* [BUG ?] Offline Memory gets stuck in offline_pages()
@ 2024-07-01 1:25 Zhijian Li (Fujitsu)
2024-07-01 7:14 ` David Hildenbrand
2024-07-04 7:43 ` Zhijian Li (Fujitsu)
0 siblings, 2 replies; 8+ messages in thread
From: Zhijian Li (Fujitsu) @ 2024-07-01 1:25 UTC (permalink / raw)
To: linux-mm@kvack.org, linux-cxl@vger.kernel.org
Cc: dan.j.williams@intel.com, Yasunori Gotou (Fujitsu),
david@redhat.com >> David Hildenbrand, Oscar Salvador,
akpm@linux-foundation.org
Hi all
Overview:
During testing the CXL memory hotremove, we noticed that `daxctl offline-memory dax0.0`
would get stuck forever sometimes. daxctl offline-memory dax0.0 will write "offline" to
/sys/devices/system/memory/memoryNNN/state.
Workaround:
When it happens, we can type Ctrl-C to abort it and then retry again.
Then the CXL memory is able to offline successfully.
Where the kernel gets stuck:
After digging into the kernel, we found that when the issue occurs, the kernel
is stuck in the outer loop of offline_pages(). Below is a piece of the
highlighted offline_pages():
```
int __ref offline_pages()
{
do { // outer loop
pfn = start_pfn;
do {
ret = scan_movable_pages(pfn, end_pfn, &pfn); // It returns -ENOENT
if (!ret)
do_migrate_range(pfn, end_pfn); // Not reach here
} while (!ret);
ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
} while (ret); // ret is -EBUSY
}
```
In this case, we dumped the first page that cannot be isolated (see dump_page below), it's
content does not change in each iteration.:
```
Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd
Jun 28 15:29:26 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 ffffd4f0ffd97ef0 0000000000000000
Jun 28 15:29:26 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
Jun 28 15:29:26 linux kernel: page dumped because: trouble page...
```
Every time the issue occurs, the content of the page structure is similar.
Questions:
Q1. Is this behavior expected? At least for an OS administrator, it should return
promptly (success or failure) instead of hanging indefinitely.
Q2. Regarding the offline_pages() function, encountering such a page indeed causes
an endless loop. Shouldn't another part of the kernel timely changed the state
of this page?
When I use the workaround mentioned above (Ctrl-C and try offline again), I find
that the page state changes (see dump_page below):
```
Jun 28 15:33:12 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd
Jun 28 15:33:12 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
Jun 28 15:33:12 linux kernel: raw: 009fffffc0000000 dead000000000100 dead000000000122 0000000000000000
Jun 28 15:33:12 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
Jun 28 15:33:12 linux kernel: page dumped because: previous trouble page
```
What our test does:
We have a CXL memory device, which is configured as kmem and online into the MOVABLE
zone as NUMA node2. We run two processes, consume-memory and offline-memory, in parallel,
see the pseudo code below:
```
main()
{
if (fork() == 0)
numactl -m 2 ./consume-memory
else {
daxctl offline-memory dax0.0
wait()
}
}
```
Attached is the process information (when it gets stuck):
```
root 25716 0.0 0.0 2460 1408 pts/0 S+ 15:28 0:00 ./main
root 25719 0.0 0.0 0 0 pts/0 Z+ 15:28 0:00 [consume-memory] <defunct>
root 25720 98.6 0.0 9476 3740 pts/0 R+ 15:28 0:26 daxctl offline-memory /dev/dax0.0
```
Feel free to let me know if you need more details.
Thank you for your attention to this issue. Looking forward to your insights.
Thanks
Zhijian
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [BUG ?] Offline Memory gets stuck in offline_pages() 2024-07-01 1:25 [BUG ?] Offline Memory gets stuck in offline_pages() Zhijian Li (Fujitsu) @ 2024-07-01 7:14 ` David Hildenbrand 2024-07-01 12:07 ` Zhijian Li (Fujitsu) 2024-07-04 7:43 ` Zhijian Li (Fujitsu) 1 sibling, 1 reply; 8+ messages in thread From: David Hildenbrand @ 2024-07-01 7:14 UTC (permalink / raw) To: Zhijian Li (Fujitsu), linux-mm@kvack.org, linux-cxl@vger.kernel.org Cc: dan.j.williams@intel.com, Yasunori Gotou (Fujitsu), Oscar Salvador, akpm@linux-foundation.org On 01.07.24 03:25, Zhijian Li (Fujitsu) wrote: > Hi all > > > Overview: > During testing the CXL memory hotremove, we noticed that `daxctl offline-memory dax0.0` > would get stuck forever sometimes. daxctl offline-memory dax0.0 will write "offline" to > /sys/devices/system/memory/memoryNNN/state. Hi, See Documentation/admin-guide/mm/memory-hotplug.rst " Further, when running into out of memory situations while migrating pages, or when still encountering permanently unmovable pages within ZONE_MOVABLE (-> BUG), memory offlining will keep retrying until it eventually succeeds. When offlining is triggered from user space, the offlining context can be terminated by sending a signal. A timeout based offlining can easily be implemented via:: % timeout $TIMEOUT offline_block | failure_handling " > > Workaround: > When it happens, we can type Ctrl-C to abort it and then retry again. > Then the CXL memory is able to offline successfully. > > Where the kernel gets stuck: > After digging into the kernel, we found that when the issue occurs, the kernel > is stuck in the outer loop of offline_pages(). Below is a piece of the > highlighted offline_pages(): > > ``` > int __ref offline_pages() > { > do { // outer loop > pfn = start_pfn; > do { > ret = scan_movable_pages(pfn, end_pfn, &pfn); // It returns -ENOENT > if (!ret) > do_migrate_range(pfn, end_pfn); // Not reach here > } while (!ret); > ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE); > } while (ret); // ret is -EBUSY > } > ``` > > In this case, we dumped the first page that cannot be isolated (see dump_page below), it's > content does not change in each iteration.: > ``` > Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd > Jun 28 15:29:26 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff) > Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 ffffd4f0ffd97ef0 0000000000000000 > Jun 28 15:29:26 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 > Jun 28 15:29:26 linux kernel: page dumped because: trouble page... Are you sure that's the problematic page? refcount:0 Indicates that the page is free. But maybe it does not have PageBuddy() set. It could also be that this is a "tail" page of a PageBuddy() page, and somehow we always end up on the tail in test_pages_isolated(). Which kernel + architecture are you testing with? > ``` > > Every time the issue occurs, the content of the page structure is similar. > > Questions: > Q1. Is this behavior expected? At least for an OS administrator, it should return > promptly (success or failure) instead of hanging indefinitely. It's expected that it might take a long time (possibly forever) in corner cases. See documentation. But it's likely unexpected that we have some problematic page here. > Q2. Regarding the offline_pages() function, encountering such a page indeed causes > an endless loop. Shouldn't another part of the kernel timely changed the state > of this page? There are various things that can go wrong. One issue might be that we try migrating a page but continuously fail to allocate memory to be used as a migration target. It seems unlikely with the page you dumped above, though. Do you maybe have that CXL memory be on a separate "fake" NUMA node, and your workload mbind() itself to that NUMA node, possibly refusing to migrate somewhere else? > > When I use the workaround mentioned above (Ctrl-C and try offline again), I find > that the page state changes (see dump_page below): > ``` > Jun 28 15:33:12 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd > Jun 28 15:33:12 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff) > Jun 28 15:33:12 linux kernel: raw: 009fffffc0000000 dead000000000100 dead000000000122 0000000000000000 > Jun 28 15:33:12 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 > Jun 28 15:33:12 linux kernel: page dumped because: previous trouble page > ``` > > What our test does: > We have a CXL memory device, which is configured as kmem and online into the MOVABLE > zone as NUMA node2. We run two processes, consume-memory and offline-memory, in parallel, > see the pseudo code below: > > ``` > main() > { > if (fork() == 0) > numactl -m 2 ./consume-memory What exactly does "consume-memory" do? Does it involve hugetlb maybe? -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [BUG ?] Offline Memory gets stuck in offline_pages() 2024-07-01 7:14 ` David Hildenbrand @ 2024-07-01 12:07 ` Zhijian Li (Fujitsu) 0 siblings, 0 replies; 8+ messages in thread From: Zhijian Li (Fujitsu) @ 2024-07-01 12:07 UTC (permalink / raw) To: David Hildenbrand, linux-mm@kvack.org, linux-cxl@vger.kernel.org Cc: dan.j.williams@intel.com, Yasunori Gotou (Fujitsu), Oscar Salvador, akpm@linux-foundation.org, Xingtao Yao (Fujitsu) on 7/1/2024 3:14 PM, David Hildenbrand wrote: > On 01.07.24 03:25, Zhijian Li (Fujitsu) wrote: >> Hi all >> Overview: >> During testing the CXL memory hotremove, we noticed that `daxctl offline-memory dax0.0` >> would get stuck forever sometimes. daxctl offline-memory dax0.0 will write "offline" to >> /sys/devices/system/memory/memoryNNN/state. > Hi, > See > Documentation/admin-guide/mm/memory-hotplug.rst Many thanks for this quotation. It reminds me that we encountered OOM during the test sometimes. > " > Further, when running into out of memory situations while migrating > pages, or when still encountering permanently unmovable pages within > ZONE_MOVABLE (-> BUG), memory offlining will keep retrying until it > eventually succeeds. > When offlining is triggered from user space, the offlining context can > be terminated by sending a signal. A timeout based offlining can easily > be implemented via:: > % timeout $TIMEOUT offline_block | failure_handling > " >> Workaround: >> When it happens, we can type Ctrl-C to abort it and then retry again. >> Then the CXL memory is able to offline successfully. >> Where the kernel gets stuck: >> After digging into the kernel, we found that when the issue occurs, the kernel >> is stuck in the outer loop of offline_pages(). Below is a piece of the >> highlighted offline_pages(): >> ``` >> int __ref offline_pages() >> { >> do { // outer loop >> pfn = start_pfn; >> do { >> ret = scan_movable_pages(pfn, end_pfn, &pfn); // It returns -ENOENT >> if (!ret) >> do_migrate_range(pfn, end_pfn); // Not reach here >> } while (!ret); >> ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE); >> } while (ret); // ret is -EBUSY >> } >> ``` >> In this case, we dumped the first page that cannot be isolated (see dump_page below), it's >> content does not change in each iteration.: >> ``` >> Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd >> Jun 28 15:29:26 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff) >> Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 ffffd4f0ffd97ef0 0000000000000000 >> Jun 28 15:29:26 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 >> Jun 28 15:29:26 linux kernel: page dumped because: trouble page... > Are you sure that's the problematic page? Yes, I dumped the page in the `else` in __test_page_isolated_in_pageblock(), see below 573 __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn, 574 int flags) 575 { 576 struct page *page; 577 578 while (pfn < end_pfn) { 579 page = pfn_to_page(pfn); 580 if (PageBuddy(page)) 581 /* 582 * If the page is on a free list, it has to be on 583 * the correct MIGRATE_ISOLATE freelist. There is no 584 * simple way to verify that as VM_BUG_ON(), though. 585 */ 586 pfn += 1 << buddy_order(page); 587 else if ((flags & MEMORY_OFFLINE) && PageHWPoison(page)) 588 /* A HWPoisoned page cannot be also PageBuddy */ 589 pfn++; 590 else if ((flags & MEMORY_OFFLINE) && PageOffline(page) && 591 !page_count(page)) 592 /* 593 * The responsible driver agreed to skip PageOffline() 594 * pages when offlining memory by dropping its 595 * reference in MEM_GOING_OFFLINE. 596 */ 597 pfn++; 598 else /****************** dump_page(page) here ****************/ 599 break; 600 } 601 602 return pfn; 603 } We also dumped that page at the beginning of offline_pages(), it had the same page structure content. IOW, this page has been problematic before the loop. > refcount:0 > Indicates that the page is free. But maybe it does not have PageBuddy() set. > It could also be that this is a "tail" page of a PageBuddy() page, It doesn't seem it's the tail page of the PageBuddy(), I also tested it that it didn't covered by the buddy_order(page) of the previous pageBuddy. > and > somehow we always end up on the tail in test_pages_isolated(). > Which kernel + architecture are you testing with? This test is running on QEMU/tcg x86_64 guest with kernel v6.10-rc2, the host is x86_64. /home/lizhijian/qemu/build/qemu-system-x86_64 \ -name guest=fedora-37-client \ -nographic \ -machine pc-q35-3.1,accel=tcg,nvdimm=on,cxl=on \ -cpu qemu64 \ -smp 4,sockets=4,cores=1,threads=1 \ -m size=8G,slots=8,maxmem=19922944k \ -hda ./Fedora-Server-1.qcow2 \ -object memory-backend-ram,size=4G,id=m0 \ -object memory-backend-ram,size=4G,id=m1 \ -numa node,nodeid=0,cpus=0-1,memdev=m0 \ -numa node,nodeid=1,cpus=2-3,memdev=m1 \ -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ -object memory-backend-ram,size=2G,share=on,id=vmem0 \ -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=type3-cxl-vmem0 \ -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=32G,cxl-fmw.0.interleave-granularity=4k \ > >> ``` >> >> Every time the issue occurs, the content of the page structure is >> similar. >> >> Questions: >> Q1. Is this behavior expected? At least for an OS administrator, it >> should return >> promptly (success or failure) instead of hanging indefinitely. > > It's expected that it might take a long time (possibly forever) in > corner cases. See documentation. > > But it's likely unexpected that we have some problematic page here. > >> Q2. Regarding the offline_pages() function, encountering such a page >> indeed causes >> an endless loop. Shouldn't another part of the kernel timely >> changed the state >> of this page? > > There are various things that can go wrong. One issue might be that we > try migrating a page but continuously fail to allocate memory to be > used as a migration target. It seems unlikely with the page you dumped > above, though. > > Do you maybe have that CXL memory be on a separate "fake" NUMA node, Yes, it's a memory only(CPU less) node. ``` [root@localhost guest]# numactl -H available: 3 nodes (0-2) node 0 cpus: 0 1 node 0 size: 3927 MB node 0 free: 3430 MB node 1 cpus: 2 3 node 1 size: 4028 MB node 1 free: 3620 MB node 2 cpus: node 2 size: 0 MB node 2 free: 0 MB node distances: node 0 1 2 0: 10 20 20 1: 20 10 20 2: 20 20 10 [root@localhost guest]# daxctl online-memory dax0.0 --movable onlined memory for 1 device [root@localhost guest]# numactl -H available: 3 nodes (0-2) node 0 cpus: 0 1 node 0 size: 3927 MB node 0 free: 3449 MB node 1 cpus: 2 3 node 1 size: 4028 MB node 1 free: 3614 MB node 2 cpus: node 2 size: 2048 MB node 2 free: 2048 MB node distances: node 0 1 2 0: 10 20 20 1: 20 10 20 2: 20 20 10 ``` > and your workload mbind() itself to that NUMA node, possibly refusing > to migrate somewhere else? In most testing runs, we do see the pages migrate to other node when we trigger a offline memory. > >> >> When I use the workaround mentioned above (Ctrl-C and try >> offline again), I find >> that the page state changes (see dump_page below): >> ``` >> Jun 28 15:33:12 linux kernel: page: refcount:0 mapcount:0 >> mapping:0000000000000000 index:0x0 pfn:0x7980dd >> Jun 28 15:33:12 linux kernel: flags: >> 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff) >> Jun 28 15:33:12 linux kernel: raw: 009fffffc0000000 dead000000000100 >> dead000000000122 0000000000000000 >> Jun 28 15:33:12 linux kernel: raw: 0000000000000000 0000000000000000 >> 00000000ffffffff 0000000000000000 >> Jun 28 15:33:12 linux kernel: page dumped because: previous trouble page >> ``` >> >> What our test does: >> We have a CXL memory device, which is configured as kmem and online >> into the MOVABLE >> zone as NUMA node2. We run two processes, consume-memory and >> offline-memory, in parallel, >> see the pseudo code below: >> >> ``` >> main() >> { >> if (fork() == 0) >> numactl -m 2 ./consume-memory > > What exactly does "consume-memory" do? Does it involve hugetlb maybe? No, they are just malloc() pages, see the code as below. We did the 2M hugetlb pattern, the hugetlb pattern will get offlined success or fail with EBUSY promptly. ``` int main(int argc, char **argv) { unsigned long long mem_size = 0; if (argc < 2) { printf("please specify the mem size in MB!\n"); return -1; } mem_size = strtoull(argv[1], NULL, 10); if (mem_size <= 0) { printf("invalid mem size '%s'\n", argv[1]); return -1; } printf("the mem size is %llu MB\n", mem_size); mem_size *= 1024 * 1024; char * a = (char *)malloc(mem_size); if (!a) { printf("malloc failed\n"); return -1; } memset(a, 0, mem_size); return 0; } ``` Feel free to let me know if you want to add some trace/debug in the code to do a further check. Thanks Zhijian > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [BUG ?] Offline Memory gets stuck in offline_pages() 2024-07-01 1:25 [BUG ?] Offline Memory gets stuck in offline_pages() Zhijian Li (Fujitsu) 2024-07-01 7:14 ` David Hildenbrand @ 2024-07-04 7:43 ` Zhijian Li (Fujitsu) 2024-07-04 8:14 ` David Hildenbrand 1 sibling, 1 reply; 8+ messages in thread From: Zhijian Li (Fujitsu) @ 2024-07-04 7:43 UTC (permalink / raw) To: linux-mm@kvack.org, linux-cxl@vger.kernel.org Cc: dan.j.williams@intel.com, Yasunori Gotou (Fujitsu), david@redhat.com >> David Hildenbrand, Oscar Salvador, akpm@linux-foundation.org, Xingtao Yao (Fujitsu) All, Some progress updates When issue occurs, calling __drain_all_pages() can make offline_pages() escape from the loop. > > Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd > Jun 28 15:29:26 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff) > Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 ffffd4f0ffd97ef0 0000000000000000 > Jun 28 15:29:26 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 > Jun 28 15:29:26 linux kernel: page dumped because: trouble page... > With this problematic page structure contents, it seems that the list_head = {ffffdfbd9e603788, ffffd4f0ffd97ef0} is valid. I guess it was linking to the pcp_list, so I dumped the per_cpu_pages[cpu].count in every in critical timings. An example is as below, offline_pages() { // per_cpu_pages[1].count = 0 zone_pcp_disable() // will call __drain_all_pages() // per_cpu_pages[1].count = 188 do { do { scan_movable_pages() ret = do_migrate_range() } while (!ret) ret = test_pages_isolated() if(is the 1st iteration) // per_cpu_pages[1].count = 182 if (issue occurs) { /* if the loop take beyond 10 seconds */ // per_cpu_pages[1].count = 61 __drain_all_pages() // per_cpu_pages[1].count = 0 /* will escape from the outer loop in later iterations */ } } while (ret) } Some interesting points: - After the 1st __drain_all_pages(), per_cpu_pages[1].count increased to 188 from 0, does it mean it's racing with something...? - per_cpu_pages[1].count will decrease but not decrease to 0 during iterations - when issue occurs, calling __drain_all_pages() will decrease per_cpu_pages[1].count to 0. So I wonder if it's fine to call __drain_all_pages() in the loop? Looking forward to your insights. Thanks Zhijian On 01/07/2024 09:25, Zhijian Li (Fujitsu) wrote: > Hi all > > > Overview: > During testing the CXL memory hotremove, we noticed that `daxctl offline-memory dax0.0` > would get stuck forever sometimes. daxctl offline-memory dax0.0 will write "offline" to > /sys/devices/system/memory/memoryNNN/state. > > Workaround: > When it happens, we can type Ctrl-C to abort it and then retry again. > Then the CXL memory is able to offline successfully. > > Where the kernel gets stuck: > After digging into the kernel, we found that when the issue occurs, the kernel > is stuck in the outer loop of offline_pages(). Below is a piece of the > highlighted offline_pages(): > > ``` > int __ref offline_pages() > { > do { // outer loop > pfn = start_pfn; > do { > ret = scan_movable_pages(pfn, end_pfn, &pfn); // It returns -ENOENT > if (!ret) > do_migrate_range(pfn, end_pfn); // Not reach here > } while (!ret); > ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE); > } while (ret); // ret is -EBUSY > } > ``` > > In this case, we dumped the first page that cannot be isolated (see dump_page below), it's > content does not change in each iteration.: > ``` > Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd > Jun 28 15:29:26 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff) > Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 ffffd4f0ffd97ef0 0000000000000000 > Jun 28 15:29:26 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 > Jun 28 15:29:26 linux kernel: page dumped because: trouble page... > ``` > > Every time the issue occurs, the content of the page structure is similar. > > Questions: > Q1. Is this behavior expected? At least for an OS administrator, it should return > promptly (success or failure) instead of hanging indefinitely. > Q2. Regarding the offline_pages() function, encountering such a page indeed causes > an endless loop. Shouldn't another part of the kernel timely changed the state > of this page? > > When I use the workaround mentioned above (Ctrl-C and try offline again), I find > that the page state changes (see dump_page below): > ``` > Jun 28 15:33:12 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd > Jun 28 15:33:12 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff) > Jun 28 15:33:12 linux kernel: raw: 009fffffc0000000 dead000000000100 dead000000000122 0000000000000000 > Jun 28 15:33:12 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 > Jun 28 15:33:12 linux kernel: page dumped because: previous trouble page > ``` > > What our test does: > We have a CXL memory device, which is configured as kmem and online into the MOVABLE > zone as NUMA node2. We run two processes, consume-memory and offline-memory, in parallel, > see the pseudo code below: > > ``` > main() > { > if (fork() == 0) > numactl -m 2 ./consume-memory > else { > daxctl offline-memory dax0.0 > wait() > } > } > ``` > > Attached is the process information (when it gets stuck): > ``` > root 25716 0.0 0.0 2460 1408 pts/0 S+ 15:28 0:00 ./main > root 25719 0.0 0.0 0 0 pts/0 Z+ 15:28 0:00 [consume-memory] <defunct> > root 25720 98.6 0.0 9476 3740 pts/0 R+ 15:28 0:26 daxctl offline-memory /dev/dax0.0 > ``` > > Feel free to let me know if you need more details. > Thank you for your attention to this issue. Looking forward to your insights. > > Thanks > Zhijian ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [BUG ?] Offline Memory gets stuck in offline_pages() 2024-07-04 7:43 ` Zhijian Li (Fujitsu) @ 2024-07-04 8:14 ` David Hildenbrand 2024-07-04 13:07 ` Zhijian Li (Fujitsu) 0 siblings, 1 reply; 8+ messages in thread From: David Hildenbrand @ 2024-07-04 8:14 UTC (permalink / raw) To: Zhijian Li (Fujitsu), linux-mm@kvack.org, linux-cxl@vger.kernel.org Cc: dan.j.williams@intel.com, Yasunori Gotou (Fujitsu), Oscar Salvador, akpm@linux-foundation.org, Xingtao Yao (Fujitsu), Zi Yan, Johannes Weiner On 04.07.24 09:43, Zhijian Li (Fujitsu) wrote: > All, > > Some progress updates > > When issue occurs, calling __drain_all_pages() can make offline_pages() escape from the loop. > >> >> Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd >> Jun 28 15:29:26 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff) >> Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 ffffd4f0ffd97ef0 0000000000000000 >> Jun 28 15:29:26 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 >> Jun 28 15:29:26 linux kernel: page dumped because: trouble page... >> > > With this problematic page structure contents, it seems that the > list_head = {ffffdfbd9e603788, ffffd4f0ffd97ef0} is valid. > > I guess it was linking to the pcp_list, so I dumped the per_cpu_pages[cpu].count > in every in critical timings. So, is your reproducer getting fixed when you call __drain_all_pages() in the loop? (not that it's the right fix, but a could datapoint :) ) > > An example is as below, > offline_pages() > { > // per_cpu_pages[1].count = 0 > zone_pcp_disable() // will call __drain_all_pages() > // per_cpu_pages[1].count = 188 > do { > do { > scan_movable_pages() > ret = do_migrate_range() > } while (!ret) > > ret = test_pages_isolated() > > if(is the 1st iteration) > // per_cpu_pages[1].count = 182 > > if (issue occurs) { /* if the loop take beyond 10 seconds */ > // per_cpu_pages[1].count = 61 > __drain_all_pages() > // per_cpu_pages[1].count = 0 > /* will escape from the outer loop in later iterations */ > } > } while (ret) > } > > Some interesting points: > - After the 1st __drain_all_pages(), per_cpu_pages[1].count increased to 188 from 0, > does it mean it's racing with something...? > - per_cpu_pages[1].count will decrease but not decrease to 0 during iterations > - when issue occurs, calling __drain_all_pages() will decrease per_cpu_pages[1].count to 0. That's indeed weird. Maybe there is a race, or zone_pcp_disable() is not fully effective for a zone? > > So I wonder if it's fine to call __drain_all_pages() in the loop? > > Looking forward to your insights. So, in free_unref_page(), we make sure to never place MIGRATE_ISOLATE onto the PCP. All pageblocks we are going to offline should be isolated at this point, so no page that is getting freed and part of the to-be-offlined range should end up on the PCP. So far the theory. In offlining code we do 1) Set MIGRATE_ISOLATE 2) zone_pcp_disable() -> set high-and-batch to 0 and drain Could there be a race in free_unref_page(), such that although zone_pcp_disable() succeeds, we would still end up with a page in the pcp? (especially, one that has MIGRATE_ISOLATE set for its pageblock?) -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [BUG ?] Offline Memory gets stuck in offline_pages() 2024-07-04 8:14 ` David Hildenbrand @ 2024-07-04 13:07 ` Zhijian Li (Fujitsu) 2024-07-12 1:50 ` Zhijian Li (Fujitsu) 0 siblings, 1 reply; 8+ messages in thread From: Zhijian Li (Fujitsu) @ 2024-07-04 13:07 UTC (permalink / raw) To: David Hildenbrand, linux-mm@kvack.org, linux-cxl@vger.kernel.org Cc: dan.j.williams@intel.com, Yasunori Gotou (Fujitsu), Oscar Salvador, akpm@linux-foundation.org, Xingtao Yao (Fujitsu), Zi Yan, Johannes Weiner > -----Original Message----- > From: David Hildenbrand <david@redhat.com> > Sent: Thursday, July 4, 2024 4:15 PM > > On 04.07.24 09:43, Zhijian Li (Fujitsu) wrote: > > All, > > > > Some progress updates > > > > When issue occurs, calling __drain_all_pages() can make offline_pages() escape > from the loop. > > > >> > >> Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 > >> mapping:0000000000000000 index:0x0 pfn:0x7980dd Jun 28 15:29:26 linux > >> kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff) > >> Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 > >> ffffd4f0ffd97ef0 0000000000000000 Jun 28 15:29:26 linux kernel: raw: > >> 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 > Jun 28 15:29:26 linux kernel: page dumped because: trouble page... > >> > > > > With this problematic page structure contents, it seems that the > > list_head = {ffffdfbd9e603788, ffffd4f0ffd97ef0} is valid. > > > > I guess it was linking to the pcp_list, so I dumped the > > per_cpu_pages[cpu].count in every in critical timings. > > So, is your reproducer getting fixed when you call __drain_all_pages() in the loop? > (not that it's the right fix, but a could datapoint :) ) Yeah, it works for my reproducer. > > > > > An example is as below, > > offline_pages() > > { > > // per_cpu_pages[1].count = 0 > > zone_pcp_disable() // will call __drain_all_pages() > > // per_cpu_pages[1].count = 188 > > do { > > do { > > scan_movable_pages() > > ret = do_migrate_range() > > } while (!ret) > > > > ret = test_pages_isolated() > > > > if(is the 1st iteration) > > // per_cpu_pages[1].count = 182 > > > > if (issue occurs) { /* if the loop take beyond 10 seconds */ > > // per_cpu_pages[1].count = 61 > > __drain_all_pages() > > // per_cpu_pages[1].count = 0 > > /* will escape from the outer loop in later iterations */ > > } > > } while (ret) > > } > > > > Some interesting points: > > - After the 1st __drain_all_pages(), per_cpu_pages[1].count increased to 188 > from 0, > > does it mean it's racing with something...? > > - per_cpu_pages[1].count will decrease but not decrease to 0 during > iterations > > - when issue occurs, calling __drain_all_pages() will decrease > per_cpu_pages[1].count to 0. > > That's indeed weird. Maybe there is a race, or zone_pcp_disable() is not fully > effective for a zone? I often see there still are pages in PCP after the zone_pcp_disable(). > > > > > So I wonder if it's fine to call __drain_all_pages() in the loop? > > > > Looking forward to your insights. > > So, in free_unref_page(), we make sure to never place MIGRATE_ISOLATE onto > the PCP. All pageblocks we are going to offline should be isolated at this point, so > no page that is getting freed and part of the to-be-offlined range should end up on > the PCP. So far the theory. > > > In offlining code we do > > 1) Set MIGRATE_ISOLATE > 2) zone_pcp_disable() -> set high-and-batch to 0 and drain > > Could there be a race in free_unref_page(), such that although > zone_pcp_disable() succeeds, we would still end up with a page in the pcp? > (especially, one that has MIGRATE_ISOLATE set for its pageblock?) Thanks for your idea, I will further investigate in this direction. Thanks Zhijian > > > -- > Cheers, > > David / dhildenb ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [BUG ?] Offline Memory gets stuck in offline_pages() 2024-07-04 13:07 ` Zhijian Li (Fujitsu) @ 2024-07-12 1:50 ` Zhijian Li (Fujitsu) 2024-07-12 5:51 ` Zhijian Li (Fujitsu) 0 siblings, 1 reply; 8+ messages in thread From: Zhijian Li (Fujitsu) @ 2024-07-12 1:50 UTC (permalink / raw) To: David Hildenbrand, linux-mm@kvack.org, linux-cxl@vger.kernel.org Cc: dan.j.williams@intel.com, Yasunori Gotou (Fujitsu), Oscar Salvador, akpm@linux-foundation.org, Xingtao Yao (Fujitsu), Zi Yan, Johannes Weiner David && ALL, Some progress updates On 04/07/2024 21:07, Zhijian Li (Fujitsu) wrote: > > >> -----Original Message----- >> From: David Hildenbrand <david@redhat.com> >> Sent: Thursday, July 4, 2024 4:15 PM > > >> >> On 04.07.24 09:43, Zhijian Li (Fujitsu) wrote: >>> All, >>> >>> Some progress updates >>> >>> When issue occurs, calling __drain_all_pages() can make offline_pages() escape >> from the loop. >>> >>>> >>>> Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 >>>> mapping:0000000000000000 index:0x0 pfn:0x7980dd Jun 28 15:29:26 linux >>>> kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff) >>>> Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 >>>> ffffd4f0ffd97ef0 0000000000000000 Jun 28 15:29:26 linux kernel: raw: >>>> 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 >> Jun 28 15:29:26 linux kernel: page dumped because: trouble page... >>>> >>> >>> With this problematic page structure contents, it seems that the >>> list_head = {ffffdfbd9e603788, ffffd4f0ffd97ef0} is valid. >>> >>> I guess it was linking to the pcp_list, so I dumped the >>> per_cpu_pages[cpu].count in every in critical timings. >> >> So, is your reproducer getting fixed when you call __drain_all_pages() in the loop? >> (not that it's the right fix, but a could datapoint :) ) > > Yeah, it works for my reproducer. > > > >> >>> >>> An example is as below, >>> offline_pages() >>> { >>> // per_cpu_pages[1].count = 0 >>> zone_pcp_disable() // will call __drain_all_pages() >>> // per_cpu_pages[1].count = 188 >>> do { >>> do { >>> scan_movable_pages() >>> ret = do_migrate_range() >>> } while (!ret) >>> >>> ret = test_pages_isolated() >>> >>> if(is the 1st iteration) >>> // per_cpu_pages[1].count = 182 >>> >>> if (issue occurs) { /* if the loop take beyond 10 seconds */ >>> // per_cpu_pages[1].count = 61 >>> __drain_all_pages() >>> // per_cpu_pages[1].count = 0 >>> /* will escape from the outer loop in later iterations */ >>> } >>> } while (ret) >>> } >>> >>> Some interesting points: >>> - After the 1st __drain_all_pages(), per_cpu_pages[1].count increased to 188 >> from 0, >>> does it mean it's racing with something...? >>> - per_cpu_pages[1].count will decrease but not decrease to 0 during >> iterations >>> - when issue occurs, calling __drain_all_pages() will decrease >> per_cpu_pages[1].count to 0. >> >> That's indeed weird. Maybe there is a race, or zone_pcp_disable() is not fully >> effective for a zone? > > I often see there still are pages in PCP after the zone_pcp_disable(). > >> >>> >>> So I wonder if it's fine to call __drain_all_pages() in the loop? >>> >>> Looking forward to your insights. >> >> So, in free_unref_page(), we make sure to never place MIGRATE_ISOLATE onto >> the PCP. All pageblocks we are going to offline should be isolated at this point, so >> no page that is getting freed and part of the to-be-offlined range should end up on >> the PCP. So far the theory. >> >> >> In offlining code we do >> >> 1) Set MIGRATE_ISOLATE >> 2) zone_pcp_disable() -> set high-and-batch to 0 and drain >> >> Could there be a race in free_unref_page(), such that although >> zone_pcp_disable() succeeds, we would still end up with a page in the pcp? >> (especially, one that has MIGRATE_ISOLATE set for its pageblock?) > > Thanks for your idea, I will further investigate in this direction. some updates CPU0 CPU1 ----------- --------- // erase pcp_list zone_pcp_disable // pcp->count = 0 lru_cache_diable() __rmqueue_pcplist() // re-add pages to pcp_lsit __rmqueue_pcplist() // drop pages from pcp_list decay_pcp_high() // drop pages from pcp_list loop { ... __rmqueue_pcplist() // drop pages from pcp_list, // it will be only called a few times during the loop scan_movable_pages() ... migration_pages() decay_pcp_high() // drop pages from pcp_list, it will be called by // a worker periodically during the loop // wait pcp_list to be empty } while (test_pages_isolated()) And we noticed that re-add pages to pcp_list in '__rmqueue_pcplist()` only happen once, pcp->count changed to 200 from 0 for example. The later calls to __rmqueue_pcplist() will drop pcp->count by step 1 for each call, For example, pcp->count: 199->198->197->196..., However it stops calling __rmqueue_pcplist after a few times before pcp->count is dropped to 0. In the normal/good case, we also noticed that __rmqueue_pcplist() dropped pcp->count to 0. Here is the __rmqueue_pcplist() call trace. CPU: 1 PID: 3615 Comm: consume_std_pag Not tainted 6.10.0-rc2+ #147 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x64/0x80 __rmqueue_pcplist+0xd55/0xdf0 get_page_from_freelist+0x2a1/0x1770 __alloc_pages_noprof+0x1a0/0x380 alloc_pages_mpol_noprof+0xe3/0x1f0 vma_alloc_folio_noprof+0x5c/0xb0 folio_prealloc+0x21/0x80 do_pte_missing+0x695/0xa20 ? __pte_offset_map+0x1b/0x180 __handle_mm_fault+0x65f/0xc10 ? kmem_cache_free+0x370/0x410 handle_mm_fault+0x128/0x360 do_user_addr_fault+0x309/0x810 exc_page_fault+0x7e/0x180 asm_exc_page_fault+0x26/0x30 RIP: 0033:0x7f3d8729028a In the meantime, decay_pcp_high() will be called to drop the pcp->count periodically. [ 145.117256] decay_pcp_high+0x68/0x90 [ 145.117256] refresh_cpu_vm_stats+0x149/0x2a0 [ 145.117256] vmstat_update+0x13/0x50 [ 145.117256] process_scheduled_works+0xa6/0x420 [ 145.117256] worker_thread+0x117/0x270 [ 145.117256] ? __pfx_worker_thread+0x10/0x10 [ 145.117256] kthread+0xe5/0x120 [ 145.117256] ? __pfx_kthread+0x10/0x10 [ 145.117256] ret_from_fork+0x34/0x40 [ 145.117256] ? __pfx_kthread+0x10/0x10 [ 145.117256] ret_from_fork_asm+0x1a/0x30 But decay_pcp_high() will stop dropping pcp->count after a few moment later. IOW decay_pcp_high() will be called periodically though, it doesn't drop pcp->count. A piece of pcp content shows as below: ... count = 7, high = 7, high_min = 0, high_max = 0, batch = 1, flags = 0 '\000', alloc_factor = 3 '\003', expire = 0 '\000', free_count = 0, lists = {{ ... 2244 * Called from the vmstat counter updater to decay the PCP high. 2245 * Return whether there are addition works to do. 2246 */ 2247 int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) 2248 { 2249 int high_min, to_drain, batch; 2250 int todo = 0; 2251 2252 high_min = READ_ONCE(pcp->high_min); 2253 batch = READ_ONCE(pcp->batch); 2254 /* 2255 * Decrease pcp->high periodically to try to free possible 2256 * idle PCP pages. And, avoid to free too many pages to 2257 * control latency. This caps pcp->high decrement too. 2258 */ 2259 if (pcp->high > high_min) { 2260 pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX), 2261 pcp->high - (pcp->high >> 3), high_min); 2262 if (pcp->high > high_min) 2263 todo++; 2264 } 2265 2266 to_drain = pcp->count - pcp->high; // to_drain will be 0(when count == high), // so no any pages can be drop from pcp_list. 2267 if (to_drain > 0) { 2268 spin_lock(&pcp->lock); 2269 free_pcppages_bulk(zone, to_drain, pcp, 0); 2270 spin_unlock(&pcp->lock); 2271 todo++; 2272 } 2273 2274 if (mutex_is_locked(&pcp_batch_high_lock) && pcp->high_max == 0 && to_drain > 0 && pcp->count >= 0) 2275 pr_info("lizhijian:%s,%d: cpu%d, to_drain %d, new %d\n", __func__, __LINE__, smp_processor_id(), to_drain, pcp->count); 2277 return todo; 2278 } I'm wondering if we can fix it in decay_pcp_high(), let it drop pcp->count to 0 when zone_pcp_disable() The following logs show the pcp->count(*new* is pcp->count in the end of the function) in the bad case. ===================== Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a0159, old 71, new 70, add 0, drop 1 Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015a, old 70, new 69, add 0, drop 1 Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015b, old 69, new 68, add 0, drop 1 Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015c, old 68, new 67, add 0, drop 1 Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015d, old 67, new 66, add 0, drop 1 Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015e, old 66, new 65, add 0, drop 1 Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015f, old 65, new 64, add 0, drop 1 ... Jul 11 18:34:18 linux kernel: lizhijian: offline_pages,2087: cpu0: [6a0000-6a8000] get trouble: pcplist[1]: 0->63, batch 1, high 154 ... Jul 11 18:34:25 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 7, new 56 Jul 11 18:34:26 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 7, new 49 Jul 11 18:34:27 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 6, new 43 ... Jul 11 18:34:40 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 1, new 10 Jul 11 18:34:41 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 1, new 9 Jul 11 18:34:42 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 1, new 8 Jul 11 18:34:43 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 1, new 7 ===================== Thanks Zhijian > > > Thanks > Zhijian > >> >> >> -- >> Cheers, >> >> David / dhildenb > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [BUG ?] Offline Memory gets stuck in offline_pages() 2024-07-12 1:50 ` Zhijian Li (Fujitsu) @ 2024-07-12 5:51 ` Zhijian Li (Fujitsu) 0 siblings, 0 replies; 8+ messages in thread From: Zhijian Li (Fujitsu) @ 2024-07-12 5:51 UTC (permalink / raw) To: David Hildenbrand, linux-mm@kvack.org, linux-cxl@vger.kernel.org Cc: dan.j.williams@intel.com, Yasunori Gotou (Fujitsu), Oscar Salvador, akpm@linux-foundation.org, Xingtao Yao (Fujitsu), Zi Yan, Johannes Weiner On 12/07/2024 09:50, Zhijian Li (Fujitsu) wrote: > CPU0 CPU1 > ----------- --------- > // erase pcp_list > zone_pcp_disable // pcp->count = 0 > > lru_cache_diable() __rmqueue_pcplist() // re-add pages to pcp_lsit > __rmqueue_pcplist() // drop pages from pcp_list > decay_pcp_high() // drop pages from pcp_list > loop { ... > __rmqueue_pcplist() // drop pages from pcp_list, > // it will be only called a few times during the loop > scan_movable_pages() ... > migration_pages() decay_pcp_high() // drop pages from pcp_list, it will be called by > // a worker periodically during the loop > > // wait pcp_list to be empty > } while (test_pages_isolated()) > > > And we noticed that re-add pages to pcp_list in '__rmqueue_pcplist()` only happen once, > pcp->count changed to 200 from 0 for example. > > The later calls to __rmqueue_pcplist() will drop pcp->count by step 1 for each call, > For example, pcp->count: 199->198->197->196..., > However it stops calling __rmqueue_pcplist after a few times before pcp->count is dropped to 0. > > In the normal/good case, we also noticed that __rmqueue_pcplist() dropped pcp->count to 0. I doubt all these(after calling zone_pcp_disable()) 1. __rmqueue_pcplist() should re-add pages to pcp_list 2. __rmqueue_pcplist() should drop pcp->count till 0 if 1 is true 3. decay_pcp_high() should drop pcp->count till 0, its name and comments don't indicate this. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-07-12 5:51 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-07-01 1:25 [BUG ?] Offline Memory gets stuck in offline_pages() Zhijian Li (Fujitsu) 2024-07-01 7:14 ` David Hildenbrand 2024-07-01 12:07 ` Zhijian Li (Fujitsu) 2024-07-04 7:43 ` Zhijian Li (Fujitsu) 2024-07-04 8:14 ` David Hildenbrand 2024-07-04 13:07 ` Zhijian Li (Fujitsu) 2024-07-12 1:50 ` Zhijian Li (Fujitsu) 2024-07-12 5:51 ` Zhijian Li (Fujitsu)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).