linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [BUG ?] Offline Memory gets stuck in offline_pages()
@ 2024-07-01  1:25 Zhijian Li (Fujitsu)
  2024-07-01  7:14 ` David Hildenbrand
  2024-07-04  7:43 ` Zhijian Li (Fujitsu)
  0 siblings, 2 replies; 8+ messages in thread
From: Zhijian Li (Fujitsu) @ 2024-07-01  1:25 UTC (permalink / raw)
  To: linux-mm@kvack.org, linux-cxl@vger.kernel.org
  Cc: dan.j.williams@intel.com, Yasunori Gotou (Fujitsu),
	david@redhat.com >> David Hildenbrand, Oscar Salvador,
	akpm@linux-foundation.org

Hi all


Overview:
During testing the CXL memory hotremove, we noticed that `daxctl offline-memory dax0.0`
would get stuck forever sometimes. daxctl offline-memory dax0.0 will write "offline" to
/sys/devices/system/memory/memoryNNN/state.

Workaround:
When it happens, we can type Ctrl-C to abort it and then retry again.
Then the CXL memory is able to offline successfully.

Where the kernel gets stuck:
After digging into the kernel, we found that when the issue occurs, the kernel
is stuck in the outer loop of offline_pages(). Below is a piece of the
highlighted offline_pages():

```
int __ref offline_pages()
{
   do { // outer loop
     pfn = start_pfn;
     do {
       ret = scan_movable_pages(pfn, end_pfn, &pfn);  // It returns -ENOENT
       if (!ret)
          do_migrate_range(pfn, end_pfn);             // Not reach here
     } while (!ret);
     ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
     } while (ret);                                   // ret is -EBUSY
}
```

In this case, we dumped the first page that cannot be isolated (see dump_page below), it's
content does not change in each iteration.:
```
Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd
Jun 28 15:29:26 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 ffffd4f0ffd97ef0 0000000000000000
Jun 28 15:29:26 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
Jun 28 15:29:26 linux kernel: page dumped because: trouble page...
```

Every time the issue occurs, the content of the page structure is similar.

Questions:
Q1. Is this behavior expected? At least for an OS administrator, it should return
     promptly (success or failure) instead of hanging indefinitely.
Q2. Regarding the offline_pages() function, encountering such a page indeed causes
     an endless loop. Shouldn't another part of the kernel timely changed the state
     of this page?

     When I use the workaround mentioned above (Ctrl-C and try offline again), I find
     that the page state changes (see dump_page below):
```
Jun 28 15:33:12 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd
Jun 28 15:33:12 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
Jun 28 15:33:12 linux kernel: raw: 009fffffc0000000 dead000000000100 dead000000000122 0000000000000000
Jun 28 15:33:12 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
Jun 28 15:33:12 linux kernel: page dumped because: previous trouble page
```

What our test does:
We have a CXL memory device, which is configured as kmem and online into the MOVABLE
zone as NUMA node2. We run two processes, consume-memory and offline-memory, in parallel,
see the pseudo code below:

```
main()
{
     if (fork() == 0)
         numactl -m 2 ./consume-memory
     else {
         daxctl offline-memory dax0.0
         wait()
     }
}
```

Attached is the process information (when it gets stuck):
```
root 25716 0.0 0.0 2460 1408 pts/0 S+ 15:28 0:00 ./main
root 25719 0.0 0.0 0 0 pts/0 Z+ 15:28 0:00 [consume-memory] <defunct>
root 25720 98.6 0.0 9476 3740 pts/0 R+ 15:28 0:26 daxctl offline-memory /dev/dax0.0
```

Feel free to let me know if you need more details.
Thank you for your attention to this issue. Looking forward to your insights.

Thanks
Zhijian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG ?] Offline Memory gets stuck in offline_pages()
  2024-07-01  1:25 [BUG ?] Offline Memory gets stuck in offline_pages() Zhijian Li (Fujitsu)
@ 2024-07-01  7:14 ` David Hildenbrand
  2024-07-01 12:07   ` Zhijian Li (Fujitsu)
  2024-07-04  7:43 ` Zhijian Li (Fujitsu)
  1 sibling, 1 reply; 8+ messages in thread
From: David Hildenbrand @ 2024-07-01  7:14 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu), linux-mm@kvack.org,
	linux-cxl@vger.kernel.org
  Cc: dan.j.williams@intel.com, Yasunori Gotou (Fujitsu),
	Oscar Salvador, akpm@linux-foundation.org

On 01.07.24 03:25, Zhijian Li (Fujitsu) wrote:
> Hi all
> 
> 
> Overview:
> During testing the CXL memory hotremove, we noticed that `daxctl offline-memory dax0.0`
> would get stuck forever sometimes. daxctl offline-memory dax0.0 will write "offline" to
> /sys/devices/system/memory/memoryNNN/state.

Hi,

See

Documentation/admin-guide/mm/memory-hotplug.rst

"
Further, when running into out of memory situations while migrating 
pages, or when still encountering permanently unmovable pages within 
ZONE_MOVABLE (-> BUG), memory offlining will keep retrying until it 
eventually succeeds.

When offlining is triggered from user space, the offlining context can 
be terminated by sending a signal. A timeout based offlining can easily 
be implemented via::

	% timeout $TIMEOUT offline_block | failure_handling
"

> 
> Workaround:
> When it happens, we can type Ctrl-C to abort it and then retry again.
> Then the CXL memory is able to offline successfully.
> 
> Where the kernel gets stuck:
> After digging into the kernel, we found that when the issue occurs, the kernel
> is stuck in the outer loop of offline_pages(). Below is a piece of the
> highlighted offline_pages():
> 
> ```
> int __ref offline_pages()
> {
>     do { // outer loop
>       pfn = start_pfn;
>       do {
>         ret = scan_movable_pages(pfn, end_pfn, &pfn);  // It returns -ENOENT
>         if (!ret)
>            do_migrate_range(pfn, end_pfn);             // Not reach here
>       } while (!ret);
>       ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
>       } while (ret);                                   // ret is -EBUSY
> }
> ```
> 
> In this case, we dumped the first page that cannot be isolated (see dump_page below), it's
> content does not change in each iteration.:
> ```
> Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd
> Jun 28 15:29:26 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
> Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 ffffd4f0ffd97ef0 0000000000000000
> Jun 28 15:29:26 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
> Jun 28 15:29:26 linux kernel: page dumped because: trouble page...

Are you sure that's the problematic page?

refcount:0

Indicates that the page is free. But maybe it does not have PageBuddy() set.

It could also be that this is a "tail" page of a PageBuddy() page, and 
somehow we always end up on the tail in test_pages_isolated().

Which kernel + architecture are you testing with?

> ```
> 
> Every time the issue occurs, the content of the page structure is similar.
> 
> Questions:
> Q1. Is this behavior expected? At least for an OS administrator, it should return
>       promptly (success or failure) instead of hanging indefinitely.

It's expected that it might take a long time (possibly forever) in 
corner cases. See documentation.

But it's likely unexpected that we have some problematic page here.

> Q2. Regarding the offline_pages() function, encountering such a page indeed causes
>       an endless loop. Shouldn't another part of the kernel timely changed the state
>       of this page?

There are various things that can go wrong. One issue might be that we 
try migrating a page but continuously fail to allocate memory to be used 
as a migration target. It seems unlikely with the page you dumped above, 
though.

Do you maybe have that CXL memory be on a separate "fake" NUMA node, and 
your workload mbind() itself to that NUMA node, possibly refusing to 
migrate somewhere else?

> 
>       When I use the workaround mentioned above (Ctrl-C and try offline again), I find
>       that the page state changes (see dump_page below):
> ```
> Jun 28 15:33:12 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd
> Jun 28 15:33:12 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
> Jun 28 15:33:12 linux kernel: raw: 009fffffc0000000 dead000000000100 dead000000000122 0000000000000000
> Jun 28 15:33:12 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
> Jun 28 15:33:12 linux kernel: page dumped because: previous trouble page
> ```
> 
> What our test does:
> We have a CXL memory device, which is configured as kmem and online into the MOVABLE
> zone as NUMA node2. We run two processes, consume-memory and offline-memory, in parallel,
> see the pseudo code below:
> 
> ```
> main()
> {
>       if (fork() == 0)
>           numactl -m 2 ./consume-memory

What exactly does "consume-memory" do? Does it involve hugetlb maybe?


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG ?] Offline Memory gets stuck in offline_pages()
  2024-07-01  7:14 ` David Hildenbrand
@ 2024-07-01 12:07   ` Zhijian Li (Fujitsu)
  0 siblings, 0 replies; 8+ messages in thread
From: Zhijian Li (Fujitsu) @ 2024-07-01 12:07 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm@kvack.org, linux-cxl@vger.kernel.org
  Cc: dan.j.williams@intel.com, Yasunori Gotou (Fujitsu),
	Oscar Salvador, akpm@linux-foundation.org, Xingtao Yao (Fujitsu)


on 7/1/2024 3:14 PM, David Hildenbrand wrote:

> On 01.07.24 03:25, Zhijian Li (Fujitsu) wrote:
>> Hi all
>> Overview:
>> During testing the CXL memory hotremove, we noticed that `daxctl offline-memory dax0.0`
>> would get stuck forever sometimes. daxctl offline-memory dax0.0 will write "offline" to
>> /sys/devices/system/memory/memoryNNN/state.
> Hi,
> See
> Documentation/admin-guide/mm/memory-hotplug.rst

Many thanks for this quotation.
It reminds me that we encountered OOM during the test sometimes.

  

> "
> Further, when running into out of memory situations while migrating
> pages, or when still encountering permanently unmovable pages within
> ZONE_MOVABLE (-> BUG), memory offlining will keep retrying until it
> eventually succeeds.
> When offlining is triggered from user space, the offlining context can
> be terminated by sending a signal. A timeout based offlining can easily
> be implemented via::
>      % timeout $TIMEOUT offline_block | failure_handling
> "
>> Workaround:
>> When it happens, we can type Ctrl-C to abort it and then retry again.
>> Then the CXL memory is able to offline successfully.
>> Where the kernel gets stuck:
>> After digging into the kernel, we found that when the issue occurs, the kernel
>> is stuck in the outer loop of offline_pages(). Below is a piece of the
>> highlighted offline_pages():
>> ```
>> int __ref offline_pages()
>> {
>>      do { // outer loop
>>        pfn = start_pfn;
>>        do {
>>          ret = scan_movable_pages(pfn, end_pfn, &pfn);  // It returns -ENOENT
>>          if (!ret)
>>             do_migrate_range(pfn, end_pfn);             // Not reach here
>>        } while (!ret);
>>        ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
>>        } while (ret);                                   // ret is -EBUSY
>> }
>> ```
>> In this case, we dumped the first page that cannot be isolated (see dump_page below), it's
>> content does not change in each iteration.:
>> ```
>> Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd
>> Jun 28 15:29:26 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
>> Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 ffffd4f0ffd97ef0 0000000000000000
>> Jun 28 15:29:26 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
>> Jun 28 15:29:26 linux kernel: page dumped because: trouble page...
> Are you sure that's the problematic page?

Yes,  I dumped the page in the `else` in __test_page_isolated_in_pageblock(), see below

573 __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
574                                   int flags)
575 {
576         struct page *page;
577
578         while (pfn < end_pfn) {
579                 page = pfn_to_page(pfn);
580                 if (PageBuddy(page))
581                         /*
582                          * If the page is on a free list, it has to be on
583                          * the correct MIGRATE_ISOLATE freelist. There is no
584                          * simple way to verify that as VM_BUG_ON(), though.
585                          */
586                         pfn += 1 << buddy_order(page);
587                 else if ((flags & MEMORY_OFFLINE) && PageHWPoison(page))
588                         /* A HWPoisoned page cannot be also PageBuddy */
589                         pfn++;
590                 else if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
591                          !page_count(page))
592                         /*
593                          * The responsible driver agreed to skip PageOffline()
594                          * pages when offlining memory by dropping its
595                          * reference in MEM_GOING_OFFLINE.
596                          */
597                         pfn++;
598                 else
                             /****************** dump_page(page) here ****************/
599                         break;
600         }
601
602         return pfn;
603 }

We also dumped that page at the beginning of offline_pages(), it had the same page
structure content. IOW, this page has been problematic before the loop.

> refcount:0
> Indicates that the page is free. But maybe it does not have PageBuddy() set.
> It could also be that this is a "tail" page of a PageBuddy() page,

It doesn't seem it's the tail page of the PageBuddy(), I also tested it that it didn't
covered by the buddy_order(page) of the previous pageBuddy.


>   and
> somehow we always end up on the tail in test_pages_isolated().
> Which kernel + architecture are you testing with?

This test is running on QEMU/tcg x86_64 guest with kernel v6.10-rc2, the host is x86_64.

/home/lizhijian/qemu/build/qemu-system-x86_64 \
  -name guest=fedora-37-client \
  -nographic \
  -machine pc-q35-3.1,accel=tcg,nvdimm=on,cxl=on \
  -cpu qemu64 \
  -smp 4,sockets=4,cores=1,threads=1 \
  -m size=8G,slots=8,maxmem=19922944k \
  -hda ./Fedora-Server-1.qcow2 \
  -object memory-backend-ram,size=4G,id=m0 \
  -object memory-backend-ram,size=4G,id=m1 \
  -numa node,nodeid=0,cpus=0-1,memdev=m0 \
  -numa node,nodeid=1,cpus=2-3,memdev=m1 \
  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
  -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
  -object memory-backend-ram,size=2G,share=on,id=vmem0 \
  -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=type3-cxl-vmem0 \
  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=32G,cxl-fmw.0.interleave-granularity=4k \



>
>> ```
>>
>> Every time the issue occurs, the content of the page structure is 
>> similar.
>>
>> Questions:
>> Q1. Is this behavior expected? At least for an OS administrator, it 
>> should return
>>       promptly (success or failure) instead of hanging indefinitely.
>
> It's expected that it might take a long time (possibly forever) in 
> corner cases. See documentation.
>
> But it's likely unexpected that we have some problematic page here.
>
>> Q2. Regarding the offline_pages() function, encountering such a page 
>> indeed causes
>>       an endless loop. Shouldn't another part of the kernel timely 
>> changed the state
>>       of this page?
>
> There are various things that can go wrong. One issue might be that we 
> try migrating a page but continuously fail to allocate memory to be 
> used as a migration target. It seems unlikely with the page you dumped 
> above, though.
>
> Do you maybe have that CXL memory be on a separate "fake" NUMA node, 

Yes, it's a memory only(CPU less) node.
```
[root@localhost guest]# numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: 3927 MB
node 0 free: 3430 MB
node 1 cpus: 2 3
node 1 size: 4028 MB
node 1 free: 3620 MB
node 2 cpus:
node 2 size: 0 MB
node 2 free: 0 MB
node distances:
node   0   1   2
   0:  10  20  20
   1:  20  10  20
   2:  20  20  10
[root@localhost guest]# daxctl online-memory dax0.0 --movable
onlined memory for 1 device
[root@localhost guest]# numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: 3927 MB
node 0 free: 3449 MB
node 1 cpus: 2 3
node 1 size: 4028 MB
node 1 free: 3614 MB
node 2 cpus:
node 2 size: 2048 MB
node 2 free: 2048 MB
node distances:
node   0   1   2
   0:  10  20  20
   1:  20  10  20
   2:  20  20  10
```



> and your workload mbind() itself to that NUMA node, possibly refusing 
> to migrate somewhere else?

In most testing runs, we do see the pages migrate to other node when we trigger a offline memory.


>
>>
>>       When I use the workaround mentioned above (Ctrl-C and try 
>> offline again), I find
>>       that the page state changes (see dump_page below):
>> ```
>> Jun 28 15:33:12 linux kernel: page: refcount:0 mapcount:0 
>> mapping:0000000000000000 index:0x0 pfn:0x7980dd
>> Jun 28 15:33:12 linux kernel: flags: 
>> 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
>> Jun 28 15:33:12 linux kernel: raw: 009fffffc0000000 dead000000000100 
>> dead000000000122 0000000000000000
>> Jun 28 15:33:12 linux kernel: raw: 0000000000000000 0000000000000000 
>> 00000000ffffffff 0000000000000000
>> Jun 28 15:33:12 linux kernel: page dumped because: previous trouble page
>> ```
>>
>> What our test does:
>> We have a CXL memory device, which is configured as kmem and online 
>> into the MOVABLE
>> zone as NUMA node2. We run two processes, consume-memory and 
>> offline-memory, in parallel,
>> see the pseudo code below:
>>
>> ```
>> main()
>> {
>>       if (fork() == 0)
>>           numactl -m 2 ./consume-memory
>
> What exactly does "consume-memory" do? Does it involve hugetlb maybe?

No, they are just malloc() pages, see the code as below. We did the 2M hugetlb pattern,
the hugetlb pattern will get offlined success or fail with EBUSY promptly.

```
int main(int argc, char **argv) {
         unsigned long long mem_size = 0;
         if (argc < 2) {
                 printf("please specify the mem size in MB!\n");
                 return -1;
         }
         mem_size = strtoull(argv[1], NULL, 10);
         if (mem_size <= 0) {
                 printf("invalid mem size '%s'\n", argv[1]);
                 return -1;
         }

         printf("the mem size is %llu MB\n", mem_size);
         mem_size *= 1024 * 1024;

         char * a = (char *)malloc(mem_size);
         if (!a) {
                 printf("malloc failed\n");
                 return -1;
         }
         memset(a, 0, mem_size);
         return 0;
}
```

Feel free to let me know if you want to add some trace/debug in the code to do
a further check.


Thanks
Zhijian



>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG ?] Offline Memory gets stuck in offline_pages()
  2024-07-01  1:25 [BUG ?] Offline Memory gets stuck in offline_pages() Zhijian Li (Fujitsu)
  2024-07-01  7:14 ` David Hildenbrand
@ 2024-07-04  7:43 ` Zhijian Li (Fujitsu)
  2024-07-04  8:14   ` David Hildenbrand
  1 sibling, 1 reply; 8+ messages in thread
From: Zhijian Li (Fujitsu) @ 2024-07-04  7:43 UTC (permalink / raw)
  To: linux-mm@kvack.org, linux-cxl@vger.kernel.org
  Cc: dan.j.williams@intel.com, Yasunori Gotou (Fujitsu),
	david@redhat.com >> David Hildenbrand, Oscar Salvador,
	akpm@linux-foundation.org, Xingtao Yao (Fujitsu)

All,

Some progress updates

When issue occurs, calling __drain_all_pages() can make offline_pages() escape from the loop.

> 
> Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd
> Jun 28 15:29:26 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
> Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 ffffd4f0ffd97ef0 0000000000000000
> Jun 28 15:29:26 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
> Jun 28 15:29:26 linux kernel: page dumped because: trouble page...
> 

With this problematic page structure contents, it seems that the
list_head = {ffffdfbd9e603788, ffffd4f0ffd97ef0} is valid.

I guess it was linking to the pcp_list, so I dumped the per_cpu_pages[cpu].count
in every in critical timings.

An example is as below,
offline_pages()
{
	// per_cpu_pages[1].count = 0
	zone_pcp_disable() // will call __drain_all_pages()
	// per_cpu_pages[1].count = 188
	do {
		do {
			scan_movable_pages()
			ret = do_migrate_range()
		} while (!ret)

		ret = test_pages_isolated()

		if(is the 1st iteration)
			// per_cpu_pages[1].count = 182

		if (issue occurs) { /* if the loop take beyond 10 seconds */
			// per_cpu_pages[1].count = 61
			__drain_all_pages()
			// per_cpu_pages[1].count = 0
			/* will escape from the outer loop in later iterations */
		}
	} while (ret)
}

Some interesting points:
  - After the 1st __drain_all_pages(), per_cpu_pages[1].count increased to 188 from 0,
    does it mean it's racing with something...?
  - per_cpu_pages[1].count will decrease but not decrease to 0 during iterations
  - when issue occurs, calling __drain_all_pages() will decrease per_cpu_pages[1].count to 0.

So I wonder if it's fine to call __drain_all_pages() in the loop?

Looking forward to your insights.


Thanks
Zhijian


On 01/07/2024 09:25, Zhijian Li (Fujitsu) wrote:
> Hi all
> 
> 
> Overview:
> During testing the CXL memory hotremove, we noticed that `daxctl offline-memory dax0.0`
> would get stuck forever sometimes. daxctl offline-memory dax0.0 will write "offline" to
> /sys/devices/system/memory/memoryNNN/state.
> 
> Workaround:
> When it happens, we can type Ctrl-C to abort it and then retry again.
> Then the CXL memory is able to offline successfully.
> 
> Where the kernel gets stuck:
> After digging into the kernel, we found that when the issue occurs, the kernel
> is stuck in the outer loop of offline_pages(). Below is a piece of the
> highlighted offline_pages():
> 
> ```
> int __ref offline_pages()
> {
>     do { // outer loop
>       pfn = start_pfn;
>       do {
>         ret = scan_movable_pages(pfn, end_pfn, &pfn);  // It returns -ENOENT
>         if (!ret)
>            do_migrate_range(pfn, end_pfn);             // Not reach here
>       } while (!ret);
>       ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
>       } while (ret);                                   // ret is -EBUSY
> }
> ```
> 
> In this case, we dumped the first page that cannot be isolated (see dump_page below), it's
> content does not change in each iteration.:
> ```
> Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd
> Jun 28 15:29:26 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
> Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 ffffd4f0ffd97ef0 0000000000000000
> Jun 28 15:29:26 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
> Jun 28 15:29:26 linux kernel: page dumped because: trouble page...
> ```
> 
> Every time the issue occurs, the content of the page structure is similar.
> 
> Questions:
> Q1. Is this behavior expected? At least for an OS administrator, it should return
>       promptly (success or failure) instead of hanging indefinitely.
> Q2. Regarding the offline_pages() function, encountering such a page indeed causes
>       an endless loop. Shouldn't another part of the kernel timely changed the state
>       of this page?
> 
>       When I use the workaround mentioned above (Ctrl-C and try offline again), I find
>       that the page state changes (see dump_page below):
> ```
> Jun 28 15:33:12 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd
> Jun 28 15:33:12 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
> Jun 28 15:33:12 linux kernel: raw: 009fffffc0000000 dead000000000100 dead000000000122 0000000000000000
> Jun 28 15:33:12 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
> Jun 28 15:33:12 linux kernel: page dumped because: previous trouble page
> ```
> 
> What our test does:
> We have a CXL memory device, which is configured as kmem and online into the MOVABLE
> zone as NUMA node2. We run two processes, consume-memory and offline-memory, in parallel,
> see the pseudo code below:
> 
> ```
> main()
> {
>       if (fork() == 0)
>           numactl -m 2 ./consume-memory
>       else {
>           daxctl offline-memory dax0.0
>           wait()
>       }
> }
> ```
> 
> Attached is the process information (when it gets stuck):
> ```
> root 25716 0.0 0.0 2460 1408 pts/0 S+ 15:28 0:00 ./main
> root 25719 0.0 0.0 0 0 pts/0 Z+ 15:28 0:00 [consume-memory] <defunct>
> root 25720 98.6 0.0 9476 3740 pts/0 R+ 15:28 0:26 daxctl offline-memory /dev/dax0.0
> ```
> 
> Feel free to let me know if you need more details.
> Thank you for your attention to this issue. Looking forward to your insights.
> 
> Thanks
> Zhijian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG ?] Offline Memory gets stuck in offline_pages()
  2024-07-04  7:43 ` Zhijian Li (Fujitsu)
@ 2024-07-04  8:14   ` David Hildenbrand
  2024-07-04 13:07     ` Zhijian Li (Fujitsu)
  0 siblings, 1 reply; 8+ messages in thread
From: David Hildenbrand @ 2024-07-04  8:14 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu), linux-mm@kvack.org,
	linux-cxl@vger.kernel.org
  Cc: dan.j.williams@intel.com, Yasunori Gotou (Fujitsu),
	Oscar Salvador, akpm@linux-foundation.org, Xingtao Yao (Fujitsu),
	Zi Yan, Johannes Weiner

On 04.07.24 09:43, Zhijian Li (Fujitsu) wrote:
> All,
> 
> Some progress updates
> 
> When issue occurs, calling __drain_all_pages() can make offline_pages() escape from the loop.
> 
>>
>> Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd
>> Jun 28 15:29:26 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
>> Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 ffffd4f0ffd97ef0 0000000000000000
>> Jun 28 15:29:26 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
>> Jun 28 15:29:26 linux kernel: page dumped because: trouble page...
>>
> 
> With this problematic page structure contents, it seems that the
> list_head = {ffffdfbd9e603788, ffffd4f0ffd97ef0} is valid.
> 
> I guess it was linking to the pcp_list, so I dumped the per_cpu_pages[cpu].count
> in every in critical timings.

So, is your reproducer getting fixed when you call __drain_all_pages() 
in the loop? (not that it's the right fix, but a could datapoint :) )

> 
> An example is as below,
> offline_pages()
> {
> 	// per_cpu_pages[1].count = 0
> 	zone_pcp_disable() // will call __drain_all_pages()
> 	// per_cpu_pages[1].count = 188
> 	do {
> 		do {
> 			scan_movable_pages()
> 			ret = do_migrate_range()
> 		} while (!ret)
> 
> 		ret = test_pages_isolated()
> 
> 		if(is the 1st iteration)
> 			// per_cpu_pages[1].count = 182
> 
> 		if (issue occurs) { /* if the loop take beyond 10 seconds */
> 			// per_cpu_pages[1].count = 61
> 			__drain_all_pages()
> 			// per_cpu_pages[1].count = 0
> 			/* will escape from the outer loop in later iterations */
> 		}
> 	} while (ret)
> }
> 
> Some interesting points:
>    - After the 1st __drain_all_pages(), per_cpu_pages[1].count increased to 188 from 0,
>      does it mean it's racing with something...?
>    - per_cpu_pages[1].count will decrease but not decrease to 0 during iterations
>    - when issue occurs, calling __drain_all_pages() will decrease per_cpu_pages[1].count to 0.

That's indeed weird. Maybe there is a race, or zone_pcp_disable() is not 
fully effective for a zone?

> 
> So I wonder if it's fine to call __drain_all_pages() in the loop?
> 
> Looking forward to your insights.

So, in free_unref_page(), we make sure to never place MIGRATE_ISOLATE 
onto the PCP. All pageblocks we are going to offline should be isolated 
at this point, so no page that is getting freed and part of the 
to-be-offlined range should end up on the PCP. So far the theory.


In offlining code we do

1) Set MIGRATE_ISOLATE
2) zone_pcp_disable() -> set high-and-batch to 0  and drain

Could there be a race in free_unref_page(), such that although 
zone_pcp_disable() succeeds, we would still end up with a page in the 
pcp? (especially, one that has MIGRATE_ISOLATE set for its pageblock?)


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [BUG ?] Offline Memory gets stuck in offline_pages()
  2024-07-04  8:14   ` David Hildenbrand
@ 2024-07-04 13:07     ` Zhijian Li (Fujitsu)
  2024-07-12  1:50       ` Zhijian Li (Fujitsu)
  0 siblings, 1 reply; 8+ messages in thread
From: Zhijian Li (Fujitsu) @ 2024-07-04 13:07 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm@kvack.org, linux-cxl@vger.kernel.org
  Cc: dan.j.williams@intel.com, Yasunori Gotou (Fujitsu),
	Oscar Salvador, akpm@linux-foundation.org, Xingtao Yao (Fujitsu),
	Zi Yan, Johannes Weiner



> -----Original Message-----
> From: David Hildenbrand <david@redhat.com>
> Sent: Thursday, July 4, 2024 4:15 PM


> 
> On 04.07.24 09:43, Zhijian Li (Fujitsu) wrote:
> > All,
> >
> > Some progress updates
> >
> > When issue occurs, calling __drain_all_pages() can make offline_pages() escape
> from the loop.
> >
> >>
> >> Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0
> >> mapping:0000000000000000 index:0x0 pfn:0x7980dd Jun 28 15:29:26 linux
> >> kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
> >> Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788
> >> ffffd4f0ffd97ef0 0000000000000000 Jun 28 15:29:26 linux kernel: raw:
> >> 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
> Jun 28 15:29:26 linux kernel: page dumped because: trouble page...
> >>
> >
> > With this problematic page structure contents, it seems that the
> > list_head = {ffffdfbd9e603788, ffffd4f0ffd97ef0} is valid.
> >
> > I guess it was linking to the pcp_list, so I dumped the
> > per_cpu_pages[cpu].count in every in critical timings.
> 
> So, is your reproducer getting fixed when you call __drain_all_pages() in the loop?
> (not that it's the right fix, but a could datapoint :) )

Yeah,  it works for my reproducer.



> 
> >
> > An example is as below,
> > offline_pages()
> > {
> > 	// per_cpu_pages[1].count = 0
> > 	zone_pcp_disable() // will call __drain_all_pages()
> > 	// per_cpu_pages[1].count = 188
> > 	do {
> > 		do {
> > 			scan_movable_pages()
> > 			ret = do_migrate_range()
> > 		} while (!ret)
> >
> > 		ret = test_pages_isolated()
> >
> > 		if(is the 1st iteration)
> > 			// per_cpu_pages[1].count = 182
> >
> > 		if (issue occurs) { /* if the loop take beyond 10 seconds */
> > 			// per_cpu_pages[1].count = 61
> > 			__drain_all_pages()
> > 			// per_cpu_pages[1].count = 0
> > 			/* will escape from the outer loop in later iterations */
> > 		}
> > 	} while (ret)
> > }
> >
> > Some interesting points:
> >    - After the 1st __drain_all_pages(), per_cpu_pages[1].count increased to 188
> from 0,
> >      does it mean it's racing with something...?
> >    - per_cpu_pages[1].count will decrease but not decrease to 0 during
> iterations
> >    - when issue occurs, calling __drain_all_pages() will decrease
> per_cpu_pages[1].count to 0.
> 
> That's indeed weird. Maybe there is a race, or zone_pcp_disable() is not fully
> effective for a zone?

I often see there still are pages in PCP after the zone_pcp_disable().

> 
> >
> > So I wonder if it's fine to call __drain_all_pages() in the loop?
> >
> > Looking forward to your insights.
> 
> So, in free_unref_page(), we make sure to never place MIGRATE_ISOLATE onto
> the PCP. All pageblocks we are going to offline should be isolated at this point, so
> no page that is getting freed and part of the to-be-offlined range should end up on
> the PCP. So far the theory.
> 
> 
> In offlining code we do
> 
> 1) Set MIGRATE_ISOLATE
> 2) zone_pcp_disable() -> set high-and-batch to 0  and drain
> 
> Could there be a race in free_unref_page(), such that although
> zone_pcp_disable() succeeds, we would still end up with a page in the pcp?
> (especially, one that has MIGRATE_ISOLATE set for its pageblock?)

Thanks for your idea, I will further investigate in this direction.


Thanks
Zhijian

> 
> 
> --
> Cheers,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG ?] Offline Memory gets stuck in offline_pages()
  2024-07-04 13:07     ` Zhijian Li (Fujitsu)
@ 2024-07-12  1:50       ` Zhijian Li (Fujitsu)
  2024-07-12  5:51         ` Zhijian Li (Fujitsu)
  0 siblings, 1 reply; 8+ messages in thread
From: Zhijian Li (Fujitsu) @ 2024-07-12  1:50 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm@kvack.org, linux-cxl@vger.kernel.org
  Cc: dan.j.williams@intel.com, Yasunori Gotou (Fujitsu),
	Oscar Salvador, akpm@linux-foundation.org, Xingtao Yao (Fujitsu),
	Zi Yan, Johannes Weiner

David && ALL,

Some progress updates

On 04/07/2024 21:07, Zhijian Li (Fujitsu) wrote:
> 
> 
>> -----Original Message-----
>> From: David Hildenbrand <david@redhat.com>
>> Sent: Thursday, July 4, 2024 4:15 PM
> 
> 
>>
>> On 04.07.24 09:43, Zhijian Li (Fujitsu) wrote:
>>> All,
>>>
>>> Some progress updates
>>>
>>> When issue occurs, calling __drain_all_pages() can make offline_pages() escape
>> from the loop.
>>>
>>>>
>>>> Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0
>>>> mapping:0000000000000000 index:0x0 pfn:0x7980dd Jun 28 15:29:26 linux
>>>> kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
>>>> Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788
>>>> ffffd4f0ffd97ef0 0000000000000000 Jun 28 15:29:26 linux kernel: raw:
>>>> 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
>> Jun 28 15:29:26 linux kernel: page dumped because: trouble page...
>>>>
>>>
>>> With this problematic page structure contents, it seems that the
>>> list_head = {ffffdfbd9e603788, ffffd4f0ffd97ef0} is valid.
>>>
>>> I guess it was linking to the pcp_list, so I dumped the
>>> per_cpu_pages[cpu].count in every in critical timings.
>>
>> So, is your reproducer getting fixed when you call __drain_all_pages() in the loop?
>> (not that it's the right fix, but a could datapoint :) )
> 
> Yeah,  it works for my reproducer.
> 
> 
> 
>>
>>>
>>> An example is as below,
>>> offline_pages()
>>> {
>>> 	// per_cpu_pages[1].count = 0
>>> 	zone_pcp_disable() // will call __drain_all_pages()
>>> 	// per_cpu_pages[1].count = 188
>>> 	do {
>>> 		do {
>>> 			scan_movable_pages()
>>> 			ret = do_migrate_range()
>>> 		} while (!ret)
>>>
>>> 		ret = test_pages_isolated()
>>>
>>> 		if(is the 1st iteration)
>>> 			// per_cpu_pages[1].count = 182
>>>
>>> 		if (issue occurs) { /* if the loop take beyond 10 seconds */
>>> 			// per_cpu_pages[1].count = 61
>>> 			__drain_all_pages()
>>> 			// per_cpu_pages[1].count = 0
>>> 			/* will escape from the outer loop in later iterations */
>>> 		}
>>> 	} while (ret)
>>> }
>>>
>>> Some interesting points:
>>>     - After the 1st __drain_all_pages(), per_cpu_pages[1].count increased to 188
>> from 0,
>>>       does it mean it's racing with something...?
>>>     - per_cpu_pages[1].count will decrease but not decrease to 0 during
>> iterations
>>>     - when issue occurs, calling __drain_all_pages() will decrease
>> per_cpu_pages[1].count to 0.
>>
>> That's indeed weird. Maybe there is a race, or zone_pcp_disable() is not fully
>> effective for a zone?
> 
> I often see there still are pages in PCP after the zone_pcp_disable().
> 
>>
>>>
>>> So I wonder if it's fine to call __drain_all_pages() in the loop?
>>>
>>> Looking forward to your insights.
>>
>> So, in free_unref_page(), we make sure to never place MIGRATE_ISOLATE onto
>> the PCP. All pageblocks we are going to offline should be isolated at this point, so
>> no page that is getting freed and part of the to-be-offlined range should end up on
>> the PCP. So far the theory.
>>
>>
>> In offlining code we do
>>
>> 1) Set MIGRATE_ISOLATE
>> 2) zone_pcp_disable() -> set high-and-batch to 0  and drain
>>
>> Could there be a race in free_unref_page(), such that although
>> zone_pcp_disable() succeeds, we would still end up with a page in the pcp?
>> (especially, one that has MIGRATE_ISOLATE set for its pageblock?)
> 
> Thanks for your idea, I will further investigate in this direction.

some updates


         CPU0                                  CPU1
     -----------                           ---------
// erase pcp_list
zone_pcp_disable // pcp->count = 0
                                                                                       
lru_cache_diable()                    __rmqueue_pcplist()  // re-add pages to pcp_lsit
                                       __rmqueue_pcplist()  // drop pages from pcp_list
                                       decay_pcp_high()     // drop pages from pcp_list
loop  {                                      ...
                                       __rmqueue_pcplist()  // drop pages from pcp_list,
                                                            // it will be only called a few times during the loop
      scan_movable_pages()                   ...
      migration_pages()                decay_pcp_high()     // drop pages from pcp_list, it will be called by
                                                            // a worker periodically during the loop
                                                                                       
// wait pcp_list to be empty
} while (test_pages_isolated())


And we noticed that re-add pages to pcp_list in '__rmqueue_pcplist()` only happen once,
pcp->count changed to 200 from 0 for example.

The later calls to __rmqueue_pcplist() will drop pcp->count by step 1 for each call,
For example, pcp->count: 199->198->197->196...,
However it stops calling __rmqueue_pcplist after a few times before pcp->count is dropped to 0.

In the normal/good case, we also noticed that __rmqueue_pcplist() dropped pcp->count to 0.


Here is the __rmqueue_pcplist() call trace.
CPU: 1 PID: 3615 Comm: consume_std_pag Not tainted 6.10.0-rc2+ #147
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
  <TASK>
  dump_stack_lvl+0x64/0x80
  __rmqueue_pcplist+0xd55/0xdf0
  get_page_from_freelist+0x2a1/0x1770
  __alloc_pages_noprof+0x1a0/0x380
  alloc_pages_mpol_noprof+0xe3/0x1f0
  vma_alloc_folio_noprof+0x5c/0xb0
  folio_prealloc+0x21/0x80
  do_pte_missing+0x695/0xa20
  ? __pte_offset_map+0x1b/0x180
  __handle_mm_fault+0x65f/0xc10
  ? kmem_cache_free+0x370/0x410
  handle_mm_fault+0x128/0x360
  do_user_addr_fault+0x309/0x810
  exc_page_fault+0x7e/0x180
  asm_exc_page_fault+0x26/0x30
RIP: 0033:0x7f3d8729028a


In the meantime, decay_pcp_high() will be called to drop the pcp->count periodically.
[  145.117256]  decay_pcp_high+0x68/0x90
[  145.117256]  refresh_cpu_vm_stats+0x149/0x2a0
[  145.117256]  vmstat_update+0x13/0x50
[  145.117256]  process_scheduled_works+0xa6/0x420
[  145.117256]  worker_thread+0x117/0x270
[  145.117256]  ? __pfx_worker_thread+0x10/0x10
[  145.117256]  kthread+0xe5/0x120
[  145.117256]  ? __pfx_kthread+0x10/0x10
[  145.117256]  ret_from_fork+0x34/0x40
[  145.117256]  ? __pfx_kthread+0x10/0x10
[  145.117256]  ret_from_fork_asm+0x1a/0x30

But decay_pcp_high() will stop dropping pcp->count after a few moment later. IOW
decay_pcp_high() will be called periodically though, it doesn't drop pcp->count.
A piece of pcp content shows as below:
...
   count = 7,
   high = 7,
   high_min = 0,
   high_max = 0,
   batch = 1,
   flags = 0 '\000',
   alloc_factor = 3 '\003',
   expire = 0 '\000',
   free_count = 0,
   lists = {{
...

2244  * Called from the vmstat counter updater to decay the PCP high.
2245  * Return whether there are addition works to do.
2246  */
2247 int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
2248 {
2249         int high_min, to_drain, batch;
2250         int todo = 0;
2251
2252         high_min = READ_ONCE(pcp->high_min);
2253         batch = READ_ONCE(pcp->batch);
2254         /*
2255          * Decrease pcp->high periodically to try to free possible
2256          * idle PCP pages.  And, avoid to free too many pages to
2257          * control latency.  This caps pcp->high decrement too.
2258          */
2259         if (pcp->high > high_min) {
2260                 pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
2261                                  pcp->high - (pcp->high >> 3), high_min);
2262                 if (pcp->high > high_min)
2263                         todo++;
2264         }
2265
2266         to_drain = pcp->count - pcp->high;    // to_drain will be 0(when count == high),
                                                    // so no any pages can be drop from pcp_list.
2267         if (to_drain > 0) {
2268                 spin_lock(&pcp->lock);
2269                 free_pcppages_bulk(zone, to_drain, pcp, 0);
2270                 spin_unlock(&pcp->lock);
2271                 todo++;
2272         }
2273
2274         if (mutex_is_locked(&pcp_batch_high_lock) && pcp->high_max == 0 && to_drain > 0 && pcp->count >= 0)
2275                 pr_info("lizhijian:%s,%d: cpu%d, to_drain %d, new %d\n", __func__, __LINE__, smp_processor_id(), to_drain, pcp->count);
2277         return todo;
2278 }


I'm wondering if we can fix it in decay_pcp_high(), let it drop pcp->count to 0 when zone_pcp_disable()


The following logs show the pcp->count(*new* is pcp->count in the end of the function) in the bad case.
=====================

Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a0159, old 71, new 70, add 0, drop 1
Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015a, old 70, new 69, add 0, drop 1
Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015b, old 69, new 68, add 0, drop 1
Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015c, old 68, new 67, add 0, drop 1
Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015d, old 67, new 66, add 0, drop 1
Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015e, old 66, new 65, add 0, drop 1
Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015f, old 65, new 64, add 0, drop 1
...
Jul 11 18:34:18 linux kernel: lizhijian: offline_pages,2087: cpu0: [6a0000-6a8000] get trouble: pcplist[1]: 0->63, batch 1, high 154
...

Jul 11 18:34:25 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 7, new 56
Jul 11 18:34:26 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 7, new 49
Jul 11 18:34:27 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 6, new 43
...
Jul 11 18:34:40 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 1, new 10
Jul 11 18:34:41 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 1, new 9
Jul 11 18:34:42 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 1, new 8
Jul 11 18:34:43 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 1, new 7
=====================

Thanks
Zhijian


> 
> 
> Thanks
> Zhijian
> 
>>
>>
>> --
>> Cheers,
>>
>> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG ?] Offline Memory gets stuck in offline_pages()
  2024-07-12  1:50       ` Zhijian Li (Fujitsu)
@ 2024-07-12  5:51         ` Zhijian Li (Fujitsu)
  0 siblings, 0 replies; 8+ messages in thread
From: Zhijian Li (Fujitsu) @ 2024-07-12  5:51 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm@kvack.org, linux-cxl@vger.kernel.org
  Cc: dan.j.williams@intel.com, Yasunori Gotou (Fujitsu),
	Oscar Salvador, akpm@linux-foundation.org, Xingtao Yao (Fujitsu),
	Zi Yan, Johannes Weiner



On 12/07/2024 09:50, Zhijian Li (Fujitsu) wrote:
>           CPU0                                  CPU1
>       -----------                           ---------
> // erase pcp_list
> zone_pcp_disable // pcp->count = 0
>                                                                                         
> lru_cache_diable()                    __rmqueue_pcplist()  // re-add pages to pcp_lsit
>                                         __rmqueue_pcplist()  // drop pages from pcp_list
>                                         decay_pcp_high()     // drop pages from pcp_list
> loop  {                                      ...
>                                         __rmqueue_pcplist()  // drop pages from pcp_list,
>                                                              // it will be only called a few times during the loop
>        scan_movable_pages()                   ...
>        migration_pages()                decay_pcp_high()     // drop pages from pcp_list, it will be called by
>                                                              // a worker periodically during the loop
>                                                                                         
> // wait pcp_list to be empty
> } while (test_pages_isolated())
> 
> 
> And we noticed that re-add pages to pcp_list in '__rmqueue_pcplist()` only happen once,
> pcp->count changed to 200 from 0 for example.
> 
> The later calls to __rmqueue_pcplist() will drop pcp->count by step 1 for each call,
> For example, pcp->count: 199->198->197->196...,
> However it stops calling __rmqueue_pcplist after a few times before pcp->count is dropped to 0.
> 
> In the normal/good case, we also noticed that __rmqueue_pcplist() dropped pcp->count to 0.

I doubt all these(after calling zone_pcp_disable())

1. __rmqueue_pcplist() should re-add pages to pcp_list
2. __rmqueue_pcplist() should drop pcp->count till 0 if 1 is true
3. decay_pcp_high() should drop pcp->count till 0, its name and comments don't indicate this.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-07-12  5:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-01  1:25 [BUG ?] Offline Memory gets stuck in offline_pages() Zhijian Li (Fujitsu)
2024-07-01  7:14 ` David Hildenbrand
2024-07-01 12:07   ` Zhijian Li (Fujitsu)
2024-07-04  7:43 ` Zhijian Li (Fujitsu)
2024-07-04  8:14   ` David Hildenbrand
2024-07-04 13:07     ` Zhijian Li (Fujitsu)
2024-07-12  1:50       ` Zhijian Li (Fujitsu)
2024-07-12  5:51         ` Zhijian Li (Fujitsu)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).