All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Huang, Ying" <ying.huang@linux.alibaba.com>
To: Zhu Haoran <zhr1502@sjtu.edu.cn>
Cc: linux-mm@kvack.org,  dev.jain@arm.com
Subject: Re: [Question] About memory.c: process_huge_page
Date: Sun, 28 Sep 2025 08:48:36 +0800	[thread overview]
Message-ID: <873487v1uj.fsf@DESKTOP-5N7EMDA> (raw)
In-Reply-To: <20250926122735.25478-1-zhr1502@sjtu.edu.cn> (Zhu Haoran's message of "Fri, 26 Sep 2025 20:27:35 +0800")

Zhu Haoran <zhr1502@sjtu.edu.cn> writes:

> "Huang, Ying" <ying.huang@linux.alibaba.com> writes:
>>Hi, Haoran,
>>
>>Zhu Haoran <zhr1502@sjtu.edu.cn> writes:
>>
>>> Hi!
>>>
>>> I recently noticed the process_huge_page function in memory.c, which was
>>> intended to keep the cache hotness of target page after processing. I compared
>>> the vm-scalability anon-cow-seq-hugetlb microbench using the default
>>> process_huge_page and sequential processing (code posted below).
>>>
>>> I ran test on epyc-7T83 with 36vCPUs and 64GB memory. Using default
>>> process_huge_page, the avg bandwidth is 1148 mb/s. However sequential
>>> processing yielded a better bandwidth of about 1255 mb/s and only
>>> one-third cache-miss rate compared with default one.
>>>
>>> The same test was run on epyc-9654 with 36vCPU and 64GB mem. The
>>> bandwidth result was similar but the difference was smaller: 1170mb/s
>>> for default and 1230 mb/s for sequential. Although we did find the cache
>>> miss rate here did the reverse, since the sequential processing seen 3
>>> times miss more than the default.
>>>
>>> These result seem really inconsitent with the what described in your
>>> patchset [1]. What factors might explain these behaviors?
>>
>>One possible difference is cache topology.  Can you try to bind the test
>>process to the CPUs in one CCX (that is, share one LLC).  This make it
>>possible to hit the local cache.
>
> Thank you for the suggestion.
>
> I reduced the test to 16 vCPUs and bound them to one CCX on the epyc-9654. The
> rerun results are:
>
>                      sequential    process_huge_page
> BW (MB/s)                523.88               531.60   ( + 1.47%)
> user cachemiss           0.318%               0.446%   ( +40.25%)
> kernel cachemiss         1.405%              18.406%   ( + 1310%)
> usertime                  26.72                18.76   ( -29.79%)
> systime                   35.97                42.64   ( +18.54%)
>
> I was able to reproduce the much lower user time, but the bw gap is still not
> that significant as in your patch. It was bottlenecked by kernel cache-misses
> and execution time. One possible explanation is that AMD has less aggressive
> cache prefetcher, which fails to predict the access pattern of current
> process_huge_page in kernel. To verify that I ran a microbench that iterates
> through 4K pages in sequential/reverse order and access each page in seq/rev
> order (4 combinations in total).
>
> cachemiss rate
>                   seq-seq  seq-rev  rev-seq  rev-rev
> epyc-9654           0.08%    1.71%    1.98%    0.09%
> epyc-7T83           1.07%   13.64%    6.23%    1.12%
> i5-13500H          27.08%   28.87%   29.57%   25.35%
>
> I also ran the anon-cow-seq on my laptop i5-13500H and all metrics aligned well
> with your patch. So I guess this could be the root cause why AMD won't benefit
> from the patch?

The cache size per process needs to be checked too.  The smaller the
cache size per process, the more the benefit.

>>> Thanks for your time.
>>>
>>> [1] https://lkml.org/lkml/2018/5/23/1072
>>>
>>> ---
>>> Sincere,
>>> Zhu Haoran
>>>
>>> ---
>>>
>>> static int process_huge_page(
>>>         unsigned long addr_hint, unsigned int nr_pages,
>>>         int (*process_subpage)(unsigned long addr, int idx, void *arg),
>>>         void *arg)
>>> {
>>>     int i, ret;
>>>     unsigned long addr = addr_hint &
>>>         ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>>
>>>     might_sleep();
>>>     for (i = 0; i < nr_pages; i++) {
>>>             cond_resched();
>>>             ret = process_subpage(addr + i * PAGE_SIZE, i, arg);
>>>             if (ret)
>>>                     return ret;
>>>     }
>>>
>>>     return 0;
>>> }

---
Best Regards,
Huang, Ying


  reply	other threads:[~2025-09-28  0:48 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-24 11:46 [Question] About memory.c: process_huge_page Zhu Haoran
2025-09-25  1:32 ` Huang, Ying
2025-09-25  3:38   ` Dev Jain
2025-09-26 12:40     ` Zhu Haoran
2025-09-26 12:27   ` Zhu Haoran
2025-09-28  0:48     ` Huang, Ying [this message]
2025-09-28 10:07       ` Zhu Haoran
2025-10-09  1:23         ` Huang, Ying
2025-09-26 12:38   ` Zhu Haoran

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=873487v1uj.fsf@DESKTOP-5N7EMDA \
    --to=ying.huang@linux.alibaba.com \
    --cc=dev.jain@arm.com \
    --cc=linux-mm@kvack.org \
    --cc=zhr1502@sjtu.edu.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.