Re: Regression of madvise(MADV_COLD) on shmem?

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Minchan Kim <minchan@kernel.org>
To: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-api@vger.kernel.org,
	mhocko@suse.com, hannes@cmpxchg.org, timmurray@google.com,
	joel@joelfernandes.org, surenb@google.com, dancol@google.com,
	shakeelb@google.com, sonnyrao@google.com, oleksandr@redhat.com,
	hdanton@sina.com, lizeb@google.com, dave.hansen@intel.com,
	kirill.shutemov@linux.intel.com
Subject: Re: Regression of madvise(MADV_COLD) on shmem?
Date: Fri, 4 Mar 2022 16:18:26 -0800	[thread overview]
Message-ID: <YiKsUr1FQwmDP7V0@google.com> (raw)
In-Reply-To: <dd620dbd-6d71-7553-d1e9-95676ff12c82@nutanix.com>

On Fri, Mar 04, 2022 at 05:55:58PM +0000, Ivan Teterevkov wrote:
> Hi folks,
> 
> I want to check if there's a regression in the madvise(MADV_COLD) behaviour
> with shared memory or my understanding of how it works is inaccurate.
> 
> The MADV_COLD advice was introduced in Linux 5.4 and allowed the users to
> mark selected memory ranges as more "inactive" than others, overruling the
> default LRU accounting. It helped to preserve the working set of an
> application. With more recent kernels, e.g. at least 5.17.0-rc6 and 5.10.42,
> MADV_COLD has stopped working as expected. Please take a look at a short
> program that demonstrates it:
> 
>     /*
>      * madvise(MADV_COLD) demo.
>      */
>     #include <assert.h>
>     #include <stdio.h>
>     #include <stdlib.h>
>     #include <string.h>
>     #include <sys/mman.h>
> 
>     /* Requires the kernel 5.4 or newer. */
>     #ifndef MADV_COLD
>     #define MADV_COLD 20
>     #endif
> 
>     #define GIB(x) ((size_t)(x) << 30)
> 
>     int main(void)
>     {
>         char *shmem, *zeroes;
>         int page_size = getpagesize();
>         size_t i;
> 
>         /* Allocate 8 GiB of shared memory. */
>         shmem = mmap(/* addr */ NULL,
>                      /* length */ GIB(8),
>                      /* prot */ PROT_READ | PROT_WRITE,
>                      /* flags */ MAP_SHARED | MAP_ANONYMOUS,
>                      /* fd */ -1,
>                      /* offset */ 0);
>         assert(shmem != MAP_FAILED);
> 
>         /* Allocate a zero page for future use. */
>         zeroes = calloc(1, page_size);
>         assert(zeroes != NULL);
> 
>         /* Put 1 GiB blob at the beginning of the shared memory range. */
>         memset(shmem, 0xaa, GIB(1));
> 
>         /* Read memory adjacent to the blob. */
>         for (i = GIB(1); i < GIB(8); i = i + page_size) {
>             int res = memcmp(shmem + i, zeroes, page_size);
>             assert(res == 0);
> 
>             /* Cooldown a zero page and make it "less active" than the blob.
>              * Under memory pressure, it'll likely become a reclaim target
>              * and thus will help to preserve the blob in memory.
>              */
>             res = madvise(shmem + i, page_size, MADV_COLD);
>             assert(res == 0);
>         }
> 
>         /* Let the user check smaps. */
>         printf("done\n");
>         pause();
> 
>         free(zeroes);
>         munmap(shmem, GIB(8));
> 
>         return 0;
>     }
> 
> How to run this program:
> 
> 1. Create a "test" cgroup with a memory limit of 3 GiB.
> 
> 1.1. cgroup v1:
> 
>     # mkdir /sys/fs/cgroup/memory/test
>     # echo 3G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> 
> 1.2. cgroup v2:
> 
>     # mkdir /sys/fs/cgroup/test
>     # echo 3G > /sys/fs/cgroup/test/memory.max
> 
> 2. Enable at least a 1 GiB swap device.
> 
> 3. Run the program in the "test" cgroup:
> 
>     # cgexec -g memory:test ./a.out
> 
> 4. Wait until it has finished, i.e. has printed "done".
> 
> 5. Check the shared memory VMA stats.
> 
> 5.1. In 5.17.0-rc6 and 5.10.42:
> 
>     # cat /proc/$(pidof a.out)/smaps | grep -A 21 -B 1 8388608
>     7f8ed4648000-7f90d4648000 rw-s 00000000 00:01 2055      /dev/zero
> (deleted)
>     Size:            8388608 kB
>     KernelPageSize:        4 kB
>     MMUPageSize:           4 kB
>     Rss:             3119556 kB
>     Pss:             3119556 kB
>     Shared_Clean:          0 kB
>     Shared_Dirty:          0 kB
>     Private_Clean:   3119556 kB
>     Private_Dirty:         0 kB
>     Referenced:            0 kB
>     Anonymous:             0 kB
>     LazyFree:              0 kB
>     AnonHugePages:         0 kB
>     ShmemPmdMapped:        0 kB
>     FilePmdMapped:         0 kB
>     Shared_Hugetlb:        0 kB
>     Private_Hugetlb:       0 kB
>     Swap:            1048576 kB
>     SwapPss:               0 kB
>     Locked:                0 kB
>     THPeligible:    0
>     VmFlags: rd wr sh mr mw me ms sd
> 
> 5.2. In 5.4.109:
> 
>     # cat /proc/$(pidof a.out)/smaps | grep -A 21 -B 1 8388608
>     7fca5f78b000-7fcc5f78b000 rw-s 00000000 00:01 173051      /dev/zero
> (deleted)
>     Size:            8388608 kB
>     KernelPageSize:        4 kB
>     MMUPageSize:           4 kB
>     Rss:             3121504 kB
>     Pss:             3121504 kB
>     Shared_Clean:          0 kB
>     Shared_Dirty:          0 kB
>     Private_Clean:   2072928 kB
>     Private_Dirty:   1048576 kB
>     Referenced:            0 kB
>     Anonymous:             0 kB
>     LazyFree:              0 kB
>     AnonHugePages:         0 kB
>     ShmemPmdMapped:        0 kB
>     FilePmdMapped:        0 kB
>     Shared_Hugetlb:        0 kB
>     Private_Hugetlb:       0 kB
>     Swap:                  0 kB
>     SwapPss:               0 kB
>     Locked:                0 kB
>     THPeligible:            0
>     VmFlags: rd wr sh mr mw me ms
> 
> There's a noticeable difference in the "Swap" reports so that the older
> kernel doesn't swap the blob, but the newer ones do.
> 
> According to ftrace, the newer kernels still call deactivate_page() in
> madvise_cold():
> 
> # trace-cmd record -p function_graph -g madvise_cold
> # trace-cmd report | less
>     a.out-4877  [000]  1485.266106: funcgraph_entry: |  madvise_cold() {
>     a.out-4877  [000]  1485.266115: funcgraph_entry: |    walk_page_range()
> {
>     a.out-4877  [000]  1485.266116: funcgraph_entry: |
> __walk_page_range() {
>     a.out-4877  [000]  1485.266117: funcgraph_entry: |
> madvise_cold_or_pageout_pte_range() {
>     a.out-4877  [000]  1485.266118: funcgraph_entry:        0.179 us |
> deactivate_page();
> 
> (The irrelevant bits are removed for brevity.)
> 
> It makes me think there may be a regression in MADV_COLD. Please let me know
> what do you reckon?

Since deactive_page is called, I guess that's not a regression(?) from [1]

Then, my random guess that you mentioned "Swap" as regression might be
related to "workingset detection for anon page" since kernel changes balancing
policy between file and anonymous LRU, which was merged into v5.8.
It would be helpful to see if you try it on v5.7 and v5.8.

[1] 12e967fd8e4e6, mm: do not allow MADV_PAGEOUT for CoW page

next prev parent reply	other threads:[~2022-03-05  0:18 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-04 17:55 Regression of madvise(MADV_COLD) on shmem? Ivan Teterevkov
2022-03-05  0:18 ` Minchan Kim [this message]
2022-03-05  9:17   ` Yu Zhao
2022-03-05  9:49     ` Yu Zhao
2022-03-07  9:57       ` Ivan Teterevkov
2022-03-07 12:10     ` Michal Hocko
2022-03-10  9:01       ` Michal Hocko
2022-03-11  0:09         ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YiKsUr1FQwmDP7V0@google.com \
    --to=minchan@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=dancol@google.com \
    --cc=dave.hansen@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=ivan.teterevkov@nutanix.com \
    --cc=joel@joelfernandes.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizeb@google.com \
    --cc=mhocko@suse.com \
    --cc=oleksandr@redhat.com \
    --cc=shakeelb@google.com \
    --cc=sonnyrao@google.com \
    --cc=surenb@google.com \
    --cc=timmurray@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).