All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Barry Song <21cnbao@gmail.com>
Cc: usamaarif642@gmail.com, akpm@linux-foundation.org,
	chengming.zhou@linux.dev, david@redhat.com, hanchuanhua@oppo.com,
	kanchana.p.sridhar@intel.com, kernel-team@meta.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, minchan@kernel.org, nphamcs@gmail.com,
	riel@surriel.com, ryan.roberts@arm.com, senozhatsky@chromium.org,
	shakeel.butt@linux.dev, v-songbaohua@oppo.com,
	willy@infradead.org, ying.huang@intel.com, yosryahmed@google.com
Subject: Re: [RFC 0/4] mm: zswap: add support for zswapin of large folios
Date: Thu, 24 Oct 2024 10:29:42 -0400	[thread overview]
Message-ID: <20241024142942.GA279597@cmpxchg.org> (raw)
In-Reply-To: <20241023233548.23348-1-21cnbao@gmail.com>

On Thu, Oct 24, 2024 at 12:35:48PM +1300, Barry Song wrote:
> On Thu, Oct 24, 2024 at 9:36 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Thu, Oct 24, 2024 at 8:47 AM Usama Arif <usamaarif642@gmail.com> wrote:
> > >
> > >
> > >
> > > On 23/10/2024 19:52, Barry Song wrote:
> > > > On Thu, Oct 24, 2024 at 7:31 AM Usama Arif <usamaarif642@gmail.com> wrote:
> > > >>
> > > >>
> > > >>
> > > >> On 23/10/2024 19:02, Yosry Ahmed wrote:
> > > >>> [..]
> > > >>>>>> I suspect the regression occurs because you're running an edge case
> > > >>>>>> where the memory cgroup stays nearly full most of the time (this isn't
> > > >>>>>> an inherent issue with large folio swap-in). As a result, swapping in
> > > >>>>>> mTHP quickly triggers a memcg overflow, causing a swap-out. The
> > > >>>>>> next swap-in then recreates the overflow, leading to a repeating
> > > >>>>>> cycle.
> > > >>>>>>
> > > >>>>>
> > > >>>>> Yes, agreed! Looking at the swap counters, I think this is what is going
> > > >>>>> on as well.
> > > >>>>>
> > > >>>>>> We need a way to stop the cup from repeatedly filling to the brim and
> > > >>>>>> overflowing. While not a definitive fix, the following change might help
> > > >>>>>> improve the situation:
> > > >>>>>>
> > > >>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > >>>>>>
> > > >>>>>> index 17af08367c68..f2fa0eeb2d9a 100644
> > > >>>>>> --- a/mm/memcontrol.c
> > > >>>>>> +++ b/mm/memcontrol.c
> > > >>>>>>
> > > >>>>>> @@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(struct folio
> > > >>>>>> *folio, struct mm_struct *mm,
> > > >>>>>>                 memcg = get_mem_cgroup_from_mm(mm);
> > > >>>>>>         rcu_read_unlock();
> > > >>>>>>
> > > >>>>>> -       ret = charge_memcg(folio, memcg, gfp);
> > > >>>>>> +       if (folio_test_large(folio) && mem_cgroup_margin(memcg) <
> > > >>>>>> MEMCG_CHARGE_BATCH)
> > > >>>>>> +               ret = -ENOMEM;
> > > >>>>>> +       else
> > > >>>>>> +               ret = charge_memcg(folio, memcg, gfp);
> > > >>>>>>
> > > >>>>>>         css_put(&memcg->css);
> > > >>>>>>         return ret;
> > > >>>>>> }
> > > >>>>>>
> > > >>>>>
> > > >>>>> The diff makes sense to me. Let me test later today and get back to you.
> > > >>>>>
> > > >>>>> Thanks!
> > > >>>>>
> > > >>>>>> Please confirm if it makes the kernel build with memcg limitation
> > > >>>>>> faster. If so, let's
> > > >>>>>> work together to figure out an official patch :-) The above code hasn't consider
> > > >>>>>> the parent memcg's overflow, so not an ideal fix.
> > > >>>>>>
> > > >>>>
> > > >>>> Thanks Barry, I think this fixes the regression, and even gives an improvement!
> > > >>>> I think the below might be better to do:
> > > >>>>
> > > >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > >>>> index c098fd7f5c5e..0a1ec55cc079 100644
> > > >>>> --- a/mm/memcontrol.c
> > > >>>> +++ b/mm/memcontrol.c
> > > >>>> @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> > > >>>>                 memcg = get_mem_cgroup_from_mm(mm);
> > > >>>>         rcu_read_unlock();
> > > >>>>
> > > >>>> -       ret = charge_memcg(folio, memcg, gfp);
> > > >>>> +       if (folio_test_large(folio) &&
> > > >>>> +           mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)))
> > > >>>> +               ret = -ENOMEM;
> > > >>>> +       else
> > > >>>> +               ret = charge_memcg(folio, memcg, gfp);
> > > >>>>
> > > >>>>         css_put(&memcg->css);
> > > >>>>         return ret;
> > > >>>>
> > > >>>>
> > > >>>> AMD 16K+32K THP=always
> > > >>>> metric         mm-unstable      mm-unstable + large folio zswapin series    mm-unstable + large folio zswapin + no swap thrashing fix
> > > >>>> real           1m23.038s        1m23.050s                                   1m22.704s
> > > >>>> user           53m57.210s       53m53.437s                                  53m52.577s
> > > >>>> sys            7m24.592s        7m48.843s                                   7m22.519s
> > > >>>> zswpin         612070           999244                                      815934
> > > >>>> zswpout        2226403          2347979                                     2054980
> > > >>>> pgfault        20667366         20481728                                    20478690
> > > >>>> pgmajfault     385887           269117                                      309702
> > > >>>>
> > > >>>> AMD 16K+32K+64K THP=always
> > > >>>> metric         mm-unstable      mm-unstable + large folio zswapin series   mm-unstable + large folio zswapin + no swap thrashing fix
> > > >>>> real           1m22.975s        1m23.266s                                  1m22.549s
> > > >>>> user           53m51.302s       53m51.069s                                 53m46.471s
> > > >>>> sys            7m40.168s        7m57.104s                                  7m25.012s
> > > >>>> zswpin         676492           1258573                                    1225703
> > > >>>> zswpout        2449839          2714767                                    2899178
> > > >>>> pgfault        17540746         17296555                                   17234663
> > > >>>> pgmajfault     429629           307495                                     287859
> > > >>>>
> > > >>>
> > > >>> Thanks Usama and Barry for looking into this. It seems like this would
> > > >>> fix a regression with large folio swapin regardless of zswap. Can the
> > > >>> same result be reproduced on zram without this series?
> > > >>
> > > >>
> > > >> Yes, its a regression in large folio swapin support regardless of zswap/zram.
> > > >>
> > > >> Need to do 3 tests, one with probably the below diff to remove large folio support,
> > > >> one with current upstream and one with upstream + swap thrashing fix.
> > > >>
> > > >> We only use zswap and dont have a zram setup (and I am a bit lazy to create one :)).
> > > >> Any zram volunteers to try this?
> > > >
> > > > Hi Usama,
> > > >
> > > > I tried a quick experiment:
> > > >
> > > > echo 1 > /sys/module/zswap/parameters/enabled
> > > > echo 0 > /sys/module/zswap/parameters/enabled
> > > >
> > > > This was to test the zRAM scenario. Enabling zswap even
> > > > once disables mTHP swap-in. :)
> > > >
> > > > I noticed a similar regression with zRAM alone, but the change resolved
> > > > the issue and even sped up the kernel build compared to the setup without
> > > > mTHP swap-in.
> > >
> > > Thanks for trying, this is amazing!
> > > >
> > > > However, I’m still working on a proper patch to address this. The current
> > > > approach:
> > > >
> > > > mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))
> > > >
> > > > isn’t sufficient, as it doesn’t cover cases where group A contains group B, and
> > > > we’re operating within group B. The problem occurs not at the boundary of
> > > > group B but at the boundary of group A.
> > >
> > > I am not sure I completely followed this. As MEMCG_CHARGE_BATCH=64, if we are
> > > trying to swapin a 16kB page, we basically check if atleast 64/4 = 16 folios can be
> > > charged to cgroup, which is reasonable. If we try to swapin a 1M folio, we just
> > > check if we can charge atleast 1 folio. Are you saying that checking just 1 folio
> > > is not enough in this case and can still cause thrashing, i.e we should check more?
> >
> > My understanding is that cgroups are hierarchical. Even if we don’t
> > hit the memory
> >  limit of the folio’s direct memcg, we could still reach the limit of
> > one of its parent
> > memcgs. Imagine a structure like:
> >
> > /sys/fs/cgroup/a/b/c/d
> >
> > If we’re compiling the kernel in d, there’s a chance that while d
> > isn’t at its limit, its
> > parents (c, b, or a) could be. Currently, the check only applies to d.
> 
> To clarify, I mean something like this:
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 17af08367c68..cc6d21848ee8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4530,6 +4530,29 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>  	return 0;
>  }
> 
> +/*
> + * When the memory cgroup is nearly full, swapping in large folios can
> + * easily lead to swap thrashing, as the memcg operates on the edge of
> + * being full. We maintain a margin to allow for quick fallback to
> + * smaller folios during the swap-in process.
> + */
> +static inline bool mem_cgroup_swapin_margin_protected(struct mem_cgroup *memcg,
> +		struct folio *folio)
> +{
> +	unsigned int nr;
> +
> +	if (!folio_test_large(folio))
> +		return false;
> +
> +	nr = max_t(unsigned int, folio_nr_pages(folio), MEMCG_CHARGE_BATCH);
> +	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
> +		if (mem_cgroup_margin(memcg) < nr)
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
>  /**
>   * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin.
>   * @folio: folio to charge.
> @@ -4547,7 +4570,8 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
>  {
>  	struct mem_cgroup *memcg;
>  	unsigned short id;
> -	int ret;
> +	int ret = -ENOMEM;
> +	bool margin_prot;
> 
>  	if (mem_cgroup_disabled())
>  		return 0;
> @@ -4557,9 +4581,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
>  	memcg = mem_cgroup_from_id(id);
>  	if (!memcg || !css_tryget_online(&memcg->css))
>  		memcg = get_mem_cgroup_from_mm(mm);
> +	margin_prot = mem_cgroup_swapin_margin_protected(memcg, folio);
>  	rcu_read_unlock();
> 
> -	ret = charge_memcg(folio, memcg, gfp);
> +	if (!margin_prot)
> +		ret = charge_memcg(folio, memcg, gfp);
> 
>  	css_put(&memcg->css);
>  	return ret;

I'm not quite following.

The charging code DOES the margin check. If you just want to avoid
reclaim, pass gfp without __GFP_DIRECT_RECLAIM, and it will return
-ENOMEM if there is no margin.

alloc_swap_folio() passes the THP mask, which should not include the
reclaim flag per default (GFP_TRANSHUGE_LIGHT). Unless you run with
defrag=always. Is that what's going on?

  reply	other threads:[~2024-10-24 14:29 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-18 10:48 [RFC 0/4] mm: zswap: add support for zswapin of large folios Usama Arif
2024-10-18 10:48 ` [RFC 1/4] mm/zswap: skip swapcache for swapping in zswap pages Usama Arif
2024-10-21 21:09   ` Yosry Ahmed
2024-10-22 19:49     ` Usama Arif
2024-10-23  0:45       ` Yosry Ahmed
2024-10-25 18:19         ` Nhat Pham
2024-10-25 19:10           ` Yosry Ahmed
2024-10-21 21:11   ` Yosry Ahmed
2024-10-22 19:59     ` Usama Arif
2024-10-23  0:47       ` Yosry Ahmed
2024-10-18 10:48 ` [RFC 2/4] mm/zswap: modify zswap_decompress to accept page instead of folio Usama Arif
2024-10-18 10:48 ` [RFC 3/4] mm/zswap: add support for large folio zswapin Usama Arif
2024-10-21  5:49   ` Barry Song
2024-10-21 10:44     ` Usama Arif
2024-10-21 10:55       ` Barry Song
2024-10-21 12:21         ` Usama Arif
2024-10-21 20:28           ` Barry Song
2024-10-21 20:57             ` Usama Arif
2024-10-21 21:34               ` Yosry Ahmed
2024-10-18 10:48 ` [RFC 4/4] mm/zswap: count successful large folio zswap loads Usama Arif
2024-10-21  5:09 ` [RFC 0/4] mm: zswap: add support for zswapin of large folios Barry Song
2024-10-21 10:40   ` Usama Arif
2024-10-22 15:26     ` Usama Arif
2024-10-22 20:46       ` Barry Song
2024-10-22 21:17         ` Usama Arif
2024-10-22 22:07           ` Barry Song
2024-10-23 10:26             ` Barry Song
2024-10-23 10:48               ` Usama Arif
2024-10-23 13:08                 ` Usama Arif
2024-10-23 18:02                   ` Yosry Ahmed
2024-10-23 18:31                     ` Usama Arif
2024-10-23 18:52                       ` Barry Song
2024-10-23 19:47                         ` Usama Arif
2024-10-23 20:36                           ` Barry Song
2024-10-23 23:35                           ` Barry Song
2024-10-24 14:29                             ` Johannes Weiner [this message]
2024-10-24 17:48                               ` Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241024142942.GA279597@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chengming.zhou@linux.dev \
    --cc=david@redhat.com \
    --cc=hanchuanhua@oppo.com \
    --cc=kanchana.p.sridhar@intel.com \
    --cc=kernel-team@meta.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=nphamcs@gmail.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=senozhatsky@chromium.org \
    --cc=shakeel.butt@linux.dev \
    --cc=usamaarif642@gmail.com \
    --cc=v-songbaohua@oppo.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.