From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=HqIU=JL=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-10.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	HTML_MESSAGE,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C3820C433B4
	for <linux-mm@archiver.kernel.org>; Wed, 14 Apr 2021 04:13:31 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 2D4BE61163
	for <linux-mm@archiver.kernel.org>; Wed, 14 Apr 2021 04:13:31 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2D4BE61163
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 8FB3A6B0036; Wed, 14 Apr 2021 00:13:30 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8AC236B006C; Wed, 14 Apr 2021 00:13:30 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6D9748D0001; Wed, 14 Apr 2021 00:13:30 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0234.hostedemail.com [216.40.44.234])
	by kanga.kvack.org (Postfix) with ESMTP id 4BF9D6B0036
	for <linux-mm@kvack.org>; Wed, 14 Apr 2021 00:13:30 -0400 (EDT)
Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id E5F4645803AA
	for <linux-mm@kvack.org>; Wed, 14 Apr 2021 04:13:29 +0000 (UTC)
X-FDA: 78029653338.02.6E1D507
Received: from mail-wm1-f47.google.com (mail-wm1-f47.google.com [209.85.128.47])
	by imf08.hostedemail.com (Postfix) with ESMTP id 74E4180192D5
	for <linux-mm@kvack.org>; Wed, 14 Apr 2021 04:13:16 +0000 (UTC)
Received: by mail-wm1-f47.google.com with SMTP id o20-20020a05600c4fd4b0290114265518afso9774715wmq.4
        for <linux-mm@kvack.org>; Tue, 13 Apr 2021 21:13:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=c7Bol88Q7MbFMRHy5KwfEKPAhnOypedOILU5zWWBv/4=;
        b=XE3X0BNPRbXDcEDwNlSvE81/O1NzMR/WVnkCFuompOYFrM7CYklbN/bLsV3043wQiK
         Ck/b9T6sEU7ExwXPXyFGmSfFKDkyvk82D76a4OgrA7Yvp2Nd1+Hm6bNtkSzrydigbxbb
         cihywWa9XKEHPACOUnsDjqheNc8nN+F5A+PWMnm37Lw6kU5lrA+vBz9D3VM2Uumfw/9L
         ZaQ6wpIXvPuhCQlAGCCHNb3AZcvuNGCe9IPZK5Z75AMdyDck+jmAEoIcj3MRYAYxvxGG
         OpSynBn19JHUxfl4POoClJCfJGrUf3i9vK7JaB8Ekkf/Z0RvaNAEWLipAZGTFxjEwpb4
         6zHA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=c7Bol88Q7MbFMRHy5KwfEKPAhnOypedOILU5zWWBv/4=;
        b=UHYh+YMGabbwMCIOUlzfa3CHdx4IuTlZ/uNJDw6jsdCnuw4CFOIcsGYYbmwevOfg8Q
         TK8GNawHURPFJKdaS1ZWCN7Y7Lz6hhZfysvbwFjannXncA/OPpWNYOjnQ2jt8JHVMivF
         5KQfdBT7BpZTUszHYgrFSAc2DgSXT8+k/cCNoI7btufYaMcmjUNvtqv0cfiqDKxgrOj6
         +xB44wiRJZ1Ed30xx44MgcXqGYFFW2o8fqtZx0Ee+PsXR8CmpaQa4MGE0oTe3LE8EiwT
         6GmANmzVPCV8QGbd7RNIJQv6gqs+NbNwhD0E7mMJ/Udyrxw9MyL3zi6oVqQVTHLkEB04
         NxeA==
X-Gm-Message-State: AOAM531CH+ueRM3X8TgosG2D50O1vkmADE/H5W0XRTaCzv0UzT9u6BxF
	IIbl4mepgdEJC+Nc0HX7F1xkpRxEZE6wZUnnv7jE4g==
X-Google-Smtp-Source: ABdhPJzxLcDR7FBu447OEeQcCvIz/Kv1UiV/I0YFHN4pttQyNOK+aBTZnpoB/KwIPnX02Q8wHM5HZzhVaWU7a8YHyAs=
X-Received: by 2002:a7b:ce8a:: with SMTP id q10mr831376wmj.101.1618373608139;
 Tue, 13 Apr 2021 21:13:28 -0700 (PDT)
MIME-Version: 1.0
References: <20210413075155.32652-1-sjpark@amazon.de> <3ddd4f8a-8e51-662b-df11-a63a0e75b2bc@kernel.dk>
 <20210413231436.GF63242@dread.disaster.area> <f4750f9431bd12b7338a47925de8b17015da51a7.camel@surriel.com>
In-Reply-To: <f4750f9431bd12b7338a47925de8b17015da51a7.camel@surriel.com>
From: Yu Zhao <yuzhao@google.com>
Date: Tue, 13 Apr 2021 22:13:16 -0600
Message-ID: <CAOUHufafMcaG8sOS=1YMy2P_6p0R1FzP16bCwpUau7g1-PybBQ@mail.gmail.com>
Subject: Re: [PATCH v2 00/16] Multigenerational LRU Framework
To: Rik van Riel <riel@surriel.com>
Cc: Dave Chinner <david@fromorbit.com>, Jens Axboe <axboe@kernel.dk>, 
	SeongJae Park <sj38.park@gmail.com>, Linux-MM <linux-mm@kvack.org>, 
	Andi Kleen <ak@linux.intel.com>, Andrew Morton <akpm@linux-foundation.org>, 
	Benjamin Manes <ben.manes@gmail.com>, Dave Hansen <dave.hansen@linux.intel.com>, 
	Hillf Danton <hdanton@sina.com>, Johannes Weiner <hannes@cmpxchg.org>, Jonathan Corbet <corbet@lwn.net>, 
	Joonsoo Kim <iamjoonsoo.kim@lge.com>, Matthew Wilcox <willy@infradead.org>, 
	Mel Gorman <mgorman@suse.de>, Miaohe Lin <linmiaohe@huawei.com>, 
	Michael Larabel <michael@michaellarabel.com>, Michal Hocko <mhocko@suse.com>, 
	Michel Lespinasse <michel@lespinasse.org>, Roman Gushchin <guro@fb.com>, Rong Chen <rong.a.chen@intel.com>, 
	SeongJae Park <sjpark@amazon.de>, Tim Chen <tim.c.chen@linux.intel.com>, 
	Vlastimil Babka <vbabka@suse.cz>, Yang Shi <shy828301@gmail.com>, Ying Huang <ying.huang@intel.com>, 
	Zi Yan <ziy@nvidia.com>, linux-kernel <linux-kernel@vger.kernel.org>, lkp@lists.01.org, 
	Kernel Page Reclaim v2 <page-reclaim@google.com>
Content-Type: multipart/alternative; boundary="00000000000007439505bfe6f885"
X-Rspamd-Queue-Id: 74E4180192D5
X-Stat-Signature: ayjdt95qyum7ox5g8hfq3qt8ahaxxh7p
X-Rspamd-Server: rspam02
Received-SPF: none (google.com>: No applicable sender policy available) receiver=imf08; identity=mailfrom; envelope-from="<yuzhao@google.com>"; helo=mail-wm1-f47.google.com; client-ip=209.85.128.47
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1618373596-686825
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

--00000000000007439505bfe6f885
Content-Type: text/plain; charset="UTF-8"

On Tue, Apr 13, 2021 at 8:30 PM Rik van Riel <riel@surriel.com> wrote:
>
> On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote:
> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> >
> > > The initial posting of this patchset did no better, in fact it did
> > > a bit
> > > worse. Performance dropped to the same levels and kswapd was using
> > > as
> > > much CPU as before, but on top of that we also got excessive
> > > swapping.
> > > Not at a high rate, but 5-10MB/sec continually.
> > >
> > > I had some back and forths with Yu Zhao and tested a few new
> > > revisions,
> > > and the current series does much better in this regard. Performance
> > > still dips a bit when page cache fills, but not nearly as much, and
> > > kswapd is using less CPU than before.
> >
> > Profiles would be interesting, because it sounds to me like reclaim
> > *might* be batching page cache removal better (e.g. fewer, larger
> > batches) and so spending less time contending on the mapping tree
> > lock...
> >
> > IOWs, I suspect this result might actually be a result of less lock
> > contention due to a change in batch processing characteristics of
> > the new algorithm rather than it being a "better" algorithm...
>
> That seems quite likely to me, given the issues we have
> had with virtual scan reclaim algorithms in the past.

Hi Rik,

Let paste the code so we can move beyond the "batching" hypothesis:

static int __remove_mapping(struct address_space *mapping, struct page
*page,
                            bool reclaimed, struct mem_cgroup *target_memcg)
{
        unsigned long flags;
        int refcount;
        void *shadow = NULL;

        BUG_ON(!PageLocked(page));
        BUG_ON(mapping != page_mapping(page));

        xa_lock_irqsave(&mapping->i_pages, flags);

> SeongJae, what is this algorithm supposed to do when faced
> with situations like this:

I'll assume the questions were directed at me, not SeongJae.

> 1) Running on a system with 8 NUMA nodes, and
> memory
>    pressure in one of those nodes.
> 2) Running PostgresQL or Oracle, with hundreds of
>    processes mapping the same (very large) shared
>    memory segment.
>
> How do you keep your algorithm from falling into the worst
> case virtual scanning scenarios that were crippling the
> 2.4 kernel 15+ years ago on systems with just a few GB of
> memory?

There is a fundamental shift: that time we were scanning for cold pages,
and nowadays we are scanning for hot pages.

I'd be surprised if scanning for cold pages didn't fall apart, because it'd
find most of the entries accessed, if they are present at all.

Scanning for hot pages, on the other hand, is way better. Let me just
reiterate:
1) It will not scan page tables from processes that have been sleeping
   since the last scan.
2) It will not scan PTE tables under non-leaf PMD entries that do not
   have the accessed bit set, when
   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
3) It will not zigzag between the PGD table and the same PMD or PTE
   table spanning multiple VMAs. In other words, it finishes all the
   VMAs with the range of the same PMD or PTE table before it returns
   to the PGD table. This optimizes workloads that have large numbers
   of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.

So the cost is roughly proportional to the number of referenced pages it
discovers. If there is no memory pressure, no scanning at all. For a system
under heavy memory pressure, most of the pages are referenced (otherwise
why would it be under memory pressure?), and if we use the rmap, we need to
scan a lot of pages anyway. Why not just scan them all? This way you save a
lot because of batching (now it's time to talk about batching). Besides,
page tables have far better memory locality than the rmap. For the shared
memory example you gave, the rmap needs to lock *each* page it scans. How
many 4KB pages does your large file have? I'll leave the math to you.

Here are some profiles:

zram with the rmap (mainline)
  31.03%  page_vma_mapped_walk
  25.59%  lzo1x_1_do_compress
   4.63%  do_raw_spin_lock
   3.89%  vma_interval_tree_iter_next
   3.33%  vma_interval_tree_subtree_search

zram with page table scanning (this patchset)
  49.36%  lzo1x_1_do_compress
   4.54%  page_vma_mapped_walk
   4.45%  memset_erms
   3.47%  walk_pte_range
   2.88%  zram_bvec_rw

Note that these are not just what I saw from some local benchmarks. We have
observed *millions* of machines in our fleet.

I encourage you to try it and see for yourself. It's as simple as:

git fetch https://linux-mm.googlesource.com/page-reclaim
 refs/changes/73/1173/1

CONFIG_LRU_GEN=y
CONFIG_LRU_GEN_ENABLED=y

and build and run your favorite benchmarks.

--00000000000007439505bfe6f885
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On Tue, Apr 13, 2021 at 8:30 PM Rik van Riel &lt;<a href=
=3D"mailto:riel@surriel.com" target=3D"_blank">riel@surriel.com</a>&gt; wro=
te:<br>&gt;<br>&gt; On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote:<=
br>&gt; &gt; On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:<br=
>&gt; &gt;<br>&gt; &gt; &gt; The initial posting of this patchset did no be=
tter, in fact it did<br>&gt; &gt; &gt; a bit<br>&gt; &gt; &gt; worse. Perfo=
rmance dropped to the same levels and kswapd was using<br>&gt; &gt; &gt; as=
<br>&gt; &gt; &gt; much CPU as before, but on top of that we also got exces=
sive<br>&gt; &gt; &gt; swapping.<br>&gt; &gt; &gt; Not at a high rate, but =
5-10MB/sec continually.<br>&gt; &gt; &gt;<br>&gt; &gt; &gt; I had some back=
 and forths with Yu Zhao and tested a few new<br>&gt; &gt; &gt; revisions,<=
br>&gt; &gt; &gt; and the current series does much better in this regard. P=
erformance<br>&gt; &gt; &gt; still dips a bit when page cache fills, but no=
t nearly as much, and<br>&gt; &gt; &gt; kswapd is using less CPU than befor=
e.<br>&gt; &gt;<br>&gt; &gt; Profiles would be interesting, because it soun=
ds to me like reclaim<br>&gt; &gt; *might* be batching page cache removal b=
etter (e.g. fewer, larger<br>&gt; &gt; batches) and so spending less time c=
ontending on the mapping tree<br>&gt; &gt; lock...<br>&gt; &gt;<br>&gt; &gt=
; IOWs, I suspect this result might actually be a result of less lock<br>&g=
t; &gt; contention due to a change in batch processing characteristics of<b=
r>&gt; &gt; the new algorithm rather than it being a &quot;better&quot; alg=
orithm...<br>&gt;<br>&gt; That seems quite likely to me, given the issues w=
e have<br>&gt; had with virtual scan reclaim algorithms in the past.<br><br=
><div>Hi Rik,<div><br>Let paste the code so we can move beyond the &quot;ba=
tching&quot; hypothesis:<br><br>static int __remove_mapping(struct address_=
space *mapping, struct page *page,<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 bool reclaim=
ed, struct mem_cgroup *target_memcg)<br>{<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 un=
signed long flags;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 int refcount;<br>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 void *shadow =3D NULL;<br><br>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 BUG_ON(!PageLocked(page));<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 BUG_ON(mappin=
g !=3D page_mapping(page));<br><br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 xa_lock_irqs=
ave(&amp;mapping-&gt;i_pages, flags);<br><br>&gt; SeongJae, what is this al=
gorithm supposed to do when faced<br>&gt; with situations like this:<br><br=
>I&#39;ll assume the questions were directed at me, not SeongJae.<br><br>&g=
t; 1) Running on a system with 8 NUMA nodes, and<br>&gt; memory<br>&gt; =C2=
=A0 =C2=A0pressure in one of those nodes.<br>&gt; 2) Running PostgresQL or =
Oracle, with hundreds of<br>&gt; =C2=A0 =C2=A0processes mapping the same (v=
ery large) shared<br>&gt; =C2=A0 =C2=A0memory segment.<br>&gt;<br>&gt; How =
do you keep your algorithm from falling into the worst<br>&gt; case virtual=
 scanning scenarios that were crippling the<br>&gt; 2.4 kernel 15+ years ag=
o on systems with just a few GB of<br>&gt; memory?<br><div><br>There is a f=
undamental shift: that time we were scanning for cold pages, and nowadays w=
e are scanning for hot pages.<br><br>I&#39;d be surprised if scanning for c=
old pages didn&#39;t fall apart, because it&#39;d find most of the entries =
accessed, if they are present at all.<br><br>Scanning for hot pages, on the=
 other hand, is way better. Let me just reiterate:<br>1) It will not scan p=
age tables from processes that have been sleeping<br>=C2=A0 =C2=A0since the=
 last scan.<br>2) It will not scan PTE tables under non-leaf PMD entries th=
at do not<br>=C2=A0 =C2=A0have the accessed bit set, when<br>=C2=A0 =C2=A0C=
ONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=3Dy.<br>3) It will not zigzag between the =
PGD table and the same PMD or PTE<br>=C2=A0 =C2=A0table spanning multiple V=
MAs. In other words, it finishes all the<br>=C2=A0 =C2=A0VMAs with the rang=
e of the same PMD or PTE table before it returns<br>=C2=A0 =C2=A0to the PGD=
 table. This optimizes workloads that have large numbers<br>=C2=A0 =C2=A0of=
 tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=3D5.<br><br>So the cost i=
s roughly proportional to the number of referenced pages it discovers. If t=
here is no memory pressure, no scanning at all. For a system under heavy me=
mory pressure, most of the pages are referenced (otherwise why would it be =
under memory pressure?), and if we use the rmap, we need to scan a lot of p=
ages anyway. Why not just scan them all? This way you save a lot because of=
 batching (now it&#39;s time to talk about batching). Besides, page tables =
have far better memory locality=C2=A0than the rmap. For the shared memory e=
xample you gave, the rmap needs to lock *each* page it scans. How many 4KB =
pages does your large file have? I&#39;ll leave the math to you.</div><div>=
<br></div><div>Here are some profiles:<div><br></div><div>zram with the rma=
p (mainline)</div><div>=C2=A0 31.03%=C2=A0 page_vma_mapped_walk<br>=C2=A0 2=
5.59%=C2=A0 lzo1x_1_do_compress<br>=C2=A0 =C2=A04.63%=C2=A0 do_raw_spin_loc=
k<br>=C2=A0 =C2=A03.89%=C2=A0 vma_interval_tree_iter_next<br>=C2=A0 =C2=A03=
.33%=C2=A0 vma_interval_tree_subtree_search</div><div><br></div><div><div>z=
ram with page table scanning (this patchset)</div></div><div>=C2=A0 49.36%=
=C2=A0 lzo1x_1_do_compress<br>=C2=A0 =C2=A04.54%=C2=A0 page_vma_mapped_walk=
<br>=C2=A0 =C2=A04.45%=C2=A0 memset_erms<br>=C2=A0 =C2=A03.47%=C2=A0 walk_p=
te_range<br>=C2=A0 =C2=A02.88%=C2=A0 zram_bvec_rw<br></div><div><br></div><=
div>Note that these are not just what I saw from some local=C2=A0benchmarks=
. We have observed *millions* of machines in our fleet.</div><div><br></div=
><div>I encourage you to try=C2=A0it and see for yourself. It&#39;s as simp=
le as:</div><div><span class=3D"gmail-im" style=3D"color:rgb(80,0,80)"><br>=
</span></div><div><span class=3D"gmail-im" style=3D"color:rgb(80,0,80)">git=
 fetch=C2=A0<a href=3D"https://linux-mm.googlesource.com/page-reclaim" rel=
=3D"noreferrer" target=3D"_blank">https://linux-mm.googlesource.com/page-re=
claim</a>=C2=A0refs/changes/73/1173/1<br><br></span>CONFIG_LRU_GEN=3Dy<br>C=
ONFIG_LRU_GEN_ENABLED=3Dy<br></div><div><br></div><div>and build and run yo=
ur favorite benchmarks.</div><div><br></div></div></div></div></div>

--00000000000007439505bfe6f885--