From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.5 required=3.0 tests=BAYES_00,DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AA9FAC433B4 for ; Wed, 14 Apr 2021 10:00:14 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id F187B6128E for ; Wed, 14 Apr 2021 10:00:13 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F187B6128E Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5291D6B0071; Wed, 14 Apr 2021 06:00:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4DB7C6B0072; Wed, 14 Apr 2021 06:00:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 305198D0003; Wed, 14 Apr 2021 06:00:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0137.hostedemail.com [216.40.44.137]) by kanga.kvack.org (Postfix) with ESMTP id 0A56E6B0071 for ; Wed, 14 Apr 2021 06:00:13 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id B567282499A8 for ; Wed, 14 Apr 2021 10:00:12 +0000 (UTC) X-FDA: 78030527064.13.02F0105 Received: from mail-io1-f49.google.com (mail-io1-f49.google.com [209.85.166.49]) by imf13.hostedemail.com (Postfix) with ESMTP id 1F92CE005F02 for ; Wed, 14 Apr 2021 10:00:08 +0000 (UTC) Received: by mail-io1-f49.google.com with SMTP id h141so11700395iof.2 for ; Wed, 14 Apr 2021 03:00:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=swi+6kI5Lk3F77BNju/FBA7qsCHeXk+GPezJxX+2Hcc=; b=wFKtJfxIABTmiQDEfm7rWPGz6rXnuPaF+sLVoHM51LkYyZt7UcAdGppARYFDIxSXPQ P/LhbFST1y/mX7ytG2MhDtQ3JlDxsk0JWjJmolHQf/vFxoz/PeP4PcBfThg/fw1n8FqE Y3771lL9inQwqzxXDY0ASzzlwb/IyrW2U7Db1Lp18/c5S9RDLV54wfyogZyGdSYI/eZI 4yjYk+iYJLwYRWIRbA7h2erF/yoxGczTNSFXeDBFV8q689IEZ6KYu3UFmlncNzSEu3DV VbHQ5mTMfZT39dLR2dKqUEffE9CpVrOdvUrCTWgMp1MSox4sNjcf+UNQD+nbSe4AXR69 pZcg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=swi+6kI5Lk3F77BNju/FBA7qsCHeXk+GPezJxX+2Hcc=; b=OrAzv3cJhPxFT+S7uvzb5L1nkybjIUPposjFGofrlklEYSAQuJEMgri093vFT39S+N fWCi6KNLqWB9rIlRuTa7yvueuWv4FBtEOhD3OqAfkRtGF8AUviKBT0AkfGktZGnriHB3 g6fILRyDXOL1ltWqxtIofsyHHGwmoe8OGDRcFxCP3duU83Zd2YqLIN0XDpDiGGKehMMg N8nGW7LepQi/dpf9uivyXMFZxBoCQ3CPNiPTI2CxJ6Yrd0n2wCgdJBHQ97lrGytMpmEt DTX+TLSMRKkyZAnJ8CGKaGnkLu5bMkTBMVli+M4cFSp+UEZTju+mFFknJbN9bNwXDMku iKbA== X-Gm-Message-State: AOAM5308lQnTamNVotCFxp8sBHTnYNG/dNYYsfdoh6NTBvcU+qE1Cl06 OeWcvFC/FCGxaJqMzqRtSTTQVw== X-Google-Smtp-Source: ABdhPJzOHyPBiGGCOh9t+y3JEbx8ndRwoe05bs+4vJk+Fuyl9XhnH56JSk+rkVz/BGhNNc4ZI5UPFw== X-Received: by 2002:a05:6638:20a:: with SMTP id e10mr10703880jaq.48.1618394411106; Wed, 14 Apr 2021 03:00:11 -0700 (PDT) Received: from google.com ([2620:15c:183:200:2822:c2d3:ef00:fa3c]) by smtp.gmail.com with ESMTPSA id l1sm8050426ioj.52.2021.04.14.03.00.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 14 Apr 2021 03:00:10 -0700 (PDT) Date: Wed, 14 Apr 2021 04:00:05 -0600 From: Yu Zhao To: Dave Chinner Cc: Jens Axboe , SeongJae Park , Linux-MM , Andi Kleen , Andrew Morton , Benjamin Manes , Dave Hansen , Hillf Danton , Johannes Weiner , Jonathan Corbet , Joonsoo Kim , Matthew Wilcox , Mel Gorman , Miaohe Lin , Michael Larabel , Michal Hocko , Michel Lespinasse , Rik van Riel , Roman Gushchin , Rong Chen , SeongJae Park , Tim Chen , Vlastimil Babka , Yang Shi , Ying Huang , Zi Yan , linux-kernel , lkp@lists.01.org, Kernel Page Reclaim v2 Subject: Re: [PATCH v2 00/16] Multigenerational LRU Framework Message-ID: References: <20210413075155.32652-1-sjpark@amazon.de> <3ddd4f8a-8e51-662b-df11-a63a0e75b2bc@kernel.dk> <20210413231436.GF63242@dread.disaster.area> <20210414045006.GR1990290@dread.disaster.area> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 1F92CE005F02 X-Stat-Signature: scrcf6jd8ko6wwz7ej7tip4myrno1tkm X-Rspamd-Server: rspam02 Received-SPF: none (google.com>: No applicable sender policy available) receiver=imf13; identity=mailfrom; envelope-from=""; helo=mail-io1-f49.google.com; client-ip=209.85.166.49 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1618394408-573970 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Apr 14, 2021 at 01:16:52AM -0600, Yu Zhao wrote: > On Tue, Apr 13, 2021 at 10:50 PM Dave Chinner wro= te: > > > > On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote: > > > On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner = wrote: > > > > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote: > > > > > On 4/13/21 1:51 AM, SeongJae Park wrote: > > > > > > From: SeongJae Park > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > Very interesting work, thank you for sharing this :) > > > > > > > > > > > > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao wrote: > > > > > > > > > > > >> What's new in v2 > > > > > >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > >> Special thanks to Jens Axboe for reporting a regression in b= uffered > > > > > >> I/O and helping test the fix. > > > > > > > > > > > > Is the discussion open? If so, could you please give me a li= nk? > > > > > > > > > > I wasn't on the initial post (or any of the lists it was posted= to), but > > > > > it's on the google page reclaim list. Not sure if that is publi= c or not. > > > > > > > > > > tldr is that I was pretty excited about this work, as buffered = IO tends > > > > > to suck (a lot) for high throughput applications. My test case = was > > > > > pretty simple: > > > > > > > > > > Randomly read a fast device, using 4k buffered IO, and watch wh= at > > > > > happens when the page cache gets filled up. For this particular= test, > > > > > we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-= 1.6GB/sec > > > > > with kswapd using a lot of CPU trying to keep up. That's mainli= ne > > > > > behavior. > > > > > > > > I see this exact same behaviour here, too, but I RCA'd it to > > > > contention between the inode and memory reclaim for the mapping > > > > structure that indexes the page cache. Basically the mapping tree > > > > lock is the contention point here - you can either be adding page= s > > > > to the mapping during IO, or memory reclaim can be removing pages > > > > from the mapping, but we can't do both at once. > > > > > > > > So we end up with kswapd spinning on the mapping tree lock like s= o > > > > when doing 1.6GB/s in 4kB buffered IO: > > > > > > > > - 20.06% 0.00% [kernel] [k] kswapd = = =E2=96=92 > > > > - 20.06% kswapd = = =E2=96=92 > > > > - 20.05% balance_pgdat = = =E2=96=92 > > > > - 20.03% shrink_node = = =E2=96=92 > > > > - 19.92% shrink_lruvec = = =E2=96=92 > > > > - 19.91% shrink_inactive_list = = =E2=96=92 > > > > - 19.22% shrink_page_list = = =E2=96=92 > > > > - 17.51% __remove_mapping = = =E2=96=92 > > > > - 14.16% _raw_spin_lock_irqsave = = =E2=96=92 > > > > - 14.14% do_raw_spin_lock = = =E2=96=92 > > > > __pv_queued_spin_lock_slowpath = = =E2=96=92 > > > > - 1.56% __delete_from_page_cache = = =E2=96=92 > > > > 0.63% xas_store = = =E2=96=92 > > > > - 0.78% _raw_spin_unlock_irqrestore = = =E2=96=92 > > > > - 0.69% do_raw_spin_unlock = = =E2=96=92 > > > > __raw_callee_save___pv_queued_spi= n_unlock = =E2=96=92 > > > > - 0.82% free_unref_page_list = = =E2=96=92 > > > > - 0.72% free_unref_page_commit = = =E2=96=92 > > > > 0.57% free_pcppages_bulk = = =E2=96=92 > > > > > > > > And these are the processes consuming CPU: > > > > > > > > 5171 root 20 0 1442496 5696 1284 R 99.7 0.0 1:= 07.78 fio > > > > 1150 root 20 0 0 0 0 S 47.4 0.0 0:= 22.70 kswapd1 > > > > 1146 root 20 0 0 0 0 S 44.0 0.0 0:= 21.85 kswapd0 > > > > 1152 root 20 0 0 0 0 S 39.7 0.0 0:= 18.28 kswapd3 > > > > 1151 root 20 0 0 0 0 S 15.2 0.0 0:= 12.14 kswapd2 > > > > > > > > i.e. when memory reclaim kicks in, the read process has 20% less > > > > time with exclusive access to the mapping tree to insert new page= s. > > > > Hence buffered read performance goes down quite substantially whe= n > > > > memory reclaim kicks in, and this really has nothing to do with t= he > > > > memory reclaim LRU scanning algorithm. > > > > > > > > I can actually get this machine to pin those 5 processes to 100% = CPU > > > > under certain conditions. Each process is spinning all that extra > > > > time on the mapping tree lock, and performance degrades further. > > > > Changing the LRU reclaim algorithm won't fix this - the workload = is > > > > solidly bound by the exclusive nature of the mapping tree lock an= d > > > > the number of tasks trying to obtain it exclusively... > > > > > > > > > The initial posting of this patchset did no better, in fact it = did a bit > > > > > worse. Performance dropped to the same levels and kswapd was us= ing as > > > > > much CPU as before, but on top of that we also got excessive sw= apping. > > > > > Not at a high rate, but 5-10MB/sec continually. > > > > > > > > > > I had some back and forths with Yu Zhao and tested a few new re= visions, > > > > > and the current series does much better in this regard. Perform= ance > > > > > still dips a bit when page cache fills, but not nearly as much,= and > > > > > kswapd is using less CPU than before. > > > > > > > > Profiles would be interesting, because it sounds to me like recla= im > > > > *might* be batching page cache removal better (e.g. fewer, larger > > > > batches) and so spending less time contending on the mapping tree > > > > lock... > > > > > > > > IOWs, I suspect this result might actually be a result of less lo= ck > > > > contention due to a change in batch processing characteristics of > > > > the new algorithm rather than it being a "better" algorithm... > > > > > > I appreciate the profile. But there is no batching in > > > __remove_mapping() -- it locks the mapping for each page, and > > > therefore the lock contention penalizes the mainline and this patch= set > > > equally. It looks worse on your system because the four kswapd thre= ads > > > from different nodes were working on the same file. > > > > I think you misunderstand exactly what I mean by "batching" here. > > I'm not talking about doing multiple pieces of work under a single > > lock. What I mean is that the overall amount of work done in a > > single reclaim scan (i.e a "reclaim batch") is packaged differently. > > > > We already batch up page reclaim via building a page list and then > > passing it to shrink_page_list() to process the batch of pages in a > > single pass. Each page in this page list batch then calls > > remove_mapping() to pull the page form the LRU, we have a run of > > contention between the foreground read() thread and the background > > kswapd. > > > > If the size or nature of the pages in the batch passed to > > shrink_page_list() changes, then the amount of time a reclaim batch > > is going to put pressure on the mapping tree lock will also change. > > That's the "change in batching behaviour" I'm referring to here. I > > haven't read through the patchset to determine if you change the > > shrink_page_list() algorithm, but it likely changes what is passed > > to be reclaimed and that in turn changes the locking patterns that > > fall out of shrink_page_list... >=20 > Ok, if we are talking about the size of the batch passed to > shrink_page_list(), both the mainline and this patchset cap it at > SWAP_CLUSTER_MAX, which is 32. There are corner cases, but when > running fio/io_uring, it's safe to say both use 32. >=20 > > > And kswapd is only one of two paths that could affect the performan= ce. > > > The kernel context of the test process is where the improvement mai= nly > > > comes from. > > > > > > I also suspect you were testing a file much larger than your memory > > > size. If so, sorry to tell you that a file only a few times larger, > > > e.g. twice, would be worse. > > > > > > Here is my take: > > > > > > Claim > > > ----- > > > This patchset is a "better" algorithm. (Technically it's not an > > > algorithm, it's a feedback loop.) > > > > > > Theoretical basis > > > ----------------- > > > An open-loop control (the mainline) can only be better if the margi= n > > > of error in its prediction of the future events is less than that f= rom > > > the trial-and-error of a closed-loop control (this patchset). For > > > simple machines, it surely can. For page reclaim, AFAIK, it can't. > > > > > > A typical example: when randomly accessing a (not infinitely) large > > > file via buffered io long enough, we're bound to hit the same block= s > > > multiple times. Should we activate the pages containing those block= s, > > > i.e., to move them to the active lru list? No. > > > > > > RCA > > > --- > > > For the fio/io_uring benchmark, the "No" is the key. > > > > > > The mainline activates pages accessed multiple times. This is done = in > > > the buffered io access path by mark_page_accessed(), and it takes t= he > > > lru lock, which is contended under memory pressure. This contention > > > slows down both the access path and kswapd. But kswapd is not the > > > problem here because we are measuring the io_uring process, not ksw= ap. > > > > > > For this patchset, there are no activations since the refault rates= of > > > pages accessed multiple times are similar to those accessed only on= ce > > > -- activations will only be done to pages from tiers with higher > > > refault rates. > > > > > > If you wish to debunk > > > --------------------- > > > > Nope, it's your job to convince us that it works, not the other way > > around. It's up to you to prove that your assertions are correct, > > not for us to prove they are false. >=20 > Just trying to keep people motivated, my homework is my own. >=20 > > > git fetch https://linux-mm.googlesource.com/page-reclaim refs/chang= es/73/1173/1 > > > > > > CONFIG_LRU_GEN=3Dy > > > CONFIG_LRU_GEN_ENABLED=3Dy > > > > > > Run your benchmarks > > > > > > Profiles (200G mem + 400G file) > > > ------------------------------- > > > A quick test from Jens' fio/io_uring: > > > > > > -rc7 > > > 13.30% io_uring xas_load > > > 13.22% io_uring _copy_to_iter > > > 12.30% io_uring __add_to_page_cache_locked > > > 7.43% io_uring clear_page_erms > > > 4.18% io_uring filemap_get_read_batch > > > 3.54% io_uring get_page_from_freelist > > > 2.98% io_uring ***native_queued_spin_lock_slowpath*** > > > 1.61% io_uring page_cache_ra_unbounded > > > 1.16% io_uring xas_start > > > 1.08% io_uring filemap_read > > > 1.07% io_uring ***__activate_page*** > > > > > > lru lock: 2.98% (lru addition + activation) > > > activation: 1.07% > > > > > > -rc7 + this patchset > > > 14.44% io_uring xas_load > > > 14.14% io_uring _copy_to_iter > > > 11.15% io_uring __add_to_page_cache_locked > > > 6.56% io_uring clear_page_erms > > > 4.44% io_uring filemap_get_read_batch > > > 2.14% io_uring get_page_from_freelist > > > 1.32% io_uring page_cache_ra_unbounded > > > 1.20% io_uring psi_group_change > > > 1.18% io_uring filemap_read > > > 1.09% io_uring ****native_queued_spin_lock_slowpath**** > > > 1.08% io_uring do_mpage_readpage > > > > > > lru lock: 1.09% (lru addition only) > > > > All this tells us is that there was *less contention on the mapping > > tree lock*. It does not tell us why there was less contention. > > > > You've handily omitted the kswapd profile, which is really the one > > of interest to the discussion here - how did the memory reclaim CPU > > usage profile also change at the same time? >=20 > Well, let me attach them. Suffix -1 is the mainline, -2 is the patchset= . >=20 > mainline > 57.65% kswapd0 __remove_mapping > this patchset > 61.61% kswapd0 __remove_mapping >=20 > As I said, the mapping lock contention penalizes both heavily. Its > percentage is even higher with the patchset, because it has less > overhead. I'm trying to explain "the less overhead" part: it's the > activations that make the mainline worse. >=20 > mainline > 6.53% kswapd0 shrink_active_list > this patchset > 0 >=20 > From the io_uring context: > mainline > 2.53% io_uring mark_page_accessed > this patchset > 0.52% io_uring mark_page_accessed >=20 > mark_page_accessed() moves pages accessed multiple times to the active > lru list. Then shrink_active_list() moves them back to the inactive > list. All for nothing. >=20 > I don't want to paste everything here -- they'd clutter. Please see > all the detailed profiles in the attachment. Let me know if their > formats are no to your liking. I still have the raw perf.data. >=20 > > > And I plan to reach out to other communities, e.g., PostgreSQL, to > > > benchmark the patchset. I heard they have been complaining about th= e > > > buffered io performance under memory pressure. Any other benchmarks > > > you'd suggest? > > > > > > BTW, you might find another surprise in how less frequently slab > > > shrinkers are called under memory pressure, because this patchset i= s a > > > lot better at finding pages to reclaim and therefore doesn't overki= ll > > > slabs. > > > > That's actually very likely to be a Bad Thing and cause unexpected > > perofrmance and OOM based regressions. When the machine finally runs > > out of page cache it can easily reclaim, it's going to get stuck > > with long tail latencies reclaiming huge slab caches as they've had > > no substantial ongoing pressure put on them to keep them in balance > > with the overall memory pressure the system is under... >=20 > Well. It does use the existing equation. That is if it scans X% of > pages, then it scans X% of slab objects. But 1) it often finds pages > to reclaim at a lower X% 2) the pages it reclaims are less likely to > refault. So the side effect is the overall slab objects it scans also > reduce. I do see your point but don't see any options, at the moment. I apologize for the spam. Apparent the attachment in my previous email didn't reach everybody. I hope this would work: git clone https://linux-mm.googlesource.com/benchmarks Repo contains profiles collected when running fio/io_uring, mainline: kswapd-1.txt kswapd-1.svg io_uring-1.txt io_uring-1.svg =20 patched: kswapd-2.txt kswapd-2.svg io_uring-2.txt io_uring-2.svg Thanks.