From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============2579184042334878469==" MIME-Version: 1.0 From: Jens Axboe To: lkp@lists.01.org Subject: Re: [PATCH v2 00/16] Multigenerational LRU Framework Date: Wed, 14 Apr 2021 08:43:36 -0600 Message-ID: <91146ee7-3054-a81a-296e-e75c24f4e290@kernel.dk> In-Reply-To: <20210413231436.GF63242@dread.disaster.area> List-Id: --===============2579184042334878469== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On 4/13/21 5:14 PM, Dave Chinner wrote: > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote: >> On 4/13/21 1:51 AM, SeongJae Park wrote: >>> From: SeongJae Park >>> >>> Hello, >>> >>> >>> Very interesting work, thank you for sharing this :) >>> >>> On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao wrote: >>> >>>> What's new in v2 >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>> Special thanks to Jens Axboe for reporting a regression in buffered >>>> I/O and helping test the fix. >>> >>> Is the discussion open? If so, could you please give me a link? >> >> I wasn't on the initial post (or any of the lists it was posted to), but >> it's on the google page reclaim list. Not sure if that is public or not. >> >> tldr is that I was pretty excited about this work, as buffered IO tends >> to suck (a lot) for high throughput applications. My test case was >> pretty simple: >> >> Randomly read a fast device, using 4k buffered IO, and watch what >> happens when the page cache gets filled up. For this particular test, >> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec >> with kswapd using a lot of CPU trying to keep up. That's mainline >> behavior. > = > I see this exact same behaviour here, too, but I RCA'd it to > contention between the inode and memory reclaim for the mapping > structure that indexes the page cache. Basically the mapping tree > lock is the contention point here - you can either be adding pages > to the mapping during IO, or memory reclaim can be removing pages > from the mapping, but we can't do both at once. > = > So we end up with kswapd spinning on the mapping tree lock like so > when doing 1.6GB/s in 4kB buffered IO: > = > - 20.06% 0.00% [kernel] [k] kswapd = = =E2=96=92 > - 20.06% kswapd = = =E2=96=92 > - 20.05% balance_pgdat = = =E2=96=92 > - 20.03% shrink_node = = =E2=96=92 > - 19.92% shrink_lruvec = = =E2=96=92 > - 19.91% shrink_inactive_list = = =E2=96=92 > - 19.22% shrink_page_list = = =E2=96=92 > - 17.51% __remove_mapping = = =E2=96=92 > - 14.16% _raw_spin_lock_irqsave = = =E2=96=92 > - 14.14% do_raw_spin_lock = = =E2=96=92 > __pv_queued_spin_lock_slowpath = = =E2=96=92 > - 1.56% __delete_from_page_cache = = =E2=96=92 > 0.63% xas_store = = =E2=96=92 > - 0.78% _raw_spin_unlock_irqrestore = = =E2=96=92 > - 0.69% do_raw_spin_unlock = = =E2=96=92 > __raw_callee_save___pv_queued_spin_unlock= = =E2=96=92 > - 0.82% free_unref_page_list = = =E2=96=92 > - 0.72% free_unref_page_commit = = =E2=96=92 > 0.57% free_pcppages_bulk = = =E2=96=92 > = > And these are the processes consuming CPU: > = > 5171 root 20 0 1442496 5696 1284 R 99.7 0.0 1:07.78 fio > 1150 root 20 0 0 0 0 S 47.4 0.0 0:22.70 ks= wapd1 > 1146 root 20 0 0 0 0 S 44.0 0.0 0:21.85 ks= wapd0 > 1152 root 20 0 0 0 0 S 39.7 0.0 0:18.28 ks= wapd3 > 1151 root 20 0 0 0 0 S 15.2 0.0 0:12.14 ks= wapd2 Here's my profile when memory reclaim is active for the above mentioned test case. This is a single node system, so just kswapd. It's using around 40-45% CPU: 43.69% kswapd0 [kernel.vmlinux] [k] xas_create | ---ret_from_fork kthread kswapd balance_pgdat shrink_node shrink_lruvec shrink_inactive_list shrink_page_list __delete_from_page_cache xas_store xas_create 16.88% kswapd0 [kernel.vmlinux] [k] queued_spin_lock_slowpath | ---ret_from_fork kthread kswapd balance_pgdat shrink_node shrink_lruvec | = --16.82%--shrink_inactive_list | = --16.55%--shrink_page_list | = --16.26%--_raw_spin_lock_irqsave queued_spin_lock_slowpath 9.89% kswapd0 [kernel.vmlinux] [k] shrink_page_list | ---ret_from_fork kthread kswapd balance_pgdat shrink_node shrink_lruvec shrink_inactive_list shrink_page_list 5.46% kswapd0 [kernel.vmlinux] [k] xas_init_marks | ---ret_from_fork kthread kswapd balance_pgdat shrink_node shrink_lruvec shrink_inactive_list shrink_page_list | = --5.41%--__delete_from_page_cache xas_init_marks 4.42% kswapd0 [kernel.vmlinux] [k] __delete_from_page_cache | ---ret_from_fork kthread kswapd balance_pgdat shrink_node shrink_lruvec shrink_inactive_list | = --4.40%--shrink_page_list __delete_from_page_cache 2.82% kswapd0 [kernel.vmlinux] [k] isolate_lru_pages | ---ret_from_fork kthread kswapd balance_pgdat shrink_node shrink_lruvec | = |--1.43%--shrink_active_list | isolate_lru_pages | = --1.39%--shrink_inactive_list isolate_lru_pages 1.99% kswapd0 [kernel.vmlinux] [k] free_pcppages_bulk | ---ret_from_fork kthread kswapd balance_pgdat shrink_node shrink_lruvec shrink_inactive_list shrink_page_list free_unref_page_list free_unref_page_commit free_pcppages_bulk 1.79% kswapd0 [kernel.vmlinux] [k] _raw_spin_lock_irqsave | ---ret_from_fork kthread kswapd balance_pgdat | = --1.76%--shrink_node shrink_lruvec shrink_inactive_list | = --1.72%--shrink_page_list _raw_spin_lock_irqsave 1.02% kswapd0 [kernel.vmlinux] [k] workingset_eviction | ---ret_from_fork kthread kswapd balance_pgdat shrink_node shrink_lruvec shrink_inactive_list | = --1.00%--shrink_page_list workingset_eviction > i.e. when memory reclaim kicks in, the read process has 20% less > time with exclusive access to the mapping tree to insert new pages. > Hence buffered read performance goes down quite substantially when > memory reclaim kicks in, and this really has nothing to do with the > memory reclaim LRU scanning algorithm. > = > I can actually get this machine to pin those 5 processes to 100% CPU > under certain conditions. Each process is spinning all that extra > time on the mapping tree lock, and performance degrades further. > Changing the LRU reclaim algorithm won't fix this - the workload is > solidly bound by the exclusive nature of the mapping tree lock and > the number of tasks trying to obtain it exclusively... I've seen way worse than the above as well, it's just my go-to easy test case for "man I wish buffered IO didn't suck so much". >> The initial posting of this patchset did no better, in fact it did a bit >> worse. Performance dropped to the same levels and kswapd was using as >> much CPU as before, but on top of that we also got excessive swapping. >> Not at a high rate, but 5-10MB/sec continually. >> >> I had some back and forths with Yu Zhao and tested a few new revisions, >> and the current series does much better in this regard. Performance >> still dips a bit when page cache fills, but not nearly as much, and >> kswapd is using less CPU than before. > = > Profiles would be interesting, because it sounds to me like reclaim > *might* be batching page cache removal better (e.g. fewer, larger > batches) and so spending less time contending on the mapping tree > lock... > = > IOWs, I suspect this result might actually be a result of less lock > contention due to a change in batch processing characteristics of > the new algorithm rather than it being a "better" algorithm... See above - let me know if you want to see more specific profiling as well. -- = Jens Axboe --===============2579184042334878469==-- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E1385C433B4 for ; Wed, 14 Apr 2021 14:43:41 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id F3D8B61155 for ; Wed, 14 Apr 2021 14:43:40 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F3D8B61155 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.dk Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 3EAA56B0036; Wed, 14 Apr 2021 10:43:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3C2496B006C; Wed, 14 Apr 2021 10:43:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 23B288D0002; Wed, 14 Apr 2021 10:43:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0157.hostedemail.com [216.40.44.157]) by kanga.kvack.org (Postfix) with ESMTP id 037696B0036 for ; Wed, 14 Apr 2021 10:43:39 -0400 (EDT) Received: from smtpin35.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 4E6A318019B02 for ; Wed, 14 Apr 2021 14:43:39 +0000 (UTC) X-FDA: 78031241358.35.19E4F28 Received: from mail-io1-f43.google.com (mail-io1-f43.google.com [209.85.166.43]) by imf08.hostedemail.com (Postfix) with ESMTP id F31C880192F0 for ; Wed, 14 Apr 2021 14:43:23 +0000 (UTC) Received: by mail-io1-f43.google.com with SMTP id h141so12518765iof.2 for ; Wed, 14 Apr 2021 07:43:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=YZVZDpwAdidWFtU6RiI402UdeGUs8c1Nk5u1ItVyewM=; b=N0AjO9bryWx3Ag8bLuoxEt2+guOISDaF9i5gzO9Mdr2w98+j+auDIXMUs1+g9TcBWQ Ykb4bTkF84KkfPpSFPb5BeVj3H+PQIw/zX93Uc9T+BuNEztLex0ugwV3cnUo+gAKexar ToZAcSDgTFWXrY1dzY3v9gfZE7H/QB6uJ4YB37pAqbhEkr2L9bvf7CrA4yUyvnoTSNMC PWcreuW6WuUyCmW1WCfTg/it1+Lk9cOlxeuD3QUP8XrFQNKndOlho4PAfxD1VFePiP6e T/ZMVg9ipP5x3FvwehDa3IM4KherTZWvlI3z+cImcyejFU7kvygHHwwykZkB+qC0iPru gAZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=YZVZDpwAdidWFtU6RiI402UdeGUs8c1Nk5u1ItVyewM=; b=mLqczD4oRcSIXPqOr8TMK13pI4fUbOmzpnaOuvs9M8JweUI+tNGsitq/nxmSp/1CMc MB4E+yLPCj40jHr0qvP3LfuZQtVBe8X+239tEr58TXvhzv3N9tNm06nXQQmcJMERcNUx /OAkBn6GQQMdW0M2qsU/K+leBOe1KFpUWAwuDAaZVx8aW7ROnpK7TUCUFSpL7eIN36zG qZfiONWqDR+FMkJjVarliBX7Qaa8KdYXYngoQ5kgTRSCGChYY2VjUlCYllluL0VC7Zit ICcUUn9LmUbT24JwCxlxz5oAVILl/k1uWN64hdjofahznHrMJdxAP4Gh6gknLdMcYNUC 2/QA== X-Gm-Message-State: AOAM531xyBxohXXDsUdr4CUcR57IhBA1NDChoPxCwW28+jcvKRZXONr1 4nrNmUML/jbxeorV7f+5/tgb9Q== X-Google-Smtp-Source: ABdhPJzyjJzBMO59YSTbgba3MsXmD20jos+S+Jo0tu8a1CQqA2feVDd+7+Hmi9Ff/ONyb6f4ZP7E7g== X-Received: by 2002:a02:230d:: with SMTP id u13mr39458646jau.53.1618411416481; Wed, 14 Apr 2021 07:43:36 -0700 (PDT) Received: from [192.168.1.30] ([65.144.74.34]) by smtp.gmail.com with ESMTPSA id o6sm8264302ioa.21.2021.04.14.07.43.35 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 14 Apr 2021 07:43:36 -0700 (PDT) Subject: Re: [PATCH v2 00/16] Multigenerational LRU Framework To: Dave Chinner Cc: SeongJae Park , Yu Zhao , linux-mm@kvack.org, Andi Kleen , Andrew Morton , Benjamin Manes , Dave Hansen , Hillf Danton , Johannes Weiner , Jonathan Corbet , Joonsoo Kim , Matthew Wilcox , Mel Gorman , Miaohe Lin , Michael Larabel , Michal Hocko , Michel Lespinasse , Rik van Riel , Roman Gushchin , Rong Chen , SeongJae Park , Tim Chen , Vlastimil Babka , Yang Shi , Ying Huang , Zi Yan , linux-kernel@vger.kernel.org, lkp@lists.01.org, page-reclaim@google.com References: <20210413075155.32652-1-sjpark@amazon.de> <3ddd4f8a-8e51-662b-df11-a63a0e75b2bc@kernel.dk> <20210413231436.GF63242@dread.disaster.area> From: Jens Axboe Message-ID: <91146ee7-3054-a81a-296e-e75c24f4e290@kernel.dk> Date: Wed, 14 Apr 2021 08:43:36 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: <20210413231436.GF63242@dread.disaster.area> Content-Type: text/plain; charset=utf-8 Content-Language: en-US X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: F31C880192F0 X-Stat-Signature: cskn7fwqs6uarj39hfcr5kdfyxfrzedb Received-SPF: none (kernel.dk>: No applicable sender policy available) receiver=imf08; identity=mailfrom; envelope-from=""; helo=mail-io1-f43.google.com; client-ip=209.85.166.43 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1618411403-739436 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 4/13/21 5:14 PM, Dave Chinner wrote: > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote: >> On 4/13/21 1:51 AM, SeongJae Park wrote: >>> From: SeongJae Park >>> >>> Hello, >>> >>> >>> Very interesting work, thank you for sharing this :) >>> >>> On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao wrote: >>> >>>> What's new in v2 >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>> Special thanks to Jens Axboe for reporting a regression in buffered >>>> I/O and helping test the fix. >>> >>> Is the discussion open? If so, could you please give me a link? >> >> I wasn't on the initial post (or any of the lists it was posted to), b= ut >> it's on the google page reclaim list. Not sure if that is public or no= t. >> >> tldr is that I was pretty excited about this work, as buffered IO tend= s >> to suck (a lot) for high throughput applications. My test case was >> pretty simple: >> >> Randomly read a fast device, using 4k buffered IO, and watch what >> happens when the page cache gets filled up. For this particular test, >> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/s= ec >> with kswapd using a lot of CPU trying to keep up. That's mainline >> behavior. >=20 > I see this exact same behaviour here, too, but I RCA'd it to > contention between the inode and memory reclaim for the mapping > structure that indexes the page cache. Basically the mapping tree > lock is the contention point here - you can either be adding pages > to the mapping during IO, or memory reclaim can be removing pages > from the mapping, but we can't do both at once. >=20 > So we end up with kswapd spinning on the mapping tree lock like so > when doing 1.6GB/s in 4kB buffered IO: >=20 > - 20.06% 0.00% [kernel] [k] kswapd = = =E2=96=92 > - 20.06% kswapd = = =E2=96=92 > - 20.05% balance_pgdat = = =E2=96=92 > - 20.03% shrink_node = = =E2=96=92 > - 19.92% shrink_lruvec = = =E2=96=92 > - 19.91% shrink_inactive_list = = =E2=96=92 > - 19.22% shrink_page_list = = =E2=96=92 > - 17.51% __remove_mapping = = =E2=96=92 > - 14.16% _raw_spin_lock_irqsave = = =E2=96=92 > - 14.14% do_raw_spin_lock = = =E2=96=92 > __pv_queued_spin_lock_slowpath = = =E2=96=92 > - 1.56% __delete_from_page_cache = = =E2=96=92 > 0.63% xas_store = = =E2=96=92 > - 0.78% _raw_spin_unlock_irqrestore = = =E2=96=92 > - 0.69% do_raw_spin_unlock = = =E2=96=92 > __raw_callee_save___pv_queued_spin_unlo= ck = =E2=96=92 > - 0.82% free_unref_page_list = = =E2=96=92 > - 0.72% free_unref_page_commit = = =E2=96=92 > 0.57% free_pcppages_bulk = = =E2=96=92 >=20 > And these are the processes consuming CPU: >=20 > 5171 root 20 0 1442496 5696 1284 R 99.7 0.0 1:07.78 = fio > 1150 root 20 0 0 0 0 S 47.4 0.0 0:22.70 = kswapd1 > 1146 root 20 0 0 0 0 S 44.0 0.0 0:21.85 = kswapd0 > 1152 root 20 0 0 0 0 S 39.7 0.0 0:18.28 = kswapd3 > 1151 root 20 0 0 0 0 S 15.2 0.0 0:12.14 = kswapd2 Here's my profile when memory reclaim is active for the above mentioned test case. This is a single node system, so just kswapd. It's using aroun= d 40-45% CPU: 43.69% kswapd0 [kernel.vmlinux] [k] xas_create | ---ret_from_fork kthread kswapd balance_pgdat shrink_node shrink_lruvec shrink_inactive_list shrink_page_list __delete_from_page_cache xas_store xas_create 16.88% kswapd0 [kernel.vmlinux] [k] queued_spin_lock_slowpath | ---ret_from_fork kthread kswapd balance_pgdat shrink_node shrink_lruvec | =20 --16.82%--shrink_inactive_list | =20 --16.55%--shrink_page_list | =20 --16.26%--_raw_spin_lock_irqsave queued_spin_lock_slowpath 9.89% kswapd0 [kernel.vmlinux] [k] shrink_page_list | ---ret_from_fork kthread kswapd balance_pgdat shrink_node shrink_lruvec shrink_inactive_list shrink_page_list 5.46% kswapd0 [kernel.vmlinux] [k] xas_init_marks | ---ret_from_fork kthread kswapd balance_pgdat shrink_node shrink_lruvec shrink_inactive_list shrink_page_list | =20 --5.41%--__delete_from_page_cache xas_init_marks 4.42% kswapd0 [kernel.vmlinux] [k] __delete_from_page_cache | ---ret_from_fork kthread kswapd balance_pgdat shrink_node shrink_lruvec shrink_inactive_list | =20 --4.40%--shrink_page_list __delete_from_page_cache 2.82% kswapd0 [kernel.vmlinux] [k] isolate_lru_pages | ---ret_from_fork kthread kswapd balance_pgdat shrink_node shrink_lruvec | =20 |--1.43%--shrink_active_list | isolate_lru_pages | =20 --1.39%--shrink_inactive_list isolate_lru_pages 1.99% kswapd0 [kernel.vmlinux] [k] free_pcppages_bulk | ---ret_from_fork kthread kswapd balance_pgdat shrink_node shrink_lruvec shrink_inactive_list shrink_page_list free_unref_page_list free_unref_page_commit free_pcppages_bulk 1.79% kswapd0 [kernel.vmlinux] [k] _raw_spin_lock_irqsave | ---ret_from_fork kthread kswapd balance_pgdat | =20 --1.76%--shrink_node shrink_lruvec shrink_inactive_list | =20 --1.72%--shrink_page_list _raw_spin_lock_irqsave 1.02% kswapd0 [kernel.vmlinux] [k] workingset_eviction | ---ret_from_fork kthread kswapd balance_pgdat shrink_node shrink_lruvec shrink_inactive_list | =20 --1.00%--shrink_page_list workingset_eviction > i.e. when memory reclaim kicks in, the read process has 20% less > time with exclusive access to the mapping tree to insert new pages. > Hence buffered read performance goes down quite substantially when > memory reclaim kicks in, and this really has nothing to do with the > memory reclaim LRU scanning algorithm. >=20 > I can actually get this machine to pin those 5 processes to 100% CPU > under certain conditions. Each process is spinning all that extra > time on the mapping tree lock, and performance degrades further. > Changing the LRU reclaim algorithm won't fix this - the workload is > solidly bound by the exclusive nature of the mapping tree lock and > the number of tasks trying to obtain it exclusively... I've seen way worse than the above as well, it's just my go-to easy test case for "man I wish buffered IO didn't suck so much". >> The initial posting of this patchset did no better, in fact it did a b= it >> worse. Performance dropped to the same levels and kswapd was using as >> much CPU as before, but on top of that we also got excessive swapping. >> Not at a high rate, but 5-10MB/sec continually. >> >> I had some back and forths with Yu Zhao and tested a few new revisions= , >> and the current series does much better in this regard. Performance >> still dips a bit when page cache fills, but not nearly as much, and >> kswapd is using less CPU than before. >=20 > Profiles would be interesting, because it sounds to me like reclaim > *might* be batching page cache removal better (e.g. fewer, larger > batches) and so spending less time contending on the mapping tree > lock... >=20 > IOWs, I suspect this result might actually be a result of less lock > contention due to a change in batch processing characteristics of > the new algorithm rather than it being a "better" algorithm... See above - let me know if you want to see more specific profiling as well. --=20 Jens Axboe