From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 39053C54E68 for ; Tue, 19 Mar 2024 22:20:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A11E06B0083; Tue, 19 Mar 2024 18:20:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9C2776B0085; Tue, 19 Mar 2024 18:20:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8B1216B0088; Tue, 19 Mar 2024 18:20:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 7C4D56B0083 for ; Tue, 19 Mar 2024 18:20:13 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 5D141A13D0 for ; Tue, 19 Mar 2024 22:20:13 +0000 (UTC) X-FDA: 81915207906.08.C5498B1 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf30.hostedemail.com (Postfix) with ESMTP id E31C080015 for ; Tue, 19 Mar 2024 22:20:06 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=l1Bv3Q9O; dmarc=none; spf=none (imf30.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710886809; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SKvPkFmK+oZX/ahAFGgHvLNEelgSfIBv198c1OD915o=; b=5Qj4MZLSTeZaZvaygi2OTlmiAQxYyEpg1HN9O+1yczQfVlJGZ7RnUrALnG7W8iPNKSOn9D FvZBCw/4sCK9lqCmBenUK/ZtQ1kNFMf/JJJmm/awHCooxorpkb2z1GBXyQLZQ5rnwPmC2T 5HHFrxPf2L3Ray8vlgcBexXgwEHLtqo= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=l1Bv3Q9O; dmarc=none; spf=none (imf30.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710886809; a=rsa-sha256; cv=none; b=sldY1cuY6W/y1/UtBzXDh3BcLtA8ItQTscKQzyr+XABQEQCeCo2ZelQO2Eks4i1FT4zH+S QGeTUwdY5OsOWXwsuuxGNXqB4WSuHjpfEhzVssLym5tGe9umo3OFkPDMcbTpDbTq/G/ujB f+txWPLiUbByyh2gtsuMbVvcalxsk2g= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=SKvPkFmK+oZX/ahAFGgHvLNEelgSfIBv198c1OD915o=; b=l1Bv3Q9OeLvJmrbDatOGg2pNGL H3nbqRnimgXFDFeSf0JZa0Gvl1b4OhLxOyIahR2KTZFb8XLEEKGgIm5dKhLu9uVqhHLKyWoZDnXG0 Ubsgfx8oQxG/Q+3hgctVziqG0oAMc+WiDXZcpY8fWLcAgP8pvcQOTUpO+cC+wp0XL424QLiAGEwEp L7nHpUaECOrvOwjF4fuMfls6KiLNFy53uyqZuX4lyekaIEEkx8mhCU2U8guP72ppGTpQnP7qNBBh9 ot1QJWASrUMHaUTPRJD6FRyRUzOc8M85kACm1IIphjNusnWiwxUdRoU5KSeevLGmgob/VGdyb6rfX XlEl0TlA==; Received: from willy by casper.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1rmhoL-00000002spj-2boa; Tue, 19 Mar 2024 22:19:57 +0000 Date: Tue, 19 Mar 2024 22:19:57 +0000 From: Matthew Wilcox To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , linux-kernel@vger.kernel.org Subject: Re: [PATCH 4/4] mm/filemap: optimize filemap folio adding Message-ID: References: <20240319092733.4501-1-ryncsn@gmail.com> <20240319092733.4501-5-ryncsn@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240319092733.4501-5-ryncsn@gmail.com> X-Rspamd-Queue-Id: E31C080015 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: f5u6eamatuqzp4dyk9fqesp1ncbpindq X-HE-Tag: 1710886806-587149 X-HE-Meta: U2FsdGVkX18iO3JvkqMDWSHVyhm/0O+M//99GjKsLmGHqMUiaEuKUfkX5Nyd4jyLkIrzEYwuaoFkW5TKa8+rZ4lOUcKHexoxAhJV8dznUEwWzP9VveUCnOGfLOLEL0ptAAfVtVqsFSoMZWz7ERFy6iMtKkCkUvGYSciDaZttCoi5VQHrPK/6B7v63zYtBG1CU3ODcDcT+rQ5B+VkwCfjsQOxFHUJ7gKtaSJ/JKJ8OIPALXe6iwjZAAFzHfJLemNDLybiC/MZbdnuvYKs/TYmm6F8qrkwkXYKzHDqcuCnu5SwcBt17DQ0JxHeMXQemUOzoA1NnLFtJvfed2qeM9cvIGtTY2vI4l1r75/84LSLg3LsSWmNFe7/FVuihhpPoHReQxMfn75gRsg3EAhASbiYuuyARQo2CK3UZ68pxVxY71iqLn+ZEU78sOXQWzKlSOveTofjw2EmgBKCESyWzwM0sKK2NOPdcd3llt8IMYtdZp/W/CBGFLe+oocpVStsiFF0OtkF+2DhXKVXaoJzo4qKSJ07LTyHVbmHfSFck7DLisocUf7clkZg8Vmpl0Zg2ec5prnaEL3KpAgHR1dS71R6GBmmD7d0tdsMDXHUWsxcAu0BNCJ4GUQdMrXNp594IBQNXCzIQ++sRsMZq2HQuFOdVX1yUAmDCwLKeF/Dw/lBLomg/l2vzyQZ4BA4r4RsOKup1wcartzVPkQlPukJjzdGOSrFl8+IvahzMZS7r1a+WxIDLAUP3q1fwYSckKZCHcaFh/8tXMaBs3rGCpOFJe2z3OTRyUktpe7tOdRJtumCMxhXSxBVDajt0htrE6r+T0+psPuLT5LIN/J1QR0Cl29CVhZKhktdb1kv1jmsd4fZY7h8zGzxCluoGNcQE1zeevcn6PSqwi1foo1ZzFhhtCuJJPG8VwuP3BLzt4Gnp6IvQq0Jsjmwx7stQYGJYD6CrRxE7SlZdgjd8qyMZtQvgjZ ezj5LTuV 1CQTC9Bm5gDOJCnu967Zx6Xhw9HmAZ/D/AGzMp33oOBgWyf6sh9f59jLi2tRfT1GYhqzV7gzQvKMB+M3daTJkFWdMJJ/5ioVAtqMnjz4/4Dz7nV2qGpqPOETN/TskN981f4fqEFA9rX2oQCVEXArU/QWKK5wwT8YEMe8dOGblWJhkBzNBTdZ/ID4gWFdCycN1kHPm X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Mar 19, 2024 at 05:27:33PM +0800, Kairui Song wrote: > From: Kairui Song > > Instead of doing multiple tree walks, do one optimism range check > with lock hold, and exit if raced with another insertion. If a shadow > exists, check it with a new xas_get_order helper before releasing the > lock to avoid redundant tree walks for getting its order. > > Drop the lock and do the allocation only if a split is needed. > > In the best case, it only need to walk the tree once. If it needs > to alloc and split, 3 walks are issued (One for first ranced > conflict check and order retrieving, one for the second check after > allocation, one for the insert after split). > > Testing with 4k pages, in an 8G cgroup, with 20G brd as block device: > > fio -name=cached --numjobs=16 --filename=/mnt/test.img \ > --buffered=1 --ioengine=mmap --rw=randread --time_based \ > --ramp_time=30s --runtime=5m --group_reporting > > Before: > bw ( MiB/s): min= 790, max= 3665, per=100.00%, avg=2499.17, stdev=20.64, samples=8698 > iops : min=202295, max=938417, avg=639785.81, stdev=5284.08, samples=8698 > > After (+4%): > bw ( MiB/s): min= 451, max= 3868, per=100.00%, avg=2599.83, stdev=23.39, samples=8653 > iops : min=115596, max=990364, avg=665556.34, stdev=5988.20, samples=8653 > > Test result with THP (do a THP randread then switch to 4K page in hope it > issues a lot of splitting): > > fio -name=cached --numjobs=16 --filename=/mnt/test.img \ > --buffered=1 --ioengine mmap -thp=1 --readonly \ > --rw=randread --random_distribution=random \ > --time_based --runtime=5m --group_reporting > > fio -name=cached --numjobs=16 --filename=/mnt/test.img \ > --buffered=1 --ioengine mmap --readonly \ > --rw=randread --random_distribution=random \ > --time_based --runtime=5s --group_reporting > > Before: > bw ( KiB/s): min=28071, max=62359, per=100.00%, avg=53542.44, stdev=179.77, samples=9520 > iops : min= 7012, max=15586, avg=13379.39, stdev=44.94, samples=9520 > bw ( MiB/s): min= 2457, max= 6193, per=100.00%, avg=3923.21, stdev=82.48, samples=144 > iops : min=629220, max=1585642, avg=1004340.78, stdev=21116.07, samples=144 > > After (+-0.0%): > bw ( KiB/s): min=30561, max=63064, per=100.00%, avg=53635.82, stdev=177.21, samples=9520 > iops : min= 7636, max=15762, avg=13402.82, stdev=44.29, samples=9520 > bw ( MiB/s): min= 2449, max= 6145, per=100.00%, avg=3914.68, stdev=81.15, samples=144 > iops : min=627106, max=1573156, avg=1002158.11, stdev=20774.77, samples=144 > > The performance is better (+4%) for 4K cached read and unchanged for THP. > > Signed-off-by: Kairui Song > --- > mm/filemap.c | 127 ++++++++++++++++++++++++++++++--------------------- > 1 file changed, 76 insertions(+), 51 deletions(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 6bbec8783793..c1484bcdbddb 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -848,12 +848,77 @@ void replace_page_cache_folio(struct folio *old, struct folio *new) > } > EXPORT_SYMBOL_GPL(replace_page_cache_folio); > > +static int __split_add_folio_locked(struct xa_state *xas, struct folio *folio, > + pgoff_t index, gfp_t gfp, void **shadowp) I don't love the name of this function. Splitting is a rare thing that it does. I'd suggest it's more filemap_store(). > +{ > + void *entry, *shadow, *alloced_shadow = NULL; > + int order, alloced_order = 0; > + > + gfp &= GFP_RECLAIM_MASK; > + for (;;) { > + shadow = NULL; > + order = 0; > + > + xas_for_each_conflict(xas, entry) { > + if (!xa_is_value(entry)) > + return -EEXIST; > + shadow = entry; > + } > + > + if (shadow) { > + if (shadow == xas_reload(xas)) { Why do you need the xas_reload here? > + order = xas_get_order(xas); > + if (order && order > folio_order(folio)) { > + /* entry may have been split before we acquired lock */ > + if (shadow != alloced_shadow || order != alloced_order) > + goto unlock; > + xas_split(xas, shadow, order); > + xas_reset(xas); > + } > + order = 0; > + } I don't think this is right. I think we can end up skipping a split and storing a folio into a slot which is of greater order than the folio we're storing. > + if (shadowp) > + *shadowp = shadow; > + } > + > + xas_store(xas, folio); > + /* Success, return with mapping locked */ > + if (!xas_error(xas)) > + return 0; > +unlock: > + /* > + * Unlock path, if errored, return unlocked. > + * If allocation needed, alloc and retry. > + */ > + xas_unlock_irq(xas); > + if (order) { > + if (unlikely(alloced_order)) > + xas_destroy(xas); > + xas_split_alloc(xas, shadow, order, gfp); > + if (!xas_error(xas)) { > + alloced_shadow = shadow; > + alloced_order = order; > + } > + goto next; > + } > + /* xas_nomem result checked by xas_error below */ > + xas_nomem(xas, gfp); > +next: > + xas_lock_irq(xas); > + if (xas_error(xas)) > + return xas_error(xas); > + > + xas_reset(xas); > + } > +} Splitting this out into a different function while changing the logic really makes this hard to review ;-( I don't object to splitting the function, but maybe two patches; one to move the logic and a second to change it?