From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1C47DC55179 for ; Tue, 27 Oct 2020 12:14:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C04F6218AC for ; Tue, 27 Oct 2020 12:14:52 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="hDle2mjA" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750867AbgJ0MOv (ORCPT ); Tue, 27 Oct 2020 08:14:51 -0400 Received: from casper.infradead.org ([90.155.50.34]:39428 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750735AbgJ0MOu (ORCPT ); Tue, 27 Oct 2020 08:14:50 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=QsKMGecGq9Or2BsBvxuQRWoxGfvn0O5yuvLyglZO3vY=; b=hDle2mjA+YsHd4CAjzM6MZhiKX CkWha5NPOK4eTYVR8ucIlAt2qQd1zievHtI2Sm2ALPknCoGF3cpw0LFUIRXCe4+3d0OpNHlOSqWjh WCDHqRe/mzFVBPBtHG8kUCathxaBYPJsvgFEzia1U2c2WWc4zidevQVldDRJBVM5Atjvmbj538fV8 mgpvCTWwpsqHh6jNg+3tMYeFs8IDzi6o9DUTWqUADWlt4jg58wyHFFIOBRLPwaQncctsHcRhyiWzN gJe040ZJT6Ax6nT2OvrSENn30ntLF7TFG3dRIvEGMXWffCaoddmxYGB7L9E4fkxQKdkrRo5+HwCZu YZLrgXSg==; Received: from willy by casper.infradead.org with local (Exim 4.92.3 #3 (Red Hat Linux)) id 1kXNsF-0008H6-NI; Tue, 27 Oct 2020 12:14:47 +0000 Date: Tue, 27 Oct 2020 12:14:47 +0000 From: Matthew Wilcox To: Dave Chinner Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org Subject: Re: Splitting a THP beyond EOF Message-ID: <20201027121447.GV20115@casper.infradead.org> References: <20201020014357.GW20115@casper.infradead.org> <20201020045928.GO7391@dread.disaster.area> <20201020112138.GZ20115@casper.infradead.org> <20201020211634.GQ7391@dread.disaster.area> <20201020225331.GE20115@casper.infradead.org> <20201021221435.GR7391@dread.disaster.area> <20201021230422.GP20115@casper.infradead.org> <20201027053126.GY7391@dread.disaster.area> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201027053126.GY7391@dread.disaster.area> Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Tue, Oct 27, 2020 at 04:31:26PM +1100, Dave Chinner wrote: > On Thu, Oct 22, 2020 at 12:04:22AM +0100, Matthew Wilcox wrote: > > On Thu, Oct 22, 2020 at 09:14:35AM +1100, Dave Chinner wrote: > > > On Tue, Oct 20, 2020 at 11:53:31PM +0100, Matthew Wilcox wrote: > > > > True, we don't _have to_ split THP on holepunch/truncation/... but it's > > > > a better implementation to free pages which cover blocks that no longer > > > > have data associated with them. > > > > > > "Better" is a very subjective measure. What numbers do you have > > > to back that up? > > > > None. When we choose to use a THP, we're choosing to treat a chunk > > of a file as a single unit for the purposes of tracking dirtiness, > > age, membership of the workingset, etc. We're trading off reduced > > precision for reduced overhead; just like the CPU tracks dirtiness on > > a cacheline basis instead of at byte level. > > > > So at some level, we've making the assumption that this 128kB THP is > > all one thingand it should be tracked together. But the user has just > > punched a hole in it. I can think of no stronger signal to say "The > > piece before this hole, the piece I just got rid of and the piece after > > this are three separate pieces of the file". > > There's a difference between the physical layout of the file and > representing data efficiently in the page cache. Just because we can > use a THP to represent a single extent doesn't mean we should always > use that relationship, nor should we require that small > manipulations of on-disk extent state require that page cache pages > be split or gathered. > > i.e. the whole point of the page cache is to decouple the physical > layout of the file from the user access mechanisms for performance > reasons, not tie them tightly together. I think that's the wrong > approach to be taking here - truncate/holepunch do not imply that > THPs need to be split unconditionally. Indeed, readahead doesn't > care that a THP might be split across mulitple extents and require > multiple bios to bring tha data into cache, so why should > truncate/holepunch type operations require the THP to be split to > reflect underlying disk layouts? At the time we do readahead, we've inferred from the user's access patterns that they're reading this file if not sequentially, then close enough to sequentially that it makes sense to bring in more of the file. On-media layout of the file is irrelevant, as you say. Now the user has given us another hint about how they see the file. A call to FALLOC_FL_PUNCH_HOLE is certainly an instruction to the filesystem to change the layout, but it's also giving the page cache information about how the file is being treated. It tells us that the portion of the file before the hole is different from the portion of the file after the hole, and treating those two portions of the file as being similar for the purposes of working set tracking is going to lead to wrong decisions. Let's take an example where an app uses 1kB fixed size records. First it does a linear scan (so readahead kicks in and we get all the way up to allocating 256kB pages). Then it decides some records are obsolete, so it calls PUNCH_HOLE on the range 20kB to 27kB in the page, then PUNCH_HOLE 40kB-45kB and finally PUNCH_HOLE 150kB-160kB. In my current scheme, this splits the page into 4kB pages. If the app then only operates on the records after 160kB and before 20kB, the pages used to cache records in the 24kB-40kB and 44kB-150kB ranges will naturally fall out of cache and the memory will be used for other purposes. With your scheme, the 256kB page would be retained in cache as a single piece. > > If I could split them into pieces that weren't single pages, I would. > > Zi Yan has a patch to do just that, and I'm very much looking forward > > to that being merged. But saying "Oh, this is quite small, I'll keep > > the rest of the THP together" is conceptually wrong. > > Yet that's exactly what we do with block size < PAGE_SIZE > configurations, so I fail to see why it's conceptually wrong for > THPs to behave the same way and normal pages.... We don't have the ability to mmap files at smaller than PAGE_SIZE granularity, so we can't do that. > > I'm not saying that my patchset is the last word and there will be no > > tweaking. I'm saying I think it's good enough, an improvement on the > > status quo, and it's better to merge it for 5.11 than to keep it out of > > tree for another three months while we tinker with improving it. > > > > Do you disagree? > > In part. Concepts and algorithms need to be sound and agreed upon > before we merge patches, and right now I disagree with the some of > the basic assumptions about how THP and filesystem layout operations > are being coupled. That part needs to be sorted before stuff gets > merged... They're not being coupled. I'm using the information the user is giving the kernel to make better decisions about what to cache.