From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9BCCDC433E1 for ; Sat, 22 Aug 2020 06:06:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7C4C1208E4 for ; Sat, 22 Aug 2020 06:06:28 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="SOovbYFa" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725959AbgHVGG1 (ORCPT ); Sat, 22 Aug 2020 02:06:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47628 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725863AbgHVGG1 (ORCPT ); Sat, 22 Aug 2020 02:06:27 -0400 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C5451C061573; Fri, 21 Aug 2020 23:06:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=rguNHvh6bim75mcF3k7vPor/ululkyldZg3PQC54Ho4=; b=SOovbYFaTWd9jsdqgjd4Jq3A2J 2SZ3Pz4NKpVrwfAa+g6oPkFfF2n5NMqo1CZs56uv8LnTOAT5/Z0VUl1RtYgP69nh8xzVPcdKNTK71 SxolTnvYhHwbuMobewVkhpJZWgumDYWoHf4gDBi+f/Y72MGt9/osvQhRNN7cNSEOIiQduhV/kZA0V YJicFZNpsWjy0TcIG1mv/KZ9vFmuSom/HpR/w60c6Ye9c4XHpc+1kmeuJqtTH4kfhveHEvNt1mGHe VGWXiZ269P9tzHOlkuQmDr6OFdkMsp2hHZX1o2IE2tcYnBekq477FRyFV/tQV4SlUKRfh6/9YQ2eG uiadUBLQ==; Received: from hch by casper.infradead.org with local (Exim 4.92.3 #3 (Red Hat Linux)) id 1k9MfS-00054o-Gu; Sat, 22 Aug 2020 06:06:18 +0000 Date: Sat, 22 Aug 2020 07:06:18 +0100 From: Christoph Hellwig To: Matthew Wilcox Cc: Christoph Hellwig , darrick.wong@oracle.com, david@fromorbit.com, yukuai3@huawei.com, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Subject: Re: Splitting an iomap_page Message-ID: <20200822060618.GE17129@infradead.org> References: <20200821144021.GV17456@casper.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200821144021.GV17456@casper.infradead.org> X-SRS-Rewrite: SMTP reverse-path rewritten from by casper.infradead.org. See http://www.infradead.org/rpr.html Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Fri, Aug 21, 2020 at 03:40:21PM +0100, Matthew Wilcox wrote: > I have only bad ideas for solving this problem, so I thought I'd share. > > Operations like holepunch or truncate call into > truncate_inode_pages_range() which just remove THPs which are > entirely within the punched range, but pages which contain one or both > ends of the range need to be split. > > What I have right now, and works, calls do_invalidatepage() from > truncate_inode_pages_range(). The iomap implementation just calls > iomap_page_release(). We have the page locked, and we've waited for > writeback to finish, so there is no I/O outstanding against the page. > We may then get into one of the writepage methods without an iomap being > allocated, so we have to allocate it. Patches here: > > http://git.infradead.org/users/willy/pagecache.git/commitdiff/167f81e880ef00d83ab7db50d56ed85bfbae2601 > http://git.infradead.org/users/willy/pagecache.git/commitdiff/82fe90cde95420c3cf155b54ed66c74d5fb6ffc5 > > If the page is Uptodate, this works great. But if it's not, then we're > going to unnecessarily re-read data from storage -- indeed, we may as > well just dump the entire page if it's !Uptodate. Of course, right now > the only way to get THPs is through readahead and that's going to always > read the entire page, so we're never going to see a !Uptodate THP. But > in the future, maybe we shall? I wouldn't like to rely on that without > pasting some big warnings for anyone who tries to change it. > > Alternative 1: Allocate a new iop for each base page if needed. This is > the obvious approach. If the block size is >= PAGE_SIZE, we don't even > need to allocate a new iop; we can just mark the page as being Uptodate > if that range is. The problem is that we might need to allocate a lot of > them -- 512 if we're splitting a 2MB page into 4kB chunks (which would > be 12kB -- plus a 2kB array to hold 512 pointers). And at the point > where we know we need to allocate them, we're under a spin_lock_irq(). > We could allocate them in advance, but then we might find out we aren't > going to split this page after all. > > Alternative 2: Always allocate an iop for each base page in a THP. We pay > the allocation price up front for every THP, even though the majority > will never be split. It does save us from allocating any iop at all for > block size >= page size, but it's a lot of extra overhead to gather all > the Uptodate bits. As above, it'd be 12kB of iops vs 80 bytes that we > currently allocate for a 2MB THP. 144 once we also track dirty bits. > > Alternative 3: Allow pages to share an iop. Do something like add a > pgoff_t base and a refcount_t to the iop and have each base page point > to the same iop, using the part of the bit array indexed by (page->index > - base) * blocks_per_page. The read/write count are harder to share, > and I'm not sure I see a way to do it. > > Alternative 4: Don't support partially-uptodate THPs. We declare (and > enforce with strategic assertions) that all THPs must always be Uptodate > (or Error) once unlocked. If your workload benefits from using THPs, > you want to do big I/Os anyway, so these "write 512 bytes at a time > using O_SYNC" workloads aren't going to use THPs. > > Funnily, buffer_heads are easier here. They just need to be reparented > to their new page. Of course, they're allocated up front, so they're > essentially alternative 2. At least initially I'd go for 4. And then see if someone screams loudly enough to reconsider. And if we really have to I wonder if we can do a variation of variant 1 where we avoid allocating under the irqs disabled spinlock by a clever retry trick.