From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5408EEB64DA for ; Mon, 19 Jun 2023 22:57:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229520AbjFSW5y (ORCPT ); Mon, 19 Jun 2023 18:57:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38928 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229454AbjFSW5x (ORCPT ); Mon, 19 Jun 2023 18:57:53 -0400 Received: from mail-pf1-x42e.google.com (mail-pf1-x42e.google.com [IPv6:2607:f8b0:4864:20::42e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 415DDE55 for ; Mon, 19 Jun 2023 15:57:52 -0700 (PDT) Received: by mail-pf1-x42e.google.com with SMTP id d2e1a72fcca58-666e6ecb52dso1509556b3a.2 for ; Mon, 19 Jun 2023 15:57:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1687215472; x=1689807472; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=SjXwtZnJw92XR661FFt1WCJs9pFxmVufx8xRZE1MlmA=; b=fovhd81vHX+kdEyYuyB+Zfu4MO9kROfPT54Lgr7qb8ZoefAX06xHlEhWuKLNw0GIbD oefMv1glR3njUPLJ2AAsUYZMVSYhiBoj8l7BlJXJDpoNtSULk+eJ/qqtVxopuZH+Ud8u 4fYpZ1tny/cL+uoxRNXmagyo10ZukGRtywD+alN5avXBzae58IPKZf01g/CcAQ5+Az4v BTLFWAPJ0XnGhQQGQimHCP1iS9J2k+HtXO9HTAyw826Dkshs45DjB25n3EHWEbK00Xdu AnAFC1fFjv38EMT/MbCndQR72fJ6dGdN9CVN7GAr1Vc8wYs/j8cMCI04/XGrAgcr15JT Lobw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687215472; x=1689807472; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=SjXwtZnJw92XR661FFt1WCJs9pFxmVufx8xRZE1MlmA=; b=BH0K05njMcflJLAI5Ucw3Wn4V/x5JsHOgTu7Xh8BIqQUJVfCF2fPyMLHRDrexkY2+n TdxSx8xVlNzaSzlL7f/qy7hHz/KOzhboYujFUSUaVjPMBvp8E3hoOrJVfj1qoIoe0tKn GWSLnSVQA9NDECcbw/jmybtdmJGJ7YWdqa4QLp5rjfO/kAnfHYzaKxwzPC87f9w46iXq TdsXxl9r9bnqyej7Yy03z0FNzdLt6Q6eZCPgqxjFFQ0Ijo0UkDyCzavHh33qu0bgud9x Qiaqj0znV5u01hl8larCFYgezSzgbMToDWZVxVT44uDvC3L7i2D17dnTgK6mXziRsd+Y f/zQ== X-Gm-Message-State: AC+VfDyyIyibekLkNPMymuYr7+ANOJ4HFnm3YL223h5irrZX74DNukQg 11kRlwyDBgDY8V9oHkqig0ZwXA== X-Google-Smtp-Source: ACHHUZ4mYODDAn40TIMpC6vCOTcHo8TSKgeWuTVjpTKBzCCzwdkWkEOFRRkg2SNGaBzU0dlGAN+qZQ== X-Received: by 2002:a05:6a00:2406:b0:63d:260d:f9dd with SMTP id z6-20020a056a00240600b0063d260df9ddmr7109838pfh.33.1687215471663; Mon, 19 Jun 2023 15:57:51 -0700 (PDT) Received: from dread.disaster.area (pa49-180-13-202.pa.nsw.optusnet.com.au. [49.180.13.202]) by smtp.gmail.com with ESMTPSA id z16-20020aa785d0000000b00643355ff6a6sm155493pfn.99.2023.06.19.15.57.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 19 Jun 2023 15:57:50 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1qBNoi-00DpLn-0h; Tue, 20 Jun 2023 08:57:48 +1000 Date: Tue, 20 Jun 2023 08:57:48 +1000 From: Dave Chinner To: Hannes Reinecke Cc: Pankaj Raghav , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, Andrew Morton , Christoph Hellwig , Luis Chamberlain , gost.dev@samsung.com Subject: Re: [PATCH 6/7] mm/filemap: allocate folios with mapping blocksize Message-ID: References: <20230614114637.89759-1-hare@suse.de> <20230614114637.89759-7-hare@suse.de> <20230619080857.qxx5c7uaz6pm4h3m@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On Mon, Jun 19, 2023 at 10:42:38AM +0200, Hannes Reinecke wrote: > On 6/19/23 10:08, Pankaj Raghav wrote: > > Hi Hannes, > > On Wed, Jun 14, 2023 at 01:46:36PM +0200, Hannes Reinecke wrote: > > > The mapping has an underlying blocksize (by virtue of > > > mapping->host->i_blkbits), so if the mapping blocksize > > > is larger than the pagesize we should allocate folios > > > in the correct order. > > > > > Network filesystems such as 9pfs set the blkbits to be maximum data it > > wants to transfer leading to unnecessary memory pressure as we will try > > to allocate higher order folios(Order 5 in my setup). Isn't it better > > for each filesystem to request the minimum folio order it needs for its > > page cache early on? Block devices can do the same for its block cache. Folio size is not a "filesystem wide" thing - it's a per-inode configuration. We can have inodes within a filesystem that have different "block" sizes. A prime example of this is XFS directories - they can have 64kB block sizes on 4kB block size filesystem. Another example is extent size hints in XFS data files - they trigger aligned allocation-around similar to using large folios in the page cache for small writes. Effectively this gives data files a "block size" of the extent size hint regardless of the filesystem block size. Hence in future we might want different sizes of folios for different types of inodes and so whatever we do we need to support per-inode folio size configuration for the inode mapping tree. > > I have prototype along those lines and I will it soon. This is also > > something willy indicated before in a mailing list conversation. > > > Well; I _though_ that's why we had things like optimal I/O size and > maximal I/O size. But this seem to be relegated to request queue limits, > so I guess it's not available from 'struct block_device' or 'struct > gendisk'. Yes, those are block device constructs to enable block device based filesystems to be laid out best for the given block device. They don't exist for non-block-based filesystems like network filesystems... > So I've been thinking of adding a flag somewhere (possibly in > 'struct address_space') to indicate that blkbits is a hard limit > and not just an advisory thing. This still relies on interpreting inode->i_blkbits repeatedly at runtime in some way, in mm code that really has no business looking at filesystem block sizes. What is needed is a field into the mapping that defines the folio order that all folios allocated for the page cache must be aligned/sized to to allow them to be inserted into the mapping. This means the minimum folio order and alignment is maintained entirely by the mapping (e.g. it allows truncate to do the right thing), and the filesystem/device side code does not need to do anything special (except support large folios) to ensure that the page cache always contains folios that are block sized and aligned. We already have mapping_set_large_folios() that we use at inode/mapping instantiation time to enable large folios in the page cache for that mapping. What we need is a new mapping_set_large_folio_order() API to enable the filesystem/device to set the base folio order for the mapping tree at instantiation time, and for all the page cache instantiation code to align/size to the order stored in the mapping... Cheers, Dave. -- Dave Chinner david@fromorbit.com