From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from flow-b5-smtp.messagingengine.com (flow-b5-smtp.messagingengine.com [202.12.124.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5624E1E501C; Tue, 15 Jul 2025 21:21:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=202.12.124.140 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752614465; cv=none; b=kFdVgS0dGdYGRQjg3EPgXtu0f3g8vlK+mxejG3kVLDQL7cFwGKNGRPF8y7B0bkFjqmce4pnmGJl9CucqdsFVxj9U1AqywYmVPXdA/YHEFBIassnB7UTn/YR/+m61LEcPetbVvcnpHe4qUH+7Fgi6RbF55dDkjYnZTI0spXsntt0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752614465; c=relaxed/simple; bh=vhYV6kzLF08a5hDrKXNokh9k27mccWgw1jxcRH+uRk8=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=ABB/PE+Cweg4jkeMtT0AnvvZ2I11BL52W3GNByVkD0LerId8Vh+Orx+nDQ1HKh7DJSFbjiOrNqek1Ux/f7/xuIvnnrcsWl9O1t0/uFskXqIoITnZWQkG6FvRFdfiI/jB7OA3TQP2ZI/KHmR1BR88CXiWeUkGtFXwSs2i72j/8NE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=bur.io; spf=pass smtp.mailfrom=bur.io; dkim=pass (2048-bit key) header.d=bur.io header.i=@bur.io header.b=lRkCdsfV; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b=Ce9F+IH1; arc=none smtp.client-ip=202.12.124.140 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=bur.io Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bur.io Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bur.io header.i=@bur.io header.b="lRkCdsfV"; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b="Ce9F+IH1" Received: from phl-compute-10.internal (phl-compute-10.phl.internal [10.202.2.50]) by mailflow.stl.internal (Postfix) with ESMTP id 62E9B1300E65; Tue, 15 Jul 2025 17:21:01 -0400 (EDT) Received: from phl-mailfrontend-02 ([10.202.2.163]) by phl-compute-10.internal (MEProxy); Tue, 15 Jul 2025 17:21:02 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc:cc :content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm3; t=1752614461; x=1752621661; bh=UYHWhle+vh Soss1fvdZ2btqEJcq1ge5i+HAEGBVMw8w=; b=lRkCdsfVtMWMwLHZr39F7f/8mX 921zmhadMTq7JcDH40jPOUA3zEj3R8hvQ31WXo8U+bDofR3ATiX4rwN3fcki5uL8 IVnPSyWwglK5YOStOSWQGsJlswe1a/bgW4B6uDqK+n0gpg8LHvZFjWWUcG7ZIsD5 DxuALrsh3O1VydYoDvs9AOccGag8gaCIl8ND1nBlAyE/w+1pho7we6cRkuD9WWhE xvf2bWjKL1ux0lIVVQDBG+eZ6QqsFwW0BaE/VyP+OkDaIWye4XLGehVLoILn5kDJ +xXs4v++Br2DKNV3PlvDZcRaf9VjLw6d6OD6ws8kKE331gCAlGxUn4d88x9A== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t= 1752614461; x=1752621661; bh=UYHWhle+vhSoss1fvdZ2btqEJcq1ge5i+HA EGBVMw8w=; b=Ce9F+IH1uvDNbsSwJIOP/rG/i042S/gMBWsX56WO7ElW3Brcxmj c1JjYLwvXBU+k0VWWkBnSh50U7X4rUP8GAZWPiuGETvDoibxN2/HRM1Plm8bl8qY vu50j3amKiU4VMH4HzDgz05hzNo5aTorV31PNArsuFQBCWB6HMYcxYwQ+jobOedV NtLRKYyUhzppcTmuTu6k1K/7xHIEtx1/XKMnqfGHQQN6ixZ5iJGW4nMdFVcc4zMi eK+JRLEKsiUH6f6pZTdSS6m/cJmNYwKQ9XszzgDD598L05RWFPyW0ezHAewLufF9 C1jEE3fzp1bmaBf72mOvXoP8C7TURS3nvUw== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgdehheeklecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpeffhffvvefukfhfgggtuggjsehttdertddttddvnecuhfhrohhmpeeuohhrihhsuceu uhhrkhhovhcuoegsohhrihhssegsuhhrrdhioheqnecuggftrfgrthhtvghrnhepkedvke ffjeellefhveehvdejudfhjedthfdvveeiieeiudfguefgtdejgfefleejnecuvehluhhs thgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepsghorhhishessghurh drihhopdhnsggprhgtphhtthhopedvgedpmhhouggvpehsmhhtphhouhhtpdhrtghpthht ohepfihilhhlhiesihhnfhhrrgguvggrugdrohhrghdprhgtphhtthhopegtlhhmsehfsg drtghomhdprhgtphhtthhopehjohhsvghfsehtohigihgtphgrnhgurgdrtghomhdprhgt phhtthhopegushhtvghrsggrsehsuhhsvgdrtghomhdprhgtphhtthhopehlihhnuhigqd gsthhrfhhssehvghgvrhdrkhgvrhhnvghlrdhorhhgpdhrtghpthhtohepnhhitghosehf lhhugihnihgtrdhnvghtpdhrtghpthhtohepgihirghngheskhgvrhhnvghlrdhorhhgpd hrtghpthhtoheptghhrghosehkvghrnhgvlhdrohhrghdprhgtphhtthhopehlihhnuhig qdgvrhhofhhssehlihhsthhsrdhoiihlrggsshdrohhrgh X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 15 Jul 2025 17:20:58 -0400 (EDT) Date: Tue, 15 Jul 2025 14:22:33 -0700 From: Boris Burkov To: Matthew Wilcox Cc: Chris Mason , Josef Bacik , David Sterba , linux-btrfs@vger.kernel.org, Nicolas Pitre , Gao Xiang , Chao Yu , linux-erofs@lists.ozlabs.org, Jaegeuk Kim , linux-f2fs-devel@lists.sourceforge.net, Jan Kara , linux-fsdevel@vger.kernel.org, David Woodhouse , Richard Weinberger , linux-mtd@lists.infradead.org, David Howells , netfs@lists.linux.dev, Paulo Alcantara , Konstantin Komarov , ntfs3@lists.linux.dev, Steve French , linux-cifs@vger.kernel.org, Phillip Lougher Subject: Re: Compressed files & the page cache Message-ID: <20250715212233.GA1680311@zen.localdomain> References: Precedence: bulk X-Mailing-List: linux-cifs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Tue, Jul 15, 2025 at 09:40:42PM +0100, Matthew Wilcox wrote: > I've started looking at how the page cache can help filesystems handle > compressed data better. Feedback would be appreciated! I'll probably > say a few things which are obvious to anyone who knows how compressed > files work, but I'm trying to be explicit about my assumptions. > > First, I believe that all filesystems work by compressing fixed-size > plaintext into variable-sized compressed blocks. This would be a good > point to stop reading and tell me about counterexamples. As far as I know, btrfs with zstd does not used fixed size plaintext. I am going off the btrfs logic itself, not the zstd internals which I am sadly ignorant of. We are using the streaming interface for whatever that is worth. Through the following callpath, the len is piped from the async_chunk\ through to zstd via the slightly weirdly named total_out parameter: compress_file_range() btrfs_compress_folios() compression_compress_pages() zstd_compress_folios() zstd_get_btrfs_parameters() // passes len zstd_init_cstream() // passes len for-each-folio: zstd_compress_stream() // last folio is truncated if short # bpftrace to check the size in the zstd callsite $ sudo bpftrace -e 'fentry:zstd_init_cstream {printf("%llu\n", args.pledged_src_size);}' Attaching 1 probe... 76800 # diff terminal, write a compressed extent with a weird source size $ sudo dd if=/dev/zero of=/mnt/lol/foo bs=75k count=1 We do operate in terms of folios for calling zstd_compress_stream, so that can be thought of as a fixed size plaintext block, but even so, we pass in a short block for the last one: $ sudo bpftrace -e 'fentry:zstd_compress_stream {printf("%llu\n", args.input->size);}' Attaching 1 probe... 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 3072 > > From what I've been reading in all your filesystems is that you want to > allocate extra pages in the page cache in order to store the excess data > retrieved along with the page that you're actually trying to read. That's > because compressing in larger chunks leads to better compression. > > There's some discrepancy between filesystems whether you need scratch > space for decompression. Some filesystems read the compressed data into > the pagecache and decompress in-place, while other filesystems read the > compressed data into scratch pages and decompress into the page cache. > > There also seems to be some discrepancy between filesystems whether the > decompression involves vmap() of all the memory allocated or whether the > decompression routines can handle doing kmap_local() on individual pages. > > So, my proposal is that filesystems tell the page cache that their minimum > folio size is the compression block size. That seems to be around 64k, btrfs has a max uncompressed extent size of 128K, for what it's worth. In practice, many compressed files are comprised of a large number of compressed extents each representing a 128k plaintext extent. Not sure if that is exactly the constant you are concerned with here, or if it refutes your idea in any way, just figured I would mention it as well. > so not an unreasonable minimum allocation size. That removes all the > extra code in filesystems to allocate extra memory in the page cache. > It means we don't attempt to track dirtiness at a sub-folio granularity > (there's no point, we have to write back the entire compressed bock > at once). We also get a single virtually contiguous block ... if you're > willing to ditch HIGHMEM support. Or there's a proposal to introduce a > vmap_file() which would give us a virtually contiguous chunk of memory > (and could be trivially turned into a noop for the case of trying to > vmap a single large folio). >