From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from fhigh-a5-smtp.messagingengine.com (fhigh-a5-smtp.messagingengine.com [103.168.172.156]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EA4D4380 for ; Thu, 24 Apr 2025 00:27:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=103.168.172.156 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745454443; cv=none; b=rhbEwsWXglwEO/anewRDqNLgSx8YkQGQ6vOM6wJ9aNgCwVf9WqIN9ZWikvEOXAx6LHyP7KpVLdFUYeNg9VB6OrspF0hvElOg1Oi5NgqWaV9w1unFF5M5h0BGsXSnLyHjMaNnhTU81iErSWNdeaFb0K4dyz6x+XM+GHkQAJApj2M= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745454443; c=relaxed/simple; bh=8fPZHx6mPFs76gHWx+dAB3pv1I1QTKCGSu/ydUfMGU8=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=IdfjKDFh+v5RHMgB0sUibrOk7OyZdTmWzlUTuTph9P5DrS/cMBw8ZXhG0gPtd3FTbVnfup83i0DP9h1lKlgevYUSB2MLrtJvSdT390H/yPu4jI1XqomGN1cBRspwIKjbAtS3N4IIWLKTwe2oKdI1lFtThcNF8LEk797cIzh1ouY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=bur.io; spf=pass smtp.mailfrom=bur.io; dkim=pass (2048-bit key) header.d=bur.io header.i=@bur.io header.b=bjxSX4ir; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b=J5Jmx7Vr; arc=none smtp.client-ip=103.168.172.156 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=bur.io Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bur.io Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bur.io header.i=@bur.io header.b="bjxSX4ir"; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b="J5Jmx7Vr" Received: from phl-compute-11.internal (phl-compute-11.phl.internal [10.202.2.51]) by mailfhigh.phl.internal (Postfix) with ESMTP id D14FE1140154; Wed, 23 Apr 2025 20:27:19 -0400 (EDT) Received: from phl-mailfrontend-02 ([10.202.2.163]) by phl-compute-11.internal (MEProxy); Wed, 23 Apr 2025 20:27:19 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc:cc :content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm3; t=1745454439; x=1745540839; bh=w/20hQmDEu eTan254ZtpfGuw5C2YtNNeRFkwDAFDrhI=; b=bjxSX4irqogQa18YVWDUmO8FQN vXRW8MLDDG5q+kkHyTx0u+Ffxl2yv4HmQF27DuVMlFRO4fObTe0obAT0ha/OYJAN WF21BDxKxV8F4xWsb2gQqLCp+d0snbFogfM2wTMjdEwlaxHjtipmQtGHvgevxKJa qo4vfgbA1RwPRSwtxdpvMLxhuin9DsKM0FA+rp3tAVQggrP/jjt0Lzd0VZmXFVxG ZhgFJk2UgMBbXy3heiahdIeMDMmKhwIBCmuEj1FE0oEc2X8X2ASwzVhNBzW/a/tV xPeVFPfTMbadzmRbDVVLS1qM7YVEevgb/zBDwVeaHERVtmmmSYNNHuKSsjAA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t= 1745454439; x=1745540839; bh=w/20hQmDEueTan254ZtpfGuw5C2YtNNeRFk wDAFDrhI=; b=J5Jmx7VrYt8mcsltLF9wgXVHfWme21HQxk1M11NecIuUyiDFy4Z 36seQ0S5KQ4Lzd2AB+UC/mi6e1GRstKmXWcHytgZKYC23C5qUqmv509Lki+OKGda uAzhVDoFr/yiznRQZdZqTlgafKRQGf8vueyL5UYdtI8jtUOLKXg6njvYpaoSvxst cL20nCS2UZOKfnQ3EPiGQv6y7J1trcS5tvgL8a/4qw0ibEc84EaawFNwwTA82sfr e4z5EIQX40AQd/0q4F7N4TjzmolRz6F6LcKeLmTYE1t95Ykn4v4XOjp0E0czhK0l YIYhSsHjrVC0+ykQLpkfFEMtXAtr50k9/jA== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefvddrtddtgddvgeektdegucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdggtfgfnhhsuhgsshgtrhhisggv pdfurfetoffkrfgpnffqhgenuceurghilhhouhhtmecufedttdenucesvcftvggtihhpih gvnhhtshculddquddttddmnecujfgurhepfffhvfevuffkfhggtggujgesthdtredttddt vdenucfhrhhomhepuehorhhishcuuehurhhkohhvuceosghorhhishessghurhdrihhoqe enucggtffrrghtthgvrhhnpeekvdekffejleelhfevhedvjeduhfejtdfhvdevieeiiedu gfeugfdtjefgfeeljeenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrih hlfhhrohhmpegsohhrihhssegsuhhrrdhiohdpnhgspghrtghpthhtohepvddpmhhouggv pehsmhhtphhouhhtpdhrtghpthhtohepfihquhesshhushgvrdgtohhmpdhrtghpthhtoh eplhhinhhugidqsghtrhhfshesvhhgvghrrdhkvghrnhgvlhdrohhrgh X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 23 Apr 2025 20:27:18 -0400 (EDT) Date: Wed, 23 Apr 2025 17:27:14 -0700 From: Boris Burkov To: Qu Wenruo Cc: linux-btrfs@vger.kernel.org Subject: Re: [PATCH v4 2/2] btrfs: make btrfs_truncate_block() zero folio range for certain subpage corner cases Message-ID: References: <50e41f048f26b5b58f55b25c045e5dfe94a8dcfd.1745443508.git.wqu@suse.com> Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50e41f048f26b5b58f55b25c045e5dfe94a8dcfd.1745443508.git.wqu@suse.com> On Thu, Apr 24, 2025 at 06:56:32AM +0930, Qu Wenruo wrote: > [BUG] > For the following fsx -e 1 run, the btrfs still fails the run on 64K > page size with 4K fs block size: > > READ BAD DATA: offset = 0x26b3a, size = 0xfafa, fname = /mnt/btrfs/junk > OFFSET GOOD BAD RANGE > 0x26b3a 0x0000 0x15b4 0x0 > operation# (mod 256) for the bad data may be 21 > [...] > LOG DUMP (28 total operations): > 1( 1 mod 256): SKIPPED (no operation) > 2( 2 mod 256): SKIPPED (no operation) > 3( 3 mod 256): SKIPPED (no operation) > 4( 4 mod 256): SKIPPED (no operation) > 5( 5 mod 256): WRITE 0x1ea90 thru 0x285e0 (0x9b51 bytes) HOLE > 6( 6 mod 256): ZERO 0x1b1a8 thru 0x20bd4 (0x5a2d bytes) > 7( 7 mod 256): FALLOC 0x22b1a thru 0x272fa (0x47e0 bytes) INTERIOR > 8( 8 mod 256): WRITE 0x741d thru 0x13522 (0xc106 bytes) > 9( 9 mod 256): MAPWRITE 0x73ee thru 0xdeeb (0x6afe bytes) > 10( 10 mod 256): FALLOC 0xb719 thru 0xb994 (0x27b bytes) INTERIOR > 11( 11 mod 256): COPY 0x15ed8 thru 0x18be1 (0x2d0a bytes) to 0x25f6e thru 0x28c77 > 12( 12 mod 256): ZERO 0x1615e thru 0x1770e (0x15b1 bytes) > 13( 13 mod 256): SKIPPED (no operation) > 14( 14 mod 256): DEDUPE 0x20000 thru 0x27fff (0x8000 bytes) to 0x1000 thru 0x8fff > 15( 15 mod 256): SKIPPED (no operation) > 16( 16 mod 256): CLONE 0xa000 thru 0xffff (0x6000 bytes) to 0x36000 thru 0x3bfff > 17( 17 mod 256): ZERO 0x14adc thru 0x1b78a (0x6caf bytes) > 18( 18 mod 256): TRUNCATE DOWN from 0x3c000 to 0x1e2e3 ******WWWW > 19( 19 mod 256): CLONE 0x4000 thru 0x11fff (0xe000 bytes) to 0x16000 thru 0x23fff > 20( 20 mod 256): FALLOC 0x311e1 thru 0x3681b (0x563a bytes) PAST_EOF > 21( 21 mod 256): FALLOC 0x351c5 thru 0x40000 (0xae3b bytes) EXTENDING > 22( 22 mod 256): WRITE 0x920 thru 0x7e51 (0x7532 bytes) > 23( 23 mod 256): COPY 0x2b58 thru 0xc508 (0x99b1 bytes) to 0x117b1 thru 0x1b161 > 24( 24 mod 256): TRUNCATE DOWN from 0x40000 to 0x3c9a5 > 25( 25 mod 256): SKIPPED (no operation) > 26( 26 mod 256): MAPWRITE 0x25020 thru 0x26b06 (0x1ae7 bytes) > 27( 27 mod 256): SKIPPED (no operation) > 28( 28 mod 256): READ 0x26b3a thru 0x36633 (0xfafa bytes) ***RRRR*** > > [CAUSE] > The involved operations are: > > fallocating to largest ever: 0x40000 > 21 pollute_eof 0x24000 thru 0x2ffff (0xc000 bytes) > 21 falloc from 0x351c5 to 0x40000 (0xae3b bytes) > 28 read 0x26b3a thru 0x36633 (0xfafa bytes) > > At operation #21 a pollute_eof is done, by memory mappaed write into > range [0x24000, 0x2ffff). > At this stage, the inode size is 0x24000, which is block aligned. > > Then fallocate happens, and since it's expanding the inode, it will call > btrfs_truncate_block() to truncate any unaligned range. > > But since the inode size is already block aligned, > btrfs_truncate_block() does nothing and exit. > > However remember the folio at 0x20000 has some range polluted already, > although they will not be written back to disk, it still affects the > page cache, resulting the later operation #28 to read out the polluted > value. > > [FIX] > Instead of early exit from btrfs_truncate_block() if the range is > already block aligned, do extra filio zeroing if the fs block size is > smaller than the page size. > > This is to address exactly the above case where memory mapped write can > still leave some garbage beyond EOF. > > Signed-off-by: Qu Wenruo > --- > fs/btrfs/inode.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 88 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 6f910c056819..23cd03716e57 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -4760,6 +4760,80 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry) > return ret; > } > > +/* > + * A helper to zero out all blocks inside range [@orig_start, @orig_end) of > + * the target folio. > + * The target folio is the one containing the head or tail block of the range > + * [@from, @end]. > + * > + * This is a special case for fs block size < page size, where even if the range > + * [from, end] is already block aligned, we can still have blocks beyond EOF being > + * polluted by memory mapped write. > + */ This function name is basically the same as folio_zero_range which behaves very differently. I think it needs a name that captures the fact that it is a special case for btrfs_truncate_block > +static int zero_range_folio(struct btrfs_inode *inode, u64 from, u64 end, > + u64 orig_start, u64 orig_end, > + enum btrfs_truncate_where where) > +{ > + const u32 blocksize = inode->root->fs_info->sectorsize; > + struct address_space *mapping = inode->vfs_inode.i_mapping; > + struct extent_io_tree *io_tree = &inode->io_tree; > + struct extent_state *cached_state = NULL; > + struct btrfs_ordered_extent *ordered; > + pgoff_t index = (where == BTRFS_TRUNCATE_HEAD_BLOCK) ? > + (from >> PAGE_SHIFT) : (end >> PAGE_SHIFT); > + struct folio *folio; > + u64 block_start; > + u64 block_end; > + u64 clamp_start; > + u64 clamp_end; > + int ret = 0; > + > +again: > + folio = filemap_lock_folio(mapping, index); > + /* No folio present. */ > + if (IS_ERR(folio)) > + return 0; > + > + if (!folio_test_uptodate(folio)) { > + ret = btrfs_read_folio(NULL, folio); > + folio_lock(folio); > + if (folio->mapping != mapping) { > + folio_unlock(folio); > + folio_put(folio); > + goto again; > + } > + if (!folio_test_uptodate(folio)) { > + ret = -EIO; > + goto out_unlock; > + } > + } > + folio_wait_writeback(folio); This lock/read/wait pattern is the same as in btrfs_truncate_block and I think could benefit from being lifted into a function. > + > + clamp_start = max_t(u64, folio_pos(folio), orig_start); > + clamp_end = min_t(u64, folio_pos(folio) + folio_size(folio) - 1, > + orig_end); > + block_start = round_down(clamp_start, blocksize); > + block_end = round_up(clamp_end + 1, blocksize) - 1; > + lock_extent(io_tree, block_start, block_end, &cached_state); > + ordered = btrfs_lookup_ordered_range(inode, block_start, block_end + 1 - block_end); > + if (ordered) { > + unlock_extent(io_tree, block_start, block_end, &cached_state); > + folio_unlock(folio); > + folio_put(folio); > + btrfs_start_ordered_extent(ordered); > + btrfs_put_ordered_extent(ordered); > + goto again; > + } > + folio_zero_range(folio, clamp_start - folio_pos(folio), > + clamp_end - clamp_start + 1); > + unlock_extent(io_tree, block_start, block_end, &cached_state); > + > +out_unlock: > + folio_unlock(folio); > + folio_put(folio); > + return ret; > +} > + > /* > * Read, zero a chunk and write a block. > * > @@ -4801,8 +4875,20 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t end, > if (end == (loff_t)-1) > ASSERT(where == BTRFS_TRUNCATE_HEAD_BLOCK); > > - if (IS_ALIGNED(from, blocksize) && IS_ALIGNED(end + 1, blocksize)) > - goto out; > + if (IS_ALIGNED(from, blocksize) && IS_ALIGNED(end + 1, blocksize)) { Does this need to depend on `where` now or is it still necessary to check both? > + /* > + * The target head/tail range is already block aligned. > + * If block size >= PAGE_SIZE, meaning it's impossible to mmap a > + * page containing anything other than the target block. > + * So we can safely exit. > + * > + * Otherwise we still need to zero out the range inside the folio > + * to avoid memory mapped write to pollute beyond EOF. > + */ > + if (blocksize >= PAGE_SIZE) > + return 0; > + return zero_range_folio(inode, from, end, orig_start, orig_end, where); > + } > > if (where == BTRFS_TRUNCATE_HEAD_BLOCK) > block_start = round_down(from, blocksize); > -- > 2.49.0