From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Brian Foster <bfoster@redhat.com>, Theodore Ts'o <tytso@mit.edu>,
Sasha Levin <sashal@kernel.org>,
adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org
Subject: [PATCH AUTOSEL 6.12 19/19] ext4: partial zero eof block on unaligned inode size extension
Date: Sun, 24 Nov 2024 07:38:54 -0500 [thread overview]
Message-ID: <20241124123912.3335344-19-sashal@kernel.org> (raw)
In-Reply-To: <20241124123912.3335344-1-sashal@kernel.org>
From: Brian Foster <bfoster@redhat.com>
[ Upstream commit c7fc0366c65628fd69bfc310affec4918199aae2 ]
Using mapped writes, it's technically possible to expose stale
post-eof data on a truncate up operation. Consider the following
example:
$ xfs_io -fc "pwrite 0 2k" -c "mmap 0 4k" -c "mwrite 2k 2k" \
-c "truncate 8k" -c "pread -v 2k 16" <file>
...
00000800: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX
...
This shows that the post-eof data written via mwrite lands within
EOF after a truncate up. While this is deliberate of the test case,
behavior is somewhat unpredictable because writeback does post-eof
zeroing, and writeback can occur at any time in the background. For
example, an fsync inserted between the mwrite and truncate causes
the subsequent read to instead return zeroes. This basically means
that there is a race window in this situation between any subsequent
extending operation and writeback that dictates whether post-eof
data is exposed to the file or zeroed.
To prevent this problem, perform partial block zeroing as part of
the various inode size extending operations that are susceptible to
it. For truncate extension, zero around the original eof similar to
how truncate down does partial zeroing of the new eof. For extension
via writes and fallocate related operations, zero the newly exposed
range of the file to cover any partial zeroing that must occur at
the original and new eof blocks.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://patch.msgid.link/20240919160741.208162-2-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
fs/ext4/extents.c | 7 ++++++-
fs/ext4/inode.c | 51 +++++++++++++++++++++++++++++++++--------------
2 files changed, 42 insertions(+), 16 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 34e25eee65219..20a0a5c0bfd93 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4482,7 +4482,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
int depth = 0;
struct ext4_map_blocks map;
unsigned int credits;
- loff_t epos;
+ loff_t epos, old_size = i_size_read(inode);
BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
map.m_lblk = offset;
@@ -4541,6 +4541,11 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
if (ext4_update_inode_size(inode, epos) & 0x1)
inode_set_mtime_to_ts(inode,
inode_get_ctime(inode));
+ if (epos > old_size) {
+ pagecache_isize_extended(inode, old_size, epos);
+ ext4_zero_partial_blocks(handle, inode,
+ old_size, epos - old_size);
+ }
}
ret2 = ext4_mark_inode_dirty(handle, inode);
ext4_update_inode_fsync_trans(handle, inode, 1);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 54bdd4884fe67..f460418e2bdae 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1307,8 +1307,10 @@ static int ext4_write_end(struct file *file,
folio_unlock(folio);
folio_put(folio);
- if (old_size < pos && !verity)
+ if (old_size < pos && !verity) {
pagecache_isize_extended(inode, old_size, pos);
+ ext4_zero_partial_blocks(handle, inode, old_size, pos - old_size);
+ }
/*
* Don't mark the inode dirty under folio lock. First, it unnecessarily
* makes the holding time of folio lock longer. Second, it forces lock
@@ -1423,8 +1425,10 @@ static int ext4_journalled_write_end(struct file *file,
folio_unlock(folio);
folio_put(folio);
- if (old_size < pos && !verity)
+ if (old_size < pos && !verity) {
pagecache_isize_extended(inode, old_size, pos);
+ ext4_zero_partial_blocks(handle, inode, old_size, pos - old_size);
+ }
if (size_changed) {
ret2 = ext4_mark_inode_dirty(handle, inode);
@@ -2985,7 +2989,8 @@ static int ext4_da_do_write_end(struct address_space *mapping,
struct inode *inode = mapping->host;
loff_t old_size = inode->i_size;
bool disksize_changed = false;
- loff_t new_i_size;
+ loff_t new_i_size, zero_len = 0;
+ handle_t *handle;
if (unlikely(!folio_buffers(folio))) {
folio_unlock(folio);
@@ -3029,18 +3034,21 @@ static int ext4_da_do_write_end(struct address_space *mapping,
folio_unlock(folio);
folio_put(folio);
- if (old_size < pos)
+ if (pos > old_size) {
pagecache_isize_extended(inode, old_size, pos);
+ zero_len = pos - old_size;
+ }
- if (disksize_changed) {
- handle_t *handle;
+ if (!disksize_changed && !zero_len)
+ return copied;
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
- ext4_mark_inode_dirty(handle, inode);
- ext4_journal_stop(handle);
- }
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+ if (zero_len)
+ ext4_zero_partial_blocks(handle, inode, old_size, zero_len);
+ ext4_mark_inode_dirty(handle, inode);
+ ext4_journal_stop(handle);
return copied;
}
@@ -5426,6 +5434,14 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
}
if (attr->ia_size != inode->i_size) {
+ /* attach jbd2 jinode for EOF folio tail zeroing */
+ if (attr->ia_size & (inode->i_sb->s_blocksize - 1) ||
+ oldsize & (inode->i_sb->s_blocksize - 1)) {
+ error = ext4_inode_attach_jinode(inode);
+ if (error)
+ goto err_out;
+ }
+
handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
if (IS_ERR(handle)) {
error = PTR_ERR(handle);
@@ -5436,12 +5452,17 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
orphan = 1;
}
/*
- * Update c/mtime on truncate up, ext4_truncate() will
- * update c/mtime in shrink case below
+ * Update c/mtime and tail zero the EOF folio on
+ * truncate up. ext4_truncate() handles the shrink case
+ * below.
*/
- if (!shrink)
+ if (!shrink) {
inode_set_mtime_to_ts(inode,
inode_set_ctime_current(inode));
+ if (oldsize & (inode->i_sb->s_blocksize - 1))
+ ext4_block_truncate_page(handle,
+ inode->i_mapping, oldsize);
+ }
if (shrink)
ext4_fc_track_range(handle, inode,
--
2.43.0
prev parent reply other threads:[~2024-11-24 12:39 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-24 12:38 [PATCH AUTOSEL 6.12 01/19] s390/pci: Sort PCI functions prior to creating virtual busses Sasha Levin
2024-11-24 12:38 ` [PATCH AUTOSEL 6.12 02/19] s390/pci: Use topology ID for multi-function devices Sasha Levin
2024-11-24 12:38 ` [PATCH AUTOSEL 6.12 03/19] s390/pci: Ignore RID for isolated VFs Sasha Levin
2024-11-24 12:38 ` [PATCH AUTOSEL 6.12 04/19] epoll: annotate racy check Sasha Levin
2024-11-24 12:38 ` [PATCH AUTOSEL 6.12 05/19] kselftest/arm64: Log fp-stress child startup errors to stdout Sasha Levin
2024-11-24 12:38 ` [PATCH AUTOSEL 6.12 06/19] s390/cpum_sf: Handle CPU hotplug remove during sampling Sasha Levin
2024-11-24 12:38 ` [PATCH AUTOSEL 6.12 07/19] block: RCU protect disk->conv_zones_bitmap Sasha Levin
2024-11-24 12:38 ` [PATCH AUTOSEL 6.12 08/19] btrfs: don't take dev_replace rwsem on task already holding it Sasha Levin
2024-11-24 12:38 ` [PATCH AUTOSEL 6.12 09/19] btrfs: zlib: make the compression path to handle sector size < page size Sasha Levin
2024-11-25 15:20 ` David Sterba
2024-12-10 16:20 ` Sasha Levin
2024-11-24 12:38 ` [PATCH AUTOSEL 6.12 10/19] btrfs: make extent_range_clear_dirty_for_io() to handle sector size < page size cases Sasha Levin
2024-11-25 15:21 ` David Sterba
2024-11-24 12:38 ` [PATCH AUTOSEL 6.12 11/19] btrfs: avoid unnecessary device path update for the same device Sasha Levin
2024-11-24 12:38 ` [PATCH AUTOSEL 6.12 12/19] btrfs: canonicalize the device path before adding it Sasha Levin
2024-11-24 12:38 ` [PATCH AUTOSEL 6.12 13/19] btrfs: reduce lock contention when eb cache miss for btree search Sasha Levin
2024-11-25 11:23 ` Filipe Manana
2024-11-24 12:38 ` [PATCH AUTOSEL 6.12 14/19] btrfs: do not clear read-only when adding sprout device Sasha Levin
2024-11-24 12:38 ` [PATCH AUTOSEL 6.12 15/19] btrfs: fix warning on PTR_ERR() against NULL device at btrfs_control_ioctl() Sasha Levin
2024-11-24 12:38 ` [PATCH AUTOSEL 6.12 16/19] md/raid1: Handle bio_split() errors Sasha Levin
2024-11-25 8:55 ` John Garry
2024-12-10 16:21 ` Sasha Levin
2024-11-24 12:38 ` [PATCH AUTOSEL 6.12 17/19] kselftest/arm64: Corrupt P0 in the irritator when testing SSVE Sasha Levin
2024-11-24 12:38 ` [PATCH AUTOSEL 6.12 18/19] kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all() Sasha Levin
2024-11-24 12:38 ` Sasha Levin [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241124123912.3335344-19-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=adilger.kernel@dilger.ca \
--cc=bfoster@redhat.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=stable@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox