* [PATCHSET 0/2] iomap: fix unshare data corruption bug
@ 2023-09-18 23:11 Darrick J. Wong
2023-09-18 23:12 ` [PATCH 1/2] iomap: don't skip reading in !uptodate folios when unsharing a range Darrick J. Wong
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Darrick J. Wong @ 2023-09-18 23:11 UTC (permalink / raw)
To: djwong; +Cc: ritesh.list, willy, linux-fsdevel, linux-xfs, ritesh.list, willy
Hi all,
I rebased djwong-dev atop 6.6-rc1, and discovered that the iomap unshare
code writes garbage into the unshared file if the unshared range covers
at least one base page's worth of file range and there weren't any
folios in the pagecache for that region.
The root cause is an optimization applied to __iomap_write_begin for 6.6
that caused it to ignore !uptodate folios. This is fine for the
write/zeroing cases since they're going to write to the folio anyway,
but unshare merely marks the folio dirty and lets writeback handle the
unsharing.
While I was rooting around in there, I also noticed that the unshare
operation wasn't ported to use large folios. This leads to suboptimal
performance if userspace funshares a file and continues using the page
cache, since the cache is now using base pages.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been lightly tested with fstests. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=iomap-fix-unshare-6.6
---
fs/iomap/buffered-io.c | 28 +++++++++++++++++-----------
1 file changed, 17 insertions(+), 11 deletions(-)
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/2] iomap: don't skip reading in !uptodate folios when unsharing a range
2023-09-18 23:11 [PATCHSET 0/2] iomap: fix unshare data corruption bug Darrick J. Wong
@ 2023-09-18 23:12 ` Darrick J. Wong
2023-09-19 4:42 ` Ritesh Harjani
2023-09-19 5:14 ` Ritesh Harjani
2023-09-18 23:12 ` [PATCH 2/2] iomap: convert iomap_unshare_iter to use large folios Darrick J. Wong
2023-09-18 23:19 ` [PATCH 3/2] fstests: test FALLOC_FL_UNSHARE when pagecache is not loaded Darrick J. Wong
2 siblings, 2 replies; 11+ messages in thread
From: Darrick J. Wong @ 2023-09-18 23:12 UTC (permalink / raw)
To: djwong; +Cc: ritesh.list, willy, linux-fsdevel, linux-xfs, ritesh.list, willy
From: Darrick J. Wong <djwong@kernel.org>
Prior to commit a01b8f225248e, we would always read in the contents of a
!uptodate folio prior to writing userspace data into the folio,
allocated a folio state object, etc. Ritesh introduced an optimization
that skips all of that if the write would cover the entire folio.
Unfortunately, the optimization misses the unshare case, where we always
have to read in the folio contents since there isn't a data buffer
supplied by userspace. This can result in stale kernel memory exposure
if userspace issues a FALLOC_FL_UNSHARE_RANGE call on part of a shared
file that isn't already cached.
This was caught by observing fstests regressions in the "unshare around"
mechanism that is used for unaligned writes to a reflinked realtime
volume when the realtime extent size is larger than 1FSB, though I think
it applies to any shared file.
Cc: ritesh.list@gmail.com, willy@infradead.org
Fixes: a01b8f225248e ("iomap: Allocate ifs in ->write_begin() early")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/iomap/buffered-io.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ae8673ce08b1..0350830fc989 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -640,11 +640,13 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
size_t poff, plen;
/*
- * If the write completely overlaps the current folio, then
+ * If the write or zeroing completely overlaps the current folio, then
* entire folio will be dirtied so there is no need for
* per-block state tracking structures to be attached to this folio.
+ * For the unshare case, we must read in the ondisk contents because we
+ * are not changing pagecache contents.
*/
- if (pos <= folio_pos(folio) &&
+ if (!(iter->flags & IOMAP_UNSHARE) && pos <= folio_pos(folio) &&
pos + len >= folio_pos(folio) + folio_size(folio))
return 0;
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/2] iomap: convert iomap_unshare_iter to use large folios
2023-09-18 23:11 [PATCHSET 0/2] iomap: fix unshare data corruption bug Darrick J. Wong
2023-09-18 23:12 ` [PATCH 1/2] iomap: don't skip reading in !uptodate folios when unsharing a range Darrick J. Wong
@ 2023-09-18 23:12 ` Darrick J. Wong
2023-09-19 8:03 ` Ritesh Harjani
2023-09-18 23:19 ` [PATCH 3/2] fstests: test FALLOC_FL_UNSHARE when pagecache is not loaded Darrick J. Wong
2 siblings, 1 reply; 11+ messages in thread
From: Darrick J. Wong @ 2023-09-18 23:12 UTC (permalink / raw)
To: djwong; +Cc: ritesh.list, willy, linux-fsdevel, linux-xfs, ritesh.list, willy
From: Darrick J. Wong <djwong@kernel.org>
Convert iomap_unshare_iter to create large folios if possible, since the
write and zeroing paths already do that. I think this got missed in the
conversion of the write paths that landed in 6.6-rc1.
Cc: ritesh.list@gmail.com, willy@infradead.org
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/iomap/buffered-io.c | 22 +++++++++++++---------
1 file changed, 13 insertions(+), 9 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 0350830fc989..db889bdfd327 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1263,7 +1263,6 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
const struct iomap *srcmap = iomap_iter_srcmap(iter);
loff_t pos = iter->pos;
loff_t length = iomap_length(iter);
- long status = 0;
loff_t written = 0;
/* don't bother with blocks that are not shared to start with */
@@ -1274,9 +1273,10 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
return length;
do {
- unsigned long offset = offset_in_page(pos);
- unsigned long bytes = min_t(loff_t, PAGE_SIZE - offset, length);
struct folio *folio;
+ int status;
+ size_t offset;
+ size_t bytes = min_t(u64, SIZE_MAX, length);
status = iomap_write_begin(iter, pos, bytes, &folio);
if (unlikely(status))
@@ -1284,18 +1284,22 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
if (iter->iomap.flags & IOMAP_F_STALE)
break;
- status = iomap_write_end(iter, pos, bytes, bytes, folio);
- if (WARN_ON_ONCE(status == 0))
+ offset = offset_in_folio(folio, pos);
+ if (bytes > folio_size(folio) - offset)
+ bytes = folio_size(folio) - offset;
+
+ bytes = iomap_write_end(iter, pos, bytes, bytes, folio);
+ if (WARN_ON_ONCE(bytes == 0))
return -EIO;
cond_resched();
- pos += status;
- written += status;
- length -= status;
+ pos += bytes;
+ written += bytes;
+ length -= bytes;
balance_dirty_pages_ratelimited(iter->inode->i_mapping);
- } while (length);
+ } while (length > 0);
return written;
}
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 3/2] fstests: test FALLOC_FL_UNSHARE when pagecache is not loaded
2023-09-18 23:11 [PATCHSET 0/2] iomap: fix unshare data corruption bug Darrick J. Wong
2023-09-18 23:12 ` [PATCH 1/2] iomap: don't skip reading in !uptodate folios when unsharing a range Darrick J. Wong
2023-09-18 23:12 ` [PATCH 2/2] iomap: convert iomap_unshare_iter to use large folios Darrick J. Wong
@ 2023-09-18 23:19 ` Darrick J. Wong
2023-09-19 5:51 ` Ritesh Harjani
2 siblings, 1 reply; 11+ messages in thread
From: Darrick J. Wong @ 2023-09-18 23:19 UTC (permalink / raw)
To: ritesh.list, willy, linux-fsdevel, linux-xfs
Add a regression test for funsharing uncached files to ensure that we
actually manage the pagecache state correctly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
tests/xfs/1936 | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++
tests/xfs/1936.out | 4 ++
2 files changed, 92 insertions(+)
create mode 100755 tests/xfs/1936
create mode 100644 tests/xfs/1936.out
diff --git a/tests/xfs/1936 b/tests/xfs/1936
new file mode 100755
index 0000000000..bcf9b6b478
--- /dev/null
+++ b/tests/xfs/1936
@@ -0,0 +1,88 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2023 Oracle. All Rights Reserved.
+#
+# FS QA Test 1936
+#
+# This is a regression test for the kernel commit noted below. The stale
+# memory exposure can be exploited by creating a file with shared blocks,
+# evicting the page cache for that file, and then funshareing at least one
+# memory page's worth of data. iomap will mark the page uptodate and dirty
+# without ever reading the ondisk contents.
+#
+. ./common/preamble
+_begin_fstest auto quick unshare clone
+
+_cleanup()
+{
+ cd /
+ rm -r -f $tmp.* $testdir
+}
+
+# real QA test starts here
+
+# Import common functions.
+. ./common/filter
+. ./common/attr
+. ./common/reflink
+
+_fixed_by_git_commit kernel XXXXXXXXXXXXX \
+ "iomap: don't skip reading in !uptodate folios when unsharing a range"
+
+# real QA test starts here
+_require_test_reflink
+_require_cp_reflink
+_require_xfs_io_command "funshare"
+
+testdir=$TEST_DIR/test-$seq
+rm -rf $testdir
+mkdir $testdir
+
+# Create a file that is at least four pages in size and aligned to the
+# file allocation unit size so that we don't trigger any unnecessary zeroing.
+pagesz=$(_get_page_size)
+alloc_unit=$(_get_file_block_size $TEST_DIR)
+filesz=$(( ( (4 * pagesz) + alloc_unit - 1) / alloc_unit * alloc_unit))
+
+echo "Create the original file and a clone"
+_pwrite_byte 0x61 0 $filesz $testdir/file2.chk >> $seqres.full
+_pwrite_byte 0x61 0 $filesz $testdir/file1 >> $seqres.full
+_cp_reflink $testdir/file1 $testdir/file2
+_cp_reflink $testdir/file1 $testdir/file3
+
+_test_cycle_mount
+
+cat $testdir/file3 > /dev/null
+
+echo "Funshare at least one pagecache page"
+$XFS_IO_PROG -c "funshare 0 $filesz" $testdir/file2
+$XFS_IO_PROG -c "funshare 0 $filesz" $testdir/file3
+_pwrite_byte 0x61 0 $filesz $testdir/file2.chk >> $seqres.full
+
+echo "Check contents"
+
+# file2 wasn't cached when it was unshared, but it should match
+if ! cmp -s $testdir/file2.chk $testdir/file2; then
+ echo "file2.chk does not match file2"
+
+ echo "file2.chk contents" >> $seqres.full
+ od -tx1 -Ad -c $testdir/file2.chk >> $seqres.full
+ echo "file2 contents" >> $seqres.full
+ od -tx1 -Ad -c $testdir/file2 >> $seqres.full
+ echo "end bad contents" >> $seqres.full
+fi
+
+# file3 was cached when it was unshared, and it should match
+if ! cmp -s $testdir/file2.chk $testdir/file3; then
+ echo "file2.chk does not match file3"
+
+ echo "file2.chk contents" >> $seqres.full
+ od -tx1 -Ad -c $testdir/file2.chk >> $seqres.full
+ echo "file3 contents" >> $seqres.full
+ od -tx1 -Ad -c $testdir/file3 >> $seqres.full
+ echo "end bad contents" >> $seqres.full
+fi
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1936.out b/tests/xfs/1936.out
new file mode 100644
index 0000000000..c7c820ced5
--- /dev/null
+++ b/tests/xfs/1936.out
@@ -0,0 +1,4 @@
+QA output created by 1936
+Create the original file and a clone
+Funshare at least one pagecache page
+Check contents
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] iomap: don't skip reading in !uptodate folios when unsharing a range
2023-09-18 23:12 ` [PATCH 1/2] iomap: don't skip reading in !uptodate folios when unsharing a range Darrick J. Wong
@ 2023-09-19 4:42 ` Ritesh Harjani
2023-09-19 9:24 ` Ritesh Harjani
2023-09-19 5:14 ` Ritesh Harjani
1 sibling, 1 reply; 11+ messages in thread
From: Ritesh Harjani @ 2023-09-19 4:42 UTC (permalink / raw)
To: Darrick J. Wong, djwong; +Cc: willy, linux-fsdevel, linux-xfs, willy
"Darrick J. Wong" <djwong@kernel.org> writes:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Prior to commit a01b8f225248e, we would always read in the contents of a
> !uptodate folio prior to writing userspace data into the folio,
> allocated a folio state object, etc. Ritesh introduced an optimization
> that skips all of that if the write would cover the entire folio.
>
> Unfortunately, the optimization misses the unshare case, where we always
> have to read in the folio contents since there isn't a data buffer
> supplied by userspace. This can result in stale kernel memory exposure
> if userspace issues a FALLOC_FL_UNSHARE_RANGE call on part of a shared
> file that isn't already cached.
>
> This was caught by observing fstests regressions in the "unshare around"
> mechanism that is used for unaligned writes to a reflinked realtime
> volume when the realtime extent size is larger than 1FSB, though I think
> it applies to any shared file.
>
> Cc: ritesh.list@gmail.com, willy@infradead.org
> Fixes: a01b8f225248e ("iomap: Allocate ifs in ->write_begin() early")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> fs/iomap/buffered-io.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
Thanks for catching this case. Fix for this looks good to me.
I have verified on my setup. w/o this patch it indeed can cause
corruption in the unshare case, since we don't read the disk contents
and we might end up writing garbage from the page cache.
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index ae8673ce08b1..0350830fc989 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -640,11 +640,13 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> size_t poff, plen;
>
> /*
> - * If the write completely overlaps the current folio, then
> + * If the write or zeroing completely overlaps the current folio, then
> * entire folio will be dirtied so there is no need for
> * per-block state tracking structures to be attached to this folio.
> + * For the unshare case, we must read in the ondisk contents because we
> + * are not changing pagecache contents.
> */
> - if (pos <= folio_pos(folio) &&
> + if (!(iter->flags & IOMAP_UNSHARE) && pos <= folio_pos(folio) &&
> pos + len >= folio_pos(folio) + folio_size(folio))
> return 0;
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] iomap: don't skip reading in !uptodate folios when unsharing a range
2023-09-18 23:12 ` [PATCH 1/2] iomap: don't skip reading in !uptodate folios when unsharing a range Darrick J. Wong
2023-09-19 4:42 ` Ritesh Harjani
@ 2023-09-19 5:14 ` Ritesh Harjani
2023-09-19 5:24 ` Darrick J. Wong
1 sibling, 1 reply; 11+ messages in thread
From: Ritesh Harjani @ 2023-09-19 5:14 UTC (permalink / raw)
To: Darrick J. Wong, djwong; +Cc: willy, linux-fsdevel, linux-xfs, willy
"Darrick J. Wong" <djwong@kernel.org> writes:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Prior to commit a01b8f225248e, we would always read in the contents of a
> !uptodate folio prior to writing userspace data into the folio,
> allocated a folio state object, etc. Ritesh introduced an optimization
> that skips all of that if the write would cover the entire folio.
>
> Unfortunately, the optimization misses the unshare case, where we always
> have to read in the folio contents since there isn't a data buffer
> supplied by userspace. This can result in stale kernel memory exposure
> if userspace issues a FALLOC_FL_UNSHARE_RANGE call on part of a shared
> file that isn't already cached.
>
> This was caught by observing fstests regressions in the "unshare around"
> mechanism that is used for unaligned writes to a reflinked realtime
> volume when the realtime extent size is larger than 1FSB,
I was wondering what is testcase that you are referring here to?
Can you please tell the testcase no. and the mkfs / mount config options
which I can use to observe the regression please?
> though I think it applies to any shared file.
>
> Cc: ritesh.list@gmail.com, willy@infradead.org
> Fixes: a01b8f225248e ("iomap: Allocate ifs in ->write_begin() early")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> fs/iomap/buffered-io.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index ae8673ce08b1..0350830fc989 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -640,11 +640,13 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> size_t poff, plen;
>
> /*
> - * If the write completely overlaps the current folio, then
> + * If the write or zeroing completely overlaps the current folio, then
> * entire folio will be dirtied so there is no need for
> * per-block state tracking structures to be attached to this folio.
> + * For the unshare case, we must read in the ondisk contents because we
> + * are not changing pagecache contents.
> */
> - if (pos <= folio_pos(folio) &&
> + if (!(iter->flags & IOMAP_UNSHARE) && pos <= folio_pos(folio) &&
> pos + len >= folio_pos(folio) + folio_size(folio))
> return 0;
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] iomap: don't skip reading in !uptodate folios when unsharing a range
2023-09-19 5:14 ` Ritesh Harjani
@ 2023-09-19 5:24 ` Darrick J. Wong
2023-09-19 5:32 ` Darrick J. Wong
0 siblings, 1 reply; 11+ messages in thread
From: Darrick J. Wong @ 2023-09-19 5:24 UTC (permalink / raw)
To: Ritesh Harjani; +Cc: willy, linux-fsdevel, linux-xfs
On Tue, Sep 19, 2023 at 10:44:58AM +0530, Ritesh Harjani wrote:
> "Darrick J. Wong" <djwong@kernel.org> writes:
>
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Prior to commit a01b8f225248e, we would always read in the contents of a
> > !uptodate folio prior to writing userspace data into the folio,
> > allocated a folio state object, etc. Ritesh introduced an optimization
> > that skips all of that if the write would cover the entire folio.
> >
> > Unfortunately, the optimization misses the unshare case, where we always
> > have to read in the folio contents since there isn't a data buffer
> > supplied by userspace. This can result in stale kernel memory exposure
> > if userspace issues a FALLOC_FL_UNSHARE_RANGE call on part of a shared
> > file that isn't already cached.
> >
> > This was caught by observing fstests regressions in the "unshare around"
> > mechanism that is used for unaligned writes to a reflinked realtime
> > volume when the realtime extent size is larger than 1FSB,
>
> I was wondering what is testcase that you are referring here to?
> Can you please tell the testcase no. and the mkfs / mount config options
> which I can use to observe the regression please?
https://lore.kernel.org/linux-fsdevel/169507871947.772278.5767091361086740046.stgit@frogsfrogsfrogs/T/#m8081f74f4f1fcb862399aa1544be082aabe56765
(any xfs config with reflink enabled)
--D
> > though I think it applies to any shared file.
> >
> > Cc: ritesh.list@gmail.com, willy@infradead.org
> > Fixes: a01b8f225248e ("iomap: Allocate ifs in ->write_begin() early")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> > fs/iomap/buffered-io.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index ae8673ce08b1..0350830fc989 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -640,11 +640,13 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> > size_t poff, plen;
> >
> > /*
> > - * If the write completely overlaps the current folio, then
> > + * If the write or zeroing completely overlaps the current folio, then
> > * entire folio will be dirtied so there is no need for
> > * per-block state tracking structures to be attached to this folio.
> > + * For the unshare case, we must read in the ondisk contents because we
> > + * are not changing pagecache contents.
> > */
> > - if (pos <= folio_pos(folio) &&
> > + if (!(iter->flags & IOMAP_UNSHARE) && pos <= folio_pos(folio) &&
> > pos + len >= folio_pos(folio) + folio_size(folio))
> > return 0;
> >
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] iomap: don't skip reading in !uptodate folios when unsharing a range
2023-09-19 5:24 ` Darrick J. Wong
@ 2023-09-19 5:32 ` Darrick J. Wong
0 siblings, 0 replies; 11+ messages in thread
From: Darrick J. Wong @ 2023-09-19 5:32 UTC (permalink / raw)
To: Ritesh Harjani; +Cc: willy, linux-fsdevel, linux-xfs
On Mon, Sep 18, 2023 at 10:24:34PM -0700, Darrick J. Wong wrote:
> On Tue, Sep 19, 2023 at 10:44:58AM +0530, Ritesh Harjani wrote:
> > "Darrick J. Wong" <djwong@kernel.org> writes:
> >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > Prior to commit a01b8f225248e, we would always read in the contents of a
> > > !uptodate folio prior to writing userspace data into the folio,
> > > allocated a folio state object, etc. Ritesh introduced an optimization
> > > that skips all of that if the write would cover the entire folio.
> > >
> > > Unfortunately, the optimization misses the unshare case, where we always
> > > have to read in the folio contents since there isn't a data buffer
> > > supplied by userspace. This can result in stale kernel memory exposure
> > > if userspace issues a FALLOC_FL_UNSHARE_RANGE call on part of a shared
> > > file that isn't already cached.
> > >
> > > This was caught by observing fstests regressions in the "unshare around"
> > > mechanism that is used for unaligned writes to a reflinked realtime
> > > volume when the realtime extent size is larger than 1FSB,
> >
> > I was wondering what is testcase that you are referring here to?
> > Can you please tell the testcase no. and the mkfs / mount config options
> > which I can use to observe the regression please?
>
> https://lore.kernel.org/linux-fsdevel/169507871947.772278.5767091361086740046.stgit@frogsfrogsfrogs/T/#m8081f74f4f1fcb862399aa1544be082aabe56765
>
> (any xfs config with reflink enabled)
*OH* you meant which testcase in the realtime reflink patchset.
This testcase:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/commit/tests/xfs/1919?h=djwong-wtf&id=56538e8882ac52e606882cfcab7e46dcb64d2a62
And this tag:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tag/?h=realtime-reflink-extsize_2023-09-12
If you rebase this branch against 6.6-rc1.
Then you need this xfsprogs:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/tag/?h=realtime-reflink-extsize_2023-09-12
and ... MKFS_OPTIONS='-d rtinherit=1, -n parent=1, -r extsize=28k,rtgroups=1'
along with a SCRATCH_RTDE.
I'm basically done porting djwong-dev to 6.6 and will likely have an
initial patchbomb of more online fsck stuff for 6.7 in a few days.
--D
> --D
>
> > > though I think it applies to any shared file.
> > >
> > > Cc: ritesh.list@gmail.com, willy@infradead.org
> > > Fixes: a01b8f225248e ("iomap: Allocate ifs in ->write_begin() early")
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > > fs/iomap/buffered-io.c | 6 ++++--
> > > 1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > >
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index ae8673ce08b1..0350830fc989 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -640,11 +640,13 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> > > size_t poff, plen;
> > >
> > > /*
> > > - * If the write completely overlaps the current folio, then
> > > + * If the write or zeroing completely overlaps the current folio, then
> > > * entire folio will be dirtied so there is no need for
> > > * per-block state tracking structures to be attached to this folio.
> > > + * For the unshare case, we must read in the ondisk contents because we
> > > + * are not changing pagecache contents.
> > > */
> > > - if (pos <= folio_pos(folio) &&
> > > + if (!(iter->flags & IOMAP_UNSHARE) && pos <= folio_pos(folio) &&
> > > pos + len >= folio_pos(folio) + folio_size(folio))
> > > return 0;
> > >
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/2] fstests: test FALLOC_FL_UNSHARE when pagecache is not loaded
2023-09-18 23:19 ` [PATCH 3/2] fstests: test FALLOC_FL_UNSHARE when pagecache is not loaded Darrick J. Wong
@ 2023-09-19 5:51 ` Ritesh Harjani
0 siblings, 0 replies; 11+ messages in thread
From: Ritesh Harjani @ 2023-09-19 5:51 UTC (permalink / raw)
To: Darrick J. Wong, willy, linux-fsdevel, linux-xfs
"Darrick J. Wong" <djwong@kernel.org> writes:
> Add a regression test for funsharing uncached files to ensure that we
> actually manage the pagecache state correctly.
>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> tests/xfs/1936 | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> tests/xfs/1936.out | 4 ++
> 2 files changed, 92 insertions(+)
> create mode 100755 tests/xfs/1936
> create mode 100644 tests/xfs/1936.out
1936? I am not sure how that works though.
./new I guess automatically gives the testcase no. for a new testcase right.
>
> diff --git a/tests/xfs/1936 b/tests/xfs/1936
> new file mode 100755
> index 0000000000..bcf9b6b478
> --- /dev/null
> +++ b/tests/xfs/1936
> @@ -0,0 +1,88 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2023 Oracle. All Rights Reserved.
> +#
> +# FS QA Test 1936
> +#
> +# This is a regression test for the kernel commit noted below. The stale
> +# memory exposure can be exploited by creating a file with shared blocks,
> +# evicting the page cache for that file, and then funshareing at least one
> +# memory page's worth of data. iomap will mark the page uptodate and dirty
> +# without ever reading the ondisk contents.
> +#
> +. ./common/preamble
> +_begin_fstest auto quick unshare clone
> +
> +_cleanup()
> +{
> + cd /
> + rm -r -f $tmp.* $testdir
> +}
> +
> +# real QA test starts here
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/attr
> +. ./common/reflink
> +
> +_fixed_by_git_commit kernel XXXXXXXXXXXXX \
> + "iomap: don't skip reading in !uptodate folios when unsharing a range"
> +
> +# real QA test starts here
> +_require_test_reflink
> +_require_cp_reflink
> +_require_xfs_io_command "funshare"
> +
> +testdir=$TEST_DIR/test-$seq
> +rm -rf $testdir
> +mkdir $testdir
> +
> +# Create a file that is at least four pages in size and aligned to the
> +# file allocation unit size so that we don't trigger any unnecessary zeroing.
> +pagesz=$(_get_page_size)
> +alloc_unit=$(_get_file_block_size $TEST_DIR)
> +filesz=$(( ( (4 * pagesz) + alloc_unit - 1) / alloc_unit * alloc_unit))
> +
> +echo "Create the original file and a clone"
> +_pwrite_byte 0x61 0 $filesz $testdir/file2.chk >> $seqres.full
> +_pwrite_byte 0x61 0 $filesz $testdir/file1 >> $seqres.full
> +_cp_reflink $testdir/file1 $testdir/file2
> +_cp_reflink $testdir/file1 $testdir/file3
> +
> +_test_cycle_mount
> +
> +cat $testdir/file3 > /dev/null
> +
> +echo "Funshare at least one pagecache page"
> +$XFS_IO_PROG -c "funshare 0 $filesz" $testdir/file2
> +$XFS_IO_PROG -c "funshare 0 $filesz" $testdir/file3
> +_pwrite_byte 0x61 0 $filesz $testdir/file2.chk >> $seqres.full
We don't need to write the bytes again to file2.chk. We already wrote above.
Otherwise looks right to me. Nice testcase indeed.
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> +
> +echo "Check contents"
> +
> +# file2 wasn't cached when it was unshared, but it should match
> +if ! cmp -s $testdir/file2.chk $testdir/file2; then
> + echo "file2.chk does not match file2"
> +
> + echo "file2.chk contents" >> $seqres.full
> + od -tx1 -Ad -c $testdir/file2.chk >> $seqres.full
> + echo "file2 contents" >> $seqres.full
> + od -tx1 -Ad -c $testdir/file2 >> $seqres.full
> + echo "end bad contents" >> $seqres.full
> +fi
> +
> +# file3 was cached when it was unshared, and it should match
> +if ! cmp -s $testdir/file2.chk $testdir/file3; then
> + echo "file2.chk does not match file3"
> +
> + echo "file2.chk contents" >> $seqres.full
> + od -tx1 -Ad -c $testdir/file2.chk >> $seqres.full
> + echo "file3 contents" >> $seqres.full
> + od -tx1 -Ad -c $testdir/file3 >> $seqres.full
> + echo "end bad contents" >> $seqres.full
> +fi
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/xfs/1936.out b/tests/xfs/1936.out
> new file mode 100644
> index 0000000000..c7c820ced5
> --- /dev/null
> +++ b/tests/xfs/1936.out
> @@ -0,0 +1,4 @@
> +QA output created by 1936
> +Create the original file and a clone
> +Funshare at least one pagecache page
> +Check contents
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] iomap: convert iomap_unshare_iter to use large folios
2023-09-18 23:12 ` [PATCH 2/2] iomap: convert iomap_unshare_iter to use large folios Darrick J. Wong
@ 2023-09-19 8:03 ` Ritesh Harjani
0 siblings, 0 replies; 11+ messages in thread
From: Ritesh Harjani @ 2023-09-19 8:03 UTC (permalink / raw)
To: Darrick J. Wong, djwong; +Cc: willy, linux-fsdevel, linux-xfs, willy
"Darrick J. Wong" <djwong@kernel.org> writes:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Convert iomap_unshare_iter to create large folios if possible, since the
> write and zeroing paths already do that. I think this got missed in the
> conversion of the write paths that landed in 6.6-rc1.
>
> Cc: ritesh.list@gmail.com, willy@infradead.org
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> fs/iomap/buffered-io.c | 22 +++++++++++++---------
> 1 file changed, 13 insertions(+), 9 deletions(-)
Looks right to me. Feel free to add -
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
(small nit below)
>
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 0350830fc989..db889bdfd327 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1263,7 +1263,6 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
> const struct iomap *srcmap = iomap_iter_srcmap(iter);
> loff_t pos = iter->pos;
> loff_t length = iomap_length(iter);
> - long status = 0;
> loff_t written = 0;
>
> /* don't bother with blocks that are not shared to start with */
> @@ -1274,9 +1273,10 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
> return length;
>
> do {
> - unsigned long offset = offset_in_page(pos);
> - unsigned long bytes = min_t(loff_t, PAGE_SIZE - offset, length);
> struct folio *folio;
> + int status;
> + size_t offset;
> + size_t bytes = min_t(u64, SIZE_MAX, length);
>
> status = iomap_write_begin(iter, pos, bytes, &folio);
> if (unlikely(status))
> @@ -1284,18 +1284,22 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
> if (iter->iomap.flags & IOMAP_F_STALE)
Just noticed that we already have "iomap = &iter->iomap" at the start of this
function. We need not dereference for iomap again here.
(Not related to this patch, but I thought I will mention while we are
still at it)
> break;
>
> - status = iomap_write_end(iter, pos, bytes, bytes, folio);
> - if (WARN_ON_ONCE(status == 0))
> + offset = offset_in_folio(folio, pos);
> + if (bytes > folio_size(folio) - offset)
> + bytes = folio_size(folio) - offset;
> +
> + bytes = iomap_write_end(iter, pos, bytes, bytes, folio);
> + if (WARN_ON_ONCE(bytes == 0))
> return -EIO;
>
> cond_resched();
>
> - pos += status;
> - written += status;
> - length -= status;
> + pos += bytes;
> + written += bytes;
> + length -= bytes;
>
> balance_dirty_pages_ratelimited(iter->inode->i_mapping);
> - } while (length);
> + } while (length > 0);
>
> return written;
> }
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] iomap: don't skip reading in !uptodate folios when unsharing a range
2023-09-19 4:42 ` Ritesh Harjani
@ 2023-09-19 9:24 ` Ritesh Harjani
0 siblings, 0 replies; 11+ messages in thread
From: Ritesh Harjani @ 2023-09-19 9:24 UTC (permalink / raw)
To: Darrick J. Wong, djwong; +Cc: willy, linux-fsdevel, linux-xfs, willy
Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
> "Darrick J. Wong" <djwong@kernel.org> writes:
>
>> From: Darrick J. Wong <djwong@kernel.org>
>>
>> Prior to commit a01b8f225248e, we would always read in the contents of a
>> !uptodate folio prior to writing userspace data into the folio,
>> allocated a folio state object, etc. Ritesh introduced an optimization
>> that skips all of that if the write would cover the entire folio.
>>
>> Unfortunately, the optimization misses the unshare case, where we always
>> have to read in the folio contents since there isn't a data buffer
>> supplied by userspace. This can result in stale kernel memory exposure
>> if userspace issues a FALLOC_FL_UNSHARE_RANGE call on part of a shared
>> file that isn't already cached.
>>
>> This was caught by observing fstests regressions in the "unshare around"
>> mechanism that is used for unaligned writes to a reflinked realtime
>> volume when the realtime extent size is larger than 1FSB, though I think
>> it applies to any shared file.
>>
>> Cc: ritesh.list@gmail.com, willy@infradead.org
>> Fixes: a01b8f225248e ("iomap: Allocate ifs in ->write_begin() early")
>> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
>> ---
>> fs/iomap/buffered-io.c | 6 ++++--
>> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> Thanks for catching this case. Fix for this looks good to me.
> I have verified on my setup. w/o this patch it indeed can cause
> corruption in the unshare case, since we don't read the disk contents
> and we might end up writing garbage from the page cache.
To add more info to my above review. iomap_write_begin() is used by
1. iomap_write_iter()
2. iomap_zero_iter()
3. iomap_unshare_iter()
And looks like out of the 3, iomap_unshare_iter() is the only one which
will not write anything to the folio in the foliocache, & we
definitely need to read the extent in folio cache in iomap_write_begin()
for unsharing.
Hence I believe iomap_unshare_iter() should be the only path to be
fixed, which this patch does by checking IOMAP_UNSHARE flag in
__iomap_write_begin().
>
> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>
>
-ritesh
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2023-09-19 9:25 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-18 23:11 [PATCHSET 0/2] iomap: fix unshare data corruption bug Darrick J. Wong
2023-09-18 23:12 ` [PATCH 1/2] iomap: don't skip reading in !uptodate folios when unsharing a range Darrick J. Wong
2023-09-19 4:42 ` Ritesh Harjani
2023-09-19 9:24 ` Ritesh Harjani
2023-09-19 5:14 ` Ritesh Harjani
2023-09-19 5:24 ` Darrick J. Wong
2023-09-19 5:32 ` Darrick J. Wong
2023-09-18 23:12 ` [PATCH 2/2] iomap: convert iomap_unshare_iter to use large folios Darrick J. Wong
2023-09-19 8:03 ` Ritesh Harjani
2023-09-18 23:19 ` [PATCH 3/2] fstests: test FALLOC_FL_UNSHARE when pagecache is not loaded Darrick J. Wong
2023-09-19 5:51 ` Ritesh Harjani
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).