* [PATCH] ext4: evict inline data when writing to memory map
@ 2017-03-13 0:47 Eric Biggers
2017-03-13 4:39 ` Andreas Dilger
2017-03-13 23:33 ` Christoph Hellwig
0 siblings, 2 replies; 6+ messages in thread
From: Eric Biggers @ 2017-03-13 0:47 UTC (permalink / raw)
To: linux-ext4
Cc: Theodore Ts'o, Andreas Dilger, Nick Alcock, Eric Biggers,
stable
From: Eric Biggers <ebiggers@google.com>
Currently the case of writing via mmap to a file with inline data is not
handled. This is maybe a rare case since it requires a writable memory
map of a very small file, but it is trivial to trigger with on
inline_data filesystem, and it causes the
'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in
ext4_writepages() to be hit:
mkfs.ext4 -O inline_data /dev/vdb
mount /dev/vdb /mnt
xfs_io -f /mnt/file \
-c 'pwrite 0 1' \
-c 'mmap -w 0 1m' \
-c 'mwrite 0 1' \
-c 'fsync'
kernel BUG at fs/ext4/inode.c:2723!
invalid opcode: 0000 [#1] SMP
CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
task: ffff88003d3a8040 task.stack: ffffc90000300000
RIP: 0010:ext4_writepages+0xc89/0xf8a
RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283
RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc
RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246
RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5
R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698
R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff
FS: 00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0
Call Trace:
? _raw_spin_unlock+0x27/0x2a
? kvm_clock_read+0x1e/0x20
do_writepages+0x23/0x2c
? do_writepages+0x23/0x2c
__filemap_fdatawrite_range+0x80/0x87
filemap_write_and_wait_range+0x67/0x8c
ext4_sync_file+0x20e/0x472
vfs_fsync_range+0x8e/0x9f
? syscall_trace_enter+0x25b/0x2d0
vfs_fsync+0x1c/0x1e
do_fsync+0x31/0x4a
SyS_fsync+0x10/0x14
do_syscall_64+0x69/0x131
entry_SYSCALL64_slow_path+0x25/0x25
We could try to be smart and keep the inline data in this case, or at
least support delayed allocation when allocating the block, but these
solutions would be more complicated and don't seem worthwhile given how
rare this case seems to be. So just fix the bug by calling
ext4_convert_inline_data() when we're asked to make a page writable, so
that any inline data gets evicted, with the block allocated immediately.
Reported-by: Nick Alcock <nick.alcock@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
fs/ext4/inode.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7385e6a6b6cb..0fc3e25ef531 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5839,6 +5839,11 @@ int ext4_page_mkwrite(struct vm_fault *vmf)
file_update_time(vma->vm_file);
down_read(&EXT4_I(inode)->i_mmap_sem);
+
+ ret = ext4_convert_inline_data(inode);
+ if (ret)
+ goto out_ret;
+
/* Delalloc case is easy... */
if (test_opt(inode->i_sb, DELALLOC) &&
!ext4_should_journal_data(inode) &&
--
2.12.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] ext4: evict inline data when writing to memory map
2017-03-13 0:47 [PATCH] ext4: evict inline data when writing to memory map Eric Biggers
@ 2017-03-13 4:39 ` Andreas Dilger
2017-03-13 23:33 ` Christoph Hellwig
1 sibling, 0 replies; 6+ messages in thread
From: Andreas Dilger @ 2017-03-13 4:39 UTC (permalink / raw)
To: Eric Biggers
Cc: linux-ext4, Theodore Ts'o, Nick Alcock, Eric Biggers, stable
[-- Attachment #1: Type: text/plain, Size: 3584 bytes --]
> On Mar 12, 2017, at 6:47 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Currently the case of writing via mmap to a file with inline data is not
> handled. This is maybe a rare case since it requires a writable memory
> map of a very small file, but it is trivial to trigger with on
> inline_data filesystem, and it causes the
> 'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in
> ext4_writepages() to be hit:
>
> mkfs.ext4 -O inline_data /dev/vdb
> mount /dev/vdb /mnt
> xfs_io -f /mnt/file \
> -c 'pwrite 0 1' \
> -c 'mmap -w 0 1m' \
> -c 'mwrite 0 1' \
> -c 'fsync'
>
> kernel BUG at fs/ext4/inode.c:2723!
> invalid opcode: 0000 [#1] SMP
> CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
> task: ffff88003d3a8040 task.stack: ffffc90000300000
> RIP: 0010:ext4_writepages+0xc89/0xf8a
> RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283
> RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc
> RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246
> RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5
> R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698
> R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff
> FS: 00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0
> Call Trace:
> ? _raw_spin_unlock+0x27/0x2a
> ? kvm_clock_read+0x1e/0x20
> do_writepages+0x23/0x2c
> ? do_writepages+0x23/0x2c
> __filemap_fdatawrite_range+0x80/0x87
> filemap_write_and_wait_range+0x67/0x8c
> ext4_sync_file+0x20e/0x472
> vfs_fsync_range+0x8e/0x9f
> ? syscall_trace_enter+0x25b/0x2d0
> vfs_fsync+0x1c/0x1e
> do_fsync+0x31/0x4a
> SyS_fsync+0x10/0x14
> do_syscall_64+0x69/0x131
> entry_SYSCALL64_slow_path+0x25/0x25
>
> We could try to be smart and keep the inline data in this case, or at
> least support delayed allocation when allocating the block, but these
> solutions would be more complicated and don't seem worthwhile given how
> rare this case seems to be. So just fix the bug by calling
> ext4_convert_inline_data() when we're asked to make a page writable, so
> that any inline data gets evicted, with the block allocated immediately.
That is what I was going to suggest also, before I somehow lost this
thread in my inbox. Keeping inline file data is a tiny improvement,
and any complexity to make this work is not worth it.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
> Reported-by: Nick Alcock <nick.alcock@oracle.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Eric Biggers <ebiggers@google.com>
> ---
> fs/ext4/inode.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 7385e6a6b6cb..0fc3e25ef531 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5839,6 +5839,11 @@ int ext4_page_mkwrite(struct vm_fault *vmf)
> file_update_time(vma->vm_file);
>
> down_read(&EXT4_I(inode)->i_mmap_sem);
> +
> + ret = ext4_convert_inline_data(inode);
> + if (ret)
> + goto out_ret;
> +
> /* Delalloc case is easy... */
> if (test_opt(inode->i_sb, DELALLOC) &&
> !ext4_should_journal_data(inode) &&
> --
> 2.12.0
>
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] ext4: evict inline data when writing to memory map
2017-03-13 0:47 [PATCH] ext4: evict inline data when writing to memory map Eric Biggers
2017-03-13 4:39 ` Andreas Dilger
@ 2017-03-13 23:33 ` Christoph Hellwig
2017-03-14 23:32 ` Eric Biggers
1 sibling, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2017-03-13 23:33 UTC (permalink / raw)
To: Eric Biggers
Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Nick Alcock,
Eric Biggers, stable
> mkfs.ext4 -O inline_data /dev/vdb
> mount /dev/vdb /mnt
> xfs_io -f /mnt/file \
> -c 'pwrite 0 1' \
> -c 'mmap -w 0 1m' \
> -c 'mwrite 0 1' \
> -c 'fsync'
Please add this test case to xfstests.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] ext4: evict inline data when writing to memory map
2017-03-13 23:33 ` Christoph Hellwig
@ 2017-03-14 23:32 ` Eric Biggers
2017-04-30 4:13 ` Theodore Ts'o
0 siblings, 1 reply; 6+ messages in thread
From: Eric Biggers @ 2017-03-14 23:32 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Nick Alcock,
Eric Biggers, stable
On Mon, Mar 13, 2017 at 04:33:13PM -0700, Christoph Hellwig wrote:
> > mkfs.ext4 -O inline_data /dev/vdb
> > mount /dev/vdb /mnt
> > xfs_io -f /mnt/file \
> > -c 'pwrite 0 1' \
> > -c 'mmap -w 0 1m' \
> > -c 'mwrite 0 1' \
> > -c 'fsync'
>
> Please add this test case to xfstests.
I'm working on this, and I discovered there's still a bug. After the data is
written with mwrite, if the filesystem is then mount-cycled, the contents of the
file are the old contents rather than the new contents.
I believe this is caused by a bug in ext4_convert_inline_data(). Specifically,
the new block containing the evicted data is journalled using a buffer_head
associated with the block device. This is wrong because it can overwrite data
that is later written through non-journalled writeback.
I'll look into this more when I have time, but in any case it appears this fix
isn't complete.
Eric
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] ext4: evict inline data when writing to memory map
2017-03-14 23:32 ` Eric Biggers
@ 2017-04-30 4:13 ` Theodore Ts'o
2017-05-01 22:10 ` Eric Biggers
0 siblings, 1 reply; 6+ messages in thread
From: Theodore Ts'o @ 2017-04-30 4:13 UTC (permalink / raw)
To: Eric Biggers
Cc: Christoph Hellwig, linux-ext4, Andreas Dilger, Nick Alcock,
Eric Biggers, stable
On Tue, Mar 14, 2017 at 04:32:39PM -0700, Eric Biggers wrote:
> I'm working on this, and I discovered there's still a bug. After the data is
> written with mwrite, if the filesystem is then mount-cycled, the contents of the
> file are the old contents rather than the new contents.
>
> I believe this is caused by a bug in ext4_convert_inline_data(). Specifically,
> the new block containing the evicted data is journalled using a buffer_head
> associated with the block device. This is wrong because it can overwrite data
> that is later written through non-journalled writeback.
I'll apply this patch for now, since it fixes a userspace-triggerable
BUG, but you're right. ext4_convert_inline_data() is busted; it
should not be using data journalling, but rather it should check to
make sure the page is already in memory (and creating it if
necessary), and just write it out to memory.
This is a separate bug, and we should fix it, but the first patch is
correct and should go in.
- Ted
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] ext4: evict inline data when writing to memory map
2017-04-30 4:13 ` Theodore Ts'o
@ 2017-05-01 22:10 ` Eric Biggers
0 siblings, 0 replies; 6+ messages in thread
From: Eric Biggers @ 2017-05-01 22:10 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Christoph Hellwig, linux-ext4, Andreas Dilger, Nick Alcock,
Eric Biggers, stable
On Sun, Apr 30, 2017 at 12:13:57AM -0400, Theodore Ts'o wrote:
> On Tue, Mar 14, 2017 at 04:32:39PM -0700, Eric Biggers wrote:
> > I'm working on this, and I discovered there's still a bug. After the data is
> > written with mwrite, if the filesystem is then mount-cycled, the contents of the
> > file are the old contents rather than the new contents.
> >
> > I believe this is caused by a bug in ext4_convert_inline_data(). Specifically,
> > the new block containing the evicted data is journalled using a buffer_head
> > associated with the block device. This is wrong because it can overwrite data
> > that is later written through non-journalled writeback.
>
> I'll apply this patch for now, since it fixes a userspace-triggerable
> BUG, but you're right. ext4_convert_inline_data() is busted; it
> should not be using data journalling, but rather it should check to
> make sure the page is already in memory (and creating it if
> necessary), and just write it out to memory.
>
> This is a separate bug, and we should fix it, but the first patch is
> correct and should go in.
>
> - Ted
Okay, thanks. A while ago I looked into the second bug a bit, but I never got
around to a proper fix. Just in case someone else wants to look into it sooner,
this is the xfstest I wrote which reproduces both of these bugs:
diff --git a/tests/generic/500 b/tests/generic/500
new file mode 100755
index 00000000..8326791e
--- /dev/null
+++ b/tests/generic/500
@@ -0,0 +1,82 @@
+#! /bin/bash
+# FS QA Test generic/500
+#
+# Test writing to a file using a memory map, including a tiny file whose data
+# the filesystem may store inline. On ext4 with inline_data, this reproduces
+# the crash fixed by: "ext4: evict inline data when writing to memory map".
+#
+#-----------------------------------------------------------------------
+# Copyright (c) 2017 Google, Inc. All Rights Reserved.
+#
+# Author: Eric Biggers <ebiggers@google.com>
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+#-----------------------------------------------------------------------
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1 # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+ cd /
+ rm -f $tmp.*
+ rm -f $testfile
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+_supported_fs generic
+_require_test
+
+testfile=$TEST_DIR/test-$seq
+
+file_sizes=( 4 4096 65999)
+mwrite_starts=(1 2048 65500)
+mwrite_sizes=( 2 32 100)
+
+for i in ${!file_sizes[@]}; do
+
+ file_size=${file_sizes[$i]}
+ echo -e "\n=== File size $file_size ==="
+
+ # create a non-sparse file containing $file_size bytes
+ rm -f $testfile
+ yes | tr -d '\n' | head -c $file_size > $testfile
+
+ # sync the file to clean its pages, then write to it with mwrite
+ $XFS_IO_PROG $testfile \
+ -c "fsync" \
+ -c "mmap -w 0 1m" \
+ -c "mwrite ${mwrite_starts[$i]} ${mwrite_sizes[$i]}"
+
+ # cycle the mount and verify the data we wrote got to disk
+ _test_cycle_mount
+ hexdump -C $testfile
+done
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/500.out b/tests/generic/500.out
new file mode 100644
index 00000000..6e9103fc
--- /dev/null
+++ b/tests/generic/500.out
@@ -0,0 +1,24 @@
+QA output created by 500
+
+=== File size 4 ===
+00000000 79 58 58 79 |yXXy|
+00000004
+
+=== File size 4096 ===
+00000000 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 |yyyyyyyyyyyyyyyy|
+*
+00000800 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 |XXXXXXXXXXXXXXXX|
+*
+00000820 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 |yyyyyyyyyyyyyyyy|
+*
+00001000
+
+=== File size 65999 ===
+00000000 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 |yyyyyyyyyyyyyyyy|
+*
+0000ffd0 79 79 79 79 79 79 79 79 79 79 79 79 58 58 58 58 |yyyyyyyyyyyyXXXX|
+0000ffe0 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 |XXXXXXXXXXXXXXXX|
+*
+00010040 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 |yyyyyyyyyyyyyyyy|
+*
+000101cf
diff --git a/tests/generic/group b/tests/generic/group
index b3051752..1df3b409 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -431,3 +431,4 @@
426 auto quick exportfs
427 auto quick aio rw
428 auto quick
+500 auto quick
--
2.13.0.rc0.306.g87b477812d-goog
^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-05-01 22:10 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-03-13 0:47 [PATCH] ext4: evict inline data when writing to memory map Eric Biggers
2017-03-13 4:39 ` Andreas Dilger
2017-03-13 23:33 ` Christoph Hellwig
2017-03-14 23:32 ` Eric Biggers
2017-04-30 4:13 ` Theodore Ts'o
2017-05-01 22:10 ` Eric Biggers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).