* [PATCH] ext4: evict inline data when writing to memory map @ 2017-03-13 0:47 Eric Biggers 2017-03-13 4:39 ` Andreas Dilger 2017-03-13 23:33 ` Christoph Hellwig 0 siblings, 2 replies; 6+ messages in thread From: Eric Biggers @ 2017-03-13 0:47 UTC (permalink / raw) To: linux-ext4 Cc: Theodore Ts'o, Andreas Dilger, Nick Alcock, Eric Biggers, stable From: Eric Biggers <ebiggers@google.com> Currently the case of writing via mmap to a file with inline data is not handled. This is maybe a rare case since it requires a writable memory map of a very small file, but it is trivial to trigger with on inline_data filesystem, and it causes the 'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in ext4_writepages() to be hit: mkfs.ext4 -O inline_data /dev/vdb mount /dev/vdb /mnt xfs_io -f /mnt/file \ -c 'pwrite 0 1' \ -c 'mmap -w 0 1m' \ -c 'mwrite 0 1' \ -c 'fsync' kernel BUG at fs/ext4/inode.c:2723! invalid opcode: 0000 [#1] SMP CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 task: ffff88003d3a8040 task.stack: ffffc90000300000 RIP: 0010:ext4_writepages+0xc89/0xf8a RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283 RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246 RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5 R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698 R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff FS: 00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0 Call Trace: ? _raw_spin_unlock+0x27/0x2a ? kvm_clock_read+0x1e/0x20 do_writepages+0x23/0x2c ? do_writepages+0x23/0x2c __filemap_fdatawrite_range+0x80/0x87 filemap_write_and_wait_range+0x67/0x8c ext4_sync_file+0x20e/0x472 vfs_fsync_range+0x8e/0x9f ? syscall_trace_enter+0x25b/0x2d0 vfs_fsync+0x1c/0x1e do_fsync+0x31/0x4a SyS_fsync+0x10/0x14 do_syscall_64+0x69/0x131 entry_SYSCALL64_slow_path+0x25/0x25 We could try to be smart and keep the inline data in this case, or at least support delayed allocation when allocating the block, but these solutions would be more complicated and don't seem worthwhile given how rare this case seems to be. So just fix the bug by calling ext4_convert_inline_data() when we're asked to make a page writable, so that any inline data gets evicted, with the block allocated immediately. Reported-by: Nick Alcock <nick.alcock@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> --- fs/ext4/inode.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 7385e6a6b6cb..0fc3e25ef531 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5839,6 +5839,11 @@ int ext4_page_mkwrite(struct vm_fault *vmf) file_update_time(vma->vm_file); down_read(&EXT4_I(inode)->i_mmap_sem); + + ret = ext4_convert_inline_data(inode); + if (ret) + goto out_ret; + /* Delalloc case is easy... */ if (test_opt(inode->i_sb, DELALLOC) && !ext4_should_journal_data(inode) && -- 2.12.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] ext4: evict inline data when writing to memory map 2017-03-13 0:47 [PATCH] ext4: evict inline data when writing to memory map Eric Biggers @ 2017-03-13 4:39 ` Andreas Dilger 2017-03-13 23:33 ` Christoph Hellwig 1 sibling, 0 replies; 6+ messages in thread From: Andreas Dilger @ 2017-03-13 4:39 UTC (permalink / raw) To: Eric Biggers Cc: linux-ext4, Theodore Ts'o, Nick Alcock, Eric Biggers, stable [-- Attachment #1: Type: text/plain, Size: 3584 bytes --] > On Mar 12, 2017, at 6:47 PM, Eric Biggers <ebiggers3@gmail.com> wrote: > > From: Eric Biggers <ebiggers@google.com> > > Currently the case of writing via mmap to a file with inline data is not > handled. This is maybe a rare case since it requires a writable memory > map of a very small file, but it is trivial to trigger with on > inline_data filesystem, and it causes the > 'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in > ext4_writepages() to be hit: > > mkfs.ext4 -O inline_data /dev/vdb > mount /dev/vdb /mnt > xfs_io -f /mnt/file \ > -c 'pwrite 0 1' \ > -c 'mmap -w 0 1m' \ > -c 'mwrite 0 1' \ > -c 'fsync' > > kernel BUG at fs/ext4/inode.c:2723! > invalid opcode: 0000 [#1] SMP > CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 > task: ffff88003d3a8040 task.stack: ffffc90000300000 > RIP: 0010:ext4_writepages+0xc89/0xf8a > RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283 > RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc > RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246 > RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5 > R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698 > R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff > FS: 00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0 > Call Trace: > ? _raw_spin_unlock+0x27/0x2a > ? kvm_clock_read+0x1e/0x20 > do_writepages+0x23/0x2c > ? do_writepages+0x23/0x2c > __filemap_fdatawrite_range+0x80/0x87 > filemap_write_and_wait_range+0x67/0x8c > ext4_sync_file+0x20e/0x472 > vfs_fsync_range+0x8e/0x9f > ? syscall_trace_enter+0x25b/0x2d0 > vfs_fsync+0x1c/0x1e > do_fsync+0x31/0x4a > SyS_fsync+0x10/0x14 > do_syscall_64+0x69/0x131 > entry_SYSCALL64_slow_path+0x25/0x25 > > We could try to be smart and keep the inline data in this case, or at > least support delayed allocation when allocating the block, but these > solutions would be more complicated and don't seem worthwhile given how > rare this case seems to be. So just fix the bug by calling > ext4_convert_inline_data() when we're asked to make a page writable, so > that any inline data gets evicted, with the block allocated immediately. That is what I was going to suggest also, before I somehow lost this thread in my inbox. Keeping inline file data is a tiny improvement, and any complexity to make this work is not worth it. Reviewed-by: Andreas Dilger <adilger@dilger.ca> > Reported-by: Nick Alcock <nick.alcock@oracle.com> > Cc: stable@vger.kernel.org > Signed-off-by: Eric Biggers <ebiggers@google.com> > --- > fs/ext4/inode.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 7385e6a6b6cb..0fc3e25ef531 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -5839,6 +5839,11 @@ int ext4_page_mkwrite(struct vm_fault *vmf) > file_update_time(vma->vm_file); > > down_read(&EXT4_I(inode)->i_mmap_sem); > + > + ret = ext4_convert_inline_data(inode); > + if (ret) > + goto out_ret; > + > /* Delalloc case is easy... */ > if (test_opt(inode->i_sb, DELALLOC) && > !ext4_should_journal_data(inode) && > -- > 2.12.0 > Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] ext4: evict inline data when writing to memory map 2017-03-13 0:47 [PATCH] ext4: evict inline data when writing to memory map Eric Biggers 2017-03-13 4:39 ` Andreas Dilger @ 2017-03-13 23:33 ` Christoph Hellwig 2017-03-14 23:32 ` Eric Biggers 1 sibling, 1 reply; 6+ messages in thread From: Christoph Hellwig @ 2017-03-13 23:33 UTC (permalink / raw) To: Eric Biggers Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Nick Alcock, Eric Biggers, stable > mkfs.ext4 -O inline_data /dev/vdb > mount /dev/vdb /mnt > xfs_io -f /mnt/file \ > -c 'pwrite 0 1' \ > -c 'mmap -w 0 1m' \ > -c 'mwrite 0 1' \ > -c 'fsync' Please add this test case to xfstests. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] ext4: evict inline data when writing to memory map 2017-03-13 23:33 ` Christoph Hellwig @ 2017-03-14 23:32 ` Eric Biggers 2017-04-30 4:13 ` Theodore Ts'o 0 siblings, 1 reply; 6+ messages in thread From: Eric Biggers @ 2017-03-14 23:32 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Nick Alcock, Eric Biggers, stable On Mon, Mar 13, 2017 at 04:33:13PM -0700, Christoph Hellwig wrote: > > mkfs.ext4 -O inline_data /dev/vdb > > mount /dev/vdb /mnt > > xfs_io -f /mnt/file \ > > -c 'pwrite 0 1' \ > > -c 'mmap -w 0 1m' \ > > -c 'mwrite 0 1' \ > > -c 'fsync' > > Please add this test case to xfstests. I'm working on this, and I discovered there's still a bug. After the data is written with mwrite, if the filesystem is then mount-cycled, the contents of the file are the old contents rather than the new contents. I believe this is caused by a bug in ext4_convert_inline_data(). Specifically, the new block containing the evicted data is journalled using a buffer_head associated with the block device. This is wrong because it can overwrite data that is later written through non-journalled writeback. I'll look into this more when I have time, but in any case it appears this fix isn't complete. Eric ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] ext4: evict inline data when writing to memory map 2017-03-14 23:32 ` Eric Biggers @ 2017-04-30 4:13 ` Theodore Ts'o 2017-05-01 22:10 ` Eric Biggers 0 siblings, 1 reply; 6+ messages in thread From: Theodore Ts'o @ 2017-04-30 4:13 UTC (permalink / raw) To: Eric Biggers Cc: Christoph Hellwig, linux-ext4, Andreas Dilger, Nick Alcock, Eric Biggers, stable On Tue, Mar 14, 2017 at 04:32:39PM -0700, Eric Biggers wrote: > I'm working on this, and I discovered there's still a bug. After the data is > written with mwrite, if the filesystem is then mount-cycled, the contents of the > file are the old contents rather than the new contents. > > I believe this is caused by a bug in ext4_convert_inline_data(). Specifically, > the new block containing the evicted data is journalled using a buffer_head > associated with the block device. This is wrong because it can overwrite data > that is later written through non-journalled writeback. I'll apply this patch for now, since it fixes a userspace-triggerable BUG, but you're right. ext4_convert_inline_data() is busted; it should not be using data journalling, but rather it should check to make sure the page is already in memory (and creating it if necessary), and just write it out to memory. This is a separate bug, and we should fix it, but the first patch is correct and should go in. - Ted ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] ext4: evict inline data when writing to memory map 2017-04-30 4:13 ` Theodore Ts'o @ 2017-05-01 22:10 ` Eric Biggers 0 siblings, 0 replies; 6+ messages in thread From: Eric Biggers @ 2017-05-01 22:10 UTC (permalink / raw) To: Theodore Ts'o Cc: Christoph Hellwig, linux-ext4, Andreas Dilger, Nick Alcock, Eric Biggers, stable On Sun, Apr 30, 2017 at 12:13:57AM -0400, Theodore Ts'o wrote: > On Tue, Mar 14, 2017 at 04:32:39PM -0700, Eric Biggers wrote: > > I'm working on this, and I discovered there's still a bug. After the data is > > written with mwrite, if the filesystem is then mount-cycled, the contents of the > > file are the old contents rather than the new contents. > > > > I believe this is caused by a bug in ext4_convert_inline_data(). Specifically, > > the new block containing the evicted data is journalled using a buffer_head > > associated with the block device. This is wrong because it can overwrite data > > that is later written through non-journalled writeback. > > I'll apply this patch for now, since it fixes a userspace-triggerable > BUG, but you're right. ext4_convert_inline_data() is busted; it > should not be using data journalling, but rather it should check to > make sure the page is already in memory (and creating it if > necessary), and just write it out to memory. > > This is a separate bug, and we should fix it, but the first patch is > correct and should go in. > > - Ted Okay, thanks. A while ago I looked into the second bug a bit, but I never got around to a proper fix. Just in case someone else wants to look into it sooner, this is the xfstest I wrote which reproduces both of these bugs: diff --git a/tests/generic/500 b/tests/generic/500 new file mode 100755 index 00000000..8326791e --- /dev/null +++ b/tests/generic/500 @@ -0,0 +1,82 @@ +#! /bin/bash +# FS QA Test generic/500 +# +# Test writing to a file using a memory map, including a tiny file whose data +# the filesystem may store inline. On ext4 with inline_data, this reproduces +# the crash fixed by: "ext4: evict inline data when writing to memory map". +# +#----------------------------------------------------------------------- +# Copyright (c) 2017 Google, Inc. All Rights Reserved. +# +# Author: Eric Biggers <ebiggers@google.com> +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#----------------------------------------------------------------------- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* + rm -f $testfile +} + +# get standard environment, filters and checks +. ./common/rc + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here +_supported_fs generic +_require_test + +testfile=$TEST_DIR/test-$seq + +file_sizes=( 4 4096 65999) +mwrite_starts=(1 2048 65500) +mwrite_sizes=( 2 32 100) + +for i in ${!file_sizes[@]}; do + + file_size=${file_sizes[$i]} + echo -e "\n=== File size $file_size ===" + + # create a non-sparse file containing $file_size bytes + rm -f $testfile + yes | tr -d '\n' | head -c $file_size > $testfile + + # sync the file to clean its pages, then write to it with mwrite + $XFS_IO_PROG $testfile \ + -c "fsync" \ + -c "mmap -w 0 1m" \ + -c "mwrite ${mwrite_starts[$i]} ${mwrite_sizes[$i]}" + + # cycle the mount and verify the data we wrote got to disk + _test_cycle_mount + hexdump -C $testfile +done + +# success, all done +status=0 +exit diff --git a/tests/generic/500.out b/tests/generic/500.out new file mode 100644 index 00000000..6e9103fc --- /dev/null +++ b/tests/generic/500.out @@ -0,0 +1,24 @@ +QA output created by 500 + +=== File size 4 === +00000000 79 58 58 79 |yXXy| +00000004 + +=== File size 4096 === +00000000 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 |yyyyyyyyyyyyyyyy| +* +00000800 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 |XXXXXXXXXXXXXXXX| +* +00000820 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 |yyyyyyyyyyyyyyyy| +* +00001000 + +=== File size 65999 === +00000000 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 |yyyyyyyyyyyyyyyy| +* +0000ffd0 79 79 79 79 79 79 79 79 79 79 79 79 58 58 58 58 |yyyyyyyyyyyyXXXX| +0000ffe0 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 |XXXXXXXXXXXXXXXX| +* +00010040 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 |yyyyyyyyyyyyyyyy| +* +000101cf diff --git a/tests/generic/group b/tests/generic/group index b3051752..1df3b409 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -431,3 +431,4 @@ 426 auto quick exportfs 427 auto quick aio rw 428 auto quick +500 auto quick -- 2.13.0.rc0.306.g87b477812d-goog ^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-05-01 22:10 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-03-13 0:47 [PATCH] ext4: evict inline data when writing to memory map Eric Biggers 2017-03-13 4:39 ` Andreas Dilger 2017-03-13 23:33 ` Christoph Hellwig 2017-03-14 23:32 ` Eric Biggers 2017-04-30 4:13 ` Theodore Ts'o 2017-05-01 22:10 ` Eric Biggers
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).