* kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs @ 2015-09-30 19:57 Roman Lebedev 2015-09-30 19:57 ` [RFC PATCH] fstests: generic: Test that fsync works on file in overlayfs merged directory Roman Lebedev 2015-11-06 2:57 ` kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs Jeff Mahoney 0 siblings, 2 replies; 11+ messages in thread From: Roman Lebedev @ 2015-09-30 19:57 UTC (permalink / raw) To: linux-btrfs, linux-unionfs, linux-fsdevel, fstests; +Cc: Roman Lebedev Hello. My / is btrfs. To do some my local stuff more cleanly i wanted to use overlayfs, but it didn't quite work. Simple non-automatic sequence to reproduce the issue: mkdir lower upper work merged mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged vi merged/file :wq Results in vi being killed on exit, and the following trace appears in dmesg: [34304.047841] BUG: unable to handle kernel paging request at 0000000009618e56 [34304.047846] IP: [<ffffffffa01667b6>] btrfs_sync_file+0xa6/0x350 [btrfs] [34304.047864] PGD 0 [34304.047866] Oops: 0002 [#12] SMP [34304.047867] Modules linked in: overlay cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc fglrx(PO) nls_utf8 joydev nls_cp437 vfat fat hid_generic usbhid kvm_amd hid kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi sha256_ssse3 sha256_generic snd_hda_intel snd_hda_codec hmac drbg ansi_cprng aesni_intel snd_hda_core aes_x86_64 mxm_wmi snd_hwdep lrw eeepc_wmi snd_pcm gf128mul asus_wmi sparse_keymap rfkill video snd_timer glue_helper sp5100_tco evdev ablk_helper e1000e ohci_pci pcspkr snd ohci_hcd xhci_pci edac_mce_amd ehci_pci serio_raw xhci_hcd soundcore fam15h_power ehci_hcd cryptd edac_core ptp pps_core usbcore k10temp i2c_piix4 [34304.047893] sg usb_common acpi_cpufreq wmi tpm_infineon button processor shpchp tpm_tis tpm thermal_sys tcp_yeah tcp_vegas it87 hwmon_vid loop parport_pc ppdev lp parport autofs4 crc32c_generic btrfs xor raid6_pq sd_mod crc32c_intel ahci libahci libata scsi_mod [34304.047905] CPU: 4 PID: 13990 Comm: vi Tainted: P D O 4.2.0-1-amd64 #1 Debian 4.2.1-2 [34304.047906] Hardware name: To be filled by O.E.M. To be filled by O.E.M./CROSSHAIR V FORMULA-Z, BIOS 2201 03/23/2015 [34304.047908] task: ffff8803d5f7f2c0 ti: ffff8806a3ec8000 task.ti: ffff8806a3ec8000 [34304.047909] RIP: 0010:[<ffffffffa01667b6>] [<ffffffffa01667b6>] btrfs_sync_file+0xa6/0x350 [btrfs] [34304.047920] RSP: 0018:ffff8806a3ecbe88 EFLAGS: 00010246 [34304.047921] RAX: ffff8803d5f7f2c0 RBX: ffff8807b2d46600 RCX: ffffffff81a6ad00 [34304.047922] RDX: 0000000080000000 RSI: 0000000000000000 RDI: ffff8807c19f8970 [34304.047923] RBP: ffff8807c19f8970 R08: 0000000000000000 R09: 0000000000000001 [34304.047924] R10: 0000000000000000 R11: 0000000000000246 R12: ffff8807c19f88c8 [34304.047925] R13: 0000000000000000 R14: 0000000009618b22 R15: 000055cb20184a70 [34304.047926] FS: 00007f31c5492800(0000) GS:ffff88082fd00000(0000) knlGS:0000000000000000 [34304.047927] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [34304.047928] CR2: 0000000009618e56 CR3: 000000044af44000 CR4: 00000000000406e0 [34304.047929] Stack: [34304.047930] 0000000000000001 7fffffffffffffff ffff880403d5b918 8000000000000000 [34304.047932] 0000000000000000 0000000000000000 000055cb20186d40 ffff8807b2d46600 [34304.047933] 0000000000000004 ffff88044b249000 0000000000000020 ffff8807b2d46600 [34304.047935] Call Trace: [34304.047939] [<ffffffff811e7038>] ? do_fsync+0x38/0x60 [34304.047940] [<ffffffff811e72b0>] ? SyS_fsync+0x10/0x20 [34304.047943] [<ffffffff8154de72>] ? system_call_fast_compare_end+0xc/0x6b [34304.047944] Code: 49 8b 0f 48 85 c9 75 e9 eb b3 48 8b 44 24 08 49 8d ac 24 a8 00 00 00 48 89 ef 4c 29 e8 48 83 c0 01 48 89 44 24 18 e8 3a 59 3e e1 <f0> 41 ff 86 34 03 00 00 49 8b 84 24 70 ff ff ff 48 c1 e8 07 83 [34304.047959] RIP [<ffffffffa01667b6>] btrfs_sync_file+0xa6/0x350 [btrfs] [34304.047970] RSP <ffff8806a3ecbe88> [34304.047970] CR2: 0000000009618e56 [34304.047972] ---[ end trace 414199893a542949 ]--- I was able to create a new fstests test that reproduces my issue, and i'm sending it as follow-up to this message. Roman Lebedev (1): fstests: generic: Test that fsync works on file in overlayfs merged directory tests/generic/111 | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++ tests/generic/111.out | 5 ++++ tests/generic/group | 1 + 3 files changed, 86 insertions(+) create mode 100755 tests/generic/111 create mode 100644 tests/generic/111.out -- 2.6.0 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC PATCH] fstests: generic: Test that fsync works on file in overlayfs merged directory 2015-09-30 19:57 kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs Roman Lebedev @ 2015-09-30 19:57 ` Roman Lebedev 2015-09-30 21:56 ` Dave Chinner 2015-11-06 2:57 ` kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs Jeff Mahoney 1 sibling, 1 reply; 11+ messages in thread From: Roman Lebedev @ 2015-09-30 19:57 UTC (permalink / raw) To: linux-btrfs, linux-unionfs, linux-fsdevel, fstests; +Cc: Roman Lebedev As per overlayfs documentation, any activity on a merged directory for a application that is doing such activity should work exactly as if that would be a normal, non overlayfs-merged directory. That is, e.g. simple fopen-fwrite-fsync-fclose sequence should work just fine. But apparently it does not. Add a simple generic test to check that. As of right now (linux-4.2.1) this test fails at least on btrfs. PS: An alternative (and probably better approach) would be to run fstests test suite with TEST_DIR set to overlayfs work directory. Also, i'm not sure that this test fits here, but it's my best guess. Signed-off-by: Roman Lebedev <lebedev.ri@gmail.com> --- tests/generic/111 | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++ tests/generic/111.out | 5 ++++ tests/generic/group | 1 + 3 files changed, 86 insertions(+) create mode 100755 tests/generic/111 create mode 100644 tests/generic/111.out diff --git a/tests/generic/111 b/tests/generic/111 new file mode 100755 index 0000000..3c2599b --- /dev/null +++ b/tests/generic/111 @@ -0,0 +1,80 @@ +#! /bin/bash +# FS QA Test 111 +# +# Test that fsync works on file in overlayfs merged directory +# +#----------------------------------------------------------------------- +# Copyright (c) 2015 Roman Lebedev. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#----------------------------------------------------------------------- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +lower=$TEST_DIR/lower.$$ +upper=$TEST_DIR/upper.$$ +work=$TEST_DIR/work.$$ +merged=$TEST_DIR/merged.$$ + +_cleanup() +{ + cd / + rm -f $tmp.* + umount $merged + rm -rf $merged + rm -rf $work + rm -rf $upper + rm -rf $lower +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here + +_supported_fs generic +_supported_os IRIX Linux +_require_test + +mkdir $lower + +$XFS_IO_PROG -f -c "pwrite 0 4k" -c "fsync" \ + $lower/file | _filter_xfs_io + +mkdir $upper +mkdir $work +mkdir $merged + +sync + +mount -t overlay overlay -olowerdir=$lower \ + -oupperdir=$upper -oworkdir=$work $merged + +$XFS_IO_PROG -f -c "pwrite 0 4k" -c "fsync" \ + $merged/file | _filter_xfs_io + +# if we are here, then fsync did not crash, so we're good. + +# success, all done +status=0 +exit diff --git a/tests/generic/111.out b/tests/generic/111.out new file mode 100644 index 0000000..36c7fde --- /dev/null +++ b/tests/generic/111.out @@ -0,0 +1,5 @@ +QA output created by 111 +wrote 4096/4096 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +wrote 4096/4096 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) diff --git a/tests/generic/group b/tests/generic/group index 4ae256f..d3516f9 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -112,6 +112,7 @@ 107 auto quick metadata 108 auto quick rw 109 auto metadata dir +111 auto quick 112 rw aio auto quick 113 rw aio auto quick 117 attr auto quick -- 2.6.0 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC PATCH] fstests: generic: Test that fsync works on file in overlayfs merged directory 2015-09-30 19:57 ` [RFC PATCH] fstests: generic: Test that fsync works on file in overlayfs merged directory Roman Lebedev @ 2015-09-30 21:56 ` Dave Chinner 2015-09-30 22:07 ` Eric Sandeen 0 siblings, 1 reply; 11+ messages in thread From: Dave Chinner @ 2015-09-30 21:56 UTC (permalink / raw) To: Roman Lebedev; +Cc: linux-btrfs, linux-unionfs, linux-fsdevel, fstests On Wed, Sep 30, 2015 at 10:57:45PM +0300, Roman Lebedev wrote: > As per overlayfs documentation, any activity on a merged directory > for a application that is doing such activity should work exactly > as if that would be a normal, non overlayfs-merged directory. > > That is, e.g. simple fopen-fwrite-fsync-fclose sequence should > work just fine. We have plenty of tests that do things like that. > But apparently it does not. Add a simple generic test to check that. > As of right now (linux-4.2.1) this test fails at least on btrfs. > > PS: An alternative (and probably better approach) would be to run > fstests test suite with TEST_DIR set to overlayfs work directory. Much better is to run xfstests directly on overlayfs. THere have been some patches to do that posted in the past, but those patches and discussions kinda ended up going nowhere: http://www.mail-archive.com/fstests@vger.kernel.org/msg00474.html Perhaps you'd like to pick this up, and then overlay will by much easier to test and hence likely not to have bugs like this... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH] fstests: generic: Test that fsync works on file in overlayfs merged directory 2015-09-30 21:56 ` Dave Chinner @ 2015-09-30 22:07 ` Eric Sandeen 0 siblings, 0 replies; 11+ messages in thread From: Eric Sandeen @ 2015-09-30 22:07 UTC (permalink / raw) To: Dave Chinner, Roman Lebedev Cc: linux-unionfs, linux-fsdevel, fstests, linux-btrfs On 9/30/15 4:56 PM, Dave Chinner wrote: > On Wed, Sep 30, 2015 at 10:57:45PM +0300, Roman Lebedev wrote: >> As per overlayfs documentation, any activity on a merged directory >> for a application that is doing such activity should work exactly >> as if that would be a normal, non overlayfs-merged directory. >> >> That is, e.g. simple fopen-fwrite-fsync-fclose sequence should >> work just fine. > > We have plenty of tests that do things like that. > >> But apparently it does not. Add a simple generic test to check that. >> As of right now (linux-4.2.1) this test fails at least on btrfs. >> >> PS: An alternative (and probably better approach) would be to run >> fstests test suite with TEST_DIR set to overlayfs work directory. > > Much better is to run xfstests directly on overlayfs. THere have > been some patches to do that posted in the past, but those patches > and discussions kinda ended up going nowhere: > > http://www.mail-archive.com/fstests@vger.kernel.org/msg00474.html > > Perhaps you'd like to pick this up, and then overlay will by much > easier to test and hence likely not to have bugs like this... Yeah, that could still be used for fun, but Zach's POV was that we should just have a specific overlayfs config (dictating paths to over/under/merge/around/through/whatever directories), a special mount_overlayfs helper, etc, ala NFS & CIFS. It may actually be easier than what I proposed. If you want to take a stab at it I'm happy to help, answer questions, etc - I'm not sure when I'll get back to it... -Eric ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs 2015-09-30 19:57 kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs Roman Lebedev 2015-09-30 19:57 ` [RFC PATCH] fstests: generic: Test that fsync works on file in overlayfs merged directory Roman Lebedev @ 2015-11-06 2:57 ` Jeff Mahoney 2015-11-06 3:18 ` Al Viro 1 sibling, 1 reply; 11+ messages in thread From: Jeff Mahoney @ 2015-11-06 2:57 UTC (permalink / raw) To: Roman Lebedev, David Howells, Al Viro Cc: linux-btrfs, linux-unionfs, linux-fsdevel, fstests, Filipe Manana [-- Attachment #1: Type: text/plain, Size: 5946 bytes --] On 9/30/15 3:57 PM, Roman Lebedev wrote: > Hello. > > My / is btrfs. > To do some my local stuff more cleanly i wanted to use overlayfs, > but it didn't quite work. > > Simple non-automatic sequence to reproduce the issue: > mkdir lower upper work merged > mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged > vi merged/file > :wq Filipe and I got a chance to look into this today. The crash is due to commit 4bacc9c9234 (overlayfs: Make f_path always point to the overlay and f_inode to the underlay) Incidentally, the test case is as simple as ":> file ; fsync file" after mounting. The short version is that after this commit, we see: file->f_mapping->host = <actual fs inode> file->f_inode = <actual fs inode> file->f_path.dentry->d_inode = <overlayfs inode> So now file_operations callbacks can't assume that file->f_path.dentry belongs to the same file system that implements the callback. More than that, any code that could ultimately get a dentry that comes from an open file can't trust that it's from the same file system. This crash is due to this issue. Unlike xfs and ext2/3/4, we use file->f_path.dentry->d_inode to resolve the inode. Using file_inode() is an easy enough fix here, but we run into trouble later. We have logic in the btrfs fsync() call path (check_parent_dirs_for_sync) that walks back up the dentry chain examining the inode's last transaction and last unlink transaction to determine whether a full transaction commit is required. This obviously doesn't work if we're walking the overlayfs path instead. Regardless of any argument over whether that's doing the right thing, it's a pretty common pattern to assume that file->f_path.dentry comes from the same file system when using a file_operation. Is it intended that that assumption is no longer valid? -Jeff > Results in vi being killed on exit, and the following trace appears in dmesg: > > [34304.047841] BUG: unable to handle kernel paging request at 0000000009618e56 > [34304.047846] IP: [<ffffffffa01667b6>] btrfs_sync_file+0xa6/0x350 [btrfs] > [34304.047864] PGD 0 > [34304.047866] Oops: 0002 [#12] SMP > [34304.047867] Modules linked in: overlay cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc fglrx(PO) nls_utf8 joydev nls_cp437 vfat fat hid_generic usbhid kvm_amd hid kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi sha256_ssse3 sha256_generic snd_hda_intel snd_hda_codec hmac drbg ansi_cprng aesni_intel snd_hda_core aes_x86_64 mxm_wmi snd_hwdep lrw eeepc_wmi snd_pcm gf128mul asus_wmi sparse_keymap rfkill video snd_timer glue_helper sp5100_tco evdev ablk_helper e1000e ohci_pci pcspkr snd ohci_hcd xhci_pci edac_mce_amd ehci_pci serio_raw xhci_hcd soundcore fam15h_power ehci_hcd cryptd edac_core ptp pps_core usbcore k10temp i2c_piix4 > [34304.047893] sg usb_common acpi_cpufreq wmi tpm_infineon button processor shpchp tpm_tis tpm thermal_sys tcp_yeah tcp_vegas it87 hwmon_vid loop parport_pc ppdev lp parport autofs4 crc32c_generic btrfs xor raid6_pq sd_mod crc32c_intel ahci libahci libata scsi_mod > [34304.047905] CPU: 4 PID: 13990 Comm: vi Tainted: P D O 4.2.0-1-amd64 #1 Debian 4.2.1-2 > [34304.047906] Hardware name: To be filled by O.E.M. To be filled by O.E.M./CROSSHAIR V FORMULA-Z, BIOS 2201 03/23/2015 > [34304.047908] task: ffff8803d5f7f2c0 ti: ffff8806a3ec8000 task.ti: ffff8806a3ec8000 > [34304.047909] RIP: 0010:[<ffffffffa01667b6>] [<ffffffffa01667b6>] btrfs_sync_file+0xa6/0x350 [btrfs] > [34304.047920] RSP: 0018:ffff8806a3ecbe88 EFLAGS: 00010246 > [34304.047921] RAX: ffff8803d5f7f2c0 RBX: ffff8807b2d46600 RCX: ffffffff81a6ad00 > [34304.047922] RDX: 0000000080000000 RSI: 0000000000000000 RDI: ffff8807c19f8970 > [34304.047923] RBP: ffff8807c19f8970 R08: 0000000000000000 R09: 0000000000000001 > [34304.047924] R10: 0000000000000000 R11: 0000000000000246 R12: ffff8807c19f88c8 > [34304.047925] R13: 0000000000000000 R14: 0000000009618b22 R15: 000055cb20184a70 > [34304.047926] FS: 00007f31c5492800(0000) GS:ffff88082fd00000(0000) knlGS:0000000000000000 > [34304.047927] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [34304.047928] CR2: 0000000009618e56 CR3: 000000044af44000 CR4: 00000000000406e0 > [34304.047929] Stack: > [34304.047930] 0000000000000001 7fffffffffffffff ffff880403d5b918 8000000000000000 > [34304.047932] 0000000000000000 0000000000000000 000055cb20186d40 ffff8807b2d46600 > [34304.047933] 0000000000000004 ffff88044b249000 0000000000000020 ffff8807b2d46600 > [34304.047935] Call Trace: > [34304.047939] [<ffffffff811e7038>] ? do_fsync+0x38/0x60 > [34304.047940] [<ffffffff811e72b0>] ? SyS_fsync+0x10/0x20 > [34304.047943] [<ffffffff8154de72>] ? system_call_fast_compare_end+0xc/0x6b > [34304.047944] Code: 49 8b 0f 48 85 c9 75 e9 eb b3 48 8b 44 24 08 49 8d ac 24 a8 00 00 00 48 89 ef 4c 29 e8 48 83 c0 01 48 89 44 24 18 e8 3a 59 3e e1 <f0> 41 ff 86 34 03 00 00 49 8b 84 24 70 ff ff ff 48 c1 e8 07 83 > [34304.047959] RIP [<ffffffffa01667b6>] btrfs_sync_file+0xa6/0x350 [btrfs] > [34304.047970] RSP <ffff8806a3ecbe88> > [34304.047970] CR2: 0000000009618e56 > [34304.047972] ---[ end trace 414199893a542949 ]--- > > I was able to create a new fstests test that reproduces my issue, > and i'm sending it as follow-up to this message. > > Roman Lebedev (1): > fstests: generic: Test that fsync works on file in overlayfs merged > directory > > tests/generic/111 | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++ > tests/generic/111.out | 5 ++++ > tests/generic/group | 1 + > 3 files changed, 86 insertions(+) > create mode 100755 tests/generic/111 > create mode 100644 tests/generic/111.out > -- Jeff Mahoney SUSE Labs [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 827 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs 2015-11-06 2:57 ` kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs Jeff Mahoney @ 2015-11-06 3:18 ` Al Viro 2015-11-06 4:03 ` Jeff Mahoney 0 siblings, 1 reply; 11+ messages in thread From: Al Viro @ 2015-11-06 3:18 UTC (permalink / raw) To: Jeff Mahoney Cc: Roman Lebedev, David Howells, linux-btrfs, linux-unionfs, linux-fsdevel, fstests, Filipe Manana On Thu, Nov 05, 2015 at 09:57:35PM -0500, Jeff Mahoney wrote: > So now file_operations callbacks can't assume that file->f_path.dentry > belongs to the same file system that implements the callback. More than > that, any code that could ultimately get a dentry that comes from an > open file can't trust that it's from the same file system. Use file_inode() for inode. > This crash is due to this issue. Unlike xfs and ext2/3/4, we use > file->f_path.dentry->d_inode to resolve the inode. Using file_inode() > is an easy enough fix here, but we run into trouble later. We have > logic in the btrfs fsync() call path (check_parent_dirs_for_sync) that > walks back up the dentry chain examining the inode's last transaction > and last unlink transaction to determine whether a full transaction > commit is required. This obviously doesn't work if we're walking the > overlayfs path instead. Regardless of any argument over whether that's > doing the right thing, it's a pretty common pattern to assume that > file->f_path.dentry comes from the same file system when using a > file_operation. Is it intended that that assumption is no longer valid? It's actually rare, and your example is a perfect demonstration of the reasons why it is so rare. What's to protect btrfs_log_dentry_safe() from racing with rename(2)? Sure, you do dget_parent(). Which protects you from having one-time parent dentry freed under you. What it doesn't do is making any promises about its relationship with your file. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs 2015-11-06 3:18 ` Al Viro @ 2015-11-06 4:03 ` Jeff Mahoney 2015-11-06 14:46 ` Jeff Mahoney 2016-03-24 15:20 ` Al Viro 0 siblings, 2 replies; 11+ messages in thread From: Jeff Mahoney @ 2015-11-06 4:03 UTC (permalink / raw) To: Al Viro Cc: Roman Lebedev, David Howells, linux-btrfs, linux-unionfs, linux-fsdevel, fstests, Filipe Manana [-- Attachment #1: Type: text/plain, Size: 2837 bytes --] On 11/5/15 10:18 PM, Al Viro wrote: > On Thu, Nov 05, 2015 at 09:57:35PM -0500, Jeff Mahoney wrote: > >> So now file_operations callbacks can't assume that file->f_path.dentry >> belongs to the same file system that implements the callback. More than >> that, any code that could ultimately get a dentry that comes from an >> open file can't trust that it's from the same file system. > > Use file_inode() for inode. > >> This crash is due to this issue. Unlike xfs and ext2/3/4, we use >> file->f_path.dentry->d_inode to resolve the inode. Using file_inode() >> is an easy enough fix here, but we run into trouble later. We have >> logic in the btrfs fsync() call path (check_parent_dirs_for_sync) that >> walks back up the dentry chain examining the inode's last transaction >> and last unlink transaction to determine whether a full transaction >> commit is required. This obviously doesn't work if we're walking the >> overlayfs path instead. Regardless of any argument over whether that's >> doing the right thing, it's a pretty common pattern to assume that >> file->f_path.dentry comes from the same file system when using a >> file_operation. Is it intended that that assumption is no longer valid? > > It's actually rare, and your example is a perfect demonstration of the > reasons why it is so rare. What's to protect btrfs_log_dentry_safe() > from racing with rename(2)? Sure, you do dget_parent(). Which protects > you from having one-time parent dentry freed under you. What it doesn't > do is making any promises about its relationship with your file. I suppose the irony here is that, AFAIK, that code is to ensure a file doesn't get lost between transactions due to rename. Isn't the file->f_path.dentry relationship stable otherwise, though? The name might change and the parent might change but the dentry that the file points to won't. I did find a few other places where that assumption happens without any questionable traversals. Sure, all three are in file systems unlikely to be used with overlayfs. ocfs2_prepare_inode_for_write uses file->f_path.dentry for should_remove_suid (due to needing to do it early since cluster locking is unknown in setattr, according to the commit). Having should_remove_suid operate on an inode would solve that easily. fat_ioctl_set_attributes uses it to call fat_setattr, but that only uses the inode and could have the inode_operation use a wrapper. cifs_new_fileinfo keeps a reference to the dentry but it seems to be used mostly to access the inode except for the nasty-looking call to build_path_from_dentry in cifs_reopen_file, which I won't be touching. That does look like a questionable traversal, especially with the "we can't take the rename lock here" comment. -Jeff -- Jeff Mahoney [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 827 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs 2015-11-06 4:03 ` Jeff Mahoney @ 2015-11-06 14:46 ` Jeff Mahoney 2016-03-24 15:20 ` Al Viro 1 sibling, 0 replies; 11+ messages in thread From: Jeff Mahoney @ 2015-11-06 14:46 UTC (permalink / raw) To: Jeff Mahoney, Al Viro Cc: Roman Lebedev, David Howells, linux-btrfs, linux-unionfs, linux-fsdevel, fstests, Filipe Manana -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/5/15 11:03 PM, Jeff Mahoney wrote: > On 11/5/15 10:18 PM, Al Viro wrote: >> On Thu, Nov 05, 2015 at 09:57:35PM -0500, Jeff Mahoney wrote: >> >>> So now file_operations callbacks can't assume that >>> file->f_path.dentry belongs to the same file system that >>> implements the callback. More than that, any code that could >>> ultimately get a dentry that comes from an open file can't >>> trust that it's from the same file system. >> >> Use file_inode() for inode. >> >>> This crash is due to this issue. Unlike xfs and ext2/3/4, we >>> use file->f_path.dentry->d_inode to resolve the inode. Using >>> file_inode() is an easy enough fix here, but we run into >>> trouble later. We have logic in the btrfs fsync() call path >>> (check_parent_dirs_for_sync) that walks back up the dentry >>> chain examining the inode's last transaction and last unlink >>> transaction to determine whether a full transaction commit is >>> required. This obviously doesn't work if we're walking the >>> overlayfs path instead. Regardless of any argument over >>> whether that's doing the right thing, it's a pretty common >>> pattern to assume that file->f_path.dentry comes from the same >>> file system when using a file_operation. Is it intended that >>> that assumption is no longer valid? >> >> It's actually rare, and your example is a perfect demonstration >> of the reasons why it is so rare. What's to protect >> btrfs_log_dentry_safe() from racing with rename(2)? Sure, you do >> dget_parent(). Which protects you from having one-time parent >> dentry freed under you. What it doesn't do is making any >> promises about its relationship with your file. > > I suppose the irony here is that, AFAIK, that code is to ensure a > file doesn't get lost between transactions due to rename. > > Isn't the file->f_path.dentry relationship stable otherwise, > though? The name might change and the parent might change but the > dentry that the file points to won't. And, taking it a bit further, it's impossible for a rename to end up with a file pointing into a different file system. So this btrfs case might misbehave, but it would never crash like we're seeing here. - -Jeff > I did find a few other places where that assumption happens without > any questionable traversals. Sure, all three are in file systems > unlikely to be used with overlayfs. > > ocfs2_prepare_inode_for_write uses file->f_path.dentry for > should_remove_suid (due to needing to do it early since cluster > locking is unknown in setattr, according to the commit). Having > should_remove_suid operate on an inode would solve that easily. > > fat_ioctl_set_attributes uses it to call fat_setattr, but that only > uses the inode and could have the inode_operation use a wrapper. > > cifs_new_fileinfo keeps a reference to the dentry but it seems to > be used mostly to access the inode except for the nasty-looking > call to build_path_from_dentry in cifs_reopen_file, which I won't > be touching. That does look like a questionable traversal, > especially with the "we can't take the rename lock here" comment. > > -Jeff > - -- Jeff Mahoney SUSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.19 (Darwin) Comment: GPGTools - http://gpgtools.org iQIcBAEBAgAGBQJWPL01AAoJEB57S2MheeWy1+IP/RfWvnpaXOCA2HJhzyR0attX D+SYah7Dc5OBicN0lghIg5ka0U2J1+l051yOOkT2sDRE23Lyu9/wmxhQVerx7hN4 js/ZGwbmGfO9I3kXbAKzGdsAscVAgvTcEp8gYXWFCzYIRYyDKEJM8xrQMM+Z2mIy AMu6lzMRFGD7q2KIITZzML0cozgT0TREE9D9+IrT3ywxAegIPATxwFp3pDRDwl4F zb2QjJjJvw/z0LEAlatwV1H7AAIZxAVrMWVywlsrdvg+pwA508JvkN7Wk06dAcJ2 YB+ddVIQsYyJuBYMA+IQsCM9q7LjIVPskoqi8BMxS2MvYObu6Z0zU+Iwcp0RnVa+ FiKt3gfRR0yOAuulzg9wKylYasIC8kfKD1POaAmOBgLErhDFtXIsJSXuw5HgY/VR LsSAbyOMfWg+YvreswQ7d7VMnK0wIJuRnludWVbQIn8y+4RKbqj2jiYIlZ7FMeUu rSSPlNt0GKISaSM3iSBrR2qN8PLvVyxdXpZSCl5itfqNea6KAwL+Kj61x0rNZhhF GkQlwsxJxYEue1eqqZU8iEkd0y93yPo3puhH7yHtT+dJW0NahjKiJF6TAGHF3C4a dEatwl6FSvDJA1aXvHG2dMfbtIiywKM1LJ4VAP1TOsbL3sqG3i4Orh7cN4bl2tYv /D9wgUU17XXdK76ysaxM =iP2W -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs 2015-11-06 4:03 ` Jeff Mahoney 2015-11-06 14:46 ` Jeff Mahoney @ 2016-03-24 15:20 ` Al Viro 2016-03-24 15:25 ` Al Viro 2016-03-24 15:31 ` Jeff Mahoney 1 sibling, 2 replies; 11+ messages in thread From: Al Viro @ 2016-03-24 15:20 UTC (permalink / raw) To: Jeff Mahoney Cc: Roman Lebedev, David Howells, linux-btrfs, linux-unionfs, linux-fsdevel, fstests, Filipe Manana On Thu, Nov 05, 2015 at 11:03:58PM -0500, Jeff Mahoney wrote: > I suppose the irony here is that, AFAIK, that code is to ensure a file > doesn't get lost between transactions due to rename. > > Isn't the file->f_path.dentry relationship stable otherwise, though? The > name might change and the parent might change but the dentry that the > file points to won't. Sure, it is stable. But that doesn't guarantee anything about the ancestors. btrfs_log_inode_parent() is a mess. Look at that loop you've got there: while (1) { if (!parent || d_really_is_negative(parent) || sb != d_inode(parent)->i_sb) break; Really? NULL from dget_parent()? A negative from dget_parent()? Sodding superblock changing from transition to parent? inode = d_inode(parent); if (root != BTRFS_I(inode)->root) break; Umm... Something like crossing a snapshot boundary? if (BTRFS_I(inode)->generation > last_committed) { ret = btrfs_log_inode(trans, root, inode, LOG_INODE_EXISTS, 0, LLONG_MAX, ctx); if (ret) goto end_trans; } if (IS_ROOT(parent)) break; parent = dget_parent(parent); dput(old_parent); old_parent = parent; } Note that there's not a damn thing to guarantee that those directories have anything to do with your file - race with rename(2) easily could've sent you chasing the wrong chain of ancestors. Incidentally, what will happen if you do that fsync() to an unlinked file? It still has ->d_parent pointing to the what used to be its parent (quite possibly itself already rmdir'ed and pinned down by that reference; inode is still busy, as if we'd done open and rmdir). Is that what those checks are attempting to catch? And what happens if rmdir happens between the check and btrfs_log_inode()? The thing is, playing with ancestors of opened file is very easy to get wrong, regardless of overlayfs involvement. And this code is anything but clear. I don't know btrfs guts enough to be able to tell how much can actually break (as opposed to adding wrong inodes to transaction), but I would really suggest taking a good look at what's going on there... > I did find a few other places where that assumption happens without any > questionable traversals. Sure, all three are in file systems unlikely > to be used with overlayfs. > > ocfs2_prepare_inode_for_write uses file->f_path.dentry for > should_remove_suid (due to needing to do it early since cluster locking > is unknown in setattr, according to the commit). Having > should_remove_suid operate on an inode would solve that easily. Can't do - there are filesystems that _need_ dentry for ->setattr(). > cifs_new_fileinfo keeps a reference to the dentry but it seems to be > used mostly to access the inode except for the nasty-looking call to > build_path_from_dentry in cifs_reopen_file, which I won't be touching. > That does look like a questionable traversal, especially with the "we > can't take the rename lock here" comment. cifs uses fs-root-relative pathnames in protocol; nothing to do there. Where do you see that comment, BTW? It uses read_seqbegin/read_seqretry on rename_lock there... ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs 2016-03-24 15:20 ` Al Viro @ 2016-03-24 15:25 ` Al Viro 2016-03-24 15:31 ` Jeff Mahoney 1 sibling, 0 replies; 11+ messages in thread From: Al Viro @ 2016-03-24 15:25 UTC (permalink / raw) To: Jeff Mahoney Cc: Roman Lebedev, David Howells, linux-btrfs, linux-unionfs, linux-fsdevel, fstests, Filipe Manana On Thu, Mar 24, 2016 at 03:20:43PM +0000, Al Viro wrote: > > ocfs2_prepare_inode_for_write uses file->f_path.dentry for > > should_remove_suid (due to needing to do it early since cluster locking > > is unknown in setattr, according to the commit). Having > > should_remove_suid operate on an inode would solve that easily. > > Can't do - there are filesystems that _need_ dentry for ->setattr(). Grr... Sorry, misread what you'd written. should_remove_suid() ought to be switched to inode, indeed. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs 2016-03-24 15:20 ` Al Viro 2016-03-24 15:25 ` Al Viro @ 2016-03-24 15:31 ` Jeff Mahoney 1 sibling, 0 replies; 11+ messages in thread From: Jeff Mahoney @ 2016-03-24 15:31 UTC (permalink / raw) To: Al Viro Cc: Roman Lebedev, David Howells, linux-btrfs, linux-unionfs, linux-fsdevel, fstests, Filipe Manana [-- Attachment #1.1: Type: text/plain, Size: 4184 bytes --] On 3/24/16 11:20 AM, Al Viro wrote: > On Thu, Nov 05, 2015 at 11:03:58PM -0500, Jeff Mahoney wrote: > >> I suppose the irony here is that, AFAIK, that code is to ensure a file >> doesn't get lost between transactions due to rename. >> >> Isn't the file->f_path.dentry relationship stable otherwise, though? The >> name might change and the parent might change but the dentry that the >> file points to won't. > > Sure, it is stable. But that doesn't guarantee anything about the > ancestors. > > btrfs_log_inode_parent() is a mess. Look at that loop you've got there: > > while (1) { > if (!parent || d_really_is_negative(parent) || sb != d_inode(parent)->i_sb) > break; > > Really? NULL from dget_parent()? A negative from dget_parent()? Sodding > superblock changing from transition to parent? > > inode = d_inode(parent); > if (root != BTRFS_I(inode)->root) > break; > > Umm... Something like crossing a snapshot boundary? > > if (BTRFS_I(inode)->generation > last_committed) { > ret = btrfs_log_inode(trans, root, inode, > LOG_INODE_EXISTS, > 0, LLONG_MAX, ctx); > if (ret) > goto end_trans; > } > if (IS_ROOT(parent)) > break; > > parent = dget_parent(parent); > dput(old_parent); > old_parent = parent; > } > > Note that there's not a damn thing to guarantee that those directories have > anything to do with your file - race with rename(2) easily could've sent you > chasing the wrong chain of ancestors. Incidentally, what will happen if > you do that fsync() to an unlinked file? It still has ->d_parent pointing > to the what used to be its parent (quite possibly itself already rmdir'ed and > pinned down by that reference; inode is still busy, as if we'd done open and > rmdir). Is that what those checks are attempting to catch? And what happens > if rmdir happens between the check and btrfs_log_inode()? > > The thing is, playing with ancestors of opened file is very easy to get > wrong, regardless of overlayfs involvement. And this code is anything > but clear. I don't know btrfs guts enough to be able to tell how much > can actually break (as opposed to adding wrong inodes to transaction), > but I would really suggest taking a good look at what's going on there... Yeah, absolutely. The file_dentry() patch that you and Miklos worked out the other day fixes the fsync crashes for us when we use it in btrfs_file_sync but you're 100% correct in your opinion of this code. >> I did find a few other places where that assumption happens without any >> questionable traversals. Sure, all three are in file systems unlikely >> to be used with overlayfs. >> >> ocfs2_prepare_inode_for_write uses file->f_path.dentry for >> should_remove_suid (due to needing to do it early since cluster locking >> is unknown in setattr, according to the commit). Having >> should_remove_suid operate on an inode would solve that easily. > > Can't do - there are filesystems that _need_ dentry for ->setattr(). > > > Grr... Sorry, misread what you'd written. should_remove_suid() > ought to be switched to inode, indeed. Ok, I'll write that up. >> cifs_new_fileinfo keeps a reference to the dentry but it seems to be >> used mostly to access the inode except for the nasty-looking call to >> build_path_from_dentry in cifs_reopen_file, which I won't be touching. >> That does look like a questionable traversal, especially with the "we >> can't take the rename lock here" comment. > > cifs uses fs-root-relative pathnames in protocol; nothing to do there. > Where do you see that comment, BTW? It uses read_seqbegin/read_seqretry > on rename_lock there... I must've been looking at old code. I don't see it now either. -Jeff -- Jeff Mahoney [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 881 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2016-03-24 15:32 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-09-30 19:57 kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs Roman Lebedev 2015-09-30 19:57 ` [RFC PATCH] fstests: generic: Test that fsync works on file in overlayfs merged directory Roman Lebedev 2015-09-30 21:56 ` Dave Chinner 2015-09-30 22:07 ` Eric Sandeen 2015-11-06 2:57 ` kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs Jeff Mahoney 2015-11-06 3:18 ` Al Viro 2015-11-06 4:03 ` Jeff Mahoney 2015-11-06 14:46 ` Jeff Mahoney 2016-03-24 15:20 ` Al Viro 2016-03-24 15:25 ` Al Viro 2016-03-24 15:31 ` Jeff Mahoney
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).