* kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d @ 2013-09-10 17:14 Josh Boyer 2013-09-10 17:33 ` Linus Torvalds 0 siblings, 1 reply; 16+ messages in thread From: Josh Boyer @ 2013-09-10 17:14 UTC (permalink / raw) To: Linus Torvalds, Al Viro, Waiman Long Cc: Linux-Kernel@Vger. Kernel. Org, moneta.mace Hi All, We've had a user report a backtrace from hitting the BUG_ON(!ret->d_lockref.count) added with the lockref infrastructure (commit 98474236f72) on rawhide today[1]. I've grabbed the backtrace below. The user has btrfs, NFS, and sshfs in usage with this oops. I've not seen anything similar, but I could have missed it. Does this look familiar to anyone? josh [1] https://bugzilla.redhat.com/show_bug.cgi?id=1006472 kernel: [ 765.817482] ------------[ cut here ]------------ kernel: [ 765.818426] kernel BUG at fs/dcache.c:648! kernel: [ 765.818426] invalid opcode: 0000 [#1] SMP kernel: [ 765.818426] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache fuse rfcomm bnep ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack w83627ehf hwmon_vid snd_hda_codec_realtek snd_hda_intel snd_hda_codec coretemp kvm_intel snd_hwdep kvm snd_seq snd_seq_device snd_pcm iTCO_wdt iTCO_vendor_support microcode serio_raw i2c_i801 lpc_ich mfd_core btusb bluetooth r8169 rfkill mii snd_page_alloc shpchp snd_timer snd soundcore acpi_cpufreq nfsd auth_rpcgss uinput nfs_acl lockd sunrpc binfmt_misc ata_generic pata_acpi btrfs libcrc32c xor raid1 raid0 raid6_pq i915 firewire_ohci firewire_core crc_itu_t video pata_it8213 i2c_algo_bit drm_kms_helper drm i2c_dev i2c_core kernel: [ 765.818426] CPU: 0 PID: 13007 Comm: gkrellm Not tainted 3.12.0-0.rc0.git16.2.fc21.x86_64 #1 kernel: [ 765.818426] Hardware name: Supermicro C2SEA/C2SEA, BIOS 1.0c 03/11/2009 kernel: [ 765.818426] task: ffff8801a1ff6c80 ti: ffff8801cb902000 task.ti: ffff8801cb902000 kernel: [ 765.818426] RIP: 0010:[<ffffffff811c03cd>] [<ffffffff811c03cd>] dget_parent+0x6d/0x80 kernel: [ 765.818426] RSP: 0018:ffff8801cb903c38 EFLAGS: 00010246 kernel: [ 765.818426] RAX: 0000000000000000 RBX: ffff88022baca600 RCX: 00000000cced6f6f kernel: [ 765.818426] RDX: 000000000000006f RSI: 0000000000000000 RDI: ffff88022baca658 kernel: [ 765.818426] RBP: ffff8801cb903c50 R08: 8080808080808080 R09: fefefefefefefeff kernel: [ 765.818426] R10: 2f2f2f2f2f2f2f2f R11: 722e7172742e2d2d R12: ffff88022bb2bd80 kernel: [ 765.818426] R13: ffff88022baca658 R14: ffff8801a1ff6c80 R15: 0000000000000002 kernel: [ 765.818426] FS: 00007fbda2e39a00(0000) GS:ffff880237c00000(0000) knlGS:0000000000000000 kernel: [ 765.818426] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: [ 765.818426] CR2: 000000000042f860 CR3: 00000001a1d26000 CR4: 00000000000007f0 kernel: [ 765.818426] Stack: kernel: [ 765.818426] ffff8801cb903de0 ffff88022bb2bd80 0000000000000000 ffff8801cb903c70 kernel: [ 765.818426] ffffffff811b5508 ffff8801cb903de0 ffff88023761b003 ffff8801cb903d10 kernel: [ 765.818426] ffffffff811b6225 0000000000000000 000201da00000000 ffff8801a1ff6c80 kernel: [ 765.818426] Call Trace: kernel: [ 765.818426] [<ffffffff811b5508>] handle_dots+0x1c8/0x290 kernel: [ 765.818426] [<ffffffff811b6225>] link_path_walk+0x335/0x8a0 kernel: [ 765.818426] [<ffffffff8113ed21>] ? read_cache_page+0x21/0x30 kernel: [ 765.818426] [<ffffffff811b6bb4>] path_lookupat+0x334/0x7b0 kernel: [ 765.818426] [<ffffffff811b705b>] filename_lookup+0x2b/0xd0 kernel: [ 765.818426] [<ffffffff811baa64>] user_path_at_empty+0x54/0x90 kernel: [ 765.818426] [<ffffffff811baab1>] user_path_at+0x11/0x20 kernel: [ 765.818426] [<ffffffff811af2d0>] vfs_fstatat+0x50/0xa0 kernel: [ 765.818426] [<ffffffff811af4cf>] SYSC_newstat+0x1f/0x40 kernel: [ 765.818426] [<ffffffff810e7cc6>] ? __audit_syscall_exit+0x1f6/0x2a0 kernel: [ 765.818426] [<ffffffff811af91e>] SyS_newstat+0xe/0x10 kernel: [ 765.818426] [<ffffffff81667a19>] system_call_fastpath+0x16/0x1b kernel: [ 765.818426] Code: 18 4c 8d 6b 58 4c 89 ef e8 b1 ef 49 00 49 3b 5c 24 18 75 19 8b 43 5c 85 c0 74 10 83 c0 01 89 43 5c 41 80 45 00 01 48 89 d8 eb c2 <0f> 0b 80 43 58 01 eb c9 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 kernel: [ 765.818426] RIP [<ffffffff811c03cd>] dget_parent+0x6d/0x80 kernel: [ 765.818426] RSP <ffff8801cb903c38> kernel: [ 765.929977] ---[ end trace 384aa993d7ac57c0 ]--- ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d 2013-09-10 17:14 kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d Josh Boyer @ 2013-09-10 17:33 ` Linus Torvalds 2013-09-10 17:39 ` Josh Boyer 0 siblings, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2013-09-10 17:33 UTC (permalink / raw) To: Josh Boyer Cc: Al Viro, Waiman Long, Linux-Kernel@Vger. Kernel. Org, moneta.mace On Tue, Sep 10, 2013 at 10:14 AM, Josh Boyer <jwboyer@fedoraproject.org> wrote: > > We've had a user report a backtrace from hitting the > BUG_ON(!ret->d_lockref.count) added with the lockref infrastructure > (commit 98474236f72) on rawhide today[1]. I've grabbed the backtrace > below. The user has btrfs, NFS, and sshfs in usage with this oops. > > I've not seen anything similar, but I could have missed it. Does this > look familiar to anyone? Nope. And the dget_parent() case itself hasn't even changed - that BUG_ON() wasn't really added by the lockref code, it's just a search-and-replace change of a BUG_ON(!d_count) to BUG_ON(!d_lockref.count). The BUG_ON() existed before. That whole "dget_parent()" thing is also in the _simple_ case (not RCU mode), and the BUG_ON is for when the dentry is properly locked, so that's all "safe" code. The refcount must have gotten corrupted earlier. Do you have the mainline git ID of that rawhide kernel? Because there *was* a real bug in d_rcu_to_refcount. I don't see how it could trigger that particular issue, but it could trigger scheduling while in the rcu-protected region and that in turn could result in odd things down the line, so.. That particular bug exists between commits 15570086b590 ("vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock()") that introduced it, and e5c832d55588 ("vfs: fix dentry RCU to refcounting possibly sleeping dput()") that should have fixed it. But I don't know what mainline kernel that "kernel-3.12.0-0.rc0.git16.2.fc21.x86_64" is based on. I'm sure that information exists somewhere.. Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d 2013-09-10 17:33 ` Linus Torvalds @ 2013-09-10 17:39 ` Josh Boyer 2013-09-10 17:47 ` Linus Torvalds 0 siblings, 1 reply; 16+ messages in thread From: Josh Boyer @ 2013-09-10 17:39 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Waiman Long, Linux-Kernel@Vger. Kernel. Org, moneta.mace On Tue, Sep 10, 2013 at 1:33 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Sep 10, 2013 at 10:14 AM, Josh Boyer <jwboyer@fedoraproject.org> wrote: >> >> We've had a user report a backtrace from hitting the >> BUG_ON(!ret->d_lockref.count) added with the lockref infrastructure >> (commit 98474236f72) on rawhide today[1]. I've grabbed the backtrace >> below. The user has btrfs, NFS, and sshfs in usage with this oops. >> >> I've not seen anything similar, but I could have missed it. Does this >> look familiar to anyone? > > Nope. And the dget_parent() case itself hasn't even changed - that > BUG_ON() wasn't really added by the lockref code, it's just a > search-and-replace change of a BUG_ON(!d_count) to > BUG_ON(!d_lockref.count). The BUG_ON() existed before. > > That whole "dget_parent()" thing is also in the _simple_ case (not RCU > mode), and the BUG_ON is for when the dentry is properly locked, so > that's all "safe" code. The refcount must have gotten corrupted > earlier. > > Do you have the mainline git ID of that rawhide kernel? Because there > *was* a real bug in d_rcu_to_refcount. I don't see how it could > trigger that particular issue, but it could trigger scheduling while > in the rcu-protected region and that in turn could result in odd > things down the line, so.. > > That particular bug exists between commits 15570086b590 ("vfs: > reimplement d_rcu_to_refcount() using lockref_get_or_lock()") that > introduced it, and e5c832d55588 ("vfs: fix dentry RCU to refcounting > possibly sleeping dput()") that should have fixed it. But I don't know > what mainline kernel that "kernel-3.12.0-0.rc0.git16.2.fc21.x86_64" is > based on. I'm sure that information exists somewhere.. The subject says v3.11-7890-ge5c832d, which is the git-describe output of the mainline kernel for that Fedora build. Sorry, I should have made that clearer. So according to that, it should be based on the actual commit you identified as the fix. I have your latest tree as of this morning (v3.11-8716-g26b0332) building right now and have asked Moneta to test it. josh ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d 2013-09-10 17:39 ` Josh Boyer @ 2013-09-10 17:47 ` Linus Torvalds 2013-09-10 18:25 ` Linus Torvalds 0 siblings, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2013-09-10 17:47 UTC (permalink / raw) To: Josh Boyer Cc: Al Viro, Waiman Long, Linux-Kernel@Vger. Kernel. Org, moneta.mace On Tue, Sep 10, 2013 at 10:39 AM, Josh Boyer <jwboyer@gmail.com> wrote: > > The subject says v3.11-7890-ge5c832d, which is the git-describe output > of the mainline kernel for that Fedora build. Duh. I just read the bugzilla and the oops ;) But yes, e5c832d is obviously the "fixed" kernel. Let me think about this. Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d 2013-09-10 17:47 ` Linus Torvalds @ 2013-09-10 18:25 ` Linus Torvalds [not found] ` <CAMfhy93EoDohCrV+ykj6b+R9QSCTdiPtvSpJpe3SH=ot_1LWMg@mail.gmail.com> 2013-09-10 18:43 ` Al Viro 0 siblings, 2 replies; 16+ messages in thread From: Linus Torvalds @ 2013-09-10 18:25 UTC (permalink / raw) To: Josh Boyer Cc: Al Viro, Waiman Long, Linux-Kernel@Vger. Kernel. Org, moneta.mace [-- Attachment #1: Type: text/plain, Size: 1387 bytes --] On Tue, Sep 10, 2013 at 10:47 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But yes, e5c832d is obviously the "fixed" kernel. Let me think about this. Ok, I think I found it. I missed that "terminate_walk()" for the RCU case does this: nd->flags &= ~LOOKUP_RCU; if (!(nd->flags & LOOKUP_ROOT)) nd->root.mnt = NULL; unlock_rcu_walk(); and my unlazy_walk() essentially terminated the walk _without_ clearing that nd->root.mnt thing (it did clear the LOOKUP_RCU bit and unlock_rcy_walk(). So then later, we'd end up doing an extra path_put(). Explaining a zero d_lockref.count. The whole damn root.mnt behavior with !LOOKUP_ROOT is a mystery and needs more comments. But the attached trivial patch should do the missing portion of terminate_walk(). Al, can you walk us through the rules for what "root.mnt == NULL" really means? It's basically used as a flag for whether we've gotten the root pointer or not. But it's pretty damn esoteric. Now I'm starting to wonder how come _I_ never saw any issues. Maybe it ends up underflowing so quickly that most people just see a big negative number.. Patch is entirely untested. Not that my testing apparently has been much good. Moneta, are you comfortable compiling a test-kernel, or does this need to become a rawhide package? Linus [-- Attachment #2: patch.diff --] [-- Type: application/octet-stream, Size: 442 bytes --] fs/namei.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/namei.c b/fs/namei.c index 56e4f4d537d0..5e6aaadc1dcd 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -574,9 +574,12 @@ unlock_and_drop_dentry: drop_dentry: unlock_rcu_walk(); dput(dentry); - return -ECHILD; + goto drop_root_mnt; out: unlock_rcu_walk(); +drop_root_mnt: + if (!(nd->flags & LOOKUP_ROOT)) + nd->root.mnt = NULL; return -ECHILD; } ^ permalink raw reply related [flat|nested] 16+ messages in thread
[parent not found: <CAMfhy93EoDohCrV+ykj6b+R9QSCTdiPtvSpJpe3SH=ot_1LWMg@mail.gmail.com>]
* Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d [not found] ` <CAMfhy93EoDohCrV+ykj6b+R9QSCTdiPtvSpJpe3SH=ot_1LWMg@mail.gmail.com> @ 2013-09-10 18:32 ` Linus Torvalds [not found] ` <CAMfhy92u1GFvTrO+8SNp8hoEBEHcd7Z-3mw+79rv8+2yYqa+ow@mail.gmail.com> 0 siblings, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2013-09-10 18:32 UTC (permalink / raw) To: Mace Moneta Cc: Josh Boyer, Al Viro, Waiman Long, Linux-Kernel@Vger. Kernel. Org On Tue, Sep 10, 2013 at 11:29 AM, Mace Moneta <moneta.mace@gmail.com> wrote: > I can patch and compile a kernel if you post the patch you want tested. It was attached in that previous email.. Apparently I'm not the only one who misses things like versions in subject lines or other small "details" in emails ;) Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <CAMfhy92u1GFvTrO+8SNp8hoEBEHcd7Z-3mw+79rv8+2yYqa+ow@mail.gmail.com>]
[parent not found: <CAMfhy90vEu2_hfoOnzB+ZOit4HNmKex2WObmztcY7R4+PBwd7w@mail.gmail.com>]
* Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d [not found] ` <CAMfhy90vEu2_hfoOnzB+ZOit4HNmKex2WObmztcY7R4+PBwd7w@mail.gmail.com> @ 2013-09-10 20:25 ` Linus Torvalds 2013-09-10 21:51 ` Josh Boyer 2013-09-11 1:48 ` Waiman Long 0 siblings, 2 replies; 16+ messages in thread From: Linus Torvalds @ 2013-09-10 20:25 UTC (permalink / raw) To: Mace Moneta Cc: Josh Boyer, Al Viro, Waiman Long, Linux-Kernel@Vger. Kernel. Org On Tue, Sep 10, 2013 at 12:57 PM, Mace Moneta <moneta.mace@gmail.com> wrote: > The (first) patch looks good; no recurrence. It has only taken 3-5 minutes > before, and I've been up for about half an hour now. Ok, good. It's pushed out. Al, your third pile of VFS stuff is also merged. Waiman, that means that your RCU path creation stuff is in. What else did you have pending for scalability? Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d 2013-09-10 20:25 ` Linus Torvalds @ 2013-09-10 21:51 ` Josh Boyer 2013-09-11 1:48 ` Waiman Long 1 sibling, 0 replies; 16+ messages in thread From: Josh Boyer @ 2013-09-10 21:51 UTC (permalink / raw) To: Linus Torvalds Cc: Mace Moneta, Al Viro, Waiman Long, Linux-Kernel@Vger. Kernel. Org On Tue, Sep 10, 2013 at 4:25 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Sep 10, 2013 at 12:57 PM, Mace Moneta <moneta.mace@gmail.com> wrote: >> The (first) patch looks good; no recurrence. It has only taken 3-5 minutes >> before, and I've been up for about half an hour now. > > Ok, good. It's pushed out. Thanks to the both of you for the quick resolution. josh ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d 2013-09-10 20:25 ` Linus Torvalds 2013-09-10 21:51 ` Josh Boyer @ 2013-09-11 1:48 ` Waiman Long 2013-09-11 3:22 ` Linus Torvalds 1 sibling, 1 reply; 16+ messages in thread From: Waiman Long @ 2013-09-11 1:48 UTC (permalink / raw) To: Linus Torvalds Cc: Mace Moneta, Josh Boyer, Al Viro, Linux-Kernel@Vger. Kernel. Org On 09/10/2013 04:25 PM, Linus Torvalds wrote: > On Tue, Sep 10, 2013 at 12:57 PM, Mace Moneta<moneta.mace@gmail.com> wrote: >> The (first) patch looks good; no recurrence. It has only taken 3-5 minutes >> before, and I've been up for about half an hour now. > Ok, good. It's pushed out. > > Al, your third pile of VFS stuff is also merged. Waiman, that means > that your RCU path creation stuff is in. What else did you have > pending for scalability? > > Linus I need to clean up some comments in the code. The other thing that I want to do is to introduce read_seqlock/read_sequnlock() primitives that do the locking without incrementing the sequence number. Then all the name lookup and translation code can use the new primitives as they don't change any of the protected structures. This will prevent one sequence number check failure from cascading into a series of failures because of the sequence number change. I will have a patch ready by tomorrow morning. Regards, Longman ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d 2013-09-11 1:48 ` Waiman Long @ 2013-09-11 3:22 ` Linus Torvalds 2013-09-11 3:56 ` Al Viro 0 siblings, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2013-09-11 3:22 UTC (permalink / raw) To: Waiman Long Cc: Mace Moneta, Josh Boyer, Al Viro, Linux-Kernel@Vger. Kernel. Org On Tue, Sep 10, 2013 at 6:48 PM, Waiman Long <waiman.long@hp.com> wrote: > > I need to clean up some comments in the code. The other thing that I want to > do is to introduce read_seqlock/read_sequnlock() primitives that do the > locking without incrementing the sequence number. I don't understand. That's the whole *point* of sequence numbers. The read-locker doesn't ever change the number, only the write-locker does. So I don't see what you would change. Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d 2013-09-11 3:22 ` Linus Torvalds @ 2013-09-11 3:56 ` Al Viro 2013-09-11 4:12 ` Linus Torvalds 0 siblings, 1 reply; 16+ messages in thread From: Al Viro @ 2013-09-11 3:56 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, Mace Moneta, Josh Boyer, Linux-Kernel@Vger. Kernel. Org On Tue, Sep 10, 2013 at 08:22:56PM -0700, Linus Torvalds wrote: > On Tue, Sep 10, 2013 at 6:48 PM, Waiman Long <waiman.long@hp.com> wrote: > > > > I need to clean up some comments in the code. The other thing that I want to > > do is to introduce read_seqlock/read_sequnlock() primitives that do the > > locking without incrementing the sequence number. > > I don't understand. That's the whole *point* of sequence numbers. The > read-locker doesn't ever change the number, only the write-locker > does. > > So I don't see what you would change. I do. What we need on the second pass (one where we currently take seq_writelock()) is exclusion against writers; nothing we are doing is worth disturbing the readers - we don't change any data structures. And simple grabbing the spinlock, without touching the sequence number would achieve exactly that. Writers will have to wait and won't be able to disturb us, readers won't notice anything happening. So yes, this extra primitive does make sense here. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d 2013-09-11 3:56 ` Al Viro @ 2013-09-11 4:12 ` Linus Torvalds 0 siblings, 0 replies; 16+ messages in thread From: Linus Torvalds @ 2013-09-11 4:12 UTC (permalink / raw) To: Al Viro Cc: Waiman Long, Mace Moneta, Josh Boyer, Linux-Kernel@Vger. Kernel. Org On Tue, Sep 10, 2013 at 8:56 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > I do. What we need on the second pass (one where we currently > take seq_writelock()) is exclusion against writers; nothing we are > doing is worth disturbing the readers - we don't change any data > structures. And simple grabbing the spinlock, without touching the > sequence number would achieve exactly that. Writers will have to > wait and won't be able to disturb us, readers won't notice anything > happening. So yes, this extra primitive does make sense here. Ahh. Yes, as a fallback from the reader-side sequence lock that makes perfect sense.. Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d 2013-09-10 18:25 ` Linus Torvalds [not found] ` <CAMfhy93EoDohCrV+ykj6b+R9QSCTdiPtvSpJpe3SH=ot_1LWMg@mail.gmail.com> @ 2013-09-10 18:43 ` Al Viro 2013-09-10 19:01 ` Linus Torvalds 1 sibling, 1 reply; 16+ messages in thread From: Al Viro @ 2013-09-10 18:43 UTC (permalink / raw) To: Linus Torvalds Cc: Josh Boyer, Waiman Long, Linux-Kernel@Vger. Kernel. Org, moneta.mace On Tue, Sep 10, 2013 at 11:25:44AM -0700, Linus Torvalds wrote: > nd->flags &= ~LOOKUP_RCU; > if (!(nd->flags & LOOKUP_ROOT)) > nd->root.mnt = NULL; > unlock_rcu_walk(); > > and my unlazy_walk() essentially terminated the walk _without_ > clearing that nd->root.mnt thing (it did clear the LOOKUP_RCU bit and > unlock_rcy_walk(). So then later, we'd end up doing an extra > path_put(). Explaining a zero d_lockref.count. > > The whole damn root.mnt behavior with !LOOKUP_ROOT is a mystery and > needs more comments. But the attached trivial patch should do the > missing portion of terminate_walk(). > > Al, can you walk us through the rules for what "root.mnt == NULL" > really means? It's basically used as a flag for whether we've gotten > the root pointer or not. But it's pretty damn esoteric. LOOKUP_ROOT: the caller has set nd->root and we shouldn't touch that at all. !LOOKUP_ROOT: we set nd->root the first time we need / (in the very beginning if it's an absolute pathname, on the first absolute symlink otherwise). In non-RCU mode we hold a reference to it; in RCU mode we do not. As the result, leaving RCU mode should either grab a reference to the damn thing (if we intend to go on) or zero it out. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d 2013-09-10 18:43 ` Al Viro @ 2013-09-10 19:01 ` Linus Torvalds 2013-09-10 19:13 ` Al Viro 0 siblings, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2013-09-10 19:01 UTC (permalink / raw) To: Al Viro Cc: Josh Boyer, Waiman Long, Linux-Kernel@Vger. Kernel. Org, Mace Moneta [-- Attachment #1: Type: text/plain, Size: 1710 bytes --] On Tue, Sep 10, 2013 at 11:43 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > !LOOKUP_ROOT: we set nd->root the first time we need / (in the very > beginning if it's an absolute pathname, on the first absolute symlink > otherwise). In non-RCU mode we hold a reference to it; in RCU mode > we do not. As the result, leaving RCU mode should either grab > a reference to the damn thing (if we intend to go on) or zero it out. Yeah, that was what I was thinking. But in particular, I was wondering if we could simplify unlazy_walk() to _not_ take that root reference at all, and just always zero it out even if we succeed. IOW, doing the attached patch (_instead_ of the one I sent out). Or is there something in path lookup retrying that might get uphappy if we go back to a NULL root.mnt, and doesn't check it? Because this patch actually simplifies that nasty/complex unlazy_walk() a lot, and makes it much more understnadable, I think. It always looked really odd to me how it used to do "if LOOKUP_ROOT is _not_ set, let's take a reference count to the root"). With this patch in place, I really like how straightforward unlazy_walk() is. That used to be just about _the_ most subtle part of the whole rcu-to-refcount thing (as demonstrated by the number of bugs it has exposed in the patches), and now it looks almost trivial. Mace - ignore this newer patch for now, the one I want you to test is the minimal one you already have, not this cleanup one. But since you clearly saw the problem I never did, if you get bored _after_ testing the first patch, feel free to give this one a whirl too. But notice that this patch is a replacement for - not in addition to - the first one. Linus [-- Attachment #2: patch.diff --] [-- Type: application/octet-stream, Size: 1344 bytes --] fs/namei.c | 17 ++--------------- 1 file changed, 2 insertions(+), 15 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 56e4f4d537d0..8c6c94aba880 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -506,7 +506,6 @@ static inline void unlock_rcu_walk(void) */ static int unlazy_walk(struct nameidata *nd, struct dentry *dentry) { - struct fs_struct *fs = current->fs; struct dentry *parent = nd->path.dentry; BUG_ON(!(nd->flags & LOOKUP_RCU)); @@ -531,6 +530,8 @@ static int unlazy_walk(struct nameidata *nd, struct dentry *dentry) */ mntget(nd->path.mnt); nd->flags &= ~LOOKUP_RCU; + if (!(nd->flags & LOOKUP_ROOT)) + nd->root.mnt = NULL; /* * For a negative lookup, the lookup sequence point is the parents @@ -554,23 +555,9 @@ static int unlazy_walk(struct nameidata *nd, struct dentry *dentry) goto drop_dentry; } - /* - * Sequence counts matched. Now make sure that the root is - * still valid and get it if required. - */ - if (nd->root.mnt && !(nd->flags & LOOKUP_ROOT)) { - spin_lock(&fs->lock); - if (nd->root.mnt != fs->root.mnt || nd->root.dentry != fs->root.dentry) - goto unlock_and_drop_dentry; - path_get(&nd->root); - spin_unlock(&fs->lock); - } - unlock_rcu_walk(); return 0; -unlock_and_drop_dentry: - spin_unlock(&fs->lock); drop_dentry: unlock_rcu_walk(); dput(dentry); ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d 2013-09-10 19:01 ` Linus Torvalds @ 2013-09-10 19:13 ` Al Viro 2013-09-10 19:27 ` Linus Torvalds 0 siblings, 1 reply; 16+ messages in thread From: Al Viro @ 2013-09-10 19:13 UTC (permalink / raw) To: Linus Torvalds Cc: Josh Boyer, Waiman Long, Linux-Kernel@Vger. Kernel. Org, Mace Moneta On Tue, Sep 10, 2013 at 12:01:22PM -0700, Linus Torvalds wrote: > On Tue, Sep 10, 2013 at 11:43 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > !LOOKUP_ROOT: we set nd->root the first time we need / (in the very > > beginning if it's an absolute pathname, on the first absolute symlink > > otherwise). In non-RCU mode we hold a reference to it; in RCU mode > > we do not. As the result, leaving RCU mode should either grab > > a reference to the damn thing (if we intend to go on) or zero it out. > > Yeah, that was what I was thinking. But in particular, I was wondering > if we could simplify unlazy_walk() to _not_ take that root reference > at all, and just always zero it out even if we succeed. > > IOW, doing the attached patch (_instead_ of the one I sent out). > > Or is there something in path lookup retrying that might get uphappy > if we go back to a NULL root.mnt, and doesn't check it? Ugh... I really don't like that - your patch introduces the situations when race with chroot can lead to two absolute symlinks in the same path being interpreted wrt different roots. And yes, sure, anybody who gets in that kind of races isn't going to get particulary sane results, but still, I'd rather have simpler semantics here... Anyway, I think I see how to tidy the things up in a different way; let's go with your original fix for now and deal with the cleanups a bit later. Another thing: could you merge vfs.git#for-linus? It's getting really messy to keep track of changes both in mainline and in vfs.git, especially since there's a pile of pending stuff (lru_list work) that needs to be resolved wrt both... ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d 2013-09-10 19:13 ` Al Viro @ 2013-09-10 19:27 ` Linus Torvalds 0 siblings, 0 replies; 16+ messages in thread From: Linus Torvalds @ 2013-09-10 19:27 UTC (permalink / raw) To: Al Viro Cc: Josh Boyer, Waiman Long, Linux-Kernel@Vger. Kernel. Org, Mace Moneta On Tue, Sep 10, 2013 at 12:13 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Ugh... I really don't like that - your patch introduces the situations > when race with chroot can lead to two absolute symlinks in the same path > being interpreted wrt different roots. And yes, sure, anybody who gets > in that kind of races isn't going to get particulary sane results, but > still, I'd rather have simpler semantics here... Well, that "root is whatever it happens to be when we encounter the slash" semantics is actually what we used to have before nameidata anyway, so I don't actually believe it's anything new. But I'll commit the simple patch that doesn't clean things up. I guess we can always revisit this later (although I fear that once it all works, we'll just leave it in the ugly state). > Another thing: could you merge vfs.git#for-linus? It's getting really > messy to keep track of changes both in mainline and in vfs.git, especially > since there's a pile of pending stuff (lru_list work) that needs to be > resolved wrt both... Will do. I was hoping to get this sorted out first. Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2013-09-11 4:12 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-10 17:14 kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d Josh Boyer
2013-09-10 17:33 ` Linus Torvalds
2013-09-10 17:39 ` Josh Boyer
2013-09-10 17:47 ` Linus Torvalds
2013-09-10 18:25 ` Linus Torvalds
[not found] ` <CAMfhy93EoDohCrV+ykj6b+R9QSCTdiPtvSpJpe3SH=ot_1LWMg@mail.gmail.com>
2013-09-10 18:32 ` Linus Torvalds
[not found] ` <CAMfhy92u1GFvTrO+8SNp8hoEBEHcd7Z-3mw+79rv8+2yYqa+ow@mail.gmail.com>
[not found] ` <CAMfhy90vEu2_hfoOnzB+ZOit4HNmKex2WObmztcY7R4+PBwd7w@mail.gmail.com>
2013-09-10 20:25 ` Linus Torvalds
2013-09-10 21:51 ` Josh Boyer
2013-09-11 1:48 ` Waiman Long
2013-09-11 3:22 ` Linus Torvalds
2013-09-11 3:56 ` Al Viro
2013-09-11 4:12 ` Linus Torvalds
2013-09-10 18:43 ` Al Viro
2013-09-10 19:01 ` Linus Torvalds
2013-09-10 19:13 ` Al Viro
2013-09-10 19:27 ` Linus Torvalds
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.