* Re: Heads-up: 3.6.2 / 3.6.3 NFS server panic: 3.6.2+ regression? [not found] <87objupjlr.fsf@spindle.srvr.nix> @ 2012-10-23 1:33 ` J. Bruce Fields 2012-10-23 14:07 ` Nix 0 siblings, 1 reply; 27+ messages in thread From: J. Bruce Fields @ 2012-10-23 1:33 UTC (permalink / raw) To: Nix Cc: linux-kernel, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs On Mon, Oct 22, 2012 at 05:17:04PM +0100, Nix wrote: > I just had a panic/oops on upgrading from 3.6.1 to 3.6.3, after weeks of > smooth operation on 3.6.1: one of the NFS changes that went into one of > the two latest stable kernels appears to be lethal after around half an > hour of uptime. The oops came from NFSv4, IIRC (relying on memory since > my camera was recharging and there is no netconsole from that box > because it is where the netconsole logs go, so I'll have to reproduce it > later today). The machine is an NFSv3 server only at present, with no > NFSv4 running (though NFSv4 is built in). Note recent clients may try to negotiate NFSv4 by default, so it's possible to use it without knowing. You didn't change anything else about your server or clients recently? > This is just a heads-up, not intended to be sufficient to track it down > unless it is instantly obvious: more later, including a .config dump and > proper oops report if it turns out to be reproducible. I just have to > recover from the usual panic-induced fs corruption and take a backup > first. Thanks. I don't see an obvious candidate on a quick skim of v3.6.1..v3.6.3 commits, but of course I could be missing something. --b. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Heads-up: 3.6.2 / 3.6.3 NFS server panic: 3.6.2+ regression? 2012-10-23 1:33 ` Heads-up: 3.6.2 / 3.6.3 NFS server panic: 3.6.2+ regression? J. Bruce Fields @ 2012-10-23 14:07 ` Nix 2012-10-23 14:30 ` J. Bruce Fields 0 siblings, 1 reply; 27+ messages in thread From: Nix @ 2012-10-23 14:07 UTC (permalink / raw) To: J. Bruce Fields Cc: linux-kernel, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs On 23 Oct 2012, J. Bruce Fields uttered the following: > On Mon, Oct 22, 2012 at 05:17:04PM +0100, Nix wrote: >> I just had a panic/oops on upgrading from 3.6.1 to 3.6.3, after weeks of >> smooth operation on 3.6.1: one of the NFS changes that went into one of >> the two latest stable kernels appears to be lethal after around half an >> hour of uptime. The oops came from NFSv4, IIRC (relying on memory since >> my camera was recharging and there is no netconsole from that box >> because it is where the netconsole logs go, so I'll have to reproduce it >> later today). The machine is an NFSv3 server only at present, with no >> NFSv4 running (though NFSv4 is built in). > > Note recent clients may try to negotiate NFSv4 by default, so it's > possible to use it without knowing. Every NFS import from all this server's clients has 'vers=3' forced on (for now, until I get around to figuring out what if anything needs to be done to move to NFSv4: it may be the answer is 'nothing' but I tread quite carefully with this server, since my $HOME is there). /proc/fs/nfsfs/volumes on the clients confirms that everything is v3. > You didn't change anything else about your server or clients recently? Nope (other than upgrading the clients to 3.6.3 in concert). Running 3.6.1 here on that server now and 3.6.3 on all the clients, and no crash in over a day. nfs-utils is a bit old, 1.2.6-rc6, I should probably upgrade it... > I don't see an obvious candidate on a quick skim of v3.6.1..v3.6.3 > commits, but of course I could be missing something. I'll try rebooting into it again soon, and get an oops report if it should happen again (and hopefully less filesystem corruption this time, right after a reboot into a new kernel was clearly the wrong time to move a bunch of directories around). Sorry for the continuing lack of useful information... I'll try to fix that shortly (once with a report of a false alarm, since I really want this stable kernel: my server is affected by the tx stalls fixed by 3468e9d and I'm getting tired of the frequent 1s lags talking to it: I could just apply that one fix, but I'd rather track this down properly). ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Heads-up: 3.6.2 / 3.6.3 NFS server panic: 3.6.2+ regression? 2012-10-23 14:07 ` Nix @ 2012-10-23 14:30 ` J. Bruce Fields 2012-10-23 16:32 ` Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) Nix 0 siblings, 1 reply; 27+ messages in thread From: J. Bruce Fields @ 2012-10-23 14:30 UTC (permalink / raw) To: Nix Cc: linux-kernel, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs On Tue, Oct 23, 2012 at 03:07:59PM +0100, Nix wrote: > On 23 Oct 2012, J. Bruce Fields uttered the following: > > > On Mon, Oct 22, 2012 at 05:17:04PM +0100, Nix wrote: > >> I just had a panic/oops on upgrading from 3.6.1 to 3.6.3, after weeks of > >> smooth operation on 3.6.1: one of the NFS changes that went into one of > >> the two latest stable kernels appears to be lethal after around half an > >> hour of uptime. The oops came from NFSv4, IIRC (relying on memory since > >> my camera was recharging and there is no netconsole from that box > >> because it is where the netconsole logs go, so I'll have to reproduce it > >> later today). The machine is an NFSv3 server only at present, with no > >> NFSv4 running (though NFSv4 is built in). > > > > Note recent clients may try to negotiate NFSv4 by default, so it's > > possible to use it without knowing. > > Every NFS import from all this server's clients has 'vers=3' forced on > (for now, until I get around to figuring out what if anything needs to > be done to move to NFSv4: it may be the answer is 'nothing' but I tread > quite carefully with this server, since my $HOME is there). > > /proc/fs/nfsfs/volumes on the clients confirms that everything is v3. > > > You didn't change anything else about your server or clients recently? > > Nope (other than upgrading the clients to 3.6.3 in concert). Running > 3.6.1 here on that server now and 3.6.3 on all the clients, and no crash > in over a day. > > nfs-utils is a bit old, 1.2.6-rc6, I should probably upgrade it... nfs-utils shouldn't be capable of oopsing the kernel, so from my (selfish) point of view I'd actually rather you stick with whatever you have and try to reproduce the oops. (It's unlikely to make any difference anyway.) --b. > > I don't see an obvious candidate on a quick skim of v3.6.1..v3.6.3 > > commits, but of course I could be missing something. > > I'll try rebooting into it again soon, and get an oops report if it > should happen again (and hopefully less filesystem corruption this time, > right after a reboot into a new kernel was clearly the wrong time to > move a bunch of directories around). > > Sorry for the continuing lack of useful information... I'll try to fix > that shortly (once with a report of a false alarm, since I really want > this stable kernel: my server is affected by the tx stalls fixed by > 3468e9d and I'm getting tired of the frequent 1s lags talking to it: I > could just apply that one fix, but I'd rather track this down properly). ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) 2012-10-23 14:30 ` J. Bruce Fields @ 2012-10-23 16:32 ` Nix 2012-10-23 16:46 ` J. Bruce Fields 2012-10-23 20:57 ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Nix 0 siblings, 2 replies; 27+ messages in thread From: Nix @ 2012-10-23 16:32 UTC (permalink / raw) To: J. Bruce Fields, Ted Ts'o Cc: linux-kernel, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs On 23 Oct 2012, J. Bruce Fields uttered the following: > nfs-utils shouldn't be capable of oopsing the kernel, so from my > (selfish) point of view I'd actually rather you stick with whatever you > have and try to reproduce the oops. Reproduced in 3.6.3, not in 3.6.1, not tried 3.6.2. Capturing it was rendered somewhat difficult by an ext4/JBD2 bug which leads to data loss in /var on every reboot out of 3.6.1 and on some reboots out of 3.6.3 (I have runs of NULs in my logs now, which keep eating the oopses): [while in 3.6.1] [ 88.565698] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash. [ 88.799263] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash. [ 89.648152] ------------[ cut here ]------------ [ 89.648386] WARNING: at fs/inode.c:280 drop_nlink+0x25/0x42() [ 89.648614] Hardware name: empty [ 89.648833] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode] [ 89.649382] Pid: 1484, comm: dhcpd Not tainted 3.6.1-dirty #1 [ 89.649610] Call Trace: [ 89.649832] [<ffffffff810608c0>] warn_slowpath_common+0x83/0x9b [ 89.650063] [<ffffffff810608f2>] warn_slowpath_null+0x1a/0x1c [ 89.650292] [<ffffffff8112efbf>] drop_nlink+0x25/0x42 [ 89.650533] [<ffffffff81187112>] ext4_dec_count+0x26/0x28 [ 89.650763] [<ffffffff8118abb8>] ext4_rename+0x4ec/0x7b4 [ 89.650993] [<ffffffff81125d81>] ? vfs_rename+0xbe/0x3b7 [ 89.651224] [<ffffffff81125f3f>] vfs_rename+0x27c/0x3b7 [ 89.651454] [<ffffffff81127a53>] sys_renameat+0x1b1/0x228 [ 89.651682] [<ffffffff8114bda1>] ? fsnotify+0x226/0x249 [ 89.651911] [<ffffffff81239b75>] ? security_inode_permission+0x1e/0x20 [ 89.652145] [<ffffffff8111a240>] ? vfs_write+0x116/0x142 [ 89.652372] [<ffffffff81127ae5>] sys_rename+0x1b/0x1e [ 89.652601] [<ffffffff814fffa2>] system_call_fastpath+0x16/0x1b [...] [while having just booted into 3.6.1 after some time in 3.6.3: the FS was clean, and fscked on the previous boot into 3.6.3 after a previous instance of this bug] Oct 23 17:18:26 spindle crit: [ 67.625319] EXT4-fs error (device dm-5): mb_free_blocks:1300: group 65, block 2143748:freeing already freed block (bit 13828) This may well be a 3.6.1-specific bug fixed in 3.6.3, but it's hard to tell since most of my reboots are 3.6.1->3.6.3 or vice versa right now. Anyway, here's the NFSv4 oops (not a panic: it helps if I remember to turn off panic_on_oops when I come home from the holidays). It's a lockd problem, and probably happens during delivery of mail over NFS (my mailserver load soars when it happens): [ 813.110354] ------------[ cut here ]------------ [ 813.110585] kernel BUG at fs/lockd/mon.c:150! [ 813.110878] invalid opcode: 0000 [#1] SMP [ 813.111173] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode] [ 813.111727] CPU 1 [ 813.111772] Pid: 1040, comm: lockd Not tainted 3.6.3-dirty #1 empty empty/S7010 [ 813.112388] RIP: 0010:[<ffffffff8120fbbc>] [<ffffffff8120fbbc>] nsm_mon_unmon+0x64/0x98 [ 813.112840] RSP: 0018:ffff88062163dcf0 EFLAGS: 00010246 [ 813.113069] RAX: ffff88062163dcf8 RBX: 0000000000000000 RCX: 0000000000000000 [ 813.113303] RDX: ffff88062163dd68 RSI: 0000000000000002 RDI: ffff88062163dd40 [ 813.113537] RBP: ffff88062163dd50 R08: 0000000000000000 R09: ffffffff8120c463 [ 813.113771] R10: ffffffff8120c463 R11: 0000000000000001 R12: 0000000000000000 [ 813.114006] R13: ffff88061f067e40 R14: ffff88059c763a00 R15: ffff88062163de28 [ 813.114241] FS: 0000000000000000(0000) GS:ffff88063fc40000(0000) knlGS:0000000000000000 [ 813.114651] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 813.114882] CR2: ffffffffff600400 CR3: 0000000001a0b000 CR4: 00000000000007e0 [ 813.115116] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 813.115350] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 813.115584] Process lockd (pid: 1040, threadinfo ffff88062163c000, task ffff880621611640) [ 813.115994] Stack: [ 813.116210] ffff88062163dd90 ffff8805f4e044b1 00000003000186b5 ffffffff00000010 [ 813.116761] ffff8805f4e044c1 000000000000008c 0000000000000000 ffff88062163dcf8 [ 813.117310] ffff88062163dd68 0000000000000000 ffff8806200ce088 ffff8805f4e04400 [ 813.117861] Call Trace: [ 813.118083] [<ffffffff8120fee4>] nsm_monitor+0x100/0x157 [ 813.118315] [<ffffffff81211b4a>] nlm4svc_retrieve_args+0x62/0xd7 [ 813.118547] [<ffffffff81211f49>] nlm4svc_proc_lock+0x3c/0xb5 [ 813.118779] [<ffffffff8121167b>] ? nlm4svc_decode_lockargs+0x47/0xb2 [ 813.119016] [<ffffffff814d89ca>] svc_process+0x3bf/0x6a1 [ 813.119246] [<ffffffff8120d5f0>] lockd+0x127/0x164 [ 813.119474] [<ffffffff8120d4c9>] ? set_grace_period+0x8a/0x8a [ 813.119708] [<ffffffff8107bcbc>] kthread+0x8b/0x93 [ 813.119938] [<ffffffff815012f4>] kernel_thread_helper+0x4/0x10 [ 813.120170] [<ffffffff8107bc31>] ? kthread_worker_fn+0xe1/0xe1 [ 813.120401] [<ffffffff815012f0>] ? gs_change+0xb/0xb [ 813.120628] Code: b8 10 00 00 00 48 89 45 c0 48 8d 81 8c 00 00 00 b9 08 00 00 00 48 89 45 c8 89 d8 f3 ab 48 8d 45 a8 48 89 55 e0 48 89 45 d8 75 02 <0f> 0b 89 f6 48 c7 02 00 00 00 00 4c 89 c7 48 6b f6 38 ba 00 04 [ 813.123627] RIP [<ffffffff8120fbbc>] nsm_mon_unmon+0x64/0x98 [ 813.123894] RSP <ffff88062163dcf0> [ 813.124129] ---[ end trace 11eb11a091ffd910 ]--- Aside: I am running with this patch, to prevent mounting of one NFS-mounted directory under another from failing all accesses with a spurious -EBUSY, a 3.6 regression -- but I can't see how this could cause it. I'm running with this in 3.6.1 as well: diff --git a/fs/dcache.c b/fs/dcache.c index 8086636..649a112 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -2404,6 +2404,10 @@ out_unalias: if (likely(!d_mountpoint(alias))) { __d_move(alias, dentry); ret = alias; + } else { + printk(KERN_WARNING "VFS: __d_move()ing a d_mountpoint(), uh oh\n"); + __d_move(alias, dentry); + ret = alias; } out_err: spin_unlock(&inode->i_lock); ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) 2012-10-23 16:32 ` Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) Nix @ 2012-10-23 16:46 ` J. Bruce Fields 2012-10-23 16:54 ` J. Bruce Fields 2012-10-23 16:56 ` Myklebust, Trond 2012-10-23 20:57 ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Nix 1 sibling, 2 replies; 27+ messages in thread From: J. Bruce Fields @ 2012-10-23 16:46 UTC (permalink / raw) To: Nix Cc: Ted Ts'o, linux-kernel, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs, Stanislav Kinsbursky On Tue, Oct 23, 2012 at 05:32:21PM +0100, Nix wrote: > On 23 Oct 2012, J. Bruce Fields uttered the following: > > nfs-utils shouldn't be capable of oopsing the kernel, so from my > > (selfish) point of view I'd actually rather you stick with whatever you > > have and try to reproduce the oops. > > Reproduced in 3.6.3, not in 3.6.1, not tried 3.6.2. Capturing it was > rendered somewhat difficult by an ext4/JBD2 bug which leads to data loss > in /var on every reboot out of 3.6.1 and on some reboots out of 3.6.3 (I > have runs of NULs in my logs now, which keep eating the oopses): > > [while in 3.6.1] > [ 88.565698] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash. > [ 88.799263] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash. > [ 89.648152] ------------[ cut here ]------------ > [ 89.648386] WARNING: at fs/inode.c:280 drop_nlink+0x25/0x42() > [ 89.648614] Hardware name: empty > [ 89.648833] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode] > [ 89.649382] Pid: 1484, comm: dhcpd Not tainted 3.6.1-dirty #1 > [ 89.649610] Call Trace: > [ 89.649832] [<ffffffff810608c0>] warn_slowpath_common+0x83/0x9b > [ 89.650063] [<ffffffff810608f2>] warn_slowpath_null+0x1a/0x1c > [ 89.650292] [<ffffffff8112efbf>] drop_nlink+0x25/0x42 > [ 89.650533] [<ffffffff81187112>] ext4_dec_count+0x26/0x28 > [ 89.650763] [<ffffffff8118abb8>] ext4_rename+0x4ec/0x7b4 > [ 89.650993] [<ffffffff81125d81>] ? vfs_rename+0xbe/0x3b7 > [ 89.651224] [<ffffffff81125f3f>] vfs_rename+0x27c/0x3b7 > [ 89.651454] [<ffffffff81127a53>] sys_renameat+0x1b1/0x228 > [ 89.651682] [<ffffffff8114bda1>] ? fsnotify+0x226/0x249 > [ 89.651911] [<ffffffff81239b75>] ? security_inode_permission+0x1e/0x20 > [ 89.652145] [<ffffffff8111a240>] ? vfs_write+0x116/0x142 > [ 89.652372] [<ffffffff81127ae5>] sys_rename+0x1b/0x1e > [ 89.652601] [<ffffffff814fffa2>] system_call_fastpath+0x16/0x1b > [...] > [while having just booted into 3.6.1 after some time in 3.6.3: the FS > was clean, and fscked on the previous boot into 3.6.3 after a previous > instance of this bug] > Oct 23 17:18:26 spindle crit: [ 67.625319] EXT4-fs error (device dm-5): mb_free_blocks:1300: group 65, block 2143748:freeing already freed block (bit 13828) > > This may well be a 3.6.1-specific bug fixed in 3.6.3, but it's hard to > tell since most of my reboots are 3.6.1->3.6.3 or vice versa right now. > > > Anyway, here's the NFSv4 oops (not a panic: it helps if I remember to > turn off panic_on_oops when I come home from the holidays). > > It's a lockd problem, and probably happens during delivery of mail over > NFS (my mailserver load soars when it happens): > > [ 813.110354] ------------[ cut here ]------------ > [ 813.110585] kernel BUG at fs/lockd/mon.c:150! So nsm_mon_unmon() is being passed a NULL client. There are three container patches between 3.6.1 and 3.6.3: lockd: per-net NSM client creation and destruction helpers introduced lockd: use rpc client's cl_nodename for id encoding lockd: create and use per-net NSM RPC clients on MON/UNMON requests and that last does change nsm_monitor's call to nsm_mon_unmon, so that's almost certainly it.... Looks like there's some confusion about whether nsm_client_get() returns NULL or an error? --b. > [ 813.110878] invalid opcode: 0000 [#1] SMP > [ 813.111173] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode] > [ 813.111727] CPU 1 > [ 813.111772] Pid: 1040, comm: lockd Not tainted 3.6.3-dirty #1 empty empty/S7010 > [ 813.112388] RIP: 0010:[<ffffffff8120fbbc>] [<ffffffff8120fbbc>] nsm_mon_unmon+0x64/0x98 > [ 813.112840] RSP: 0018:ffff88062163dcf0 EFLAGS: 00010246 > [ 813.113069] RAX: ffff88062163dcf8 RBX: 0000000000000000 RCX: 0000000000000000 > [ 813.113303] RDX: ffff88062163dd68 RSI: 0000000000000002 RDI: ffff88062163dd40 > [ 813.113537] RBP: ffff88062163dd50 R08: 0000000000000000 R09: ffffffff8120c463 > [ 813.113771] R10: ffffffff8120c463 R11: 0000000000000001 R12: 0000000000000000 > [ 813.114006] R13: ffff88061f067e40 R14: ffff88059c763a00 R15: ffff88062163de28 > [ 813.114241] FS: 0000000000000000(0000) GS:ffff88063fc40000(0000) knlGS:0000000000000000 > [ 813.114651] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 813.114882] CR2: ffffffffff600400 CR3: 0000000001a0b000 CR4: 00000000000007e0 > [ 813.115116] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 813.115350] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 813.115584] Process lockd (pid: 1040, threadinfo ffff88062163c000, task ffff880621611640) > [ 813.115994] Stack: > [ 813.116210] ffff88062163dd90 ffff8805f4e044b1 00000003000186b5 ffffffff00000010 > [ 813.116761] ffff8805f4e044c1 000000000000008c 0000000000000000 ffff88062163dcf8 > [ 813.117310] ffff88062163dd68 0000000000000000 ffff8806200ce088 ffff8805f4e04400 > [ 813.117861] Call Trace: > [ 813.118083] [<ffffffff8120fee4>] nsm_monitor+0x100/0x157 > [ 813.118315] [<ffffffff81211b4a>] nlm4svc_retrieve_args+0x62/0xd7 > [ 813.118547] [<ffffffff81211f49>] nlm4svc_proc_lock+0x3c/0xb5 > [ 813.118779] [<ffffffff8121167b>] ? nlm4svc_decode_lockargs+0x47/0xb2 > [ 813.119016] [<ffffffff814d89ca>] svc_process+0x3bf/0x6a1 > [ 813.119246] [<ffffffff8120d5f0>] lockd+0x127/0x164 > [ 813.119474] [<ffffffff8120d4c9>] ? set_grace_period+0x8a/0x8a > [ 813.119708] [<ffffffff8107bcbc>] kthread+0x8b/0x93 > [ 813.119938] [<ffffffff815012f4>] kernel_thread_helper+0x4/0x10 > [ 813.120170] [<ffffffff8107bc31>] ? kthread_worker_fn+0xe1/0xe1 > [ 813.120401] [<ffffffff815012f0>] ? gs_change+0xb/0xb > [ 813.120628] Code: b8 10 00 00 00 48 89 45 c0 48 8d 81 8c 00 00 00 b9 08 00 00 00 48 89 45 c8 89 d8 f3 ab 48 8d 45 a8 48 89 55 e0 48 89 45 d8 75 02 <0f> 0b 89 f6 48 c7 02 00 00 00 00 4c 89 c7 48 6b f6 38 ba 00 04 > [ 813.123627] RIP [<ffffffff8120fbbc>] nsm_mon_unmon+0x64/0x98 > [ 813.123894] RSP <ffff88062163dcf0> > [ 813.124129] ---[ end trace 11eb11a091ffd910 ]--- > > Aside: I am running with this patch, to prevent mounting of one > NFS-mounted directory under another from failing all accesses with a > spurious -EBUSY, a 3.6 regression -- but I can't see how this could > cause it. I'm running with this in 3.6.1 as well: > > diff --git a/fs/dcache.c b/fs/dcache.c > index 8086636..649a112 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -2404,6 +2404,10 @@ out_unalias: > if (likely(!d_mountpoint(alias))) { > __d_move(alias, dentry); > ret = alias; > + } else { > + printk(KERN_WARNING "VFS: __d_move()ing a d_mountpoint(), uh oh\n"); > + __d_move(alias, dentry); > + ret = alias; > } > out_err: > spin_unlock(&inode->i_lock); > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) 2012-10-23 16:46 ` J. Bruce Fields @ 2012-10-23 16:54 ` J. Bruce Fields 2012-10-23 16:56 ` Myklebust, Trond 1 sibling, 0 replies; 27+ messages in thread From: J. Bruce Fields @ 2012-10-23 16:54 UTC (permalink / raw) To: Nix Cc: Ted Ts'o, linux-kernel, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs, Stanislav Kinsbursky On Tue, Oct 23, 2012 at 12:46:21PM -0400, J. Bruce Fields wrote: > On Tue, Oct 23, 2012 at 05:32:21PM +0100, Nix wrote: > > On 23 Oct 2012, J. Bruce Fields uttered the following: > > > nfs-utils shouldn't be capable of oopsing the kernel, so from my > > > (selfish) point of view I'd actually rather you stick with whatever you > > > have and try to reproduce the oops. > > > > Reproduced in 3.6.3, not in 3.6.1, not tried 3.6.2. Capturing it was > > rendered somewhat difficult by an ext4/JBD2 bug which leads to data loss > > in /var on every reboot out of 3.6.1 and on some reboots out of 3.6.3 (I > > have runs of NULs in my logs now, which keep eating the oopses): > > > > [while in 3.6.1] > > [ 88.565698] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash. > > [ 88.799263] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash. > > [ 89.648152] ------------[ cut here ]------------ > > [ 89.648386] WARNING: at fs/inode.c:280 drop_nlink+0x25/0x42() > > [ 89.648614] Hardware name: empty > > [ 89.648833] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode] > > [ 89.649382] Pid: 1484, comm: dhcpd Not tainted 3.6.1-dirty #1 > > [ 89.649610] Call Trace: > > [ 89.649832] [<ffffffff810608c0>] warn_slowpath_common+0x83/0x9b > > [ 89.650063] [<ffffffff810608f2>] warn_slowpath_null+0x1a/0x1c > > [ 89.650292] [<ffffffff8112efbf>] drop_nlink+0x25/0x42 > > [ 89.650533] [<ffffffff81187112>] ext4_dec_count+0x26/0x28 > > [ 89.650763] [<ffffffff8118abb8>] ext4_rename+0x4ec/0x7b4 > > [ 89.650993] [<ffffffff81125d81>] ? vfs_rename+0xbe/0x3b7 > > [ 89.651224] [<ffffffff81125f3f>] vfs_rename+0x27c/0x3b7 > > [ 89.651454] [<ffffffff81127a53>] sys_renameat+0x1b1/0x228 > > [ 89.651682] [<ffffffff8114bda1>] ? fsnotify+0x226/0x249 > > [ 89.651911] [<ffffffff81239b75>] ? security_inode_permission+0x1e/0x20 > > [ 89.652145] [<ffffffff8111a240>] ? vfs_write+0x116/0x142 > > [ 89.652372] [<ffffffff81127ae5>] sys_rename+0x1b/0x1e > > [ 89.652601] [<ffffffff814fffa2>] system_call_fastpath+0x16/0x1b > > [...] > > [while having just booted into 3.6.1 after some time in 3.6.3: the FS > > was clean, and fscked on the previous boot into 3.6.3 after a previous > > instance of this bug] > > Oct 23 17:18:26 spindle crit: [ 67.625319] EXT4-fs error (device dm-5): mb_free_blocks:1300: group 65, block 2143748:freeing already freed block (bit 13828) > > > > This may well be a 3.6.1-specific bug fixed in 3.6.3, but it's hard to > > tell since most of my reboots are 3.6.1->3.6.3 or vice versa right now. > > > > > > Anyway, here's the NFSv4 oops (not a panic: it helps if I remember to > > turn off panic_on_oops when I come home from the holidays). > > > > It's a lockd problem, and probably happens during delivery of mail over > > NFS (my mailserver load soars when it happens): > > > > [ 813.110354] ------------[ cut here ]------------ > > [ 813.110585] kernel BUG at fs/lockd/mon.c:150! > > So nsm_mon_unmon() is being passed a NULL client. > > There are three container patches between 3.6.1 and 3.6.3: > > lockd: per-net NSM client creation and destruction helpers introduced > lockd: use rpc client's cl_nodename for id encoding > lockd: create and use per-net NSM RPC clients on MON/UNMON requests > > and that last does change nsm_monitor's call to nsm_mon_unmon, so that's > almost certainly it.... > > Looks like there's some confusion about whether nsm_client_get() returns > NULL or an error? The return from nsm_client_get() is either from nsm_create() or from ln->nsm_clnt. nsm_create's return is from rpc_create, and it doesn't look possible for rpc_creat to return NULL. So probably we have some case where, while holding ln->nsm_clnt_lock, you can see ln->nsm_users nonzero, but ln->nsm_clnt NULL ? --b. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) 2012-10-23 16:46 ` J. Bruce Fields 2012-10-23 16:54 ` J. Bruce Fields @ 2012-10-23 16:56 ` Myklebust, Trond 2012-10-23 17:05 ` Nix 1 sibling, 1 reply; 27+ messages in thread From: Myklebust, Trond @ 2012-10-23 16:56 UTC (permalink / raw) To: J. Bruce Fields Cc: Nix, Ted Ts'o, linux-kernel@vger.kernel.org, Schumaker, Bryan, Peng Tao, gregkh@linuxfoundation.org, linux-nfs@vger.kernel.org, Stanislav Kinsbursky T24gVHVlLCAyMDEyLTEwLTIzIGF0IDEyOjQ2IC0wNDAwLCBKLiBCcnVjZSBGaWVsZHMgd3JvdGU6 DQo+IE9uIFR1ZSwgT2N0IDIzLCAyMDEyIGF0IDA1OjMyOjIxUE0gKzAxMDAsIE5peCB3cm90ZToN Cj4gPiBPbiAyMyBPY3QgMjAxMiwgSi4gQnJ1Y2UgRmllbGRzIHV0dGVyZWQgdGhlIGZvbGxvd2lu ZzoNCj4gPiA+IG5mcy11dGlscyBzaG91bGRuJ3QgYmUgY2FwYWJsZSBvZiBvb3BzaW5nIHRoZSBr ZXJuZWwsIHNvIGZyb20gbXkNCj4gPiA+IChzZWxmaXNoKSBwb2ludCBvZiB2aWV3IEknZCBhY3R1 YWxseSByYXRoZXIgeW91IHN0aWNrIHdpdGggd2hhdGV2ZXIgeW91DQo+ID4gPiBoYXZlIGFuZCB0 cnkgdG8gcmVwcm9kdWNlIHRoZSBvb3BzLg0KPiA+IA0KPiA+IFJlcHJvZHVjZWQgaW4gMy42LjMs IG5vdCBpbiAzLjYuMSwgbm90IHRyaWVkIDMuNi4yLiBDYXB0dXJpbmcgaXQgd2FzDQo+ID4gcmVu ZGVyZWQgc29tZXdoYXQgZGlmZmljdWx0IGJ5IGFuIGV4dDQvSkJEMiBidWcgd2hpY2ggbGVhZHMg dG8gZGF0YSBsb3NzDQo+ID4gaW4gL3ZhciBvbiBldmVyeSByZWJvb3Qgb3V0IG9mIDMuNi4xIGFu ZCBvbiBzb21lIHJlYm9vdHMgb3V0IG9mIDMuNi4zIChJDQo+ID4gaGF2ZSBydW5zIG9mIE5VTHMg aW4gbXkgbG9ncyBub3csIHdoaWNoIGtlZXAgZWF0aW5nIHRoZSBvb3BzZXMpOg0KPiA+IA0KPiA+ IFt3aGlsZSBpbiAzLjYuMV0NCj4gPiBbICAgODguNTY1Njk4XSBKQkQyOiBTcG90dGVkIGRpcnR5 IG1ldGFkYXRhIGJ1ZmZlciAoZGV2ID0gZG0tNSwgYmxvY2tuciA9IDApLiBUaGVyZSdzIGEgcmlz ayBvZiBmaWxlc3lzdGVtIGNvcnJ1cHRpb24gaW4gY2FzZSBvZiBzeXN0ZW0gY3Jhc2guDQo+ID4g WyAgIDg4Ljc5OTI2M10gSkJEMjogU3BvdHRlZCBkaXJ0eSBtZXRhZGF0YSBidWZmZXIgKGRldiA9 IGRtLTUsIGJsb2NrbnIgPSAwKS4gVGhlcmUncyBhIHJpc2sgb2YgZmlsZXN5c3RlbSBjb3JydXB0 aW9uIGluIGNhc2Ugb2Ygc3lzdGVtIGNyYXNoLg0KPiA+IFsgICA4OS42NDgxNTJdIC0tLS0tLS0t LS0tLVsgY3V0IGhlcmUgXS0tLS0tLS0tLS0tLQ0KPiA+IFsgICA4OS42NDgzODZdIFdBUk5JTkc6 IGF0IGZzL2lub2RlLmM6MjgwIGRyb3BfbmxpbmsrMHgyNS8weDQyKCkNCj4gPiBbICAgODkuNjQ4 NjE0XSBIYXJkd2FyZSBuYW1lOiBlbXB0eQ0KPiA+IFsgICA4OS42NDg4MzNdIE1vZHVsZXMgbGlu a2VkIGluOiBmaXJld2lyZV9vaGNpIGZpcmV3aXJlX2NvcmUgW2xhc3QgdW5sb2FkZWQ6IG1pY3Jv Y29kZV0NCj4gPiBbICAgODkuNjQ5MzgyXSBQaWQ6IDE0ODQsIGNvbW06IGRoY3BkIE5vdCB0YWlu dGVkIDMuNi4xLWRpcnR5ICMxDQo+ID4gWyAgIDg5LjY0OTYxMF0gQ2FsbCBUcmFjZToNCj4gPiBb ICAgODkuNjQ5ODMyXSAgWzxmZmZmZmZmZjgxMDYwOGMwPl0gd2Fybl9zbG93cGF0aF9jb21tb24r MHg4My8weDliDQo+ID4gWyAgIDg5LjY1MDA2M10gIFs8ZmZmZmZmZmY4MTA2MDhmMj5dIHdhcm5f c2xvd3BhdGhfbnVsbCsweDFhLzB4MWMNCj4gPiBbICAgODkuNjUwMjkyXSAgWzxmZmZmZmZmZjgx MTJlZmJmPl0gZHJvcF9ubGluaysweDI1LzB4NDINCj4gPiBbICAgODkuNjUwNTMzXSAgWzxmZmZm ZmZmZjgxMTg3MTEyPl0gZXh0NF9kZWNfY291bnQrMHgyNi8weDI4DQo+ID4gWyAgIDg5LjY1MDc2 M10gIFs8ZmZmZmZmZmY4MTE4YWJiOD5dIGV4dDRfcmVuYW1lKzB4NGVjLzB4N2I0DQo+ID4gWyAg IDg5LjY1MDk5M10gIFs8ZmZmZmZmZmY4MTEyNWQ4MT5dID8gdmZzX3JlbmFtZSsweGJlLzB4M2I3 DQo+ID4gWyAgIDg5LjY1MTIyNF0gIFs8ZmZmZmZmZmY4MTEyNWYzZj5dIHZmc19yZW5hbWUrMHgy N2MvMHgzYjcNCj4gPiBbICAgODkuNjUxNDU0XSAgWzxmZmZmZmZmZjgxMTI3YTUzPl0gc3lzX3Jl bmFtZWF0KzB4MWIxLzB4MjI4DQo+ID4gWyAgIDg5LjY1MTY4Ml0gIFs8ZmZmZmZmZmY4MTE0YmRh MT5dID8gZnNub3RpZnkrMHgyMjYvMHgyNDkNCj4gPiBbICAgODkuNjUxOTExXSAgWzxmZmZmZmZm ZjgxMjM5Yjc1Pl0gPyBzZWN1cml0eV9pbm9kZV9wZXJtaXNzaW9uKzB4MWUvMHgyMA0KPiA+IFsg ICA4OS42NTIxNDVdICBbPGZmZmZmZmZmODExMWEyNDA+XSA/IHZmc193cml0ZSsweDExNi8weDE0 Mg0KPiA+IFsgICA4OS42NTIzNzJdICBbPGZmZmZmZmZmODExMjdhZTU+XSBzeXNfcmVuYW1lKzB4 MWIvMHgxZQ0KPiA+IFsgICA4OS42NTI2MDFdICBbPGZmZmZmZmZmODE0ZmZmYTI+XSBzeXN0ZW1f Y2FsbF9mYXN0cGF0aCsweDE2LzB4MWINCj4gPiBbLi4uXQ0KPiA+IFt3aGlsZSBoYXZpbmcganVz dCBib290ZWQgaW50byAzLjYuMSBhZnRlciBzb21lIHRpbWUgaW4gMy42LjM6IHRoZSBGUw0KPiA+ ICB3YXMgY2xlYW4sIGFuZCBmc2NrZWQgb24gdGhlIHByZXZpb3VzIGJvb3QgaW50byAzLjYuMyBh ZnRlciBhIHByZXZpb3VzDQo+ID4gIGluc3RhbmNlIG9mIHRoaXMgYnVnXQ0KPiA+IE9jdCAyMyAx NzoxODoyNiBzcGluZGxlIGNyaXQ6IFsgICA2Ny42MjUzMTldIEVYVDQtZnMgZXJyb3IgKGRldmlj ZSBkbS01KTogbWJfZnJlZV9ibG9ja3M6MTMwMDogZ3JvdXAgNjUsIGJsb2NrIDIxNDM3NDg6ZnJl ZWluZyBhbHJlYWR5IGZyZWVkIGJsb2NrIChiaXQgMTM4MjgpDQo+ID4gDQo+ID4gVGhpcyBtYXkg d2VsbCBiZSBhIDMuNi4xLXNwZWNpZmljIGJ1ZyBmaXhlZCBpbiAzLjYuMywgYnV0IGl0J3MgaGFy ZCB0bw0KPiA+IHRlbGwgc2luY2UgbW9zdCBvZiBteSByZWJvb3RzIGFyZSAzLjYuMS0+My42LjMg b3IgdmljZSB2ZXJzYSByaWdodCBub3cuDQo+ID4gDQo+ID4gDQo+ID4gQW55d2F5LCBoZXJlJ3Mg dGhlIE5GU3Y0IG9vcHMgKG5vdCBhIHBhbmljOiBpdCBoZWxwcyBpZiBJIHJlbWVtYmVyIHRvDQo+ ID4gdHVybiBvZmYgcGFuaWNfb25fb29wcyB3aGVuIEkgY29tZSBob21lIGZyb20gdGhlIGhvbGlk YXlzKS4NCj4gPiANCj4gPiBJdCdzIGEgbG9ja2QgcHJvYmxlbSwgYW5kIHByb2JhYmx5IGhhcHBl bnMgZHVyaW5nIGRlbGl2ZXJ5IG9mIG1haWwgb3Zlcg0KPiA+IE5GUyAobXkgbWFpbHNlcnZlciBs b2FkIHNvYXJzIHdoZW4gaXQgaGFwcGVucyk6DQo+ID4gDQo+ID4gWyAgODEzLjExMDM1NF0gLS0t LS0tLS0tLS0tWyBjdXQgaGVyZSBdLS0tLS0tLS0tLS0tDQo+ID4gWyAgODEzLjExMDU4NV0ga2Vy bmVsIEJVRyBhdCBmcy9sb2NrZC9tb24uYzoxNTAhDQo+IA0KPiBTbyBuc21fbW9uX3VubW9uKCkg aXMgYmVpbmcgcGFzc2VkIGEgTlVMTCBjbGllbnQuDQo+IA0KPiBUaGVyZSBhcmUgdGhyZWUgY29u dGFpbmVyIHBhdGNoZXMgYmV0d2VlbiAzLjYuMSBhbmQgMy42LjM6DQo+IA0KPiAJbG9ja2Q6IHBl ci1uZXQgTlNNIGNsaWVudCBjcmVhdGlvbiBhbmQgZGVzdHJ1Y3Rpb24gaGVscGVycyBpbnRyb2R1 Y2VkDQo+IAlsb2NrZDogdXNlIHJwYyBjbGllbnQncyBjbF9ub2RlbmFtZSBmb3IgaWQgZW5jb2Rp bmcNCj4gCWxvY2tkOiBjcmVhdGUgYW5kIHVzZSBwZXItbmV0IE5TTSBSUEMgY2xpZW50cyBvbiBN T04vVU5NT04gcmVxdWVzdHMNCj4gDQo+IGFuZCB0aGF0IGxhc3QgZG9lcyBjaGFuZ2UgbnNtX21v bml0b3IncyBjYWxsIHRvIG5zbV9tb25fdW5tb24sIHNvIHRoYXQncw0KPiBhbG1vc3QgY2VydGFp bmx5IGl0Li4uLg0KPiANCj4gTG9va3MgbGlrZSB0aGVyZSdzIHNvbWUgY29uZnVzaW9uIGFib3V0 IHdoZXRoZXIgbnNtX2NsaWVudF9nZXQoKSByZXR1cm5zDQo+IE5VTEwgb3IgYW4gZXJyb3I/DQoN Cm5zbV9jbGllbnRfZ2V0KCkgbG9va3MgZXh0cmVtZWx5IHJhY3kgaW4gdGhlIGNhc2Ugd2hlcmUg bG4tPm5zbV91c2VycyA9PQ0KMC4NCg0KU2luY2Ugd2UgbmV2ZXIgcmVjaGVjayB0aGUgdmFsdWUg b2YgbG4tPm5zbV91c2VycyBhZnRlciB0YWtpbmcNCm5zbV9jcmVhdGVfbXV0ZXgsIHdoYXQgaXMg c3RvcHBpbmcgMiBkaWZmZXJlbnQgdGhyZWFkcyBmcm9tIGJvdGggc2V0dGluZw0KbG4tPm5zbV9j bG50IGFuZCByZS1pbml0aWFsaXNpbmcgbG4tPm5zbV91c2Vycz8NCg0KDQotLSANClRyb25kIE15 a2xlYnVzdA0KTGludXggTkZTIGNsaWVudCBtYWludGFpbmVyDQoNCk5ldEFwcA0KVHJvbmQuTXlr bGVidXN0QG5ldGFwcC5jb20NCnd3dy5uZXRhcHAuY29tDQo= ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) 2012-10-23 16:56 ` Myklebust, Trond @ 2012-10-23 17:05 ` Nix 2012-10-23 17:36 ` Nix 0 siblings, 1 reply; 27+ messages in thread From: Nix @ 2012-10-23 17:05 UTC (permalink / raw) To: Myklebust, Trond Cc: J. Bruce Fields, Ted Ts'o, linux-kernel@vger.kernel.org, Schumaker, Bryan, Peng Tao, gregkh@linuxfoundation.org, linux-nfs@vger.kernel.org, Stanislav Kinsbursky On 23 Oct 2012, Trond Myklebust spake thusly: > On Tue, 2012-10-23 at 12:46 -0400, J. Bruce Fields wrote: >> Looks like there's some confusion about whether nsm_client_get() returns >> NULL or an error? > > nsm_client_get() looks extremely racy in the case where ln->nsm_users == > 0. Since we never recheck the value of ln->nsm_users after taking > nsm_create_mutex, what is stopping 2 different threads from both setting > ln->nsm_clnt and re-initialising ln->nsm_users? Yep. At the worst possible time: spin_lock(&ln->nsm_clnt_lock); if (ln->nsm_users) { if (--ln->nsm_users) ln->nsm_clnt = NULL; (1) shutdown = !ln->nsm_users; } spin_unlock(&ln->nsm_clnt_lock); If a thread reinitializes nsm_users at point (1), after the assignment, we could well end up with ln->nsm_clnt NULL and shutdown false. A bit later, nsm_mon_unmon gets called with a NULL clnt, and boom. This seems particularly likely if there is only one nsm_user (which is true in my case, since I have only one active network namespace). -- NULL && (void) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) 2012-10-23 17:05 ` Nix @ 2012-10-23 17:36 ` Nix 2012-10-23 17:43 ` J. Bruce Fields 2012-10-23 17:44 ` Myklebust, Trond 0 siblings, 2 replies; 27+ messages in thread From: Nix @ 2012-10-23 17:36 UTC (permalink / raw) To: Myklebust, Trond Cc: J. Bruce Fields, Ted Ts'o, linux-kernel@vger.kernel.org, Schumaker, Bryan, Peng Tao, gregkh@linuxfoundation.org, linux-nfs@vger.kernel.org, Stanislav Kinsbursky On 23 Oct 2012, nix@esperi.org.uk uttered the following: > On 23 Oct 2012, Trond Myklebust spake thusly: >> On Tue, 2012-10-23 at 12:46 -0400, J. Bruce Fields wrote: >>> Looks like there's some confusion about whether nsm_client_get() returns >>> NULL or an error? >> >> nsm_client_get() looks extremely racy in the case where ln->nsm_users == >> 0. Since we never recheck the value of ln->nsm_users after taking >> nsm_create_mutex, what is stopping 2 different threads from both setting >> ln->nsm_clnt and re-initialising ln->nsm_users? > > Yep. At the worst possible time: > > spin_lock(&ln->nsm_clnt_lock); > if (ln->nsm_users) { > if (--ln->nsm_users) > ln->nsm_clnt = NULL; > (1) shutdown = !ln->nsm_users; > } > spin_unlock(&ln->nsm_clnt_lock); > > If a thread reinitializes nsm_users at point (1), after the assignment, > we could well end up with ln->nsm_clnt NULL and shutdown false. A bit > later, nsm_mon_unmon gets called with a NULL clnt, and boom. Possible fix if so, utterly untested so far (will test when I can face yet another reboot and fs-corruption-recovery-hell cycle, in a few hours), may ruin performance, violate locking hierarchies, and consume kittens: diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c index e4fb3ba..da91cdf 100644 --- a/fs/lockd/mon.c +++ b/fs/lockd/mon.c @@ -98,7 +98,6 @@ static struct rpc_clnt *nsm_client_get(struct net *net) spin_unlock(&ln->nsm_clnt_lock); goto out; } - spin_unlock(&ln->nsm_clnt_lock); mutex_lock(&nsm_create_mutex); clnt = nsm_create(net); @@ -108,6 +107,7 @@ static struct rpc_clnt *nsm_client_get(struct net *net) ln->nsm_users = 1; } mutex_unlock(&nsm_create_mutex); + spin_unlock(&ln->nsm_clnt_lock); out: return clnt; } ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) 2012-10-23 17:36 ` Nix @ 2012-10-23 17:43 ` J. Bruce Fields 2012-10-23 17:44 ` Myklebust, Trond 1 sibling, 0 replies; 27+ messages in thread From: J. Bruce Fields @ 2012-10-23 17:43 UTC (permalink / raw) To: Nix Cc: Myklebust, Trond, Ted Ts'o, linux-kernel@vger.kernel.org, Schumaker, Bryan, Peng Tao, gregkh@linuxfoundation.org, linux-nfs@vger.kernel.org, Stanislav Kinsbursky On Tue, Oct 23, 2012 at 06:36:15PM +0100, Nix wrote: > On 23 Oct 2012, nix@esperi.org.uk uttered the following: > > > On 23 Oct 2012, Trond Myklebust spake thusly: > >> On Tue, 2012-10-23 at 12:46 -0400, J. Bruce Fields wrote: > >>> Looks like there's some confusion about whether nsm_client_get() returns > >>> NULL or an error? > >> > >> nsm_client_get() looks extremely racy in the case where ln->nsm_users == > >> 0. Since we never recheck the value of ln->nsm_users after taking > >> nsm_create_mutex, what is stopping 2 different threads from both setting > >> ln->nsm_clnt and re-initialising ln->nsm_users? > > > > Yep. At the worst possible time: > > > > spin_lock(&ln->nsm_clnt_lock); > > if (ln->nsm_users) { > > if (--ln->nsm_users) > > ln->nsm_clnt = NULL; > > (1) shutdown = !ln->nsm_users; > > } > > spin_unlock(&ln->nsm_clnt_lock); > > > > If a thread reinitializes nsm_users at point (1), after the assignment, > > we could well end up with ln->nsm_clnt NULL and shutdown false. A bit > > later, nsm_mon_unmon gets called with a NULL clnt, and boom. > > Possible fix if so, utterly untested so far (will test when I can face > yet another reboot and fs-corruption-recovery-hell cycle, in a few > hours), may ruin performance, violate locking hierarchies, and consume > kittens: Right, mutexes can't be taken while holding spinlocks. Keep the kittens well away from the computer. --b. > > diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c > index e4fb3ba..da91cdf 100644 > --- a/fs/lockd/mon.c > +++ b/fs/lockd/mon.c > @@ -98,7 +98,6 @@ static struct rpc_clnt *nsm_client_get(struct net *net) > spin_unlock(&ln->nsm_clnt_lock); > goto out; > } > - spin_unlock(&ln->nsm_clnt_lock); > > mutex_lock(&nsm_create_mutex); > clnt = nsm_create(net); > @@ -108,6 +107,7 @@ static struct rpc_clnt *nsm_client_get(struct net *net) > ln->nsm_users = 1; > } > mutex_unlock(&nsm_create_mutex); > + spin_unlock(&ln->nsm_clnt_lock); > out: > return clnt; > } > ^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) 2012-10-23 17:36 ` Nix 2012-10-23 17:43 ` J. Bruce Fields @ 2012-10-23 17:44 ` Myklebust, Trond 2012-10-23 17:57 ` Myklebust, Trond [not found] ` <1351015039.4622.23.camel@lade.trondhjem.org> 1 sibling, 2 replies; 27+ messages in thread From: Myklebust, Trond @ 2012-10-23 17:44 UTC (permalink / raw) To: Nix Cc: J. Bruce Fields, Ted Ts'o, linux-kernel@vger.kernel.org, Schumaker, Bryan, Peng Tao, gregkh@linuxfoundation.org, linux-nfs@vger.kernel.org, Stanislav Kinsbursky > -----Original Message----- > From: Nix [mailto:nix@esperi.org.uk] > Sent: Tuesday, October 23, 2012 1:36 PM > To: Myklebust, Trond > Cc: J. Bruce Fields; Ted Ts'o; linux-kernel@vger.kernel.org; Schumaker, > Bryan; Peng Tao; gregkh@linuxfoundation.org; linux-nfs@vger.kernel.org; > Stanislav Kinsbursky > Subject: Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also > an unrelated ext4 data loss bug) > > On 23 Oct 2012, nix@esperi.org.uk uttered the following: > > > On 23 Oct 2012, Trond Myklebust spake thusly: > >> On Tue, 2012-10-23 at 12:46 -0400, J. Bruce Fields wrote: > >>> Looks like there's some confusion about whether nsm_client_get() > >>> returns NULL or an error? > >> > >> nsm_client_get() looks extremely racy in the case where ln->nsm_users > >> == 0. Since we never recheck the value of ln->nsm_users after taking > >> nsm_create_mutex, what is stopping 2 different threads from both > >> setting > >> ln->nsm_clnt and re-initialising ln->nsm_users? > > > > Yep. At the worst possible time: > > > > spin_lock(&ln->nsm_clnt_lock); > > if (ln->nsm_users) { > > if (--ln->nsm_users) > > ln->nsm_clnt = NULL; > > (1) shutdown = !ln->nsm_users; > > } > > spin_unlock(&ln->nsm_clnt_lock); > > > > If a thread reinitializes nsm_users at point (1), after the > > assignment, we could well end up with ln->nsm_clnt NULL and shutdown > > false. A bit later, nsm_mon_unmon gets called with a NULL clnt, and boom. > > Possible fix if so, utterly untested so far (will test when I can face yet another > reboot and fs-corruption-recovery-hell cycle, in a few hours), may ruin > performance, violate locking hierarchies, and consume > kittens: > > diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c index e4fb3ba..da91cdf 100644 > --- a/fs/lockd/mon.c > +++ b/fs/lockd/mon.c > @@ -98,7 +98,6 @@ static struct rpc_clnt *nsm_client_get(struct net *net) > spin_unlock(&ln->nsm_clnt_lock); > goto out; > } > - spin_unlock(&ln->nsm_clnt_lock); > > mutex_lock(&nsm_create_mutex); > clnt = nsm_create(net); > @@ -108,6 +107,7 @@ static struct rpc_clnt *nsm_client_get(struct net *net) > ln->nsm_users = 1; > } > mutex_unlock(&nsm_create_mutex); > + spin_unlock(&ln->nsm_clnt_lock); You can't hold a spinlock while sleeping. Both mutex_lock() and nsm_create() can definitely sleep. The correct way to do this is to grab the spinlock and recheck the value of ln->nsm_users inside the 'if (!IS_ERR())' condition. If it is still zero, bump it and set ln->nsm_clnt, otherwise bump it, get the existing ln->nsm_clnt and call rpc_shutdown_clnt() on the redundant nsm client after dropping the spinlock. Cheers Trond ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) 2012-10-23 17:44 ` Myklebust, Trond @ 2012-10-23 17:57 ` Myklebust, Trond [not found] ` <1351015039.4622.23.camel@lade.trondhjem.org> 1 sibling, 0 replies; 27+ messages in thread From: Myklebust, Trond @ 2012-10-23 17:57 UTC (permalink / raw) To: Nix Cc: J. Bruce Fields, Ted Ts'o, linux-kernel@vger.kernel.org, Schumaker, Bryan, Peng Tao, gregkh@linuxfoundation.org, linux-nfs@vger.kernel.org, Stanislav Kinsbursky T24gVHVlLCAyMDEyLTEwLTIzIGF0IDE3OjQ0ICswMDAwLCBNeWtsZWJ1c3QsIFRyb25kIHdyb3Rl Og0KPiBZb3UgY2FuJ3QgaG9sZCBhIHNwaW5sb2NrIHdoaWxlIHNsZWVwaW5nLiBCb3RoIG11dGV4 X2xvY2soKSBhbmQgbnNtX2NyZWF0ZSgpIGNhbiBkZWZpbml0ZWx5IHNsZWVwLg0KPiANCj4gVGhl IGNvcnJlY3Qgd2F5IHRvIGRvIHRoaXMgaXMgdG8gZ3JhYiB0aGUgc3BpbmxvY2sgYW5kIHJlY2hl Y2sgdGhlIHZhbHVlIG9mIGxuLT5uc21fdXNlcnMgaW5zaWRlIHRoZSAnaWYgKCFJU19FUlIoKSkn IGNvbmRpdGlvbi4gSWYgaXQgaXMgc3RpbGwgemVybywgYnVtcCBpdCBhbmQgc2V0IGxuLT5uc21f Y2xudCwgb3RoZXJ3aXNlIGJ1bXAgaXQsIGdldCB0aGUgZXhpc3RpbmcgbG4tPm5zbV9jbG50IGFu ZCBjYWxsIHJwY19zaHV0ZG93bl9jbG50KCkgb24gdGhlIHJlZHVuZGFudCBuc20gY2xpZW50IGFm dGVyIGRyb3BwaW5nIHRoZSBzcGlubG9jay4NCj4gDQo+IENoZWVycw0KPiAgIFRyb25kDQoNCkNh biB5b3UgcGxlYXNlIGNoZWNrIGlmIHRoZSBmb2xsb3dpbmcgcGF0Y2ggZml4ZXMgdGhlIGlzc3Vl Pw0KDQpDaGVlcnMNCiAgVHJvbmQNCg0KODwtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQ0KRnJvbSA0NGEwNzA0NTVkMjQ2ZTA5ZGUwY2VmYzg4 NzU4MzNmMjFjYTY1NWU4IE1vbiBTZXAgMTcgMDA6MDA6MDAgMjAwMQ0KRnJvbTogVHJvbmQgTXlr bGVidXN0IDxUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbT4NCkRhdGU6IFR1ZSwgMjMgT2N0IDIw MTIgMTM6NTE6NTggLTA0MDANClN1YmplY3Q6IFtQQVRDSF0gTE9DS0Q6IGZpeCByYWNlcyBpbiBu c21fY2xpZW50X2dldA0KDQpDb21taXQgZTk0MDZkYjIwZmVjYmZjYWI2NDZiYWQxNTdiNGNmZGM3 Y2FkZGRmYiAobG9ja2Q6IHBlci1uZXQNCk5TTSBjbGllbnQgY3JlYXRpb24gYW5kIGRlc3RydWN0 aW9uIGhlbHBlcnMgaW50cm9kdWNlZCkgY29udGFpbnMNCmEgbmFzdHkgcmFjZSBvbiBpbml0aWFs aXNhdGlvbiBvZiB0aGUgcGVyLW5ldCBOU00gY2xpZW50IGJlY2F1c2UNCml0IGRvZXNuJ3QgY2hl Y2sgd2hldGhlciBvciBub3QgdGhlIGNsaWVudCBpcyBzZXQgYWZ0ZXIgZ3JhYmJpbmcNCnRoZSBu c21fY3JlYXRlX211dGV4Lg0KDQpSZXBvcnRlZC1ieTogTml4IDxuaXhAZXNwZXJpLm9yZy51az4N ClNpZ25lZC1vZmYtYnk6IFRyb25kIE15a2xlYnVzdCA8VHJvbmQuTXlrbGVidXN0QG5ldGFwcC5j b20+DQotLS0NCiBmcy9sb2NrZC9tb24uYyB8IDIwICsrKysrKysrKysrKysrLS0tLS0tDQogMSBm aWxlIGNoYW5nZWQsIDE0IGluc2VydGlvbnMoKyksIDYgZGVsZXRpb25zKC0pDQoNCmRpZmYgLS1n aXQgYS9mcy9sb2NrZC9tb24uYyBiL2ZzL2xvY2tkL21vbi5jDQppbmRleCBlNGZiM2JhLi45NzU1 NjAzIDEwMDY0NA0KLS0tIGEvZnMvbG9ja2QvbW9uLmMNCisrKyBiL2ZzL2xvY2tkL21vbi5jDQpA QCAtODgsNyArODgsNyBAQCBzdGF0aWMgc3RydWN0IHJwY19jbG50ICpuc21fY3JlYXRlKHN0cnVj dCBuZXQgKm5ldCkNCiBzdGF0aWMgc3RydWN0IHJwY19jbG50ICpuc21fY2xpZW50X2dldChzdHJ1 Y3QgbmV0ICpuZXQpDQogew0KIAlzdGF0aWMgREVGSU5FX01VVEVYKG5zbV9jcmVhdGVfbXV0ZXgp Ow0KLQlzdHJ1Y3QgcnBjX2NsbnQJKmNsbnQ7DQorCXN0cnVjdCBycGNfY2xudAkqY2xudCwgKm5l dzsNCiAJc3RydWN0IGxvY2tkX25ldCAqbG4gPSBuZXRfZ2VuZXJpYyhuZXQsIGxvY2tkX25ldF9p ZCk7DQogDQogCXNwaW5fbG9jaygmbG4tPm5zbV9jbG50X2xvY2spOw0KQEAgLTEwMSwxMSArMTAx LDE5IEBAIHN0YXRpYyBzdHJ1Y3QgcnBjX2NsbnQgKm5zbV9jbGllbnRfZ2V0KHN0cnVjdCBuZXQg Km5ldCkNCiAJc3Bpbl91bmxvY2soJmxuLT5uc21fY2xudF9sb2NrKTsNCiANCiAJbXV0ZXhfbG9j aygmbnNtX2NyZWF0ZV9tdXRleCk7DQotCWNsbnQgPSBuc21fY3JlYXRlKG5ldCk7DQotCWlmICgh SVNfRVJSKGNsbnQpKSB7DQotCQlsbi0+bnNtX2NsbnQgPSBjbG50Ow0KLQkJc21wX3dtYigpOw0K LQkJbG4tPm5zbV91c2VycyA9IDE7DQorCW5ldyA9IG5zbV9jcmVhdGUobmV0KTsNCisJY2xudCA9 IG5ldzsNCisJaWYgKCFJU19FUlIobmV3KSkgew0KKwkJc3Bpbl9sb2NrKCZsbi0+bnNtX2NsbnRf bG9jayk7DQorCQlpZiAoIWxuLT5uc21fdXNlcnMpIHsNCisJCQlsbi0+bnNtX2NsbnQgPSBuZXc7 DQorCQkJbmV3ID0gTlVMTDsNCisJCX0NCisJCWNsbnQgPSBsbi0+bnNtX2NsbnQ7DQorCQlsbi0+ bnNtX3VzZXJzKys7DQorCQlzcGluX3VubG9jaygmbG4tPm5zbV9jbG50X2xvY2spOw0KKwkJaWYg KG5ldykNCisJCQlycGNfc2h1dGRvd25fY2xpZW50KG5ldyk7DQogCX0NCiAJbXV0ZXhfdW5sb2Nr KCZuc21fY3JlYXRlX211dGV4KTsNCiBvdXQ6DQotLSANCjEuNy4xMS43DQoNCg0KLS0gDQpUcm9u ZCBNeWtsZWJ1c3QNCkxpbnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lcg0KDQpOZXRBcHANClRyb25k Lk15a2xlYnVzdEBuZXRhcHAuY29tDQp3d3cubmV0YXBwLmNvbQ0K ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <1351015039.4622.23.camel@lade.trondhjem.org>]
* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) [not found] ` <1351015039.4622.23.camel@lade.trondhjem.org> @ 2012-10-23 18:23 ` Myklebust, Trond 2012-10-23 19:49 ` Nix 0 siblings, 1 reply; 27+ messages in thread From: Myklebust, Trond @ 2012-10-23 18:23 UTC (permalink / raw) To: Nix Cc: J. Bruce Fields, Ted Ts'o, linux-kernel@vger.kernel.org, Schumaker, Bryan, Peng Tao, gregkh@linuxfoundation.org, linux-nfs@vger.kernel.org, Stanislav Kinsbursky T24gVHVlLCAyMDEyLTEwLTIzIGF0IDEzOjU3IC0wNDAwLCBUcm9uZCBNeWtsZWJ1c3Qgd3JvdGU6 DQo+IE9uIFR1ZSwgMjAxMi0xMC0yMyBhdCAxNzo0NCArMDAwMCwgTXlrbGVidXN0LCBUcm9uZCB3 cm90ZToNCj4gPiBZb3UgY2FuJ3QgaG9sZCBhIHNwaW5sb2NrIHdoaWxlIHNsZWVwaW5nLiBCb3Ro IG11dGV4X2xvY2soKSBhbmQgbnNtX2NyZWF0ZSgpIGNhbiBkZWZpbml0ZWx5IHNsZWVwLg0KPiA+ IA0KPiA+IFRoZSBjb3JyZWN0IHdheSB0byBkbyB0aGlzIGlzIHRvIGdyYWIgdGhlIHNwaW5sb2Nr IGFuZCByZWNoZWNrIHRoZSB2YWx1ZSBvZiBsbi0+bnNtX3VzZXJzIGluc2lkZSB0aGUgJ2lmICgh SVNfRVJSKCkpJyBjb25kaXRpb24uIElmIGl0IGlzIHN0aWxsIHplcm8sIGJ1bXAgaXQgYW5kIHNl dCBsbi0+bnNtX2NsbnQsIG90aGVyd2lzZSBidW1wIGl0LCBnZXQgdGhlIGV4aXN0aW5nIGxuLT5u c21fY2xudCBhbmQgY2FsbCBycGNfc2h1dGRvd25fY2xudCgpIG9uIHRoZSByZWR1bmRhbnQgbnNt IGNsaWVudCBhZnRlciBkcm9wcGluZyB0aGUgc3BpbmxvY2suDQo+ID4gDQo+ID4gQ2hlZXJzDQo+ ID4gICBUcm9uZA0KPiANCj4gQ2FuIHlvdSBwbGVhc2UgY2hlY2sgaWYgdGhlIGZvbGxvd2luZyBw YXRjaCBmaXhlcyB0aGUgaXNzdWU/DQo+IA0KPiBDaGVlcnMNCj4gICBUcm9uZA0KPiANCk1laC4u LiBUaGlzIG9uZSBnZXRzIHJpZCBvZiB0aGUgMTAwJSByZWR1bmRhbnQgbXV0ZXguLi4NCg0KODwt LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LQ0KRnJvbSA0MTg3YzgxNmExNWRmMTI1NDRlYmNmYTZiOTYxZmNlOTY0NThlMjQ0IE1vbiBTZXAg MTcgMDA6MDA6MDAgMjAwMQ0KRnJvbTogVHJvbmQgTXlrbGVidXN0IDxUcm9uZC5NeWtsZWJ1c3RA bmV0YXBwLmNvbT4NCkRhdGU6IFR1ZSwgMjMgT2N0IDIwMTIgMTM6NTE6NTggLTA0MDANClN1Ympl Y3Q6IFtQQVRDSF0gTE9DS0Q6IGZpeCByYWNlcyBpbiBuc21fY2xpZW50X2dldA0KDQpDb21taXQg ZTk0MDZkYjIwZmVjYmZjYWI2NDZiYWQxNTdiNGNmZGM3Y2FkZGRmYiAobG9ja2Q6IHBlci1uZXQN Ck5TTSBjbGllbnQgY3JlYXRpb24gYW5kIGRlc3RydWN0aW9uIGhlbHBlcnMgaW50cm9kdWNlZCkg Y29udGFpbnMNCmEgbmFzdHkgcmFjZSBvbiBpbml0aWFsaXNhdGlvbiBvZiB0aGUgcGVyLW5ldCBO U00gY2xpZW50IGJlY2F1c2UNCml0IGRvZXNuJ3QgY2hlY2sgd2hldGhlciBvciBub3QgdGhlIGNs aWVudCBpcyBzZXQgYWZ0ZXIgZ3JhYmJpbmcNCnRoZSBuc21fY3JlYXRlX211dGV4Lg0KDQpSZXBv cnRlZC1ieTogTml4IDxuaXhAZXNwZXJpLm9yZy51az4NClNpZ25lZC1vZmYtYnk6IFRyb25kIE15 a2xlYnVzdCA8VHJvbmQuTXlrbGVidXN0QG5ldGFwcC5jb20+DQpDYzogc3RhYmxlQHZnZXIua2Vy bmVsLm9yZw0KLS0tDQogZnMvbG9ja2QvbW9uLmMgfCA0MyArKysrKysrKysrKysrKysrKysrKysr KysrKy0tLS0tLS0tLS0tLS0tLS0tDQogMSBmaWxlIGNoYW5nZWQsIDI2IGluc2VydGlvbnMoKyks IDE3IGRlbGV0aW9ucygtKQ0KDQpkaWZmIC0tZ2l0IGEvZnMvbG9ja2QvbW9uLmMgYi9mcy9sb2Nr ZC9tb24uYw0KaW5kZXggZTRmYjNiYS4uZmU2OTU2MCAxMDA2NDQNCi0tLSBhL2ZzL2xvY2tkL21v bi5jDQorKysgYi9mcy9sb2NrZC9tb24uYw0KQEAgLTg1LDI5ICs4NSwzOCBAQCBzdGF0aWMgc3Ry dWN0IHJwY19jbG50ICpuc21fY3JlYXRlKHN0cnVjdCBuZXQgKm5ldCkNCiAJcmV0dXJuIHJwY19j cmVhdGUoJmFyZ3MpOw0KIH0NCiANCitzdGF0aWMgc3RydWN0IHJwY19jbG50ICpuc21fY2xpZW50 X3NldChzdHJ1Y3QgbG9ja2RfbmV0ICpsbiwNCisJCXN0cnVjdCBycGNfY2xudCAqY2xudCkNCit7 DQorCXNwaW5fbG9jaygmbG4tPm5zbV9jbG50X2xvY2spOw0KKwlpZiAobG4tPm5zbV91c2VycyA9 PSAwKSB7DQorCQlpZiAoY2xudCA9PSBOVUxMKQ0KKwkJCWdvdG8gb3V0Ow0KKwkJbG4tPm5zbV9j bG50ID0gY2xudDsNCisJfQ0KKwljbG50ID0gbG4tPm5zbV9jbG50Ow0KKwlsbi0+bnNtX3VzZXJz Kys7DQorb3V0Og0KKwlzcGluX3VubG9jaygmbG4tPm5zbV9jbG50X2xvY2spOw0KKwlyZXR1cm4g Y2xudDsNCit9DQorDQogc3RhdGljIHN0cnVjdCBycGNfY2xudCAqbnNtX2NsaWVudF9nZXQoc3Ry dWN0IG5ldCAqbmV0KQ0KIHsNCi0Jc3RhdGljIERFRklORV9NVVRFWChuc21fY3JlYXRlX211dGV4 KTsNCi0Jc3RydWN0IHJwY19jbG50CSpjbG50Ow0KKwlzdHJ1Y3QgcnBjX2NsbnQJKmNsbnQsICpu ZXc7DQogCXN0cnVjdCBsb2NrZF9uZXQgKmxuID0gbmV0X2dlbmVyaWMobmV0LCBsb2NrZF9uZXRf aWQpOw0KIA0KLQlzcGluX2xvY2soJmxuLT5uc21fY2xudF9sb2NrKTsNCi0JaWYgKGxuLT5uc21f dXNlcnMpIHsNCi0JCWxuLT5uc21fdXNlcnMrKzsNCi0JCWNsbnQgPSBsbi0+bnNtX2NsbnQ7DQot CQlzcGluX3VubG9jaygmbG4tPm5zbV9jbG50X2xvY2spOw0KKwljbG50ID0gbnNtX2NsaWVudF9z ZXQobG4sIE5VTEwpOw0KKwlpZiAoY2xudCAhPSBOVUxMKQ0KIAkJZ290byBvdXQ7DQotCX0NCi0J c3Bpbl91bmxvY2soJmxuLT5uc21fY2xudF9sb2NrKTsNCiANCi0JbXV0ZXhfbG9jaygmbnNtX2Ny ZWF0ZV9tdXRleCk7DQotCWNsbnQgPSBuc21fY3JlYXRlKG5ldCk7DQotCWlmICghSVNfRVJSKGNs bnQpKSB7DQotCQlsbi0+bnNtX2NsbnQgPSBjbG50Ow0KLQkJc21wX3dtYigpOw0KLQkJbG4tPm5z bV91c2VycyA9IDE7DQotCX0NCi0JbXV0ZXhfdW5sb2NrKCZuc21fY3JlYXRlX211dGV4KTsNCisJ Y2xudCA9IG5ldyA9IG5zbV9jcmVhdGUobmV0KTsNCisJaWYgKElTX0VSUihjbG50KSkNCisJCWdv dG8gb3V0Ow0KKw0KKwljbG50ID0gbnNtX2NsaWVudF9zZXQobG4sIG5ldyk7DQorCWlmIChjbG50 ICE9IG5ldykNCisJCXJwY19zaHV0ZG93bl9jbGllbnQobmV3KTsNCiBvdXQ6DQogCXJldHVybiBj bG50Ow0KIH0NCi0tIA0KMS43LjExLjcNCg0KDQotLSANClRyb25kIE15a2xlYnVzdA0KTGludXgg TkZTIGNsaWVudCBtYWludGFpbmVyDQoNCk5ldEFwcA0KVHJvbmQuTXlrbGVidXN0QG5ldGFwcC5j b20NCnd3dy5uZXRhcHAuY29tDQo= ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) 2012-10-23 18:23 ` Myklebust, Trond @ 2012-10-23 19:49 ` Nix 2012-10-24 10:18 ` [PATCH] lockd: fix races in per-net NSM client handling Stanislav Kinsbursky 0 siblings, 1 reply; 27+ messages in thread From: Nix @ 2012-10-23 19:49 UTC (permalink / raw) To: Myklebust, Trond Cc: J. Bruce Fields, Ted Ts'o, linux-kernel@vger.kernel.org, Schumaker, Bryan, Peng Tao, gregkh@linuxfoundation.org, linux-nfs@vger.kernel.org, Stanislav Kinsbursky On 23 Oct 2012, Trond Myklebust outgrape: > On Tue, 2012-10-23 at 13:57 -0400, Trond Myklebust wrote: >> On Tue, 2012-10-23 at 17:44 +0000, Myklebust, Trond wrote: >> > You can't hold a spinlock while sleeping. Both mutex_lock() and nsm_create() can definitely sleep. >> > >> > The correct way to do this is to grab the spinlock and recheck the value of ln->nsm_users inside the 'if (!IS_ERR())' condition. If it is still zero, bump it and set ln->nsm_clnt, otherwise bump it, get the existing ln->nsm_clnt and call rpc_shutdown_clnt() on the redundant nsm client after dropping the spinlock. >> > >> > Cheers >> > Trond >> >> Can you please check if the following patch fixes the issue? >> >> Cheers >> Trond >> > Meh... This one gets rid of the 100% redundant mutex... No help, I'm afraid: [ 894.005699] ------------[ cut here ]------------ [ 894.005929] kernel BUG at fs/lockd/mon.c:159! [ 894.006156] invalid opcode: 0000 [#1] SMP [ 894.006451] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode] [ 894.007005] CPU 1 [ 894.007050] Pid: 1035, comm: lockd Not tainted 3.6.3-dirty #1 empty empty/S7010 [ 894.007669] RIP: 0010:[<ffffffff8120fbbc>] [<ffffffff8120fbbc>] nsm_mon_unmon+0x64/0x98 [ 894.008126] RSP: 0018:ffff880620a23ce0 EFLAGS: 00010246 [ 894.008355] RAX: ffff880620a23ce8 RBX: 0000000000000000 RCX: 0000000000000000 [ 894.008591] RDX: ffff880620a23d58 RSI: 0000000000000002 RDI: ffff880620a23d30 [ 894.008827] RBP: ffff880620a23d40 R08: 0000000000000000 R09: ffffea00188e4f00 [ 894.009063] R10: ffffffff814d032f R11: 0000000000000020 R12: 0000000000000000 [ 894.009300] R13: ffff88061f067e40 R14: ffff88061f067ee8 R15: ffff88062393dc00 [ 894.009537] FS: 0000000000000000(0000) GS:ffff88063fc40000(0000) knlGS:0000000000000000 [ 894.009956] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 894.010187] CR2: 00007f056a9a6ff0 CR3: 0000000001a0b000 CR4: 00000000000027e0 [ 894.010422] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 894.010659] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 894.010896] Process lockd (pid: 1035, threadinfo ffff880620a22000, task ffff8806208b5900) [ 894.011310] Stack: [ 894.011528] 0000000000000010 ffff8806102d3db1 00000003000186b5 ffffffff00000010 [ 894.012083] ffff8806102d3dc1 000000000000008c 0000000000000000 ffff880620a23ce8 [ 894.012637] ffff880620a23d58 0000000000000000 ffff88061f067ee8 ffff8806102d3d00 [ 894.013190] Call Trace: [ 894.013413] [<ffffffff8120ff07>] nsm_monitor+0x123/0x17e [ 894.013645] [<ffffffff81211b72>] nlm4svc_retrieve_args+0x62/0xd7 [ 894.013879] [<ffffffff81211f71>] nlm4svc_proc_lock+0x3c/0xb5 [ 894.014112] [<ffffffff812116a3>] ? nlm4svc_decode_lockargs+0x47/0xb2 [ 894.014349] [<ffffffff814d89fa>] svc_process+0x3bf/0x6a1 [ 894.014581] [<ffffffff8120d5f0>] lockd+0x127/0x164 [ 894.014810] [<ffffffff8120d4c9>] ? set_grace_period+0x8a/0x8a [ 894.015046] [<ffffffff8107bcbc>] kthread+0x8b/0x93 [ 894.015277] [<ffffffff81501334>] kernel_thread_helper+0x4/0x10 [ 894.015511] [<ffffffff8107bc31>] ? kthread_worker_fn+0xe1/0xe1 [ 894.015744] [<ffffffff81501330>] ? gs_change+0xb/0xb [ 894.015972] Code: b8 10 00 00 00 48 89 45 c0 48 8d 81 8c 00 00 00 b9 08 00 00 00 48 89 45 c8 89 d8 f3 ab 48 8d 45 a8 48 89 55 e0 48 89 45 d8 75 02 <0f> 0b 89 f6 48 c7 02 00 00 00 00 4c 89 c7 48 6b f6 38 ba 00 04 [ 894.018895] RIP [<ffffffff8120fbbc>] nsm_mon_unmon+0x64/0x98 [ 894.019163] RSP <ffff880620a23ce0> [ 894.019401] ---[ end trace b8ef5cb81bec72c8 ]--- Slightly different timing, but still boom. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH] lockd: fix races in per-net NSM client handling 2012-10-23 19:49 ` Nix @ 2012-10-24 10:18 ` Stanislav Kinsbursky 0 siblings, 0 replies; 27+ messages in thread From: Stanislav Kinsbursky @ 2012-10-24 10:18 UTC (permalink / raw) To: Trond.Myklebust; +Cc: bfields, linux-nfs, linux-kernel, devel This patch fixes two problems: 1) Removes races on NSM creation. 2) Fixes silly misprint on NSM client destruction (usage counter was checked for non-zero value instead of zero). Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> --- fs/lockd/mon.c | 35 +++++++++++++++++++++++------------ 1 files changed, 23 insertions(+), 12 deletions(-) diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c index e4fb3ba..e3e59f6 100644 --- a/fs/lockd/mon.c +++ b/fs/lockd/mon.c @@ -85,30 +85,41 @@ static struct rpc_clnt *nsm_create(struct net *net) return rpc_create(&args); } -static struct rpc_clnt *nsm_client_get(struct net *net) +static struct rpc_clnt *nsm_get_client(struct net *net) { - static DEFINE_MUTEX(nsm_create_mutex); - struct rpc_clnt *clnt; + struct rpc_clnt *clnt = NULL; struct lockd_net *ln = net_generic(net, lockd_net_id); spin_lock(&ln->nsm_clnt_lock); if (ln->nsm_users) { ln->nsm_users++; clnt = ln->nsm_clnt; - spin_unlock(&ln->nsm_clnt_lock); - goto out; } spin_unlock(&ln->nsm_clnt_lock); + return clnt; +} + +static struct rpc_clnt *nsm_client_get(struct net *net) +{ + static DEFINE_MUTEX(nsm_create_mutex); + struct rpc_clnt *clnt; + struct lockd_net *ln = net_generic(net, lockd_net_id); + + clnt = nsm_get_client(net); + if (clnt) + return clnt; mutex_lock(&nsm_create_mutex); - clnt = nsm_create(net); - if (!IS_ERR(clnt)) { - ln->nsm_clnt = clnt; - smp_wmb(); - ln->nsm_users = 1; + clnt = nsm_get_client(net); + if (clnt == NULL) { + clnt = nsm_create(net); + if (!IS_ERR(clnt)) { + ln->nsm_clnt = clnt; + smp_wmb(); + ln->nsm_users = 1; + } } mutex_unlock(&nsm_create_mutex); -out: return clnt; } @@ -120,7 +131,7 @@ static void nsm_client_put(struct net *net) spin_lock(&ln->nsm_clnt_lock); if (ln->nsm_users) { - if (--ln->nsm_users) + if (--ln->nsm_users == 0) ln->nsm_clnt = NULL; shutdown = !ln->nsm_users; } ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) 2012-10-23 16:32 ` Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) Nix 2012-10-23 16:46 ` J. Bruce Fields @ 2012-10-23 20:57 ` Nix 2012-10-24 1:13 ` Eric Sandeen 2012-10-26 20:35 ` Eric Sandeen 1 sibling, 2 replies; 27+ messages in thread From: Nix @ 2012-10-23 20:57 UTC (permalink / raw) To: Ted Ts'o, linux-ext4 Cc: linux-kernel, J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs [Bruce, Trond, I fear it may be hard for me to continue chasing this NFS lockd crash as long as ext4 on 3.6.3 is hosing my filesystems like this. Apologies.] On 23 Oct 2012, nix@esperi.org.uk uttered the following: > Reproduced in 3.6.3, not in 3.6.1, not tried 3.6.2. Capturing it was > rendered somewhat difficult by an ext4/JBD2 bug which leads to data loss > in /var on every reboot out of 3.6.1 and on some reboots out of 3.6.3 (I > have runs of NULs in my logs now, which keep eating the oopses): > > [while in 3.6.1] > [ 88.565698] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash. > [ 88.799263] JBD2: Spotted dirty metadata buffer (dev = dm-5, blocknr = 0). There's a risk of filesystem corruption in case of system crash. > [ 89.648152] ------------[ cut here ]------------ > [ 89.648386] WARNING: at fs/inode.c:280 drop_nlink+0x25/0x42() > [ 89.648614] Hardware name: empty > [ 89.648833] Modules linked in: firewire_ohci firewire_core [last unloaded: microcode] > [ 89.649382] Pid: 1484, comm: dhcpd Not tainted 3.6.1-dirty #1 > [ 89.649610] Call Trace: > [ 89.649832] [<ffffffff810608c0>] warn_slowpath_common+0x83/0x9b > [ 89.650063] [<ffffffff810608f2>] warn_slowpath_null+0x1a/0x1c > [ 89.650292] [<ffffffff8112efbf>] drop_nlink+0x25/0x42 > [ 89.650533] [<ffffffff81187112>] ext4_dec_count+0x26/0x28 > [ 89.650763] [<ffffffff8118abb8>] ext4_rename+0x4ec/0x7b4 > [ 89.650993] [<ffffffff81125d81>] ? vfs_rename+0xbe/0x3b7 > [ 89.651224] [<ffffffff81125f3f>] vfs_rename+0x27c/0x3b7 > [ 89.651454] [<ffffffff81127a53>] sys_renameat+0x1b1/0x228 > [ 89.651682] [<ffffffff8114bda1>] ? fsnotify+0x226/0x249 > [ 89.651911] [<ffffffff81239b75>] ? security_inode_permission+0x1e/0x20 > [ 89.652145] [<ffffffff8111a240>] ? vfs_write+0x116/0x142 > [ 89.652372] [<ffffffff81127ae5>] sys_rename+0x1b/0x1e > [ 89.652601] [<ffffffff814fffa2>] system_call_fastpath+0x16/0x1b > [...] > [while having just booted into 3.6.1 after some time in 3.6.3: the FS > was clean, and fscked on the previous boot into 3.6.3 after a previous > instance of this bug] > Oct 23 17:18:26 spindle crit: [ 67.625319] EXT4-fs error (device dm-5): mb_free_blocks:1300: group 65, block 2143748:freeing already freed block (bit 13828) > > This may well be a 3.6.1-specific bug fixed in 3.6.3, but it's hard to > tell since most of my reboots are 3.6.1->3.6.3 or vice versa right now. It is now quite clear that this is a bug introduced by one or more of the post-3.6.1 ext4 patches (which have all been backported at least to 3.5, so the problem is probably there too). Rebooting from 3.6.3 back into 3.6.1, I saw this within seconds of boot: [ 60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 202, 1583 clusters in bitmap, 1675 in gd [ 60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash. [ 116.508621] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 5, 17749 clusters in bitmap, 17700 in gd [ 116.509626] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 8, 15814 clusters in bitmap, 16073 in gd [ 116.510103] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 10, 3349 clusters in bitmap, 3493 in gd [ 116.510571] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 12, 1792 clusters in bitmap, 1648 in gd [ 116.511691] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash. [ 116.512736] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: group 25, 14463 clusters in bitmap, 14462 in gd [ 116.513624] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash. [ 359.538550] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #73732: comm nfsd: deleted inode referenced: 10024 [ 359.559220] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #73732: comm nfsd: deleted inode referenced: 10572 [ 366.113780] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #73732: comm nfsd: deleted inode referenced: 10024 [ 366.114837] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #73732: comm nfsd: deleted inode referenced: 10572 [ 456.013682] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm vi: deleted inode referenced: 10303 [ 456.384454] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm vi: deleted inode referenced: 10303 [ 457.508943] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm vi: deleted inode referenced: 10303 [ 457.509422] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm vi: deleted inode referenced: 10303 [ 457.509897] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm vi: deleted inode referenced: 10303 [ 478.779574] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303 [ 478.780047] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303 [ 478.780516] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303 [ 478.780983] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303 [ 478.782010] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303 [ 478.782480] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm zsh: deleted inode referenced: 10303 [ 479.826974] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm lesspipe.sh: deleted inode referenced: 10303 [ 479.834091] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm lesspipe.sh: deleted inode referenced: 10303 [ 479.835966] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm file: deleted inode referenced: 10303 [ 479.856946] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm less: deleted inode referenced: 10303 [ 479.857431] EXT4-fs error (device dm-3): ext4_lookup:1343: inode #49155: comm less: deleted inode referenced: 10303 [ 679.812704] EXT4-fs error (device dm-3): mb_free_blocks:1300: group 27, block 916489:freeing already freed block (bit 31753) (I'd provide more sample errors, but this bug has been eating newly-written logs in /var all day, so not much has survived.) I rebooted into 3.6.1 rescue mode and fscked everything: lots of orphans, block group corruption and cross-linked files. The problems did not recur upon booting from 3.6.1 into 3.6.1 again. It is quite clear that metadata changes made in 3.6.3 are not making it to disk reliably, thus leading to corrupted filesystems marked clean on reboot into other kernels: pretty much every file appended to in 3.6.3 loses some or all of its appended data, and newly allocated blocks often end up cross-linked between multiple files. The curious thing is this doesn't affect every filesystem: for a while it affected only /var, and now it's affecting only /var and /home. The massive writes to the ext4 filesystem mounted on /usr/src seem to have gone off without incident: fsck reports no problems. The only unusual thing about the filesystems on this machine are that they have hardware RAID-5 (using the Areca driver), so I'm mounting with 'nobarrier': the full set of options for all my ext4 filesystems are: rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota, usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota If there's anything I can do to help, I'm happy to do it, once I've restored my home directory from backup :( tune2fs output for one of the afflicted filesystems (after fscking): tune2fs 1.42.2 (9-Apr-2012) Filesystem volume name: home Last mounted on: /home Filesystem UUID: 95bd22c2-253c-456f-8e36-b6cfb9ecd4ef Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 3276800 Block count: 13107200 Reserved block count: 655360 Free blocks: 5134852 Free inodes: 3174777 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 20 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 RAID stripe width: 16 Flex block group size: 64 Filesystem created: Tue May 26 21:29:41 2009 Last mount time: Tue Oct 23 21:32:07 2012 Last write time: Tue Oct 23 21:32:07 2012 Mount count: 2 Maximum mount count: 20 Last checked: Tue Oct 23 21:22:16 2012 Check interval: 15552000 (6 months) Next check after: Sun Apr 21 21:22:16 2013 Lifetime writes: 1092 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 First orphan inode: 1572907 Default directory hash: half_md4 Directory Hash Seed: a201983d-d8a3-460b-93ca-eb7804b62c23 Journal backup: inode blocks ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) 2012-10-23 20:57 ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Nix @ 2012-10-24 1:13 ` Eric Sandeen 2012-10-24 4:15 ` Nix 2012-10-26 20:35 ` Eric Sandeen 1 sibling, 1 reply; 27+ messages in thread From: Eric Sandeen @ 2012-10-24 1:13 UTC (permalink / raw) To: Nix Cc: Ted Ts'o, linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs On 10/23/12 3:57 PM, Nix wrote: <snip> > (I'd provide more sample errors, but this bug has been eating > newly-written logs in /var all day, so not much has survived.) > > I rebooted into 3.6.1 rescue mode and fscked everything: lots of > orphans, block group corruption and cross-linked files. The problems did > not recur upon booting from 3.6.1 into 3.6.1 again. It is quite clear > that metadata changes made in 3.6.3 are not making it to disk reliably, > thus leading to corrupted filesystems marked clean on reboot into other > kernels: pretty much every file appended to in 3.6.3 loses some or all > of its appended data, and newly allocated blocks often end up > cross-linked between multiple files. > > The curious thing is this doesn't affect every filesystem: for a while > it affected only /var, and now it's affecting only /var and /home. The > massive writes to the ext4 filesystem mounted on /usr/src seem to have > gone off without incident: fsck reports no problems. > > > The only unusual thing about the filesystems on this machine are that > they have hardware RAID-5 (using the Areca driver), so I'm mounting with > 'nobarrier': I should have read more. :( More questions follow: * Does the Areca have a battery backed write cache? * Are you crashing or rebooting cleanly? * Do you see log recovery messages in the logs for this filesystem? > the full set of options for all my ext4 filesystems are: > > rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota, > usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota ok journal_async_commit is off the reservation a bit; that's really not tested, and Jan had serious reservations about its safety. * Can you reproduce this w/o journal_async_commit? -Eric > If there's anything I can do to help, I'm happy to do it, once I've > restored my home directory from backup :( > > > tune2fs output for one of the afflicted filesystems (after fscking): > > tune2fs 1.42.2 (9-Apr-2012) > Filesystem volume name: home > Last mounted on: /home > Filesystem UUID: 95bd22c2-253c-456f-8e36-b6cfb9ecd4ef > Filesystem magic number: 0xEF53 > Filesystem revision #: 1 (dynamic) > Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize > Filesystem flags: signed_directory_hash > Default mount options: (none) > Filesystem state: clean > Errors behavior: Continue > Filesystem OS type: Linux > Inode count: 3276800 > Block count: 13107200 > Reserved block count: 655360 > Free blocks: 5134852 > Free inodes: 3174777 > First block: 0 > Block size: 4096 > Fragment size: 4096 > Reserved GDT blocks: 20 > Blocks per group: 32768 > Fragments per group: 32768 > Inodes per group: 8192 > Inode blocks per group: 512 > RAID stripe width: 16 > Flex block group size: 64 > Filesystem created: Tue May 26 21:29:41 2009 > Last mount time: Tue Oct 23 21:32:07 2012 > Last write time: Tue Oct 23 21:32:07 2012 > Mount count: 2 > Maximum mount count: 20 > Last checked: Tue Oct 23 21:22:16 2012 > Check interval: 15552000 (6 months) > Next check after: Sun Apr 21 21:22:16 2013 > Lifetime writes: 1092 GB > Reserved blocks uid: 0 (user root) > Reserved blocks gid: 0 (group root) > First inode: 11 > Inode size: 256 > Required extra isize: 28 > Desired extra isize: 28 > Journal inode: 8 > First orphan inode: 1572907 > Default directory hash: half_md4 > Directory Hash Seed: a201983d-d8a3-460b-93ca-eb7804b62c23 > Journal backup: inode blocks > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) 2012-10-24 1:13 ` Eric Sandeen @ 2012-10-24 4:15 ` Nix 2012-10-24 4:27 ` Eric Sandeen 0 siblings, 1 reply; 27+ messages in thread From: Nix @ 2012-10-24 4:15 UTC (permalink / raw) To: Eric Sandeen Cc: Ted Ts'o, linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs On 24 Oct 2012, Eric Sandeen uttered the following: > On 10/23/12 3:57 PM, Nix wrote: >> The only unusual thing about the filesystems on this machine are that >> they have hardware RAID-5 (using the Areca driver), so I'm mounting with >> 'nobarrier': > > I should have read more. :( More questions follow: > > * Does the Areca have a battery backed write cache? Yes (though I'm not powering off, just rebooting). Battery at 100% and happy, though the lack of power-off means it's not actually getting used, since the cache is obviously mains-backed as well. > * Are you crashing or rebooting cleanly? Rebooting cleanly, everything umounted happily including /home and /var. > * Do you see log recovery messages in the logs for this filesystem? My memory says yes, but nothing seems to be logged when this happens (though with my logs on the first filesystem damaged by this, this is rather hard to tell, they're all quite full of NULs by now). I'll double-reboot tomorrow via the faulty kernel and check, unless I get asked not to in the interim. (And then double-reboot again to fsck everything...) >> the full set of options for all my ext4 filesystems are: >> >> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota, >> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota > > ok journal_async_commit is off the reservation a bit; that's really not > tested, and Jan had serious reservations about its safety. OK, well, I've been 'testing' it for years :) No problems until now. (If anything, I was more concerned about journal_checksum. I thought that had actually been implicated in corruption before now...) > * Can you reproduce this w/o journal_async_commit? I can try! -- NULL && (void) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) 2012-10-24 4:15 ` Nix @ 2012-10-24 4:27 ` Eric Sandeen 0 siblings, 0 replies; 27+ messages in thread From: Eric Sandeen @ 2012-10-24 4:27 UTC (permalink / raw) To: Nix Cc: Ted Ts'o, linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs On 10/23/12 11:15 PM, Nix wrote: > On 24 Oct 2012, Eric Sandeen uttered the following: > >> On 10/23/12 3:57 PM, Nix wrote: >>> The only unusual thing about the filesystems on this machine are that >>> they have hardware RAID-5 (using the Areca driver), so I'm mounting with >>> 'nobarrier': >> >> I should have read more. :( More questions follow: >> >> * Does the Areca have a battery backed write cache? > > Yes (though I'm not powering off, just rebooting). Battery at 100% and > happy, though the lack of power-off means it's not actually getting > used, since the cache is obviously mains-backed as well. > >> * Are you crashing or rebooting cleanly? > > Rebooting cleanly, everything umounted happily including /home and /var. > >> * Do you see log recovery messages in the logs for this filesystem? > > My memory says yes, but nothing seems to be logged when this happens > (though with my logs on the first filesystem damaged by this, this is > rather hard to tell, they're all quite full of NULs by now). > > I'll double-reboot tomorrow via the faulty kernel and check, unless I > get asked not to in the interim. (And then double-reboot again to fsck > everything...) > >>> the full set of options for all my ext4 filesystems are: >>> >>> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota, >>> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota >> >> ok journal_async_commit is off the reservation a bit; that's really not >> tested, and Jan had serious reservations about its safety. > > OK, well, I've been 'testing' it for years :) No problems until now. (If > anything, I was more concerned about journal_checksum. I thought that > had actually been implicated in corruption before now...) It had, but I fixed it AFAIK; OTOH, we turned it off by default after that episode. >> * Can you reproduce this w/o journal_async_commit? > > I can try! Ok, fair enough. If the BBU is working, nobarrier is ok; I don't trust journal_async_commit, but that doesn't mean this isn't a regression. Thanks for the answers... onward. :) -Eric ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) 2012-10-23 20:57 ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Nix 2012-10-24 1:13 ` Eric Sandeen @ 2012-10-26 20:35 ` Eric Sandeen 2012-10-26 20:37 ` Nix 1 sibling, 1 reply; 27+ messages in thread From: Eric Sandeen @ 2012-10-26 20:35 UTC (permalink / raw) To: Nix Cc: Ted Ts'o, linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs On 10/23/12 3:57 PM, Nix wrote: > [Bruce, Trond, I fear it may be hard for me to continue chasing this NFS > lockd crash as long as ext4 on 3.6.3 is hosing my filesystems like > this. Apologies.] <big snip> > The only unusual thing about the filesystems on this machine are that > they have hardware RAID-5 (using the Areca driver), so I'm mounting with > 'nobarrier': the full set of options for all my ext4 filesystems are: > > rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota, > usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota Out of curiosity, when I test log replay with the journal_checksum option, I almost always get something like: [ 999.917805] JBD2: journal transaction 84121 on dm-1-8 is corrupt. [ 999.923904] EXT4-fs (dm-1): error loading journal after a simulated crash & log replay. Do you see anything like that in your logs? <big snip> Thanks, -Eric ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) 2012-10-26 20:35 ` Eric Sandeen @ 2012-10-26 20:37 ` Nix 2012-10-26 20:56 ` Theodore Ts'o 0 siblings, 1 reply; 27+ messages in thread From: Nix @ 2012-10-26 20:37 UTC (permalink / raw) To: Eric Sandeen Cc: Ted Ts'o, linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs On 26 Oct 2012, Eric Sandeen outgrape: > On 10/23/12 3:57 PM, Nix wrote: >> The only unusual thing about the filesystems on this machine are that >> they have hardware RAID-5 (using the Areca driver), so I'm mounting with >> 'nobarrier': the full set of options for all my ext4 filesystems are: >> >> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota, >> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota > > Out of curiosity, when I test log replay with the journal_checksum option, I > almost always get something like: > > [ 999.917805] JBD2: journal transaction 84121 on dm-1-8 is corrupt. > [ 999.923904] EXT4-fs (dm-1): error loading journal > > after a simulated crash & log replay. > > Do you see anything like that in your logs? I'm not seeing any corrupt journals or abort messages at all. The journal claims to be fine, but plainly isn't. I can reproduce this on a small filesystem and stick the image somewhere if that would be of any use to anyone. (If I'm very lucky, merely making this offer will make the problem go away. :} ) -- NULL && (void) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) 2012-10-26 20:37 ` Nix @ 2012-10-26 20:56 ` Theodore Ts'o 2012-10-26 20:59 ` Nix 0 siblings, 1 reply; 27+ messages in thread From: Theodore Ts'o @ 2012-10-26 20:56 UTC (permalink / raw) To: Nix Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs On Fri, Oct 26, 2012 at 09:37:08PM +0100, Nix wrote: > > I can reproduce this on a small filesystem and stick the image somewhere > if that would be of any use to anyone. (If I'm very lucky, merely making > this offer will make the problem go away. :} ) I'm not sure the image is going to be that useful. What we really need to do is to get a reliable reproduction of what _you_ are seeing. It's clear from Eric's experiments that journal_checksum is dangerous. In fact, I will likely put it under an #ifdef EXT4_EXPERIMENTAL to try to discourage people from using it in the future. There are things I've been planning on doing to make it be safer, but there's a very good *reason* that both journal_checksum and journal_async_commit are not on by default. That's why one of the things I asked you to do when you had time was to see if you could reproduce the problem you are seeing w/o nobarrier,journal_checksum,journal_async_commit. The other experiment that would be really useful if you could do is to try to apply these two patches which I sent earlier this week: [PATCH 1/2] ext4: revert "jbd2: don't write superblock when if its empty [PATCH 2/2] ext4: fix I/O error when unmounting an ro file system ... and see if they make a difference. If they don't make a difference, I don't want to apply patches just for placebo/PR reasons. And for Eric at least, he can reproduce the journal checksum error followed by fairly significant corruption reported by e2fsck with journal_checksum, and the presence or absense of these patches make no difference for him. So I really don't want to push these patches to Linus until I get confirmation that they make a difference to *somebody*. Regards, - Ted ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) 2012-10-26 20:56 ` Theodore Ts'o @ 2012-10-26 20:59 ` Nix 2012-10-26 21:15 ` Theodore Ts'o 0 siblings, 1 reply; 27+ messages in thread From: Nix @ 2012-10-26 20:59 UTC (permalink / raw) To: Theodore Ts'o Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs On 26 Oct 2012, Theodore Ts'o stated: > On Fri, Oct 26, 2012 at 09:37:08PM +0100, Nix wrote: >> >> I can reproduce this on a small filesystem and stick the image somewhere >> if that would be of any use to anyone. (If I'm very lucky, merely making >> this offer will make the problem go away. :} ) > > I'm not sure the image is going to be that useful. What we really > need to do is to get a reliable reproduction of what _you_ are seeing. > > It's clear from Eric's experiments that journal_checksum is dangerous. > > That's why one of the things I asked you to do when you had time was > to see if you could reproduce the problem you are seeing w/o > nobarrier,journal_checksum,journal_async_commit. OK. Will do tomorrow. > The other experiment that would be really useful if you could do is to > try to apply these two patches which I sent earlier this week: > > [PATCH 1/2] ext4: revert "jbd2: don't write superblock when if its empty > [PATCH 2/2] ext4: fix I/O error when unmounting an ro file system > > ... and see if they make a difference. As of tomorrow I'll be able to reboot without causing a riot: I'll test it then. (Sorry for the delay :( ) > So I really don't want > to push these patches to Linus until I get confirmation that they make > a difference to *somebody*. Agreed. This isn't the first time that journal_checksum has proven problematic. It's a shame that we're stuck between two error-inducing stools here... -- NULL && (void) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) 2012-10-26 20:59 ` Nix @ 2012-10-26 21:15 ` Theodore Ts'o 2012-10-26 21:19 ` Nix 2012-10-27 3:11 ` Jim Rees 0 siblings, 2 replies; 27+ messages in thread From: Theodore Ts'o @ 2012-10-26 21:15 UTC (permalink / raw) To: Nix Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs > This isn't the first time that journal_checksum has proven problematic. > It's a shame that we're stuck between two error-inducing stools here... The problem is that it currently bails out be aborting the entire journal replay, and the file system will get left in a mess when it does that. It's actually safer today to just be blissfully ignorant of a corrupted block in the journal, than to have the journal getting aborted mid-replay when we detect a corrupted commit. The plan is that eventually, we will have checksums on a per-journalled block basis, instead of a per-commit basis, and when we get a failed checksum, we skip the replay of that block, but we keep going and replay all of the other blocks and commits. We'll then set the "file system corrupted" bit and force an e2fsck check. The problem is this code isn't done yet, and journal_checksum is really not ready for prime time. When it is ready, my plan is to wire it up so it is enabled by default; at the moment, it was intended for developer experimentation only. As I said, it's my fault for not clearly labelling it "Not for you!", or putting it under an #ifdef to prevent unwary civilians from coming across the feature and saying, "oooh, shiny!" and turning it on. :-( - Ted ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) 2012-10-26 21:15 ` Theodore Ts'o @ 2012-10-26 21:19 ` Nix 2012-10-27 0:22 ` Theodore Ts'o 2012-10-27 3:11 ` Jim Rees 1 sibling, 1 reply; 27+ messages in thread From: Nix @ 2012-10-26 21:19 UTC (permalink / raw) To: Theodore Ts'o Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs On 26 Oct 2012, Theodore Ts'o uttered the following: > The plan is that eventually, we will have checksums on a > per-journalled block basis, instead of a per-commit basis, and when we > get a failed checksum, we skip the replay of that block, But not of everything it implies, since that's quite tricky to track down (it's basically the same work needed for softupdates, but in reverse). Hence the e2fsck check, I suppose. > prevent unwary civilians from coming across the feature and saying, > "oooh, shiny!" and turning it on. :-( Or having it turned on by default either, which seems to be the case now. -- NULL && (void) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) 2012-10-26 21:19 ` Nix @ 2012-10-27 0:22 ` Theodore Ts'o 0 siblings, 0 replies; 27+ messages in thread From: Theodore Ts'o @ 2012-10-27 0:22 UTC (permalink / raw) To: Nix Cc: Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs On Fri, Oct 26, 2012 at 10:19:21PM +0100, Nix wrote: > > prevent unwary civilians from coming across the feature and saying, > > "oooh, shiny!" and turning it on. :-( > > Or having it turned on by default either, which seems to be the case > now. Huh? It's not turned on by default. If you mount with no mount options, journal checksums are *not* turned on. - Ted ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) 2012-10-26 21:15 ` Theodore Ts'o 2012-10-26 21:19 ` Nix @ 2012-10-27 3:11 ` Jim Rees 1 sibling, 0 replies; 27+ messages in thread From: Jim Rees @ 2012-10-27 3:11 UTC (permalink / raw) To: Theodore Ts'o, Nix, Eric Sandeen, linux-ext4, linux-kernel, J. Bruce Fields, Bryan Schumaker, Peng Tao, Trond.Myklebust, gregkh, linux-nfs Theodore Ts'o wrote: The problem is this code isn't done yet, and journal_checksum is really not ready for prime time. When it is ready, my plan is to wire it up so it is enabled by default; at the moment, it was intended for developer experimentation only. As I said, it's my fault for not clearly labelling it "Not for you!", or putting it under an #ifdef to prevent unwary civilians from coming across the feature and saying, "oooh, shiny!" and turning it on. :-( Perhaps a word or two in the mount man page would be appropriate? ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2012-10-27  3:12 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <87objupjlr.fsf@spindle.srvr.nix>
2012-10-23  1:33 ` Heads-up: 3.6.2 / 3.6.3 NFS server panic: 3.6.2+ regression? J. Bruce Fields
2012-10-23 14:07   ` Nix
2012-10-23 14:30     ` J. Bruce Fields
2012-10-23 16:32       ` Heads-up: 3.6.2 / 3.6.3 NFS server oops: 3.6.2+ regression? (also an unrelated ext4 data loss bug) Nix
2012-10-23 16:46         ` J. Bruce Fields
2012-10-23 16:54           ` J. Bruce Fields
2012-10-23 16:56           ` Myklebust, Trond
2012-10-23 17:05             ` Nix
2012-10-23 17:36               ` Nix
2012-10-23 17:43                 ` J. Bruce Fields
2012-10-23 17:44                 ` Myklebust, Trond
2012-10-23 17:57                   ` Myklebust, Trond
     [not found]                   ` <1351015039.4622.23.camel@lade.trondhjem.org>
2012-10-23 18:23                     ` Myklebust, Trond
2012-10-23 19:49                       ` Nix
2012-10-24 10:18                         ` [PATCH] lockd: fix races in per-net NSM client handling Stanislav Kinsbursky
2012-10-23 20:57         ` Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Nix
2012-10-24  1:13           ` Eric Sandeen
2012-10-24  4:15             ` Nix
2012-10-24  4:27               ` Eric Sandeen
2012-10-26 20:35           ` Eric Sandeen
2012-10-26 20:37             ` Nix
2012-10-26 20:56               ` Theodore Ts'o
2012-10-26 20:59                 ` Nix
2012-10-26 21:15                   ` Theodore Ts'o
2012-10-26 21:19                     ` Nix
2012-10-27  0:22                       ` Theodore Ts'o
2012-10-27  3:11                     ` Jim Rees
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).