* [Ocfs2-devel] dlmglue fixes
@ 2010-01-21 18:50 Sunil Mushran
2010-01-21 18:50 ` [Ocfs2-devel] [PATCH 1/2] ocfs2: Fix setting of OCFS2_LOCK_BLOCKED during bast Sunil Mushran
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Sunil Mushran @ 2010-01-21 18:50 UTC (permalink / raw)
To: ocfs2-devel
David,
So here are the two patches. Remove all patches that you have and apply
these.
The first one is straight forward.
The second one will hopefully fix the livelock issue you have been
encountering.
People reviewing the patches should note that the second one is slightly
different than the one I posted earlier. It removes the BUG_ON in the if
condition where we jump to update_holders. The accompanying comment has
also been updated.
Thanks
Sunil
^ permalink raw reply [flat|nested] 9+ messages in thread* [Ocfs2-devel] [PATCH 1/2] ocfs2: Fix setting of OCFS2_LOCK_BLOCKED during bast 2010-01-21 18:50 [Ocfs2-devel] dlmglue fixes Sunil Mushran @ 2010-01-21 18:50 ` Sunil Mushran 2010-01-21 18:50 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Prevent a livelock in dlmglue Sunil Mushran 2010-01-26 12:33 ` [Ocfs2-devel] dlmglue fixes Joel Becker 2 siblings, 0 replies; 9+ messages in thread From: Sunil Mushran @ 2010-01-21 18:50 UTC (permalink / raw) To: ocfs2-devel From: Wengang Wang <wen.gang.wang@oracle.com> During bast, set the OCFS2_LOCK_BLOCKED flag only if the lock needs to downconverted. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Acked-by: Sunil Mushran <sunil.mushran@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> --- fs/ocfs2/dlmglue.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index 0d38d67..0190f31 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -907,8 +907,6 @@ static int ocfs2_generic_handle_bast(struct ocfs2_lock_res *lockres, assert_spin_locked(&lockres->l_lock); - lockres_or_flags(lockres, OCFS2_LOCK_BLOCKED); - if (level > lockres->l_blocking) { /* only schedule a downconvert if we haven't already scheduled * one that goes low enough to satisfy the level we're @@ -921,6 +919,9 @@ static int ocfs2_generic_handle_bast(struct ocfs2_lock_res *lockres, lockres->l_blocking = level; } + if (needs_downconvert) + lockres_or_flags(lockres, OCFS2_LOCK_BLOCKED); + mlog_exit(needs_downconvert); return needs_downconvert; } -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* [Ocfs2-devel] [PATCH 2/2] ocfs2: Prevent a livelock in dlmglue 2010-01-21 18:50 [Ocfs2-devel] dlmglue fixes Sunil Mushran 2010-01-21 18:50 ` [Ocfs2-devel] [PATCH 1/2] ocfs2: Fix setting of OCFS2_LOCK_BLOCKED during bast Sunil Mushran @ 2010-01-21 18:50 ` Sunil Mushran 2010-01-26 12:33 ` [Ocfs2-devel] dlmglue fixes Joel Becker 2 siblings, 0 replies; 9+ messages in thread From: Sunil Mushran @ 2010-01-21 18:50 UTC (permalink / raw) To: ocfs2-devel There is possibility of a livelock in __ocfs2_cluster_lock(). If a node were to get an ast for an upconvert request, followed immediately by a bast, there is a small probability the fs may downconvert the lock before the process, that requested the upconvert, is able to take the lock. This patch adds a new flag to indicate that the upconvert is still in progress and that the dc thread should not downconvert it right now. Wengang Wang <wen.gang.wang@oracle.com> and Joel Becker <joel.becker@oracle.com> contributed heavily to this patch. Reported-by: David Teigland <teigland@redhat.com> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> --- fs/ocfs2/dlmglue.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- fs/ocfs2/ocfs2.h | 4 ++++ 2 files changed, 50 insertions(+), 3 deletions(-) diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index 0190f31..f7b9f8f 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -875,6 +875,14 @@ static inline void ocfs2_generic_handle_convert_action(struct ocfs2_lock_res *lo lockres_or_flags(lockres, OCFS2_LOCK_NEEDS_REFRESH); lockres->l_level = lockres->l_requested; + + /* + * We set the OCFS2_LOCK_UPCONVERT_FINISHING flag before clearing + * the OCFS2_LOCK_BUSY flag to prevent the dc thread from + * downconverting the lock before the upconvert has fully completed. + */ + lockres_or_flags(lockres, OCFS2_LOCK_UPCONVERT_FINISHING); + lockres_clear_flags(lockres, OCFS2_LOCK_BUSY); mlog_exit_void(); @@ -1134,6 +1142,7 @@ static inline void ocfs2_recover_from_dlm_error(struct ocfs2_lock_res *lockres, mlog_entry_void(); spin_lock_irqsave(&lockres->l_lock, flags); lockres_clear_flags(lockres, OCFS2_LOCK_BUSY); + lockres_clear_flags(lockres, OCFS2_LOCK_UPCONVERT_FINISHING); if (convert) lockres->l_action = OCFS2_AST_INVALID; else @@ -1324,13 +1333,13 @@ static int __ocfs2_cluster_lock(struct ocfs2_super *osb, again: wait = 0; + spin_lock_irqsave(&lockres->l_lock, flags); + if (catch_signals && signal_pending(current)) { ret = -ERESTARTSYS; - goto out; + goto unlock; } - spin_lock_irqsave(&lockres->l_lock, flags); - mlog_bug_on_msg(lockres->l_flags & OCFS2_LOCK_FREEING, "Cluster lock called on freeing lockres %s! flags " "0x%lx\n", lockres->l_name, lockres->l_flags); @@ -1347,6 +1356,25 @@ again: goto unlock; } + if (lockres->l_flags & OCFS2_LOCK_UPCONVERT_FINISHING) { + /* + * We've upconverted. If the lock now has a level we can + * work with, we take it. If, however, the lock is not at the + * required level, we go thru the full cycle. One way this could + * happen is if a process requesting an upconvert to PR is + * closely followed by another requesting upconvert to an EX. + * If the process requesting EX lands here, we want it to + * continue attempting to upconvert and let the process + * requesting PR take the lock. + * If multiple processes request upconvert to PR, the first one + * here will take the lock. The others will have to go thru the + * OCFS2_LOCK_BLOCKED check to ensure that there is no pending + * downconvert request. + */ + if (level <= lockres->l_level) + goto update_holders; + } + if (lockres->l_flags & OCFS2_LOCK_BLOCKED && !ocfs2_may_continue_on_blocked_lock(lockres, level)) { /* is the lock is currently blocked on behalf of @@ -1417,11 +1445,14 @@ again: goto again; } +update_holders: /* Ok, if we get here then we're good to go. */ ocfs2_inc_holders(lockres, level); ret = 0; unlock: + lockres_clear_flags(lockres, OCFS2_LOCK_UPCONVERT_FINISHING); + spin_unlock_irqrestore(&lockres->l_lock, flags); out: /* @@ -3402,6 +3433,18 @@ recheck: goto leave; } + /* + * This prevents livelocks. OCFS2_LOCK_UPCONVERT_FINISHING flag is + * set when the ast is received for an upconvert just before the + * OCFS2_LOCK_BUSY flag is cleared. Now if the fs received a bast + * on the heels of the ast, we want to delay the downconvert just + * enough to allow the up requestor to do its task. Because this + * lock is in the blocked queue, the lock will be downconverted + * as soon as the requestor is done with the lock. + */ + if (lockres->l_flags & OCFS2_LOCK_UPCONVERT_FINISHING) + goto leave_requeue; + /* if we're blocking an exclusive and we have *any* holders, * then requeue. */ if ((lockres->l_blocking == DLM_LOCK_EX) diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h index d963d86..782e77d 100644 --- a/fs/ocfs2/ocfs2.h +++ b/fs/ocfs2/ocfs2.h @@ -136,6 +136,10 @@ enum ocfs2_unlock_action { #define OCFS2_LOCK_PENDING (0x00000400) /* This lockres is pending a call to dlm_lock. Only exists with BUSY set. */ +#define OCFS2_LOCK_UPCONVERT_FINISHING (0x00000800) /* blocks the dc thread + * from downconverting + * before the upconvert + * has completed */ struct ocfs2_lock_res_ops; -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* [Ocfs2-devel] dlmglue fixes 2010-01-21 18:50 [Ocfs2-devel] dlmglue fixes Sunil Mushran 2010-01-21 18:50 ` [Ocfs2-devel] [PATCH 1/2] ocfs2: Fix setting of OCFS2_LOCK_BLOCKED during bast Sunil Mushran 2010-01-21 18:50 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Prevent a livelock in dlmglue Sunil Mushran @ 2010-01-26 12:33 ` Joel Becker 2010-01-26 16:37 ` David Teigland 2 siblings, 1 reply; 9+ messages in thread From: Joel Becker @ 2010-01-26 12:33 UTC (permalink / raw) To: ocfs2-devel On Thu, Jan 21, 2010 at 10:50:01AM -0800, Sunil Mushran wrote: > So here are the two patches. Remove all patches that you have and apply > these. > > The first one is straight forward. > > The second one will hopefully fix the livelock issue you have been > encountering. > > People reviewing the patches should note that the second one is slightly > different than the one I posted earlier. It removes the BUG_ON in the if > condition where we jump to update_holders. The accompanying comment has > also been updated. David, Don't know if you saw this, so I'm adding you to the CC. Hopefully you can test it out; let us know if there are further problems! Joel -- "I inject pure kryptonite into my brain. It improves my kung fu, and it eases the pain." Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Ocfs2-devel] dlmglue fixes 2010-01-26 12:33 ` [Ocfs2-devel] dlmglue fixes Joel Becker @ 2010-01-26 16:37 ` David Teigland 2010-01-26 19:18 ` Sunil Mushran 2010-01-26 22:57 ` Sunil Mushran 0 siblings, 2 replies; 9+ messages in thread From: David Teigland @ 2010-01-26 16:37 UTC (permalink / raw) To: ocfs2-devel On Tue, Jan 26, 2010 at 04:33:26AM -0800, Joel Becker wrote: > On Thu, Jan 21, 2010 at 10:50:01AM -0800, Sunil Mushran wrote: > > So here are the two patches. Remove all patches that you have and apply > > these. I ran http://people.redhat.com/~teigland/make_panic on three nodes for 15 minutes without any problem, so that's a big improvement. Then I tried another little test on three nodes which quickly triggered a BUG, http://people.redhat.com/~teigland/alternate.c node1: alternate test 0 0 3 node2: alternate test 0 1 3 node3: alternate test 0 2 3 ------------[ cut here ]------------ kernel BUG at fs/ocfs2/dlmglue.c:3281! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:80/0000:80:02.0/0000:86:01.0/local_cpus CPU 1 Modules linked in: ocfs2_stack_user dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue sunrpc ipv6 cpufreq_ondemand powernow_k8 freq_table dm_multipath shpchp amd64_edac_mod edac_core serio_raw tg3 i2c_nforce2 k8temp i2c_core qla2xxx mptspi mptscsih scsi_transport_fc ata_generic mptbase pata_acpi scsi_tgt scsi_transport_spi sata_nv pata_amd [last unloaded: scsi_wait_scan] Pid: 2523, comm: ocfs2dc Not tainted 2.6.32.3 #2 ProLiant DL145 G2 RIP: 0010:[<ffffffffa020593d>] [<ffffffffa020593d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2] RSP: 0018:ffff88007cd89d90 EFLAGS: 00010082 RAX: 000000000000005b RBX: ffff88007c5ccc50 RCX: 0000000000000aef RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046 RBP: ffff88007cd89db0 R08: ffff88007cd89cd0 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000006db00 R12: 0000000000000000 R13: ffff88007cc20000 R14: 0000000000000293 R15: ffff88007c5ccc68 FS: 00007f77b5a4e700(0000) GS:ffff880028300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000011d8178 CR3: 000000013cee0000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process ocfs2dc (pid: 2523, threadinfo ffff88007cd88000, task ffff880037d00000) Stack: ffff880000000000 ffff88007c5ccc50 ffff88007c5ccc50 0000000000000000 <0> ffff88007cd89ee0 ffffffffa0208e98 00ff880000000000 ffff880037d004b8 <0> ffff88007df99740 ffff88007cd89e80 ffff88007cd89e10 ffffffff00000000 Call Trace: [<ffffffffa0208e98>] ocfs2_downconvert_thread+0x5cf/0x930 [ocfs2] [<ffffffff81074f6b>] ? autoremove_wake_function+0x0/0x39 [<ffffffffa02088c9>] ? ocfs2_downconvert_thread+0x0/0x930 [ocfs2] [<ffffffff81074c7e>] kthread+0x7f/0x87 [<ffffffff81012cea>] child_rip+0xa/0x20 [<ffffffff81074bff>] ? kthread+0x0/0x87 [<ffffffff81012ce0>] ? child_rip+0x0/0x20 Code: 00 41 b8 d0 0c 00 00 48 c7 c1 f0 af 25 a0 65 8b 14 25 68 e3 00 00 48 c7 c7 b6 26 26 a0 48 63 d2 31 c0 44 89 24 24 e8 b6 b3 22 e1 <0f> 0b eb fe f6 05 fa 29 fb ff 08 74 4a f6 05 f9 29 fb ff 08 75 RIP [<ffffffffa020593d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2] RSP <ffff88007cd89d90> ---[ end trace 9d3da64f968ed95a ]--- ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Ocfs2-devel] dlmglue fixes 2010-01-26 16:37 ` David Teigland @ 2010-01-26 19:18 ` Sunil Mushran 2010-01-26 19:53 ` David Teigland 2010-01-26 22:57 ` Sunil Mushran 1 sibling, 1 reply; 9+ messages in thread From: Sunil Mushran @ 2010-01-26 19:18 UTC (permalink / raw) To: ocfs2-devel David Teigland wrote: > On Tue, Jan 26, 2010 at 04:33:26AM -0800, Joel Becker wrote: > >> On Thu, Jan 21, 2010 at 10:50:01AM -0800, Sunil Mushran wrote: >> >>> So here are the two patches. Remove all patches that you have and apply >>> these. >>> > > I ran http://people.redhat.com/~teigland/make_panic on three nodes for 15 > minutes without any problem, so that's a big improvement. > > Then I tried another little test on three nodes which quickly triggered a > BUG, http://people.redhat.com/~teigland/alternate.c > > node1: alternate test 0 0 3 > node2: alternate test 0 1 3 > node3: alternate test 0 2 3 > > ------------[ cut here ]------------ > kernel BUG at fs/ocfs2/dlmglue.c:3281! > invalid opcode: 0000 [#1] SMP > last sysfs file: /sys/devices/pci0000:80/0000:80:02.0/0000:86:01.0/local_cpus > CPU 1 > Modules linked in: ocfs2_stack_user dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue sunrpc ipv6 cpufreq_ondemand powernow_k8 freq_table dm_multipath shpchp amd64_edac_mod edac_core serio_raw tg3 i2c_nforce2 k8temp i2c_core qla2xxx mptspi mptscsih scsi_transport_fc ata_generic mptbase pata_acpi scsi_tgt scsi_transport_spi sata_nv pata_amd [last unloaded: scsi_wait_scan] > Pid: 2523, comm: ocfs2dc Not tainted 2.6.32.3 #2 ProLiant DL145 G2 > RIP: 0010:[<ffffffffa020593d>] [<ffffffffa020593d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2] > RSP: 0018:ffff88007cd89d90 EFLAGS: 00010082 > RAX: 000000000000005b RBX: ffff88007c5ccc50 RCX: 0000000000000aef > RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046 > RBP: ffff88007cd89db0 R08: ffff88007cd89cd0 R09: 0000000000000000 > R10: 0000000000000000 R11: 000000000006db00 R12: 0000000000000000 > R13: ffff88007cc20000 R14: 0000000000000293 R15: ffff88007c5ccc68 > FS: 00007f77b5a4e700(0000) GS:ffff880028300000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > CR2: 00000000011d8178 CR3: 000000013cee0000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process ocfs2dc (pid: 2523, threadinfo ffff88007cd88000, task ffff880037d00000) > Stack: > ffff880000000000 ffff88007c5ccc50 ffff88007c5ccc50 0000000000000000 > <0> ffff88007cd89ee0 ffffffffa0208e98 00ff880000000000 ffff880037d004b8 > <0> ffff88007df99740 ffff88007cd89e80 ffff88007cd89e10 ffffffff00000000 > Call Trace: > [<ffffffffa0208e98>] ocfs2_downconvert_thread+0x5cf/0x930 [ocfs2] > [<ffffffff81074f6b>] ? autoremove_wake_function+0x0/0x39 > [<ffffffffa02088c9>] ? ocfs2_downconvert_thread+0x0/0x930 [ocfs2] > [<ffffffff81074c7e>] kthread+0x7f/0x87 > [<ffffffff81012cea>] child_rip+0xa/0x20 > [<ffffffff81074bff>] ? kthread+0x0/0x87 > [<ffffffff81012ce0>] ? child_rip+0x0/0x20 > Code: 00 41 b8 d0 0c 00 00 48 c7 c1 f0 af 25 a0 65 8b 14 25 68 e3 00 00 48 c7 c7 b6 26 26 a0 48 63 d2 31 c0 44 89 24 24 e8 b6 b3 22 e1 <0f> 0b eb fe f6 05 fa 29 fb ff 08 74 4a f6 05 f9 29 fb ff 08 75 > RIP [<ffffffffa020593d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2] > RSP <ffff88007cd89d90> > ---[ end trace 9d3da64f968ed95a ]--- > David, Thanks for running the test. Did this happen on all three nodes? Also, was there another message like the following? mlog(ML_ERROR, "lockres->l_level (%d) <= new_level (%d)\n", lockres->l_level, new_level); Wondering if you build with CONFIG_OCFS2_DEBUG_MASKLOG. Sunil ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Ocfs2-devel] dlmglue fixes 2010-01-26 19:18 ` Sunil Mushran @ 2010-01-26 19:53 ` David Teigland 2010-01-29 0:21 ` Sunil Mushran 0 siblings, 1 reply; 9+ messages in thread From: David Teigland @ 2010-01-26 19:53 UTC (permalink / raw) To: ocfs2-devel On Tue, Jan 26, 2010 at 11:18:34AM -0800, Sunil Mushran wrote: > Thanks for running the test. Did this happen on all three nodes? First time it was 2 of 3, second time it was 1 of 3. > Also, was there another message like the following? > > mlog(ML_ERROR, "lockres->l_level (%d) <= new_level (%d)\n", > lockres->l_level, new_level); Oops, yeah, I missed copying that: Jan 26 10:08:31 bull-02 kernel: (1995,1):ocfs2_prepare_downconvert:3280 ERROR: lockres->l_level (0) <= new_level (0) Jan 26 10:08:31 bull-02 kernel: ------------[ cut here ]------------ Jan 26 10:08:31 bull-02 kernel: kernel BUG@fs/ocfs2/dlmglue.c:3281! Jan 26 10:08:31 bull-02 kernel: invalid opcode: 0000 [#1] SMP Jan 26 10:08:31 bull-02 kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:0d.0/0000:03:00.0/irq Jan 26 10:08:31 bull-02 kernel: CPU 1 Jan 26 10:08:31 bull-02 kernel: Modules linked in: ocfs2_stack_user dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue sunrpc ipv6 cpufreq_ ondemand powernow_k8 freq_table dm_multipath amd64_edac_mod i2c_nforce2 tg3 shpchp serio_raw edac_core i2c_core k8temp qla2xxx mptspi mptscsi h ata_generic scsi_transport_fc pata_acpi mptbase sata_nv scsi_transport_spi pata_amd scsi_tgt [last unloaded: scsi_wait_scan] Jan 26 10:08:31 bull-02 kernel: Pid: 1995, comm: ocfs2dc Not tainted 2.6.32.3 #2 ProLiant DL145 G2 Jan 26 10:08:31 bull-02 kernel: RIP: 0010:[<ffffffffa020993d>] [<ffffffffa020993d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2] Jan 26 10:08:31 bull-02 kernel: RSP: 0018:ffff88007aa37d90 EFLAGS: 00010082 Jan 26 10:08:31 bull-02 kernel: RAX: 000000000000005b RBX: ffff88013bf4f1d0 RCX: 0000000000000aa6 Jan 26 10:08:31 bull-02 kernel: RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046 Jan 26 10:08:31 bull-02 kernel: RBP: ffff88007aa37db0 R08: ffff88007aa37cd0 R09: 0000000000000000 Jan 26 10:08:31 bull-02 kernel: R10: 0000000000000000 R11: 000000107ce1fc00 R12: 0000000000000000 Jan 26 10:08:31 bull-02 kernel: R13: ffff88007cc55000 R14: 0000000000000293 R15: ffff88013bf4f1e8 Jan 26 10:08:31 bull-02 kernel: FS: 00007fe27c39b700(0000) GS:ffff880028300000(0000) knlGS:0000000000000000 Jan 26 10:08:31 bull-02 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Jan 26 10:08:31 bull-02 kernel: CR2: 0000000000bb3000 CR3: 00000001382f1000 CR4: 00000000000006e0 Jan 26 10:08:31 bull-02 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jan 26 10:08:31 bull-02 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jan 26 10:08:31 bull-02 kernel: Process ocfs2dc (pid: 1995, threadinfo ffff88007aa36000, task ffff88007a6b1740) Jan 26 10:08:31 bull-02 kernel: Stack: Jan 26 10:08:31 bull-02 kernel: ffff880000000000 ffff88013bf4f1d0 ffff88013bf4f1d0 0000000000000000 Jan 26 10:08:31 bull-02 kernel: <0> ffff88007aa37ee0 ffffffffa020ce98 00ff880000000000 ffff88007a6b1bf8 Jan 26 10:08:31 bull-02 kernel: <0> ffff88007df81740 ffff88007aa37e80 ffff88007aa37e10 ffffffff00000000 Jan 26 10:08:31 bull-02 kernel: Call Trace: Jan 26 10:08:31 bull-02 kernel: [<ffffffffa020ce98>] ocfs2_downconvert_thread+0x5cf/0x930 [ocfs2] Jan 26 10:08:31 bull-02 kernel: [<ffffffff81074f6b>] ? autoremove_wake_function+0x0/0x39 Jan 26 10:08:31 bull-02 kernel: [<ffffffffa020c8c9>] ? ocfs2_downconvert_thread+0x0/0x930 [ocfs2] Jan 26 10:08:31 bull-02 kernel: [<ffffffff81074c7e>] kthread+0x7f/0x87 Jan 26 10:08:31 bull-02 kernel: [<ffffffff81012cea>] child_rip+0xa/0x20 Jan 26 10:08:31 bull-02 kernel: [<ffffffff81074bff>] ? kthread+0x0/0x87 Jan 26 10:08:31 bull-02 kernel: [<ffffffff81012ce0>] ? child_rip+0x0/0x20 Jan 26 10:08:31 bull-02 kernel: Code: 00 41 b8 d0 0c 00 00 48 c7 c1 f0 ef 25 a0 65 8b 14 25 68 e3 00 00 48 c7 c7 b6 66 26 a0 48 63 d2 31 c0 4 4 89 24 24 e8 b6 73 22 e1 <0f> 0b eb fe f6 05 fa 29 fb ff 08 74 4a f6 05 f9 29 fb ff 08 75 Jan 26 10:08:31 bull-02 kernel: RIP [<ffffffffa020993d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2] Jan 26 10:08:31 bull-02 kernel: RSP <ffff88007aa37d90> Jan 26 10:08:31 bull-02 kernel: ---[ end trace 9e720f5422312a43 ]--- Jan 26 10:33:35 bull-02 kernel: (2523,1):ocfs2_prepare_downconvert:3280 ERROR: lockres->l_level (0) <= new_level (0) Jan 26 10:33:35 bull-02 kernel: ------------[ cut here ]------------ Jan 26 10:33:35 bull-02 kernel: kernel BUG@fs/ocfs2/dlmglue.c:3281! Jan 26 10:33:35 bull-02 kernel: invalid opcode: 0000 [#1] SMP Jan 26 10:33:35 bull-02 kernel: last sysfs file: /sys/devices/pci0000:80/0000:80:02.0/0000:86:01.0/local_cpus Jan 26 10:33:35 bull-02 kernel: CPU 1 Jan 26 10:33:35 bull-02 kernel: Modules linked in: ocfs2_stack_user dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue sunrpc ipv6 cpufreq_ondemand powernow_k8 freq_table dm_multipath shpchp amd64_edac_mod edac_core serio_raw tg3 i2c_nforce2 k8temp i2c_core qla2xxx mptspi mptscsih scsi_transport_fc ata_generic mptbase pata_acpi scsi_tgt scsi_transport_spi sata_nv pata_amd [last unloaded: scsi_wait_scan] Jan 26 10:33:35 bull-02 kernel: Pid: 2523, comm: ocfs2dc Not tainted 2.6.32.3 #2 ProLiant DL145 G2 Jan 26 10:33:35 bull-02 kernel: RIP: 0010:[<ffffffffa020593d>] [<ffffffffa020593d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2] Jan 26 10:33:35 bull-02 kernel: RSP: 0018:ffff88007cd89d90 EFLAGS: 00010082 Jan 26 10:33:35 bull-02 kernel: RAX: 000000000000005b RBX: ffff88007c5ccc50 RCX: 0000000000000aef Jan 26 10:33:35 bull-02 kernel: RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046 Jan 26 10:33:35 bull-02 kernel: RBP: ffff88007cd89db0 R08: ffff88007cd89cd0 R09: 0000000000000000 Jan 26 10:33:35 bull-02 kernel: R10: 0000000000000000 R11: 000000000006db00 R12: 0000000000000000 Jan 26 10:33:35 bull-02 kernel: R13: ffff88007cc20000 R14: 0000000000000293 R15: ffff88007c5ccc68 Jan 26 10:33:35 bull-02 kernel: FS: 00007f77b5a4e700(0000) GS:ffff880028300000(0000) knlGS:0000000000000000 Jan 26 10:33:35 bull-02 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Jan 26 10:33:35 bull-02 kernel: CR2: 00000000011d8178 CR3: 000000013cee0000 CR4: 00000000000006e0 Jan 26 10:33:35 bull-02 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jan 26 10:33:35 bull-02 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jan 26 10:33:35 bull-02 kernel: Process ocfs2dc (pid: 2523, threadinfo ffff88007cd88000, task ffff880037d00000) Jan 26 10:33:35 bull-02 kernel: Stack: Jan 26 10:33:35 bull-02 kernel: ffff880000000000 ffff88007c5ccc50 ffff88007c5ccc50 0000000000000000 Jan 26 10:33:35 bull-02 kernel: <0> ffff88007cd89ee0 ffffffffa0208e98 00ff880000000000 ffff880037d004b8 Jan 26 10:33:35 bull-02 kernel: <0> ffff88007df99740 ffff88007cd89e80 ffff88007cd89e10 ffffffff00000000 Jan 26 10:33:35 bull-02 kernel: Call Trace: Jan 26 10:33:35 bull-02 kernel: [<ffffffffa0208e98>] ocfs2_downconvert_thread+0x5cf/0x930 [ocfs2] Jan 26 10:33:35 bull-02 kernel: [<ffffffff81074f6b>] ? autoremove_wake_function+0x0/0x39 Jan 26 10:33:35 bull-02 kernel: [<ffffffffa02088c9>] ? ocfs2_downconvert_thread+0x0/0x930 [ocfs2] Jan 26 10:33:35 bull-02 kernel: [<ffffffff81074c7e>] kthread+0x7f/0x87 Jan 26 10:33:35 bull-02 kernel: [<ffffffff81012cea>] child_rip+0xa/0x20 Jan 26 10:33:35 bull-02 kernel: [<ffffffff81074bff>] ? kthread+0x0/0x87 Jan 26 10:33:35 bull-02 kernel: [<ffffffff81012ce0>] ? child_rip+0x0/0x20 Jan 26 10:33:35 bull-02 kernel: Code: 00 41 b8 d0 0c 00 00 48 c7 c1 f0 af 25 a0 65 8b 14 25 68 e3 00 00 48 c7 c7 b6 26 26 a0 48 63 d2 31 c0 44 89 24 24 e8 b6 b3 22 e1 <0f> 0b eb fe f6 05 fa 29 fb ff 08 74 4a f6 05 f9 29 fb ff 08 75 Jan 26 10:33:35 bull-02 kernel: RIP [<ffffffffa020593d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2] Jan 26 10:33:35 bull-02 kernel: RSP <ffff88007cd89d90> Jan 26 10:33:35 bull-02 kernel: ---[ end trace 9d3da64f968ed95a ]--- Jan 26 10:08:31 bull-04 kernel: (2047,3):ocfs2_prepare_downconvert:3280 ERROR: lockres->l_level (0) <= new_level (0) Jan 26 10:08:31 bull-04 kernel: ------------[ cut here ]------------ Jan 26 10:08:31 bull-04 kernel: kernel BUG@fs/ocfs2/dlmglue.c:3281! Jan 26 10:08:31 bull-04 kernel: invalid opcode: 0000 [#1] SMP Jan 26 10:08:31 bull-04 kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:0d.0/0000:03:00.0/irq Jan 26 10:08:31 bull-04 kernel: CPU 3 Jan 26 10:08:31 bull-04 kernel: Modules linked in: ocfs2_stack_user dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue sunrpc ipv6 cpufreq_ ondemand powernow_k8 freq_table dm_multipath amd64_edac_mod edac_core tg3 i2c_nforce2 k8temp shpchp i2c_core serio_raw qla2xxx mptspi ata_gen eric scsi_transport_fc pata_acpi mptscsih mptbase scsi_transport_spi scsi_tgt sata_nv pata_amd [last unloaded: scsi_wait_scan] Jan 26 10:08:31 bull-04 kernel: Pid: 2047, comm: ocfs2dc Not tainted 2.6.32.3 #2 ProLiant DL145 G2 Jan 26 10:08:31 bull-04 kernel: RIP: 0010:[<ffffffffa020693d>] [<ffffffffa020693d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2] Jan 26 10:08:31 bull-04 kernel: RSP: 0018:ffff88007d301d90 EFLAGS: 00010082 Jan 26 10:08:31 bull-04 kernel: RAX: 000000000000005b RBX: ffff88007ada17d0 RCX: 0000000000000a4a Jan 26 10:08:31 bull-04 kernel: RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046 Jan 26 10:08:31 bull-04 kernel: RBP: ffff88007d301db0 R08: ffff88007d301cd0 R09: 0000000000000000 Jan 26 10:08:31 bull-04 kernel: R10: 0000000000000000 R11: ffff88013a730400 R12: 0000000000000000 Jan 26 10:08:31 bull-04 kernel: R13: ffff88013a40d000 R14: 0000000000000293 R15: ffff88007ada17e8 Jan 26 10:08:31 bull-04 kernel: FS: 00007f5e323db700(0000) GS:ffff880082100000(0000) knlGS:0000000000000000 Jan 26 10:08:31 bull-04 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Jan 26 10:08:31 bull-04 kernel: CR2: 0000000000e074b0 CR3: 0000000139db7000 CR4: 00000000000006e0 Jan 26 10:08:31 bull-04 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jan 26 10:08:31 bull-04 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jan 26 10:08:31 bull-04 kernel: Process ocfs2dc (pid: 2047, threadinfo ffff88007d300000, task ffff88007ab3c5c0) Jan 26 10:08:31 bull-04 kernel: Stack: Jan 26 10:08:31 bull-04 kernel: ffff880100000000 ffff88007ada17d0 ffff88007ada17d0 0000000000000000 Jan 26 10:08:31 bull-04 kernel: <0> ffff88007d301ee0 ffffffffa0209e98 0000000000000000 ffff88007ab3ca78 Jan 26 10:08:31 bull-04 kernel: <0> ffffffff816861f0 ffff88007d301e80 ffff88007d301e10 ffffffff00000000 Jan 26 10:08:31 bull-04 kernel: Call Trace: Jan 26 10:08:31 bull-04 kernel: [<ffffffffa0209e98>] ocfs2_downconvert_thread+0x5cf/0x930 [ocfs2] Jan 26 10:08:31 bull-04 kernel: [<ffffffff81074f6b>] ? autoremove_wake_function+0x0/0x39 Jan 26 10:08:31 bull-04 kernel: [<ffffffffa02098c9>] ? ocfs2_downconvert_thread+0x0/0x930 [ocfs2] Jan 26 10:08:31 bull-04 kernel: [<ffffffff81074c7e>] kthread+0x7f/0x87 Jan 26 10:08:31 bull-04 kernel: [<ffffffff81012cea>] child_rip+0xa/0x20 Jan 26 10:08:31 bull-04 kernel: [<ffffffff81074bff>] ? kthread+0x0/0x87 Jan 26 10:08:31 bull-04 kernel: [<ffffffff81012ce0>] ? child_rip+0x0/0x20 Jan 26 10:08:31 bull-04 kernel: Code: 00 41 b8 d0 0c 00 00 48 c7 c1 f0 bf 25 a0 65 8b 14 25 68 e3 00 00 48 c7 c7 b6 36 26 a0 48 63 d2 31 c0 4 4 89 24 24 e8 b6 a3 22 e1 <0f> 0b eb fe f6 05 fa 29 fb ff 08 74 4a f6 05 f9 29 fb ff 08 75 Jan 26 10:08:31 bull-04 kernel: RIP [<ffffffffa020693d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2] Jan 26 10:08:31 bull-04 kernel: RSP <ffff88007d301d90> Jan 26 10:08:31 bull-04 kernel: ---[ end trace 930397e8616715ba ]--- > Wondering if you build with CONFIG_OCFS2_DEBUG_MASKLOG. Yes I am. Dave ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Ocfs2-devel] dlmglue fixes 2010-01-26 19:53 ` David Teigland @ 2010-01-29 0:21 ` Sunil Mushran 0 siblings, 0 replies; 9+ messages in thread From: Sunil Mushran @ 2010-01-29 0:21 UTC (permalink / raw) To: ocfs2-devel David Teigland wrote: > Oops, yeah, I missed copying that: > > Jan 26 10:08:31 bull-02 kernel: (1995,1):ocfs2_prepare_downconvert:3280 ERROR: lockres->l_level (0) <= new_level (0) > Jan 26 10:08:31 bull-02 kernel: ------------[ cut here ]------------ > Jan 26 10:08:31 bull-02 kernel: kernel BUG at fs/ocfs2/dlmglue.c:3281! Coly ran into the same sometime ago. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1178 I am going thru the traces. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Ocfs2-devel] dlmglue fixes 2010-01-26 16:37 ` David Teigland 2010-01-26 19:18 ` Sunil Mushran @ 2010-01-26 22:57 ` Sunil Mushran 1 sibling, 0 replies; 9+ messages in thread From: Sunil Mushran @ 2010-01-26 22:57 UTC (permalink / raw) To: ocfs2-devel David Teigland wrote: > I ran http://people.redhat.com/~teigland/make_panic on three nodes for 15 > minutes without any problem, so that's a big improvement. > > Then I tried another little test on three nodes which quickly triggered a > BUG, http://people.redhat.com/~teigland/alternate.c > > node1: alternate test 0 0 3 > node2: alternate test 0 1 3 > node3: alternate test 0 2 3 > I ran the same on 3 x86 nodes. With o2dlm. It is chugging along. 100324 1114 100327 1062 100330 1289 100333 1131 100336 1142 100339 1222 100342 1233 100345 2941 100348 1228 100351 1141 100354 1070 100357 1197 100360 1159 100363 1223 100366 1220 I'll let it run. I'll also do the same test on a ppc cluster. I've had better luck reproducing such races on that one. Fingers crossed. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2010-01-29 0:21 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-01-21 18:50 [Ocfs2-devel] dlmglue fixes Sunil Mushran 2010-01-21 18:50 ` [Ocfs2-devel] [PATCH 1/2] ocfs2: Fix setting of OCFS2_LOCK_BLOCKED during bast Sunil Mushran 2010-01-21 18:50 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Prevent a livelock in dlmglue Sunil Mushran 2010-01-26 12:33 ` [Ocfs2-devel] dlmglue fixes Joel Becker 2010-01-26 16:37 ` David Teigland 2010-01-26 19:18 ` Sunil Mushran 2010-01-26 19:53 ` David Teigland 2010-01-29 0:21 ` Sunil Mushran 2010-01-26 22:57 ` Sunil Mushran
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.