[Ocfs2-devel] dlmglue fixes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Ocfs2-devel] dlmglue fixes
@ 2010-01-21 18:50 Sunil Mushran
  2010-01-21 18:50 ` [Ocfs2-devel] [PATCH 1/2] ocfs2: Fix setting of OCFS2_LOCK_BLOCKED during bast Sunil Mushran
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Sunil Mushran @ 2010-01-21 18:50 UTC (permalink / raw)
  To: ocfs2-devel

David,

So here are the two patches. Remove all patches that you have and apply
these.

The first one is straight forward.

The second one will hopefully fix the livelock issue you have been
encountering.

People reviewing the patches should note that the second one is slightly
different than the one I posted earlier. It removes the BUG_ON in the if
condition where we jump to update_holders. The accompanying comment has
also been updated.

Thanks
Sunil

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH 1/2] ocfs2: Fix setting of OCFS2_LOCK_BLOCKED during bast
  2010-01-21 18:50 [Ocfs2-devel] dlmglue fixes Sunil Mushran
@ 2010-01-21 18:50 ` Sunil Mushran
  2010-01-21 18:50 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Prevent a livelock in dlmglue Sunil Mushran
  2010-01-26 12:33 ` [Ocfs2-devel] dlmglue fixes Joel Becker
  2 siblings, 0 replies; 9+ messages in thread
From: Sunil Mushran @ 2010-01-21 18:50 UTC (permalink / raw)
  To: ocfs2-devel

From: Wengang Wang <wen.gang.wang@oracle.com>

During bast, set the OCFS2_LOCK_BLOCKED flag only if the lock needs to
downconverted.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/dlmglue.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 0d38d67..0190f31 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -907,8 +907,6 @@ static int ocfs2_generic_handle_bast(struct ocfs2_lock_res *lockres,
 
 	assert_spin_locked(&lockres->l_lock);
 
-	lockres_or_flags(lockres, OCFS2_LOCK_BLOCKED);
-
 	if (level > lockres->l_blocking) {
 		/* only schedule a downconvert if we haven't already scheduled
 		 * one that goes low enough to satisfy the level we're
@@ -921,6 +919,9 @@ static int ocfs2_generic_handle_bast(struct ocfs2_lock_res *lockres,
 		lockres->l_blocking = level;
 	}
 
+	if (needs_downconvert)
+		lockres_or_flags(lockres, OCFS2_LOCK_BLOCKED);
+		
 	mlog_exit(needs_downconvert);
 	return needs_downconvert;
 }
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH 2/2] ocfs2: Prevent a livelock in dlmglue
  2010-01-21 18:50 [Ocfs2-devel] dlmglue fixes Sunil Mushran
  2010-01-21 18:50 ` [Ocfs2-devel] [PATCH 1/2] ocfs2: Fix setting of OCFS2_LOCK_BLOCKED during bast Sunil Mushran
@ 2010-01-21 18:50 ` Sunil Mushran
  2010-01-26 12:33 ` [Ocfs2-devel] dlmglue fixes Joel Becker
  2 siblings, 0 replies; 9+ messages in thread
From: Sunil Mushran @ 2010-01-21 18:50 UTC (permalink / raw)
  To: ocfs2-devel

There is possibility of a livelock in __ocfs2_cluster_lock(). If a node were
to get an ast for an upconvert request, followed immediately by a bast, there
is a small probability the fs may downconvert the lock before the process, that
requested the upconvert, is able to take the lock.

This patch adds a new flag to indicate that the upconvert is still in progress
and that the dc thread should not downconvert it right now.

Wengang Wang <wen.gang.wang@oracle.com> and Joel Becker <joel.becker@oracle.com>
contributed heavily to this patch.

Reported-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
---
 fs/ocfs2/dlmglue.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++---
 fs/ocfs2/ocfs2.h   |    4 ++++
 2 files changed, 50 insertions(+), 3 deletions(-)

diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 0190f31..f7b9f8f 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -875,6 +875,14 @@ static inline void ocfs2_generic_handle_convert_action(struct ocfs2_lock_res *lo
 		lockres_or_flags(lockres, OCFS2_LOCK_NEEDS_REFRESH);
 
 	lockres->l_level = lockres->l_requested;
+
+	/*
+	 * We set the OCFS2_LOCK_UPCONVERT_FINISHING flag before clearing
+	 * the OCFS2_LOCK_BUSY flag to prevent the dc thread from
+	 * downconverting the lock before the upconvert has fully completed.
+	 */
+	lockres_or_flags(lockres, OCFS2_LOCK_UPCONVERT_FINISHING);
+
 	lockres_clear_flags(lockres, OCFS2_LOCK_BUSY);
 
 	mlog_exit_void();
@@ -1134,6 +1142,7 @@ static inline void ocfs2_recover_from_dlm_error(struct ocfs2_lock_res *lockres,
 	mlog_entry_void();
 	spin_lock_irqsave(&lockres->l_lock, flags);
 	lockres_clear_flags(lockres, OCFS2_LOCK_BUSY);
+	lockres_clear_flags(lockres, OCFS2_LOCK_UPCONVERT_FINISHING);
 	if (convert)
 		lockres->l_action = OCFS2_AST_INVALID;
 	else
@@ -1324,13 +1333,13 @@ static int __ocfs2_cluster_lock(struct ocfs2_super *osb,
 again:
 	wait = 0;
 
+	spin_lock_irqsave(&lockres->l_lock, flags);
+
 	if (catch_signals && signal_pending(current)) {
 		ret = -ERESTARTSYS;
-		goto out;
+		goto unlock;
 	}
 
-	spin_lock_irqsave(&lockres->l_lock, flags);
-
 	mlog_bug_on_msg(lockres->l_flags & OCFS2_LOCK_FREEING,
 			"Cluster lock called on freeing lockres %s! flags "
 			"0x%lx\n", lockres->l_name, lockres->l_flags);
@@ -1347,6 +1356,25 @@ again:
 		goto unlock;
 	}
 
+	if (lockres->l_flags & OCFS2_LOCK_UPCONVERT_FINISHING) {
+		/*
+		 * We've upconverted. If the lock now has a level we can
+		 * work with, we take it. If, however, the lock is not at the
+		 * required level, we go thru the full cycle. One way this could
+		 * happen is if a process requesting an upconvert to PR is
+		 * closely followed by another requesting upconvert to an EX.
+		 * If the process requesting EX lands here, we want it to
+		 * continue attempting to upconvert and let the process
+		 * requesting PR take the lock.
+		 * If multiple processes request upconvert to PR, the first one
+		 * here will take the lock. The others will have to go thru the
+		 * OCFS2_LOCK_BLOCKED check to ensure that there is no pending
+		 * downconvert request.
+		 */
+		if (level <= lockres->l_level)
+			goto update_holders;
+	}
+
 	if (lockres->l_flags & OCFS2_LOCK_BLOCKED &&
 	    !ocfs2_may_continue_on_blocked_lock(lockres, level)) {
 		/* is the lock is currently blocked on behalf of
@@ -1417,11 +1445,14 @@ again:
 		goto again;
 	}
 
+update_holders:
 	/* Ok, if we get here then we're good to go. */
 	ocfs2_inc_holders(lockres, level);
 
 	ret = 0;
 unlock:
+	lockres_clear_flags(lockres, OCFS2_LOCK_UPCONVERT_FINISHING);
+
 	spin_unlock_irqrestore(&lockres->l_lock, flags);
 out:
 	/*
@@ -3402,6 +3433,18 @@ recheck:
 		goto leave;
 	}
 
+	/*
+	 * This prevents livelocks. OCFS2_LOCK_UPCONVERT_FINISHING flag is
+	 * set when the ast is received for an upconvert just before the
+	 * OCFS2_LOCK_BUSY flag is cleared. Now if the fs received a bast
+	 * on the heels of the ast, we want to delay the downconvert just
+	 * enough to allow the up requestor to do its task. Because this
+	 * lock is in the blocked queue, the lock will be downconverted
+	 * as soon as the requestor is done with the lock.
+	 */
+	if (lockres->l_flags & OCFS2_LOCK_UPCONVERT_FINISHING)
+		goto leave_requeue;
+
 	/* if we're blocking an exclusive and we have *any* holders,
 	 * then requeue. */
 	if ((lockres->l_blocking == DLM_LOCK_EX)
diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index d963d86..782e77d 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -136,6 +136,10 @@ enum ocfs2_unlock_action {
 #define OCFS2_LOCK_PENDING       (0x00000400) /* This lockres is pending a
 						 call to dlm_lock.  Only
 						 exists with BUSY set. */
+#define OCFS2_LOCK_UPCONVERT_FINISHING (0x00000800) /* blocks the dc thread
+						     * from downconverting
+						     * before the upconvert
+						     * has completed */
 
 struct ocfs2_lock_res_ops;
 
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] dlmglue fixes
  2010-01-21 18:50 [Ocfs2-devel] dlmglue fixes Sunil Mushran
  2010-01-21 18:50 ` [Ocfs2-devel] [PATCH 1/2] ocfs2: Fix setting of OCFS2_LOCK_BLOCKED during bast Sunil Mushran
  2010-01-21 18:50 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Prevent a livelock in dlmglue Sunil Mushran
@ 2010-01-26 12:33 ` Joel Becker
  2010-01-26 16:37   ` David Teigland
  2 siblings, 1 reply; 9+ messages in thread
From: Joel Becker @ 2010-01-26 12:33 UTC (permalink / raw)
  To: ocfs2-devel

On Thu, Jan 21, 2010 at 10:50:01AM -0800, Sunil Mushran wrote:
> So here are the two patches. Remove all patches that you have and apply
> these.
> 
> The first one is straight forward.
> 
> The second one will hopefully fix the livelock issue you have been
> encountering.
> 
> People reviewing the patches should note that the second one is slightly
> different than the one I posted earlier. It removes the BUG_ON in the if
> condition where we jump to update_holders. The accompanying comment has
> also been updated.

David,
	Don't know if you saw this, so I'm adding you to the CC.
Hopefully you can test it out; let us know if there are further
problems!

Joel

-- 

"I inject pure kryptonite into my brain.
 It improves my kung fu, and it eases the pain."


Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] dlmglue fixes
  2010-01-26 12:33 ` [Ocfs2-devel] dlmglue fixes Joel Becker
@ 2010-01-26 16:37   ` David Teigland
  2010-01-26 19:18     ` Sunil Mushran
  2010-01-26 22:57     ` Sunil Mushran
  0 siblings, 2 replies; 9+ messages in thread
From: David Teigland @ 2010-01-26 16:37 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Jan 26, 2010 at 04:33:26AM -0800, Joel Becker wrote:
> On Thu, Jan 21, 2010 at 10:50:01AM -0800, Sunil Mushran wrote:
> > So here are the two patches. Remove all patches that you have and apply
> > these.

I ran http://people.redhat.com/~teigland/make_panic on three nodes for 15
minutes without any problem, so that's a big improvement.

Then I tried another little test on three nodes which quickly triggered a
BUG, http://people.redhat.com/~teigland/alternate.c

node1: alternate test 0 0 3
node2: alternate test 0 1 3
node3: alternate test 0 2 3

------------[ cut here ]------------
kernel BUG at fs/ocfs2/dlmglue.c:3281!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:80/0000:80:02.0/0000:86:01.0/local_cpus
CPU 1
Modules linked in: ocfs2_stack_user dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue sunrpc ipv6 cpufreq_ondemand powernow_k8 freq_table dm_multipath shpchp amd64_edac_mod edac_core serio_raw tg3 i2c_nforce2 k8temp i2c_core qla2xxx mptspi mptscsih scsi_transport_fc ata_generic mptbase pata_acpi scsi_tgt scsi_transport_spi sata_nv pata_amd [last unloaded: scsi_wait_scan]
Pid: 2523, comm: ocfs2dc Not tainted 2.6.32.3 #2 ProLiant DL145 G2
RIP: 0010:[<ffffffffa020593d>]  [<ffffffffa020593d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2]
RSP: 0018:ffff88007cd89d90  EFLAGS: 00010082
RAX: 000000000000005b RBX: ffff88007c5ccc50 RCX: 0000000000000aef
RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046
RBP: ffff88007cd89db0 R08: ffff88007cd89cd0 R09: 0000000000000000
R10: 0000000000000000 R11: 000000000006db00 R12: 0000000000000000
R13: ffff88007cc20000 R14: 0000000000000293 R15: ffff88007c5ccc68
FS:  00007f77b5a4e700(0000) GS:ffff880028300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000011d8178 CR3: 000000013cee0000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ocfs2dc (pid: 2523, threadinfo ffff88007cd88000, task ffff880037d00000)
Stack:
 ffff880000000000 ffff88007c5ccc50 ffff88007c5ccc50 0000000000000000
<0> ffff88007cd89ee0 ffffffffa0208e98 00ff880000000000 ffff880037d004b8
<0> ffff88007df99740 ffff88007cd89e80 ffff88007cd89e10 ffffffff00000000
Call Trace:
 [<ffffffffa0208e98>] ocfs2_downconvert_thread+0x5cf/0x930 [ocfs2]
 [<ffffffff81074f6b>] ? autoremove_wake_function+0x0/0x39
 [<ffffffffa02088c9>] ? ocfs2_downconvert_thread+0x0/0x930 [ocfs2]
 [<ffffffff81074c7e>] kthread+0x7f/0x87
 [<ffffffff81012cea>] child_rip+0xa/0x20
 [<ffffffff81074bff>] ? kthread+0x0/0x87
 [<ffffffff81012ce0>] ? child_rip+0x0/0x20
Code: 00 41 b8 d0 0c 00 00 48 c7 c1 f0 af 25 a0 65 8b 14 25 68 e3 00 00 48 c7 c7 b6 26 26 a0 48 63 d2 31 c0 44 89 24 24 e8 b6 b3 22 e1 <0f> 0b eb fe f6 05 fa 29 fb ff 08 74 4a f6 05 f9 29 fb ff 08 75
RIP  [<ffffffffa020593d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2]
 RSP <ffff88007cd89d90>
---[ end trace 9d3da64f968ed95a ]---

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] dlmglue fixes
  2010-01-26 16:37   ` David Teigland
@ 2010-01-26 19:18     ` Sunil Mushran
  2010-01-26 19:53       ` David Teigland
  2010-01-26 22:57     ` Sunil Mushran
  1 sibling, 1 reply; 9+ messages in thread
From: Sunil Mushran @ 2010-01-26 19:18 UTC (permalink / raw)
  To: ocfs2-devel

David Teigland wrote:
> On Tue, Jan 26, 2010 at 04:33:26AM -0800, Joel Becker wrote:
>   
>> On Thu, Jan 21, 2010 at 10:50:01AM -0800, Sunil Mushran wrote:
>>     
>>> So here are the two patches. Remove all patches that you have and apply
>>> these.
>>>       
>
> I ran http://people.redhat.com/~teigland/make_panic on three nodes for 15
> minutes without any problem, so that's a big improvement.
>
> Then I tried another little test on three nodes which quickly triggered a
> BUG, http://people.redhat.com/~teigland/alternate.c
>
> node1: alternate test 0 0 3
> node2: alternate test 0 1 3
> node3: alternate test 0 2 3
>
> ------------[ cut here ]------------
> kernel BUG at fs/ocfs2/dlmglue.c:3281!
> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/devices/pci0000:80/0000:80:02.0/0000:86:01.0/local_cpus
> CPU 1
> Modules linked in: ocfs2_stack_user dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue sunrpc ipv6 cpufreq_ondemand powernow_k8 freq_table dm_multipath shpchp amd64_edac_mod edac_core serio_raw tg3 i2c_nforce2 k8temp i2c_core qla2xxx mptspi mptscsih scsi_transport_fc ata_generic mptbase pata_acpi scsi_tgt scsi_transport_spi sata_nv pata_amd [last unloaded: scsi_wait_scan]
> Pid: 2523, comm: ocfs2dc Not tainted 2.6.32.3 #2 ProLiant DL145 G2
> RIP: 0010:[<ffffffffa020593d>]  [<ffffffffa020593d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2]
> RSP: 0018:ffff88007cd89d90  EFLAGS: 00010082
> RAX: 000000000000005b RBX: ffff88007c5ccc50 RCX: 0000000000000aef
> RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046
> RBP: ffff88007cd89db0 R08: ffff88007cd89cd0 R09: 0000000000000000
> R10: 0000000000000000 R11: 000000000006db00 R12: 0000000000000000
> R13: ffff88007cc20000 R14: 0000000000000293 R15: ffff88007c5ccc68
> FS:  00007f77b5a4e700(0000) GS:ffff880028300000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 00000000011d8178 CR3: 000000013cee0000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process ocfs2dc (pid: 2523, threadinfo ffff88007cd88000, task ffff880037d00000)
> Stack:
>  ffff880000000000 ffff88007c5ccc50 ffff88007c5ccc50 0000000000000000
> <0> ffff88007cd89ee0 ffffffffa0208e98 00ff880000000000 ffff880037d004b8
> <0> ffff88007df99740 ffff88007cd89e80 ffff88007cd89e10 ffffffff00000000
> Call Trace:
>  [<ffffffffa0208e98>] ocfs2_downconvert_thread+0x5cf/0x930 [ocfs2]
>  [<ffffffff81074f6b>] ? autoremove_wake_function+0x0/0x39
>  [<ffffffffa02088c9>] ? ocfs2_downconvert_thread+0x0/0x930 [ocfs2]
>  [<ffffffff81074c7e>] kthread+0x7f/0x87
>  [<ffffffff81012cea>] child_rip+0xa/0x20
>  [<ffffffff81074bff>] ? kthread+0x0/0x87
>  [<ffffffff81012ce0>] ? child_rip+0x0/0x20
> Code: 00 41 b8 d0 0c 00 00 48 c7 c1 f0 af 25 a0 65 8b 14 25 68 e3 00 00 48 c7 c7 b6 26 26 a0 48 63 d2 31 c0 44 89 24 24 e8 b6 b3 22 e1 <0f> 0b eb fe f6 05 fa 29 fb ff 08 74 4a f6 05 f9 29 fb ff 08 75
> RIP  [<ffffffffa020593d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2]
>  RSP <ffff88007cd89d90>
> ---[ end trace 9d3da64f968ed95a ]---
>   

David,

Thanks for running the test. Did this happen on all three nodes?
Also, was there another message like the following?

                mlog(ML_ERROR, "lockres->l_level (%d) <= new_level (%d)\n",
                     lockres->l_level, new_level);

Wondering if you build with CONFIG_OCFS2_DEBUG_MASKLOG.

Sunil

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] dlmglue fixes
  2010-01-26 19:18     ` Sunil Mushran
@ 2010-01-26 19:53       ` David Teigland
  2010-01-29  0:21         ` Sunil Mushran
  0 siblings, 1 reply; 9+ messages in thread
From: David Teigland @ 2010-01-26 19:53 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Jan 26, 2010 at 11:18:34AM -0800, Sunil Mushran wrote:
> Thanks for running the test. Did this happen on all three nodes?

First time it was 2 of 3, second time it was 1 of 3.

> Also, was there another message like the following?
> 
>                mlog(ML_ERROR, "lockres->l_level (%d) <= new_level (%d)\n",
>                     lockres->l_level, new_level);

Oops, yeah, I missed copying that:

Jan 26 10:08:31 bull-02 kernel: (1995,1):ocfs2_prepare_downconvert:3280 ERROR: lockres->l_level (0) <= new_level (0)
Jan 26 10:08:31 bull-02 kernel: ------------[ cut here ]------------
Jan 26 10:08:31 bull-02 kernel: kernel BUG@fs/ocfs2/dlmglue.c:3281!
Jan 26 10:08:31 bull-02 kernel: invalid opcode: 0000 [#1] SMP 
Jan 26 10:08:31 bull-02 kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:0d.0/0000:03:00.0/irq
Jan 26 10:08:31 bull-02 kernel: CPU 1 
Jan 26 10:08:31 bull-02 kernel: Modules linked in: ocfs2_stack_user dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue sunrpc ipv6 cpufreq_
ondemand powernow_k8 freq_table dm_multipath amd64_edac_mod i2c_nforce2 tg3 shpchp serio_raw edac_core i2c_core k8temp qla2xxx mptspi mptscsi
h ata_generic scsi_transport_fc pata_acpi mptbase sata_nv scsi_transport_spi pata_amd scsi_tgt [last unloaded: scsi_wait_scan]
Jan 26 10:08:31 bull-02 kernel: Pid: 1995, comm: ocfs2dc Not tainted 2.6.32.3 #2 ProLiant DL145 G2
Jan 26 10:08:31 bull-02 kernel: RIP: 0010:[<ffffffffa020993d>]  [<ffffffffa020993d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2]
Jan 26 10:08:31 bull-02 kernel: RSP: 0018:ffff88007aa37d90  EFLAGS: 00010082
Jan 26 10:08:31 bull-02 kernel: RAX: 000000000000005b RBX: ffff88013bf4f1d0 RCX: 0000000000000aa6
Jan 26 10:08:31 bull-02 kernel: RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046
Jan 26 10:08:31 bull-02 kernel: RBP: ffff88007aa37db0 R08: ffff88007aa37cd0 R09: 0000000000000000
Jan 26 10:08:31 bull-02 kernel: R10: 0000000000000000 R11: 000000107ce1fc00 R12: 0000000000000000
Jan 26 10:08:31 bull-02 kernel: R13: ffff88007cc55000 R14: 0000000000000293 R15: ffff88013bf4f1e8
Jan 26 10:08:31 bull-02 kernel: FS:  00007fe27c39b700(0000) GS:ffff880028300000(0000) knlGS:0000000000000000
Jan 26 10:08:31 bull-02 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Jan 26 10:08:31 bull-02 kernel: CR2: 0000000000bb3000 CR3: 00000001382f1000 CR4: 00000000000006e0
Jan 26 10:08:31 bull-02 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 26 10:08:31 bull-02 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 26 10:08:31 bull-02 kernel: Process ocfs2dc (pid: 1995, threadinfo ffff88007aa36000, task ffff88007a6b1740)
Jan 26 10:08:31 bull-02 kernel: Stack:
Jan 26 10:08:31 bull-02 kernel: ffff880000000000 ffff88013bf4f1d0 ffff88013bf4f1d0 0000000000000000
Jan 26 10:08:31 bull-02 kernel: <0> ffff88007aa37ee0 ffffffffa020ce98 00ff880000000000 ffff88007a6b1bf8
Jan 26 10:08:31 bull-02 kernel: <0> ffff88007df81740 ffff88007aa37e80 ffff88007aa37e10 ffffffff00000000
Jan 26 10:08:31 bull-02 kernel: Call Trace:
Jan 26 10:08:31 bull-02 kernel: [<ffffffffa020ce98>] ocfs2_downconvert_thread+0x5cf/0x930 [ocfs2]
Jan 26 10:08:31 bull-02 kernel: [<ffffffff81074f6b>] ? autoremove_wake_function+0x0/0x39
Jan 26 10:08:31 bull-02 kernel: [<ffffffffa020c8c9>] ? ocfs2_downconvert_thread+0x0/0x930 [ocfs2]
Jan 26 10:08:31 bull-02 kernel: [<ffffffff81074c7e>] kthread+0x7f/0x87
Jan 26 10:08:31 bull-02 kernel: [<ffffffff81012cea>] child_rip+0xa/0x20
Jan 26 10:08:31 bull-02 kernel: [<ffffffff81074bff>] ? kthread+0x0/0x87
Jan 26 10:08:31 bull-02 kernel: [<ffffffff81012ce0>] ? child_rip+0x0/0x20
Jan 26 10:08:31 bull-02 kernel: Code: 00 41 b8 d0 0c 00 00 48 c7 c1 f0 ef 25 a0 65 8b 14 25 68 e3 00 00 48 c7 c7 b6 66 26 a0 48 63 d2 31 c0 4
4 89 24 24 e8 b6 73 22 e1 <0f> 0b eb fe f6 05 fa 29 fb ff 08 74 4a f6 05 f9 29 fb ff 08 75 
Jan 26 10:08:31 bull-02 kernel: RIP  [<ffffffffa020993d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2]
Jan 26 10:08:31 bull-02 kernel: RSP <ffff88007aa37d90>
Jan 26 10:08:31 bull-02 kernel: ---[ end trace 9e720f5422312a43 ]---

Jan 26 10:33:35 bull-02 kernel: (2523,1):ocfs2_prepare_downconvert:3280 ERROR: lockres->l_level (0) <= new_level (0)
Jan 26 10:33:35 bull-02 kernel: ------------[ cut here ]------------
Jan 26 10:33:35 bull-02 kernel: kernel BUG@fs/ocfs2/dlmglue.c:3281!
Jan 26 10:33:35 bull-02 kernel: invalid opcode: 0000 [#1] SMP 
Jan 26 10:33:35 bull-02 kernel: last sysfs file: /sys/devices/pci0000:80/0000:80:02.0/0000:86:01.0/local_cpus
Jan 26 10:33:35 bull-02 kernel: CPU 1 
Jan 26 10:33:35 bull-02 kernel: Modules linked in: ocfs2_stack_user dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue sunrpc ipv6 cpufreq_ondemand powernow_k8 freq_table dm_multipath shpchp amd64_edac_mod edac_core serio_raw tg3 i2c_nforce2 k8temp i2c_core qla2xxx mptspi mptscsih scsi_transport_fc ata_generic mptbase pata_acpi scsi_tgt scsi_transport_spi sata_nv pata_amd [last unloaded: scsi_wait_scan]
Jan 26 10:33:35 bull-02 kernel: Pid: 2523, comm: ocfs2dc Not tainted 2.6.32.3 #2 ProLiant DL145 G2
Jan 26 10:33:35 bull-02 kernel: RIP: 0010:[<ffffffffa020593d>]  [<ffffffffa020593d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2]
Jan 26 10:33:35 bull-02 kernel: RSP: 0018:ffff88007cd89d90  EFLAGS: 00010082
Jan 26 10:33:35 bull-02 kernel: RAX: 000000000000005b RBX: ffff88007c5ccc50 RCX: 0000000000000aef
Jan 26 10:33:35 bull-02 kernel: RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046
Jan 26 10:33:35 bull-02 kernel: RBP: ffff88007cd89db0 R08: ffff88007cd89cd0 R09: 0000000000000000
Jan 26 10:33:35 bull-02 kernel: R10: 0000000000000000 R11: 000000000006db00 R12: 0000000000000000
Jan 26 10:33:35 bull-02 kernel: R13: ffff88007cc20000 R14: 0000000000000293 R15: ffff88007c5ccc68
Jan 26 10:33:35 bull-02 kernel: FS:  00007f77b5a4e700(0000) GS:ffff880028300000(0000) knlGS:0000000000000000
Jan 26 10:33:35 bull-02 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Jan 26 10:33:35 bull-02 kernel: CR2: 00000000011d8178 CR3: 000000013cee0000 CR4: 00000000000006e0
Jan 26 10:33:35 bull-02 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 26 10:33:35 bull-02 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 26 10:33:35 bull-02 kernel: Process ocfs2dc (pid: 2523, threadinfo ffff88007cd88000, task ffff880037d00000)
Jan 26 10:33:35 bull-02 kernel: Stack:
Jan 26 10:33:35 bull-02 kernel: ffff880000000000 ffff88007c5ccc50 ffff88007c5ccc50 0000000000000000
Jan 26 10:33:35 bull-02 kernel: <0> ffff88007cd89ee0 ffffffffa0208e98 00ff880000000000 ffff880037d004b8
Jan 26 10:33:35 bull-02 kernel: <0> ffff88007df99740 ffff88007cd89e80 ffff88007cd89e10 ffffffff00000000
Jan 26 10:33:35 bull-02 kernel: Call Trace:
Jan 26 10:33:35 bull-02 kernel: [<ffffffffa0208e98>] ocfs2_downconvert_thread+0x5cf/0x930 [ocfs2]
Jan 26 10:33:35 bull-02 kernel: [<ffffffff81074f6b>] ? autoremove_wake_function+0x0/0x39
Jan 26 10:33:35 bull-02 kernel: [<ffffffffa02088c9>] ? ocfs2_downconvert_thread+0x0/0x930 [ocfs2]
Jan 26 10:33:35 bull-02 kernel: [<ffffffff81074c7e>] kthread+0x7f/0x87
Jan 26 10:33:35 bull-02 kernel: [<ffffffff81012cea>] child_rip+0xa/0x20
Jan 26 10:33:35 bull-02 kernel: [<ffffffff81074bff>] ? kthread+0x0/0x87
Jan 26 10:33:35 bull-02 kernel: [<ffffffff81012ce0>] ? child_rip+0x0/0x20
Jan 26 10:33:35 bull-02 kernel: Code: 00 41 b8 d0 0c 00 00 48 c7 c1 f0 af 25 a0 65 8b 14 25 68 e3 00 00 48 c7 c7 b6 26 26 a0 48 63 d2 31 c0 44 89 24 24 e8 b6 b3 22 e1 <0f> 0b eb fe f6 05 fa 29 fb ff 08 74 4a f6 05 f9 29 fb ff 08 75 
Jan 26 10:33:35 bull-02 kernel: RIP  [<ffffffffa020593d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2]
Jan 26 10:33:35 bull-02 kernel: RSP <ffff88007cd89d90>
Jan 26 10:33:35 bull-02 kernel: ---[ end trace 9d3da64f968ed95a ]---

Jan 26 10:08:31 bull-04 kernel: (2047,3):ocfs2_prepare_downconvert:3280 ERROR: lockres->l_level (0) <= new_level (0)
Jan 26 10:08:31 bull-04 kernel: ------------[ cut here ]------------
Jan 26 10:08:31 bull-04 kernel: kernel BUG@fs/ocfs2/dlmglue.c:3281!
Jan 26 10:08:31 bull-04 kernel: invalid opcode: 0000 [#1] SMP
Jan 26 10:08:31 bull-04 kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:0d.0/0000:03:00.0/irq
Jan 26 10:08:31 bull-04 kernel: CPU 3
Jan 26 10:08:31 bull-04 kernel: Modules linked in: ocfs2_stack_user dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue sunrpc ipv6 cpufreq_
ondemand powernow_k8 freq_table dm_multipath amd64_edac_mod edac_core tg3 i2c_nforce2 k8temp shpchp i2c_core serio_raw qla2xxx mptspi ata_gen
eric scsi_transport_fc pata_acpi mptscsih mptbase scsi_transport_spi scsi_tgt sata_nv pata_amd [last unloaded: scsi_wait_scan]
Jan 26 10:08:31 bull-04 kernel: Pid: 2047, comm: ocfs2dc Not tainted 2.6.32.3 #2 ProLiant DL145 G2
Jan 26 10:08:31 bull-04 kernel: RIP: 0010:[<ffffffffa020693d>]  [<ffffffffa020693d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2]
Jan 26 10:08:31 bull-04 kernel: RSP: 0018:ffff88007d301d90  EFLAGS: 00010082
Jan 26 10:08:31 bull-04 kernel: RAX: 000000000000005b RBX: ffff88007ada17d0 RCX: 0000000000000a4a
Jan 26 10:08:31 bull-04 kernel: RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046
Jan 26 10:08:31 bull-04 kernel: RBP: ffff88007d301db0 R08: ffff88007d301cd0 R09: 0000000000000000
Jan 26 10:08:31 bull-04 kernel: R10: 0000000000000000 R11: ffff88013a730400 R12: 0000000000000000
Jan 26 10:08:31 bull-04 kernel: R13: ffff88013a40d000 R14: 0000000000000293 R15: ffff88007ada17e8
Jan 26 10:08:31 bull-04 kernel: FS:  00007f5e323db700(0000) GS:ffff880082100000(0000) knlGS:0000000000000000
Jan 26 10:08:31 bull-04 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Jan 26 10:08:31 bull-04 kernel: CR2: 0000000000e074b0 CR3: 0000000139db7000 CR4: 00000000000006e0
Jan 26 10:08:31 bull-04 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 26 10:08:31 bull-04 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 26 10:08:31 bull-04 kernel: Process ocfs2dc (pid: 2047, threadinfo ffff88007d300000, task ffff88007ab3c5c0)
Jan 26 10:08:31 bull-04 kernel: Stack:
Jan 26 10:08:31 bull-04 kernel: ffff880100000000 ffff88007ada17d0 ffff88007ada17d0 0000000000000000
Jan 26 10:08:31 bull-04 kernel: <0> ffff88007d301ee0 ffffffffa0209e98 0000000000000000 ffff88007ab3ca78
Jan 26 10:08:31 bull-04 kernel: <0> ffffffff816861f0 ffff88007d301e80 ffff88007d301e10 ffffffff00000000
Jan 26 10:08:31 bull-04 kernel: Call Trace:
Jan 26 10:08:31 bull-04 kernel: [<ffffffffa0209e98>] ocfs2_downconvert_thread+0x5cf/0x930 [ocfs2]
Jan 26 10:08:31 bull-04 kernel: [<ffffffff81074f6b>] ? autoremove_wake_function+0x0/0x39
Jan 26 10:08:31 bull-04 kernel: [<ffffffffa02098c9>] ? ocfs2_downconvert_thread+0x0/0x930 [ocfs2]
Jan 26 10:08:31 bull-04 kernel: [<ffffffff81074c7e>] kthread+0x7f/0x87
Jan 26 10:08:31 bull-04 kernel: [<ffffffff81012cea>] child_rip+0xa/0x20
Jan 26 10:08:31 bull-04 kernel: [<ffffffff81074bff>] ? kthread+0x0/0x87
Jan 26 10:08:31 bull-04 kernel: [<ffffffff81012ce0>] ? child_rip+0x0/0x20
Jan 26 10:08:31 bull-04 kernel: Code: 00 41 b8 d0 0c 00 00 48 c7 c1 f0 bf 25 a0 65 8b 14 25 68 e3 00 00 48 c7 c7 b6 36 26 a0 48 63 d2 31 c0 4
4 89 24 24 e8 b6 a3 22 e1 <0f> 0b eb fe f6 05 fa 29 fb ff 08 74 4a f6 05 f9 29 fb ff 08 75
Jan 26 10:08:31 bull-04 kernel: RIP  [<ffffffffa020693d>] ocfs2_prepare_downconvert+0x93/0x11c [ocfs2]
Jan 26 10:08:31 bull-04 kernel: RSP <ffff88007d301d90>
Jan 26 10:08:31 bull-04 kernel: ---[ end trace 930397e8616715ba ]---


> Wondering if you build with CONFIG_OCFS2_DEBUG_MASKLOG.

Yes I am.

Dave

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] dlmglue fixes
  2010-01-26 19:53       ` David Teigland
@ 2010-01-29  0:21         ` Sunil Mushran
  0 siblings, 0 replies; 9+ messages in thread
From: Sunil Mushran @ 2010-01-29  0:21 UTC (permalink / raw)
  To: ocfs2-devel

David Teigland wrote:
> Oops, yeah, I missed copying that:
>
> Jan 26 10:08:31 bull-02 kernel: (1995,1):ocfs2_prepare_downconvert:3280 ERROR: lockres->l_level (0) <= new_level (0)
> Jan 26 10:08:31 bull-02 kernel: ------------[ cut here ]------------
> Jan 26 10:08:31 bull-02 kernel: kernel BUG at fs/ocfs2/dlmglue.c:3281!

Coly ran into the same sometime ago.
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1178

I am going thru the traces.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] dlmglue fixes
  2010-01-26 16:37   ` David Teigland
  2010-01-26 19:18     ` Sunil Mushran
@ 2010-01-26 22:57     ` Sunil Mushran
  1 sibling, 0 replies; 9+ messages in thread
From: Sunil Mushran @ 2010-01-26 22:57 UTC (permalink / raw)
  To: ocfs2-devel

David Teigland wrote:
> I ran http://people.redhat.com/~teigland/make_panic on three nodes for 15
> minutes without any problem, so that's a big improvement.
>
> Then I tried another little test on three nodes which quickly triggered a
> BUG, http://people.redhat.com/~teigland/alternate.c
>
> node1: alternate test 0 0 3
> node2: alternate test 0 1 3
> node3: alternate test 0 2 3
>   

I ran the same on 3 x86 nodes. With o2dlm. It is chugging along.

100324 1114
100327 1062
100330 1289
100333 1131
100336 1142
100339 1222
100342 1233
100345 2941
100348 1228
100351 1141
100354 1070
100357 1197
100360 1159
100363 1223
100366 1220

I'll let it run. I'll also do the same test on a ppc cluster. I've had 
better
luck reproducing such races on that one. Fingers crossed.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-01-29  0:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-21 18:50 [Ocfs2-devel] dlmglue fixes Sunil Mushran
2010-01-21 18:50 ` [Ocfs2-devel] [PATCH 1/2] ocfs2: Fix setting of OCFS2_LOCK_BLOCKED during bast Sunil Mushran
2010-01-21 18:50 ` [Ocfs2-devel] [PATCH 2/2] ocfs2: Prevent a livelock in dlmglue Sunil Mushran
2010-01-26 12:33 ` [Ocfs2-devel] dlmglue fixes Joel Becker
2010-01-26 16:37   ` David Teigland
2010-01-26 19:18     ` Sunil Mushran
2010-01-26 19:53       ` David Teigland
2010-01-29  0:21         ` Sunil Mushran
2010-01-26 22:57     ` Sunil Mushran

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.