* [Ocfs2-devel] [PATCH] ocfs2: Plugs race between the dc thread and an unlock ast message
@ 2010-02-03 18:16 Sunil Mushran
2010-02-04 10:27 ` Wengang Wang
0 siblings, 1 reply; 10+ messages in thread
From: Sunil Mushran @ 2010-02-03 18:16 UTC (permalink / raw)
To: ocfs2-devel
This patch plugs a race between the downconvert thread and an unlock ast message.
Specifically, after the downconvert worker has done its task, the dc thread needs
to check whether an unlock ast made the downconvert moot.
Reported-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@sus.com>
---
fs/ocfs2/dlmglue.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 1c4d704..28b3e21 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -3384,6 +3384,7 @@ static int ocfs2_unblock_lock(struct ocfs2_super *osb,
unsigned long flags;
int blocking;
int new_level;
+ int level;
int ret = 0;
int set_lvb = 0;
unsigned int gen;
@@ -3503,6 +3504,7 @@ recheck:
* may sleep, so we save off a copy of what we're blocking as
* it may change while we're not holding the spin lock. */
blocking = lockres->l_blocking;
+ level = lockres->l_level;
spin_unlock_irqrestore(&lockres->l_lock, flags);
ctl->unblock_action = lockres->l_ops->downconvert_worker(lockres, blocking);
@@ -3511,7 +3513,7 @@ recheck:
goto leave;
spin_lock_irqsave(&lockres->l_lock, flags);
- if (blocking != lockres->l_blocking) {
+ if ((blocking != lockres->l_blocking) || (level != lockres->l_level)) {
/* If this changed underneath us, then we can't drop
* it just yet. */
goto recheck;
--
1.6.3.3
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: Plugs race between the dc thread and an unlock ast message
2010-02-03 18:16 [Ocfs2-devel] [PATCH] ocfs2: Plugs race between the dc thread and an unlock ast message Sunil Mushran
@ 2010-02-04 10:27 ` Wengang Wang
2010-02-04 19:36 ` Sunil Mushran
0 siblings, 1 reply; 10+ messages in thread
From: Wengang Wang @ 2010-02-04 10:27 UTC (permalink / raw)
To: ocfs2-devel
Hi Sunil,
Just for my knowledage.
By "unlock ast message", do you meant
ocfs2_locking_ast()->ocfs2_generic_handle_downconvert_action()?
If yes,
if l_blocking did not changed before ocfs2_generic_handle_downconvert_action(),
when l_level is set with a lower value, l_blocking must change.
So why we need to check l_level?
regards,
wengang.
On 10-02-03 10:16, Sunil Mushran wrote:
> This patch plugs a race between the downconvert thread and an unlock ast message.
> Specifically, after the downconvert worker has done its task, the dc thread needs
> to check whether an unlock ast made the downconvert moot.
>
> Reported-by: David Teigland <teigland@redhat.com>
> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
> Acked-by: Mark Fasheh <mfasheh@sus.com>
> ---
> fs/ocfs2/dlmglue.c | 4 +++-
> 1 files changed, 3 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 1c4d704..28b3e21 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -3384,6 +3384,7 @@ static int ocfs2_unblock_lock(struct ocfs2_super *osb,
> unsigned long flags;
> int blocking;
> int new_level;
> + int level;
> int ret = 0;
> int set_lvb = 0;
> unsigned int gen;
> @@ -3503,6 +3504,7 @@ recheck:
> * may sleep, so we save off a copy of what we're blocking as
> * it may change while we're not holding the spin lock. */
> blocking = lockres->l_blocking;
> + level = lockres->l_level;
> spin_unlock_irqrestore(&lockres->l_lock, flags);
>
> ctl->unblock_action = lockres->l_ops->downconvert_worker(lockres, blocking);
> @@ -3511,7 +3513,7 @@ recheck:
> goto leave;
>
> spin_lock_irqsave(&lockres->l_lock, flags);
> - if (blocking != lockres->l_blocking) {
> + if ((blocking != lockres->l_blocking) || (level != lockres->l_level)) {
> /* If this changed underneath us, then we can't drop
> * it just yet. */
> goto recheck;
> --
> 1.6.3.3
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: Plugs race between the dc thread and an unlock ast message
2010-02-04 10:27 ` Wengang Wang
@ 2010-02-04 19:36 ` Sunil Mushran
2010-02-05 6:01 ` Joel Becker
2010-02-05 16:24 ` David Teigland
0 siblings, 2 replies; 10+ messages in thread
From: Sunil Mushran @ 2010-02-04 19:36 UTC (permalink / raw)
To: ocfs2-devel
Wengang Wang wrote:
> By "unlock ast message", do you meant
> ocfs2_locking_ast()->ocfs2_generic_handle_downconvert_action()?
>
> If yes,
> if l_blocking did not changed before ocfs2_generic_handle_downconvert_action(),
> when l_level is set with a lower value, l_blocking must change.
> So why we need to check l_level?
I meant ocfs2_unlock_ast. Specifically cancel convert. That is one case
that does not change l_blocking directly.
However, that does not change the l_level too. So I am unsure what sequence
of asts and basts (multiple ofcourse) can lead to this situation.
But the patch looks reasonable even if I cannot state the precise scenario
that leads to it.
David is rerunning the test. We'll know the results by tomorrow.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: Plugs race between the dc thread and an unlock ast message
2010-02-04 19:36 ` Sunil Mushran
@ 2010-02-05 6:01 ` Joel Becker
2010-02-05 18:39 ` Sunil Mushran
2010-02-05 16:24 ` David Teigland
1 sibling, 1 reply; 10+ messages in thread
From: Joel Becker @ 2010-02-05 6:01 UTC (permalink / raw)
To: ocfs2-devel
On Thu, Feb 04, 2010 at 11:36:30AM -0800, Sunil Mushran wrote:
> I meant ocfs2_unlock_ast. Specifically cancel convert. That is one case
> that does not change l_blocking directly.
>
> However, that does not change the l_level too. So I am unsure what sequence
> of asts and basts (multiple ofcourse) can lead to this situation.
Why cancel convert? Why not an actual unlock. That's what I
thought you meant when you proposed the patch. After all, David's bug
was at DLM_LOCK_IV, not NL.
Cancel isn't in play here. Cancel only happens on the dc
thread, and the dc thread has gotten past BUSY. So the lock isn't busy
anymore. It does downconvert_worker in an unlocked state. When it
comes back to recheck, another thread can't have done a cancel.
That's why I asked about unlink. Imagine the dc thread is
handling a bast while the unlink thread has gotten to clear_inode.
Could ocfs2_drop_lock() be racing the downconvert?
Joel
--
Life's Little Instruction Book #173
"Be kinder than necessary."
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: Plugs race between the dc thread and an unlock ast message
2010-02-04 19:36 ` Sunil Mushran
2010-02-05 6:01 ` Joel Becker
@ 2010-02-05 16:24 ` David Teigland
1 sibling, 0 replies; 10+ messages in thread
From: David Teigland @ 2010-02-05 16:24 UTC (permalink / raw)
To: ocfs2-devel
On Thu, Feb 04, 2010 at 11:36:30AM -0800, Sunil Mushran wrote:
> Wengang Wang wrote:
> >By "unlock ast message", do you meant
> >ocfs2_locking_ast()->ocfs2_generic_handle_downconvert_action()?
> >
> >If yes,
> >if l_blocking did not changed before
> >ocfs2_generic_handle_downconvert_action(),
> >when l_level is set with a lower value, l_blocking must change.
> >So why we need to check l_level?
>
> I meant ocfs2_unlock_ast. Specifically cancel convert. That is one case
> that does not change l_blocking directly.
>
> However, that does not change the l_level too. So I am unsure what sequence
> of asts and basts (multiple ofcourse) can lead to this situation.
>
> But the patch looks reasonable even if I cannot state the precise scenario
> that leads to it.
>
> David is rerunning the test. We'll know the results by tomorrow.
Got this after about 10 hours (and I believe before updatedb would have
started running):
Feb 4 21:50:23 bull-05 kernel: (8034,2,ocfs2dc):ocfs2_prepare_downconvert:3280
ERROR: lockres->l_level (-1) <= new_level (0)
Feb 4 21:50:23 bull-05 kernel: kernel BUG at fs/ocfs2/dlmglue.c:3281!
kernel BUG@fs/ocfs2/dlmglue.c:3281!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:0d.0/0000:03:00.0/irq
CPU 2
Modules linked in: ocfs2_stack_user dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue sunrpc ipv6 cpufreq_ondemand powernow_k8 freq_table dm_multipath shpchp amd64_edac_mod edac_core i2c_nforce2 i2c_core k8temp tg3 serio_raw qla2xxx mptspi mptscsih ata_generic scsi_transport_fc pata_acpi mptbase scsi_transport_spi scsi_tgt sata_nv pata_amd [last unloaded: scsi_wait_scan]
Pid: 8034, comm: ocfs2dc Not tainted 2.6.32.3 #2 ProLiant DL145 G2
RIP: 0010:[<ffffffffa01df58f>] [<ffffffffa01df58f>] ocfs2_prepare_downconvert+0x9c/0x131 [ocfs2]
RSP: 0018:ffff88007ceffda0 EFLAGS: 00010082
RAX: 0000000000000064 RBX: ffff8800505808d0 RCX: 00000000000023a6
RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046
RBP: ffff88007ceffdd0 R08: ffff88007ceffca0 R09: 0000000000000000
R10: 0000000000000000 R11: 000000107aa1fc00 R12: 0000000000000000
R13: 0000000000000293 R14: ffff880139fcd000 R15: ffff8800505808e8
FS: 00007f1627c76700(0000) GS:ffff880082000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007fffe0020f08 CR3: 0000000139fb3000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ocfs2dc (pid: 8034, threadinfo ffff88007cefe000, task ffff88007cf3dd00)
Stack:
ffff8800ffffffff 0000000000000000 ffff88007ceffe80 ffff8800505808d0
<0> ffff8800505808d0 0000000000000000 ffff88007ceffee0 ffffffffa01e2e27
<0> ffff88007ceffe10 ffffffff8104e867 ffff88007d372d80 0000000000000000
Call Trace:
[<ffffffffa01e2e27>] ocfs2_downconvert_thread+0x674/0xa0e [ocfs2]
[<ffffffff8104e867>] ? finish_task_switch+0x58/0x77
[<ffffffff81074f6b>] ? autoremove_wake_function+0x0/0x39
[<ffffffffa01e27b3>] ? ocfs2_downconvert_thread+0x0/0xa0e [ocfs2]
[<ffffffff81074c7e>] kthread+0x7f/0x87
[<ffffffff81012cea>] child_rip+0xa/0x20
[<ffffffff81074bff>] ? kthread+0x0/0x87
[<ffffffff81012ce0>] ? child_rip+0x0/0x20
Code: 65 8b 14 25 68 e3 00 00 41 b9 d0 0c 00 00 48 63 d2 49 c7 c0 70 99 23 a0 48 c7 c7 25 12 24 a0 31 c0 44 89 64 24 08 e8 64 17 25 e1 <0f> 0b eb fe f6 05 48 95 fb ff 08 74 56 f6 05 47 95 fb ff 08 75
RIP [<ffffffffa01df58f>] ocfs2_prepare_downconvert+0x9c/0x131 [ocfs2]
RSP <ffff88007ceffda0>
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: Plugs race between the dc thread and an unlock ast message
2010-02-05 6:01 ` Joel Becker
@ 2010-02-05 18:39 ` Sunil Mushran
2010-02-05 19:33 ` Mark Fasheh
0 siblings, 1 reply; 10+ messages in thread
From: Sunil Mushran @ 2010-02-05 18:39 UTC (permalink / raw)
To: ocfs2-devel
Joel Becker wrote:
> Why cancel convert? Why not an actual unlock. That's what I
> thought you meant when you proposed the patch. After all, David's bug
> was at DLM_LOCK_IV, not NL.
> Cancel isn't in play here. Cancel only happens on the dc
> thread, and the dc thread has gotten past BUSY. So the lock isn't busy
> anymore. It does downconvert_worker in an unlocked state. When it
> comes back to recheck, another thread can't have done a cancel.
> That's why I asked about unlink. Imagine the dc thread is
> handling a bast while the unlink thread has gotten to clear_inode.
> Could ocfs2_drop_lock() be racing the downconvert?
Yes. I forgot the level was IV. So the full fix would be to ensure
the level is <= NL in the block below. ??
/*
* How can we block and yet be@NL? We were trying to upconvert
* from NL and got canceled. The code comes back here, and now
* we notice and clear BLOCKING.
*/
if (lockres->l_level == DLM_LOCK_NL) {
BUG_ON(lockres->l_ex_holders || lockres->l_ro_holders);
lockres->l_blocking = DLM_LOCK_NL;
lockres_clear_flags(lockres, OCFS2_LOCK_BLOCKED);
spin_unlock_irqrestore(&lockres->l_lock, flags);
goto leave;
}
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: Plugs race between the dc thread and an unlock ast message
2010-02-05 18:39 ` Sunil Mushran
@ 2010-02-05 19:33 ` Mark Fasheh
2010-02-05 23:37 ` Sunil Mushran
0 siblings, 1 reply; 10+ messages in thread
From: Mark Fasheh @ 2010-02-05 19:33 UTC (permalink / raw)
To: ocfs2-devel
On Fri, Feb 05, 2010 at 10:39:47AM -0800, Sunil Mushran wrote:
> Joel Becker wrote:
> > Why cancel convert? Why not an actual unlock. That's what I
> >thought you meant when you proposed the patch. After all, David's bug
> >was at DLM_LOCK_IV, not NL.
> > Cancel isn't in play here. Cancel only happens on the dc
> >thread, and the dc thread has gotten past BUSY. So the lock isn't busy
> >anymore. It does downconvert_worker in an unlocked state. When it
> >comes back to recheck, another thread can't have done a cancel.
> > That's why I asked about unlink. Imagine the dc thread is
> >handling a bast while the unlink thread has gotten to clear_inode.
> >Could ocfs2_drop_lock() be racing the downconvert?
Hmm, if busy is being cleared but the downconvert thread is still working on
it, then yes this could be a possibility - ocfs2_drop_lock only really waits
on the busy flag.
> Yes. I forgot the level was IV. So the full fix would be to ensure
> the level is <= NL in the block below. ??
>
>
> /*
> * How can we block and yet be at NL? We were trying to upconvert
> * from NL and got canceled. The code comes back here, and now
> * we notice and clear BLOCKING.
> */
> if (lockres->l_level == DLM_LOCK_NL) {
> BUG_ON(lockres->l_ex_holders || lockres->l_ro_holders);
> lockres->l_blocking = DLM_LOCK_NL;
> lockres_clear_flags(lockres, OCFS2_LOCK_BLOCKED);
> spin_unlock_irqrestore(&lockres->l_lock, flags);
> goto leave;
> }
I really want to say that we should prevent those two threads from racing in
the first place. How to do that though might be problematic. I'm not sure
simply spinning on OCFS2_LOCK_QUEUED might not be enough - that flag gets
cleared before the ->post_unlock callback...
--Mark
--
Mark Fasheh
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: Plugs race between the dc thread and an unlock ast message
2010-02-05 19:33 ` Mark Fasheh
@ 2010-02-05 23:37 ` Sunil Mushran
2010-02-09 20:55 ` David Teigland
0 siblings, 1 reply; 10+ messages in thread
From: Sunil Mushran @ 2010-02-05 23:37 UTC (permalink / raw)
To: ocfs2-devel
Mark Fasheh wrote:
> I really want to say that we should prevent those two threads from racing in
> the first place. How to do that though might be problematic. I'm not sure
> simply spinning on OCFS2_LOCK_QUEUED might not be enough - that flag gets
> cleared before the ->post_unlock callback...
David,
Please could you rerun with the following debug patch.
http://oss.oracle.com/~smushran/.dlmglue/0001-ocfs2-Patch-to-debug-hang-in-dlmglue-when-running-dl.patch
May make sense to fire it either later in the day or fire mlocate
a shortwhile after starting the test. That's to reduce the trace logs.
Thanks
Sunil
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: Plugs race between the dc thread and an unlock ast message
2010-02-05 23:37 ` Sunil Mushran
@ 2010-02-09 20:55 ` David Teigland
2010-02-09 21:49 ` Sunil Mushran
0 siblings, 1 reply; 10+ messages in thread
From: David Teigland @ 2010-02-09 20:55 UTC (permalink / raw)
To: ocfs2-devel
On Fri, Feb 05, 2010 at 03:37:11PM -0800, Sunil Mushran wrote:
> Mark Fasheh wrote:
> >I really want to say that we should prevent those two threads from racing
> >in
> >the first place. How to do that though might be problematic. I'm not sure
> >simply spinning on OCFS2_LOCK_QUEUED might not be enough - that flag gets
> >cleared before the ->post_unlock callback...
>
> David,
>
> Please could you rerun with the following debug patch.
>
> http://oss.oracle.com/~smushran/.dlmglue/0001-ocfs2-Patch-to-debug-hang-in-dlmglue-when-running-dl.patch
>
> May make sense to fire it either later in the day or fire mlocate
> a shortwhile after starting the test. That's to reduce the trace logs.
BUG hit just five minutes after starting make_panic and alternate on four
nodes. (I can send history before this if you need.)
Feb 9 14:09:52 bull-05 kernel: (6983,0,dlm_astd):ocfs2_locking_ast:1104 lock M0000000000000000097bda00000000, action 3, unlock 0, level 3, newlevel 0
Feb 9 14:09:52 bull-05 kernel: (7085,0,alternate):__ocfs2_cluster_lock:1424 lock M000000000000000003f00400000000, convert from 0 to 3
Feb 9 14:09:52 bull-05 kernel: (6983,0,dlm_astd):ocfs2_locking_ast:1104 lock M000000000000000003f00400000000, action 2, unlock 0, level 0, newlevel 3
Feb 9 14:09:52 bull-05 kernel: (6983,0,dlm_astd):ocfs2_blocking_ast:1059 BAST fired for lockres M000000000000000003f00400000000, blocking 5, level 3 type Meta
Feb 9 14:09:52 bull-05 kernel: (6983,0,dlm_astd):ocfs2_generic_handle_bast:932 lockres M000000000000000003f00400000000, block 5, level 3, l_block 5, dwn 1
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_process_blocked_lock:4004 lockres M000000000000000003f00400000000 blocked.
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_process_blocked_lock:4028 lockres M000000000000000003f00400000000, requeue = yes.
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_process_blocked_lock:4004 lockres M000000000000000003f00400000000 blocked.
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_prepare_downconvert:3300 lock M000000000000000003f00400000000, new_level = 0, l_blocking = 5
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_downconvert_lock:3320 lock M000000000000000003f00400000000, level 3 => 0
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_process_blocked_lock:4028 lockres M000000000000000003f00400000000, requeue = no.
Feb 9 14:09:52 bull-05 kernel: (6983,0,dlm_astd):ocfs2_locking_ast:1104 lock M000000000000000003f00400000000, action 3, unlock 0, level 3, newlevel 0
Feb 9 14:09:52 bull-05 kernel: (7085,0,alternate):__ocfs2_cluster_lock:1424 lock M000000000000000003f00400000000, convert from 0 to 3
Feb 9 14:09:52 bull-05 kernel: (6983,0,dlm_astd):ocfs2_locking_ast:1104 lock M000000000000000003f00400000000, action 2, unlock 0, level 0, newlevel 3
Feb 9 14:09:52 bull-05 kernel: (6983,0,dlm_astd):ocfs2_blocking_ast:1059 BAST fired for lockres M000000000000000003f00400000000, blocking 5, level 3 type Meta
Feb 9 14:09:52 bull-05 kernel: (6983,0,dlm_astd):ocfs2_generic_handle_bast:932 lockres M000000000000000003f00400000000, block 5, level 3, l_block 5, dwn 1
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_process_blocked_lock:4004 lockres M000000000000000003f00400000000 blocked.
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_process_blocked_lock:4028 lockres M000000000000000003f00400000000, requeue = yes.
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_process_blocked_lock:4004 lockres M000000000000000003f00400000000 blocked.
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_prepare_downconvert:3300 lock M000000000000000003f00400000000, new_level = 0, l_blocking = 5
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_downconvert_lock:3320 lock M000000000000000003f00400000000, level 3 => 0
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_process_blocked_lock:4028 lockres M000000000000000003f00400000000, requeue = no.
Feb 9 14:09:52 bull-05 kernel: (6983,0,dlm_astd):ocfs2_locking_ast:1104 lock M000000000000000003f00400000000, action 3, unlock 0, level 3, newlevel 0
Feb 9 14:09:52 bull-05 kernel: (7085,0,alternate):__ocfs2_cluster_lock:1424 lock M000000000000000003f00400000000, convert from 0 to 3
Feb 9 14:09:52 bull-05 kernel: (6983,0,dlm_astd):ocfs2_locking_ast:1104 lock M000000000000000003f00400000000, action 2, unlock 0, level 0, newlevel 3
Feb 9 14:09:52 bull-05 kernel: (6983,0,dlm_astd):ocfs2_blocking_ast:1059 BAST fired for lockres M000000000000000003f00400000000, blocking 5, level 3 type Meta
Feb 9 14:09:52 bull-05 kernel: (6983,0,dlm_astd):ocfs2_generic_handle_bast:932 lockres M000000000000000003f00400000000, block 5, level 3, l_block 5, dwn 1
Feb 9 14:09:52 bull-05 kernel: (6983,0,dlm_astd):ocfs2_locking_ast:1104 lock M000000000000000009732800000000, action 2, unlock 0, level 0, newlevel 5
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_process_blocked_lock:4004 lockres M000000000000000003f00400000000 blocked.
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_process_blocked_lock:4028 lockres M000000000000000003f00400000000, requeue = yes.
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_process_blocked_lock:4004 lockres M000000000000000003f00400000000 blocked.
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_locking_ast:1104 lock N000000000003f028, action 1, unlock 0, level -1, newlevel 3
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_prepare_downconvert:3300 lock M000000000000000003f00400000000, new_level = 0, l_blocking = 5
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_downconvert_lock:3320 lock M000000000000000003f00400000000, level 3 => 0
Feb 9 14:09:52 bull-05 kernel: (7009,0,ocfs2dc):ocfs2_process_blocked_lock:4028 lockres M000000000000000003f00400000000, requeue = no.
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_locking_ast:1104 lock M000000000000000003f00400000000, action 3, unlock 0, level 3, newlevel 0
Feb 9 14:09:52 bull-05 kernel: (7085,0,alternate):__ocfs2_cluster_lock:1424 lock M000000000000000003f00400000000, convert from 0 to 3
Feb 9 14:09:52 bull-05 kernel: (7083,3,make_panic):__ocfs2_cluster_lock:1424 lock W0000000000000000056e1934638d85, convert from -1 to 5
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_locking_ast:1104 lock M000000000000000003f00400000000, action 2, unlock 0, level 0, newlevel 3
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_blocking_ast:1059 BAST fired for lockres M000000000000000003f00400000000, blocking 5, level 3 type Meta
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_generic_handle_bast:932 lockres M000000000000000003f00400000000, block 5, level 3, l_block 5, dwn 1
Feb 9 14:09:52 bull-05 kernel: (6696,0,ocfs2_wq):__ocfs2_cluster_lock:1424 lock O000000000000000009732800000000, convert from 3 to 5
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_blocking_ast:1059 BAST fired for lockres N000000000003f020, blocking 5, level 3 type Dentry
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_generic_handle_bast:932 lockres N000000000003f020, block 5, level 3, l_block 5, dwn 1
Feb 9 14:09:52 bull-05 kernel: (7009,2,ocfs2dc):ocfs2_process_blocked_lock:4004 lockres M000000000000000003f00400000000 blocked.
Feb 9 14:09:52 bull-05 kernel: (7009,2,ocfs2dc):ocfs2_process_blocked_lock:4028 lockres M000000000000000003f00400000000, requeue = yes.
Feb 9 14:09:52 bull-05 kernel: (7009,2,ocfs2dc):ocfs2_process_blocked_lock:4004 lockres N000000000003f020 blocked.
Feb 9 14:09:52 bull-05 kernel: (7009,2,ocfs2dc):ocfs2_unblock_lock:3544 lock N000000000003f020, UNBLOCK_STOP_POST
Feb 9 14:09:52 bull-05 kernel: (7009,2,ocfs2dc):ocfs2_process_blocked_lock:4028 lockres N000000000003f020, requeue = no.
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_unlock_ast:3075 UNLOCK AST called on lock N000000000003f020, action = 2
Feb 9 14:09:52 bull-05 kernel: (7009,2,ocfs2dc):ocfs2_process_blocked_lock:4004 lockres M000000000000000003f00400000000 blocked.
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_unlock_ast:3075 UNLOCK AST called on lock O000000000000000009732800000000, action = 2
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_blocking_ast:1059 BAST fired for lockres M000000000000000008f41000000000, blocking 3, level 5 type Meta
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_generic_handle_bast:932 lockres M000000000000000008f41000000000, block 3, level 5, l_block 3, dwn 1
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_unlock_ast:3075 UNLOCK AST called on lock M000000000000000009732800000000, action = 2
Feb 9 14:09:52 bull-05 kernel: (6696,0,ocfs2_wq):__ocfs2_cluster_lock:1424 lock M0000000000000000097bda00000000, convert from 0 to 5
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_blocking_ast:1059 BAST fired for lockres M000000000000000009732800000000, blocking 5, level -1 type Meta
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_generic_handle_bast:932 lockres M000000000000000009732800000000, block 5, level -1, l_block 5, dwn 1
Feb 9 14:09:52 bull-05 kernel: (7009,2,ocfs2dc):ocfs2_prepare_downconvert:3300 lock M000000000000000003f00400000000, new_level = 0, l_blocking = 5
Feb 9 14:09:52 bull-05 kernel: (7009,2,ocfs2dc):ocfs2_downconvert_lock:3320 lock M000000000000000003f00400000000, level 3 => 0
Feb 9 14:09:52 bull-05 kernel: (7009,2,ocfs2dc):ocfs2_process_blocked_lock:4028 lockres M000000000000000003f00400000000, requeue = no.
Feb 9 14:09:52 bull-05 kernel: (7009,2,ocfs2dc):ocfs2_process_blocked_lock:4004 lockres M000000000000000008f41000000000 blocked.
Feb 9 14:09:52 bull-05 kernel: (7084,0,make_panic):__ocfs2_cluster_lock:1424 lock O000000000000000008e0d900000000, convert from -1 to 3
Feb 9 14:09:52 bull-05 kernel: (7009,2,ocfs2dc):ocfs2_prepare_downconvert:3300 lock M000000000000000008f41000000000, new_level = 3, l_blocking = 3
Feb 9 14:09:52 bull-05 kernel: (7009,2,ocfs2dc):ocfs2_downconvert_lock:3320 lock M000000000000000008f41000000000, level 5 => 3
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_locking_ast:1104 lock M000000000000000003f00400000000, action 3, unlock 0, level 3, newlevel 0
Feb 9 14:09:52 bull-05 kernel: (7085,0,alternate):__ocfs2_cluster_lock:1424 lock M000000000000000003f00400000000, convert from 0 to 3
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_locking_ast:1104 lock M000000000000000008f41000000000, action 3, unlock 0, level 5, newlevel 3
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_locking_ast:1104 lock M000000000000000003f00400000000, action 2, unlock 0, level 0, newlevel 3
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_blocking_ast:1059 BAST fired for lockres M000000000000000003f00400000000, blocking 5, level 3 type Meta
Feb 9 14:09:52 bull-05 kernel: (6983,2,dlm_astd):ocfs2_generic_handle_bast:932 lockres M000000000000000003f00400000000, block 5, level 3, l_block 5, dwn 1
Feb 9 14:09:52 bull-05 kernel: (7009,2,ocfs2dc):ocfs2_process_blocked_lock:4028 lockres M000000000000000008f41000000000, requeue = no.
Feb 9 14:09:52 bull-05 kernel: (7009,2,ocfs2dc):ocfs2_process_blocked_lock:4004 lockres M000000000000000009732800000000 blocked.
Feb 9 14:09:52 bull-05 kernel: (7009,2,ocfs2dc):ocfs2_prepare_downconvert:3295 ERROR: lock M000000000000000009732800000000, lvl -1 <= 0, blcklst 1, mask 1, type 0, flags 0x104, hold 0 0, act 0 0, req 5, block 5, pgen 7
Feb 9 14:09:52 bull-05 kernel: ------------[ cut here ]------------
Feb 9 14:09:52 bull-05 kernel: kernel BUG@fs/ocfs2/dlmglue.c:3296!
Feb 9 14:09:52 bull-05 kernel: invalid opcode: 0000 [#1] SMP
Feb 9 14:09:52 bull-05 kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:0d.0/0000:03:00.0/irq
Feb 9 14:09:52 bull-05 kernel: CPU 2
Feb 9 14:09:52 bull-05 kernel: Modules linked in: ocfs2_stack_user dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue sunrpc ipv6 cpufreq_ondemand powernow_k8 freq_table dm_multipath amd64_edac_mod edac_core i2c_nforce2 shpchp k8temp i2c_core tg3 serio_raw qla2xxx mptspi mptscsih ata_generic scsi_transport_fc pata_acpi mptbase scsi_tgt scsi_transport_spi sata_nv pata_amd [last unloaded: scsi_wait_scan]
Feb 9 14:09:52 bull-05 kernel: Pid: 7009, comm: ocfs2dc Not tainted 2.6.32.3 #2 ProLiant DL145 G2
Feb 9 14:09:52 bull-05 kernel: RIP: 0010:[<ffffffffa01d9632>] [<ffffffffa01d9632>] ocfs2_prepare_downconvert+0x121/0x1c0 [ocfs2]
Feb 9 14:09:52 bull-05 kernel: RSP: 0018:ffff88007aaabd40 EFLAGS: 00010082
Feb 9 14:09:52 bull-05 kernel: RAX: 00000000000000c2 RBX: ffff880114d19098 RCX: 000000000000e8a4
Feb 9 14:09:52 bull-05 kernel: RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046
Feb 9 14:09:52 bull-05 kernel: RBP: ffff88007aaabdc0 R08: 0007ffffffffffff R09: 0000000000000000
Feb 9 14:09:52 bull-05 kernel: R10: 0000000000000000 R11: ffff88002821dd00 R12: 0000000000000000
Feb 9 14:09:52 bull-05 kernel: R13: 0000000000000000 R14: ffff88007cd41000 R15: ffff880114d19068
Feb 9 14:09:52 bull-05 kernel: FS: 00007f6e99054710(0000) GS:ffff880082000000(0000) knlGS:0000000000000000
Feb 9 14:09:52 bull-05 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Feb 9 14:09:52 bull-05 kernel: CR2: 00007f9acad1a000 CR3: 000000013a47d000 CR4: 00000000000006e0
Feb 9 14:09:52 bull-05 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 9 14:09:52 bull-05 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Feb 9 14:09:52 bull-05 kernel: Process ocfs2dc (pid: 7009, threadinfo ffff88007aaaa000, task ffff88007aad1740)
Feb 9 14:09:52 bull-05 kernel: Stack:
Feb 9 14:09:52 bull-05 kernel: ffff880114d19098 ffff8801ffffffff 0000000000000000 ffff880100000001
Feb 9 14:09:52 bull-05 kernel: <0> ffff880000000001 ffff880100000000 0000000000000104 ffffffff00000000
Feb 9 14:09:52 bull-05 kernel: <0> ffff880000000000 ffffffff00000000 ffff880100000000 ffff880100000005
Feb 9 14:09:52 bull-05 kernel: Call Trace:
Feb 9 14:09:52 bull-05 kernel: [<ffffffffa01dd337>] ocfs2_downconvert_thread+0x954/0xd02 [ocfs2]
Feb 9 14:09:52 bull-05 kernel: [<ffffffff8104e867>] ? finish_task_switch+0x58/0x77
Feb 9 14:09:52 bull-05 kernel: [<ffffffff81074f6b>] ? autoremove_wake_function+0x0/0x39
Feb 9 14:09:52 bull-05 kernel: [<ffffffffa01dc9e3>] ? ocfs2_downconvert_thread+0x0/0xd02 [ocfs2]
Feb 9 14:09:52 bull-05 kernel: [<ffffffff81074c7e>] kthread+0x7f/0x87
Feb 9 14:09:52 bull-05 kernel: [<ffffffff81012cea>] child_rip+0xa/0x20
Feb 9 14:09:52 bull-05 kernel: [<ffffffff81074bff>] ? kthread+0x0/0x87
Feb 9 14:09:52 bull-05 kernel: [<ffffffff81012ce0>] ? child_rip+0x0/0x20
Feb 9 14:09:52 bull-05 kernel: Code: 43 18 89 7c 24 08 48 c7 c7 52 b7 23 a0 44 89 64 24 10 0f 94 c0 48 83 c3 48 0f b6 c0 48 89 1c 24 89 44 24 18 31 c0 e8 c1 76 25 e1 <0f> 0b eb fe 48 ba 00 00 08 00 02 00 00 00 48 85 15 99 94 fb ff
Feb 9 14:09:52 bull-05 kernel: RIP [<ffffffffa01d9632>] ocfs2_prepare_downconvert+0x121/0x1c0 [ocfs2]
Feb 9 14:09:52 bull-05 kernel: RSP <ffff88007aaabd40>
Feb 9 14:09:52 bull-05 kernel: ---[ end trace 3d047094e011f93c ]---
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: Plugs race between the dc thread and an unlock ast message
2010-02-09 20:55 ` David Teigland
@ 2010-02-09 21:49 ` Sunil Mushran
0 siblings, 0 replies; 10+ messages in thread
From: Sunil Mushran @ 2010-02-09 21:49 UTC (permalink / raw)
To: ocfs2-devel
David Teigland wrote:
> BUG hit just five minutes after starting make_panic and alternate on four
> nodes. (I can send history before this if you need.)
>
Feb 9 14:09:52 bull-05 kernel: (6983,0,dlm_astd):
ocfs2_locking_ast:1104 lock M000000000000000009732800000000, action 2,
unlock 0, level 0, newlevel 5
ocfs2_unlock_ast:3075 UNLOCK AST called on lock
M000000000000000009732800000000, action = 2 (DROP_LOCK)
ocfs2_blocking_ast:1059 BAST fired for lockres
M000000000000000009732800000000, blocking 5, level -1 type Meta
ocfs2_generic_handle_bast:932 lockres M000000000000000009732800000000,
block 5, level -1, l_block 5, dwn 1
Feb 9 14:09:52 bull-05 kernel: (7009,2,ocfs2dc):
ocfs2_process_blocked_lock:4004 lockres M000000000000000009732800000000
blocked.
ocfs2_prepare_downconvert:3295 ERROR: lock
M000000000000000009732800000000, lvl -1 <= 0, blcklst 1, mask 1, type 0,
flags 0x104, h
So the BAST fires after the UNLOCK AST (DROP_LOCK). That is what
we thought.
Why does ocfs2_generic_handle_bast() not take into account the current
lockres->l_level?
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2010-02-09 21:49 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-03 18:16 [Ocfs2-devel] [PATCH] ocfs2: Plugs race between the dc thread and an unlock ast message Sunil Mushran
2010-02-04 10:27 ` Wengang Wang
2010-02-04 19:36 ` Sunil Mushran
2010-02-05 6:01 ` Joel Becker
2010-02-05 18:39 ` Sunil Mushran
2010-02-05 19:33 ` Mark Fasheh
2010-02-05 23:37 ` Sunil Mushran
2010-02-09 20:55 ` David Teigland
2010-02-09 21:49 ` Sunil Mushran
2010-02-05 16:24 ` David Teigland
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.