* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock
@ 2010-01-06 8:34 Wengang Wang
2010-01-07 2:00 ` Joel Becker
0 siblings, 1 reply; 27+ messages in thread
From: Wengang Wang @ 2010-01-06 8:34 UTC (permalink / raw)
To: ocfs2-devel
there is deadlock possibility in __ocfs2_cluster_lock().
the case is like following.
#in time order
1) node A(the lock owner) is doing an up-convert(UC, PR->EX). it got the lock
but didn't check BUSY flag again.
2) node B requested an UC on the same lock resource. since node A has an EX on
the lockres, a bast is issued and OCFS2_LOCK_BLOCKED flag is set to the lockres
meanting that a DC is should be done. the DC asks for agreement of node A to
release EX. and for now node A refuses that(still in process of getting that
lock:) ).
3) the UC on node A runs into the phrase of checking OCFS2_LOCK_BLOCKED flag and
if it can continue without waiting for any DC. code is:
if (lockres->l_flags & OCFS2_LOCK_BLOCKED &&
!ocfs2_may_continue_on_blocked_lock(lockres, level)) {
analysis:
the BLOCKED flag is set in 2), and the UC can't continue to get the lock(
ocfs2_may_continue_on_blocked_lock() returns false), so it the DC to finish.
so both UC and DC are waiting for each other --a dead lock.
the deadlock happens when the UC is a PR->EX and there happen to be a DC queued
setting OCFS2_LOCK_BLOCKED flag when the UC checks that flag again.
fix:
fix in this patch is that if the EX lock is got, we don't check OCFS2_LOCK_BLOCKED
flag again. since we have got the EX lock, we don't need to wait for any DC to
finish. and we don't agree to DC the EX lock, so it' can't be DCed. --it's safe.
another solution could be in ocfs2_may_continue_on_blocked_lock().
currently it compares the wanted level and blocking level. we can change that
to check each dlm_lock attached to the lockres and ignore the one for local
then it works. but it's too complex and affects a lot. --give it up.
diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index e18eadf..269fb47 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -1313,6 +1313,7 @@ static int __ocfs2_cluster_lock(struct ocfs2_super *osb,
unsigned long flags;
unsigned int gen;
int noqueue_attempted = 0;
+ int lock_done = 0;
mlog_entry_void();
@@ -1347,6 +1348,18 @@ again:
goto unlock;
}
+ if (lock_done) {
+ /* cluster lock completed already, don't wait for down-
+ * convertion to finish.
+ * if we do, in case another DC happen to be there(not on behalf
+ * of this lock request), and ocfs2_may_continue_on_blocked_lock()
+ * returns false(in case we want an EX), we go into deadlock:
+ * DC is waiting for our release of the EX(check_downconvert())
+ * and we are waiting for that DC.
+ */
+ goto update_holders;
+ }
+
if (lockres->l_flags & OCFS2_LOCK_BLOCKED &&
!ocfs2_may_continue_on_blocked_lock(lockres, level)) {
/* is the lock is currently blocked on behalf of
@@ -1414,9 +1427,11 @@ again:
catch_signals = 0;
/* wait for busy to clear and carry on */
+ lock_done = 1;
goto again;
}
+update_holders:
/* Ok, if we get here then we're good to go. */
ocfs2_inc_holders(lockres, level);
--
1.6.2.5
^ permalink raw reply related [flat|nested] 27+ messages in thread* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-06 8:34 [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock Wengang Wang @ 2010-01-07 2:00 ` Joel Becker 2010-01-07 5:22 ` Wengang Wang 2010-01-09 18:05 ` Wengang Wang 0 siblings, 2 replies; 27+ messages in thread From: Joel Becker @ 2010-01-07 2:00 UTC (permalink / raw) To: ocfs2-devel On Wed, Jan 06, 2010 at 04:34:44PM +0800, Wengang Wang wrote: > there is deadlock possibility in __ocfs2_cluster_lock(). > the case is like following. > > #in time order > 1) node A(the lock owner) is doing an up-convert(UC, PR->EX). it got the lock > but didn't check BUSY flag again. > > 2) node B requested an UC on the same lock resource. since node A has an EX on > the lockres, a bast is issued and OCFS2_LOCK_BLOCKED flag is set to the lockres > meanting that a DC is should be done. the DC asks for agreement of node A to > release EX. and for now node A refuses that(still in process of getting that > lock:) ). > > 3) the UC on node A runs into the phrase of checking OCFS2_LOCK_BLOCKED flag and > if it can continue without waiting for any DC. code is: > > if (lockres->l_flags & OCFS2_LOCK_BLOCKED && > !ocfs2_may_continue_on_blocked_lock(lockres, level)) { > > analysis: > the BLOCKED flag is set in 2), and the UC can't continue to get the lock( > ocfs2_may_continue_on_blocked_lock() returns false), so it the DC to finish. > > so both UC and DC are waiting for each other --a dead lock. 1) Do you have a test case? Have you seen this happen, or are you just postulating from reading the code? 2) It sure looks like you're right. 3) I hate our locking spaghetti! Joel -- "I don't know anything about music. In my line you don't have to." - Elvis Presley Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-07 2:00 ` Joel Becker @ 2010-01-07 5:22 ` Wengang Wang 2010-01-09 18:05 ` Wengang Wang 1 sibling, 0 replies; 27+ messages in thread From: Wengang Wang @ 2010-01-07 5:22 UTC (permalink / raw) To: ocfs2-devel Hi Joel, On 10-01-06 18:00, Joel Becker wrote: > On Wed, Jan 06, 2010 at 04:34:44PM +0800, Wengang Wang wrote: > > there is deadlock possibility in __ocfs2_cluster_lock(). > > the case is like following. > > > > #in time order > > 1) node A(the lock owner) is doing an up-convert(UC, PR->EX). it got the lock > > but didn't check BUSY flag again. > > > > 2) node B requested an UC on the same lock resource. since node A has an EX on > > the lockres, a bast is issued and OCFS2_LOCK_BLOCKED flag is set to the lockres > > meanting that a DC is should be done. the DC asks for agreement of node A to > > release EX. and for now node A refuses that(still in process of getting that > > lock:) ). > > > > 3) the UC on node A runs into the phrase of checking OCFS2_LOCK_BLOCKED flag and > > if it can continue without waiting for any DC. code is: > > > > if (lockres->l_flags & OCFS2_LOCK_BLOCKED && > > !ocfs2_may_continue_on_blocked_lock(lockres, level)) { > > > > analysis: > > the BLOCKED flag is set in 2), and the UC can't continue to get the lock( > > ocfs2_may_continue_on_blocked_lock() returns false), so it the DC to finish. > > > > so both UC and DC are waiting for each other --a dead lock. > > 1) Do you have a test case? Have you seen this happen, or are you just > postulating from reading the code? yes, I have test case for this. when I was testing freeze/thaw, it was hit. while, the code for freeze/thaw is not sent out yet :P the main cluster lock iterations is like this: 0) a cluster lock for freeze/thaw is added. it works like open lock on inode. so node A and node B both have PR lock on it. 1) node A want to freeze the fs, so requested for a EX. 2) node B released PR. and very immediately requests a PR waiting for thawing. 3) things described above happened. regards, wengang. > 2) It sure looks like you're right. > 3) I hate our locking spaghetti! ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-07 2:00 ` Joel Becker 2010-01-07 5:22 ` Wengang Wang @ 2010-01-09 18:05 ` Wengang Wang 2010-01-09 18:24 ` Sunil Mushran 2010-01-12 1:59 ` Joel Becker 1 sibling, 2 replies; 27+ messages in thread From: Wengang Wang @ 2010-01-09 18:05 UTC (permalink / raw) To: ocfs2-devel On 10-01-06 18:00, Joel Becker wrote: > On Wed, Jan 06, 2010 at 04:34:44PM +0800, Wengang Wang wrote: > > there is deadlock possibility in __ocfs2_cluster_lock(). > > the case is like following. > > > > #in time order > > 1) node A(the lock owner) is doing an up-convert(UC, PR->EX). it got the lock > > but didn't check BUSY flag again. > > > > 2) node B requested an UC on the same lock resource. since node A has an EX on > > the lockres, a bast is issued and OCFS2_LOCK_BLOCKED flag is set to the lockres > > meanting that a DC is should be done. the DC asks for agreement of node A to > > release EX. and for now node A refuses that(still in process of getting that > > lock:) ). > > > > 3) the UC on node A runs into the phrase of checking OCFS2_LOCK_BLOCKED flag and > > if it can continue without waiting for any DC. code is: > > > > if (lockres->l_flags & OCFS2_LOCK_BLOCKED && > > !ocfs2_may_continue_on_blocked_lock(lockres, level)) { > > > > analysis: > > the BLOCKED flag is set in 2), and the UC can't continue to get the lock( > > ocfs2_may_continue_on_blocked_lock() returns false), so it the DC to finish. > > > > so both UC and DC are waiting for each other --a dead lock. > > 1) Do you have a test case? Have you seen this happen, or are you just > postulating from reading the code? > 2) It sure looks like you're right. well, this doesn't sounds like a sob :) regards, wengang. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-09 18:05 ` Wengang Wang @ 2010-01-09 18:24 ` Sunil Mushran 2010-01-10 10:05 ` Joel Becker 2010-01-12 1:59 ` Joel Becker 1 sibling, 1 reply; 27+ messages in thread From: Sunil Mushran @ 2010-01-09 18:24 UTC (permalink / raw) To: ocfs2-devel This needs testing. Joel has forwarded it to Fabbione who has a testcase that we believe triggers this issue. If that works, I'll push this in the 1.6 testing tree. That way we'll know (a) the patch fixes the problem, and (b) does not break other things. This is a tricky function. We are just being careful. On Jan 9, 2010, at 10:05 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote: > On 10-01-06 18:00, Joel Becker wrote: >> On Wed, Jan 06, 2010 at 04:34:44PM +0800, Wengang Wang wrote: >>> there is deadlock possibility in __ocfs2_cluster_lock(). >>> the case is like following. >>> >>> #in time order >>> 1) node A(the lock owner) is doing an up-convert(UC, PR->EX). it >>> got the lock >>> but didn't check BUSY flag again. >>> >>> 2) node B requested an UC on the same lock resource. since node A >>> has an EX on >>> the lockres, a bast is issued and OCFS2_LOCK_BLOCKED flag is set >>> to the lockres >>> meanting that a DC is should be done. the DC asks for agreement of >>> node A to >>> release EX. and for now node A refuses that(still in process of >>> getting that >>> lock:) ). >>> >>> 3) the UC on node A runs into the phrase of checking >>> OCFS2_LOCK_BLOCKED flag and >>> if it can continue without waiting for any DC. code is: >>> >>> if (lockres->l_flags & OCFS2_LOCK_BLOCKED && >>> !ocfs2_may_continue_on_blocked_lock(lockres, level)) { >>> >>> analysis: >>> the BLOCKED flag is set in 2), and the UC can't continue to get >>> the lock( >>> ocfs2_may_continue_on_blocked_lock() returns false), so it the DC >>> to finish. >>> >>> so both UC and DC are waiting for each other --a dead lock. >> >> 1) Do you have a test case? Have you seen this happen, or are you >> just >> postulating from reading the code? >> 2) It sure looks like you're right. > well, this doesn't sounds like a sob :) > > regards, > wengang. > > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-09 18:24 ` Sunil Mushran @ 2010-01-10 10:05 ` Joel Becker 2010-01-11 1:55 ` Wengang Wang 0 siblings, 1 reply; 27+ messages in thread From: Joel Becker @ 2010-01-10 10:05 UTC (permalink / raw) To: ocfs2-devel On Sat, Jan 09, 2010 at 10:24:10AM -0800, Sunil Mushran wrote: > This needs testing. Joel has forwarded it to Fabbione who has a > testcase that we believe triggers this issue. If that works, I'll > push this in the 1.6 testing tree. That way we'll know (a) the patch > fixes the problem, and (b) does not break other things. This is a > tricky function. We are just being careful. Wengang, Like Sunil said, this is a very critical part of the code. Our locking code is the biggest place for lockups and corruptions to happen. We always take the safe route here. We're working to test both of your patches. Joel -- Life's Little Instruction Book #198 "Feed a stranger's expired parking meter." Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-10 10:05 ` Joel Becker @ 2010-01-11 1:55 ` Wengang Wang 0 siblings, 0 replies; 27+ messages in thread From: Wengang Wang @ 2010-01-11 1:55 UTC (permalink / raw) To: ocfs2-devel Hi Joel and Sunil, Sure I agree with what is doing with the patches --It's just a reminder not to forget it :). regards, wengang. On 10-01-10 02:05, Joel Becker wrote: > On Sat, Jan 09, 2010 at 10:24:10AM -0800, Sunil Mushran wrote: > > This needs testing. Joel has forwarded it to Fabbione who has a > > testcase that we believe triggers this issue. If that works, I'll > > push this in the 1.6 testing tree. That way we'll know (a) the patch > > fixes the problem, and (b) does not break other things. This is a > > tricky function. We are just being careful. > > Wengang, > Like Sunil said, this is a very critical part of the code. Our > locking code is the biggest place for lockups and corruptions to happen. > We always take the safe route here. We're working to test both of your > patches. > > Joel > > -- > > Life's Little Instruction Book #198 > > "Feed a stranger's expired parking meter." > > Joel Becker > Principal Software Developer > Oracle > E-mail: joel.becker at oracle.com > Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-09 18:05 ` Wengang Wang 2010-01-09 18:24 ` Sunil Mushran @ 2010-01-12 1:59 ` Joel Becker 2010-01-12 3:58 ` Wengang Wang ` (2 more replies) 1 sibling, 3 replies; 27+ messages in thread From: Joel Becker @ 2010-01-12 1:59 UTC (permalink / raw) To: ocfs2-devel [Cc'd Mark and Dave] On Sun, Jan 10, 2010 at 02:05:22AM +0800, Wengang Wang wrote: > > 2) It sure looks like you're right. > well, this doesn't sounds like a sob :) With good reason, too. You've got a bug. Dave Teigland (copied now) found it while testing with fs/dlm. The short description is that you didn't notice the handling of async -EAGAIN. Not really your fault, there are no obvious comments. With o2dlm, you get -EAGAIN from the call to ocfs2_dlm_lock(). That's safe with your patch. However, with fs/dlm, EAGAIN comes asynchronously. ocfs2_dlm_lock() will return 0. When the ast fires, the ast notices the -EAGAIN and clears the locking state for a retry. In your patch, however, you take the 'lock_done' + !BUSY and immediately increment the holders - even though the lock hasn't been acquired! The fix is pretty simple, though - when lock_done is set, you know we've at least tried to upconvert. If the locking level is good, we know the upconvert succeeded. We can then go to inc_holders. I've attached a new version of your patch that does this. I've also modified the patch to reflect that the original problem you found is actually a livelock, not a deadlock. And I added a little comment about the async -EAGAIN. I've attached the full patch with my changes. Dave, please test my version (the attached one) instead of Wengang's. Mark, can you do a sanity check here? Here's the diff between your patch and mine: diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index 2788b8e..eabe53e 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -1313,7 +1313,7 @@ static int __ocfs2_cluster_lock(struct ocfs2_super *osb, unsigned long flags; unsigned int gen; int noqueue_attempted = 0; - int lock_done = 0; + int lock_attempted = 0; mlog_entry_void(); @@ -1348,18 +1348,25 @@ again: goto unlock; } - if (lock_done) { - /* cluster lock completed already, don't wait for down- - * convertion to finish. - * if we do, in case another DC happen to be there(not on behalf - * of this lock request), and ocfs2_may_continue_on_blocked_lock() - * returns false(in case we want an EX), we go into deadlock: - * DC is waiting for our release of the EX(check_downconvert()) - * and we are waiting for that DC. + if (lock_attempted && (lockres->l_level >= level)) { + /* + * We've attempted to upconvert, and the lock now has + * a level we can work with. If we fell through to the + * next checks, we could spin in an upconvert->downconvert + * cycle as other nodes pounded us. Instead, we jump + * out and let the caller do some work. If a downconvert + * has come in, it will do its thing as soon as the caller + * is done. */ goto update_holders; } + /* + * This is either the first pass or our level wasn't good enough. + * Treat both cases as a first pass. + */ + lock_attempted = 0; + if (lockres->l_flags & OCFS2_LOCK_BLOCKED && !ocfs2_may_continue_on_blocked_lock(lockres, level)) { /* is the lock is currently blocked on behalf of @@ -1370,6 +1377,13 @@ again: } if (level > lockres->l_level) { + /* + * If the dlm notifies us of -EAGAIN asynchronously, it + * will have returned 0 from the call to ocfs2_dlm_lock() + * and given us -EAGAIN in the ast. The ast will clear + * out our state. We notice that here and return -EAGAIN + * to the caller. + */ if (noqueue_attempted > 0) { ret = -EAGAIN; goto unlock; @@ -1427,7 +1441,7 @@ again: catch_signals = 0; /* wait for busy to clear and carry on */ - lock_done = 1; + lock_attempted = 1; goto again; } Joel -- "If you are ever in doubt as to whether or not to kiss a pretty girl, give her the benefit of the doubt" -Thomas Carlyle Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-ocfs2-fix-__ocfs2_cluster_lock-livelock.patch Type: text/x-diff Size: 3646 bytes Desc: not available Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20100111/1264b243/attachment.bin ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-12 1:59 ` Joel Becker @ 2010-01-12 3:58 ` Wengang Wang 2010-01-12 11:18 ` Joel Becker 2010-01-12 19:47 ` David Teigland 2010-01-13 3:20 ` Wengang Wang 2 siblings, 1 reply; 27+ messages in thread From: Wengang Wang @ 2010-01-12 3:58 UTC (permalink / raw) To: ocfs2-devel Hi Joel, On 10-01-11 17:59, Joel Becker wrote: > [Cc'd Mark and Dave] > > On Sun, Jan 10, 2010 at 02:05:22AM +0800, Wengang Wang wrote: > > > 2) It sure looks like you're right. > > well, this doesn't sounds like a sob :) > > With good reason, too. You've got a bug. Dave Teigland (copied > now) found it while testing with fs/dlm. > The short description is that you didn't notice the handling of > async -EAGAIN. Not really your fault, there are no obvious comments. > With o2dlm, you get -EAGAIN from the call to ocfs2_dlm_lock(). That's > safe with your patch. However, with fs/dlm, EAGAIN comes > asynchronously. ocfs2_dlm_lock() will return 0. When the ast fires, > the ast notices the -EAGAIN and clears the locking state for a retry. > In your patch, however, you take the 'lock_done' + !BUSY and immediately > increment the holders - even though the lock hasn't been acquired! ah, yes. I fault. I should check more deeper. > The fix is pretty simple, though - when lock_done is set, you > know we've at least tried to upconvert. If the locking level is good, > we know the upconvert succeeded. We can then go to inc_holders. I've > attached a new version of your patch that does this. > I've also modified the patch to reflect that the original > problem you found is actually a livelock, not a deadlock. And I added a > little comment about the async -EAGAIN. > I've attached the full patch with my changes. Dave, please test > my version (the attached one) instead of Wengang's. Mark, can you do a > sanity check here? > Here's the diff between your patch and mine: snip... > Date: Wed, 6 Jan 2010 16:34:44 +0800 > Subject: [PATCH] ocfs2: fix __ocfs2_cluster_lock() livelock > > There is livelock possibility in __ocfs2_cluster_lock(). Here's what > happens: > > 1) node A(the lock owner) is doing an up-convert(UC, PR->EX). it gets > the lock but doesn't check BUSY again because the level is right. > > 2) node B requested an UC on the same lock resource. since node A has an > EX on the lockres, a bast is issued and OCFS2_LOCK_BLOCKED flag is set > to the lockres meanting that a DC is should be done. the DC asks for > agreement of node A to release EX. The downconvert starts because BUSY > is cleared by the ast. > > 3) the UC on node A runs into the check of OCFS2_LOCK_BLOCKED: > > if (lockres->l_flags & OCFS2_LOCK_BLOCKED && > !ocfs2_may_continue_on_blocked_lock(lockres, level)) { > > analysis: > the BLOCKED flag is set in 2), and the UC can't continue to get the lock > (ocfs2_may_continue_on_blocked_lock() returns false), so it waits on the > DC to finish. > > 4) The DC finishes, and BLOCKED is cleared. The UC on node A starts over > getting the UC, now from NL->EX. It asks node B for the lock. I am wondering why DC can finish. although BUSY is cleared. I think the check_downconvert() should return false since we have not done whatever we want to do. and then the DC should be requeued. #don't hate me if I am wrong/stupid :) regards, wengang. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-12 3:58 ` Wengang Wang @ 2010-01-12 11:18 ` Joel Becker 2010-01-12 12:04 ` Wengang Wang 0 siblings, 1 reply; 27+ messages in thread From: Joel Becker @ 2010-01-12 11:18 UTC (permalink / raw) To: ocfs2-devel On Tue, Jan 12, 2010 at 11:58:35AM +0800, Wengang Wang wrote: > On 10-01-11 17:59, Joel Becker wrote: > > Date: Wed, 6 Jan 2010 16:34:44 +0800 > > Subject: [PATCH] ocfs2: fix __ocfs2_cluster_lock() livelock > > > > There is livelock possibility in __ocfs2_cluster_lock(). Here's what > > happens: > > > > 1) node A(the lock owner) is doing an up-convert(UC, PR->EX). it gets > > the lock but doesn't check BUSY again because the level is right. > > > > 2) node B requested an UC on the same lock resource. since node A has an > > EX on the lockres, a bast is issued and OCFS2_LOCK_BLOCKED flag is set > > to the lockres meanting that a DC is should be done. the DC asks for > > agreement of node A to release EX. The downconvert starts because BUSY > > is cleared by the ast. > > > > 3) the UC on node A runs into the check of OCFS2_LOCK_BLOCKED: > > > > if (lockres->l_flags & OCFS2_LOCK_BLOCKED && > > !ocfs2_may_continue_on_blocked_lock(lockres, level)) { > > > > analysis: > > the BLOCKED flag is set in 2), and the UC can't continue to get the lock > > (ocfs2_may_continue_on_blocked_lock() returns false), so it waits on the > > DC to finish. > > > > 4) The DC finishes, and BLOCKED is cleared. The UC on node A starts over > > getting the UC, now from NL->EX. It asks node B for the lock. > > I am wondering why DC can finish. > although BUSY is cleared. I think the check_downconvert() should return > false since we have not done whatever we want to do. and then the DC should > be requeued. DC thread only waits on PENDING and holders. If BUSY, it will cancel the upconvert. If not BUSY, it will schedule a downconvert. There's nothing stopping the downconvert, in other words. > #don't hate me if I am wrong/stupid :) Won't hate you at all. And you're not stupid even if you are wrong. It's not easy code. Heck, you may be right and I might be wrong in the end. Remember, you found the bug in the first place! Joel -- Life's Little Instruction Book #450 "Don't be afraid to say, 'I need help.'" Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-12 11:18 ` Joel Becker @ 2010-01-12 12:04 ` Wengang Wang 2010-01-12 13:01 ` Joel Becker 0 siblings, 1 reply; 27+ messages in thread From: Wengang Wang @ 2010-01-12 12:04 UTC (permalink / raw) To: ocfs2-devel Hi Joel, On 10-01-12 03:18, Joel Becker wrote: > > > Date: Wed, 6 Jan 2010 16:34:44 +0800 > > > Subject: [PATCH] ocfs2: fix __ocfs2_cluster_lock() livelock > > > > > > There is livelock possibility in __ocfs2_cluster_lock(). Here's what > > > happens: > > > > > > 1) node A(the lock owner) is doing an up-convert(UC, PR->EX). it gets > > > the lock but doesn't check BUSY again because the level is right. > > > > > > 2) node B requested an UC on the same lock resource. since node A has an > > > EX on the lockres, a bast is issued and OCFS2_LOCK_BLOCKED flag is set > > > to the lockres meanting that a DC is should be done. the DC asks for > > > agreement of node A to release EX. The downconvert starts because BUSY > > > is cleared by the ast. > > > > > > 3) the UC on node A runs into the check of OCFS2_LOCK_BLOCKED: > > > > > > if (lockres->l_flags & OCFS2_LOCK_BLOCKED && > > > !ocfs2_may_continue_on_blocked_lock(lockres, level)) { > > > > > > analysis: > > > the BLOCKED flag is set in 2), and the UC can't continue to get the lock > > > (ocfs2_may_continue_on_blocked_lock() returns false), so it waits on the > > > DC to finish. > > > > > > 4) The DC finishes, and BLOCKED is cleared. The UC on node A starts over > > > getting the UC, now from NL->EX. It asks node B for the lock. > > > > I am wondering why DC can finish. > > although BUSY is cleared. I think the check_downconvert() should return > > false since we have not done whatever we want to do. and then the DC should > > be requeued. > > DC thread only waits on PENDING and holders. If BUSY, it will > cancel the upconvert. If not BUSY, it will schedule a downconvert. > There's nothing stopping the downconvert, in other words. I think I am talking about the scheduled downconvert. let me copy src here. the downconvert kernel thread does a call stack like this: ocfs2_downconvert_thread_do_work() -->ocfs2_process_blocked_lock -->ocfs2_unblock_lock() in ocfs2_unblock_lock() there are lines after checking for BUSY flag: if (lockres->l_ops->check_downconvert && !lockres->l_ops->check_downconvert(lockres, new_level)) goto leave_requeue; [snip...] leave_requeue: spin_unlock_irqrestore(&lockres->l_lock, flags); ctl->requeue = 1; if ctl->requeue is set, ocfs2_schedule_blocked_lock() will add the lockres to blocked_lock_list. the downconvert will then be rescheduled only if ocfs2_wake_downconvert_thread() is called somewhere. in this simple case: the "somewhere", I think, is in __ocfs2_cluster_unlock(). so before __ocfs2_cluster_unlock() is called(surely after __ocfs2_cluster_lock() returns), the downconvert couldn't be rescheduled. thus __ocfs2_cluster_lock() is waiting for DC to finish and the DC is waiting __ocfs2_cluster_unlock() to be called. --deadlock. so maybe you don't notice the lockres->l_ops->check_downconvert()? > > #don't hate me if I am wrong/stupid :) > > Won't hate you at all. And you're not stupid even if you are > wrong. It's not easy code. Heck, you may be right and I might be wrong > in the end. Remember, you found the bug in the first place! > great! regards, wengang. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-12 12:04 ` Wengang Wang @ 2010-01-12 13:01 ` Joel Becker 2010-01-12 14:30 ` Wengang Wang 0 siblings, 1 reply; 27+ messages in thread From: Joel Becker @ 2010-01-12 13:01 UTC (permalink / raw) To: ocfs2-devel On Tue, Jan 12, 2010 at 08:04:19PM +0800, Wengang Wang wrote: > On 10-01-12 03:18, Joel Becker wrote: > > DC thread only waits on PENDING and holders. If BUSY, it will > > cancel the upconvert. If not BUSY, it will schedule a downconvert. > > There's nothing stopping the downconvert, in other words. ... > in ocfs2_unblock_lock() there are lines after checking for BUSY flag: > > if (lockres->l_ops->check_downconvert > && !lockres->l_ops->check_downconvert(lockres, new_level)) > goto leave_requeue; ... > if ctl->requeue is set, ocfs2_schedule_blocked_lock() will add the > lockres to blocked_lock_list. the downconvert will then be rescheduled only > if ocfs2_wake_downconvert_thread() is called somewhere. > in this simple case: > the "somewhere", I think, is in __ocfs2_cluster_unlock(). > so before __ocfs2_cluster_unlock() is called(surely after > __ocfs2_cluster_lock() returns), the downconvert couldn't > be rescheduled. > thus __ocfs2_cluster_lock() is waiting for DC to finish and the DC is > waiting __ocfs2_cluster_unlock() to be called. --deadlock. > > so maybe you don't notice the lockres->l_ops->check_downconvert()? I notice it. I know what it checks. If the lock is not currently taken it will return "go ahead and downconvert". So we won't requeue, we'll downconvert. Specifically, look at ocfs2_check_meta_downconvert(). It merely checks the caching info. We can't have any cached data if we don't have the EX yet. Sunil has observed the livelock behavior (pinging between UC and DC) in the wild via traces. The UC process never leaves ocfs2_cluster_lock(). Your change fixes that. Joel -- "I don't even butter my bread; I consider that cooking." - Katherine Cebrian Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-12 13:01 ` Joel Becker @ 2010-01-12 14:30 ` Wengang Wang 2010-01-12 19:14 ` Sunil Mushran 0 siblings, 1 reply; 27+ messages in thread From: Wengang Wang @ 2010-01-12 14:30 UTC (permalink / raw) To: ocfs2-devel On 10-01-12 05:01, Joel Becker wrote: > > so maybe you don't notice the lockres->l_ops->check_downconvert()? > > I notice it. I know what it checks. If the lock is not > currently taken it will return "go ahead and downconvert". So we > won't requeue, we'll downconvert. Specifically, look at > ocfs2_check_meta_downconvert(). It merely checks the caching info. We > can't have any cached data if we don't have the EX yet. Is it a rule that If the lock is not currently taken it will return "go ahead and downconvert"? by checking ocfs2_check_meta_downconvert(), I think it ensures cache is checkpointed by JBD(2) before returning "go ahead and downconvert". maybe yes that you noticed that we can't have cache if we don't have EX, I think the main purpose is that it ensures that we finish things which we should finish before we release the lock. another evidence is that it doesn't check the ex_holders. another question is that you can find out ocfs2_check_meta_downconvert() checks if lock is taken(not explicitly), can you make such a conclusion that all check_downconvert() follows it? --sorry, I didn't check them and pushed it to you :P --I will check them too later. so still the question, is it a rule that check_downconvert() should return "go ahead and downconvert" if the lock is not currently taken? and also, what means "currently taken"? --ocfs2 layer(after ocfs2_cluster_lock() returns) or dlm level? though I guess you meant ocfs2 layer. regards, wengang. > Sunil has observed the livelock behavior (pinging between UC and > DC) in the wild via traces. The UC process never leaves > ocfs2_cluster_lock(). Your change fixes that. > ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-12 14:30 ` Wengang Wang @ 2010-01-12 19:14 ` Sunil Mushran 2010-01-13 6:15 ` Wengang Wang 0 siblings, 1 reply; 27+ messages in thread From: Sunil Mushran @ 2010-01-12 19:14 UTC (permalink / raw) To: ocfs2-devel Wengang Wang wrote: > Is it a rule that If the lock is not currently taken it will return "go > ahead and downconvert"? In short, yes, because of fairness. Not having holders is one of the checks. If there are incompatible holders then it will requeue the lockres in the blocked queue. The lockres will be re-processed when the dc thread is kicked the next time. If there are no holders and the journal is checkpointed (or check_downconvert succeeds), then yes, it will be downconvert. > by checking ocfs2_check_meta_downconvert(), I think it ensures cache is > checkpointed by JBD(2) before returning "go ahead and downconvert". > maybe yes that you noticed that we can't have cache if we don't have > EX, I think the main purpose is that it ensures that we finish things which we > should finish before we release the lock. another evidence is that it > doesn't check the ex_holders. check_downconvert() is an additional check. The holders check tells the fs whether it can and whether it needs to down convert or not. If answer to both qs is yes, then it sees whether there are any lock type specific checks. M and T lock types want the journal check pointed before a dc. > another question is that you can find out ocfs2_check_meta_downconvert() > checks if lock is taken(not explicitly), can you make such a conclusion > that all check_downconvert() follows it? --sorry, I didn't check them > and pushed it to you :P --I will check them too later. > > so still the question, is it a rule that check_downconvert() should > return "go ahead and downconvert" if the lock is not currently taken? > and also, what means "currently taken"? --ocfs2 layer(after > ocfs2_cluster_lock() returns) or dlm level? though I guess you meant > ocfs2 layer. I am confused by your qs. Have you read the description of check_downconvert()? /* * Allow a lock type to add checks to determine whether it is * safe to downconvert a lock. Return 0 to re-queue the * downconvert at a later time, nonzero to continue. * * For most locks, the default checks that there are no * incompatible holders are sufficient. * * Called with the lockres spinlock held. */ int (*check_downconvert)(struct ocfs2_lock_res *, int); Sunil ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-12 19:14 ` Sunil Mushran @ 2010-01-13 6:15 ` Wengang Wang 0 siblings, 0 replies; 27+ messages in thread From: Wengang Wang @ 2010-01-13 6:15 UTC (permalink / raw) To: ocfs2-devel On 10-01-12 11:14, Sunil Mushran wrote: > Wengang Wang wrote: >> Is it a rule that If the lock is not currently taken it will return "go >> ahead and downconvert"? > > In short, yes, because of fairness. > > Not having holders is one of the checks. If there are incompatible > holders then it will requeue the lockres in the blocked queue. The > lockres will be re-processed when the dc thread is kicked the > next time. If there are no holders and the journal is checkpointed > (or check_downconvert succeeds), then yes, it will be downconvert. > Ok, got it. >> by checking ocfs2_check_meta_downconvert(), I think it ensures cache is >> checkpointed by JBD(2) before returning "go ahead and downconvert". >> maybe yes that you noticed that we can't have cache if we don't have >> EX, I think the main purpose is that it ensures that we finish things which we >> should finish before we release the lock. another evidence is that it >> doesn't check the ex_holders. > > check_downconvert() is an additional check. The holders check tells the fs > whether it can and whether it needs to down convert or not. If answer to > both qs is yes, then it sees whether there are any lock type specific > checks. M and T lock types want the journal check pointed before a dc. yes, understood. the holders check is done in ocfs2_unblock_lock() and the check_downconvert() checks specific things. whether the problem is a deadlock or a livelock depends on whether the check_downconvert()(if there is) doesn't agree to dc in some condition. if it doesn't agree, it's a deadlock. otherwise, it's a livelock. > >> another question is that you can find out ocfs2_check_meta_downconvert() >> checks if lock is taken(not explicitly), can you make such a conclusion >> that all check_downconvert() follows it? --sorry, I didn't check them >> and pushed it to you :P --I will check them too later. >> >> so still the question, is it a rule that check_downconvert() should >> return "go ahead and downconvert" if the lock is not currently taken? >> and also, what means "currently taken"? --ocfs2 layer(after >> ocfs2_cluster_lock() returns) or dlm level? though I guess you meant >> ocfs2 layer. > > I am confused by your qs. Have you read the description of > check_downconvert()? > > > /* > * Allow a lock type to add checks to determine whether it is > * safe to downconvert a lock. Return 0 to re-queue the > * downconvert at a later time, nonzero to continue. > * > * For most locks, the default checks that there are no > * incompatible holders are sufficient. > * > * Called with the lockres spinlock held. > */ > int (*check_downconvert)(struct ocfs2_lock_res *, int); > > Sunil got, please ignore the above part. regards, wengang. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-12 1:59 ` Joel Becker 2010-01-12 3:58 ` Wengang Wang @ 2010-01-12 19:47 ` David Teigland 2010-01-13 3:20 ` Wengang Wang 2 siblings, 0 replies; 27+ messages in thread From: David Teigland @ 2010-01-12 19:47 UTC (permalink / raw) To: ocfs2-devel On Mon, Jan 11, 2010 at 05:59:46PM -0800, Joel Becker wrote: > I've attached the full patch with my changes. Dave, please test > my version (the attached one) instead of Wengang's. Your new patch fixes the mount, so I went on to test make_panic which is the test we never got to work: http://oss.oracle.com/pipermail/ocfs2-devel/2009-April/004313.html https://bugzilla.novell.com/show_bug.cgi?id=492055 It ran on three nodes for several minutes; much longer than it ever had before. It eventually triggered different BUG's on two of the nodes, rather than just getting stuck as it used to. I wasn't watching, so I don't know which of these came first. One node had: kernel BUG at fs/ocfs2/dlmglue.c:3567! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map CPU 3 Modules linked in: ocfs2_stack_user dlm ocfs2 jbd2 ocfs2_nodemanager configfs ocfs2_stackglue ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp autofs4 sunrpc ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi cpufreq_ondemand dm_multipath video output sbs sbshc battery ac parport_pc lp parport sg serio_raw button tg3 libphy i2c_nforce2 i2c_core pcspkr dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod qla2xxx scsi_transport_fc shpchp mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 4019, comm: ocfs2dc Not tainted 2.6.32.3 #2 ProLiant DL145 G2 RIP: 0010:[<ffffffffa03a8148>] [<ffffffffa03a8148>] ocfs2_ci_checkpointed+0x7c/0xcb [ocfs2] RSP: 0018:ffff8800785b7dc0 EFLAGS: 00010002 RAX: 0000000000000001 RBX: 0000000000000411 RCX: ffff8800779b53b8 RDX: 000000000000d0cf RSI: 0000000000000001 RDI: 0000000000000000 RBP: ffff8800785b7df0 R08: ffffffffa03a810a R09: ffffffffa03aa010 R10: ffff88013f421e68 R11: ffff8800839d36c0 R12: ffffffffffffffff R13: ffff880078b73938 R14: 0000000000000000 R15: ffff880078b73368 FS: 00007f589e5126e0(0000) GS:ffff880083a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f0598be6000 CR3: 0000000133608000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process ocfs2dc (pid: 4019, threadinfo ffff8800785b6000, task ffff8800779b4cc0) Stack: ffff88007856c000 0000000000000282 ffff880078b73368 0000000000000000 <0> ffff88007856c000 0000000000000282 ffff8800785b7e00 ffffffffa03ab4b0 <0> ffff8800785b7eb0 ffffffffa03aa16d ffff8800785b7eb0 ffffffff813528ed Call Trace: [<ffffffffa03ab4b0>] ocfs2_check_meta_downconvert+0x32/0x34 [ocfs2] [<ffffffffa03aa16d>] ocfs2_downconvert_thread+0x470/0x869 [ocfs2] [<ffffffff813528ed>] ? thread_return+0x3e/0xee [<ffffffff8105ff63>] ? autoremove_wake_function+0x0/0x3d [<ffffffffa03a9cfd>] ? ocfs2_downconvert_thread+0x0/0x869 [ocfs2] [<ffffffff8105fe55>] kthread+0x82/0x8d [<ffffffff8100cc1a>] child_rip+0xa/0x20 [<ffffffff8105fdb2>] ? kthreadd+0xc7/0xe8 [<ffffffff8105fdd3>] ? kthread+0x0/0x8d [<ffffffff8100cc10>] ? child_rip+0x0/0x20 Code: e0 45 85 f6 74 0a 41 83 fe 03 74 04 0f 0b eb fe 49 29 dc 49 f7 d4 4c 89 e0 48 c1 e8 3f 41 83 bf a0 00 00 00 05 74 08 84 c0 74 08 <0f> 0b eb fe 84 c0 75 07 b8 01 00 00 00 eb 33 4c 89 ef e8 07 ac RIP [<ffffffffa03a8148>] ocfs2_ci_checkpointed+0x7c/0xcb [ocfs2] RSP <ffff8800785b7dc0> Another node had: (2881,2):ocfs2_inode_lock_update:2224 ERROR: bug expression: inode->i_generation != le32_to_cpu(fe->i_generation) (2881,2):ocfs2_inode_lock_update:2224 ERROR: Invalid dinode 420382 disk generation: 1523484106 inode->i_generation: 1523484094 ------------[ cut here ]------------ kernel BUG at fs/ocfs2/dlmglue.c:2224! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map CPU 2 Modules linked in: ocfs2_stack_user dlm ocfs2 jbd2 ocfs2_nodemanager configfs ocfs2_stackglue ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp sunrpc ipv6 cpufreq_ondemand dm_multipath uinput serio_raw pcspkr sg qla2xxx scsi_transport_fc tg3 libphy i2c_nforce2 i2c_core button dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod shpchp mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan] Pid: 2881, comm: make_panic Not tainted 2.6.32.3 #1 ProLiant DL145 G2 RIP: 0010:[<ffffffffa032e9a0>] [<ffffffffa032e9a0>] ocfs2_inode_lock_full_nested+0x850/0xd00 [ocfs2] RSP: 0018:ffff88007b55bce8 EFLAGS: 00010296 RAX: 0000000000000085 RBX: ffff880067c95000 RCX: 000000000000be01 RDX: ffff880083800000 RSI: 0000000000000001 RDI: 0000000000000003 RBP: ffff88007b55bd58 R08: 0000000000000092 R09: 0000000000000000 R10: 0000000000000003 R11: 0000000000018600 R12: ffff880066a8b2d8 R13: ffff880066a8b6e8 R14: ffff880066a8b840 R15: ffff880066a8b040 FS: 00007fa9c96b66f0(0000) GS:ffff880083800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f54f5021000 CR3: 000000007e82a000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process make_panic (pid: 2881, threadinfo ffff88007b55a000, task ffff88007e9d40c0) Stack: ffffffff5ace85ca 000001015ace85be 00000000810f4897 0000000000000000 <0> 0000000000000000 000000017f59ece0 ffff880066a8b228 ffffffff81103794 <0> ffff880068122cb0 ffff880066a8b840 ffff880066a8b840 0000000000000026 Call Trace: [<ffffffff81103794>] ? mntput_no_expire+0x29/0xfd [<ffffffffa03354f1>] ocfs2_permission+0x78/0x16f [ocfs2] [<ffffffff810f3ea5>] inode_permission+0x6e/0x9e [<ffffffff810f593c>] may_open+0x9e/0x252 [<ffffffff810f8389>] do_filp_open+0x51f/0xa55 [<ffffffff81101c14>] ? alloc_fd+0x122/0x133 [<ffffffff810ea0b0>] do_sys_open+0x62/0x109 [<ffffffff810ea18a>] sys_open+0x20/0x22 [<ffffffff8100bc5b>] system_call_fastpath+0x16/0x1b Code: ff 48 c7 c1 b0 eb 37 a0 48 c7 c7 0b 6d 38 a0 65 8b 14 25 08 cd 00 00 89 44 24 08 8b 43 08 48 63 d2 89 04 24 31 c0 e8 37 cc 02 e1 <0f> 0b eb fe 48 83 7b 48 00 75 0a f6 43 2c 01 0f 85 bc 00 00 00 RIP [<ffffffffa032e9a0>] ocfs2_inode_lock_full_nested+0x850/0xd00 [ocfs2] RSP <ffff88007b55bce8> ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-12 1:59 ` Joel Becker 2010-01-12 3:58 ` Wengang Wang 2010-01-12 19:47 ` David Teigland @ 2010-01-13 3:20 ` Wengang Wang 2010-01-13 7:08 ` Wengang Wang 2010-01-13 7:57 ` Joel Becker 2 siblings, 2 replies; 27+ messages in thread From: Wengang Wang @ 2010-01-13 3:20 UTC (permalink / raw) To: ocfs2-devel Hi Joel and David, I think the patch to fix the deadlock(or livelock) is not good enough yet. my original patch is based on that when we have requested the lock, we don't allow to downconvert it before we release it. but per the communications with Joel and Sunil, I found my base is wrong. so my original patch is not good. and Joel's patch has problem either. On 10-01-11 17:59, Joel Becker wrote: > [Cc'd Mark and Dave] > Signed-off-by: Joel Becker <joel.becker@oracle.com> > --- > fs/ocfs2/dlmglue.c | 29 +++++++++++++++++++++++++++++ > 1 files changed, 29 insertions(+), 0 deletions(-) > > diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c > index 90682a0..eabe53e 100644 > --- a/fs/ocfs2/dlmglue.c > +++ b/fs/ocfs2/dlmglue.c > @@ -1313,6 +1313,7 @@ static int __ocfs2_cluster_lock(struct ocfs2_super *osb, > unsigned long flags; > unsigned int gen; > int noqueue_attempted = 0; > + int lock_attempted = 0; > > mlog_entry_void(); > > @@ -1347,6 +1348,25 @@ again: > goto unlock; > } > > + if (lock_attempted && (lockres->l_level >= level)) { > + /* > + * We've attempted to upconvert, and the lock now has > + * a level we can work with. If we fell through to the > + * next checks, we could spin in an upconvert->downconvert > + * cycle as other nodes pounded us. Instead, we jump > + * out and let the caller do some work. If a downconvert > + * has come in, it will do its thing as soon as the caller > + * is done. > + */ > + goto update_holders; > + } before update_holders, the lock could be DCed(since no BUSY flag set by here). and even after update_holders, the lock could be DCed too. so that we get ocfs2_cluster_lock()(with holders increased) returned sucessfully but actually we don't hold the dlm lock. --thus more than one node is considering that they have the (EX) lock. that's could be the cause of what is happening observed by David. regards, wengang. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-13 3:20 ` Wengang Wang @ 2010-01-13 7:08 ` Wengang Wang 2010-01-13 7:57 ` Joel Becker 1 sibling, 0 replies; 27+ messages in thread From: Wengang Wang @ 2010-01-13 7:08 UTC (permalink / raw) To: ocfs2-devel On 10-01-13 11:20, Wengang Wang wrote: > Hi Joel and David, > > I think the patch to fix the deadlock(or livelock) is not good enough > yet. > my original patch is based on that when we have requested the lock, we > don't allow to downconvert it before we release it. > but per the communications with Joel and Sunil, I found my base is > wrong. so my original patch is not good. and Joel's patch has problem > either. > > On 10-01-11 17:59, Joel Becker wrote: > > [Cc'd Mark and Dave] > > Signed-off-by: Joel Becker <joel.becker@oracle.com> > > --- > > fs/ocfs2/dlmglue.c | 29 +++++++++++++++++++++++++++++ > > 1 files changed, 29 insertions(+), 0 deletions(-) > > > > diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c > > index 90682a0..eabe53e 100644 > > --- a/fs/ocfs2/dlmglue.c > > +++ b/fs/ocfs2/dlmglue.c > > @@ -1313,6 +1313,7 @@ static int __ocfs2_cluster_lock(struct ocfs2_super *osb, > > unsigned long flags; > > unsigned int gen; > > int noqueue_attempted = 0; > > + int lock_attempted = 0; > > > > mlog_entry_void(); > > > > @@ -1347,6 +1348,25 @@ again: > > goto unlock; > > } > > > > + if (lock_attempted && (lockres->l_level >= level)) { > > + /* > > + * We've attempted to upconvert, and the lock now has > > + * a level we can work with. If we fell through to the > > + * next checks, we could spin in an upconvert->downconvert > > + * cycle as other nodes pounded us. Instead, we jump > > + * out and let the caller do some work. If a downconvert > > + * has come in, it will do its thing as soon as the caller > > + * is done. > > + */ > > + goto update_holders; > > + } > > before update_holders, the lock could be DCed(since no BUSY flag set by > here). > > and even after update_holders, the lock could be DCed too. > > so that we get ocfs2_cluster_lock()(with holders increased) returned sucessfully > but actually we don't hold the dlm lock. --thus more than one node is > considering that they have the (EX) lock. > > that's could be the cause of what is happening observed by David. I think the main problem is that though the lock is got, the holders is not updated in time(within BUSY flag protection). a BUSY flag protects a single ocfs2_dlm_lock(). if a BUSY_INC_HD(also busy with updating holders) is used to protect a single ocfs2_dlm_lock() + holders updates against dc thread, so far I think it's the fix. comments? #patch attached. -------------- next part -------------- diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index c3ee146..d172c5f 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -1360,6 +1360,8 @@ again: goto update_holders; } + lockres_clear_flags(lockres, OCFS2_LOCK_BUSY_INC_HD); + /* * This is either the first pass or our level wasn't good enough. * Treat both cases as a first pass. @@ -1403,7 +1405,7 @@ again: } lockres->l_requested = level; - lockres_or_flags(lockres, OCFS2_LOCK_BUSY); + lockres_or_flags(lockres, OCFS2_LOCK_BUSY|OCFS2_LOCK_BUSY_INC_HD); gen = lockres_set_pending(lockres); spin_unlock_irqrestore(&lockres->l_lock, flags); @@ -1447,6 +1449,8 @@ again: update_holders: /* Ok, if we get here then we're good to go. */ ocfs2_inc_holders(lockres, level); + /* no harm even double clear OCFS2_LOCK_BUSY_INC_HD */ + lockres_clear_flags(lockres, OCFS2_LOCK_BUSY_INC_HD); ret = 0; unlock: @@ -3392,7 +3396,7 @@ static int ocfs2_unblock_lock(struct ocfs2_super *osb, BUG_ON(!(lockres->l_flags & OCFS2_LOCK_BLOCKED)); recheck: - if (lockres->l_flags & OCFS2_LOCK_BUSY) { + if (lockres->l_flags & (OCFS2_LOCK_BUSY|OCFS2_LOCK_BUSY_INC_HD)) { /* XXX * This is a *big* race. The OCFS2_LOCK_PENDING flag * exists entirely for one reason - another thread has set diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h index d963d86..a3d23ef 100644 --- a/fs/ocfs2/ocfs2.h +++ b/fs/ocfs2/ocfs2.h @@ -136,6 +136,7 @@ enum ocfs2_unlock_action { #define OCFS2_LOCK_PENDING (0x00000400) /* This lockres is pending a call to dlm_lock. Only exists with BUSY set. */ +#define OCFS2_LOCK_BUSY_INC_HD (0x00000800) /* busy with increase holders */ struct ocfs2_lock_res_ops; ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-13 3:20 ` Wengang Wang 2010-01-13 7:08 ` Wengang Wang @ 2010-01-13 7:57 ` Joel Becker 2010-01-13 9:23 ` Wengang Wang 2010-01-13 19:14 ` Sunil Mushran 1 sibling, 2 replies; 27+ messages in thread From: Joel Becker @ 2010-01-13 7:57 UTC (permalink / raw) To: ocfs2-devel On Wed, Jan 13, 2010 at 11:20:49AM +0800, Wengang Wang wrote: > before update_holders, the lock could be DCed(since no BUSY flag set by > here). > > and even after update_holders, the lock could be DCed too. > > so that we get ocfs2_cluster_lock()(with holders increased) returned sucessfully > but actually we don't hold the dlm lock. --thus more than one node is > considering that they have the (EX) lock. You make a good point. I don't like the solution you propose, though. Another flag that's almost the same? Eww! There's got to be a better way. Joel -- Life's Little Instruction Book #80 "Slow dance" Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-13 7:57 ` Joel Becker @ 2010-01-13 9:23 ` Wengang Wang 2010-01-13 9:31 ` Joel Becker 2010-01-13 19:14 ` Sunil Mushran 1 sibling, 1 reply; 27+ messages in thread From: Wengang Wang @ 2010-01-13 9:23 UTC (permalink / raw) To: ocfs2-devel On 10-01-12 23:57, Joel Becker wrote: > On Wed, Jan 13, 2010 at 11:20:49AM +0800, Wengang Wang wrote: > > before update_holders, the lock could be DCed(since no BUSY flag set by > > here). > > > > and even after update_holders, the lock could be DCed too. > > > > so that we get ocfs2_cluster_lock()(with holders increased) returned sucessfully > > but actually we don't hold the dlm lock. --thus more than one node is > > considering that they have the (EX) lock. > > You make a good point. I don't like the solution you propose, > though. Another flag that's almost the same? Eww! There's got to be a > better way. cool! expecting your patch... regards, wengang. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-13 9:23 ` Wengang Wang @ 2010-01-13 9:31 ` Joel Becker 2010-01-13 9:36 ` Wengang Wang 0 siblings, 1 reply; 27+ messages in thread From: Joel Becker @ 2010-01-13 9:31 UTC (permalink / raw) To: ocfs2-devel On Wed, Jan 13, 2010 at 05:23:32PM +0800, Wengang Wang wrote: > On 10-01-12 23:57, Joel Becker wrote: > > On Wed, Jan 13, 2010 at 11:20:49AM +0800, Wengang Wang wrote: > > > before update_holders, the lock could be DCed(since no BUSY flag set by > > > here). > > > > > > and even after update_holders, the lock could be DCed too. > > > > > > so that we get ocfs2_cluster_lock()(with holders increased) returned sucessfully > > > but actually we don't hold the dlm lock. --thus more than one node is > > > considering that they have the (EX) lock. > > > > You make a good point. I don't like the solution you propose, > > though. Another flag that's almost the same? Eww! There's got to be a > > better way. > cool! expecting your patch... Don't have one yet. We can't rely on the spinlock, because we have to sleep on BUSY. We relied on the spinblock before because we allowed the UC->DC->UC->DC livelock to happen. Joel -- "Up and down that road in our worn out shoes, Talking bout good things and singing the blues." Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-13 9:31 ` Joel Becker @ 2010-01-13 9:36 ` Wengang Wang 0 siblings, 0 replies; 27+ messages in thread From: Wengang Wang @ 2010-01-13 9:36 UTC (permalink / raw) To: ocfs2-devel On 10-01-13 01:31, Joel Becker wrote: > > > > before update_holders, the lock could be DCed(since no BUSY flag set by > > > > here). > > > > > > > > and even after update_holders, the lock could be DCed too. > > > > > > > > so that we get ocfs2_cluster_lock()(with holders increased) returned sucessfully > > > > but actually we don't hold the dlm lock. --thus more than one node is > > > > considering that they have the (EX) lock. > > > > > > You make a good point. I don't like the solution you propose, > > > though. Another flag that's almost the same? Eww! There's got to be a > > > better way. > > cool! expecting your patch... > > Don't have one yet. We can't rely on the spinlock, because we > have to sleep on BUSY. We relied on the spinblock before because we > allowed the UC->DC->UC->DC livelock to happen. > then I feel it's going to be a big change. so that the author shouldn't be me, add me as something like bug-found-by if you like :) regards, wengang. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-13 7:57 ` Joel Becker 2010-01-13 9:23 ` Wengang Wang @ 2010-01-13 19:14 ` Sunil Mushran 2010-01-14 1:01 ` Joel Becker 2010-01-21 15:30 ` David Teigland 1 sibling, 2 replies; 27+ messages in thread From: Sunil Mushran @ 2010-01-13 19:14 UTC (permalink / raw) To: ocfs2-devel Joel Becker wrote: > On Wed, Jan 13, 2010 at 11:20:49AM +0800, Wengang Wang wrote: >> before update_holders, the lock could be DCed(since no BUSY flag set by >> here). >> >> and even after update_holders, the lock could be DCed too. >> >> so that we get ocfs2_cluster_lock()(with holders increased) returned sucessfully >> but actually we don't hold the dlm lock. --thus more than one node is >> considering that they have the (EX) lock. > > You make a good point. I don't like the solution you propose, > though. Another flag that's almost the same? Eww! There's got to be a > better way. I don't see how we can resolve this without another flag. It could be a mirror image of the PENDING flag. PENDING takes care of the gap between BUSY set and calling dlm_lock(). This flag has to take care of the gap between BUSY clear and inc_holders. Am working on improving Wengang's patch. Mark is strangely silent. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-13 19:14 ` Sunil Mushran @ 2010-01-14 1:01 ` Joel Becker 2010-01-14 4:12 ` Wengang Wang 2010-01-21 15:30 ` David Teigland 1 sibling, 1 reply; 27+ messages in thread From: Joel Becker @ 2010-01-14 1:01 UTC (permalink / raw) To: ocfs2-devel On Wed, Jan 13, 2010 at 11:14:48AM -0800, Sunil Mushran wrote: > Joel Becker wrote: > > On Wed, Jan 13, 2010 at 11:20:49AM +0800, Wengang Wang wrote: > >> before update_holders, the lock could be DCed(since no BUSY flag set by > >> here). > >> > >> and even after update_holders, the lock could be DCed too. > >> > >> so that we get ocfs2_cluster_lock()(with holders increased) returned sucessfully > >> but actually we don't hold the dlm lock. --thus more than one node is > >> considering that they have the (EX) lock. > > > > You make a good point. I don't like the solution you propose, > > though. Another flag that's almost the same? Eww! There's got to be a > > better way. > > I don't see how we can resolve this without another flag. It could > be a mirror image of the PENDING flag. PENDING takes care of the gap > between BUSY set and calling dlm_lock(). This flag has to take care > of the gap between BUSY clear and inc_holders. > > Am working on improving Wengang's patch. I don't see anything yet, but PENDING is a complex enough flag as-is. Another flag and we're even less understandable ;-( If we have to do it, and I assume you'll set the flag in the ast when you clear BUSY, I say call it UPCONVERT_FINISHING. Oh, wait. This means that we can drop the lock_attemped/lock_done state variable from wengang's original change. Right before ocfs2_cluster_lock checks for BLOCKING, it can just check flags&UPCONVERT_FINISHING and jump to inc_holders. Maybe this is in wengang's update and I didn't notice. > Mark is strangely silent. He's probably busy. Joel -- One look at the From: understanding has blossomed .procmailrc grows - Alexander Viro Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-14 1:01 ` Joel Becker @ 2010-01-14 4:12 ` Wengang Wang 0 siblings, 0 replies; 27+ messages in thread From: Wengang Wang @ 2010-01-14 4:12 UTC (permalink / raw) To: ocfs2-devel Hi, On 10-01-13 17:01, Joel Becker wrote: > On Wed, Jan 13, 2010 at 11:14:48AM -0800, Sunil Mushran wrote: > > Joel Becker wrote: > > > On Wed, Jan 13, 2010 at 11:20:49AM +0800, Wengang Wang wrote: > > >> before update_holders, the lock could be DCed(since no BUSY flag set by > > >> here). > > >> > > >> and even after update_holders, the lock could be DCed too. > > >> > > >> so that we get ocfs2_cluster_lock()(with holders increased) returned sucessfully > > >> but actually we don't hold the dlm lock. --thus more than one node is > > >> considering that they have the (EX) lock. > > > > > > You make a good point. I don't like the solution you propose, > > > though. Another flag that's almost the same? Eww! There's got to be a > > > better way. > > > > I don't see how we can resolve this without another flag. It could > > be a mirror image of the PENDING flag. PENDING takes care of the gap > > between BUSY set and calling dlm_lock(). This flag has to take care > > of the gap between BUSY clear and inc_holders. > > > > Am working on improving Wengang's patch. > thanks! > I don't see anything yet, but PENDING is a complex enough flag > as-is. Another flag and we're even less understandable ;-( If we have > to do it, and I assume you'll set the flag in the ast when you clear > BUSY, I say call it UPCONVERT_FINISHING. > Oh, wait. This means that we can drop the > lock_attemped/lock_done state variable from wengang's original change. > Right before ocfs2_cluster_lock checks for BLOCKING, it can just check > flags&UPCONVERT_FINISHING and jump to inc_holders. Maybe this is in > wengang's update and I didn't notice. > yes, we can drop that variable. and I didn't notice it in my update :P regards, wengang. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-13 19:14 ` Sunil Mushran 2010-01-14 1:01 ` Joel Becker @ 2010-01-21 15:30 ` David Teigland 2010-01-21 15:41 ` Sunil Mushran 1 sibling, 1 reply; 27+ messages in thread From: David Teigland @ 2010-01-21 15:30 UTC (permalink / raw) To: ocfs2-devel On Wed, Jan 13, 2010 at 11:14:48AM -0800, Sunil Mushran wrote: > Joel Becker wrote: > >On Wed, Jan 13, 2010 at 11:20:49AM +0800, Wengang Wang wrote: > >>before update_holders, the lock could be DCed(since no BUSY flag set by > >>here). > >> > >>and even after update_holders, the lock could be DCed too. > >> > >>so that we get ocfs2_cluster_lock()(with holders increased) returned > >>sucessfully > >>but actually we don't hold the dlm lock. --thus more than one node is > >>considering that they have the (EX) lock. > > > > You make a good point. I don't like the solution you propose, > >though. Another flag that's almost the same? Eww! There's got to be a > >better way. > > I don't see how we can resolve this without another flag. It could > be a mirror image of the PENDING flag. PENDING takes care of the gap > between BUSY set and calling dlm_lock(). This flag has to take care > of the gap between BUSY clear and inc_holders. > > Am working on improving Wengang's patch. I'm afraid I lost track of which patches I should be trying. Is there something that we expect should solve the known problems, even if it's not a final version? Dave ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock 2010-01-21 15:30 ` David Teigland @ 2010-01-21 15:41 ` Sunil Mushran 0 siblings, 0 replies; 27+ messages in thread From: Sunil Mushran @ 2010-01-21 15:41 UTC (permalink / raw) To: ocfs2-devel Yes, we expect it to fix the issue. I'll send the patches after I get to work. On Jan 21, 2010, at 7:30 AM, David Teigland <teigland@redhat.com> wrote: > On Wed, Jan 13, 2010 at 11:14:48AM -0800, Sunil Mushran wrote: >> Joel Becker wrote: >>> On Wed, Jan 13, 2010 at 11:20:49AM +0800, Wengang Wang wrote: >>>> before update_holders, the lock could be DCed(since no BUSY flag >>>> set by >>>> here). >>>> >>>> and even after update_holders, the lock could be DCed too. >>>> >>>> so that we get ocfs2_cluster_lock()(with holders increased) >>>> returned >>>> sucessfully >>>> but actually we don't hold the dlm lock. --thus more than one >>>> node is >>>> considering that they have the (EX) lock. >>> >>> You make a good point. I don't like the solution you propose, >>> though. Another flag that's almost the same? Eww! There's got >>> to be a >>> better way. >> >> I don't see how we can resolve this without another flag. It could >> be a mirror image of the PENDING flag. PENDING takes care of the gap >> between BUSY set and calling dlm_lock(). This flag has to take care >> of the gap between BUSY clear and inc_holders. >> >> Am working on improving Wengang's patch. > > I'm afraid I lost track of which patches I should be trying. Is there > something that we expect should solve the known problems, even if > it's not > a final version? > Dave > ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2010-01-21 15:41 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-01-06 8:34 [Ocfs2-devel] [PATCH] ocfs2: fix __ocfs2_cluster_lock() dead lock Wengang Wang 2010-01-07 2:00 ` Joel Becker 2010-01-07 5:22 ` Wengang Wang 2010-01-09 18:05 ` Wengang Wang 2010-01-09 18:24 ` Sunil Mushran 2010-01-10 10:05 ` Joel Becker 2010-01-11 1:55 ` Wengang Wang 2010-01-12 1:59 ` Joel Becker 2010-01-12 3:58 ` Wengang Wang 2010-01-12 11:18 ` Joel Becker 2010-01-12 12:04 ` Wengang Wang 2010-01-12 13:01 ` Joel Becker 2010-01-12 14:30 ` Wengang Wang 2010-01-12 19:14 ` Sunil Mushran 2010-01-13 6:15 ` Wengang Wang 2010-01-12 19:47 ` David Teigland 2010-01-13 3:20 ` Wengang Wang 2010-01-13 7:08 ` Wengang Wang 2010-01-13 7:57 ` Joel Becker 2010-01-13 9:23 ` Wengang Wang 2010-01-13 9:31 ` Joel Becker 2010-01-13 9:36 ` Wengang Wang 2010-01-13 19:14 ` Sunil Mushran 2010-01-14 1:01 ` Joel Becker 2010-01-14 4:12 ` Wengang Wang 2010-01-21 15:30 ` David Teigland 2010-01-21 15:41 ` Sunil Mushran
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.