* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash @ 2012-07-17 7:10 Junxiao Bi 2012-07-17 19:49 ` Sunil Mushran 0 siblings, 1 reply; 7+ messages in thread From: Junxiao Bi @ 2012-07-17 7:10 UTC (permalink / raw) To: ocfs2-devel In the target node of the dlm lock migration, the logic to find the local dlm lock is wrong, it shouldn't change the loop variable "lock" in the list_for_each_entry loop. This will cause a NULL-pointer accessing crash. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: stable at vger.kernel.org --- fs/ocfs2/dlm/dlmrecovery.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index 01ebfd0..0b9cc88 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, u8 from = O2NM_MAX_NODES; unsigned int added = 0; __be64 c; + int found; mlog(0, "running %d locks for this lockres\n", mres->num_locks); for (i=0; i<mres->num_locks; i++) { @@ -1793,22 +1794,23 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm, /* MIGRATION ONLY! */ BUG_ON(!(mres->flags & DLM_MRES_MIGRATION)); + found = 0; spin_lock(&res->spinlock); for (j = DLM_GRANTED_LIST; j <= DLM_BLOCKED_LIST; j++) { tmpq = dlm_list_idx_to_ptr(res, j); list_for_each_entry(lock, tmpq, list) { - if (lock->ml.cookie != ml->cookie) - lock = NULL; - else + if (lock->ml.cookie == ml->cookie) { + found = 1; break; + } } - if (lock) + if (found) break; } /* lock is always created locally first, and * destroyed locally last. it must be on the list */ - if (!lock) { + if (!found) { c = ml->cookie; mlog(ML_ERROR, "Could not find local lock " "with cookie %u:%llu, node %u, " -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash 2012-07-17 7:10 [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash Junxiao Bi @ 2012-07-17 19:49 ` Sunil Mushran 2012-07-18 1:36 ` Junxiao Bi 0 siblings, 1 reply; 7+ messages in thread From: Sunil Mushran @ 2012-07-17 19:49 UTC (permalink / raw) To: ocfs2-devel On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi <junxiao.bi@oracle.com> wrote: > In the target node of the dlm lock migration, the logic to find > the local dlm lock is wrong, it shouldn't change the loop variable > "lock" in the list_for_each_entry loop. This will cause a NULL-pointer > accessing crash. > > Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> > Cc: stable at vger.kernel.org > --- > fs/ocfs2/dlm/dlmrecovery.c | 12 +++++++----- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c > index 01ebfd0..0b9cc88 100644 > --- a/fs/ocfs2/dlm/dlmrecovery.c > +++ b/fs/ocfs2/dlm/dlmrecovery.c > @@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct dlm_ctxt > *dlm, > u8 from = O2NM_MAX_NODES; > unsigned int added = 0; > __be64 c; > + int found; > > mlog(0, "running %d locks for this lockres\n", mres->num_locks); > for (i=0; i<mres->num_locks; i++) { > @@ -1793,22 +1794,23 @@ static int dlm_process_recovery_data(struct > dlm_ctxt *dlm, > /* MIGRATION ONLY! */ > BUG_ON(!(mres->flags & DLM_MRES_MIGRATION)); > > + found = 0; > spin_lock(&res->spinlock); > for (j = DLM_GRANTED_LIST; j <= DLM_BLOCKED_LIST; > j++) { > tmpq = dlm_list_idx_to_ptr(res, j); > list_for_each_entry(lock, tmpq, list) { > - if (lock->ml.cookie != ml->cookie) > - lock = NULL; > - else > + if (lock->ml.cookie == ml->cookie) > { > + found = 1; > break; > + } > } > - if (lock) > + if (found) > break; > } > > /* lock is always created locally first, and > * destroyed locally last. it must be on the list > */ > - if (!lock) { > + if (!found) { > c = ml->cookie; > mlog(ML_ERROR, "Could not find local lock " > "with cookie %u:%llu, node > %u, " > https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47 We had decided to go back to list_for_each(). -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20120717/56e846cb/attachment.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash 2012-07-17 19:49 ` Sunil Mushran @ 2012-07-18 1:36 ` Junxiao Bi [not found] ` <CAEeiSHXpcU6xXeDzP3nA8jGDnoit-NtZHM2A73hya_9c01Y_mg@mail.gmail.com> 0 siblings, 1 reply; 7+ messages in thread From: Junxiao Bi @ 2012-07-18 1:36 UTC (permalink / raw) To: ocfs2-devel Hi Sunil, On 07/18/2012 03:49 AM, Sunil Mushran wrote: > On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi <junxiao.bi@oracle.com > <mailto:junxiao.bi@oracle.com>> wrote: > > In the target node of the dlm lock migration, the logic to find > the local dlm lock is wrong, it shouldn't change the loop variable > "lock" in the list_for_each_entry loop. This will cause a NULL-pointer > accessing crash. > > Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com > <mailto:junxiao.bi@oracle.com>> > Cc: stable at vger.kernel.org <mailto:stable@vger.kernel.org> > --- > fs/ocfs2/dlm/dlmrecovery.c | 12 +++++++----- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c > index 01ebfd0..0b9cc88 100644 > --- a/fs/ocfs2/dlm/dlmrecovery.c > +++ b/fs/ocfs2/dlm/dlmrecovery.c > @@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct > dlm_ctxt *dlm, > u8 from = O2NM_MAX_NODES; > unsigned int added = 0; > __be64 c; > + int found; > > mlog(0, "running %d locks for this lockres\n", > mres->num_locks); > for (i=0; i<mres->num_locks; i++) { > @@ -1793,22 +1794,23 @@ static int > dlm_process_recovery_data(struct dlm_ctxt *dlm, > /* MIGRATION ONLY! */ > BUG_ON(!(mres->flags & DLM_MRES_MIGRATION)); > > + found = 0; > spin_lock(&res->spinlock); > for (j = DLM_GRANTED_LIST; j <= > DLM_BLOCKED_LIST; j++) { > tmpq = dlm_list_idx_to_ptr(res, j); > list_for_each_entry(lock, tmpq, > list) { > - if (lock->ml.cookie != > ml->cookie) > - lock = NULL; > - else > + if (lock->ml.cookie == > ml->cookie) { > + found = 1; > break; > + } > } > - if (lock) > + if (found) > break; > } > > /* lock is always created locally first, and > * destroyed locally last. it must be on > the list */ > - if (!lock) { > + if (!found) { > c = ml->cookie; > mlog(ML_ERROR, "Could not find > local lock " > "with cookie > %u:%llu, node %u, " > > > > https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47 > > > We had decided to go back to list_for_each(). OK, thank you. It's OK to revert it back for a introduced bug. But I think you'd better cc stable branch. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20120718/fa9698d6/attachment.html ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <CAEeiSHXpcU6xXeDzP3nA8jGDnoit-NtZHM2A73hya_9c01Y_mg@mail.gmail.com>]
[parent not found: <50076428.2040908@oracle.com>]
[parent not found: <CAEeiSHV+TVsnwqnsi0u4r=ucBoddo8wD8DcqbsCn1UoA3xjtdg@mail.gmail.com>]
* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash [not found] ` <CAEeiSHV+TVsnwqnsi0u4r=ucBoddo8wD8DcqbsCn1UoA3xjtdg@mail.gmail.com> @ 2014-02-24 9:07 ` Junxiao Bi 2014-02-24 23:30 ` Srinivas Eeda 0 siblings, 1 reply; 7+ messages in thread From: Junxiao Bi @ 2014-02-24 9:07 UTC (permalink / raw) To: ocfs2-devel Hi, On 07/19/2012 09:59 AM, Sunil Mushran wrote: > Different issues. > > On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi <junxiao.bi@oracle.com > <mailto:junxiao.bi@oracle.com>> wrote: > > On 07/19/2012 12:36 AM, Sunil Mushran wrote: >> This bug was detected during code audit. Never seen a crash. If >> it does hit, >> then we have bigger problems. So no point posting to stable. > I read a lot of dlm recovery code recently, I found this bug could happen at the following scenario. node 1: migrate target node x: dlm_unregister_domain() dlm_migrate_all_locks() dlm_empty_lockres() select node x as migrate target node since there is a node x lock on the granted list. dlm_migrate_lockres() dlm_mark_lockres_migrating() { wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res)); <<< node x unlock may happen here, res->granted list can be empty. dlm_lockres_release_ast(dlm, res); } dlm_send_one_lockres() dlm_process_recovery_data() { tmpq is res->granted list and is empty. list_for_each_entry(lock, tmpq, list) { if (lock->ml.cookie != ml->cookie) lock = NULL; else break; } lock will be invalid here. if (lock->ml.node != ml->node) BUG() --> crash here. } Thanks, Junxiao. > > Our customer can reproduce it. Also I saw you were assigned a > similar bug before, see > https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the > same BUG? >> >> On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi >> <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote: >> >> Hi Sunil, >> >> On 07/18/2012 03:49 AM, Sunil Mushran wrote: >>> On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi >>> <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote: >>> >>> In the target node of the dlm lock migration, the logic >>> to find >>> the local dlm lock is wrong, it shouldn't change the >>> loop variable >>> "lock" in the list_for_each_entry loop. This will cause >>> a NULL-pointer >>> accessing crash. >>> >>> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com >>> <mailto:junxiao.bi@oracle.com>> >>> Cc: stable at vger.kernel.org <mailto:stable@vger.kernel.org> >>> --- >>> fs/ocfs2/dlm/dlmrecovery.c | 12 +++++++----- >>> 1 file changed, 7 insertions(+), 5 deletions(-) >>> >>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c >>> b/fs/ocfs2/dlm/dlmrecovery.c >>> index 01ebfd0..0b9cc88 100644 >>> --- a/fs/ocfs2/dlm/dlmrecovery.c >>> +++ b/fs/ocfs2/dlm/dlmrecovery.c >>> @@ -1762,6 +1762,7 @@ static int >>> dlm_process_recovery_data(struct dlm_ctxt *dlm, >>> u8 from = O2NM_MAX_NODES; >>> unsigned int added = 0; >>> __be64 c; >>> + int found; >>> >>> mlog(0, "running %d locks for this lockres\n", >>> mres->num_locks); >>> for (i=0; i<mres->num_locks; i++) { >>> @@ -1793,22 +1794,23 @@ static int >>> dlm_process_recovery_data(struct dlm_ctxt *dlm, >>> /* MIGRATION ONLY! */ >>> BUG_ON(!(mres->flags & >>> DLM_MRES_MIGRATION)); >>> >>> + found = 0; >>> spin_lock(&res->spinlock); >>> for (j = DLM_GRANTED_LIST; j <= >>> DLM_BLOCKED_LIST; j++) { >>> tmpq = >>> dlm_list_idx_to_ptr(res, j); >>> >>> list_for_each_entry(lock, tmpq, list) { >>> - if >>> (lock->ml.cookie != ml->cookie) >>> - lock = NULL; >>> - else >>> + if >>> (lock->ml.cookie == ml->cookie) { >>> + found = 1; >>> break; >>> + } >>> } >>> - if (lock) >>> + if (found) >>> break; >>> } >>> >>> /* lock is always created >>> locally first, and >>> * destroyed locally last. it >>> must be on the list */ >>> - if (!lock) { >>> + if (!found) { >>> c = ml->cookie; >>> mlog(ML_ERROR, "Could >>> not find local lock " >>> "with >>> cookie %u:%llu, node %u, " >>> >>> >>> >>> https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47 >>> >>> >>> We had decided to go back to list_for_each(). >> >> OK, thank you. It's OK to revert it back for a introduced >> bug. But I think you'd better cc stable branch. >> >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20140224/f8596f59/attachment.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash 2014-02-24 9:07 ` Junxiao Bi @ 2014-02-24 23:30 ` Srinivas Eeda 2014-02-25 1:54 ` Junxiao Bi 2014-02-25 2:14 ` Junxiao Bi 0 siblings, 2 replies; 7+ messages in thread From: Srinivas Eeda @ 2014-02-24 23:30 UTC (permalink / raw) To: ocfs2-devel Junxiao, thanks for looking into this issue. Please see my comment below On 02/24/2014 01:07 AM, Junxiao Bi wrote: > Hi, > > On 07/19/2012 09:59 AM, Sunil Mushran wrote: >> Different issues. >> >> On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi <junxiao.bi@oracle.com >> <mailto:junxiao.bi@oracle.com>> wrote: >> >> On 07/19/2012 12:36 AM, Sunil Mushran wrote: >>> This bug was detected during code audit. Never seen a crash. If >>> it does hit, >>> then we have bigger problems. So no point posting to stable. >> > I read a lot of dlm recovery code recently, I found this bug could > happen at the following scenario. > > node 1: migrate target node x: > dlm_unregister_domain() > dlm_migrate_all_locks() > dlm_empty_lockres() > select node x as migrate target node > since there is a node x lock on the granted list. > dlm_migrate_lockres() > dlm_mark_lockres_migrating() { > wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res)); > <<< node x unlock may happen here, res->granted list can be empty. If the unlock request got sent at this point, and if the request was *processed*, lock must have been removed from the granted_list. If the request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set in dlm_lockres_release_ast would make dlm_unlock handler to return DLM_MIGRATING to the caller (in this case node x). So I don't see how granted_list could have stale lock. Am I missing something ? I do think there is such race that you pointed below exist, but I am not sure if it was due to the above race described. > dlm_lockres_release_ast(dlm, res); > } > dlm_send_one_lockres() > dlm_process_recovery_data() { > tmpq is > res->granted list and is empty. > list_for_each_entry(lock, tmpq, list) { > if > (lock->ml.cookie != ml->cookie) > lock = NULL; > else > break; > } > lock will be > invalid here. > if (lock->ml.node > != ml->node) > BUG() --> > crash here. > } > > Thanks, > Junxiao. >> >> Our customer can reproduce it. Also I saw you were assigned a >> similar bug before, see >> https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the >> same BUG? >>> >>> On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi >>> <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote: >>> >>> Hi Sunil, >>> >>> On 07/18/2012 03:49 AM, Sunil Mushran wrote: >>>> On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi >>>> <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote: >>>> >>>> In the target node of the dlm lock migration, the logic >>>> to find >>>> the local dlm lock is wrong, it shouldn't change the >>>> loop variable >>>> "lock" in the list_for_each_entry loop. This will cause >>>> a NULL-pointer >>>> accessing crash. >>>> >>>> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com >>>> <mailto:junxiao.bi@oracle.com>> >>>> Cc: stable at vger.kernel.org <mailto:stable@vger.kernel.org> >>>> --- >>>> fs/ocfs2/dlm/dlmrecovery.c | 12 +++++++----- >>>> 1 file changed, 7 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c >>>> b/fs/ocfs2/dlm/dlmrecovery.c >>>> index 01ebfd0..0b9cc88 100644 >>>> --- a/fs/ocfs2/dlm/dlmrecovery.c >>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c >>>> @@ -1762,6 +1762,7 @@ static int >>>> dlm_process_recovery_data(struct dlm_ctxt *dlm, >>>> u8 from = O2NM_MAX_NODES; >>>> unsigned int added = 0; >>>> __be64 c; >>>> + int found; >>>> >>>> mlog(0, "running %d locks for this lockres\n", >>>> mres->num_locks); >>>> for (i=0; i<mres->num_locks; i++) { >>>> @@ -1793,22 +1794,23 @@ static int >>>> dlm_process_recovery_data(struct dlm_ctxt *dlm, >>>> /* MIGRATION ONLY! */ >>>> BUG_ON(!(mres->flags & DLM_MRES_MIGRATION)); >>>> >>>> + found = 0; >>>> spin_lock(&res->spinlock); >>>> for (j = DLM_GRANTED_LIST; j <= >>>> DLM_BLOCKED_LIST; j++) { >>>> tmpq = >>>> dlm_list_idx_to_ptr(res, j); >>>> list_for_each_entry(lock, tmpq, list) { >>>> - if >>>> (lock->ml.cookie != ml->cookie) >>>> - lock = NULL; >>>> - else >>>> + if >>>> (lock->ml.cookie == ml->cookie) { >>>> + found = 1; >>>> break; >>>> + } >>>> } >>>> - if (lock) >>>> + if (found) >>>> break; >>>> } >>>> >>>> /* lock is always created >>>> locally first, and >>>> * destroyed locally last. it >>>> must be on the list */ >>>> - if (!lock) { >>>> + if (!found) { >>>> c = ml->cookie; >>>> mlog(ML_ERROR, "Could not find local lock " >>>> "with cookie %u:%llu, node %u, " >>>> >>>> >>>> >>>> https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47 >>>> >>>> >>>> We had decided to go back to list_for_each(). >>> >>> OK, thank you. It's OK to revert it back for a introduced >>> bug. But I think you'd better cc stable branch. >>> >>> >> >> >> > > > > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20140224/b704f9e0/attachment-0001.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash 2014-02-24 23:30 ` Srinivas Eeda @ 2014-02-25 1:54 ` Junxiao Bi 2014-02-25 2:14 ` Junxiao Bi 1 sibling, 0 replies; 7+ messages in thread From: Junxiao Bi @ 2014-02-25 1:54 UTC (permalink / raw) To: ocfs2-devel Hi Srini, On 02/25/2014 07:30 AM, Srinivas Eeda wrote: > Junxiao, thanks for looking into this issue. Please see my comment below > > On 02/24/2014 01:07 AM, Junxiao Bi wrote: >> Hi, >> >> On 07/19/2012 09:59 AM, Sunil Mushran wrote: >>> Different issues. >>> >>> On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi <junxiao.bi@oracle.com >>> <mailto:junxiao.bi@oracle.com>> wrote: >>> >>> On 07/19/2012 12:36 AM, Sunil Mushran wrote: >>>> This bug was detected during code audit. Never seen a crash. If >>>> it does hit, >>>> then we have bigger problems. So no point posting to stable. >>> >> I read a lot of dlm recovery code recently, I found this bug could >> happen at the following scenario. >> >> node 1: migrate target >> node x: >> dlm_unregister_domain() >> dlm_migrate_all_locks() >> dlm_empty_lockres() >> select node x as migrate target node >> since there is a node x lock on the granted list. >> dlm_migrate_lockres() >> dlm_mark_lockres_migrating() { >> wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res)); >> <<< node x unlock may happen here, res->granted list can be empty. > If the unlock request got sent at this point, and if the request was > *processed*, lock must have been removed from the granted_list. If the > request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set > in dlm_lockres_release_ast would make dlm_unlock handler to return > DLM_MIGRATING to the caller (in this case node x). So I don't see how > granted_list could have stale lock. Am I missing something ? I agree granted_list will not have stale lock. The issue is triggered when there is no locks in the granted_list. In migrate target node, the granted_list is also empty after unlock. Then due to the wrong use of list_for_each_entry in the following code, lock will be not null even the granted_list is null. The lock is invalid and lock->ml.node != ml->node may be true and cause the bug. for (j = DLM_GRANTED_LIST; j <= DLM_BLOCKED_LIST; j++) { tmpq = dlm_list_idx_to_ptr(res, j); list_for_each_entry(lock, tmpq, list) { if (lock->ml.cookie != ml->cookie) lock = NULL; else break; } if (lock) break; } /* lock is always created locally first, and * destroyed locally last. it must be on the list */ if (!lock) { c = ml->cookie; BUG(); } if (lock->ml.node != ml->node) { c = lock->ml.cookie; c = ml->cookie; BUG(); } Thanks, Junxiao. > > I do think there is such race that you pointed below exist, but I am > not sure if it was due to the above race described. > >> dlm_lockres_release_ast(dlm, res); >> } >> dlm_send_one_lockres() >> >> dlm_process_recovery_data() { >> tmpq is >> res->granted list and is empty. >> >> list_for_each_entry(lock, tmpq, list) { >> if >> (lock->ml.cookie != ml->cookie) >> lock = NULL; >> else >> break; >> } >> lock will be >> invalid here. >> if (lock->ml.node >> != ml->node) >> BUG() --> >> crash here. >> } >> >> Thanks, >> Junxiao. >>> >>> Our customer can reproduce it. Also I saw you were assigned a >>> similar bug before, see >>> https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the >>> same BUG? >>>> >>>> On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi >>>> <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote: >>>> >>>> Hi Sunil, >>>> >>>> On 07/18/2012 03:49 AM, Sunil Mushran wrote: >>>>> On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi >>>>> <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote: >>>>> >>>>> In the target node of the dlm lock migration, the >>>>> logic to find >>>>> the local dlm lock is wrong, it shouldn't change the >>>>> loop variable >>>>> "lock" in the list_for_each_entry loop. This will >>>>> cause a NULL-pointer >>>>> accessing crash. >>>>> >>>>> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com >>>>> <mailto:junxiao.bi@oracle.com>> >>>>> Cc: stable at vger.kernel.org <mailto:stable@vger.kernel.org> >>>>> --- >>>>> fs/ocfs2/dlm/dlmrecovery.c | 12 +++++++----- >>>>> 1 file changed, 7 insertions(+), 5 deletions(-) >>>>> >>>>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c >>>>> b/fs/ocfs2/dlm/dlmrecovery.c >>>>> index 01ebfd0..0b9cc88 100644 >>>>> --- a/fs/ocfs2/dlm/dlmrecovery.c >>>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c >>>>> @@ -1762,6 +1762,7 @@ static int >>>>> dlm_process_recovery_data(struct dlm_ctxt *dlm, >>>>> u8 from = O2NM_MAX_NODES; >>>>> unsigned int added = 0; >>>>> __be64 c; >>>>> + int found; >>>>> >>>>> mlog(0, "running %d locks for this lockres\n", >>>>> mres->num_locks); >>>>> for (i=0; i<mres->num_locks; i++) { >>>>> @@ -1793,22 +1794,23 @@ static int >>>>> dlm_process_recovery_data(struct dlm_ctxt *dlm, >>>>> /* MIGRATION ONLY! */ >>>>> BUG_ON(!(mres->flags & >>>>> DLM_MRES_MIGRATION)); >>>>> >>>>> + found = 0; >>>>> spin_lock(&res->spinlock); >>>>> for (j = DLM_GRANTED_LIST; j >>>>> <= DLM_BLOCKED_LIST; j++) { >>>>> tmpq = >>>>> dlm_list_idx_to_ptr(res, j); >>>>> >>>>> list_for_each_entry(lock, tmpq, list) { >>>>> - if >>>>> (lock->ml.cookie != ml->cookie) >>>>> - lock = >>>>> NULL; >>>>> - else >>>>> + if >>>>> (lock->ml.cookie == ml->cookie) { >>>>> + found = 1; >>>>> break; >>>>> + } >>>>> } >>>>> - if (lock) >>>>> + if (found) >>>>> break; >>>>> } >>>>> >>>>> /* lock is always created >>>>> locally first, and >>>>> * destroyed locally last. it >>>>> must be on the list */ >>>>> - if (!lock) { >>>>> + if (!found) { >>>>> c = ml->cookie; >>>>> mlog(ML_ERROR, "Could >>>>> not find local lock " >>>>> "with >>>>> cookie %u:%llu, node %u, " >>>>> >>>>> >>>>> >>>>> https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47 >>>>> >>>>> >>>>> We had decided to go back to list_for_each(). >>>> >>>> OK, thank you. It's OK to revert it back for a introduced >>>> bug. But I think you'd better cc stable branch. >>>> >>>> >>> >>> >>> >> >> >> >> _______________________________________________ >> Ocfs2-devel mailing list >> Ocfs2-devel at oss.oracle.com >> https://oss.oracle.com/mailman/listinfo/ocfs2-devel > > > > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20140225/4c5bcdde/attachment-0001.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash 2014-02-24 23:30 ` Srinivas Eeda 2014-02-25 1:54 ` Junxiao Bi @ 2014-02-25 2:14 ` Junxiao Bi 1 sibling, 0 replies; 7+ messages in thread From: Junxiao Bi @ 2014-02-25 2:14 UTC (permalink / raw) To: ocfs2-devel On 02/25/2014 07:30 AM, Srinivas Eeda wrote: > Junxiao, thanks for looking into this issue. Please see my comment below > > On 02/24/2014 01:07 AM, Junxiao Bi wrote: >> Hi, >> >> On 07/19/2012 09:59 AM, Sunil Mushran wrote: >>> Different issues. >>> >>> On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi <junxiao.bi@oracle.com >>> <mailto:junxiao.bi@oracle.com>> wrote: >>> >>> On 07/19/2012 12:36 AM, Sunil Mushran wrote: >>>> This bug was detected during code audit. Never seen a crash. If >>>> it does hit, >>>> then we have bigger problems. So no point posting to stable. >>> >> I read a lot of dlm recovery code recently, I found this bug could >> happen at the following scenario. >> >> node 1: migrate target >> node x: >> dlm_unregister_domain() >> dlm_migrate_all_locks() >> dlm_empty_lockres() >> select node x as migrate target node >> since there is a node x lock on the granted list. >> dlm_migrate_lockres() >> dlm_mark_lockres_migrating() { >> wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res)); >> <<< node x unlock may happen here, res->granted list can be empty. > If the unlock request got sent at this point, and if the request was > *processed*, lock must have been removed from the granted_list. If the > request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set > in dlm_lockres_release_ast would make dlm_unlock handler to return > DLM_MIGRATING to the caller (in this case node x). So I don't see how > granted_list could have stale lock. Am I missing something ? > > I do think there is such race that you pointed below exist, but I am > not sure if it was due to the above race described. Outside the windows from set RES_BLOCK_DIRTY flag and wait_event() to dlm_lockres_release_ast(), granted_list can not be empty, since wait_event will wait until dlm_thread clear the dirty flag where shuffle list will pick another lock to the granted list. After the window, DLM_MIGRATING flag will stop other node unlock to the granted list. So I think this cause the empty granted list and cause the migrate target panic. I didn't see any other harm of this since the migrate target node will shuffle the list and send the ast message later. Thanks, Junxiao. > >> dlm_lockres_release_ast(dlm, res); >> } >> dlm_send_one_lockres() >> >> dlm_process_recovery_data() { >> tmpq is >> res->granted list and is empty. >> >> list_for_each_entry(lock, tmpq, list) { >> if >> (lock->ml.cookie != ml->cookie) >> lock = NULL; >> else >> break; >> } >> lock will be >> invalid here. >> if (lock->ml.node >> != ml->node) >> BUG() --> >> crash here. >> } >> >> Thanks, >> Junxiao. >>> >>> Our customer can reproduce it. Also I saw you were assigned a >>> similar bug before, see >>> https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the >>> same BUG? >>>> >>>> On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi >>>> <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote: >>>> >>>> Hi Sunil, >>>> >>>> On 07/18/2012 03:49 AM, Sunil Mushran wrote: >>>>> On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi >>>>> <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote: >>>>> >>>>> In the target node of the dlm lock migration, the >>>>> logic to find >>>>> the local dlm lock is wrong, it shouldn't change the >>>>> loop variable >>>>> "lock" in the list_for_each_entry loop. This will >>>>> cause a NULL-pointer >>>>> accessing crash. >>>>> >>>>> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com >>>>> <mailto:junxiao.bi@oracle.com>> >>>>> Cc: stable at vger.kernel.org <mailto:stable@vger.kernel.org> >>>>> --- >>>>> fs/ocfs2/dlm/dlmrecovery.c | 12 +++++++----- >>>>> 1 file changed, 7 insertions(+), 5 deletions(-) >>>>> >>>>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c >>>>> b/fs/ocfs2/dlm/dlmrecovery.c >>>>> index 01ebfd0..0b9cc88 100644 >>>>> --- a/fs/ocfs2/dlm/dlmrecovery.c >>>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c >>>>> @@ -1762,6 +1762,7 @@ static int >>>>> dlm_process_recovery_data(struct dlm_ctxt *dlm, >>>>> u8 from = O2NM_MAX_NODES; >>>>> unsigned int added = 0; >>>>> __be64 c; >>>>> + int found; >>>>> >>>>> mlog(0, "running %d locks for this lockres\n", >>>>> mres->num_locks); >>>>> for (i=0; i<mres->num_locks; i++) { >>>>> @@ -1793,22 +1794,23 @@ static int >>>>> dlm_process_recovery_data(struct dlm_ctxt *dlm, >>>>> /* MIGRATION ONLY! */ >>>>> BUG_ON(!(mres->flags & >>>>> DLM_MRES_MIGRATION)); >>>>> >>>>> + found = 0; >>>>> spin_lock(&res->spinlock); >>>>> for (j = DLM_GRANTED_LIST; j >>>>> <= DLM_BLOCKED_LIST; j++) { >>>>> tmpq = >>>>> dlm_list_idx_to_ptr(res, j); >>>>> >>>>> list_for_each_entry(lock, tmpq, list) { >>>>> - if >>>>> (lock->ml.cookie != ml->cookie) >>>>> - lock = >>>>> NULL; >>>>> - else >>>>> + if >>>>> (lock->ml.cookie == ml->cookie) { >>>>> + found = 1; >>>>> break; >>>>> + } >>>>> } >>>>> - if (lock) >>>>> + if (found) >>>>> break; >>>>> } >>>>> >>>>> /* lock is always created >>>>> locally first, and >>>>> * destroyed locally last. it >>>>> must be on the list */ >>>>> - if (!lock) { >>>>> + if (!found) { >>>>> c = ml->cookie; >>>>> mlog(ML_ERROR, "Could >>>>> not find local lock " >>>>> "with >>>>> cookie %u:%llu, node %u, " >>>>> >>>>> >>>>> >>>>> https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47 >>>>> >>>>> >>>>> We had decided to go back to list_for_each(). >>>> >>>> OK, thank you. It's OK to revert it back for a introduced >>>> bug. But I think you'd better cc stable branch. >>>> >>>> >>> >>> >>> >> >> >> >> _______________________________________________ >> Ocfs2-devel mailing list >> Ocfs2-devel at oss.oracle.com >> https://oss.oracle.com/mailman/listinfo/ocfs2-devel > > > > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20140225/4bf61a97/attachment-0001.html ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-02-25 2:14 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-17 7:10 [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash Junxiao Bi
2012-07-17 19:49 ` Sunil Mushran
2012-07-18 1:36 ` Junxiao Bi
[not found] ` <CAEeiSHXpcU6xXeDzP3nA8jGDnoit-NtZHM2A73hya_9c01Y_mg@mail.gmail.com>
[not found] ` <50076428.2040908@oracle.com>
[not found] ` <CAEeiSHV+TVsnwqnsi0u4r=ucBoddo8wD8DcqbsCn1UoA3xjtdg@mail.gmail.com>
2014-02-24 9:07 ` Junxiao Bi
2014-02-24 23:30 ` Srinivas Eeda
2014-02-25 1:54 ` Junxiao Bi
2014-02-25 2:14 ` Junxiao Bi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).