* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
@ 2012-07-17 7:10 Junxiao Bi
2012-07-17 19:49 ` Sunil Mushran
0 siblings, 1 reply; 7+ messages in thread
From: Junxiao Bi @ 2012-07-17 7:10 UTC (permalink / raw)
To: ocfs2-devel
In the target node of the dlm lock migration, the logic to find
the local dlm lock is wrong, it shouldn't change the loop variable
"lock" in the list_for_each_entry loop. This will cause a NULL-pointer
accessing crash.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: stable at vger.kernel.org
---
fs/ocfs2/dlm/dlmrecovery.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
index 01ebfd0..0b9cc88 100644
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm,
u8 from = O2NM_MAX_NODES;
unsigned int added = 0;
__be64 c;
+ int found;
mlog(0, "running %d locks for this lockres\n", mres->num_locks);
for (i=0; i<mres->num_locks; i++) {
@@ -1793,22 +1794,23 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm,
/* MIGRATION ONLY! */
BUG_ON(!(mres->flags & DLM_MRES_MIGRATION));
+ found = 0;
spin_lock(&res->spinlock);
for (j = DLM_GRANTED_LIST; j <= DLM_BLOCKED_LIST; j++) {
tmpq = dlm_list_idx_to_ptr(res, j);
list_for_each_entry(lock, tmpq, list) {
- if (lock->ml.cookie != ml->cookie)
- lock = NULL;
- else
+ if (lock->ml.cookie == ml->cookie) {
+ found = 1;
break;
+ }
}
- if (lock)
+ if (found)
break;
}
/* lock is always created locally first, and
* destroyed locally last. it must be on the list */
- if (!lock) {
+ if (!found) {
c = ml->cookie;
mlog(ML_ERROR, "Could not find local lock "
"with cookie %u:%llu, node %u, "
--
1.7.9.5
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
2012-07-17 7:10 [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash Junxiao Bi
@ 2012-07-17 19:49 ` Sunil Mushran
2012-07-18 1:36 ` Junxiao Bi
0 siblings, 1 reply; 7+ messages in thread
From: Sunil Mushran @ 2012-07-17 19:49 UTC (permalink / raw)
To: ocfs2-devel
On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi <junxiao.bi@oracle.com> wrote:
> In the target node of the dlm lock migration, the logic to find
> the local dlm lock is wrong, it shouldn't change the loop variable
> "lock" in the list_for_each_entry loop. This will cause a NULL-pointer
> accessing crash.
>
> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
> Cc: stable at vger.kernel.org
> ---
> fs/ocfs2/dlm/dlmrecovery.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 01ebfd0..0b9cc88 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct dlm_ctxt
> *dlm,
> u8 from = O2NM_MAX_NODES;
> unsigned int added = 0;
> __be64 c;
> + int found;
>
> mlog(0, "running %d locks for this lockres\n", mres->num_locks);
> for (i=0; i<mres->num_locks; i++) {
> @@ -1793,22 +1794,23 @@ static int dlm_process_recovery_data(struct
> dlm_ctxt *dlm,
> /* MIGRATION ONLY! */
> BUG_ON(!(mres->flags & DLM_MRES_MIGRATION));
>
> + found = 0;
> spin_lock(&res->spinlock);
> for (j = DLM_GRANTED_LIST; j <= DLM_BLOCKED_LIST;
> j++) {
> tmpq = dlm_list_idx_to_ptr(res, j);
> list_for_each_entry(lock, tmpq, list) {
> - if (lock->ml.cookie != ml->cookie)
> - lock = NULL;
> - else
> + if (lock->ml.cookie == ml->cookie)
> {
> + found = 1;
> break;
> + }
> }
> - if (lock)
> + if (found)
> break;
> }
>
> /* lock is always created locally first, and
> * destroyed locally last. it must be on the list
> */
> - if (!lock) {
> + if (!found) {
> c = ml->cookie;
> mlog(ML_ERROR, "Could not find local lock "
> "with cookie %u:%llu, node
> %u, "
>
https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47
We had decided to go back to list_for_each().
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20120717/56e846cb/attachment.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
2012-07-17 19:49 ` Sunil Mushran
@ 2012-07-18 1:36 ` Junxiao Bi
[not found] ` <CAEeiSHXpcU6xXeDzP3nA8jGDnoit-NtZHM2A73hya_9c01Y_mg@mail.gmail.com>
0 siblings, 1 reply; 7+ messages in thread
From: Junxiao Bi @ 2012-07-18 1:36 UTC (permalink / raw)
To: ocfs2-devel
Hi Sunil,
On 07/18/2012 03:49 AM, Sunil Mushran wrote:
> On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi <junxiao.bi@oracle.com
> <mailto:junxiao.bi@oracle.com>> wrote:
>
> In the target node of the dlm lock migration, the logic to find
> the local dlm lock is wrong, it shouldn't change the loop variable
> "lock" in the list_for_each_entry loop. This will cause a NULL-pointer
> accessing crash.
>
> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com
> <mailto:junxiao.bi@oracle.com>>
> Cc: stable at vger.kernel.org <mailto:stable@vger.kernel.org>
> ---
> fs/ocfs2/dlm/dlmrecovery.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 01ebfd0..0b9cc88 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct
> dlm_ctxt *dlm,
> u8 from = O2NM_MAX_NODES;
> unsigned int added = 0;
> __be64 c;
> + int found;
>
> mlog(0, "running %d locks for this lockres\n",
> mres->num_locks);
> for (i=0; i<mres->num_locks; i++) {
> @@ -1793,22 +1794,23 @@ static int
> dlm_process_recovery_data(struct dlm_ctxt *dlm,
> /* MIGRATION ONLY! */
> BUG_ON(!(mres->flags & DLM_MRES_MIGRATION));
>
> + found = 0;
> spin_lock(&res->spinlock);
> for (j = DLM_GRANTED_LIST; j <=
> DLM_BLOCKED_LIST; j++) {
> tmpq = dlm_list_idx_to_ptr(res, j);
> list_for_each_entry(lock, tmpq,
> list) {
> - if (lock->ml.cookie !=
> ml->cookie)
> - lock = NULL;
> - else
> + if (lock->ml.cookie ==
> ml->cookie) {
> + found = 1;
> break;
> + }
> }
> - if (lock)
> + if (found)
> break;
> }
>
> /* lock is always created locally first, and
> * destroyed locally last. it must be on
> the list */
> - if (!lock) {
> + if (!found) {
> c = ml->cookie;
> mlog(ML_ERROR, "Could not find
> local lock "
> "with cookie
> %u:%llu, node %u, "
>
>
>
> https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47
>
>
> We had decided to go back to list_for_each().
OK, thank you. It's OK to revert it back for a introduced bug. But I
think you'd better cc stable branch.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20120718/fa9698d6/attachment.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
[not found] ` <CAEeiSHV+TVsnwqnsi0u4r=ucBoddo8wD8DcqbsCn1UoA3xjtdg@mail.gmail.com>
@ 2014-02-24 9:07 ` Junxiao Bi
2014-02-24 23:30 ` Srinivas Eeda
0 siblings, 1 reply; 7+ messages in thread
From: Junxiao Bi @ 2014-02-24 9:07 UTC (permalink / raw)
To: ocfs2-devel
Hi,
On 07/19/2012 09:59 AM, Sunil Mushran wrote:
> Different issues.
>
> On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi <junxiao.bi@oracle.com
> <mailto:junxiao.bi@oracle.com>> wrote:
>
> On 07/19/2012 12:36 AM, Sunil Mushran wrote:
>> This bug was detected during code audit. Never seen a crash. If
>> it does hit,
>> then we have bigger problems. So no point posting to stable.
>
I read a lot of dlm recovery code recently, I found this bug could
happen at the following scenario.
node 1: migrate target node x:
dlm_unregister_domain()
dlm_migrate_all_locks()
dlm_empty_lockres()
select node x as migrate target node
since there is a node x lock on the granted list.
dlm_migrate_lockres()
dlm_mark_lockres_migrating() {
wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res));
<<< node x unlock may happen here, res->granted list can be empty.
dlm_lockres_release_ast(dlm, res);
}
dlm_send_one_lockres()
dlm_process_recovery_data() {
tmpq is res->granted
list and is empty.
list_for_each_entry(lock, tmpq, list) {
if (lock->ml.cookie
!= ml->cookie)
lock = NULL;
else
break;
}
lock will be invalid
here.
if (lock->ml.node !=
ml->node)
BUG() --> crash
here.
}
Thanks,
Junxiao.
>
> Our customer can reproduce it. Also I saw you were assigned a
> similar bug before, see
> https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the
> same BUG?
>>
>> On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi
>> <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote:
>>
>> Hi Sunil,
>>
>> On 07/18/2012 03:49 AM, Sunil Mushran wrote:
>>> On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi
>>> <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote:
>>>
>>> In the target node of the dlm lock migration, the logic
>>> to find
>>> the local dlm lock is wrong, it shouldn't change the
>>> loop variable
>>> "lock" in the list_for_each_entry loop. This will cause
>>> a NULL-pointer
>>> accessing crash.
>>>
>>> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com
>>> <mailto:junxiao.bi@oracle.com>>
>>> Cc: stable at vger.kernel.org <mailto:stable@vger.kernel.org>
>>> ---
>>> fs/ocfs2/dlm/dlmrecovery.c | 12 +++++++-----
>>> 1 file changed, 7 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c
>>> b/fs/ocfs2/dlm/dlmrecovery.c
>>> index 01ebfd0..0b9cc88 100644
>>> --- a/fs/ocfs2/dlm/dlmrecovery.c
>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c
>>> @@ -1762,6 +1762,7 @@ static int
>>> dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>> u8 from = O2NM_MAX_NODES;
>>> unsigned int added = 0;
>>> __be64 c;
>>> + int found;
>>>
>>> mlog(0, "running %d locks for this lockres\n",
>>> mres->num_locks);
>>> for (i=0; i<mres->num_locks; i++) {
>>> @@ -1793,22 +1794,23 @@ static int
>>> dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>> /* MIGRATION ONLY! */
>>> BUG_ON(!(mres->flags &
>>> DLM_MRES_MIGRATION));
>>>
>>> + found = 0;
>>> spin_lock(&res->spinlock);
>>> for (j = DLM_GRANTED_LIST; j <=
>>> DLM_BLOCKED_LIST; j++) {
>>> tmpq =
>>> dlm_list_idx_to_ptr(res, j);
>>>
>>> list_for_each_entry(lock, tmpq, list) {
>>> - if
>>> (lock->ml.cookie != ml->cookie)
>>> - lock = NULL;
>>> - else
>>> + if
>>> (lock->ml.cookie == ml->cookie) {
>>> + found = 1;
>>> break;
>>> + }
>>> }
>>> - if (lock)
>>> + if (found)
>>> break;
>>> }
>>>
>>> /* lock is always created
>>> locally first, and
>>> * destroyed locally last. it
>>> must be on the list */
>>> - if (!lock) {
>>> + if (!found) {
>>> c = ml->cookie;
>>> mlog(ML_ERROR, "Could
>>> not find local lock "
>>> "with
>>> cookie %u:%llu, node %u, "
>>>
>>>
>>>
>>> https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47
>>>
>>>
>>> We had decided to go back to list_for_each().
>>
>> OK, thank you. It's OK to revert it back for a introduced
>> bug. But I think you'd better cc stable branch.
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20140224/f8596f59/attachment.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
2014-02-24 9:07 ` Junxiao Bi
@ 2014-02-24 23:30 ` Srinivas Eeda
2014-02-25 1:54 ` Junxiao Bi
2014-02-25 2:14 ` Junxiao Bi
0 siblings, 2 replies; 7+ messages in thread
From: Srinivas Eeda @ 2014-02-24 23:30 UTC (permalink / raw)
To: ocfs2-devel
Junxiao, thanks for looking into this issue. Please see my comment below
On 02/24/2014 01:07 AM, Junxiao Bi wrote:
> Hi,
>
> On 07/19/2012 09:59 AM, Sunil Mushran wrote:
>> Different issues.
>>
>> On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi <junxiao.bi@oracle.com
>> <mailto:junxiao.bi@oracle.com>> wrote:
>>
>> On 07/19/2012 12:36 AM, Sunil Mushran wrote:
>>> This bug was detected during code audit. Never seen a crash. If
>>> it does hit,
>>> then we have bigger problems. So no point posting to stable.
>>
> I read a lot of dlm recovery code recently, I found this bug could
> happen at the following scenario.
>
> node 1: migrate target node x:
> dlm_unregister_domain()
> dlm_migrate_all_locks()
> dlm_empty_lockres()
> select node x as migrate target node
> since there is a node x lock on the granted list.
> dlm_migrate_lockres()
> dlm_mark_lockres_migrating() {
> wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res));
> <<< node x unlock may happen here, res->granted list can be empty.
If the unlock request got sent at this point, and if the request was
*processed*, lock must have been removed from the granted_list. If the
request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set in
dlm_lockres_release_ast would make dlm_unlock handler to return
DLM_MIGRATING to the caller (in this case node x). So I don't see how
granted_list could have stale lock. Am I missing something ?
I do think there is such race that you pointed below exist, but I am not
sure if it was due to the above race described.
> dlm_lockres_release_ast(dlm, res);
> }
> dlm_send_one_lockres()
> dlm_process_recovery_data() {
> tmpq is
> res->granted list and is empty.
> list_for_each_entry(lock, tmpq, list) {
> if
> (lock->ml.cookie != ml->cookie)
> lock = NULL;
> else
> break;
> }
> lock will be
> invalid here.
> if (lock->ml.node
> != ml->node)
> BUG() -->
> crash here.
> }
>
> Thanks,
> Junxiao.
>>
>> Our customer can reproduce it. Also I saw you were assigned a
>> similar bug before, see
>> https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the
>> same BUG?
>>>
>>> On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi
>>> <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote:
>>>
>>> Hi Sunil,
>>>
>>> On 07/18/2012 03:49 AM, Sunil Mushran wrote:
>>>> On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi
>>>> <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote:
>>>>
>>>> In the target node of the dlm lock migration, the logic
>>>> to find
>>>> the local dlm lock is wrong, it shouldn't change the
>>>> loop variable
>>>> "lock" in the list_for_each_entry loop. This will cause
>>>> a NULL-pointer
>>>> accessing crash.
>>>>
>>>> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com
>>>> <mailto:junxiao.bi@oracle.com>>
>>>> Cc: stable at vger.kernel.org <mailto:stable@vger.kernel.org>
>>>> ---
>>>> fs/ocfs2/dlm/dlmrecovery.c | 12 +++++++-----
>>>> 1 file changed, 7 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c
>>>> b/fs/ocfs2/dlm/dlmrecovery.c
>>>> index 01ebfd0..0b9cc88 100644
>>>> --- a/fs/ocfs2/dlm/dlmrecovery.c
>>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c
>>>> @@ -1762,6 +1762,7 @@ static int
>>>> dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>>> u8 from = O2NM_MAX_NODES;
>>>> unsigned int added = 0;
>>>> __be64 c;
>>>> + int found;
>>>>
>>>> mlog(0, "running %d locks for this lockres\n",
>>>> mres->num_locks);
>>>> for (i=0; i<mres->num_locks; i++) {
>>>> @@ -1793,22 +1794,23 @@ static int
>>>> dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>>> /* MIGRATION ONLY! */
>>>> BUG_ON(!(mres->flags & DLM_MRES_MIGRATION));
>>>>
>>>> + found = 0;
>>>> spin_lock(&res->spinlock);
>>>> for (j = DLM_GRANTED_LIST; j <=
>>>> DLM_BLOCKED_LIST; j++) {
>>>> tmpq =
>>>> dlm_list_idx_to_ptr(res, j);
>>>> list_for_each_entry(lock, tmpq, list) {
>>>> - if
>>>> (lock->ml.cookie != ml->cookie)
>>>> - lock = NULL;
>>>> - else
>>>> + if
>>>> (lock->ml.cookie == ml->cookie) {
>>>> + found = 1;
>>>> break;
>>>> + }
>>>> }
>>>> - if (lock)
>>>> + if (found)
>>>> break;
>>>> }
>>>>
>>>> /* lock is always created
>>>> locally first, and
>>>> * destroyed locally last. it
>>>> must be on the list */
>>>> - if (!lock) {
>>>> + if (!found) {
>>>> c = ml->cookie;
>>>> mlog(ML_ERROR, "Could not find local lock "
>>>> "with cookie %u:%llu, node %u, "
>>>>
>>>>
>>>>
>>>> https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47
>>>>
>>>>
>>>> We had decided to go back to list_for_each().
>>>
>>> OK, thank you. It's OK to revert it back for a introduced
>>> bug. But I think you'd better cc stable branch.
>>>
>>>
>>
>>
>>
>
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20140224/b704f9e0/attachment-0001.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
2014-02-24 23:30 ` Srinivas Eeda
@ 2014-02-25 1:54 ` Junxiao Bi
2014-02-25 2:14 ` Junxiao Bi
1 sibling, 0 replies; 7+ messages in thread
From: Junxiao Bi @ 2014-02-25 1:54 UTC (permalink / raw)
To: ocfs2-devel
Hi Srini,
On 02/25/2014 07:30 AM, Srinivas Eeda wrote:
> Junxiao, thanks for looking into this issue. Please see my comment below
>
> On 02/24/2014 01:07 AM, Junxiao Bi wrote:
>> Hi,
>>
>> On 07/19/2012 09:59 AM, Sunil Mushran wrote:
>>> Different issues.
>>>
>>> On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi <junxiao.bi@oracle.com
>>> <mailto:junxiao.bi@oracle.com>> wrote:
>>>
>>> On 07/19/2012 12:36 AM, Sunil Mushran wrote:
>>>> This bug was detected during code audit. Never seen a crash. If
>>>> it does hit,
>>>> then we have bigger problems. So no point posting to stable.
>>>
>> I read a lot of dlm recovery code recently, I found this bug could
>> happen at the following scenario.
>>
>> node 1: migrate target
>> node x:
>> dlm_unregister_domain()
>> dlm_migrate_all_locks()
>> dlm_empty_lockres()
>> select node x as migrate target node
>> since there is a node x lock on the granted list.
>> dlm_migrate_lockres()
>> dlm_mark_lockres_migrating() {
>> wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res));
>> <<< node x unlock may happen here, res->granted list can be empty.
> If the unlock request got sent at this point, and if the request was
> *processed*, lock must have been removed from the granted_list. If the
> request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set
> in dlm_lockres_release_ast would make dlm_unlock handler to return
> DLM_MIGRATING to the caller (in this case node x). So I don't see how
> granted_list could have stale lock. Am I missing something ?
I agree granted_list will not have stale lock. The issue is triggered
when there is no locks in the granted_list. In migrate target node, the
granted_list is also empty after unlock. Then due to the wrong use of
list_for_each_entry in the following code, lock will be not null even
the granted_list is null. The lock is invalid and lock->ml.node !=
ml->node may be true and cause the bug.
for (j = DLM_GRANTED_LIST; j <=
DLM_BLOCKED_LIST; j++) {
tmpq = dlm_list_idx_to_ptr(res, j);
list_for_each_entry(lock, tmpq, list) {
if (lock->ml.cookie != ml->cookie)
lock = NULL;
else
break;
}
if (lock)
break;
}
/* lock is always created locally first, and
* destroyed locally last. it must be on the
list */
if (!lock) {
c = ml->cookie;
BUG();
}
if (lock->ml.node != ml->node) {
c = lock->ml.cookie;
c = ml->cookie;
BUG();
}
Thanks,
Junxiao.
>
> I do think there is such race that you pointed below exist, but I am
> not sure if it was due to the above race described.
>
>> dlm_lockres_release_ast(dlm, res);
>> }
>> dlm_send_one_lockres()
>>
>> dlm_process_recovery_data() {
>> tmpq is
>> res->granted list and is empty.
>>
>> list_for_each_entry(lock, tmpq, list) {
>> if
>> (lock->ml.cookie != ml->cookie)
>> lock = NULL;
>> else
>> break;
>> }
>> lock will be
>> invalid here.
>> if (lock->ml.node
>> != ml->node)
>> BUG() -->
>> crash here.
>> }
>>
>> Thanks,
>> Junxiao.
>>>
>>> Our customer can reproduce it. Also I saw you were assigned a
>>> similar bug before, see
>>> https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the
>>> same BUG?
>>>>
>>>> On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi
>>>> <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote:
>>>>
>>>> Hi Sunil,
>>>>
>>>> On 07/18/2012 03:49 AM, Sunil Mushran wrote:
>>>>> On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi
>>>>> <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote:
>>>>>
>>>>> In the target node of the dlm lock migration, the
>>>>> logic to find
>>>>> the local dlm lock is wrong, it shouldn't change the
>>>>> loop variable
>>>>> "lock" in the list_for_each_entry loop. This will
>>>>> cause a NULL-pointer
>>>>> accessing crash.
>>>>>
>>>>> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com
>>>>> <mailto:junxiao.bi@oracle.com>>
>>>>> Cc: stable at vger.kernel.org <mailto:stable@vger.kernel.org>
>>>>> ---
>>>>> fs/ocfs2/dlm/dlmrecovery.c | 12 +++++++-----
>>>>> 1 file changed, 7 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c
>>>>> b/fs/ocfs2/dlm/dlmrecovery.c
>>>>> index 01ebfd0..0b9cc88 100644
>>>>> --- a/fs/ocfs2/dlm/dlmrecovery.c
>>>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c
>>>>> @@ -1762,6 +1762,7 @@ static int
>>>>> dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>>>> u8 from = O2NM_MAX_NODES;
>>>>> unsigned int added = 0;
>>>>> __be64 c;
>>>>> + int found;
>>>>>
>>>>> mlog(0, "running %d locks for this lockres\n",
>>>>> mres->num_locks);
>>>>> for (i=0; i<mres->num_locks; i++) {
>>>>> @@ -1793,22 +1794,23 @@ static int
>>>>> dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>>>> /* MIGRATION ONLY! */
>>>>> BUG_ON(!(mres->flags &
>>>>> DLM_MRES_MIGRATION));
>>>>>
>>>>> + found = 0;
>>>>> spin_lock(&res->spinlock);
>>>>> for (j = DLM_GRANTED_LIST; j
>>>>> <= DLM_BLOCKED_LIST; j++) {
>>>>> tmpq =
>>>>> dlm_list_idx_to_ptr(res, j);
>>>>>
>>>>> list_for_each_entry(lock, tmpq, list) {
>>>>> - if
>>>>> (lock->ml.cookie != ml->cookie)
>>>>> - lock =
>>>>> NULL;
>>>>> - else
>>>>> + if
>>>>> (lock->ml.cookie == ml->cookie) {
>>>>> + found = 1;
>>>>> break;
>>>>> + }
>>>>> }
>>>>> - if (lock)
>>>>> + if (found)
>>>>> break;
>>>>> }
>>>>>
>>>>> /* lock is always created
>>>>> locally first, and
>>>>> * destroyed locally last. it
>>>>> must be on the list */
>>>>> - if (!lock) {
>>>>> + if (!found) {
>>>>> c = ml->cookie;
>>>>> mlog(ML_ERROR, "Could
>>>>> not find local lock "
>>>>> "with
>>>>> cookie %u:%llu, node %u, "
>>>>>
>>>>>
>>>>>
>>>>> https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47
>>>>>
>>>>>
>>>>> We had decided to go back to list_for_each().
>>>>
>>>> OK, thank you. It's OK to revert it back for a introduced
>>>> bug. But I think you'd better cc stable branch.
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20140225/4c5bcdde/attachment-0001.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
2014-02-24 23:30 ` Srinivas Eeda
2014-02-25 1:54 ` Junxiao Bi
@ 2014-02-25 2:14 ` Junxiao Bi
1 sibling, 0 replies; 7+ messages in thread
From: Junxiao Bi @ 2014-02-25 2:14 UTC (permalink / raw)
To: ocfs2-devel
On 02/25/2014 07:30 AM, Srinivas Eeda wrote:
> Junxiao, thanks for looking into this issue. Please see my comment below
>
> On 02/24/2014 01:07 AM, Junxiao Bi wrote:
>> Hi,
>>
>> On 07/19/2012 09:59 AM, Sunil Mushran wrote:
>>> Different issues.
>>>
>>> On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi <junxiao.bi@oracle.com
>>> <mailto:junxiao.bi@oracle.com>> wrote:
>>>
>>> On 07/19/2012 12:36 AM, Sunil Mushran wrote:
>>>> This bug was detected during code audit. Never seen a crash. If
>>>> it does hit,
>>>> then we have bigger problems. So no point posting to stable.
>>>
>> I read a lot of dlm recovery code recently, I found this bug could
>> happen at the following scenario.
>>
>> node 1: migrate target
>> node x:
>> dlm_unregister_domain()
>> dlm_migrate_all_locks()
>> dlm_empty_lockres()
>> select node x as migrate target node
>> since there is a node x lock on the granted list.
>> dlm_migrate_lockres()
>> dlm_mark_lockres_migrating() {
>> wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res));
>> <<< node x unlock may happen here, res->granted list can be empty.
> If the unlock request got sent at this point, and if the request was
> *processed*, lock must have been removed from the granted_list. If the
> request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set
> in dlm_lockres_release_ast would make dlm_unlock handler to return
> DLM_MIGRATING to the caller (in this case node x). So I don't see how
> granted_list could have stale lock. Am I missing something ?
>
> I do think there is such race that you pointed below exist, but I am
> not sure if it was due to the above race described.
Outside the windows from set RES_BLOCK_DIRTY flag and wait_event() to
dlm_lockres_release_ast(), granted_list can not be empty, since
wait_event will wait until dlm_thread clear the dirty flag where shuffle
list will pick another lock to the granted list. After the window,
DLM_MIGRATING flag will stop other node unlock to the granted list. So I
think this cause the empty granted list and cause the migrate target
panic. I didn't see any other harm of this since the migrate target node
will shuffle the list and send the ast message later.
Thanks,
Junxiao.
>
>> dlm_lockres_release_ast(dlm, res);
>> }
>> dlm_send_one_lockres()
>>
>> dlm_process_recovery_data() {
>> tmpq is
>> res->granted list and is empty.
>>
>> list_for_each_entry(lock, tmpq, list) {
>> if
>> (lock->ml.cookie != ml->cookie)
>> lock = NULL;
>> else
>> break;
>> }
>> lock will be
>> invalid here.
>> if (lock->ml.node
>> != ml->node)
>> BUG() -->
>> crash here.
>> }
>>
>> Thanks,
>> Junxiao.
>>>
>>> Our customer can reproduce it. Also I saw you were assigned a
>>> similar bug before, see
>>> https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the
>>> same BUG?
>>>>
>>>> On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi
>>>> <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote:
>>>>
>>>> Hi Sunil,
>>>>
>>>> On 07/18/2012 03:49 AM, Sunil Mushran wrote:
>>>>> On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi
>>>>> <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote:
>>>>>
>>>>> In the target node of the dlm lock migration, the
>>>>> logic to find
>>>>> the local dlm lock is wrong, it shouldn't change the
>>>>> loop variable
>>>>> "lock" in the list_for_each_entry loop. This will
>>>>> cause a NULL-pointer
>>>>> accessing crash.
>>>>>
>>>>> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com
>>>>> <mailto:junxiao.bi@oracle.com>>
>>>>> Cc: stable at vger.kernel.org <mailto:stable@vger.kernel.org>
>>>>> ---
>>>>> fs/ocfs2/dlm/dlmrecovery.c | 12 +++++++-----
>>>>> 1 file changed, 7 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c
>>>>> b/fs/ocfs2/dlm/dlmrecovery.c
>>>>> index 01ebfd0..0b9cc88 100644
>>>>> --- a/fs/ocfs2/dlm/dlmrecovery.c
>>>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c
>>>>> @@ -1762,6 +1762,7 @@ static int
>>>>> dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>>>> u8 from = O2NM_MAX_NODES;
>>>>> unsigned int added = 0;
>>>>> __be64 c;
>>>>> + int found;
>>>>>
>>>>> mlog(0, "running %d locks for this lockres\n",
>>>>> mres->num_locks);
>>>>> for (i=0; i<mres->num_locks; i++) {
>>>>> @@ -1793,22 +1794,23 @@ static int
>>>>> dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>>>> /* MIGRATION ONLY! */
>>>>> BUG_ON(!(mres->flags &
>>>>> DLM_MRES_MIGRATION));
>>>>>
>>>>> + found = 0;
>>>>> spin_lock(&res->spinlock);
>>>>> for (j = DLM_GRANTED_LIST; j
>>>>> <= DLM_BLOCKED_LIST; j++) {
>>>>> tmpq =
>>>>> dlm_list_idx_to_ptr(res, j);
>>>>>
>>>>> list_for_each_entry(lock, tmpq, list) {
>>>>> - if
>>>>> (lock->ml.cookie != ml->cookie)
>>>>> - lock =
>>>>> NULL;
>>>>> - else
>>>>> + if
>>>>> (lock->ml.cookie == ml->cookie) {
>>>>> + found = 1;
>>>>> break;
>>>>> + }
>>>>> }
>>>>> - if (lock)
>>>>> + if (found)
>>>>> break;
>>>>> }
>>>>>
>>>>> /* lock is always created
>>>>> locally first, and
>>>>> * destroyed locally last. it
>>>>> must be on the list */
>>>>> - if (!lock) {
>>>>> + if (!found) {
>>>>> c = ml->cookie;
>>>>> mlog(ML_ERROR, "Could
>>>>> not find local lock "
>>>>> "with
>>>>> cookie %u:%llu, node %u, "
>>>>>
>>>>>
>>>>>
>>>>> https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47
>>>>>
>>>>>
>>>>> We had decided to go back to list_for_each().
>>>>
>>>> OK, thank you. It's OK to revert it back for a introduced
>>>> bug. But I think you'd better cc stable branch.
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20140225/4bf61a97/attachment-0001.html
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-02-25 2:14 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-17 7:10 [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash Junxiao Bi
2012-07-17 19:49 ` Sunil Mushran
2012-07-18 1:36 ` Junxiao Bi
[not found] ` <CAEeiSHXpcU6xXeDzP3nA8jGDnoit-NtZHM2A73hya_9c01Y_mg@mail.gmail.com>
[not found] ` <50076428.2040908@oracle.com>
[not found] ` <CAEeiSHV+TVsnwqnsi0u4r=ucBoddo8wD8DcqbsCn1UoA3xjtdg@mail.gmail.com>
2014-02-24 9:07 ` Junxiao Bi
2014-02-24 23:30 ` Srinivas Eeda
2014-02-25 1:54 ` Junxiao Bi
2014-02-25 2:14 ` Junxiao Bi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).