ocfs2-devel.oss.oracle.com archive mirror
 help / color / mirror / Atom feed
* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
@ 2012-07-17  7:10 Junxiao Bi
  2012-07-17 19:49 ` Sunil Mushran
  0 siblings, 1 reply; 7+ messages in thread
From: Junxiao Bi @ 2012-07-17  7:10 UTC (permalink / raw)
  To: ocfs2-devel

In the target node of the dlm lock migration, the logic to find
the local dlm lock is wrong, it shouldn't change the loop variable
"lock" in the list_for_each_entry loop. This will cause a NULL-pointer
accessing crash.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: stable at vger.kernel.org
---
 fs/ocfs2/dlm/dlmrecovery.c |   12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
index 01ebfd0..0b9cc88 100644
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm,
 	u8 from = O2NM_MAX_NODES;
 	unsigned int added = 0;
 	__be64 c;
+	int found;
 
 	mlog(0, "running %d locks for this lockres\n", mres->num_locks);
 	for (i=0; i<mres->num_locks; i++) {
@@ -1793,22 +1794,23 @@ static int dlm_process_recovery_data(struct dlm_ctxt *dlm,
 			/* MIGRATION ONLY! */
 			BUG_ON(!(mres->flags & DLM_MRES_MIGRATION));
 
+			found = 0;
 			spin_lock(&res->spinlock);
 			for (j = DLM_GRANTED_LIST; j <= DLM_BLOCKED_LIST; j++) {
 				tmpq = dlm_list_idx_to_ptr(res, j);
 				list_for_each_entry(lock, tmpq, list) {
-					if (lock->ml.cookie != ml->cookie)
-						lock = NULL;
-					else
+					if (lock->ml.cookie == ml->cookie) {
+						found = 1;
 						break;
+					}
 				}
-				if (lock)
+				if (found)
 					break;
 			}
 
 			/* lock is always created locally first, and
 			 * destroyed locally last.  it must be on the list */
-			if (!lock) {
+			if (!found) {
 				c = ml->cookie;
 				mlog(ML_ERROR, "Could not find local lock "
 					       "with cookie %u:%llu, node %u, "
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
  2012-07-17  7:10 [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash Junxiao Bi
@ 2012-07-17 19:49 ` Sunil Mushran
  2012-07-18  1:36   ` Junxiao Bi
  0 siblings, 1 reply; 7+ messages in thread
From: Sunil Mushran @ 2012-07-17 19:49 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi <junxiao.bi@oracle.com> wrote:

> In the target node of the dlm lock migration, the logic to find
> the local dlm lock is wrong, it shouldn't change the loop variable
> "lock" in the list_for_each_entry loop. This will cause a NULL-pointer
> accessing crash.
>
> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
> Cc: stable at vger.kernel.org
> ---
>  fs/ocfs2/dlm/dlmrecovery.c |   12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 01ebfd0..0b9cc88 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct dlm_ctxt
> *dlm,
>         u8 from = O2NM_MAX_NODES;
>         unsigned int added = 0;
>         __be64 c;
> +       int found;
>
>         mlog(0, "running %d locks for this lockres\n", mres->num_locks);
>         for (i=0; i<mres->num_locks; i++) {
> @@ -1793,22 +1794,23 @@ static int dlm_process_recovery_data(struct
> dlm_ctxt *dlm,
>                         /* MIGRATION ONLY! */
>                         BUG_ON(!(mres->flags & DLM_MRES_MIGRATION));
>
> +                       found = 0;
>                         spin_lock(&res->spinlock);
>                         for (j = DLM_GRANTED_LIST; j <= DLM_BLOCKED_LIST;
> j++) {
>                                 tmpq = dlm_list_idx_to_ptr(res, j);
>                                 list_for_each_entry(lock, tmpq, list) {
> -                                       if (lock->ml.cookie != ml->cookie)
> -                                               lock = NULL;
> -                                       else
> +                                       if (lock->ml.cookie == ml->cookie)
> {
> +                                               found = 1;
>                                                 break;
> +                                       }
>                                 }
> -                               if (lock)
> +                               if (found)
>                                         break;
>                         }
>
>                         /* lock is always created locally first, and
>                          * destroyed locally last.  it must be on the list
> */
> -                       if (!lock) {
> +                       if (!found) {
>                                 c = ml->cookie;
>                                 mlog(ML_ERROR, "Could not find local lock "
>                                                "with cookie %u:%llu, node
> %u, "
>


https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47

We had decided to go back to list_for_each().
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20120717/56e846cb/attachment.html 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
  2012-07-17 19:49 ` Sunil Mushran
@ 2012-07-18  1:36   ` Junxiao Bi
       [not found]     ` <CAEeiSHXpcU6xXeDzP3nA8jGDnoit-NtZHM2A73hya_9c01Y_mg@mail.gmail.com>
  0 siblings, 1 reply; 7+ messages in thread
From: Junxiao Bi @ 2012-07-18  1:36 UTC (permalink / raw)
  To: ocfs2-devel

Hi Sunil,

On 07/18/2012 03:49 AM, Sunil Mushran wrote:
> On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi <junxiao.bi@oracle.com
> <mailto:junxiao.bi@oracle.com>> wrote:
>
>     In the target node of the dlm lock migration, the logic to find
>     the local dlm lock is wrong, it shouldn't change the loop variable
>     "lock" in the list_for_each_entry loop. This will cause a NULL-pointer
>     accessing crash.
>
>     Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com
>     <mailto:junxiao.bi@oracle.com>>
>     Cc: stable at vger.kernel.org <mailto:stable@vger.kernel.org>
>     ---
>      fs/ocfs2/dlm/dlmrecovery.c |   12 +++++++-----
>      1 file changed, 7 insertions(+), 5 deletions(-)
>
>     diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
>     index 01ebfd0..0b9cc88 100644
>     --- a/fs/ocfs2/dlm/dlmrecovery.c
>     +++ b/fs/ocfs2/dlm/dlmrecovery.c
>     @@ -1762,6 +1762,7 @@ static int dlm_process_recovery_data(struct
>     dlm_ctxt *dlm,
>             u8 from = O2NM_MAX_NODES;
>             unsigned int added = 0;
>             __be64 c;
>     +       int found;
>
>             mlog(0, "running %d locks for this lockres\n",
>     mres->num_locks);
>             for (i=0; i<mres->num_locks; i++) {
>     @@ -1793,22 +1794,23 @@ static int
>     dlm_process_recovery_data(struct dlm_ctxt *dlm,
>                             /* MIGRATION ONLY! */
>                             BUG_ON(!(mres->flags & DLM_MRES_MIGRATION));
>
>     +                       found = 0;
>                             spin_lock(&res->spinlock);
>                             for (j = DLM_GRANTED_LIST; j <=
>     DLM_BLOCKED_LIST; j++) {
>                                     tmpq = dlm_list_idx_to_ptr(res, j);
>                                     list_for_each_entry(lock, tmpq,
>     list) {
>     -                                       if (lock->ml.cookie !=
>     ml->cookie)
>     -                                               lock = NULL;
>     -                                       else
>     +                                       if (lock->ml.cookie ==
>     ml->cookie) {
>     +                                               found = 1;
>                                                     break;
>     +                                       }
>                                     }
>     -                               if (lock)
>     +                               if (found)
>                                             break;
>                             }
>
>                             /* lock is always created locally first, and
>                              * destroyed locally last.  it must be on
>     the list */
>     -                       if (!lock) {
>     +                       if (!found) {
>                                     c = ml->cookie;
>                                     mlog(ML_ERROR, "Could not find
>     local lock "
>                                                    "with cookie
>     %u:%llu, node %u, "
>
>
>
> https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47
>
>
> We had decided to go back to list_for_each().

OK, thank you. It's OK to revert it back for a introduced bug. But I
think you'd better cc stable branch.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20120718/fa9698d6/attachment.html 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
       [not found]         ` <CAEeiSHV+TVsnwqnsi0u4r=ucBoddo8wD8DcqbsCn1UoA3xjtdg@mail.gmail.com>
@ 2014-02-24  9:07           ` Junxiao Bi
  2014-02-24 23:30             ` Srinivas Eeda
  0 siblings, 1 reply; 7+ messages in thread
From: Junxiao Bi @ 2014-02-24  9:07 UTC (permalink / raw)
  To: ocfs2-devel

Hi,

On 07/19/2012 09:59 AM, Sunil Mushran wrote:
> Different issues.
>
> On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi <junxiao.bi@oracle.com
> <mailto:junxiao.bi@oracle.com>> wrote:
>
>     On 07/19/2012 12:36 AM, Sunil Mushran wrote:
>>     This bug was detected during code audit. Never seen a crash. If
>>     it does hit,
>>     then we have bigger problems. So no point posting to stable.
>
I read a lot of dlm recovery code recently, I found this bug could
happen at the following scenario.

node 1:                                             migrate target node x:
dlm_unregister_domain()
 dlm_migrate_all_locks()
  dlm_empty_lockres()
   select node x as migrate target node
   since there is a node x lock on the granted list.
   dlm_migrate_lockres()
    dlm_mark_lockres_migrating() {
     wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res));
<<< node x unlock may happen here, res->granted list can be empty.
     dlm_lockres_release_ast(dlm, res);
    }  
    dlm_send_one_lockres()
                                                  
dlm_process_recovery_data() {
                                                    tmpq is res->granted
list and is empty.
                                                   
list_for_each_entry(lock, tmpq, list) {
                                                     if (lock->ml.cookie
!= ml->cookie)
                                                      lock = NULL;
                                                     else
                                                      break;
                                                    }  
                                                    lock will be invalid
here.
                                                    if (lock->ml.node !=
ml->node)
                                                        BUG() --> crash
here.
                                                   }

Thanks,
Junxiao.
>
>     Our customer can reproduce it. Also I saw you were assigned a
>     similar bug before, see
>     https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the
>     same BUG?
>>
>>     On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi
>>     <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote:
>>
>>         Hi Sunil,
>>
>>         On 07/18/2012 03:49 AM, Sunil Mushran wrote:
>>>         On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi
>>>         <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote:
>>>
>>>             In the target node of the dlm lock migration, the logic
>>>             to find
>>>             the local dlm lock is wrong, it shouldn't change the
>>>             loop variable
>>>             "lock" in the list_for_each_entry loop. This will cause
>>>             a NULL-pointer
>>>             accessing crash.
>>>
>>>             Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com
>>>             <mailto:junxiao.bi@oracle.com>>
>>>             Cc: stable at vger.kernel.org <mailto:stable@vger.kernel.org>
>>>             ---
>>>              fs/ocfs2/dlm/dlmrecovery.c |   12 +++++++-----
>>>              1 file changed, 7 insertions(+), 5 deletions(-)
>>>
>>>             diff --git a/fs/ocfs2/dlm/dlmrecovery.c
>>>             b/fs/ocfs2/dlm/dlmrecovery.c
>>>             index 01ebfd0..0b9cc88 100644
>>>             --- a/fs/ocfs2/dlm/dlmrecovery.c
>>>             +++ b/fs/ocfs2/dlm/dlmrecovery.c
>>>             @@ -1762,6 +1762,7 @@ static int
>>>             dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>>                     u8 from = O2NM_MAX_NODES;
>>>                     unsigned int added = 0;
>>>                     __be64 c;
>>>             +       int found;
>>>
>>>                     mlog(0, "running %d locks for this lockres\n",
>>>             mres->num_locks);
>>>                     for (i=0; i<mres->num_locks; i++) {
>>>             @@ -1793,22 +1794,23 @@ static int
>>>             dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>>                                     /* MIGRATION ONLY! */
>>>                                     BUG_ON(!(mres->flags &
>>>             DLM_MRES_MIGRATION));
>>>
>>>             +                       found = 0;
>>>                                     spin_lock(&res->spinlock);
>>>                                     for (j = DLM_GRANTED_LIST; j <=
>>>             DLM_BLOCKED_LIST; j++) {
>>>                                             tmpq =
>>>             dlm_list_idx_to_ptr(res, j);
>>>                                            
>>>             list_for_each_entry(lock, tmpq, list) {
>>>             -                                       if
>>>             (lock->ml.cookie != ml->cookie)
>>>             -                                               lock = NULL;
>>>             -                                       else
>>>             +                                       if
>>>             (lock->ml.cookie == ml->cookie) {
>>>             +                                               found = 1;
>>>                                                             break;
>>>             +                                       }
>>>                                             }
>>>             -                               if (lock)
>>>             +                               if (found)
>>>                                                     break;
>>>                                     }
>>>
>>>                                     /* lock is always created
>>>             locally first, and
>>>                                      * destroyed locally last.  it
>>>             must be on the list */
>>>             -                       if (!lock) {
>>>             +                       if (!found) {
>>>                                             c = ml->cookie;
>>>                                             mlog(ML_ERROR, "Could
>>>             not find local lock "
>>>                                                            "with
>>>             cookie %u:%llu, node %u, "
>>>
>>>
>>>
>>>         https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47
>>>
>>>
>>>         We had decided to go back to list_for_each().
>>
>>         OK, thank you. It's OK to revert it back for a introduced
>>         bug. But I think you'd better cc stable branch.
>>
>>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20140224/f8596f59/attachment.html 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
  2014-02-24  9:07           ` Junxiao Bi
@ 2014-02-24 23:30             ` Srinivas Eeda
  2014-02-25  1:54               ` Junxiao Bi
  2014-02-25  2:14               ` Junxiao Bi
  0 siblings, 2 replies; 7+ messages in thread
From: Srinivas Eeda @ 2014-02-24 23:30 UTC (permalink / raw)
  To: ocfs2-devel

Junxiao, thanks for looking into this issue. Please see my comment below

On 02/24/2014 01:07 AM, Junxiao Bi wrote:
> Hi,
>
> On 07/19/2012 09:59 AM, Sunil Mushran wrote:
>> Different issues.
>>
>> On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi <junxiao.bi@oracle.com 
>> <mailto:junxiao.bi@oracle.com>> wrote:
>>
>>     On 07/19/2012 12:36 AM, Sunil Mushran wrote:
>>>     This bug was detected during code audit. Never seen a crash. If
>>>     it does hit,
>>>     then we have bigger problems. So no point posting to stable.
>>
> I read a lot of dlm recovery code recently, I found this bug could 
> happen at the following scenario.
>
> node 1:                                             migrate target node x:
> dlm_unregister_domain()
>  dlm_migrate_all_locks()
>   dlm_empty_lockres()
>    select node x as migrate target node
>    since there is a node x lock on the granted list.
>    dlm_migrate_lockres()
>     dlm_mark_lockres_migrating() {
>      wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res));
> <<< node x unlock may happen here, res->granted list can be empty.
If the unlock request got sent at this point, and if the request was 
*processed*, lock must have been removed from the granted_list. If the 
request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set in 
dlm_lockres_release_ast would make dlm_unlock handler to return 
DLM_MIGRATING to the caller (in this case node x). So I don't see how 
granted_list could have stale lock. Am I missing something ?

I do think there is such race that you pointed below exist, but I am not 
sure if it was due to the above race described.

> dlm_lockres_release_ast(dlm, res);
>     }
>     dlm_send_one_lockres()
> dlm_process_recovery_data() {
>                                                     tmpq is 
> res->granted list and is empty.
> list_for_each_entry(lock, tmpq, list) {
>                                                      if 
> (lock->ml.cookie != ml->cookie)
>                                                       lock = NULL;
>                                                      else
>                                                       break;
>                                                     }
>                                                     lock will be 
> invalid here.
>                                                     if (lock->ml.node 
> != ml->node)
>                                                         BUG() --> 
> crash here.
>                                                    }
>
> Thanks,
> Junxiao.
>>
>>     Our customer can reproduce it. Also I saw you were assigned a
>>     similar bug before, see
>>     https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the
>>     same BUG?
>>>
>>>     On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi
>>>     <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote:
>>>
>>>         Hi Sunil,
>>>
>>>         On 07/18/2012 03:49 AM, Sunil Mushran wrote:
>>>>         On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi
>>>>         <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote:
>>>>
>>>>             In the target node of the dlm lock migration, the logic
>>>>             to find
>>>>             the local dlm lock is wrong, it shouldn't change the
>>>>             loop variable
>>>>             "lock" in the list_for_each_entry loop. This will cause
>>>>             a NULL-pointer
>>>>             accessing crash.
>>>>
>>>>             Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com
>>>>             <mailto:junxiao.bi@oracle.com>>
>>>>             Cc: stable at vger.kernel.org <mailto:stable@vger.kernel.org>
>>>>             ---
>>>>              fs/ocfs2/dlm/dlmrecovery.c |   12 +++++++-----
>>>>              1 file changed, 7 insertions(+), 5 deletions(-)
>>>>
>>>>             diff --git a/fs/ocfs2/dlm/dlmrecovery.c
>>>>             b/fs/ocfs2/dlm/dlmrecovery.c
>>>>             index 01ebfd0..0b9cc88 100644
>>>>             --- a/fs/ocfs2/dlm/dlmrecovery.c
>>>>             +++ b/fs/ocfs2/dlm/dlmrecovery.c
>>>>             @@ -1762,6 +1762,7 @@ static int
>>>>             dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>>>                     u8 from = O2NM_MAX_NODES;
>>>>                     unsigned int added = 0;
>>>>                     __be64 c;
>>>>             +       int found;
>>>>
>>>>                     mlog(0, "running %d locks for this lockres\n",
>>>>             mres->num_locks);
>>>>                     for (i=0; i<mres->num_locks; i++) {
>>>>             @@ -1793,22 +1794,23 @@ static int
>>>>             dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>>>                                     /* MIGRATION ONLY! */
>>>>             BUG_ON(!(mres->flags & DLM_MRES_MIGRATION));
>>>>
>>>>             +                       found = 0;
>>>>             spin_lock(&res->spinlock);
>>>>                                     for (j = DLM_GRANTED_LIST; j <=
>>>>             DLM_BLOCKED_LIST; j++) {
>>>>                                             tmpq =
>>>>             dlm_list_idx_to_ptr(res, j);
>>>>             list_for_each_entry(lock, tmpq, list) {
>>>>             -                                       if
>>>>             (lock->ml.cookie != ml->cookie)
>>>>             -       lock = NULL;
>>>>             - else
>>>>             +                                       if
>>>>             (lock->ml.cookie == ml->cookie) {
>>>>             +       found = 1;
>>>>                   break;
>>>>             +                                       }
>>>>                                             }
>>>>             -                               if (lock)
>>>>             +                               if (found)
>>>>             break;
>>>>                                     }
>>>>
>>>>                                     /* lock is always created
>>>>             locally first, and
>>>>                                      * destroyed locally last.  it
>>>>             must be on the list */
>>>>             -                       if (!lock) {
>>>>             +                       if (!found) {
>>>>                                             c = ml->cookie;
>>>>             mlog(ML_ERROR, "Could not find local lock "
>>>>                  "with cookie %u:%llu, node %u, "
>>>>
>>>>
>>>>
>>>>         https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47
>>>>
>>>>
>>>>         We had decided to go back to list_for_each().
>>>
>>>         OK, thank you. It's OK to revert it back for a introduced
>>>         bug. But I think you'd better cc stable branch.
>>>
>>>
>>
>>
>>
>
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20140224/b704f9e0/attachment-0001.html 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
  2014-02-24 23:30             ` Srinivas Eeda
@ 2014-02-25  1:54               ` Junxiao Bi
  2014-02-25  2:14               ` Junxiao Bi
  1 sibling, 0 replies; 7+ messages in thread
From: Junxiao Bi @ 2014-02-25  1:54 UTC (permalink / raw)
  To: ocfs2-devel

Hi Srini,

On 02/25/2014 07:30 AM, Srinivas Eeda wrote:
> Junxiao, thanks for looking into this issue. Please see my comment below
>
> On 02/24/2014 01:07 AM, Junxiao Bi wrote:
>> Hi,
>>
>> On 07/19/2012 09:59 AM, Sunil Mushran wrote:
>>> Different issues.
>>>
>>> On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi <junxiao.bi@oracle.com
>>> <mailto:junxiao.bi@oracle.com>> wrote:
>>>
>>>     On 07/19/2012 12:36 AM, Sunil Mushran wrote:
>>>>     This bug was detected during code audit. Never seen a crash. If
>>>>     it does hit,
>>>>     then we have bigger problems. So no point posting to stable.
>>>
>> I read a lot of dlm recovery code recently, I found this bug could
>> happen at the following scenario.
>>
>> node 1:                                             migrate target
>> node x:
>> dlm_unregister_domain()
>>  dlm_migrate_all_locks()
>>   dlm_empty_lockres()
>>    select node x as migrate target node
>>    since there is a node x lock on the granted list.
>>    dlm_migrate_lockres()
>>     dlm_mark_lockres_migrating() {
>>      wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res));
>> <<< node x unlock may happen here, res->granted list can be empty.
> If the unlock request got sent at this point, and if the request was
> *processed*, lock must have been removed from the granted_list. If the
> request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set
> in dlm_lockres_release_ast would make dlm_unlock handler to return
> DLM_MIGRATING to the caller (in this case node x). So I don't see how
> granted_list could have stale lock. Am I missing something ?
I agree granted_list will not have stale lock. The issue is triggered
when there is no locks in the granted_list. In migrate target node, the
granted_list is also empty after unlock. Then due to the wrong use of
list_for_each_entry in the following code, lock will be not null even
the granted_list is null. The lock is invalid and lock->ml.node !=
ml->node may be true and cause the bug.


                         for (j = DLM_GRANTED_LIST; j <=
DLM_BLOCKED_LIST; j++) {
                                tmpq = dlm_list_idx_to_ptr(res, j);
                                list_for_each_entry(lock, tmpq, list) {
                                        if (lock->ml.cookie != ml->cookie)
                                                lock = NULL;
                                        else
                                                break;
                                }
                                if (lock)
                                        break;
                        }

                        /* lock is always created locally first, and
                         * destroyed locally last.  it must be on the
list */
                        if (!lock) {
                                c = ml->cookie;
                               
                                BUG();
                        }

                        if (lock->ml.node != ml->node) {
                                c = lock->ml.cookie;      
                                c = ml->cookie;           
                                BUG();
                        }

Thanks,
Junxiao.
>
> I do think there is such race that you pointed below exist, but I am
> not sure if it was due to the above race described.
>
>>      dlm_lockres_release_ast(dlm, res);
>>     }  
>>     dlm_send_one_lockres()
>>                                                   
>> dlm_process_recovery_data() {
>>                                                     tmpq is
>> res->granted list and is empty.
>>                                                    
>> list_for_each_entry(lock, tmpq, list) {
>>                                                      if
>> (lock->ml.cookie != ml->cookie)
>>                                                       lock = NULL;
>>                                                      else
>>                                                       break;
>>                                                     }  
>>                                                     lock will be
>> invalid here.
>>                                                     if (lock->ml.node
>> != ml->node)
>>                                                         BUG() -->
>> crash here.
>>                                                    }
>>
>> Thanks,
>> Junxiao.
>>>
>>>     Our customer can reproduce it. Also I saw you were assigned a
>>>     similar bug before, see
>>>     https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the
>>>     same BUG?
>>>>
>>>>     On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi
>>>>     <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote:
>>>>
>>>>         Hi Sunil,
>>>>
>>>>         On 07/18/2012 03:49 AM, Sunil Mushran wrote:
>>>>>         On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi
>>>>>         <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote:
>>>>>
>>>>>             In the target node of the dlm lock migration, the
>>>>>             logic to find
>>>>>             the local dlm lock is wrong, it shouldn't change the
>>>>>             loop variable
>>>>>             "lock" in the list_for_each_entry loop. This will
>>>>>             cause a NULL-pointer
>>>>>             accessing crash.
>>>>>
>>>>>             Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com
>>>>>             <mailto:junxiao.bi@oracle.com>>
>>>>>             Cc: stable at vger.kernel.org <mailto:stable@vger.kernel.org>
>>>>>             ---
>>>>>              fs/ocfs2/dlm/dlmrecovery.c |   12 +++++++-----
>>>>>              1 file changed, 7 insertions(+), 5 deletions(-)
>>>>>
>>>>>             diff --git a/fs/ocfs2/dlm/dlmrecovery.c
>>>>>             b/fs/ocfs2/dlm/dlmrecovery.c
>>>>>             index 01ebfd0..0b9cc88 100644
>>>>>             --- a/fs/ocfs2/dlm/dlmrecovery.c
>>>>>             +++ b/fs/ocfs2/dlm/dlmrecovery.c
>>>>>             @@ -1762,6 +1762,7 @@ static int
>>>>>             dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>>>>                     u8 from = O2NM_MAX_NODES;
>>>>>                     unsigned int added = 0;
>>>>>                     __be64 c;
>>>>>             +       int found;
>>>>>
>>>>>                     mlog(0, "running %d locks for this lockres\n",
>>>>>             mres->num_locks);
>>>>>                     for (i=0; i<mres->num_locks; i++) {
>>>>>             @@ -1793,22 +1794,23 @@ static int
>>>>>             dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>>>>                                     /* MIGRATION ONLY! */
>>>>>                                     BUG_ON(!(mres->flags &
>>>>>             DLM_MRES_MIGRATION));
>>>>>
>>>>>             +                       found = 0;
>>>>>                                     spin_lock(&res->spinlock);
>>>>>                                     for (j = DLM_GRANTED_LIST; j
>>>>>             <= DLM_BLOCKED_LIST; j++) {
>>>>>                                             tmpq =
>>>>>             dlm_list_idx_to_ptr(res, j);
>>>>>                                            
>>>>>             list_for_each_entry(lock, tmpq, list) {
>>>>>             -                                       if
>>>>>             (lock->ml.cookie != ml->cookie)
>>>>>             -                                               lock =
>>>>>             NULL;
>>>>>             -                                       else
>>>>>             +                                       if
>>>>>             (lock->ml.cookie == ml->cookie) {
>>>>>             +                                               found = 1;
>>>>>                                                             break;
>>>>>             +                                       }
>>>>>                                             }
>>>>>             -                               if (lock)
>>>>>             +                               if (found)
>>>>>                                                     break;
>>>>>                                     }
>>>>>
>>>>>                                     /* lock is always created
>>>>>             locally first, and
>>>>>                                      * destroyed locally last.  it
>>>>>             must be on the list */
>>>>>             -                       if (!lock) {
>>>>>             +                       if (!found) {
>>>>>                                             c = ml->cookie;
>>>>>                                             mlog(ML_ERROR, "Could
>>>>>             not find local lock "
>>>>>                                                            "with
>>>>>             cookie %u:%llu, node %u, "
>>>>>
>>>>>
>>>>>
>>>>>         https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47
>>>>>
>>>>>
>>>>>         We had decided to go back to list_for_each().
>>>>
>>>>         OK, thank you. It's OK to revert it back for a introduced
>>>>         bug. But I think you'd better cc stable branch.
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20140225/4c5bcdde/attachment-0001.html 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash
  2014-02-24 23:30             ` Srinivas Eeda
  2014-02-25  1:54               ` Junxiao Bi
@ 2014-02-25  2:14               ` Junxiao Bi
  1 sibling, 0 replies; 7+ messages in thread
From: Junxiao Bi @ 2014-02-25  2:14 UTC (permalink / raw)
  To: ocfs2-devel

On 02/25/2014 07:30 AM, Srinivas Eeda wrote:
> Junxiao, thanks for looking into this issue. Please see my comment below
>
> On 02/24/2014 01:07 AM, Junxiao Bi wrote:
>> Hi,
>>
>> On 07/19/2012 09:59 AM, Sunil Mushran wrote:
>>> Different issues.
>>>
>>> On Wed, Jul 18, 2012 at 6:34 PM, Junxiao Bi <junxiao.bi@oracle.com
>>> <mailto:junxiao.bi@oracle.com>> wrote:
>>>
>>>     On 07/19/2012 12:36 AM, Sunil Mushran wrote:
>>>>     This bug was detected during code audit. Never seen a crash. If
>>>>     it does hit,
>>>>     then we have bigger problems. So no point posting to stable.
>>>
>> I read a lot of dlm recovery code recently, I found this bug could
>> happen at the following scenario.
>>
>> node 1:                                             migrate target
>> node x:
>> dlm_unregister_domain()
>>  dlm_migrate_all_locks()
>>   dlm_empty_lockres()
>>    select node x as migrate target node
>>    since there is a node x lock on the granted list.
>>    dlm_migrate_lockres()
>>     dlm_mark_lockres_migrating() {
>>      wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res));
>> <<< node x unlock may happen here, res->granted list can be empty.
> If the unlock request got sent at this point, and if the request was
> *processed*, lock must have been removed from the granted_list. If the
> request was *not yet processed*, then the DLM_LOCK_RES_MIGRATING set
> in dlm_lockres_release_ast would make dlm_unlock handler to return
> DLM_MIGRATING to the caller (in this case node x). So I don't see how
> granted_list could have stale lock. Am I missing something ?
>
> I do think there is such race that you pointed below exist, but I am
> not sure if it was due to the above race described.
Outside the windows from set RES_BLOCK_DIRTY flag and wait_event() to
dlm_lockres_release_ast(), granted_list can not be empty, since
wait_event will wait until dlm_thread clear the dirty flag where shuffle
list will pick another lock to the granted list. After the window,
DLM_MIGRATING flag will stop other node unlock to the granted list. So I
think this cause the empty granted list and cause the migrate target
panic. I didn't see any other harm of this since the migrate target node
will shuffle the list and send the ast message later.

Thanks,
Junxiao.
>
>>      dlm_lockres_release_ast(dlm, res);
>>     }  
>>     dlm_send_one_lockres()
>>                                                   
>> dlm_process_recovery_data() {
>>                                                     tmpq is
>> res->granted list and is empty.
>>                                                    
>> list_for_each_entry(lock, tmpq, list) {
>>                                                      if
>> (lock->ml.cookie != ml->cookie)
>>                                                       lock = NULL;
>>                                                      else
>>                                                       break;
>>                                                     }  
>>                                                     lock will be
>> invalid here.
>>                                                     if (lock->ml.node
>> != ml->node)
>>                                                         BUG() -->
>> crash here.
>>                                                    }
>>
>> Thanks,
>> Junxiao.
>>>
>>>     Our customer can reproduce it. Also I saw you were assigned a
>>>     similar bug before, see
>>>     https://oss.oracle.com/bugzilla/show_bug.cgi?id=1220, is it the
>>>     same BUG?
>>>>
>>>>     On Tue, Jul 17, 2012 at 6:36 PM, Junxiao Bi
>>>>     <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote:
>>>>
>>>>         Hi Sunil,
>>>>
>>>>         On 07/18/2012 03:49 AM, Sunil Mushran wrote:
>>>>>         On Tue, Jul 17, 2012 at 12:10 AM, Junxiao Bi
>>>>>         <junxiao.bi at oracle.com <mailto:junxiao.bi@oracle.com>> wrote:
>>>>>
>>>>>             In the target node of the dlm lock migration, the
>>>>>             logic to find
>>>>>             the local dlm lock is wrong, it shouldn't change the
>>>>>             loop variable
>>>>>             "lock" in the list_for_each_entry loop. This will
>>>>>             cause a NULL-pointer
>>>>>             accessing crash.
>>>>>
>>>>>             Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com
>>>>>             <mailto:junxiao.bi@oracle.com>>
>>>>>             Cc: stable at vger.kernel.org <mailto:stable@vger.kernel.org>
>>>>>             ---
>>>>>              fs/ocfs2/dlm/dlmrecovery.c |   12 +++++++-----
>>>>>              1 file changed, 7 insertions(+), 5 deletions(-)
>>>>>
>>>>>             diff --git a/fs/ocfs2/dlm/dlmrecovery.c
>>>>>             b/fs/ocfs2/dlm/dlmrecovery.c
>>>>>             index 01ebfd0..0b9cc88 100644
>>>>>             --- a/fs/ocfs2/dlm/dlmrecovery.c
>>>>>             +++ b/fs/ocfs2/dlm/dlmrecovery.c
>>>>>             @@ -1762,6 +1762,7 @@ static int
>>>>>             dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>>>>                     u8 from = O2NM_MAX_NODES;
>>>>>                     unsigned int added = 0;
>>>>>                     __be64 c;
>>>>>             +       int found;
>>>>>
>>>>>                     mlog(0, "running %d locks for this lockres\n",
>>>>>             mres->num_locks);
>>>>>                     for (i=0; i<mres->num_locks; i++) {
>>>>>             @@ -1793,22 +1794,23 @@ static int
>>>>>             dlm_process_recovery_data(struct dlm_ctxt *dlm,
>>>>>                                     /* MIGRATION ONLY! */
>>>>>                                     BUG_ON(!(mres->flags &
>>>>>             DLM_MRES_MIGRATION));
>>>>>
>>>>>             +                       found = 0;
>>>>>                                     spin_lock(&res->spinlock);
>>>>>                                     for (j = DLM_GRANTED_LIST; j
>>>>>             <= DLM_BLOCKED_LIST; j++) {
>>>>>                                             tmpq =
>>>>>             dlm_list_idx_to_ptr(res, j);
>>>>>                                            
>>>>>             list_for_each_entry(lock, tmpq, list) {
>>>>>             -                                       if
>>>>>             (lock->ml.cookie != ml->cookie)
>>>>>             -                                               lock =
>>>>>             NULL;
>>>>>             -                                       else
>>>>>             +                                       if
>>>>>             (lock->ml.cookie == ml->cookie) {
>>>>>             +                                               found = 1;
>>>>>                                                             break;
>>>>>             +                                       }
>>>>>                                             }
>>>>>             -                               if (lock)
>>>>>             +                               if (found)
>>>>>                                                     break;
>>>>>                                     }
>>>>>
>>>>>                                     /* lock is always created
>>>>>             locally first, and
>>>>>                                      * destroyed locally last.  it
>>>>>             must be on the list */
>>>>>             -                       if (!lock) {
>>>>>             +                       if (!found) {
>>>>>                                             c = ml->cookie;
>>>>>                                             mlog(ML_ERROR, "Could
>>>>>             not find local lock "
>>>>>                                                            "with
>>>>>             cookie %u:%llu, node %u, "
>>>>>
>>>>>
>>>>>
>>>>>         https://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=blobdiff;f=fs/ocfs2/dlm/dlmrecovery.c;h=c881be6043a8c27c26ee44d217fb8ecf1eb37e02;hp=01ebfd0bdad72264b99345378f0c6febe246503d;hb=13279667cc8bbaf901591dee96f762d4aab8b307;hpb=a5ae0116eb56ec7c128e84fe15646a5cb9a8cb47
>>>>>
>>>>>
>>>>>         We had decided to go back to list_for_each().
>>>>
>>>>         OK, thank you. It's OK to revert it back for a introduced
>>>>         bug. But I think you'd better cc stable branch.
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20140225/4bf61a97/attachment-0001.html 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-02-25  2:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-17  7:10 [Ocfs2-devel] [PATCH] ocfs2: fix dlm lock migration crash Junxiao Bi
2012-07-17 19:49 ` Sunil Mushran
2012-07-18  1:36   ` Junxiao Bi
     [not found]     ` <CAEeiSHXpcU6xXeDzP3nA8jGDnoit-NtZHM2A73hya_9c01Y_mg@mail.gmail.com>
     [not found]       ` <50076428.2040908@oracle.com>
     [not found]         ` <CAEeiSHV+TVsnwqnsi0u4r=ucBoddo8wD8DcqbsCn1UoA3xjtdg@mail.gmail.com>
2014-02-24  9:07           ` Junxiao Bi
2014-02-24 23:30             ` Srinivas Eeda
2014-02-25  1:54               ` Junxiao Bi
2014-02-25  2:14               ` Junxiao Bi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).