[PATCH] Fix: Sometimes mdmon throws core dump during reshape

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Fix: Sometimes mdmon throws core dump during reshape
@ 2011-09-05 10:39 Adam Kwolek
  2011-09-06 19:09 ` Williams, Dan J
  2011-09-07  4:07 ` NeilBrown
  0 siblings, 2 replies; 5+ messages in thread
From: Adam Kwolek @ 2011-09-05 10:39 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, dan.j.williams, ed.ciechanowski, wojciech.neubauer

Problem was found during reshaping 2 volumes /raid0 and raid5/ in container.
Sometimes mdmon throws core dump due to NULL pointer exception.

Problem occurs in scenario:
- managemon: is about spare activation (degraded raid4 volume == raid0 under takeover)
- managemon: detect level change and signals monitor (manage_member() calls replace_array())
- monitor: detects transition raid4/5->raid0 and sets a->container to NULL
           to indicate array deactivation
- managemon : continues his work and tries to activate spare (a->check_degraded is set).
              NULL pointer is passed to metadata handler activate_spare()
              Core dump is generated.

To resolve this situation managemon (after monitor kick) checks again
a->container pointer to learn if current array is not to be deactivated.

Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
---

 managemon.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/managemon.c b/managemon.c
index d020f82..3540dac 100644
--- a/managemon.c
+++ b/managemon.c
@@ -475,6 +475,12 @@ static void manage_member(struct mdstat_ent *mdstat,
 		}
 	}
 
+	/* we are after monitor kick,
+	 * so container field can be cleared - check it again
+	 */
+	if (a->container == NULL)
+		return;
+
 	/* We don't check the array while any update is pending, as it
 	 * might container a change (such as a spare assignment) which
 	 * could affect our decisions.


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] Fix: Sometimes mdmon throws core dump during reshape
  2011-09-05 10:39 [PATCH] Fix: Sometimes mdmon throws core dump during reshape Adam Kwolek
@ 2011-09-06 19:09 ` Williams, Dan J
  2011-09-07  6:25   ` Kwolek, Adam
  2011-09-07  4:07 ` NeilBrown
  1 sibling, 1 reply; 5+ messages in thread
From: Williams, Dan J @ 2011-09-06 19:09 UTC (permalink / raw)
  To: Adam Kwolek; +Cc: neilb, linux-raid, ed.ciechanowski, wojciech.neubauer

On Mon, Sep 5, 2011 at 3:39 AM, Adam Kwolek <adam.kwolek@intel.com> wrote:
> Problem was found during reshaping 2 volumes /raid0 and raid5/ in container.
> Sometimes mdmon throws core dump due to NULL pointer exception.
>
> Problem occurs in scenario:
> - managemon: is about spare activation (degraded raid4 volume == raid0 under takeover)
> - managemon: detect level change and signals monitor (manage_member() calls replace_array())
> - monitor: detects transition raid4/5->raid0 and sets a->container to NULL
>           to indicate array deactivation

Maybe I have lost track of the reshape implementation but I don't see
where the monitor sets ->container to NULL during a reshape?  Do you
mean deactivate mdmon for the array after the reshape completes?

> - managemon : continues his work and tries to activate spare (a->check_degraded is set).
>              NULL pointer is passed to metadata handler activate_spare()
>              Core dump is generated.
>
> To resolve this situation managemon (after monitor kick) checks again
> a->container pointer to learn if current array is not to be deactivated.
[..]
> diff --git a/managemon.c b/managemon.c
> index d020f82..3540dac 100644
> --- a/managemon.c
> +++ b/managemon.c
> @@ -475,6 +475,12 @@ static void manage_member(struct mdstat_ent *mdstat,
>                }
>        }
>
> +       /* we are after monitor kick,
> +        * so container field can be cleared - check it again
> +        */
> +       if (a->container == NULL)
> +               return;
> +

Isn't this still racy?  Because we don't wait for the monitor to run
before proceeding.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Fix: Sometimes mdmon throws core dump during reshape
  2011-09-05 10:39 [PATCH] Fix: Sometimes mdmon throws core dump during reshape Adam Kwolek
  2011-09-06 19:09 ` Williams, Dan J
@ 2011-09-07  4:07 ` NeilBrown
  2011-09-07  6:36   ` Kwolek, Adam
  1 sibling, 1 reply; 5+ messages in thread
From: NeilBrown @ 2011-09-07  4:07 UTC (permalink / raw)
  To: Adam Kwolek
  Cc: linux-raid, dan.j.williams, ed.ciechanowski, wojciech.neubauer

On Mon, 05 Sep 2011 12:39:55 +0200 Adam Kwolek <adam.kwolek@intel.com> wrote:

> Problem was found during reshaping 2 volumes /raid0 and raid5/ in container.
> Sometimes mdmon throws core dump due to NULL pointer exception.
> 
> Problem occurs in scenario:
> - managemon: is about spare activation (degraded raid4 volume == raid0 under takeover)
> - managemon: detect level change and signals monitor (manage_member() calls replace_array())
> - monitor: detects transition raid4/5->raid0 and sets a->container to NULL
>            to indicate array deactivation
> - managemon : continues his work and tries to activate spare (a->check_degraded is set).
>               NULL pointer is passed to metadata handler activate_spare()
>               Core dump is generated.
> 
> To resolve this situation managemon (after monitor kick) checks again
> a->container pointer to learn if current array is not to be deactivated.

This looks like it might be the same bug as is fixed by
     Lukasz Dorau <lukasz.dorau@intel.com>
in
  Subject: [PATCH] FIX: Mdmon crashes after changing RAID level from 1 to 0

Does that look likely?

Thanks,
NeilBrown


> 
> Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
> ---
> 
>  managemon.c |    6 ++++++
>  1 files changed, 6 insertions(+), 0 deletions(-)
> 
> diff --git a/managemon.c b/managemon.c
> index d020f82..3540dac 100644
> --- a/managemon.c
> +++ b/managemon.c
> @@ -475,6 +475,12 @@ static void manage_member(struct mdstat_ent *mdstat,
>  		}
>  	}
>  
> +	/* we are after monitor kick,
> +	 * so container field can be cleared - check it again
> +	 */
> +	if (a->container == NULL)
> +		return;
> +
>  	/* We don't check the array while any update is pending, as it
>  	 * might container a change (such as a spare assignment) which
>  	 * could affect our decisions.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] Fix: Sometimes mdmon throws core dump during reshape
  2011-09-06 19:09 ` Williams, Dan J
@ 2011-09-07  6:25   ` Kwolek, Adam
  0 siblings, 0 replies; 5+ messages in thread
From: Kwolek, Adam @ 2011-09-07  6:25 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: neilb@suse.de, linux-raid@vger.kernel.org, Ciechanowski, Ed,
	Neubauer, Wojciech



> -----Original Message-----
> From: Williams, Dan J [mailto:dan.j.williams@intel.com]
> Sent: Tuesday, September 06, 2011 9:10 PM
> To: Kwolek, Adam
> Cc: neilb@suse.de; linux-raid@vger.kernel.org; Ciechanowski, Ed;
> Neubauer, Wojciech
> Subject: Re: [PATCH] Fix: Sometimes mdmon throws core dump during
> reshape
> 
> On Mon, Sep 5, 2011 at 3:39 AM, Adam Kwolek <adam.kwolek@intel.com>
> wrote:
> > Problem was found during reshaping 2 volumes /raid0 and raid5/ in
> container.
> > Sometimes mdmon throws core dump due to NULL pointer exception.
> >
> > Problem occurs in scenario:
> > - managemon: is about spare activation (degraded raid4 volume == raid0
> under takeover)
> > - managemon: detect level change and signals monitor (manage_member()
> calls replace_array())
> > - monitor: detects transition raid4/5->raid0 and sets a->container to
> NULL
> >           to indicate array deactivation
> 
> Maybe I have lost track of the reshape implementation but I don't see
> where the monitor sets ->container to NULL during a reshape?  Do you
> mean deactivate mdmon for the array after the reshape completes?
> 
> > - managemon : continues his work and tries to activate spare (a-
> >check_degraded is set).
> >              NULL pointer is passed to metadata handler
> activate_spare()
> >              Core dump is generated.
> >
> > To resolve this situation managemon (after monitor kick) checks again
> > a->container pointer to learn if current array is not to be
> deactivated.

Yes, when takeover is used. From one hand mdmon tries to resolve takeovered raid0 degradation "problem"
and backward takeover occurs meanwhile.

BR
Adam

> [..]
> > diff --git a/managemon.c b/managemon.c
> > index d020f82..3540dac 100644
> > --- a/managemon.c
> > +++ b/managemon.c
> > @@ -475,6 +475,12 @@ static void manage_member(struct mdstat_ent
> *mdstat,
> >                }
> >        }
> >
> > +       /* we are after monitor kick,
> > +        * so container field can be cleared - check it again
> > +        */
> > +       if (a->container == NULL)
> > +               return;
> > +
> 
> Isn't this still racy?  Because we don't wait for the monitor to run
> before proceeding.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] Fix: Sometimes mdmon throws core dump during reshape
  2011-09-07  4:07 ` NeilBrown
@ 2011-09-07  6:36   ` Kwolek, Adam
  0 siblings, 0 replies; 5+ messages in thread
From: Kwolek, Adam @ 2011-09-07  6:36 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid@vger.kernel.org, Williams, Dan J, Ciechanowski, Ed,
	Grabowski, Grzegorz



> -----Original Message-----
> From: NeilBrown [mailto:neilb@suse.de]
> Sent: Wednesday, September 07, 2011 6:08 AM
> To: Kwolek, Adam
> Cc: linux-raid@vger.kernel.org; Williams, Dan J; Ciechanowski, Ed;
> Neubauer, Wojciech
> Subject: Re: [PATCH] Fix: Sometimes mdmon throws core dump during
> reshape
> 
> On Mon, 05 Sep 2011 12:39:55 +0200 Adam Kwolek <adam.kwolek@intel.com>
> wrote:
> 
> > Problem was found during reshaping 2 volumes /raid0 and raid5/ in
> container.
> > Sometimes mdmon throws core dump due to NULL pointer exception.
> >
> > Problem occurs in scenario:
> > - managemon: is about spare activation (degraded raid4 volume == raid0
> under takeover)
> > - managemon: detect level change and signals monitor (manage_member()
> calls replace_array())
> > - monitor: detects transition raid4/5->raid0 and sets a->container to
> NULL
> >            to indicate array deactivation
> > - managemon : continues his work and tries to activate spare (a-
> >check_degraded is set).
> >               NULL pointer is passed to metadata handler
> activate_spare()
> >               Core dump is generated.
> >
> > To resolve this situation managemon (after monitor kick) checks again
> > a->container pointer to learn if current array is not to be
> deactivated.
> 
> This looks like it might be the same bug as is fixed by
>      Lukasz Dorau <lukasz.dorau@intel.com>
> in
>   Subject: [PATCH] FIX: Mdmon crashes after changing RAID level from 1
> to 0
> 
> Does that look likely?
> 
> Thanks,
> NeilBrown

It is very rarely problem and I had got single reproduction only with applied patch pointed by you.
To completely solve this problem using Lukasze's patch only, new array monitoring deactivation
should be extended to every case. Container field should never be used for deactivation task.

Do you prefer such approach?


BR
Adam


> 
> 
> >
> > Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
> > ---
> >
> >  managemon.c |    6 ++++++
> >  1 files changed, 6 insertions(+), 0 deletions(-)
> >
> > diff --git a/managemon.c b/managemon.c
> > index d020f82..3540dac 100644
> > --- a/managemon.c
> > +++ b/managemon.c
> > @@ -475,6 +475,12 @@ static void manage_member(struct mdstat_ent
> *mdstat,
> >  		}
> >  	}
> >
> > +	/* we are after monitor kick,
> > +	 * so container field can be cleared - check it again
> > +	 */
> > +	if (a->container == NULL)
> > +		return;
> > +
> >  	/* We don't check the array while any update is pending, as it
> >  	 * might container a change (such as a spare assignment) which
> >  	 * could affect our decisions.
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-09-07  6:36 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-05 10:39 [PATCH] Fix: Sometimes mdmon throws core dump during reshape Adam Kwolek
2011-09-06 19:09 ` Williams, Dan J
2011-09-07  6:25   ` Kwolek, Adam
2011-09-07  4:07 ` NeilBrown
2011-09-07  6:36   ` Kwolek, Adam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).