array_state_store: 'inactive' versus 'clean' race

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* array_state_store: 'inactive' versus 'clean' race
@ 2009-04-28  1:24 Dan Williams
  2009-04-28  4:16 ` Neil Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Dan Williams @ 2009-04-28  1:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jacek Danecki, Ed Ciechanowski, linux-raid

Hi Neil,

I am debugging what appears to be a race between mdadm and mdmon
manipulating md/array_state. The following warnings were originally
triggered by the validation team in Poland.  I was not able to reproduce
it on my development system until I modified mdmon to hammer on
array_state and can now produce the same failure signature:

------------[ cut here ]------------
WARNING: at fs/sysfs/dir.c:462 sysfs_add_one+0x35/0x3d()
Hardware name:
sysfs: duplicate filename 'sync_action' can not be created
Modules linked in: raid10...
Supported: Yes
Pid: 8696, comm: mdmon Tainted: G           X 2.6.29-6-default #1
Call Trace:
 [<ffffffff8020ff31>] try_stack_unwind+0x70/0x127
 [<ffffffff8020f0c0>] dump_trace+0x9a/0x2a6
 [<ffffffff8020fc82>] show_trace_log_lvl+0x4c/0x58
 [<ffffffff8020fc9e>] show_trace+0x10/0x12
 [<ffffffff804f5777>] dump_stack+0x72/0x7b
 [<ffffffff80248353>] warn_slowpath+0xb1/0xed
 [<ffffffff8032ac84>] sysfs_add_one+0x35/0x3d
 [<ffffffff8032a6d1>] sysfs_add_file_mode+0x57/0x8b
 [<ffffffff8032c3d0>] internal_create_group+0xea/0x174
 [<ffffffff8032c47b>] sysfs_create_group+0xe/0x13
 [<ffffffff80448bc2>] do_md_run+0x54d/0x856
 [<ffffffff80449130>] array_state_store+0x265/0x291
 [<ffffffff80444ce6>] md_attr_store+0x81/0xa9
 [<ffffffff8032a133>] sysfs_write_file+0xdf/0x114
 [<ffffffff802d6b4e>] vfs_write+0xae/0x157
 [<ffffffff802d6d05>] sys_write+0x4c/0xa5
 [<ffffffff8020c4aa>] system_call_fastpath+0x16/0x1b
 [<00007f1251cd3950>] 0x7f1251cd3950
---[ end trace a00c6d28b22a64ae ]---
md: cannot register extra attributes for md126
------------[ cut here ]------------
WARNING: at fs/sysfs/dir.c:462 sysfs_add_one+0x35/0x3d()
Hardware name:
sysfs: duplicate filename 'rd3' can not be created
Modules linked in: raid10...
Supported: Yes
Pid: 8696, comm: mdmon Tainted: G        W  X 2.6.29-6-default #1
Call Trace:
 [<ffffffff8020ff31>] try_stack_unwind+0x70/0x127
 [<ffffffff8020f0c0>] dump_trace+0x9a/0x2a6
 [<ffffffff8020fc82>] show_trace_log_lvl+0x4c/0x58
 [<ffffffff8020fc9e>] show_trace+0x10/0x12
 [<ffffffff804f5777>] dump_stack+0x72/0x7b
 [<ffffffff80248353>] warn_slowpath+0xb1/0xed
 [<ffffffff8032ac84>] sysfs_add_one+0x35/0x3d
 [<ffffffff8032bb78>] sysfs_do_create_link+0xd3/0x141
 [<ffffffff8032bc01>] sysfs_create_link+0xe/0x11
 [<ffffffff80448ca7>] do_md_run+0x632/0x856
 [<ffffffff80449130>] array_state_store+0x265/0x291
 [<ffffffff80444ce6>] md_attr_store+0x81/0xa9

mdadm in another thread has just finished writing 'inactive' to
array_state which will have the effect of setting mddev->pers to NULL.
mdmon is still managing the array and before noticing the 'inactive'
state writes 'clean' as part of its normal operation.  The
array_state_store() call for mdmon notices that mddev->pers is not set
and calls do_md_run().

Is it the case that we only need array_state_store() to call do_md_run()
when performing initial assembly?  If so it seems a flag is needed to
prevent reactivation before the old sysfs context is destroyed.

Thanks,
Dan


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: array_state_store: 'inactive' versus 'clean' race
  2009-04-28  1:24 array_state_store: 'inactive' versus 'clean' race Dan Williams
@ 2009-04-28  4:16 ` Neil Brown
  2009-04-28 18:05   ` Dan Williams
  0 siblings, 1 reply; 5+ messages in thread
From: Neil Brown @ 2009-04-28  4:16 UTC (permalink / raw)
  To: Dan Williams; +Cc: Jacek Danecki, Ed Ciechanowski, linux-raid

On Monday April 27, dan.j.williams@intel.com wrote:
> Hi Neil,
> 
> I am debugging what appears to be a race between mdadm and mdmon
> manipulating md/array_state. The following warnings were originally
> triggered by the validation team in Poland.  I was not able to reproduce
> it on my development system until I modified mdmon to hammer on
> array_state and can now produce the same failure signature:
> 
> ------------[ cut here ]------------
> WARNING: at fs/sysfs/dir.c:462 sysfs_add_one+0x35/0x3d()
> Hardware name:
> sysfs: duplicate filename 'sync_action' can not be created
> Modules linked in: raid10...
...
> 
> mdadm in another thread has just finished writing 'inactive' to
> array_state which will have the effect of setting mddev->pers to NULL.
> mdmon is still managing the array and before noticing the 'inactive'
> state writes 'clean' as part of its normal operation.  The
> array_state_store() call for mdmon notices that mddev->pers is not set
> and calls do_md_run().
> 
> Is it the case that we only need array_state_store() to call do_md_run()
> when performing initial assembly?  If so it seems a flag is needed to
> prevent reactivation before the old sysfs context is destroyed.

The problem seems to be that mdmon is saying "this was 'active' but now
it is 'clean'", but because 'inactive' was written by mdadm, md thinks
that mdmon is saying "this was inactive but should now be 'clean'",
which is a very different thing to say.

We are using 'array_state' as a means of communicating between mdadm
and mdmon (to say "this array is active, clear it up"), and between
mdmon and the kernel (to do the clean/active transition).  It isn't
hard to see that this can cause confusion.

What to do?

- we could use a different mechanism to tell mdmon to stop the array,
  but that feel clumsy...
- mdadm could 'clear' the array instead of just setting it 'inactive' -
  then marking it 'clean' would fail.  But then mdmon would not be
  able to extract the resync_start information, which is not ideal.
- we could arrange that 'inactive' leaves the array in some state where
  'clean' is not sufficient to reactivate it, which is essentially 
  your suggestion.  e.g. we could clear the 'level' or 'raid_disks'.
  These would work, but again feel clumsy.
- We could add an extra stage to the transition:
   - mdadm writes 'clean', then syncs with mdmon
   - mdadm writes 'inactive',
   - mdmon notices, cleans up, and writes 'clear'.
  This would avoid the current race, but there is nothing to stop
  mdmon restoring the array to 'active' is there is write activity.
  If there is, then something is accessing the array, to setting to
  inactive would fail.
  So maybe mdadm could set to readonly, sync with mdmon, then set to
  inactive.  This would probably work...
  Currently you cannot set an array to readonly if it is active.  I
  don't like that and might get the courage to change it one day.
  So we wouldn't be able to depend on that.
  So setting to readonly when you really mean "test if the array is 
  unused and if so, stop it", isn't quite write as it might one day
  cause error to write requests.
  We could transition through read_auto instead.  But read_auto
  can spontaneously change to active (through write_pending)
  so we could still race with that.
- Maybe.... we could change the rules so that "active->inactive" was 
  not permitted.  It had to go through 'clean', 'readonly', or 'read_auto'
  first.
  Then mdadm could
    set array to 'clean'
    sync with mdmon (ping_monitor)
    set to 'inactive'.
  If the last step fails, then the array is clearly still active.
  If we were to do this we would need to think about whether any other
  transitions needed to be outlawed.
- We could go the other way and never allow inactive->clean
  transitions.  Require them to go through readonly or readauto.  I
  don't think that would be an improvement.


So I'm leaning towards:
  - a kernel change so that you can only set 'inactive' if the
    array is already clean or readonly.
    i.e. if (mddev->ro || mddev->in_sync)
    which possible is the same as simply
         if (mddev->in_sync)
  - an mdadm change so that is doesn't just set 'inactive'
    but rather
         sysfs_set_str(mdi, NULL, "array_state", "clean");
	 ping_monitor(mdi->text_version);
	 sysfs_set_str(mdi, NULL, "array_state", "inactive")
  - make sure that if mdmon sees the array as 'clean', it will
    update it's metadata etc so that it feels no further urge
    to mark the array clean.

Also, we need to fix the WARNING, which I think means moving
		list_for_each_entry(rdev, &mddev->disks, same_set)
			if (rdev->raid_disk >= 0) {
				char nm[20];
				sprintf(nm, "rd%d", rdev->raid_disk);
				sysfs_remove_link(&mddev->kobj, nm);
			}

in do_md_stop up in to the
		case 0: /* disassemble */
		case 2: /* stop */

section.

My one concern about this conclusion is that it seems to sidestep the
core problem rather than address it.
The core problem is that setting the state to 'clean' means very
different thing depending on which side you come from, so an app which
writes 'clean' might get more than it bargained for.  All I'm doing is
making that confusion avoidable rather than impossible....

I guess I could invent a syntax:
 writing "a->b" to array_state sets the state to 'b' but only if it
 was already in 'a'.
But that feels needlessly complex.

So:  do you agree with my leaning?  or my concern?  or both or neither?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: array_state_store: 'inactive' versus 'clean' race
  2009-04-28  4:16 ` Neil Brown
@ 2009-04-28 18:05   ` Dan Williams
  2009-04-29  1:14     ` Neil Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Dan Williams @ 2009-04-28 18:05 UTC (permalink / raw)
  To: Neil Brown; +Cc: Danecki, Jacek, Ciechanowski, Ed, linux-raid

Neil Brown wrote:
> So I'm leaning towards:
>   - a kernel change so that you can only set 'inactive' if the
>     array is already clean or readonly.
>     i.e. if (mddev->ro || mddev->in_sync)
>     which possible is the same as simply
>          if (mddev->in_sync)
>   - an mdadm change so that is doesn't just set 'inactive'
>     but rather
>          sysfs_set_str(mdi, NULL, "array_state", "clean");
> 	 ping_monitor(mdi->text_version);
> 	 sysfs_set_str(mdi, NULL, "array_state", "inactive")

I recall that when implementing WaitClean there was a reason not to 
write 'clean' from mdadm... /me looks in the logs.  Yes, from commit 
0dd3ba30:

"--wait-clean: shorten timeout

Set the safemode timeout to a small value to get the array marked clean 
as soon as possible.  We don't write 'clean' directly as it may cause 
mdmon to miss a 'write-pending' event."

So we do not want a thread to hang because we missed its write-pending 
event, or will it receive an error after the device has been torn down?

>   - make sure that if mdmon sees the array as 'clean', it will
>     update it's metadata etc so that it feels no further urge
>     to mark the array clean.
> 
> Also, we need to fix the WARNING, which I think means moving
> 		list_for_each_entry(rdev, &mddev->disks, same_set)
> 			if (rdev->raid_disk >= 0) {
> 				char nm[20];
> 				sprintf(nm, "rd%d", rdev->raid_disk);
> 				sysfs_remove_link(&mddev->kobj, nm);
> 			}

It would also involve a flush_scheduled_work() somewhere to make sure 
the md_redundancy_group has been deleted.  But it occurs to me that 
do_md_run() should still be failing in the case where userspace gets the 
ordering of 'inactive' wrong...

> 
> in do_md_stop up in to the
> 		case 0: /* disassemble */
> 		case 2: /* stop */
> 
> section.
> 
> My one concern about this conclusion is that it seems to sidestep the
> core problem rather than address it.
> The core problem is that setting the state to 'clean' means very
> different thing depending on which side you come from, so an app which
> writes 'clean' might get more than it bargained for.  All I'm doing is
> making that confusion avoidable rather than impossible....
> 
> I guess I could invent a syntax:
>  writing "a->b" to array_state sets the state to 'b' but only if it
>  was already in 'a'.
> But that feels needlessly complex.
> 
> So:  do you agree with my leaning?  or my concern?  or both or neither?

First let me say that I really appreciate when you do these 
implementation contingency brain dumps, it really helps to get all the 
cards on the table (something I would do well to emulate).

I agree with your concern and I am currently more in line with the idea 
that the confusion should be impossible rather than avoidable.  I now 
think it is a bug that the 'inactive' state has two meanings.  A new 
state like 'shutdown' or 'defunct' would imply that 'clear' is the only 
valid next state, all other writes return an error.

So, it is not quite as ugly as randomly preventing some 
'inactive->clean' transitions in that userspace can see why its attempt 
to write 'clean' is returning -EINVAL.

Thoughts?

Thanks,
Dan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: array_state_store: 'inactive' versus 'clean' race
  2009-04-28 18:05   ` Dan Williams
@ 2009-04-29  1:14     ` Neil Brown
  2009-04-29  1:42       ` Dan Williams
  0 siblings, 1 reply; 5+ messages in thread
From: Neil Brown @ 2009-04-29  1:14 UTC (permalink / raw)
  To: Dan Williams; +Cc: Danecki, Jacek, Ciechanowski, Ed, linux-raid

On Tuesday April 28, dan.j.williams@intel.com wrote:
> Neil Brown wrote:
> > So I'm leaning towards:
> >   - a kernel change so that you can only set 'inactive' if the
> >     array is already clean or readonly.
> >     i.e. if (mddev->ro || mddev->in_sync)
> >     which possible is the same as simply
> >          if (mddev->in_sync)
> >   - an mdadm change so that is doesn't just set 'inactive'
> >     but rather
> >          sysfs_set_str(mdi, NULL, "array_state", "clean");
> > 	 ping_monitor(mdi->text_version);
> > 	 sysfs_set_str(mdi, NULL, "array_state", "inactive")
> 
> I recall that when implementing WaitClean there was a reason not to 
> write 'clean' from mdadm... /me looks in the logs.  Yes, from commit 
> 0dd3ba30:
> 
> "--wait-clean: shorten timeout
> 
> Set the safemode timeout to a small value to get the array marked clean 
> as soon as possible.  We don't write 'clean' directly as it may cause 
> mdmon to miss a 'write-pending' event."
> 
> So we do not want a thread to hang because we missed its write-pending 
> event, or will it receive an error after the device has been torn down?

I don't see that writing 'clean' can cause a 'write-pending' state to
be missed.
If the array is in 'write-pending', it is because some thread is in
md_write_start waiting for MD_CHANGE_CLEAN to be cleared.  So
->writes_pending is elevated.  So an attempt to write 'clean' will
result in -EBUSY.

???

> 
> >   - make sure that if mdmon sees the array as 'clean', it will
> >     update it's metadata etc so that it feels no further urge
> >     to mark the array clean.
> > 
> > Also, we need to fix the WARNING, which I think means moving
> > 		list_for_each_entry(rdev, &mddev->disks, same_set)
> > 			if (rdev->raid_disk >= 0) {
> > 				char nm[20];
> > 				sprintf(nm, "rd%d", rdev->raid_disk);
> > 				sysfs_remove_link(&mddev->kobj, nm);
> > 			}
> 
> It would also involve a flush_scheduled_work() somewhere to make sure 
> the md_redundancy_group has been deleted.  But it occurs to me that 
> do_md_run() should still be failing in the case where userspace gets the 
> ordering of 'inactive' wrong...
> 

Uhmm..... I think we should split the removal of md_redundancy_group
out in to a separate delayed_work.  But yes, we need to deal with that
somehow.


> > 
> > in do_md_stop up in to the
> > 		case 0: /* disassemble */
> > 		case 2: /* stop */
> > 
> > section.
> > 
> > My one concern about this conclusion is that it seems to sidestep the
> > core problem rather than address it.
> > The core problem is that setting the state to 'clean' means very
> > different thing depending on which side you come from, so an app which
> > writes 'clean' might get more than it bargained for.  All I'm doing is
> > making that confusion avoidable rather than impossible....
> > 
> > I guess I could invent a syntax:
> >  writing "a->b" to array_state sets the state to 'b' but only if it
> >  was already in 'a'.
> > But that feels needlessly complex.
> > 
> > So:  do you agree with my leaning?  or my concern?  or both or neither?
> 
> First let me say that I really appreciate when you do these 
> implementation contingency brain dumps, it really helps to get all the 
> cards on the table (something I would do well to emulate).
> 
> I agree with your concern and I am currently more in line with the idea 
> that the confusion should be impossible rather than avoidable.  I now 
> think it is a bug that the 'inactive' state has two meanings.  A new 
> state like 'shutdown' or 'defunct' would imply that 'clear' is the only 
> valid next state, all other writes return an error.
> 
> So, it is not quite as ugly as randomly preventing some 
> 'inactive->clean' transitions in that userspace can see why its attempt 
> to write 'clean' is returning -EINVAL.
> 
> Thoughts?

I don't like the idea of a trap-door where once you get to 'shutdown'
the only way out is 'clear'.  But I could possibly live with a state
like 'shutdown' which can transition to either 'clear' or 'inactive',
but not directly to 'clean'.   I cannot think of an immediate use for
the 'shutdown'->'inactive' transition, but I wouldn't want to exclude
it.

But I really think the 'problem' is more around 'clean' then it is
around 'inactive'.
I think I would like writing 'clean' never to start the array.
If we want to start the array in the 'clean' state, then write
something else.  I think we can just write 'active'.  That will
actually leave the array in a 'clean' state I think.  I'd need to
check that...

So now I think we just arrange that 'clean' returns -EINVAL if 
mddev->pers == NULL.
We never currently write 'clean' to start an array.  We either use
RUN_ARRAY, or mdmon writes either read_auto or active.

So that is safe.

So we need to:

 - remove the rd%d links earlier
 - remove the redundancy group earlier
 - disable starting an array by writing 'clean'.

I think I've convinced myself.  Have I convinced you?

NeilBrown

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: array_state_store: 'inactive' versus 'clean' race
  2009-04-29  1:14     ` Neil Brown
@ 2009-04-29  1:42       ` Dan Williams
  0 siblings, 0 replies; 5+ messages in thread
From: Dan Williams @ 2009-04-29  1:42 UTC (permalink / raw)
  To: Neil Brown; +Cc: Danecki, Jacek, Ciechanowski, Ed, linux-raid

On Tue, Apr 28, 2009 at 6:14 PM, Neil Brown <neilb@suse.de> wrote:
> So we need to:
>
>  - remove the rd%d links earlier
>  - remove the redundancy group earlier
>  - disable starting an array by writing 'clean'.
>
> I think I've convinced myself.  Have I convinced you?
>

Yes, the simplicity of just turning off 'inactive->clean' transitions
is indeed convincing.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-04-29  1:42 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-28  1:24 array_state_store: 'inactive' versus 'clean' race Dan Williams
2009-04-28  4:16 ` Neil Brown
2009-04-28 18:05   ` Dan Williams
2009-04-29  1:14     ` Neil Brown
2009-04-29  1:42       ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).