* array_state_store: 'inactive' versus 'clean' race @ 2009-04-28 1:24 Dan Williams 2009-04-28 4:16 ` Neil Brown 0 siblings, 1 reply; 5+ messages in thread From: Dan Williams @ 2009-04-28 1:24 UTC (permalink / raw) To: NeilBrown; +Cc: Jacek Danecki, Ed Ciechanowski, linux-raid Hi Neil, I am debugging what appears to be a race between mdadm and mdmon manipulating md/array_state. The following warnings were originally triggered by the validation team in Poland. I was not able to reproduce it on my development system until I modified mdmon to hammer on array_state and can now produce the same failure signature: ------------[ cut here ]------------ WARNING: at fs/sysfs/dir.c:462 sysfs_add_one+0x35/0x3d() Hardware name: sysfs: duplicate filename 'sync_action' can not be created Modules linked in: raid10... Supported: Yes Pid: 8696, comm: mdmon Tainted: G X 2.6.29-6-default #1 Call Trace: [<ffffffff8020ff31>] try_stack_unwind+0x70/0x127 [<ffffffff8020f0c0>] dump_trace+0x9a/0x2a6 [<ffffffff8020fc82>] show_trace_log_lvl+0x4c/0x58 [<ffffffff8020fc9e>] show_trace+0x10/0x12 [<ffffffff804f5777>] dump_stack+0x72/0x7b [<ffffffff80248353>] warn_slowpath+0xb1/0xed [<ffffffff8032ac84>] sysfs_add_one+0x35/0x3d [<ffffffff8032a6d1>] sysfs_add_file_mode+0x57/0x8b [<ffffffff8032c3d0>] internal_create_group+0xea/0x174 [<ffffffff8032c47b>] sysfs_create_group+0xe/0x13 [<ffffffff80448bc2>] do_md_run+0x54d/0x856 [<ffffffff80449130>] array_state_store+0x265/0x291 [<ffffffff80444ce6>] md_attr_store+0x81/0xa9 [<ffffffff8032a133>] sysfs_write_file+0xdf/0x114 [<ffffffff802d6b4e>] vfs_write+0xae/0x157 [<ffffffff802d6d05>] sys_write+0x4c/0xa5 [<ffffffff8020c4aa>] system_call_fastpath+0x16/0x1b [<00007f1251cd3950>] 0x7f1251cd3950 ---[ end trace a00c6d28b22a64ae ]--- md: cannot register extra attributes for md126 ------------[ cut here ]------------ WARNING: at fs/sysfs/dir.c:462 sysfs_add_one+0x35/0x3d() Hardware name: sysfs: duplicate filename 'rd3' can not be created Modules linked in: raid10... Supported: Yes Pid: 8696, comm: mdmon Tainted: G W X 2.6.29-6-default #1 Call Trace: [<ffffffff8020ff31>] try_stack_unwind+0x70/0x127 [<ffffffff8020f0c0>] dump_trace+0x9a/0x2a6 [<ffffffff8020fc82>] show_trace_log_lvl+0x4c/0x58 [<ffffffff8020fc9e>] show_trace+0x10/0x12 [<ffffffff804f5777>] dump_stack+0x72/0x7b [<ffffffff80248353>] warn_slowpath+0xb1/0xed [<ffffffff8032ac84>] sysfs_add_one+0x35/0x3d [<ffffffff8032bb78>] sysfs_do_create_link+0xd3/0x141 [<ffffffff8032bc01>] sysfs_create_link+0xe/0x11 [<ffffffff80448ca7>] do_md_run+0x632/0x856 [<ffffffff80449130>] array_state_store+0x265/0x291 [<ffffffff80444ce6>] md_attr_store+0x81/0xa9 mdadm in another thread has just finished writing 'inactive' to array_state which will have the effect of setting mddev->pers to NULL. mdmon is still managing the array and before noticing the 'inactive' state writes 'clean' as part of its normal operation. The array_state_store() call for mdmon notices that mddev->pers is not set and calls do_md_run(). Is it the case that we only need array_state_store() to call do_md_run() when performing initial assembly? If so it seems a flag is needed to prevent reactivation before the old sysfs context is destroyed. Thanks, Dan ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: array_state_store: 'inactive' versus 'clean' race 2009-04-28 1:24 array_state_store: 'inactive' versus 'clean' race Dan Williams @ 2009-04-28 4:16 ` Neil Brown 2009-04-28 18:05 ` Dan Williams 0 siblings, 1 reply; 5+ messages in thread From: Neil Brown @ 2009-04-28 4:16 UTC (permalink / raw) To: Dan Williams; +Cc: Jacek Danecki, Ed Ciechanowski, linux-raid On Monday April 27, dan.j.williams@intel.com wrote: > Hi Neil, > > I am debugging what appears to be a race between mdadm and mdmon > manipulating md/array_state. The following warnings were originally > triggered by the validation team in Poland. I was not able to reproduce > it on my development system until I modified mdmon to hammer on > array_state and can now produce the same failure signature: > > ------------[ cut here ]------------ > WARNING: at fs/sysfs/dir.c:462 sysfs_add_one+0x35/0x3d() > Hardware name: > sysfs: duplicate filename 'sync_action' can not be created > Modules linked in: raid10... ... > > mdadm in another thread has just finished writing 'inactive' to > array_state which will have the effect of setting mddev->pers to NULL. > mdmon is still managing the array and before noticing the 'inactive' > state writes 'clean' as part of its normal operation. The > array_state_store() call for mdmon notices that mddev->pers is not set > and calls do_md_run(). > > Is it the case that we only need array_state_store() to call do_md_run() > when performing initial assembly? If so it seems a flag is needed to > prevent reactivation before the old sysfs context is destroyed. The problem seems to be that mdmon is saying "this was 'active' but now it is 'clean'", but because 'inactive' was written by mdadm, md thinks that mdmon is saying "this was inactive but should now be 'clean'", which is a very different thing to say. We are using 'array_state' as a means of communicating between mdadm and mdmon (to say "this array is active, clear it up"), and between mdmon and the kernel (to do the clean/active transition). It isn't hard to see that this can cause confusion. What to do? - we could use a different mechanism to tell mdmon to stop the array, but that feel clumsy... - mdadm could 'clear' the array instead of just setting it 'inactive' - then marking it 'clean' would fail. But then mdmon would not be able to extract the resync_start information, which is not ideal. - we could arrange that 'inactive' leaves the array in some state where 'clean' is not sufficient to reactivate it, which is essentially your suggestion. e.g. we could clear the 'level' or 'raid_disks'. These would work, but again feel clumsy. - We could add an extra stage to the transition: - mdadm writes 'clean', then syncs with mdmon - mdadm writes 'inactive', - mdmon notices, cleans up, and writes 'clear'. This would avoid the current race, but there is nothing to stop mdmon restoring the array to 'active' is there is write activity. If there is, then something is accessing the array, to setting to inactive would fail. So maybe mdadm could set to readonly, sync with mdmon, then set to inactive. This would probably work... Currently you cannot set an array to readonly if it is active. I don't like that and might get the courage to change it one day. So we wouldn't be able to depend on that. So setting to readonly when you really mean "test if the array is unused and if so, stop it", isn't quite write as it might one day cause error to write requests. We could transition through read_auto instead. But read_auto can spontaneously change to active (through write_pending) so we could still race with that. - Maybe.... we could change the rules so that "active->inactive" was not permitted. It had to go through 'clean', 'readonly', or 'read_auto' first. Then mdadm could set array to 'clean' sync with mdmon (ping_monitor) set to 'inactive'. If the last step fails, then the array is clearly still active. If we were to do this we would need to think about whether any other transitions needed to be outlawed. - We could go the other way and never allow inactive->clean transitions. Require them to go through readonly or readauto. I don't think that would be an improvement. So I'm leaning towards: - a kernel change so that you can only set 'inactive' if the array is already clean or readonly. i.e. if (mddev->ro || mddev->in_sync) which possible is the same as simply if (mddev->in_sync) - an mdadm change so that is doesn't just set 'inactive' but rather sysfs_set_str(mdi, NULL, "array_state", "clean"); ping_monitor(mdi->text_version); sysfs_set_str(mdi, NULL, "array_state", "inactive") - make sure that if mdmon sees the array as 'clean', it will update it's metadata etc so that it feels no further urge to mark the array clean. Also, we need to fix the WARNING, which I think means moving list_for_each_entry(rdev, &mddev->disks, same_set) if (rdev->raid_disk >= 0) { char nm[20]; sprintf(nm, "rd%d", rdev->raid_disk); sysfs_remove_link(&mddev->kobj, nm); } in do_md_stop up in to the case 0: /* disassemble */ case 2: /* stop */ section. My one concern about this conclusion is that it seems to sidestep the core problem rather than address it. The core problem is that setting the state to 'clean' means very different thing depending on which side you come from, so an app which writes 'clean' might get more than it bargained for. All I'm doing is making that confusion avoidable rather than impossible.... I guess I could invent a syntax: writing "a->b" to array_state sets the state to 'b' but only if it was already in 'a'. But that feels needlessly complex. So: do you agree with my leaning? or my concern? or both or neither? Thanks, NeilBrown ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: array_state_store: 'inactive' versus 'clean' race 2009-04-28 4:16 ` Neil Brown @ 2009-04-28 18:05 ` Dan Williams 2009-04-29 1:14 ` Neil Brown 0 siblings, 1 reply; 5+ messages in thread From: Dan Williams @ 2009-04-28 18:05 UTC (permalink / raw) To: Neil Brown; +Cc: Danecki, Jacek, Ciechanowski, Ed, linux-raid Neil Brown wrote: > So I'm leaning towards: > - a kernel change so that you can only set 'inactive' if the > array is already clean or readonly. > i.e. if (mddev->ro || mddev->in_sync) > which possible is the same as simply > if (mddev->in_sync) > - an mdadm change so that is doesn't just set 'inactive' > but rather > sysfs_set_str(mdi, NULL, "array_state", "clean"); > ping_monitor(mdi->text_version); > sysfs_set_str(mdi, NULL, "array_state", "inactive") I recall that when implementing WaitClean there was a reason not to write 'clean' from mdadm... /me looks in the logs. Yes, from commit 0dd3ba30: "--wait-clean: shorten timeout Set the safemode timeout to a small value to get the array marked clean as soon as possible. We don't write 'clean' directly as it may cause mdmon to miss a 'write-pending' event." So we do not want a thread to hang because we missed its write-pending event, or will it receive an error after the device has been torn down? > - make sure that if mdmon sees the array as 'clean', it will > update it's metadata etc so that it feels no further urge > to mark the array clean. > > Also, we need to fix the WARNING, which I think means moving > list_for_each_entry(rdev, &mddev->disks, same_set) > if (rdev->raid_disk >= 0) { > char nm[20]; > sprintf(nm, "rd%d", rdev->raid_disk); > sysfs_remove_link(&mddev->kobj, nm); > } It would also involve a flush_scheduled_work() somewhere to make sure the md_redundancy_group has been deleted. But it occurs to me that do_md_run() should still be failing in the case where userspace gets the ordering of 'inactive' wrong... > > in do_md_stop up in to the > case 0: /* disassemble */ > case 2: /* stop */ > > section. > > My one concern about this conclusion is that it seems to sidestep the > core problem rather than address it. > The core problem is that setting the state to 'clean' means very > different thing depending on which side you come from, so an app which > writes 'clean' might get more than it bargained for. All I'm doing is > making that confusion avoidable rather than impossible.... > > I guess I could invent a syntax: > writing "a->b" to array_state sets the state to 'b' but only if it > was already in 'a'. > But that feels needlessly complex. > > So: do you agree with my leaning? or my concern? or both or neither? First let me say that I really appreciate when you do these implementation contingency brain dumps, it really helps to get all the cards on the table (something I would do well to emulate). I agree with your concern and I am currently more in line with the idea that the confusion should be impossible rather than avoidable. I now think it is a bug that the 'inactive' state has two meanings. A new state like 'shutdown' or 'defunct' would imply that 'clear' is the only valid next state, all other writes return an error. So, it is not quite as ugly as randomly preventing some 'inactive->clean' transitions in that userspace can see why its attempt to write 'clean' is returning -EINVAL. Thoughts? Thanks, Dan ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: array_state_store: 'inactive' versus 'clean' race 2009-04-28 18:05 ` Dan Williams @ 2009-04-29 1:14 ` Neil Brown 2009-04-29 1:42 ` Dan Williams 0 siblings, 1 reply; 5+ messages in thread From: Neil Brown @ 2009-04-29 1:14 UTC (permalink / raw) To: Dan Williams; +Cc: Danecki, Jacek, Ciechanowski, Ed, linux-raid On Tuesday April 28, dan.j.williams@intel.com wrote: > Neil Brown wrote: > > So I'm leaning towards: > > - a kernel change so that you can only set 'inactive' if the > > array is already clean or readonly. > > i.e. if (mddev->ro || mddev->in_sync) > > which possible is the same as simply > > if (mddev->in_sync) > > - an mdadm change so that is doesn't just set 'inactive' > > but rather > > sysfs_set_str(mdi, NULL, "array_state", "clean"); > > ping_monitor(mdi->text_version); > > sysfs_set_str(mdi, NULL, "array_state", "inactive") > > I recall that when implementing WaitClean there was a reason not to > write 'clean' from mdadm... /me looks in the logs. Yes, from commit > 0dd3ba30: > > "--wait-clean: shorten timeout > > Set the safemode timeout to a small value to get the array marked clean > as soon as possible. We don't write 'clean' directly as it may cause > mdmon to miss a 'write-pending' event." > > So we do not want a thread to hang because we missed its write-pending > event, or will it receive an error after the device has been torn down? I don't see that writing 'clean' can cause a 'write-pending' state to be missed. If the array is in 'write-pending', it is because some thread is in md_write_start waiting for MD_CHANGE_CLEAN to be cleared. So ->writes_pending is elevated. So an attempt to write 'clean' will result in -EBUSY. ??? > > > - make sure that if mdmon sees the array as 'clean', it will > > update it's metadata etc so that it feels no further urge > > to mark the array clean. > > > > Also, we need to fix the WARNING, which I think means moving > > list_for_each_entry(rdev, &mddev->disks, same_set) > > if (rdev->raid_disk >= 0) { > > char nm[20]; > > sprintf(nm, "rd%d", rdev->raid_disk); > > sysfs_remove_link(&mddev->kobj, nm); > > } > > It would also involve a flush_scheduled_work() somewhere to make sure > the md_redundancy_group has been deleted. But it occurs to me that > do_md_run() should still be failing in the case where userspace gets the > ordering of 'inactive' wrong... > Uhmm..... I think we should split the removal of md_redundancy_group out in to a separate delayed_work. But yes, we need to deal with that somehow. > > > > in do_md_stop up in to the > > case 0: /* disassemble */ > > case 2: /* stop */ > > > > section. > > > > My one concern about this conclusion is that it seems to sidestep the > > core problem rather than address it. > > The core problem is that setting the state to 'clean' means very > > different thing depending on which side you come from, so an app which > > writes 'clean' might get more than it bargained for. All I'm doing is > > making that confusion avoidable rather than impossible.... > > > > I guess I could invent a syntax: > > writing "a->b" to array_state sets the state to 'b' but only if it > > was already in 'a'. > > But that feels needlessly complex. > > > > So: do you agree with my leaning? or my concern? or both or neither? > > First let me say that I really appreciate when you do these > implementation contingency brain dumps, it really helps to get all the > cards on the table (something I would do well to emulate). > > I agree with your concern and I am currently more in line with the idea > that the confusion should be impossible rather than avoidable. I now > think it is a bug that the 'inactive' state has two meanings. A new > state like 'shutdown' or 'defunct' would imply that 'clear' is the only > valid next state, all other writes return an error. > > So, it is not quite as ugly as randomly preventing some > 'inactive->clean' transitions in that userspace can see why its attempt > to write 'clean' is returning -EINVAL. > > Thoughts? I don't like the idea of a trap-door where once you get to 'shutdown' the only way out is 'clear'. But I could possibly live with a state like 'shutdown' which can transition to either 'clear' or 'inactive', but not directly to 'clean'. I cannot think of an immediate use for the 'shutdown'->'inactive' transition, but I wouldn't want to exclude it. But I really think the 'problem' is more around 'clean' then it is around 'inactive'. I think I would like writing 'clean' never to start the array. If we want to start the array in the 'clean' state, then write something else. I think we can just write 'active'. That will actually leave the array in a 'clean' state I think. I'd need to check that... So now I think we just arrange that 'clean' returns -EINVAL if mddev->pers == NULL. We never currently write 'clean' to start an array. We either use RUN_ARRAY, or mdmon writes either read_auto or active. So that is safe. So we need to: - remove the rd%d links earlier - remove the redundancy group earlier - disable starting an array by writing 'clean'. I think I've convinced myself. Have I convinced you? NeilBrown ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: array_state_store: 'inactive' versus 'clean' race 2009-04-29 1:14 ` Neil Brown @ 2009-04-29 1:42 ` Dan Williams 0 siblings, 0 replies; 5+ messages in thread From: Dan Williams @ 2009-04-29 1:42 UTC (permalink / raw) To: Neil Brown; +Cc: Danecki, Jacek, Ciechanowski, Ed, linux-raid On Tue, Apr 28, 2009 at 6:14 PM, Neil Brown <neilb@suse.de> wrote: > So we need to: > > - remove the rd%d links earlier > - remove the redundancy group earlier > - disable starting an array by writing 'clean'. > > I think I've convinced myself. Have I convinced you? > Yes, the simplicity of just turning off 'inactive->clean' transitions is indeed convincing. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2009-04-29 1:42 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-04-28 1:24 array_state_store: 'inactive' versus 'clean' race Dan Williams 2009-04-28 4:16 ` Neil Brown 2009-04-28 18:05 ` Dan Williams 2009-04-29 1:14 ` Neil Brown 2009-04-29 1:42 ` Dan Williams
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).