From: Martin Wilck <mwilck@arcor.de>
To: NeilBrown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Subject: Re: Suspicious test failure - mdmon misses recovery events on loop devices
Date: Mon, 29 Jul 2013 22:39:43 +0200 [thread overview]
Message-ID: <51F6D30F.7060007@arcor.de> (raw)
In-Reply-To: <20130729165550.53c33bf3@notabene.brown>
[-- Attachment #1: Type: text/plain, Size: 2241 bytes --]
On 07/29/2013 08:55 AM, NeilBrown wrote:
> Hi Martin.
>
> I don't think the state change needs to happen while mdmon is in the select
> call. It just need to happen between one call to read_and_act, and the next.
> And everything happens between one call and the next...
>
> If sync_action is 'recovery' one time and then something else that isn't
> 'idle' the next time, then that would cause the transition to get lost.
> Can that ever happen? Do you see a particular transition that bypasses
> 'idle'?
> It is possible there is some race here...
>
> I'll try out your test script can see if I can reproduce it.
I don't see "recover" followed by anything else but "idle". But what I
do observe is that the "recover" status isn't seen at all.
I've attached a comparison of good and bad case for the same test.
BAD CASE: mdmon processes the metadata update for the first member and
writes the metadata. when it calls read_and_act after that, the recovery
on the first array is already finished, and it will call set_disk, and
call sync_metadata() again. This double metadata write takes a long
time. Meanwhile, the manager sent the update for the second member and
started the recovery on it. When mdmon comes down to processing this
update, the recovery on the 2nd array is already finished, and it never
sees "recover" or "frozen" state on it. Consequently, it doesn't realize
that there ever was a recovery.
The example is somehow pathological because the test arrays are
unrealistically small. In normal situations we'd expect recovery to take
much longer than 2 meta data writes. However, it doesn't feel good to
know that this can happen.
My current idea to solve this is yet another separate thread just for
monitoring kernel state changes. Don't have it ready yet, though.
Martin
>
>
>
>>
>> Martin
>>
>> PS: In that context, reading mdmon-design.txt, is it allowed at all to
>> add dprintf() messages in the code path called by mdmon? That would also
>> affect some DDF methods where I currently have lots of debug code.
>
> Yes, you can have dprintf messages anywhere. However if debugging is
> enabled, then I don't promise that mdmon will even try to survive low memory
> conditions.
>
> NeilBrown
[-- Attachment #2: fail-and-add_goodbad.txt --]
[-- Type: text/plain, Size: 804 bytes --]
"*" means monitor wakeup
array (1) is the RAID10, (0) the RAID5
GOOD (2)
Monitor Manager
activate_spare(1)
* 1:frozen 0: idle
* process_update(1)
... activate_spare(0)
sync_metadata
1: recover 0: frozen
remove_old(md125)
* process_update(0)
sync_metadata
1: recover 0: recover
* 1: recover 0: recover
remove_old(md126)
...
* 1: idle 0: recover
set_disk
sync_metadata
...
* 1: clean 0: clean
set_disk
sync_metadata
BAD
Monitor Manager
activate_spare(1)
* 1: frozen 0: idle
*
remove_old(md125)
1: recover 0: idle
* process_update(1)
... activate_spare(0)
sync_metadata
remove_old(md126)
1: idle 0: idle
set_disk
sync_metadata
* process_update(0)
sync_metadata
1: idle 0: idle (3s later)
next prev parent reply other threads:[~2013-07-29 20:39 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-26 20:58 Suspicious test failure - mdmon misses recovery events on loop devices Martin Wilck
2013-07-29 6:55 ` NeilBrown
2013-07-29 20:39 ` Martin Wilck [this message]
2013-07-29 20:42 ` Martin Wilck
2013-07-30 0:42 ` NeilBrown
2013-07-30 21:16 ` Martin Wilck
2013-07-30 21:18 ` [PATCH 00/10] Two bug fixes and a lot of debug code mwilck
2013-07-31 3:10 ` NeilBrown
2013-07-30 21:18 ` [PATCH 01/10] DDF: ddf_activate_spare: bugfix for 62ff3c40 mwilck
2013-07-30 21:18 ` [PATCH 02/10] DDF: log disk status changes more nicely mwilck
2013-07-30 21:18 ` [PATCH 03/10] DDF: ddf_process_update: log offsets for conf changes mwilck
2013-07-30 21:18 ` [PATCH 04/10] DDF: load_ddf_header: more error logging mwilck
2013-07-30 21:18 ` [PATCH 05/10] DDF: ddf_set_disk: add some debug messages mwilck
2013-07-30 21:18 ` [PATCH 06/10] monitor: read_and_act: log status when called mwilck
2013-07-31 2:59 ` NeilBrown
2013-07-31 5:28 ` Martin Wilck
2013-07-30 21:18 ` [PATCH 07/10] mdmon: wait_and_act: fix debug message for SIGUSR1 mwilck
2013-07-30 21:18 ` [PATCH 08/10] mdmon: manage_member: debug messages for array state mwilck
2013-07-30 21:18 ` [PATCH 09/10] mdmon: manage_member: fix race condition during slow meta data writes mwilck
2013-07-30 21:18 ` [PATCH 10/10] tests/10ddf-create-fail-rebuild: new unit test for DDF mwilck
2013-07-31 5:36 ` [PATCH] tests/env-ddf-template: helper for new unit test mwilck
2013-07-31 6:49 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51F6D30F.7060007@arcor.de \
--to=mwilck@arcor.de \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.