* sb->resync_offset value after resync failure @ 2012-01-19 16:19 Alexander Lyakas 2012-01-24 14:25 ` Alexander Lyakas 2012-01-26 0:51 ` NeilBrown 0 siblings, 2 replies; 9+ messages in thread From: Alexander Lyakas @ 2012-01-19 16:19 UTC (permalink / raw) To: linux-raid Greetings, I am looking into a scenario, in which the md raid5/6 array is resyncing (e.g., after a fresh creation) and there is a drive failure. As written in Neil's blog entry "Closing the RAID5 write hole" (http://neil.brown.name/blog/20110614101708): "if a device fails during the resync, md doesn't take special action - it just allows the array to be used without a resync even though there could be corrupt data". However, I noticed that at this point sb->resync_offset in the superblock is not set to MaxSector. At this point if a drive is added/re-added to the array, then drive recovery starts, i.e., md assumes that data/parity on the surviving drives are correct, and uses them to rebuild the new drive. This state of data/parity being correct should be reflected as sb->resync_offset==MaxSector, shouldn't it? One issue that I ran into is the following: I reached a situation in which during array assembly: sb->resync_offset==sb->size. At this point, the following code in mdadm assumes that array is clean: info->array.state = (__le64_to_cpu(sb->resync_offset) >= __le64_to_cpu(sb->size)) ? 1 : 0; As a result, mdadm lets the array assembly flow through fine to the kernel, but in the kernel the following code refuses to start the array: if (mddev->degraded > dirty_parity_disks && mddev->recovery_cp != MaxSector) { At this point, speciying --force to mdadm --assembly doesn't help, because mdadm thinks that array is clean (clean==1), and therefore doesn't do the "force-array" update, which would knock off the sb->resync_offset value. So there is no way to start the array, unless specifying the start_dirty_degraded=1 kernel parameter. So one question is: should mdadm compare sb->resync_offset to MaxSector and not to sb->size? In the kernel code, resync_offset is always compared to MaxSector. Another question is: whether sb->resync_offset should be set to MaxSector by the kernel as soon as it starts rebuilding a drive? I think this would be consistent with what Neil wrote in the blog entry. Here is the scenario to reproduce the issue I described: # Create a raid6 array with 4 drives A,B,C,D. Array starts resyncing. # Fail drive D. Array aborts the resync and then immediately restarts it (it seems to checkpoint the mddev->recovery_cp, but I am not sure that it restarts from that checkpoint) # Re-add drive D to the array. It is added as a spare, array continues resyncing # Fail drive C. Array aborts the resync, and then starts rebuilding drive D. At this point sb->resync_offset is some valid value (usually 0, not MaxSectors and not sb->size). # Stop the array. At this point sb->resync offset is sb->size in all the superblocks. Another question I have: when exactly md decides to update the sb->resync_offset in the superblock? I am playing with similar scenarios with raid5, and sometimes I end up with MaxSectors and sometimes with valid values. From the code, it looks like only this logic updates it: if (mddev->in_sync) sb->resync_offset = cpu_to_le64(mddev->recovery_cp); else sb->resync_offset = cpu_to_le64(0); except for resizing and setting through sysfs. But I don't understand how this value should be managed in general. Thanks! Alex. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sb->resync_offset value after resync failure 2012-01-19 16:19 sb->resync_offset value after resync failure Alexander Lyakas @ 2012-01-24 14:25 ` Alexander Lyakas 2012-01-26 0:41 ` NeilBrown 2012-01-26 0:51 ` NeilBrown 1 sibling, 1 reply; 9+ messages in thread From: Alexander Lyakas @ 2012-01-24 14:25 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hello Neil, I hope you can find some time to look at my doubts in the email below. Meanwhile I realized I have more doubts about resync. Hopefully, you will be able to give some information on those too... # I am looking at "--force" parameter for assembly, and also for "start_dirty_degraded" kernel parameters. They are actually very different: "force" marks the array as clean (sets sb->resync_offset=MaxSector). While if start_dirty_degraded==1, kernel actually starts resyncing the array. For RAID5, it starts and stops immediately (correct?) But for RAID6 coming up with one missing drive, kernel will do the resync using the remaining redundant drive. So start_dirty_degraded==1 is "better" then just forgetting about resync with "--force", isn't it? Because we will still have one parity block correct. Do you think the following logic is appropriate: always set start_dirty_degraded=1 kernel parameter. In mdadm during assembly detect dirty+degraded, and if "force" is not given - abort. If "force" is given, don't knock off sb->resync_offset (like code does today), assemble the array and let the kernel start resync (if there is still a redundant drive). # I saw an explanation on the list, that for RAID6 always a full stripe is rewritten. Given this, I think I don't understand why the initial resync of the array is needed. For those areas never written to, the parity may remain incorrect, because reading data from there is not expected to return anything meaningful. For those areas written, the parity will be recalculated while writing. So reading from those areas should have correct parity in degraded mode. I must be missing something here for sure, can you tell me what? Thanks, Alex. On Thu, Jan 19, 2012 at 6:19 PM, Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Greetings, > I am looking into a scenario, in which the md raid5/6 array is > resyncing (e.g., after a fresh creation) and there is a drive failure. > As written in Neil's blog entry "Closing the RAID5 write hole" > (http://neil.brown.name/blog/20110614101708): "if a device fails > during the resync, md doesn't take special action - it just allows the > array to be used without a resync even though there could be corrupt > data". > > However, I noticed that at this point sb->resync_offset in the > superblock is not set to MaxSector. At this point if a drive is > added/re-added to the array, then drive recovery starts, i.e., md > assumes that data/parity on the surviving drives are correct, and uses > them to rebuild the new drive. This state of data/parity being correct > should be reflected as sb->resync_offset==MaxSector, shouldn't it? > > One issue that I ran into is the following: I reached a situation in > which during array assembly: sb->resync_offset==sb->size. At this > point, the following code in mdadm > assumes that array is clean: > info->array.state = > (__le64_to_cpu(sb->resync_offset) >= __le64_to_cpu(sb->size)) > ? 1 : 0; > As a result, mdadm lets the array assembly flow through fine to the > kernel, but in the kernel the following code refuses to start the > array: > if (mddev->degraded > dirty_parity_disks && > mddev->recovery_cp != MaxSector) { > > At this point, speciying --force to mdadm --assembly doesn't help, > because mdadm thinks that array is clean (clean==1), and therefore > doesn't do the "force-array" update, which would knock off the > sb->resync_offset value. So there is no way to start the array, unless > specifying the start_dirty_degraded=1 kernel parameter. > > So one question is: should mdadm compare sb->resync_offset to > MaxSector and not to sb->size? In the kernel code, resync_offset is > always compared to MaxSector. > > Another question is: whether sb->resync_offset should be set to > MaxSector by the kernel as soon as it starts rebuilding a drive? I > think this would be consistent with what Neil wrote in the blog entry. > > Here is the scenario to reproduce the issue I described: > # Create a raid6 array with 4 drives A,B,C,D. Array starts resyncing. > # Fail drive D. Array aborts the resync and then immediately restarts > it (it seems to checkpoint the mddev->recovery_cp, but I am not sure > that it restarts from that checkpoint) > # Re-add drive D to the array. It is added as a spare, array continues resyncing > # Fail drive C. Array aborts the resync, and then starts rebuilding > drive D. At this point sb->resync_offset is some valid value (usually > 0, not MaxSectors and not sb->size). > # Stop the array. At this point sb->resync offset is sb->size in all > the superblocks. > > Another question I have: when exactly md decides to update the > sb->resync_offset in the superblock? I am playing with similar > scenarios with raid5, and sometimes I end up with MaxSectors and > sometimes with valid values. From the code, it looks like only this > logic updates it: > if (mddev->in_sync) > sb->resync_offset = cpu_to_le64(mddev->recovery_cp); > else > sb->resync_offset = cpu_to_le64(0); > except for resizing and setting through sysfs. But I don't understand > how this value should be managed in general. > > Thanks! > Alex. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sb->resync_offset value after resync failure 2012-01-24 14:25 ` Alexander Lyakas @ 2012-01-26 0:41 ` NeilBrown 0 siblings, 0 replies; 9+ messages in thread From: NeilBrown @ 2012-01-26 0:41 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 3001 bytes --] On Tue, 24 Jan 2012 16:25:04 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hello Neil, > I hope you can find some time to look at my doubts in the email below. > Meanwhile I realized I have more doubts about resync. Hopefully, you > will be able to give some information on those too... > > # I am looking at "--force" parameter for assembly, and also for > "start_dirty_degraded" kernel parameters. They are actually very > different: > "force" marks the array as clean (sets sb->resync_offset=MaxSector). > While if start_dirty_degraded==1, kernel actually starts resyncing the > array. For RAID5, it starts and stops immediately (correct?) But for > RAID6 coming up with one missing drive, kernel will do the resync > using the remaining redundant drive. > So start_dirty_degraded==1 is "better" then just forgetting about > resync with "--force", isn't it? Because we will still have one parity > block correct. Correct. Degraded RAID5 is either clean or faulty. There is no sense in which such an array can be simply "dirty" as there is no redundancy. Singly-degraded RAID6 can be "dirty" as it still has one disk of redundancy which could be inconsistent with the rest. So there is an extra state that we need to handle which we currently do no. > > Do you think the following logic is appropriate: always set > start_dirty_degraded=1 kernel parameter. In mdadm during assembly > detect dirty+degraded, and if "force" is not given - abort. If "force" > is given, don't knock off sb->resync_offset (like code does today), > assemble the array and let the kernel start resync (if there is still > a redundant drive). I would rather export mddev->ok_start_degraded via sysfs. Then mdadm can respond to the --force flag on a dirty/degraded RAID6 by either setting the flag if it exists, or marking the array 'clean' and starting a 'repair'. > > # I saw an explanation on the list, that for RAID6 always a full > stripe is rewritten. Given this, I think I don't understand why the > initial resync of the array is needed. For those areas > never written to, the parity may remain incorrect, because reading > data from there is not expected to return anything meaningful. For > those areas written, the parity will be > recalculated while writing. So reading from those areas should have > correct parity in degraded mode. I must be missing something here for > sure, can you tell me what? The initial sync isn't needed and you are free to specify --assume-clean if you like. However I don't guarantee that RAID6 will always perform reconstruct-write (rather than read-modify-write). Also, most people scrub their arrays periodically and if you haven't done an initial sync, the first time you do a scrub you will likely get a high mismatch count, which might be confusing. So I resync by default because it is safest. If you know what you are doing, then explicitly disabling that is perfectly OK. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sb->resync_offset value after resync failure 2012-01-19 16:19 sb->resync_offset value after resync failure Alexander Lyakas 2012-01-24 14:25 ` Alexander Lyakas @ 2012-01-26 0:51 ` NeilBrown 2012-02-01 20:56 ` Alexander Lyakas 1 sibling, 1 reply; 9+ messages in thread From: NeilBrown @ 2012-01-26 0:51 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 4349 bytes --] On Thu, 19 Jan 2012 18:19:55 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Greetings, > I am looking into a scenario, in which the md raid5/6 array is > resyncing (e.g., after a fresh creation) and there is a drive failure. > As written in Neil's blog entry "Closing the RAID5 write hole" > (http://neil.brown.name/blog/20110614101708): "if a device fails > during the resync, md doesn't take special action - it just allows the > array to be used without a resync even though there could be corrupt > data". > > However, I noticed that at this point sb->resync_offset in the > superblock is not set to MaxSector. At this point if a drive is > added/re-added to the array, then drive recovery starts, i.e., md > assumes that data/parity on the surviving drives are correct, and uses > them to rebuild the new drive. This state of data/parity being correct > should be reflected as sb->resync_offset==MaxSector, shouldn't it? > > One issue that I ran into is the following: I reached a situation in > which during array assembly: sb->resync_offset==sb->size. At this > point, the following code in mdadm > assumes that array is clean: > info->array.state = > (__le64_to_cpu(sb->resync_offset) >= __le64_to_cpu(sb->size)) > ? 1 : 0; > As a result, mdadm lets the array assembly flow through fine to the > kernel, but in the kernel the following code refuses to start the > array: > if (mddev->degraded > dirty_parity_disks && > mddev->recovery_cp != MaxSector) { > > At this point, speciying --force to mdadm --assembly doesn't help, > because mdadm thinks that array is clean (clean==1), and therefore > doesn't do the "force-array" update, which would knock off the > sb->resync_offset value. So there is no way to start the array, unless > specifying the start_dirty_degraded=1 kernel parameter. > > So one question is: should mdadm compare sb->resync_offset to > MaxSector and not to sb->size? In the kernel code, resync_offset is > always compared to MaxSector. Yes, mdadm should be consistent with the kernel. Patches welcome. > > Another question is: whether sb->resync_offset should be set to > MaxSector by the kernel as soon as it starts rebuilding a drive? I > think this would be consistent with what Neil wrote in the blog entry. Maybe every time we update ->curr_resync_completed we should update ->recovery_cp as well if it is below the new ->curre_resync_completed ?? > > Here is the scenario to reproduce the issue I described: > # Create a raid6 array with 4 drives A,B,C,D. Array starts resyncing. > # Fail drive D. Array aborts the resync and then immediately restarts > it (it seems to checkpoint the mddev->recovery_cp, but I am not sure > that it restarts from that checkpoint) > # Re-add drive D to the array. It is added as a spare, array continues resyncing > # Fail drive C. Array aborts the resync, and then starts rebuilding > drive D. At this point sb->resync_offset is some valid value (usually > 0, not MaxSectors and not sb->size). Does it start the rebuilding from the start? I hope it does. > # Stop the array. At this point sb->resync offset is sb->size in all > the superblocks. At some point in there you had a RAID6 with two missing devices, so it is either failed or completely in-sync. I guess we assume the latter. Is that wrong? > > Another question I have: when exactly md decides to update the > sb->resync_offset in the superblock? I am playing with similar > scenarios with raid5, and sometimes I end up with MaxSectors and > sometimes with valid values. From the code, it looks like only this > logic updates it: > if (mddev->in_sync) > sb->resync_offset = cpu_to_le64(mddev->recovery_cp); > else > sb->resync_offset = cpu_to_le64(0); > except for resizing and setting through sysfs. But I don't understand > how this value should be managed in general. I'm not sure what you are asking here .... that code explains exactly when resync_offset should be set, and how. What more is there to say? NeilBrown > > Thanks! > Alex. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sb->resync_offset value after resync failure 2012-01-26 0:51 ` NeilBrown @ 2012-02-01 20:56 ` Alexander Lyakas 2012-02-01 22:19 ` NeilBrown 0 siblings, 1 reply; 9+ messages in thread From: Alexander Lyakas @ 2012-02-01 20:56 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hi Neil, based on your hints I dug some more into resync failure cases, and handling of sb->resync_offset and mddev->recovery_cp. Here are aome observations: The only cases, in which recovery_cp is set (except setting via sysfs, setting bitmap bits via sysfs etc.) are: # on array creation, recovery_cp is set to 0 (and bitmap is fully set) # recovery_cp becomes MaxSector only if MD_RECOVERY_SYNC (with/without REQUESTED) completes successfully. On raid6 this may happen when a singly-degraded array completes resync. # if resync does not complete successfully (MD_RECOVERY_INTR or crash), then recovery_cp remains valid (not MaxSector). # The only real influence that recovery_cp seems to have is: 1) abort the assembly if ok_start_degraded is not set 2) when loading the superblock, it looks like recovery_cp may cause the beginning of the bitmap not being loaded. I did not dig further into bitmap at this point. 3) resume the resync in md_check_recovery() if recovery_cp is valid. Are these observations valid? With this scheme I saw several interesting issues: # After resync is aborted/interrupted, recovery_cp is updated (either to MaxSector or another value). However, the superblock is not updated at this point. If there is no additional activity on the array, the superblock will not be updated. I saw cases when resync completes fine, recovery_cp is set to MaxSector, but not persisted in the superblock. If I crash the machine at this point, then after reboot, array is still considered as dirty (has valid resync_offset in superblock). Is there an option to force the superblock update at this point? # When resync aborts due to drive failure, then MD_RECOVERY_INTR is set, and sync_request() returns mddev->dev_sectors - sector_nr. As a result, recovery_cp is set to device size, and that's what I see in my raid6 scenario. At this point three things can happen: 1) If there is additional drive to resync (like in raid6), then resync restarts (from sector 0) 2) If there is a spare, it starts rebuilding the spare, and, as a result, persists sb->resync_offset==sb->size in the superblock 3) Otherwise, it restarts the resync (due to valid recovery_cp), immediately finishes it, and sets recovery_cp=MaxSector (but not necessarily writes the superblock right away). So basically, we can have a rebuilding drive and sb->resync_offset==sb->size in the superblock. It will be cleared only after rebuild completes and this "empty-resync" happens. Are you (we?) comfortable with this behavior? (Due to mdadm issue I mentioned earlier, in this case mdadm thinks that array is clean, while kernel thinks the array is dirty, while it's actually rebuilding). Now I will answer to your suggestions: >> So one question is: should mdadm compare sb->resync_offset to >> MaxSector and not to sb->size? In the kernel code, resync_offset is >> always compared to MaxSector. > > Yes, mdadm should be consistent with the kernel. Patches welcome. I think at this point it's better to consider the array as dirty in mdadm, and either let the kernel set MaxSectors in this "empty-resync" or set it via mdadm+force. Let's see if I can submit this trivial patch. >> Another question is: whether sb->resync_offset should be set to >> MaxSector by the kernel as soon as it starts rebuilding a drive? I >> think this would be consistent with what Neil wrote in the blog entry. > > Maybe every time we update ->curr_resync_completed we should update > ->recovery_cp as well if it is below the new ->curre_resync_completed ?? I'm not sure it will help a lot. Except for not loading part of the bitmap (which I am unsure about), the real value if recovery_cp does not really matter. Is this correct? > > >> >> Here is the scenario to reproduce the issue I described: >> # Create a raid6 array with 4 drives A,B,C,D. Array starts resyncing. >> # Fail drive D. Array aborts the resync and then immediately restarts >> it (it seems to checkpoint the mddev->recovery_cp, but I am not sure >> that it restarts from that checkpoint) >> # Re-add drive D to the array. It is added as a spare, array continues resyncing >> # Fail drive C. Array aborts the resync, and then starts rebuilding >> drive D. At this point sb->resync_offset is some valid value (usually >> 0, not MaxSectors and not sb->size). > > Does it start the rebuilding from the start? I hope it does. Yes, it starts from j==0 (and then it goes according to bitmap I guess and hope). > >> # Stop the array. At this point sb->resync offset is sb->size in all >> the superblocks. > > At some point in there you had a RAID6 with two missing devices, so it is > either failed or completely in-sync. I guess we assume the latter. > Is that wrong? It's not that it's wrong, but if we assume the array is clean, then resync_offset should be MaxSectors in the superblock. But it's not, as I showed earlier. About exposing ok_start_degraded via sysfs: Is it possible to push this value through RUN_ARRAY ioctl? I saw that it can carry mdu_param_t, which has max_fault/*unused for now*/ field? Perhaps this filed can be used? I have another question about md_check_recovery() this code: if (mddev->safemode && !atomic_read(&mddev->writes_pending) && !mddev->in_sync && mddev->recovery_cp == MaxSector) { mddev->in_sync = 1; Why mddev->recovery_cp == MaxSector is checked here? This basically means that if we have a valid recovery_cp, then in_sync will never be set to 1, and this code: if (mddev->in_sync) sb->resync_offset = cpu_to_le64(mddev->recovery_cp); else sb->resync_offset = cpu_to_le64(0); should always set resync_offset to 0. I saw that "in_sync" basically tells whether there are pending/in-flight writes. Finally, what would you recommend to do if a raid5 resync fails? At this point kernel doesn't fail the array (as you pointed in your blog), but it machine reboots, array cannot be re-assembled without "force", because it's dirty+has a failed drive. So it's kind of...inconsistent? Thanks for your help, Alex. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sb->resync_offset value after resync failure 2012-02-01 20:56 ` Alexander Lyakas @ 2012-02-01 22:19 ` NeilBrown 2012-02-06 11:41 ` Alexander Lyakas 0 siblings, 1 reply; 9+ messages in thread From: NeilBrown @ 2012-02-01 22:19 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 8756 bytes --] On Wed, 1 Feb 2012 22:56:56 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hi Neil, > based on your hints I dug some more into resync failure cases, and > handling of sb->resync_offset and mddev->recovery_cp. Here are aome > observations: > > The only cases, in which recovery_cp is set (except setting via sysfs, > setting bitmap bits via sysfs etc.) are: > # on array creation, recovery_cp is set to 0 (and bitmap is fully set) > # recovery_cp becomes MaxSector only if MD_RECOVERY_SYNC (with/without > REQUESTED) completes successfully. On raid6 this may happen when a > singly-degraded array completes resync. > # if resync does not complete successfully (MD_RECOVERY_INTR or > crash), then recovery_cp remains valid (not MaxSector). > > # The only real influence that recovery_cp seems to have is: > 1) abort the assembly if ok_start_degraded is not set > 2) when loading the superblock, it looks like recovery_cp may cause > the beginning of the bitmap not being loaded. I did not dig further > into bitmap at this point. > 3) resume the resync in md_check_recovery() if recovery_cp is valid. > > Are these observations valid? Looks about right. 'recovery_cp' reflects the 'dirty' flag in the superblock. '0' means the whole array is dirty. 'MaxSectors' means none of the array is dirty. other numbers mean that part of the array is dirty, part isn't. (I should really rename it is 'resync_cp' - 'cp' for 'checkpoint' of course). > > With this scheme I saw several interesting issues: > # After resync is aborted/interrupted, recovery_cp is updated (either > to MaxSector or another value). However, the superblock is not updated > at this point. If there is no additional activity on the array, the > superblock will not be updated. I saw cases when resync completes > fine, recovery_cp is set to MaxSector, but not persisted in the > superblock. If I crash the machine at this point, then after reboot, > array is still considered as dirty (has valid resync_offset in > superblock). Is there an option to force the superblock update at this > point? My we just need to move the 'skip:' label to before: set_bit(MD_CHANGE_DEVS, &mddev->flags); it is that flag which causes the superblock to be updated. > > # When resync aborts due to drive failure, then MD_RECOVERY_INTR is > set, and sync_request() returns mddev->dev_sectors - sector_nr. As a > result, recovery_cp is set to device size, and that's what I see in my > raid6 scenario. At this point three things can happen: > 1) If there is additional drive to resync (like in raid6), then resync > restarts (from sector 0) It should really just keep resyncing from where it is up to, shouldn't it? > 2) If there is a spare, it starts rebuilding the spare, and, as a > result, persists sb->resync_offset==sb->size in the superblock Hmm.. that's wrong too. It make some sense for RAID5 as if there is a spare recovering we must assume the rest of the array is in-sync. But not for RAID6. > 3) Otherwise, it restarts the resync (due to valid recovery_cp), > immediately finishes it, and sets recovery_cp=MaxSector (but not > necessarily writes the superblock right away). This bit seems right - assuming you are referring to the 'raid5, no spare' case. > So basically, we can have a rebuilding drive and > sb->resync_offset==sb->size in the superblock. It will be cleared only > after rebuild completes and this "empty-resync" happens. > Are you (we?) comfortable with this behavior? (Due to mdadm issue I > mentioned earlier, in this case mdadm thinks that array is clean, > while kernel thinks the array is dirty, while it's actually > rebuilding). No, not really. There do seem to be some bugs there. > > Now I will answer to your suggestions: > > >> So one question is: should mdadm compare sb->resync_offset to > >> MaxSector and not to sb->size? In the kernel code, resync_offset is > >> always compared to MaxSector. > > > > Yes, mdadm should be consistent with the kernel. Patches welcome. > > I think at this point it's better to consider the array as dirty in > mdadm, and either let the kernel set MaxSectors in this "empty-resync" > or set it via mdadm+force. Let's see if I can submit this trivial > patch. Sounds reasonable. > > >> Another question is: whether sb->resync_offset should be set to > >> MaxSector by the kernel as soon as it starts rebuilding a drive? I > >> think this would be consistent with what Neil wrote in the blog entry. > > > > Maybe every time we update ->curr_resync_completed we should update > > ->recovery_cp as well if it is below the new ->curre_resync_completed ?? > > I'm not sure it will help a lot. Except for not loading part of the > bitmap (which I am unsure about), the real value if recovery_cp does > not really matter. Is this correct? It is stored in the metadata so if you stop the array in the middle of a resync, then start the array again, the resync starts from wherever it was up to. So having a correct value certainly is useful. (Having a value that is less than the correct value is no a big problem, have a value that is too big would be bad). > > > > > > >> > >> Here is the scenario to reproduce the issue I described: > >> # Create a raid6 array with 4 drives A,B,C,D. Array starts resyncing. > >> # Fail drive D. Array aborts the resync and then immediately restarts > >> it (it seems to checkpoint the mddev->recovery_cp, but I am not sure > >> that it restarts from that checkpoint) > >> # Re-add drive D to the array. It is added as a spare, array continues resyncing > >> # Fail drive C. Array aborts the resync, and then starts rebuilding > >> drive D. At this point sb->resync_offset is some valid value (usually > >> 0, not MaxSectors and not sb->size). > > > > Does it start the rebuilding from the start? I hope it does. > > Yes, it starts from j==0 (and then it goes according to bitmap I guess > and hope). > > > > >> # Stop the array. At this point sb->resync offset is sb->size in all > >> the superblocks. > > > > At some point in there you had a RAID6 with two missing devices, so it is > > either failed or completely in-sync. I guess we assume the latter. > > Is that wrong? > > It's not that it's wrong, but if we assume the array is clean, then > resync_offset should be MaxSectors in the superblock. But it's not, as > I showed earlier. Fair enough - it would be better if it were MaxSectors in this case. > > About exposing ok_start_degraded via sysfs: Is it possible to push > this value through RUN_ARRAY ioctl? I saw that it can carry > mdu_param_t, which has max_fault/*unused for now*/ field? Perhaps this > filed can be used? I'm not keen on adding more functionality on to the ioctls. I'd much rather see something via sysfs... Not sure what is best though, I'd have to have a good think about it. > > I have another question about md_check_recovery() this code: > if (mddev->safemode && > !atomic_read(&mddev->writes_pending) && > !mddev->in_sync && > mddev->recovery_cp == MaxSector) { > mddev->in_sync = 1; > Why mddev->recovery_cp == MaxSector is checked here? This basically > means that if we have a valid recovery_cp, then in_sync will never be > set to 1, and this code: > if (mddev->in_sync) > sb->resync_offset = cpu_to_le64(mddev->recovery_cp); > else > sb->resync_offset = cpu_to_le64(0); > should always set resync_offset to 0. I saw that "in_sync" basically > tells whether there are pending/in-flight writes. If recovery_cp != MaxSector, then the array is not considered to be 'clean' so we don't want to mark it as 'clean'. On a graceful shutdown we do get in_sync to 1 - in __md_stop_writes. Maybe we could do it at other times if the array is idle - not sure. > > Finally, what would you recommend to do if a raid5 resync fails? At > this point kernel doesn't fail the array (as you pointed in your > blog), but it machine reboots, array cannot be re-assembled without > "force", because it's dirty+has a failed drive. So it's kind > of...inconsistent? I don't have a good answer. The logical thing to do is to cause the whole array to fail - but I think that would be unpopular. I'm happy to listen to well-reasoned suggestions, but I don't have any good suggestion other than the current situation. Thanks, NeilBrown > > Thanks for your help, > Alex. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sb->resync_offset value after resync failure 2012-02-01 22:19 ` NeilBrown @ 2012-02-06 11:41 ` Alexander Lyakas 2012-02-07 0:55 ` NeilBrown 0 siblings, 1 reply; 9+ messages in thread From: Alexander Lyakas @ 2012-02-06 11:41 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hi Neil, >> # After resync is aborted/interrupted, recovery_cp is updated (either >> to MaxSector or another value). However, the superblock is not updated >> at this point. If there is no additional activity on the array, the >> superblock will not be updated. I saw cases when resync completes >> fine, recovery_cp is set to MaxSector, but not persisted in the >> superblock. If I crash the machine at this point, then after reboot, >> array is still considered as dirty (has valid resync_offset in >> superblock). Is there an option to force the superblock update at this >> point? > > My we just need to move the 'skip:' label to before: > > set_bit(MD_CHANGE_DEVS, &mddev->flags); > > it is that flag which causes the superblock to be updated. Thanks for the hint, will try it. For production though I'm a bit afraid to recompile my own kernel at this point. > >> >> # When resync aborts due to drive failure, then MD_RECOVERY_INTR is >> set, and sync_request() returns mddev->dev_sectors - sector_nr. As a >> result, recovery_cp is set to device size, and that's what I see in my >> raid6 scenario. At this point three things can happen: >> 1) If there is additional drive to resync (like in raid6), then resync >> restarts (from sector 0) > > It should really just keep resyncing from where it is up to, shouldn't it? According to the code: else if (!mddev->bitmap) j = mddev->recovery_cp; if the array has a bitmap, then it starts from j==0. > >> 2) If there is a spare, it starts rebuilding the spare, and, as a >> result, persists sb->resync_offset==sb->size in the superblock > > Hmm.. that's wrong too. It make some sense for RAID5 as if there is a spare > recovering we must assume the rest of the array is in-sync. But not for > RAID6. > >> 3) Otherwise, it restarts the resync (due to valid recovery_cp), >> immediately finishes it, and sets recovery_cp=MaxSector (but not >> necessarily writes the superblock right away). > > This bit seems right - assuming you are referring to the 'raid5, no spare' > case. Yes, and also to raid6 with two drives failed during resync. >> > Maybe every time we update ->curr_resync_completed we should update >> > ->recovery_cp as well if it is below the new ->curre_resync_completed ?? >> >> I'm not sure it will help a lot. Except for not loading part of the >> bitmap (which I am unsure about), the real value if recovery_cp does >> not really matter. Is this correct? > > It is stored in the metadata so if you stop the array in the middle of a > resync, then start the array again, the resync starts from wherever it was up > to. So having a correct value certainly is useful. > (Having a value that is less than the correct value is no a big problem, > have a value that is too big would be bad). Neil, as I pointed out earlier, it looks like resync always starts from j==0, if there is a bitmap. So there is no real importance for the exact value of recovery_cp (I think). Only whether it is MaxSector or not. Am I missing something? > >> >> I have another question about md_check_recovery() this code: >> if (mddev->safemode && >> !atomic_read(&mddev->writes_pending) && >> !mddev->in_sync && >> mddev->recovery_cp == MaxSector) { >> mddev->in_sync = 1; >> Why mddev->recovery_cp == MaxSector is checked here? This basically >> means that if we have a valid recovery_cp, then in_sync will never be >> set to 1, and this code: >> if (mddev->in_sync) >> sb->resync_offset = cpu_to_le64(mddev->recovery_cp); >> else >> sb->resync_offset = cpu_to_le64(0); >> should always set resync_offset to 0. I saw that "in_sync" basically >> tells whether there are pending/in-flight writes. > > If recovery_cp != MaxSector, then the array is not considered to be 'clean' > so we don't want to mark it as 'clean'. > On a graceful shutdown we do get in_sync to 1 - in __md_stop_writes. > Maybe we could do it at other times if the array is idle - not sure. It looks like I am missing something. If 'in_sync'==0, then resync_offset will always be 0 in the superblock. If 'in_sync'==1, then recovery_cp will be in the superblock. But the code in md_check_recovery() only sets in_sync=1 if there is no recovery_cp. If there is a valid recovery_cp, then in_sync will not be set to 1 and the superblock will always contain 0. That's why I am wondering why we are checking recovery_cp here? To end up having a valid recovery_cp in the superblock...we need recovery_cp==MaxSectors. Strange? Can you please answer two more questions? 1) When testing the sb->resync offset patch with the latest version of mdadm, some of my unit tests failed due to failure in "--add". Debugging with strace reveals following: write_bitmap1() calls awrite() with 4096b buffer that was allocated on the stack in this function, and is not aligned. However, the fd in use was opened with O_RDWR|O_EXCL|O_DIRECT. As a result, write() fails with errno==EINVAL. This appears to be regression at least vs mdadm-3.1.5, in which aligned buffer was used. Portions of strace: open("/dev/sdj", O_RDWR|O_EXCL|O_DIRECT) = 4 ... ioctl(4, BLKSSZGET, 0x7fffa70b98d4) = 0 write(4, "bitm\4\0\0\0G*8\34m\245\261\253\34\341\312\3\214S\203\247\200\0\0\0\0\0\0\0"..., 512) = -1 EINVAL (Invalid argument) The buffer pointer at this point is 0x7fffa70b9930 (clearly not aligned to 512). Do you want me to perhaps prepare a patch, which simply __aligns__ the buffer on the stack? 2) Why in md_check_recovery(), rebuilding a spare is preferrable over resyncing the parity by this code: } else if ((spares = remove_and_add_spares(mddev))) { clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); } else if (mddev->recovery_cp < MaxSector) { set_bit(MD_RECOVERY_SYNC, &mddev->recovery); clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); This leads to situation, in which array "remembers" that it has to complete resync (in sb->resync_offset), but after the spare completes rebuilding, the bitmap is cleared, and subsequent resync starts and immediately completes. Was that decision to prefer rebuild over resync intentional? Thanks for your help, Alex. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sb->resync_offset value after resync failure 2012-02-06 11:41 ` Alexander Lyakas @ 2012-02-07 0:55 ` NeilBrown 2012-02-07 9:11 ` Alexander Lyakas 0 siblings, 1 reply; 9+ messages in thread From: NeilBrown @ 2012-02-07 0:55 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 8319 bytes --] On Mon, 6 Feb 2012 13:41:57 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hi Neil, > > >> # After resync is aborted/interrupted, recovery_cp is updated (either > >> to MaxSector or another value). However, the superblock is not updated > >> at this point. If there is no additional activity on the array, the > >> superblock will not be updated. I saw cases when resync completes > >> fine, recovery_cp is set to MaxSector, but not persisted in the > >> superblock. If I crash the machine at this point, then after reboot, > >> array is still considered as dirty (has valid resync_offset in > >> superblock). Is there an option to force the superblock update at this > >> point? > > > > My we just need to move the 'skip:' label to before: > > > > set_bit(MD_CHANGE_DEVS, &mddev->flags); > > > > it is that flag which causes the superblock to be updated. > > Thanks for the hint, will try it. For production though I'm a bit > afraid to recompile my own kernel at this point. > > > > >> > >> # When resync aborts due to drive failure, then MD_RECOVERY_INTR is > >> set, and sync_request() returns mddev->dev_sectors - sector_nr. As a > >> result, recovery_cp is set to device size, and that's what I see in my > >> raid6 scenario. At this point three things can happen: > >> 1) If there is additional drive to resync (like in raid6), then resync > >> restarts (from sector 0) > > > > It should really just keep resyncing from where it is up to, shouldn't it? > According to the code: > else if (!mddev->bitmap) > j = mddev->recovery_cp; > if the array has a bitmap, then it starts from j==0. Good point. When there is a bitmap the recovery_cp number if irrelevant as the bitmap stores equivalent information with might finer detail. > > > > >> 2) If there is a spare, it starts rebuilding the spare, and, as a > >> result, persists sb->resync_offset==sb->size in the superblock I just looked into this and tested it and I don't agree. ->resync_offset does not get set to sb->size. It gets set to the value that curr_resync got up to which is correct .. or fairly close, it should really be curr_resync_completed. So I don't think there is a big problem there. > > > > Hmm.. that's wrong too. It make some sense for RAID5 as if there is a spare > > recovering we must assume the rest of the array is in-sync. But not for > > RAID6. > > > >> 3) Otherwise, it restarts the resync (due to valid recovery_cp), > >> immediately finishes it, and sets recovery_cp=MaxSector (but not > >> necessarily writes the superblock right away). > > > > This bit seems right - assuming you are referring to the 'raid5, no spare' > > case. > Yes, and also to raid6 with two drives failed during resync. Yes. > > >> > Maybe every time we update ->curr_resync_completed we should update > >> > ->recovery_cp as well if it is below the new ->curre_resync_completed ?? > >> > >> I'm not sure it will help a lot. Except for not loading part of the > >> bitmap (which I am unsure about), the real value if recovery_cp does > >> not really matter. Is this correct? > > > > It is stored in the metadata so if you stop the array in the middle of a > > resync, then start the array again, the resync starts from wherever it was up > > to. So having a correct value certainly is useful. > > (Having a value that is less than the correct value is no a big problem, > > have a value that is too big would be bad). > > Neil, as I pointed out earlier, it looks like resync always starts > from j==0, if there is a bitmap. So there is no real importance for > the exact value of recovery_cp (I think). Only whether it is MaxSector > or not. Am I missing something? Yes, when you have a bitmap recovery_cp is much less important. > > > > >> > >> I have another question about md_check_recovery() this code: > >> if (mddev->safemode && > >> !atomic_read(&mddev->writes_pending) && > >> !mddev->in_sync && > >> mddev->recovery_cp == MaxSector) { > >> mddev->in_sync = 1; > >> Why mddev->recovery_cp == MaxSector is checked here? This basically > >> means that if we have a valid recovery_cp, then in_sync will never be > >> set to 1, and this code: > >> if (mddev->in_sync) > >> sb->resync_offset = cpu_to_le64(mddev->recovery_cp); > >> else > >> sb->resync_offset = cpu_to_le64(0); > >> should always set resync_offset to 0. I saw that "in_sync" basically > >> tells whether there are pending/in-flight writes. > > > > If recovery_cp != MaxSector, then the array is not considered to be 'clean' > > so we don't want to mark it as 'clean'. > > On a graceful shutdown we do get in_sync to 1 - in __md_stop_writes. > > Maybe we could do it at other times if the array is idle - not sure. > > It looks like I am missing something. If 'in_sync'==0, then > resync_offset will always be 0 in the superblock. If 'in_sync'==1, > then recovery_cp will be in the superblock. But the code in > md_check_recovery() only sets in_sync=1 if there is no recovery_cp. If > there is a valid recovery_cp, then in_sync will not be set to 1 and > the superblock will always contain 0. That's why I am wondering why we > are checking recovery_cp here? To end up having a valid recovery_cp in > the superblock...we need recovery_cp==MaxSectors. Strange? ->resync_offset is only really meaningful and useful after a clean shutdown of the array. The clean-shutdown code path uses a different rule for setting in_sync to 1. In that case you do get a useful value written to ->resync_offset. > > Can you please answer two more questions? > > 1) When testing the sb->resync offset patch with the latest version of > mdadm, some of my unit tests failed due to failure in "--add". > Debugging with strace reveals following: > write_bitmap1() calls awrite() with 4096b buffer that was allocated on > the stack in this function, and is not aligned. However, the fd in use > was opened with O_RDWR|O_EXCL|O_DIRECT. As a result, write() fails > with errno==EINVAL. This appears to be regression at least vs > mdadm-3.1.5, in which aligned buffer was used. > Portions of strace: > open("/dev/sdj", O_RDWR|O_EXCL|O_DIRECT) = 4 > ... > ioctl(4, BLKSSZGET, 0x7fffa70b98d4) = 0 > write(4, "bitm\4\0\0\0G*8\34m\245\261\253\34\341\312\3\214S\203\247\200\0\0\0\0\0\0\0"..., > 512) = -1 EINVAL (Invalid argument) > The buffer pointer at this point is 0x7fffa70b9930 (clearly not aligned to 512). > Do you want me to perhaps prepare a patch, which simply __aligns__ the > buffer on the stack? Hmmm... I messed that up, didn't I? I think I'd rather fix awrite so that it *always* does an aligned write. I'll make a patch - thanks. > > 2) Why in md_check_recovery(), rebuilding a spare is preferrable over > resyncing the parity by this code: > } else if ((spares = remove_and_add_spares(mddev))) { > clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); > clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); > clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); > set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); > } else if (mddev->recovery_cp < MaxSector) { > set_bit(MD_RECOVERY_SYNC, &mddev->recovery); > clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); > > This leads to situation, in which array "remembers" that it has to > complete resync (in sb->resync_offset), but after the spare completes > rebuilding, the bitmap is cleared, and subsequent resync starts and > immediately completes. Was that decision to prefer rebuild over resync > intentional? Yes. In most cases rebuilding the spare will bring the array back into sync anyway. Also rebuilding a spare will do more to improve redundancy than a resync. There might be room for improvement here for RAID6 - if the array is dirty and degrade we should both resync and recover at the same time ... and it is possible that we already do. I'd need to look more closely at the code to be sure. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sb->resync_offset value after resync failure 2012-02-07 0:55 ` NeilBrown @ 2012-02-07 9:11 ` Alexander Lyakas 0 siblings, 0 replies; 9+ messages in thread From: Alexander Lyakas @ 2012-02-07 9:11 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hi, >> >> 2) If there is a spare, it starts rebuilding the spare, and, as a >> >> result, persists sb->resync_offset==sb->size in the superblock > > I just looked into this and tested it and I don't agree. > ->resync_offset does not get set to sb->size. It gets set to the value that > curr_resync got up to which is correct .. or fairly close, it should really > be curr_resync_completed. > > So I don't think there is a big problem there. In the scenario I mentioned, I clearly end up having sb->resync_offset==sb->size (and it actually remains this way, until spare rebuild completes and "empty-resync" completes). But I am using 2.6.38-8; perhaps in your kernel this does not happen anymore. If you're interested to investigate this further, I can send you a repro script. Thanks for your help! -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-02-07 9:11 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-01-19 16:19 sb->resync_offset value after resync failure Alexander Lyakas 2012-01-24 14:25 ` Alexander Lyakas 2012-01-26 0:41 ` NeilBrown 2012-01-26 0:51 ` NeilBrown 2012-02-01 20:56 ` Alexander Lyakas 2012-02-01 22:19 ` NeilBrown 2012-02-06 11:41 ` Alexander Lyakas 2012-02-07 0:55 ` NeilBrown 2012-02-07 9:11 ` Alexander Lyakas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).