* MDADM RAID 6 Bad Superblock after reboot [not found] <6a0f0e0b-6b03-8ec1-b02f-b17b0447aff8@gmail.com> @ 2017-10-18 18:14 ` Sean R. Funk 2017-10-18 19:40 ` Wols Lists 0 siblings, 1 reply; 11+ messages in thread From: Sean R. Funk @ 2017-10-18 18:14 UTC (permalink / raw) To: linux-raid Hi there, After attempting to add a GPU to a VM running on a CentOS 7 KVM host I have, the machine forcibly rebooted. Upon reboot, my /dev/md0 raid 6 XFS array would not start. Background: Approximately 3 weeks ago I added 3 additional 3TB HDD's to my existing 5 disk array, and grew it using the *raw* disks as opposed to the partitions. Everything appeared to be working fine (raw disk was my mistake, as it had been a year since I had expanded this array previously, simply forgot steps) until last night. WHen I added the GPU via VMM, the host itself rebooted. Unfortunately, the machine has no network access at the moment and I can only provide pictures of text from whats displayed on the screen. The system is booting into emergency mode and its failing because the /dev/md0 array isn't starting (and then NFS fails, etc). Smartctl shows no errors with any of the disks, and mdadm examine shows no superblocks on the 3 disks I added before. The array is in the inactive state, and it shows only 5 disks. To add to that, apparently I had grown the cluster while SELinux had been enabled as opposed to permissive - so there was a audit log of mdadm trying to modify /etc/mdadm.conf. I'm guessing it was trying to update the configuration file as to the drive configuration. Smartctl shows each drive is fine, and the first 5 drives have equal numbers of events. I'm presuming the data is all still intact. Any advice on how to proceed? Thanks! ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: MDADM RAID 6 Bad Superblock after reboot 2017-10-18 18:14 ` MDADM RAID 6 Bad Superblock after reboot Sean R. Funk @ 2017-10-18 19:40 ` Wols Lists 2017-10-19 1:52 ` sfunk1x 0 siblings, 1 reply; 11+ messages in thread From: Wols Lists @ 2017-10-18 19:40 UTC (permalink / raw) To: Sean R. Funk, linux-raid On 18/10/17 19:14, Sean R. Funk wrote: > > > Hi there, Hi, First responding ... > > After attempting to add a GPU to a VM running on a CentOS 7 KVM host I > have, the machine forcibly rebooted. Upon reboot, my /dev/md0 raid 6 XFS > array would not start. > > Background: > > Approximately 3 weeks ago I added 3 additional 3TB HDD's to my existing > 5 disk array, and grew it using the *raw* disks as opposed to the > partitions. Everything appeared to be working fine (raw disk was my > mistake, as it had been a year since I had expanded this array > previously, simply forgot steps) until last night. WHen I added the GPU > via VMM, the host itself rebooted. Raw disk shouldn't make any difference - mdadm/raid couldn't care less. Mixing is not recommended primarily because it confuses the sysadmin - not a good idea. > > Unfortunately, the machine has no network access at the moment and I can > only provide pictures of text from whats displayed on the screen. The > system is booting into emergency mode and its failing because the > /dev/md0 array isn't starting (and then NFS fails, etc). > I'm guessing :-) that that means the array is degraded, therefore it won't assemble/run and that obviously is knocking out the system. > Smartctl shows no errors with any of the disks, and mdadm examine shows > no superblocks on the 3 disks I added before. The array is in the > inactive state, and it shows only 5 disks. What does --detail tell us about the array? > > To add to that, apparently I had grown the cluster while SELinux had > been enabled as opposed to permissive - so there was a audit log of > mdadm trying to modify /etc/mdadm.conf. I'm guessing it was trying to > update the configuration file as to the drive configuration. Are you sure the three drives were added? SELinux has a habit of causing havoc. Did the available space on the array increase? Did you check? > > Smartctl shows each drive is fine, and the first 5 drives have equal > numbers of events. I'm presuming the data is all still intact. > > Any advice on how to proceed? Thanks! Firstly, make sure SELinux didn't interfere with the grow. My guess is the add failed because SELinux blocked it, and in reality you've still got a five-drive array, it just thinks it's an eight-drive array, so when the system rebooted it said "five drives of eight? Not enough!" and stopped. The experts will chime in with more info, but (a) don't do anything that alters the disks ... (b) investigate that scenario, ie SELinux prevented the grow from actually occurring. If I'm right, recovery is hopefully a simple matter of disabling SELinux, and re-assembling the array with either reverting the grow, or firing it off so it can actually run and complete. It certainly doesn't look a disastrous scenario at the moment. Cheers, Wol ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: MDADM RAID 6 Bad Superblock after reboot 2017-10-18 19:40 ` Wols Lists @ 2017-10-19 1:52 ` sfunk1x 2017-10-19 10:58 ` Wols Lists 2017-10-19 21:17 ` NeilBrown 0 siblings, 2 replies; 11+ messages in thread From: sfunk1x @ 2017-10-19 1:52 UTC (permalink / raw) To: Wols Lists, linux-raid On 10/18/2017 12:40 PM, Wols Lists wrote: > What does --detail tell us about the array? https://imgur.com/a/Own0W Apologies for the imgur link, but this was the easiest way to communicate --detail. > Are you sure the three drives were added? SELinux has a habit of causing > havoc. Did the available space on the array increase? Did you check? Yeah. The array took about 17 hours to rebuild with the three drives added (I had expected over 24 as that had been my experience with adding the 5th drive long ago), and I had immediately started using the extra space. The extra 8+ TB showed up in --details, as well as df, and my guests could see the extra space. The SELinux audit log, however, was very clear about mdadm not being able to edit the conf. And it's true - the conf file did not have the extra drives added. I've since audited and applied a rule to allow the editing of the conf file, but the system is currently in permissive mode until the array is back online. I can disable if needed. > Firstly, make sure SELinux didn't interfere with the grow. My guess is > the add failed because SELinux blocked it, and in reality you've still > got a five-drive array, it just thinks it's an eight-drive array, so > when the system rebooted it said "five drives of eight? Not enough!" and > stopped. I could see this being the case - on reboot, the configuration would specify 5 drives instead of the 8. In addition, the system did not get rebooted after the array had been grown - I just kept it running and put it to work. Lesson learned. > If I'm right, recovery is hopefully a simple matter of disabling > SELinux, and re-assembling the array with either reverting the grow, or > firing it off so it can actually run and complete. Thanks for the vote of confidence in not losing data here. As I mentioned above, I've set SELinux into permissive mode. I'm sort of at a loss as to what to do next. Since sd[fgh] don't have any superblock data, can I try to bring the array back online with 5 drives, then re-add the last three, with the hope that they sync? There obviously hasn't been any writes to the last three drives since the hard system reboot, so I'd hope their event numbers would be in sync? Thanks, -Sean ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: MDADM RAID 6 Bad Superblock after reboot 2017-10-19 1:52 ` sfunk1x @ 2017-10-19 10:58 ` Wols Lists 2017-10-19 14:11 ` Sean R. Funk 2017-10-19 21:17 ` NeilBrown 1 sibling, 1 reply; 11+ messages in thread From: Wols Lists @ 2017-10-19 10:58 UTC (permalink / raw) To: sfunk1x, linux-raid On 19/10/17 02:52, sfunk1x wrote: > Thanks for the vote of confidence in not losing data here. As I > mentioned above, I've set SELinux into permissive mode. I'm sort of at a > loss as to what to do next. Since sd[fgh] don't have any superblock > data, can I try to bring the array back online with 5 drives, then > re-add the last three, with the hope that they sync? There obviously > hasn't been any writes to the last three drives since the hard system > reboot, so I'd hope their event numbers would be in sync? Okay. First thing I'd try is temporarily get rid of mdadm.conf. You don't actually need it. Then do "mdadm --stop /dev/md0" followed by "mdadm --assemble --scan". That'll tell mdadm to do a "best effort" to get everything running. *Always* do an mdadm --stop between every attempt, because the quickest way to mess up recovery attempts is to have the wreckage of a previous attempt lying around. If that doesn't work, then try assembling the array explicitly, ie listing all the drives. I can't remember whether order is important - so long as you don't specify --force it'll either work or fail, it won't do any damage. Beyond this point I'm getting out of my depth, I don't have the broad experience so I'll have to step back and let others chime in, but there's a good chance this will get your array back. Cheers, Wol ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: MDADM RAID 6 Bad Superblock after reboot 2017-10-19 10:58 ` Wols Lists @ 2017-10-19 14:11 ` Sean R. Funk 2017-10-19 14:28 ` Wols Lists 0 siblings, 1 reply; 11+ messages in thread From: Sean R. Funk @ 2017-10-19 14:11 UTC (permalink / raw) To: Wols Lists, linux-raid On October 19, 2017 3:58:48 AM PDT, Wols Lists <antlists@youngman.org.uk> wrote: > >Okay. First thing I'd try is temporarily get rid of mdadm.conf. You >don't actually need it. Then do "mdadm --stop /dev/md0" followed by >"mdadm --assemble --scan". > Ok, so doing the stop, then assemble scan gets me : /Dev/MD/0 assembled from 5 drives - not enough to start the array. No arrays found in config file or automatically. This was after moving the mdadm.conf temporarily. >If that doesn't work, then try assembling the array explicitly, ie >listing all the drives. I can't remember whether order is important - >so >long as you don't specify --force it'll either work or fail, it won't >do >any damage. Looks like it's failed. I ran: mdadm --assemble /dev/md0 /dev/[abcde]1 /dev/sd[fgh] Due to having to use a cell phone to respond I didn't type everything out here, but that's the idea. Running that, I got: No superblock find on sdf expected magic a92b4efc, got 00000000 No raid superblock on sdf Sdf has no superblock assembly aborted This is where I've been stuck. I read a post a couple nights ago about zeroing the superblock on the drives that report no superblock, but it seems they are already zeroed? My lack of experience here isn't helping. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: MDADM RAID 6 Bad Superblock after reboot 2017-10-19 14:11 ` Sean R. Funk @ 2017-10-19 14:28 ` Wols Lists 0 siblings, 0 replies; 11+ messages in thread From: Wols Lists @ 2017-10-19 14:28 UTC (permalink / raw) To: Sean R. Funk, linux-raid, Phil Turmel, NeilBrown On 19/10/17 15:11, Sean R. Funk wrote: > This is where I've been stuck. I read a post a couple nights ago about zeroing the superblock on the drives that report no superblock, but it seems they are already zeroed? My lack of experience here isn't helping. Bummer. This is where I'll have to leave it. Did SELinux prevent the superblocks being written? But in that case it should also have prevented them being updated. I'm out of my depth. There's some information out there about doing diagnostics by using hexdump. https://raid.wiki.kernel.org/index.php/Advanced_data_recovery This page is currently a work-in-progress, but maybe it'll give you a few clues. You'll probably be asked to boot into a rescue CD, if you can. You'll need to have a copy of the latest mdadm on hand. Any chance of running lsdrv from a rescue CD? What I will say is do NOT do anything that will write to the drive unless an expert is guiding you. You've had a reshape happen. That will have moved data and stuff around on the disk. If you destroy the existing superblocks they will be a *nightmare* to recreate, as stuff is no longer in the default location. Best of luck. Cheers, Wol ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: MDADM RAID 6 Bad Superblock after reboot 2017-10-19 1:52 ` sfunk1x 2017-10-19 10:58 ` Wols Lists @ 2017-10-19 21:17 ` NeilBrown 2017-10-20 0:43 ` sfunk1x 1 sibling, 1 reply; 11+ messages in thread From: NeilBrown @ 2017-10-19 21:17 UTC (permalink / raw) To: sfunk1x, Wols Lists, linux-raid [-- Attachment #1: Type: text/plain, Size: 1593 bytes --] On Wed, Oct 18 2017, sfunk1x wrote: > On 10/18/2017 12:40 PM, Wols Lists wrote: >> What does --detail tell us about the array? > > https://imgur.com/a/Own0W > > Apologies for the imgur link, but this was the easiest way to > communicate --detail. Anything is better than nothing.. Can we get a photo of "mdadm --examine" output on one of the devices (e.g. /dev/sda1) ?? Also, what mdadm version (mdadm -V) and kernel version (uname -r)? > >> Are you sure the three drives were added? SELinux has a habit of causing >> havoc. Did the available space on the array increase? Did you check? > > Yeah. The array took about 17 hours to rebuild with the three drives > added (I had expected over 24 as that had been my experience with adding > the 5th drive long ago), and I had immediately started using the extra > space. The extra 8+ TB showed up in --details, as well as df, and my > guests could see the extra space. > > The SELinux audit log, however, was very clear about mdadm not being > able to edit the conf. And it's true - the conf file did not have the > extra drives added. I've since audited and applied a rule to allow the > editing of the conf file, but the system is currently in permissive mode > until the array is back online. I can disable if needed. mdadm never tries to edit mdadm.conf. It does modify files in /run/mdadm (or maybe /var/run/mdadm). Can we get a photo of that audit log? I'm very suspicious of these new drives appearing to have not metadata. Can you od -x /dev/sdf | head od -x /dev/sdf | grep a92b and provide a photo of the output? NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: MDADM RAID 6 Bad Superblock after reboot 2017-10-19 21:17 ` NeilBrown @ 2017-10-20 0:43 ` sfunk1x 2017-10-20 1:50 ` NeilBrown 0 siblings, 1 reply; 11+ messages in thread From: sfunk1x @ 2017-10-20 0:43 UTC (permalink / raw) To: NeilBrown, Wols Lists, linux-raid On 10/19/2017 02:17 PM, NeilBrown wrote: > Anything is better than nothing.. Can we get a photo of "mdadm > --examine" output on one of the devices (e.g. /dev/sda1) ?? > > Also, what mdadm version (mdadm -V) and kernel version (uname -r)? https://imgur.com/x2rZEl5 I included an --examine of sda1 and sdf. > mdadm never tries to edit mdadm.conf. > It does modify files in /run/mdadm (or maybe /var/run/mdadm). > Can we get a photo of that audit log? I was mistaken, I think. It looks like mdadm tried to read (??) the file and the context for the file was not set properly: https://imgur.com/5UvFExp In order to "fix" the issue, I ran sealert against the audit.log and followed it's instructions, producing a rules file. > > I'm very suspicious of these new drives appearing to have not metadata. > Can you > od -x /dev/sdf | head > od -x /dev/sdf | grep a92b > and provide a photo of the output? grep'ing a92b scrolled off the screen, so I grabbed a portion of it, ctrl-c' and included head: https://imgur.com/WDcAYBc Thanks for the help so far. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: MDADM RAID 6 Bad Superblock after reboot 2017-10-20 0:43 ` sfunk1x @ 2017-10-20 1:50 ` NeilBrown 2017-10-22 21:00 ` sfunk1x 0 siblings, 1 reply; 11+ messages in thread From: NeilBrown @ 2017-10-20 1:50 UTC (permalink / raw) To: sfunk1x, Wols Lists, linux-raid [-- Attachment #1: Type: text/plain, Size: 1640 bytes --] On Thu, Oct 19 2017, sfunk1x wrote: > On 10/19/2017 02:17 PM, NeilBrown wrote: > >> Anything is better than nothing.. Can we get a photo of "mdadm >> --examine" output on one of the devices (e.g. /dev/sda1) ?? >> >> Also, what mdadm version (mdadm -V) and kernel version (uname -r)? > > https://imgur.com/x2rZEl5 > > I included an --examine of sda1 and sdf. Thanks. sdf looks like it might have a gpt partition table on it. The "od -x" supports that. What does "fdisk -l /dev/sdf" show? What about mdadm --examine /dev/sdf* ?? > >> mdadm never tries to edit mdadm.conf. >> It does modify files in /run/mdadm (or maybe /var/run/mdadm). >> Can we get a photo of that audit log? > > I was mistaken, I think. It looks like mdadm tried to read (??) the file > and the context for the file was not set properly: > > https://imgur.com/5UvFExp As you say, it was reading. That makes sense but shouldn't be fatal. > > In order to "fix" the issue, I ran sealert against the audit.log and > followed it's instructions, producing a rules file. > >> >> I'm very suspicious of these new drives appearing to have not metadata. >> Can you >> od -x /dev/sdf | head >> od -x /dev/sdf | grep a92b >> and provide a photo of the output? > > grep'ing a92b scrolled off the screen, so I grabbed a portion of it, > ctrl-c' and included head: I should have suggested " | head" at the end of the 'grep' command. Maybe make it od -x /dev/sdf | grep '4efc a92b' | head That will look for md metadata. Thanks NeilBrown > > https://imgur.com/WDcAYBc > > Thanks for the help so far. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: MDADM RAID 6 Bad Superblock after reboot 2017-10-20 1:50 ` NeilBrown @ 2017-10-22 21:00 ` sfunk1x 2017-10-22 22:40 ` NeilBrown 0 siblings, 1 reply; 11+ messages in thread From: sfunk1x @ 2017-10-22 21:00 UTC (permalink / raw) To: NeilBrown, Wols Lists, linux-raid On 10/19/2017 06:50 PM, NeilBrown wrote: > Thanks. sdf looks like it might have a gpt partition > table on it. The "od -x" supports that. What does "fdisk -l /dev/sdf" > show? What about > mdadm --examine /dev/sdf* > ?? fdisk -l /dev/sdf: Disk /dev/sdf: 3000.6 GB, 3000592982016 bytes, 5860533168 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk label type: gpt Disk identifier: 5B742241-93F1-4FE1-9CA6-525F759E38B1 # Start End Size Type Name 1 2048 5860533134 2.7T Linux RAID mdadm --examine /dev/sdf* /dev/sdf: MBR Magic : aa55 Partition[0] : 4294967295 sectors at 1 (type ee) mdadm: No md superblock detected on /dev/sdf1. > > I should have suggested " | head" at the end of the 'grep' > command. Maybe make it > od -x /dev/sdf | grep '4efc a92b' | head > That will look for md metadata. It's actually still running, but this is the output so far: [root@hemlock ~]# od -x /dev/sdf | grep '4efc a92b' | head 142401762260 7984 d854 643a 3c8f dd3a a8de 4efc a92b 156441541400 48f9 9523 3539 de0a d1f4 4efc a92b 60a8 214717071120 ac68 a62e 441c c0f8 85c1 cab7 4efc a92b 367526264660 dc8f 4f52 4efc a92b fd99 4744 1d6e c59f 515240166540 4f6a c1eb 5309 4efc a92b d0ad b3ee 11a0 575452271660 b669 7eeb 4efc a92b ebf5 c069 c78d 82d3 1026007653000 913f 4efc a92b 0b3d 5e94 9f7a 80a5 6e0c 1104130763160 7267 f0e5 e9a4 a9e0 0b85 5b3b 4efc a92b 1107535224640 cde4 efa8 557c 01a1 1abb b885 4efc a92b 1167200747300 49dd f3e2 4efc a92b a6e7 f1af 7af1 e6c6 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: MDADM RAID 6 Bad Superblock after reboot 2017-10-22 21:00 ` sfunk1x @ 2017-10-22 22:40 ` NeilBrown 0 siblings, 0 replies; 11+ messages in thread From: NeilBrown @ 2017-10-22 22:40 UTC (permalink / raw) To: sfunk1x, Wols Lists, linux-raid [-- Attachment #1: Type: text/plain, Size: 2652 bytes --] On Sun, Oct 22 2017, sfunk1x wrote: > On 10/19/2017 06:50 PM, NeilBrown wrote: > >> Thanks. sdf looks like it might have a gpt partition >> table on it. The "od -x" supports that. What does "fdisk -l /dev/sdf" >> show? What about >> mdadm --examine /dev/sdf* >> ?? > > fdisk -l /dev/sdf: > > Disk /dev/sdf: 3000.6 GB, 3000592982016 bytes, 5860533168 sectors > Units = sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 4096 bytes > I/O size (minimum/optimal): 4096 bytes / 4096 bytes > Disk label type: gpt > Disk identifier: 5B742241-93F1-4FE1-9CA6-525F759E38B1 > > > # Start End Size Type Name > 1 2048 5860533134 2.7T Linux RAID > This seems to suggest that you did prepare to use sdf1 (rather than sdf) in the array, but it isn't conclusive proof. > > mdadm --examine /dev/sdf* > > /dev/sdf: > MBR Magic : aa55 > Partition[0] : 4294967295 sectors at 1 (type ee) > mdadm: No md superblock detected on /dev/sdf1. No metadata on sdf1 - sad. That is what I was hoping for. > > > >> >> I should have suggested " | head" at the end of the 'grep' >> command. Maybe make it >> od -x /dev/sdf | grep '4efc a92b' | head >> That will look for md metadata. > > It's actually still running, but this is the output so far: > > [root@hemlock ~]# od -x /dev/sdf | grep '4efc a92b' | head > 142401762260 7984 d854 643a 3c8f dd3a a8de 4efc a92b > 156441541400 48f9 9523 3539 de0a d1f4 4efc a92b 60a8 > 214717071120 ac68 a62e 441c c0f8 85c1 cab7 4efc a92b > 367526264660 dc8f 4f52 4efc a92b fd99 4744 1d6e c59f > 515240166540 4f6a c1eb 5309 4efc a92b d0ad b3ee 11a0 > 575452271660 b669 7eeb 4efc a92b ebf5 c069 c78d 82d3 > 1026007653000 913f 4efc a92b 0b3d 5e94 9f7a 80a5 6e0c > 1104130763160 7267 f0e5 e9a4 a9e0 0b85 5b3b 4efc a92b > 1107535224640 cde4 efa8 557c 01a1 1abb b885 4efc a92b > 1167200747300 49dd f3e2 4efc a92b a6e7 f1af 7af1 e6c6 No md metadata to be found at all. We would need to see a line with an address ending "000" and data starting 4efc a92b 0001 0000 .... Nothing like that here. Maybe this isn't really the drive you think it is? Could someone or something have swapped drives? Maybe you or someone zeroed the metadata (mdadm --zero-super /dev/sdf1). There is no way that md or mdadm could have mis-behaved, even when affected by selinux, to produce this result. Something else must have happened. Maybe something you don't know about, or something you don't think it relevant. But there must be something. Sorry I cannot be more helpful. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2017-10-22 22:40 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <6a0f0e0b-6b03-8ec1-b02f-b17b0447aff8@gmail.com> 2017-10-18 18:14 ` MDADM RAID 6 Bad Superblock after reboot Sean R. Funk 2017-10-18 19:40 ` Wols Lists 2017-10-19 1:52 ` sfunk1x 2017-10-19 10:58 ` Wols Lists 2017-10-19 14:11 ` Sean R. Funk 2017-10-19 14:28 ` Wols Lists 2017-10-19 21:17 ` NeilBrown 2017-10-20 0:43 ` sfunk1x 2017-10-20 1:50 ` NeilBrown 2017-10-22 21:00 ` sfunk1x 2017-10-22 22:40 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).