* fs corruption recovery @ 2015-03-19 0:56 Allison Henderson 2015-03-19 0:59 ` Andreas Dilger 0 siblings, 1 reply; 6+ messages in thread From: Allison Henderson @ 2015-03-19 0:56 UTC (permalink / raw) To: linux-ext4; +Cc: jane, marcel.dufour Hi all, I've had some internal folks contact me for help with some customers that are having file system corruption woes. It's been so long since I've done any work on ext3/4 code it's hard for me to advise. So I told them I would run the situation by the folks on these mailing lists to see if I can generate some more ideas for them. They have a 17 TB ext3 file system on rhel 6.5. Upon reboot, the system was not able to come up and reported errors with the super block. Right now, getting the machine to boot is not a critical as just recovering customer data. They are able to boot a rescue disk to run fsck and they report that it ran for a short while and showed a lot of inode errors, but eventually it seg faulted. They can re-run the tool, and they were able to progress further on repeated runs, but they do not seem to be able to get further than about 75%. They do not have the fsck core at this point in time, but I'm guessing the tool is likely running out of memory for a file system that large, and they say they are using an old fsck (from 2010). They report having run fsck successfully on large file systems in the past, but normally the machine has 24GB, and this one has only 16GB due to a bad dim. The plan at the moment is for them to fix the bad dim and try the latest fsck. So the questions they had that I am hoping to get help for is are there any other options they can try for data recovery? I am hoping that the extra memory and the updated fsck might be able to complete, but I'm not sure what has changed in the tool since then. I can assist them to collect more information/cores. Any help is appreciated! Thx! Allison Henderson ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: fs corruption recovery 2015-03-19 0:56 fs corruption recovery Allison Henderson @ 2015-03-19 0:59 ` Andreas Dilger 2015-03-19 21:52 ` Eric Sandeen 2015-03-20 1:47 ` Theodore Ts'o 0 siblings, 2 replies; 6+ messages in thread From: Andreas Dilger @ 2015-03-19 0:59 UTC (permalink / raw) To: Allison Henderson Cc: linux-ext4@vger.kernel.org, jane@us.ibm.com, marcel.dufour@ca.ibm.com I think that running a 17TB filesystem on ext3 is a recipe for disaster. They should use ext4 for anything larger than 16TB. Upgrading e2fsprogs to the latest 1.42.12 is also strongly advised. Cheers, Andreas > On Mar 18, 2015, at 18:56, Allison Henderson <achender@linux.vnet.ibm.com> wrote: > > Hi all, > > I've had some internal folks contact me for help with some customers that are having file system corruption woes. It's been so long since I've done any work on ext3/4 code it's hard for me to advise. So I told them I would run the situation by the folks on these mailing lists to see if I can generate some more ideas for them. > > They have a 17 TB ext3 file system on rhel 6.5. Upon reboot, the system was not able to come up and reported errors with the super block. Right now, getting the machine to boot is not a critical as just recovering customer data. They are able to boot a rescue disk to run fsck and they report that it ran for a short while and showed a lot of inode errors, but eventually it seg faulted. They can re-run the tool, and they were able to progress further on repeated runs, but they do not seem to be able to get further than about 75%. They do not have the fsck core at this point in time, but I'm guessing the tool is likely running out of memory for a file system that large, and they say they are using an old fsck (from 2010). They report having run fsck successfully on large file systems in the past, but normally the machine has 24GB, and this one has only 16GB due to a bad dim. The plan at the moment is for them to fix the bad dim and try the latest fsck. > > So the questions they had that I am hoping to get help for is are there any other options they can try for data recovery? I am hoping that the extra memory and the updated fsck might be able to complete, but I'm not sure what has changed in the tool since then. I can assist them to collect more information/cores. Any help is appreciated! Thx! > > Allison Henderson > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: fs corruption recovery 2015-03-19 0:59 ` Andreas Dilger @ 2015-03-19 21:52 ` Eric Sandeen 2015-03-20 1:47 ` Theodore Ts'o 1 sibling, 0 replies; 6+ messages in thread From: Eric Sandeen @ 2015-03-19 21:52 UTC (permalink / raw) To: Andreas Dilger, Allison Henderson Cc: linux-ext4@vger.kernel.org, jane@us.ibm.com, marcel.dufour@ca.ibm.com On 3/18/15 7:59 PM, Andreas Dilger wrote: > I think that running a 17TB filesystem on ext3 is a recipe for disaster. They should use ext4 for anything larger than 16TB. Not only that - impossible, unless you have > 4k page sizes and > 4k blocks. # mkfs.ext3: Size of device fsfile too big to be expressed in 32 bits using a blocksize of 4096. Are they doing something clever on PPC w/ 64k blocks? -Eric > Upgrading e2fsprogs to the latest 1.42.12 is also strongly advised. > > Cheers, Andreas > >> On Mar 18, 2015, at 18:56, Allison Henderson <achender@linux.vnet.ibm.com> wrote: >> >> Hi all, >> >> I've had some internal folks contact me for help with some >> customers that are having file system corruption woes. It's been so >> long since I've done any work on ext3/4 code it's hard for me to >> advise. So I told them I would run the situation by the folks on >> these mailing lists to see if I can generate some more ideas for >> them. >> >> They have a 17 TB ext3 file system on rhel 6.5. Upon reboot, the >> system was not able to come up and reported errors with the super >> block. Right now, getting the machine to boot is not a critical as >> just recovering customer data. They are able to boot a rescue disk >> to run fsck and they report that it ran for a short while and >> showed a lot of inode errors, but eventually it seg faulted. They >> can re-run the tool, and they were able to progress further on >> repeated runs, but they do not seem to be able to get further than >> about 75%. They do not have the fsck core at this point in time, >> but I'm guessing the tool is likely running out of memory for a >> file system that large, and they say they are using an old fsck >> (from 2010). They report having run fsck successfully on large file >> systems in the past, but normally the machine has 24GB, and this >> one has only 16GB due to a bad dim. The plan at the moment is for >> them to fix the bad dim and try the latest fsck. >> >> So the questions they had that I am hoping to get help for is are >> there any other options they can try for data recovery? I am hoping >> that the extra memory and the updated fsck might be able to >> complete, but I'm not sure what has changed in the tool since then. >> I can assist them to collect more information/cores. Any help is >> appreciated! Thx! >> >> Allison Henderson >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: fs corruption recovery 2015-03-19 0:59 ` Andreas Dilger 2015-03-19 21:52 ` Eric Sandeen @ 2015-03-20 1:47 ` Theodore Ts'o 2015-03-20 5:47 ` Allison Henderson 1 sibling, 1 reply; 6+ messages in thread From: Theodore Ts'o @ 2015-03-20 1:47 UTC (permalink / raw) To: Andreas Dilger Cc: Allison Henderson, linux-ext4@vger.kernel.org, jane@us.ibm.com, marcel.dufour@ca.ibm.com On Wed, Mar 18, 2015 at 06:59:52PM -0600, Andreas Dilger wrote: > I think that running a 17TB filesystem on ext3 is a recipe for disaster. They should use ext4 for anything larger than 16TB. It's not *possible* to have a 17TB file system with ext3. Something must be very wrong there. 16TB is the maximum you can have before you end up overflowing a 32-bit block number. Unless this is a PowerPC with a 16K block size or some such? If e2fsck is segfaulting, then I would certainly try getting the latest version of e2fsprogs, just in case the problem isn't just that it's running out of memory. Also if recovering customer data is the most important thing, the first thing they should do is a make image copy of the file system, since it's possible that incorrect use of e2fsck, or an old/buggy version of e2fsck could make things work. In particular, if they are seeing errors with multply claimed inodes, it's likely that part of the inode table was written to the wrong place, and sometimes a skilled human being can get more data than simply using e2fsck -y and praying. At the end of the day the question is how much is the customer data work and how much effort is the customer / IBM willing to invest in trying to get every last bit of data back? - Ted ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: fs corruption recovery 2015-03-20 1:47 ` Theodore Ts'o @ 2015-03-20 5:47 ` Allison Henderson 2015-03-20 18:45 ` Darrick J. Wong 0 siblings, 1 reply; 6+ messages in thread From: Allison Henderson @ 2015-03-20 5:47 UTC (permalink / raw) To: Theodore Ts'o, Andreas Dilger Cc: linux-ext4@vger.kernel.org, jane@us.ibm.com, marcel.dufour@ca.ibm.com On 03/19/2015 06:47 PM, Theodore Ts'o wrote: > On Wed, Mar 18, 2015 at 06:59:52PM -0600, Andreas Dilger wrote: >> I think that running a 17TB filesystem on ext3 is a recipe for disaster. They should use ext4 for anything larger than 16TB. > > It's not *possible* to have a 17TB file system with ext3. Something > must be very wrong there. 16TB is the maximum you can have before you > end up overflowing a 32-bit block number. Unless this is a PowerPC > with a 16K block size or some such? > > If e2fsck is segfaulting, then I would certainly try getting the > latest version of e2fsprogs, just in case the problem isn't just that > it's running out of memory. Also if recovering customer data is the > most important thing, the first thing they should do is a make image > copy of the file system, since it's possible that incorrect use of > e2fsck, or an old/buggy version of e2fsck could make things work. > > In particular, if they are seeing errors with multply claimed inodes, > it's likely that part of the inode table was written to the wrong > place, and sometimes a skilled human being can get more data than > simply using e2fsck -y and praying. At the end of the day the > question is how much is the customer data work and how much effort is > the customer / IBM willing to invest in trying to get every last bit > of data back? > > - Ted > Hi all, Sorry for the delay, our email servers went down for a bit after I sent the email. I will work with Marcel to find the block size, page size and arch. It is my understanding they they have a contract with this customer to maintain this data, so there is pressure to recover it. Unfortunately the product mirrored the fs corruption to the back up device before the corruption was discovered. I've been told that I was the only person they could find left that had some background with ext3/4, so I have an inkling that the "skilled human being" might end up being me, even though its been a while since I've worked with it. :-) Maybe I could poke into the inode table and see what I can figure out. We will be sure to make image backups though. Thx a bunch for the feed back, we really appreciate the help! I will keep folks updated when I have more info. Thx! Allison Henderson ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: fs corruption recovery 2015-03-20 5:47 ` Allison Henderson @ 2015-03-20 18:45 ` Darrick J. Wong 0 siblings, 0 replies; 6+ messages in thread From: Darrick J. Wong @ 2015-03-20 18:45 UTC (permalink / raw) To: Allison Henderson Cc: Theodore Ts'o, Andreas Dilger, linux-ext4@vger.kernel.org, jane@us.ibm.com, marcel.dufour@ca.ibm.com On Thu, Mar 19, 2015 at 10:47:17PM -0700, Allison Henderson wrote: > On 03/19/2015 06:47 PM, Theodore Ts'o wrote: > >On Wed, Mar 18, 2015 at 06:59:52PM -0600, Andreas Dilger wrote: > >>I think that running a 17TB filesystem on ext3 is a recipe for disaster. They should use ext4 for anything larger than 16TB. > > > >It's not *possible* to have a 17TB file system with ext3. Something > >must be very wrong there. 16TB is the maximum you can have before you > >end up overflowing a 32-bit block number. Unless this is a PowerPC > >with a 16K block size or some such? > > > >If e2fsck is segfaulting, then I would certainly try getting the > >latest version of e2fsprogs, just in case the problem isn't just that > >it's running out of memory. Also if recovering customer data is the > >most important thing, the first thing they should do is a make image > >copy of the file system, since it's possible that incorrect use of > >e2fsck, or an old/buggy version of e2fsck could make things work. ...make things *worse*. > > > >In particular, if they are seeing errors with multply claimed inodes, > >it's likely that part of the inode table was written to the wrong > >place, and sometimes a skilled human being can get more data than > >simply using e2fsck -y and praying. At the end of the day the > >question is how much is the customer data work and how much effort is > >the customer / IBM willing to invest in trying to get every last bit > >of data back? > > > > - Ted > > > > Hi all, > > Sorry for the delay, our email servers went down for a bit after I > sent the email. I will work with Marcel to find the block size, > page size and arch. It is my understanding they they have a Just guessing PPC, in which case you'll really want an e2fsck released after the giant heaps of bugfixes I've sent over the last year. There were a lot of bugs that only show up on bigendian systems, which probably don't get much testing nowadays. Even if it's a 17179869184 byte ext3 FS on x86, you're probably still better off with a less buggy e2fsck. There are a number of fixes to prevent the crosslinked file fixer and the directory fixer from doing insane things to the FS. > contract with this customer to maintain this data, so there is > pressure to recover it. Unfortunately the product mirrored the fs > corruption to the back up device before the corruption was > discovered. I've been told that I was the only person they could > find left that had some background with ext3/4, so I have an inkling Yep. ;) > that the "skilled human being" might end up being me, even though > its been a while since I've worked with it. :-) Maybe I could poke > into the inode table and see what I can figure out. We will be sure > to make image backups though. Thx a bunch for the feed back, we > really appreciate the help! I will keep folks updated when I have > more info. Thx! If you have LVM or other volume management, please take a snapshot and fsck the snapshot first, so you can capture a log of what happens without blasting away at existing data. --D > > Allison Henderson > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2015-03-20 18:45 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-03-19 0:56 fs corruption recovery Allison Henderson 2015-03-19 0:59 ` Andreas Dilger 2015-03-19 21:52 ` Eric Sandeen 2015-03-20 1:47 ` Theodore Ts'o 2015-03-20 5:47 ` Allison Henderson 2015-03-20 18:45 ` Darrick J. Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).