* Corrupted files
@ 2014-09-09 15:21 Leslie Rhorer
2014-09-09 15:50 ` Sean Caron
` (2 more replies)
0 siblings, 3 replies; 35+ messages in thread
From: Leslie Rhorer @ 2014-09-09 15:21 UTC (permalink / raw)
To: xfs
Hello,
I have an issue with my primary RAID array. I have 13T of data on the
array, and I suffered a major array failure. I was able to rebuild the
array, but some data was lost. Of course I have backups, so after
running xfs_repair, I ran an rsync job to recover the lost data. Most
of it was recovered, but there are several files that cannot be read,
deleted, or overwritten. I have tried running xfs_repair several times,
but any attempt to access these files continuously reports "cannot stat
XXXXXXXX: Structure needs cleaning". I don't need to try to recover the
data directly, as it does reside on the backup, but I need to clear the
file structure so I can write the files back to the filesystem. How do
I proceed?
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: Corrupted files 2014-09-09 15:21 Corrupted files Leslie Rhorer @ 2014-09-09 15:50 ` Sean Caron 2014-09-09 16:03 ` Sean Caron 2014-09-09 16:08 ` Emmanuel Florac 2014-09-09 22:06 ` Dave Chinner 2 siblings, 1 reply; 35+ messages in thread From: Sean Caron @ 2014-09-09 15:50 UTC (permalink / raw) To: Leslie Rhorer, Sean Caron; +Cc: xfs@oss.sgi.com [-- Attachment #1.1: Type: text/plain, Size: 2581 bytes --] Hi Leslie, If you have a full backup, I would STRONGLY recommend just wiping your old filesystem and restoring your backups on top of a totally fresh XFS, rather than repairing the original filesystem and then filling in the blanks with backups using a file-diff tool like rsync. You will probably hear various opinions here about xfs_repair; my personal opinion is that xfs_repair is a program made available for the unwary to further scramble their data and make a hash of the filesystem... In my first-hand experience managing ~7 PB of XFS storage and growing, I have NEVER found xfs_repair (yes, even the "newest version") to ever do anything positive. It's basically a data scrambler. At this point, you will never achieve anything near what I'd consider a production-grade, trustworthy data repository. Any further runs of xfs_repair will either do nothing, or make the situation worse. Fortunately you followed best practice and kept backups so you don't really need xfs_repair anyway, right? Best, Sean P.S. No backups? Still don't even think about running xfs_repair. ESPECIALLY don't think about running xfs_repair. Try mounting ro; if that doesn't work, mount ro with noreplaylog and scavenge what you can. Write off the rest. That's the cost of doing business without backups. Running xfs_repair (especially as a first-line step) will only make it worse, and especially on big filesystems, the run time can extend to weeks... Don't keep your users down any longer than you need to, running a program that won't really help you. Just scavenge it, reformat and turn it back around. On Tue, Sep 9, 2014 at 11:21 AM, Leslie Rhorer <lrhorer@mygrande.net> wrote: > > Hello, > > I have an issue with my primary RAID array. I have 13T of data on > the array, and I suffered a major array failure. I was able to rebuild the > array, but some data was lost. Of course I have backups, so after running > xfs_repair, I ran an rsync job to recover the lost data. Most of it was > recovered, but there are several files that cannot be read, deleted, or > overwritten. I have tried running xfs_repair several times, but any > attempt to access these files continuously reports "cannot stat XXXXXXXX: > Structure needs cleaning". I don't need to try to recover the data > directly, as it does reside on the backup, but I need to clear the file > structure so I can write the files back to the filesystem. How do I > proceed? > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > [-- Attachment #1.2: Type: text/html, Size: 3292 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-09 15:50 ` Sean Caron @ 2014-09-09 16:03 ` Sean Caron 2014-09-09 22:24 ` Eric Sandeen 0 siblings, 1 reply; 35+ messages in thread From: Sean Caron @ 2014-09-09 16:03 UTC (permalink / raw) To: Leslie Rhorer, Sean Caron; +Cc: xfs@oss.sgi.com [-- Attachment #1.1: Type: text/plain, Size: 3577 bytes --] OK, let me retract just a tiny fraction of what I said originally; thinking about it further, there was _one_ time I was able to use xfs_repair to successfully recover a "lightly bruised" XFS and return it to service. But in that case, the fault was very minor and I always check first with: xfs_repair [-L] -n -v <filesystem> and give the output a good looking over before proceeding further. If it won't run without zeroing the log, you can take that as a sign that things are getting dire.. I wouldn't bother to run xfs_repair "for real" if the trial output looked even slightly non-trivial, in cases of underlying array failure or massive filesystem corruption, and I'd never run it without mounting and scavenging first (unless I had a very recent full backup). Barring rare cases, xfs_repair is bad juju. Best, Sean On Tue, Sep 9, 2014 at 11:50 AM, Sean Caron <scaron@umich.edu> wrote: > Hi Leslie, > > If you have a full backup, I would STRONGLY recommend just wiping your old > filesystem and restoring your backups on top of a totally fresh XFS, rather > than repairing the original filesystem and then filling in the blanks with > backups using a file-diff tool like rsync. > > You will probably hear various opinions here about xfs_repair; my personal > opinion is that xfs_repair is a program made available for the unwary to > further scramble their data and make a hash of the filesystem... In my > first-hand experience managing ~7 PB of XFS storage and growing, I have > NEVER found xfs_repair (yes, even the "newest version") to ever do anything > positive. It's basically a data scrambler. > > At this point, you will never achieve anything near what I'd consider a > production-grade, trustworthy data repository. Any further runs of > xfs_repair will either do nothing, or make the situation worse. Fortunately > you followed best practice and kept backups so you don't really need > xfs_repair anyway, right? > > Best, > > Sean > > P.S. No backups? Still don't even think about running xfs_repair. > ESPECIALLY don't think about running xfs_repair. Try mounting ro; if that > doesn't work, mount ro with noreplaylog and scavenge what you can. Write > off the rest. That's the cost of doing business without backups. Running > xfs_repair (especially as a first-line step) will only make it worse, and > especially on big filesystems, the run time can extend to weeks... Don't > keep your users down any longer than you need to, running a program that > won't really help you. Just scavenge it, reformat and turn it back around. > > > > > > On Tue, Sep 9, 2014 at 11:21 AM, Leslie Rhorer <lrhorer@mygrande.net> > wrote: > >> >> Hello, >> >> I have an issue with my primary RAID array. I have 13T of data >> on the array, and I suffered a major array failure. I was able to rebuild >> the array, but some data was lost. Of course I have backups, so after >> running xfs_repair, I ran an rsync job to recover the lost data. Most of >> it was recovered, but there are several files that cannot be read, deleted, >> or overwritten. I have tried running xfs_repair several times, but any >> attempt to access these files continuously reports "cannot stat XXXXXXXX: >> Structure needs cleaning". I don't need to try to recover the data >> directly, as it does reside on the backup, but I need to clear the file >> structure so I can write the files back to the filesystem. How do I >> proceed? >> >> _______________________________________________ >> xfs mailing list >> xfs@oss.sgi.com >> http://oss.sgi.com/mailman/listinfo/xfs >> > > [-- Attachment #1.2: Type: text/html, Size: 4763 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-09 16:03 ` Sean Caron @ 2014-09-09 22:24 ` Eric Sandeen 2014-09-09 22:57 ` Sean Caron 2014-09-10 0:48 ` Leslie Rhorer 0 siblings, 2 replies; 35+ messages in thread From: Eric Sandeen @ 2014-09-09 22:24 UTC (permalink / raw) To: Sean Caron, Leslie Rhorer; +Cc: xfs@oss.sgi.com On 9/9/14 11:03 AM, Sean Caron wrote: >Barring rare cases, xfs_repair is bad juju. No, it's not. It is the appropriate tool to use for filesystem repair. But it is not the appropriate tool for recovery from mangled storage. I've actually been running a filesystem fuzzer over xfs images, randomly corrupting data and testing repair, 1000s of times over. It does remarkably well. If you scramble your raid, which means your block device is no longer an xfs filesystem, but is instead a random tangle of bits and pieces of other things, of course xfs_repair won't do well, but it's not the right tool for the job at that stage. -Eric _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-09 22:24 ` Eric Sandeen @ 2014-09-09 22:57 ` Sean Caron 2014-09-10 1:00 ` Roger Willcocks 2014-09-10 5:09 ` Eric Sandeen 2014-09-10 0:48 ` Leslie Rhorer 1 sibling, 2 replies; 35+ messages in thread From: Sean Caron @ 2014-09-09 22:57 UTC (permalink / raw) To: Eric Sandeen, Sean Caron; +Cc: Leslie Rhorer, xfs@oss.sgi.com [-- Attachment #1.1: Type: text/plain, Size: 1243 bytes --] Hey, just sharing some hard-won (believe me) professional experience. I have seen xfs_repair take a bad situation and make it worse many times. I don't know that a filesystem fuzzer or any other simulation can ever provide true simulation of users absolutely pounding the tar out of a system. There seems to be a real disconnect between what developers are able to test and observe directly, and what happens in the production environment in a very high-throughput environment. Best, Sean On Tue, Sep 9, 2014 at 6:24 PM, Eric Sandeen <sandeen@sandeen.net> wrote: > On 9/9/14 11:03 AM, Sean Caron wrote: > > Barring rare cases, xfs_repair is bad juju. >> > > No, it's not. It is the appropriate tool to use for filesystem repair. > > But it is not the appropriate tool for recovery from mangled storage. > > I've actually been running a filesystem fuzzer over xfs images, randomly > corrupting data and testing repair, 1000s of times over. It does > remarkably well. > > If you scramble your raid, which means your block device is no longer > an xfs filesystem, but is instead a random tangle of bits and pieces of > other things, of course xfs_repair won't do well, but it's not the right > tool for the job at that stage. > > -Eric > [-- Attachment #1.2: Type: text/html, Size: 1884 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-09 22:57 ` Sean Caron @ 2014-09-10 1:00 ` Roger Willcocks 2014-09-10 1:23 ` Leslie Rhorer 2014-09-10 5:09 ` Eric Sandeen 1 sibling, 1 reply; 35+ messages in thread From: Roger Willcocks @ 2014-09-10 1:00 UTC (permalink / raw) To: Sean Caron; +Cc: Roger Willcocks, Eric Sandeen, Leslie Rhorer, xfs@oss.sgi.com [-- Attachment #1.1: Type: text/plain, Size: 2383 bytes --] I normally watch quietly from the sidelines but I think it's important to get some balance here; our customers between them run many hundreds of multi-terabyte arrays and when something goes badly awry it generally falls to me to sort it out. In my experience xfs_repair does exactly what it says on the tin. I can recall only a couple of instances where we elected to reformat and reload from backups and they were both due to human error: somebody deleted the wrong raid unit when doing routine maintenance, and then tried to fix it up hemselves. In theory of course xfs_repair shouldn't be needed if the write barriers work properly (it's a journalled filesystem), but low-level corruption does creep in due to power failures / kernel crashes and it's this which xfs_repair is intended to address; not massive data corruption due to failed hardware or careless users. -- Roger On 9 Sep 2014, at 23:57, Sean Caron <scaron@umich.edu> wrote: > Hey, just sharing some hard-won (believe me) professional experience. I have seen xfs_repair take a bad situation and make it worse many times. I don't know that a filesystem fuzzer or any other simulation can ever provide true simulation of users absolutely pounding the tar out of a system. There seems to be a real disconnect between what developers are able to test and observe directly, and what happens in the production environment in a very high-throughput environment. > > Best, > > Sean > > > On Tue, Sep 9, 2014 at 6:24 PM, Eric Sandeen <sandeen@sandeen.net> wrote: > On 9/9/14 11:03 AM, Sean Caron wrote: > > Barring rare cases, xfs_repair is bad juju. > > No, it's not. It is the appropriate tool to use for filesystem repair. > > But it is not the appropriate tool for recovery from mangled storage. > > I've actually been running a filesystem fuzzer over xfs images, randomly > corrupting data and testing repair, 1000s of times over. It does > remarkably well. > > If you scramble your raid, which means your block device is no longer > an xfs filesystem, but is instead a random tangle of bits and pieces of > other things, of course xfs_repair won't do well, but it's not the right > tool for the job at that stage. > > -Eric > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs [-- Attachment #1.2: Type: text/html, Size: 3458 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 1:00 ` Roger Willcocks @ 2014-09-10 1:23 ` Leslie Rhorer 0 siblings, 0 replies; 35+ messages in thread From: Leslie Rhorer @ 2014-09-10 1:23 UTC (permalink / raw) To: Roger Willcocks, Sean Caron; +Cc: Eric Sandeen, xfs@oss.sgi.com On 9/9/2014 8:00 PM, Roger Willcocks wrote: > I normally watch quietly from the sidelines but I think it's important > to get some balance here That is almost always wise advice. Shooting from the hip often has regrettable consequences, yet being too cautious can have its down side, too. In this case, things are working very well at the moment, and the apparent issues are reasonably small, so there is no need for panic. > our customers between them run many hundreds > of multi-terabyte arrays and when something goes badly awry it generally > falls to me to sort it out. In my experience xfs_repair does exactly > what it says on the tin. I couldn't say. This is only the second time I have ever had an array drop, and the first time it was completely unrecoverable. Less than 5 minutes after I had started a RAID upgrade from RAID5 to RAID6, there was a protracted power outage. I shut down the system cleanly and after the outage restarted the reshape. The recovery had only been running a few minutes when the system suffered a kernel panic - I never did find out why. Every single structure on the array larger than the stripe size (16K, I think) was garbage. > I can recall only a couple of instances where we elected to reformat and > reload from backups and they were both due to human error: somebody > deleted the wrong raid unit when doing routine maintenance, and then > tried to fix it up hemselves. > > In theory of course xfs_repair shouldn't be needed if the write barriers > work properly (it's a journalled filesystem), but low-level corruption > does creep in due to power failures / kernel crashes and it's this which > xfs_repair is intended to address; not massive data corruption due to > failed hardware or careless users. Oh, yeah, like losing 3 out of 8 drives in the array after a drive controller replacement... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-09 22:57 ` Sean Caron 2014-09-10 1:00 ` Roger Willcocks @ 2014-09-10 5:09 ` Eric Sandeen 1 sibling, 0 replies; 35+ messages in thread From: Eric Sandeen @ 2014-09-10 5:09 UTC (permalink / raw) To: Sean Caron; +Cc: Leslie Rhorer, xfs@oss.sgi.com On 9/9/14 5:57 PM, Sean Caron wrote: > Hey, just sharing some hard-won (believe me) professional experience. > I have seen xfs_repair take a bad situation and make it worse many > times. I don't know that a filesystem fuzzer or any other simulation > can ever provide true simulation of users absolutely pounding the tar > out of a system. There seems to be a real disconnect between what > developers are able to test and observe directly, and what happens in > the production environment in a very high-throughput environment. > > Best, > > Sean Fair enough, but I don't want to let stand an assertion that you should avoid xfs_repair at all (most) costs. It, like almost any software, has some bugs, but they don't get fixed if they don't get well reported. We do our best to improve it when we get useful reports from users - usually including a metadata dump - and we beat on it as best we can in the lab. "pounding the tar out of a filesystem" should not, in general, require an xfs_repair run. ;) Yes, it's always good advice to do a dry run before committing to a repair, in case something goes off the rails. But most times I've seen things go very very badly was when the storage device under the filesystem was no longer consistent, and the filesystem really had no pieces to pick up. -Eric _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-09 22:24 ` Eric Sandeen 2014-09-09 22:57 ` Sean Caron @ 2014-09-10 0:48 ` Leslie Rhorer 2014-09-10 1:10 ` Roger Willcocks 1 sibling, 1 reply; 35+ messages in thread From: Leslie Rhorer @ 2014-09-10 0:48 UTC (permalink / raw) To: Eric Sandeen, Sean Caron; +Cc: xfs@oss.sgi.com On 9/9/2014 5:24 PM, Eric Sandeen wrote: > On 9/9/14 11:03 AM, Sean Caron wrote: > >> Barring rare cases, xfs_repair is bad juju. > > No, it's not. It is the appropriate tool to use for filesystem repair. > > But it is not the appropriate tool for recovery from mangled storage. It's not all that mangled. Out of over 52,000 files on the backup server array, only 5758 were missing from the primary array, and most of those were lost by the corruption of just a couple of directories, where every file in the directory was lost with the directory itself. Several directories and a scattering of individual files were deleted with intent prior to the failure but not yet purged from the backup. Most were small files - only 29 were larger than 1G. All of those 5758 were easily recovered. The only ones remaining at issue are 3 files which cannot be read, written or deleted. The rest have been read and checksums sucessfully computed and compared. With only 50K files in question, I am confidant any checksum collisions are of insignificant probability. Someone is going to have to do a lot of talking to convince me rsync can read two copies of what should be the same data and come up with the same checksum value for both, but other applications would be able to successfully read one of the files and not the other. I really don't think Draconian measures are required. Even if it turns out they are, the existence of the backup allows for a good deal of fiddling with the main filesystem before one is compelled to give up and start fresh. This especially since a small amount of the data on the main array had not yet been backed up to the secondary array. These e-mails, for example. The rsync job that backs up the main array runs every morning at 04:00, so files created that day were not backed up, and for safety I have changed the backup array file system to read-only, so nothing created since is backed up. > I've actually been running a filesystem fuzzer over xfs images, randomly > corrupting data and testing repair, 1000s of times over. It does > remarkably well. > > If you scramble your raid, which means your block device is no longer > an xfs filesystem, but is instead a random tangle of bits and pieces of > other things, of course xfs_repair won't do well, but it's not the right > tool for the job at that stage. This is nowhere near that stage. A few sectors here and there were lost because 3 drives were kicked from the array while write operations were underway. I had to force re-assemble the array, which lost some data. The vast majority of the data is clearly intact, including most of the file system structures. Far less than 1% of the data was lost or corrupted. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 0:48 ` Leslie Rhorer @ 2014-09-10 1:10 ` Roger Willcocks 2014-09-10 1:31 ` Leslie Rhorer 0 siblings, 1 reply; 35+ messages in thread From: Roger Willcocks @ 2014-09-10 1:10 UTC (permalink / raw) To: Leslie Rhorer; +Cc: Sean Caron, Roger Willcocks, Eric Sandeen, xfs@oss.sgi.com On 10 Sep 2014, at 01:48, Leslie Rhorer <lrhorer@mygrande.net> wrote: > The only ones remaining at issue are 3 files which cannot be read, written or deleted. The most straightforward fix would be to note down the inode numbers of the three fies and then use xfs_db to clear the inodes; then run xfs_repair again. See: http://xfs.org/index.php/XFS_FAQ#Q:_How_to_get_around_a_bad_inode_repair_is_unable_to_clean_up but before that try running the latest (3.2.1 I think) xfs_repair. -- Roger _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 1:10 ` Roger Willcocks @ 2014-09-10 1:31 ` Leslie Rhorer 2014-09-10 14:24 ` Emmanuel Florac 0 siblings, 1 reply; 35+ messages in thread From: Leslie Rhorer @ 2014-09-10 1:31 UTC (permalink / raw) To: Roger Willcocks; +Cc: Sean Caron, Eric Sandeen, xfs@oss.sgi.com On 9/9/2014 8:10 PM, Roger Willcocks wrote: > > On 10 Sep 2014, at 01:48, Leslie Rhorer <lrhorer@mygrande.net> wrote: > >> The only ones remaining at issue are 3 files which cannot be read, written or deleted. > > The most straightforward fix would be to note down the inode numbers of the three fies and then use xfs_db to clear the inodes; then run xfs_repair again. > > See: > > http://xfs.org/index.php/XFS_FAQ#Q:_How_to_get_around_a_bad_inode_repair_is_unable_to_clean_up That sounds reasonable. If no one has any more sound advice, I think I will try that. > but before that try running the latest (3.2.1 I think) xfs_repair. I am always reticent to run anything outside the distro package. Ive had problems in the past with doing so. 3.1.7 is pretty close, so unless there is a really solid reason to use 3.2.1 vs. 3.1.7, I think I will stick with the distro version and try the above. Can you or anyone else give a reason why 3.2.1 would work when 3.1.7 would not? More importantly, is there some reason 3.1.7 would make things worse while 3.2.1 would not? If not, then I can always try 3.1.7 and then try 3.2.1 if that does not help. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 1:31 ` Leslie Rhorer @ 2014-09-10 14:24 ` Emmanuel Florac 2014-09-10 14:49 ` Sean Caron 0 siblings, 1 reply; 35+ messages in thread From: Emmanuel Florac @ 2014-09-10 14:24 UTC (permalink / raw) To: Leslie Rhorer; +Cc: Eric Sandeen, Roger Willcocks, Sean Caron, xfs@oss.sgi.com Le Tue, 09 Sep 2014 20:31:07 -0500 Leslie Rhorer <lrhorer@mygrande.net> écrivait: > More > importantly, is there some reason 3.1.7 would make things worse while > 3.2.1 would not? If not, then I can always try 3.1.7 and then try > 3.2.1 if that does not help. I don't know for these particular versions, however in the past I've confirmed that a later version of xfs_repair performed way better (salvaged more files from lost+found, in particular). At some point in the distant past, some versions of xfs_repair were buggy and would happily throw away TB of perfectly sane data... Ih ad this very problem once on Christmas eve in 2005 IIRC :/ -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 14:24 ` Emmanuel Florac @ 2014-09-10 14:49 ` Sean Caron 0 siblings, 0 replies; 35+ messages in thread From: Sean Caron @ 2014-09-10 14:49 UTC (permalink / raw) To: Emmanuel Florac, Sean Caron Cc: Roger Willcocks, Eric Sandeen, Leslie Rhorer, xfs@oss.sgi.com [-- Attachment #1.1: Type: text/plain, Size: 4234 bytes --] I don't want to bloviate too much and drag this completely off topic esp. since the OPs query is resolved but please allow me just one anecdote :) Earlier this year, I had one of our project file servers (450 TB) go down. It didn't go down because the array spuriously just lost a bunch of disks; it was simply your usual sort of Linux kernel panic... you go to the console and it's just black screen and unresponsive, or maybe you can see the tail end of a backtrace and it's unresponsive. So, OK, issue a quick remote IPMI reboot of the machine, it comes up... I'm in single user mode, bringing up each sub-RAID6 in our RAID60 by hand, no problem. Bring up the top level RAID0. OK. Then I go to mount the XFS... no go. Apparently the log somehow got corrupted in the crash? So I try to mount ro, no dice, but I _can_ mount ro,noreplaylog and I see good files here! Thank goodness. I start scavenging to a spare host... A few weeks later, after the scavenge is done, I did a few xfs_repair runs just for the sake of experimentation. Using both in dry run mode, I tried the version that shipped with Ubuntu 12.04, as well as the latest xfs_repair I could pull from the source tree. I redirected the output of both runs to file and watched them with 'tail -f'. Diffing the output when they were done, it didn't look like they were behaving much differently. Both files had thousands or tens of thousands of lines worth of output in them, bad this, bad that... (I always run in verbose mode) Since the filesystem was hosed anyway and I was going to rebuild it, I decided to let the new xfs_repair run "for real" just to see what would happen, for kicks. And who knows? Maybe I could recover even more than I already had ...? (I wasn't just totally wasting time) I think it took maybe a week for it to run on a 450 TB volume? At least a week. Maybe I was being a teensy bit hyperbolic in my previous descriptions of runtime, LOL. After it was done? ... almost everything was obliterated. I had tens of millions of zero-length files, and tens of millions of bits of anonymous scrambled junk in lost+found. So, I chuckled a bit (thankful for my hard-won previous experience) before reformatting the array and then copied back the results of my scavenging. Just by ro-mounting and copying what I could, I was able to save around 90% of the data by volume on the array (it was a little more than half full when it failed... ~290 TB? There was only ~30 TB that I couldn't salvage); good clean files that passed validation from their respective users. I think 80-90% recovery rates are very commonly achievable just mounting ro,noreplaylog and getting what you can with cp -R or rsync, given that there wasn't grievous failure of the underlying storage system. If I had depended on xfs_repair, or blithely run it as a first line of response as the documentation might intimate (hey, it's called xfs_repair, right?) like you would casually think to do; run it like people run fsck or CHKDSK... I would have been hosed, big time. Best, Sean On Wed, Sep 10, 2014 at 10:24 AM, Emmanuel Florac <eflorac@intellique.com> wrote: > Le Tue, 09 Sep 2014 20:31:07 -0500 > Leslie Rhorer <lrhorer@mygrande.net> écrivait: > > > More > > importantly, is there some reason 3.1.7 would make things worse while > > 3.2.1 would not? If not, then I can always try 3.1.7 and then try > > 3.2.1 if that does not help. > > I don't know for these particular versions, however in the past > I've confirmed that a later version of xfs_repair performed way better > (salvaged more files from lost+found, in particular). > > At some point in the distant past, some versions of xfs_repair were > buggy and would happily throw away TB of perfectly sane data... Ih ad > this very problem once on Christmas eve in 2005 IIRC :/ > > -- > ------------------------------------------------------------------------ > Emmanuel Florac | Direction technique > | Intellique > | <eflorac@intellique.com> > | +33 1 78 94 84 02 > ------------------------------------------------------------------------ > [-- Attachment #1.2: Type: text/html, Size: 5218 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-09 15:21 Corrupted files Leslie Rhorer 2014-09-09 15:50 ` Sean Caron @ 2014-09-09 16:08 ` Emmanuel Florac 2014-09-09 22:06 ` Dave Chinner 2 siblings, 0 replies; 35+ messages in thread From: Emmanuel Florac @ 2014-09-09 16:08 UTC (permalink / raw) To: Leslie Rhorer; +Cc: xfs Le Tue, 09 Sep 2014 10:21:37 -0500 Leslie Rhorer <lrhorer@mygrande.net> écrivait: > I have tried running xfs_repair several times, > but any attempt to access these files continuously reports "cannot > stat XXXXXXXX: Structure needs cleaning". I won't agree with Sean here(1). Most of the time xfs_repair ends with the expected result; however many distros (particularly centOS) provide positively ancient versions. You'd better grab a recent version (3.1 or better). (1) in particular on the "run for weeks" part. I've never had xfs_repair take more than a couple of hours, even on badly damaged filesystems in the hundred of terabytes range. -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-09 15:21 Corrupted files Leslie Rhorer 2014-09-09 15:50 ` Sean Caron 2014-09-09 16:08 ` Emmanuel Florac @ 2014-09-09 22:06 ` Dave Chinner 2014-09-10 1:12 ` Leslie Rhorer 2 siblings, 1 reply; 35+ messages in thread From: Dave Chinner @ 2014-09-09 22:06 UTC (permalink / raw) To: Leslie Rhorer; +Cc: xfs On Tue, Sep 09, 2014 at 10:21:37AM -0500, Leslie Rhorer wrote: > > Hello, > > I have an issue with my primary RAID array. I have 13T of data on > the array, and I suffered a major array failure. I was able to > rebuild the array, but some data was lost. Of course I have > backups, so after running xfs_repair, I ran an rsync job to recover > the lost data. Most of it was recovered, but there are several > files that cannot be read, deleted, or overwritten. I have tried > running xfs_repair several times, but any attempt to access these > files continuously reports "cannot stat XXXXXXXX: Structure needs > cleaning". I don't need to try to recover the data directly, as it > does reside on the backup, but I need to clear the file structure so > I can write the files back to the filesystem. How do I proceed? Fristly, more infomration is required, namely versions and actual error messages: http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F dmesg, in particular, should tell use what the corruption being encountered is when stat fails. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-09 22:06 ` Dave Chinner @ 2014-09-10 1:12 ` Leslie Rhorer 2014-09-10 1:25 ` Sean Caron 2014-09-10 1:53 ` Dave Chinner 0 siblings, 2 replies; 35+ messages in thread From: Leslie Rhorer @ 2014-09-10 1:12 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On 9/9/2014 5:06 PM, Dave Chinner wrote: > Fristly, more infomration is required, namely versions and actual > error messages: Indubitably: RAID-Server:/# xfs_repair -V xfs_repair version 3.1.7 RAID-Server:/# uname -r 3.2.0-4-amd64 4.0 GHz FX-8350 eight core processor RAID-Server:/# cat /proc/meminfo /proc/mounts /proc/partitions MemTotal: 8099916 kB MemFree: 5786420 kB Buffers: 112684 kB Cached: 457020 kB SwapCached: 0 kB Active: 521800 kB Inactive: 457268 kB Active(anon): 276648 kB Inactive(anon): 140180 kB Active(file): 245152 kB Inactive(file): 317088 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 12623740 kB SwapFree: 12623740 kB Dirty: 20 kB Writeback: 0 kB AnonPages: 409488 kB Mapped: 47576 kB Shmem: 7464 kB Slab: 197100 kB SReclaimable: 112644 kB SUnreclaim: 84456 kB KernelStack: 2560 kB PageTables: 8468 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 16673696 kB Committed_AS: 1010172 kB VmallocTotal: 34359738367 kB VmallocUsed: 339140 kB VmallocChunk: 34359395308 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 65532 kB DirectMap2M: 5120000 kB DirectMap1G: 3145728 kB rootfs / rootfs rw 0 0 sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 udev /dev devtmpfs rw,relatime,size=10240k,nr_inodes=1002653,mode=755 0 0 devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=809992k,mode=755 0 0 /dev/disk/by-uuid/fa5c404a-bfcb-43de-87ed-e671fda1ba99 / ext4 rw,relatime,errors=remount-ro,user_xattr,barrier=1,data=ordered 0 0 tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0 tmpfs /run/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=4144720k 0 0 /dev/md1 /boot ext2 rw,relatime,errors=continue 0 0 rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0 Backup:/Backup /Backup nfs rw,relatime,vers=3,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.51,mountvers=3,mountport=39597,mountproto=tcp,local_lock=none,addr=192.168.1.51 0 0 Backup:/var/www /var/www/backup nfs rw,relatime,vers=3,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.51,mountvers=3,mountport=39597,mountproto=tcp,local_lock=none,addr=192.168.1.51 0 0 /dev/md0 /RAID xfs rw,relatime,attr2,delaylog,sunit=2048,swidth=12288,noquota 0 0 major minor #blocks name 8 0 125034840 sda 8 1 96256 sda1 8 2 112305152 sda2 8 3 12632064 sda3 8 16 125034840 sdb 8 17 96256 sdb1 8 18 112305152 sdb2 8 19 12632064 sdb3 8 48 3907018584 sdd 8 32 3907018584 sdc 8 64 1465138584 sde 8 80 1465138584 sdf 8 96 1465138584 sdg 8 112 3907018584 sdh 8 128 3907018584 sdi 8 144 3907018584 sdj 8 160 3907018584 sdk 9 1 96192 md1 9 2 112239488 md2 9 3 12623744 md3 9 0 23441319936 md0 9 10 4395021312 md10 RAID-Server:/# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] [raid1] [raid0] md10 : active raid0 sdf[0] sde[2] sdg[1] 4395021312 blocks super 1.2 512k chunks md0 : active raid6 md10[12] sdc[13] sdk[10] sdj[11] sdi[15] sdh[8] sdd[9] 23441319936 blocks super 1.2 level 6, 1024k chunk, algorithm 2 [8/7] [UUU_UUUU] bitmap: 29/30 pages [116KB], 65536KB chunk md3 : active (auto-read-only) raid1 sda3[0] sdb3[1] 12623744 blocks super 1.2 [3/2] [UU_] bitmap: 1/1 pages [4KB], 65536KB chunk md2 : active raid1 sda2[0] sdb2[1] 112239488 blocks super 1.2 [3/2] [UU_] bitmap: 1/1 pages [4KB], 65536KB chunk md1 : active raid1 sda1[0] sdb1[1] 96192 blocks [3/2] [UU_] bitmap: 1/1 pages [4KB], 65536KB chunk unused devices: <none> Six of the drives are 4T spindles (a mixture of makes and models). The three drives comprising MD10 are WD 1.5T green drives. These are in place to take over the function of one of the kicked 4T drives. Md1, 2, and 3 are not data drives and are not suffering any issue. I'm not sure what is meant by "write cache status" in this context. The machine has been rebooted more than once during recovery and the FS has been umounted and xfs_repair run several times. I don't know for what the acronym BBWC stands. RAID-Server:/# xfs_info /dev/md0 meta-data=/dev/md0 isize=256 agcount=43, agsize=137356288 blks = sectsz=512 attr=2 data = bsize=4096 blocks=5860329984, imaxpct=5 = sunit=256 swidth=1536 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 The system performs just fine, other than the aforementioned, with loads in excess of 3Gbps. That is internal only. The LAN link is ony 1Gbps, so no external request exceeds about 950Mbps. > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F > > dmesg, in particular, should tell use what the corruption being > encountered is when stat fails. RAID-Server:/# ls "/RAID/DVD/Big Sleep, The (1945)/VIDEO_TS/VTS_01_1.VOB" ls: cannot access /RAID/DVD/Big Sleep, The (1945)/VIDEO_TS/VTS_01_1.VOB: Structure needs cleaning RAID-Server:/# dmesg | tail -n 30 ... [192173.363981] XFS (md0): corrupt dinode 41006, extent total = 1, nblocks = 0. [192173.363988] ffff8802338b8e00: 49 4e 81 b6 02 02 00 00 00 00 03 e8 00 00 03 e8 IN.............. [192173.363996] XFS (md0): Internal error xfs_iformat(1) at line 319 of file /build/linux-eKuxrT/linux-3.2.60/fs/xfs/xfs_inode.c. Caller 0xffffffffa0509318 [192173.363999] [192173.364062] Pid: 10813, comm: ls Not tainted 3.2.0-4-amd64 #1 Debian 3.2.60-1+deb7u3 [192173.364065] Call Trace: [192173.364097] [<ffffffffa04d3731>] ? xfs_corruption_error+0x54/0x6f [xfs] [192173.364134] [<ffffffffa0509318>] ? xfs_iread+0x9f/0x177 [xfs] [192173.364170] [<ffffffffa0508efa>] ? xfs_iformat+0xe3/0x462 [xfs] [192173.364204] [<ffffffffa0509318>] ? xfs_iread+0x9f/0x177 [xfs] [192173.364240] [<ffffffffa0509318>] ? xfs_iread+0x9f/0x177 [xfs] [192173.364268] [<ffffffffa04d6ebe>] ? xfs_iget+0x37c/0x56c [xfs] [192173.364300] [<ffffffffa04e13b4>] ? xfs_lookup+0xa4/0xd3 [xfs] [192173.364328] [<ffffffffa04d9e5a>] ? xfs_vn_lookup+0x3f/0x7e [xfs] [192173.364344] [<ffffffff81102de9>] ? d_alloc_and_lookup+0x3a/0x60 [192173.364357] [<ffffffff8110388d>] ? walk_component+0x219/0x406 [192173.364370] [<ffffffff81104721>] ? path_lookupat+0x7c/0x2bd [192173.364383] [<ffffffff81036628>] ? should_resched+0x5/0x23 [192173.364396] [<ffffffff8134f144>] ? _cond_resched+0x7/0x1c [192173.364408] [<ffffffff8110497e>] ? do_path_lookup+0x1c/0x87 [192173.364420] [<ffffffff81106407>] ? user_path_at_empty+0x47/0x7b [192173.364434] [<ffffffff813533d8>] ? do_page_fault+0x30a/0x345 [192173.364448] [<ffffffff810d6a04>] ? mmap_region+0x353/0x44a [192173.364460] [<ffffffff810fe45a>] ? vfs_fstatat+0x32/0x60 [192173.364471] [<ffffffff810fe590>] ? sys_newstat+0x12/0x2b [192173.364483] [<ffffffff813509f5>] ? page_fault+0x25/0x30 [192173.364495] [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b [192173.364503] XFS (md0): Corruption detected. Unmount and run xfs_repair That last line, by the way, is why I ran umount and xfs_repair. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 1:12 ` Leslie Rhorer @ 2014-09-10 1:25 ` Sean Caron 2014-09-10 1:43 ` Leslie Rhorer 2014-09-10 1:53 ` Dave Chinner 1 sibling, 1 reply; 35+ messages in thread From: Sean Caron @ 2014-09-10 1:25 UTC (permalink / raw) To: Leslie Rhorer, Sean Caron; +Cc: xfs@oss.sgi.com [-- Attachment #1.1: Type: text/plain, Size: 9535 bytes --] Hi Leslie, You really don't want to be running "green" anything in an array... that is a ticking time bomb just waiting to go off... let me tell you... At my installation, a predecessor had procured a large number of green drives because they were very inexpensive and regrets were had by all. Lousy performance, lots of spurious ejection/RAID gremlins and the failure rate on the WDC Greens is just appalling... BBWC stands for Battery Backed Write Cache; this is a feature of hardware RAID cards; it is just like it says on the tin; a bit (usually half a gig, or a gig, or two...) of nonvolatile cache that retains writes to the array in case of power failure, etc. If you have BBWC enabled but your battery is dead, bad things can happen. Not applicable for JBOD software RAID. I hold firm to my beliefs on xfs_repair :) As I say, you'll see a variety of opinions here. Best, Sean On Tue, Sep 9, 2014 at 9:12 PM, Leslie Rhorer <lrhorer@mygrande.net> wrote: > On 9/9/2014 5:06 PM, Dave Chinner wrote: > >> Fristly, more infomration is required, namely versions and actual >> error messages: >> > > Indubitably: > > RAID-Server:/# xfs_repair -V > xfs_repair version 3.1.7 > RAID-Server:/# uname -r > 3.2.0-4-amd64 > > 4.0 GHz FX-8350 eight core processor > > RAID-Server:/# cat /proc/meminfo /proc/mounts /proc/partitions > MemTotal: 8099916 kB > MemFree: 5786420 kB > Buffers: 112684 kB > Cached: 457020 kB > SwapCached: 0 kB > Active: 521800 kB > Inactive: 457268 kB > Active(anon): 276648 kB > Inactive(anon): 140180 kB > Active(file): 245152 kB > Inactive(file): 317088 kB > Unevictable: 0 kB > Mlocked: 0 kB > SwapTotal: 12623740 kB > SwapFree: 12623740 kB > Dirty: 20 kB > Writeback: 0 kB > AnonPages: 409488 kB > Mapped: 47576 kB > Shmem: 7464 kB > Slab: 197100 kB > SReclaimable: 112644 kB > SUnreclaim: 84456 kB > KernelStack: 2560 kB > PageTables: 8468 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > WritebackTmp: 0 kB > CommitLimit: 16673696 kB > Committed_AS: 1010172 kB > VmallocTotal: 34359738367 kB > VmallocUsed: 339140 kB > VmallocChunk: 34359395308 kB > HardwareCorrupted: 0 kB > AnonHugePages: 0 kB > HugePages_Total: 0 > HugePages_Free: 0 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 2048 kB > DirectMap4k: 65532 kB > DirectMap2M: 5120000 kB > DirectMap1G: 3145728 kB > rootfs / rootfs rw 0 0 > sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0 > proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 > udev /dev devtmpfs rw,relatime,size=10240k,nr_inodes=1002653,mode=755 0 0 > devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 > 0 0 > tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=809992k,mode=755 0 0 > /dev/disk/by-uuid/fa5c404a-bfcb-43de-87ed-e671fda1ba99 / ext4 > rw,relatime,errors=remount-ro,user_xattr,barrier=1,data=ordered 0 0 > tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0 > tmpfs /run/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=4144720k 0 0 > /dev/md1 /boot ext2 rw,relatime,errors=continue 0 0 > rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0 > Backup:/Backup /Backup nfs rw,relatime,vers=3,rsize= > 524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600, > retrans=2,sec=sys,mountaddr=192.168.1.51,mountvers=3, > mountport=39597,mountproto=tcp,local_lock=none,addr=192.168.1.51 0 0 > Backup:/var/www /var/www/backup nfs rw,relatime,vers=3,rsize= > 524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600, > retrans=2,sec=sys,mountaddr=192.168.1.51,mountvers=3, > mountport=39597,mountproto=tcp,local_lock=none,addr=192.168.1.51 0 0 > /dev/md0 /RAID xfs rw,relatime,attr2,delaylog,sunit=2048,swidth=12288,noquota > 0 0 > major minor #blocks name > > 8 0 125034840 sda > 8 1 96256 sda1 > 8 2 112305152 sda2 > 8 3 12632064 sda3 > 8 16 125034840 sdb > 8 17 96256 sdb1 > 8 18 112305152 sdb2 > 8 19 12632064 sdb3 > 8 48 3907018584 sdd > 8 32 3907018584 sdc > 8 64 1465138584 sde > 8 80 1465138584 sdf > 8 96 1465138584 sdg > 8 112 3907018584 sdh > 8 128 3907018584 sdi > 8 144 3907018584 sdj > 8 160 3907018584 sdk > 9 1 96192 md1 > 9 2 112239488 md2 > 9 3 12623744 md3 > 9 0 23441319936 md0 > 9 10 4395021312 md10 > > RAID-Server:/# cat /proc/mdstat > Personalities : [raid6] [raid5] [raid4] [raid1] [raid0] > md10 : active raid0 sdf[0] sde[2] sdg[1] > 4395021312 blocks super 1.2 512k chunks > > md0 : active raid6 md10[12] sdc[13] sdk[10] sdj[11] sdi[15] sdh[8] sdd[9] > 23441319936 blocks super 1.2 level 6, 1024k chunk, algorithm 2 [8/7] > [UUU_UUUU] > bitmap: 29/30 pages [116KB], 65536KB chunk > > md3 : active (auto-read-only) raid1 sda3[0] sdb3[1] > 12623744 blocks super 1.2 [3/2] [UU_] > bitmap: 1/1 pages [4KB], 65536KB chunk > > md2 : active raid1 sda2[0] sdb2[1] > 112239488 blocks super 1.2 [3/2] [UU_] > bitmap: 1/1 pages [4KB], 65536KB chunk > > md1 : active raid1 sda1[0] sdb1[1] > 96192 blocks [3/2] [UU_] > bitmap: 1/1 pages [4KB], 65536KB chunk > > unused devices: <none> > > Six of the drives are 4T spindles (a mixture of makes and > models). The three drives comprising MD10 are WD 1.5T green drives. These > are in place to take over the function of one of the kicked 4T drives. > Md1, 2, and 3 are not data drives and are not suffering any issue. > > I'm not sure what is meant by "write cache status" in this > context. The machine has been rebooted more than once during recovery and > the FS has been umounted and xfs_repair run several times. > > I don't know for what the acronym BBWC stands. > > RAID-Server:/# xfs_info /dev/md0 > meta-data=/dev/md0 isize=256 agcount=43, agsize=137356288 > blks > = sectsz=512 attr=2 > data = bsize=4096 blocks=5860329984, imaxpct=5 > = sunit=256 swidth=1536 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal bsize=4096 blocks=521728, version=2 > = sectsz=512 sunit=8 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > The system performs just fine, other than the aforementioned, with > loads in excess of 3Gbps. That is internal only. The LAN link is ony > 1Gbps, so no external request exceeds about 950Mbps. > > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_ >> should_I_include_when_reporting_a_problem.3F >> >> dmesg, in particular, should tell use what the corruption being >> encountered is when stat fails. >> > > RAID-Server:/# ls "/RAID/DVD/Big Sleep, The (1945)/VIDEO_TS/VTS_01_1.VOB" > ls: cannot access /RAID/DVD/Big Sleep, The (1945)/VIDEO_TS/VTS_01_1.VOB: > Structure needs cleaning > RAID-Server:/# dmesg | tail -n 30 > ... > [192173.363981] XFS (md0): corrupt dinode 41006, extent total = 1, nblocks > = 0. > [192173.363988] ffff8802338b8e00: 49 4e 81 b6 02 02 00 00 00 00 03 e8 00 > 00 03 e8 IN.............. > [192173.363996] XFS (md0): Internal error xfs_iformat(1) at line 319 of > file /build/linux-eKuxrT/linux-3.2.60/fs/xfs/xfs_inode.c. Caller > 0xffffffffa0509318 > [192173.363999] > [192173.364062] Pid: 10813, comm: ls Not tainted 3.2.0-4-amd64 #1 Debian > 3.2.60-1+deb7u3 > [192173.364065] Call Trace: > [192173.364097] [<ffffffffa04d3731>] ? xfs_corruption_error+0x54/0x6f > [xfs] > [192173.364134] [<ffffffffa0509318>] ? xfs_iread+0x9f/0x177 [xfs] > [192173.364170] [<ffffffffa0508efa>] ? xfs_iformat+0xe3/0x462 [xfs] > [192173.364204] [<ffffffffa0509318>] ? xfs_iread+0x9f/0x177 [xfs] > [192173.364240] [<ffffffffa0509318>] ? xfs_iread+0x9f/0x177 [xfs] > [192173.364268] [<ffffffffa04d6ebe>] ? xfs_iget+0x37c/0x56c [xfs] > [192173.364300] [<ffffffffa04e13b4>] ? xfs_lookup+0xa4/0xd3 [xfs] > [192173.364328] [<ffffffffa04d9e5a>] ? xfs_vn_lookup+0x3f/0x7e [xfs] > [192173.364344] [<ffffffff81102de9>] ? d_alloc_and_lookup+0x3a/0x60 > [192173.364357] [<ffffffff8110388d>] ? walk_component+0x219/0x406 > [192173.364370] [<ffffffff81104721>] ? path_lookupat+0x7c/0x2bd > [192173.364383] [<ffffffff81036628>] ? should_resched+0x5/0x23 > [192173.364396] [<ffffffff8134f144>] ? _cond_resched+0x7/0x1c > [192173.364408] [<ffffffff8110497e>] ? do_path_lookup+0x1c/0x87 > [192173.364420] [<ffffffff81106407>] ? user_path_at_empty+0x47/0x7b > [192173.364434] [<ffffffff813533d8>] ? do_page_fault+0x30a/0x345 > [192173.364448] [<ffffffff810d6a04>] ? mmap_region+0x353/0x44a > [192173.364460] [<ffffffff810fe45a>] ? vfs_fstatat+0x32/0x60 > [192173.364471] [<ffffffff810fe590>] ? sys_newstat+0x12/0x2b > [192173.364483] [<ffffffff813509f5>] ? page_fault+0x25/0x30 > [192173.364495] [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b > [192173.364503] XFS (md0): Corruption detected. Unmount and run xfs_repair > > That last line, by the way, is why I ran umount and xfs_repair. > > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > [-- Attachment #1.2: Type: text/html, Size: 12076 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 1:25 ` Sean Caron @ 2014-09-10 1:43 ` Leslie Rhorer 2014-09-10 14:31 ` Emmanuel Florac 0 siblings, 1 reply; 35+ messages in thread From: Leslie Rhorer @ 2014-09-10 1:43 UTC (permalink / raw) To: Sean Caron; +Cc: xfs@oss.sgi.com On 9/9/2014 8:25 PM, Sean Caron wrote: > Hi Leslie, > > You really don't want to be running "green" anything in an array... that > is a ticking time bomb just waiting to go off... let me tell you... At > my installation, a predecessor had procured a large number of green > drives because they were very inexpensive and regrets were had by all. The alternative is nothing at all. I am not a company, just a guy with a couple of arrays at his house. 'Not a rich guy, either. I've had these arrays since 2001 with only one other mass drive failure, and that was not unrecoverable, nor were they "green" drives. (Four Seagate drives all suddenly decided they did not want to be part of the array, so md kicked all four simultaneously. After that, they would not stay up as part of the array long enough to be mounted. I was able to read all four with dd_rescue, and get the array back online without a single lost file. Note also these arrays are not usually under any sort of massive load. The bulk of the data is video files which are written once at about 80MBps and then read one-by-one at about 4MBps. > Lousy performance, lots of spurious ejection/RAID gremlins and the > failure rate on the WDC Greens is just appalling... None of the failed drives were WD green. All three and the previous four were Seagate. I realize that is not a large statistical sample. > BBWC stands for Battery Backed Write Cache; this is a feature of > hardware RAID cards Ah, yes. This array does not have a BBWC controller. The backup array does, actually, but the battery backup is disabled. > it is just like it says on the tin; a bit (usually > half a gig, or a gig, or two...) of nonvolatile cache that retains > writes to the array in case of power failure, etc. If you have BBWC > enabled but your battery is dead, bad things can happen. Not applicable > for JBOD software RAID. Exactly. All the arrays are JBOD / mdadm. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 1:43 ` Leslie Rhorer @ 2014-09-10 14:31 ` Emmanuel Florac 2014-09-10 14:52 ` Grozdan ` (3 more replies) 0 siblings, 4 replies; 35+ messages in thread From: Emmanuel Florac @ 2014-09-10 14:31 UTC (permalink / raw) To: Leslie Rhorer; +Cc: Sean Caron, xfs@oss.sgi.com Le Tue, 09 Sep 2014 20:43:08 -0500 Leslie Rhorer <lrhorer@mygrande.net> écrivait: > None of the failed drives were WD green. All three and the > previous four were Seagate. I realize that is not a large > statistical sample. > If you're interested in large statistical samples, on a grand total of 4000 1 TB Seagate Barracuda ES2, I had to replace 2100 of them over the course of 3 years. I still have a couple of hundred of these unfortunate pieces of crap in service, and they still represent the vast majority of unexpected RAID malfunctions, urgent replacements, late night calls and other "interesting side activities". I wouldn't buy anything labeled Seagate nowadays. Their drives have been the baddest train wreck since the dreaded 9 GB Micropolis back in 1994 (or was it 1995?). -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 14:31 ` Emmanuel Florac @ 2014-09-10 14:52 ` Grozdan 2014-09-10 15:12 ` Emmanuel Florac 2014-09-10 14:54 ` Sean Caron ` (2 subsequent siblings) 3 siblings, 1 reply; 35+ messages in thread From: Grozdan @ 2014-09-10 14:52 UTC (permalink / raw) To: Emmanuel Florac; +Cc: Sean Caron, Leslie Rhorer, xfs@oss.sgi.com On Wed, Sep 10, 2014 at 4:31 PM, Emmanuel Florac <eflorac@intellique.com> wrote: > Le Tue, 09 Sep 2014 20:43:08 -0500 > Leslie Rhorer <lrhorer@mygrande.net> écrivait: > >> None of the failed drives were WD green. All three and the >> previous four were Seagate. I realize that is not a large >> statistical sample. >> > > If you're interested in large statistical samples, on a grand total of > 4000 1 TB Seagate Barracuda ES2, I had to replace 2100 of them over the > course of 3 years. I still have a couple of hundred of these > unfortunate pieces of crap in service, and they still represent the > vast majority of unexpected RAID malfunctions, urgent replacements, > late night calls and other "interesting side activities". > > I wouldn't buy anything labeled Seagate nowadays. Their drives have > been the baddest train wreck since the dreaded 9 GB Micropolis back in > 1994 (or was it 1995?). Funny, because our server (105 of them) all run on Seagate drives a few years now and I have yet to see one fail or cause other problems. But then again, we use Constellation disks, not Barracuda's. At home I also use both Barracuda's and Constellation ones and also have yet to see a problem with them. > > -- > ------------------------------------------------------------------------ > Emmanuel Florac | Direction technique > | Intellique > | <eflorac@intellique.com> > | +33 1 78 94 84 02 > ------------------------------------------------------------------------ > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- Yours truly _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 14:52 ` Grozdan @ 2014-09-10 15:12 ` Emmanuel Florac 2014-09-10 15:32 ` Grozdan 0 siblings, 1 reply; 35+ messages in thread From: Emmanuel Florac @ 2014-09-10 15:12 UTC (permalink / raw) To: Grozdan; +Cc: Sean Caron, Leslie Rhorer, xfs@oss.sgi.com Le Wed, 10 Sep 2014 16:52:27 +0200 Grozdan <neutrino8@gmail.com> écrivait: > Funny, because our server (105 of them) all run on Seagate drives a > few years now and I have yet to see one fail or cause other problems. > But then again, we use Constellation disks, not Barracuda's. At home I > also use both Barracuda's and Constellation ones and also have yet to > see a problem with them. > Yes, we replaced most failed Barracudas with Constellations at a later stage (because the "certified repaired" Barracudas aren't any better... ) and these work fine so far. However, why would I give Seagate my hard-earned money after they cost me so dearly for years? :) -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 15:12 ` Emmanuel Florac @ 2014-09-10 15:32 ` Grozdan 0 siblings, 0 replies; 35+ messages in thread From: Grozdan @ 2014-09-10 15:32 UTC (permalink / raw) To: Emmanuel Florac; +Cc: Sean Caron, Leslie Rhorer, xfs@oss.sgi.com On Wed, Sep 10, 2014 at 5:12 PM, Emmanuel Florac <eflorac@intellique.com> wrote: > Le Wed, 10 Sep 2014 16:52:27 +0200 > Grozdan <neutrino8@gmail.com> écrivait: > >> Funny, because our server (105 of them) all run on Seagate drives a >> few years now and I have yet to see one fail or cause other problems. >> But then again, we use Constellation disks, not Barracuda's. At home I >> also use both Barracuda's and Constellation ones and also have yet to >> see a problem with them. >> > > Yes, we replaced most failed Barracudas with Constellations at a > later stage (because the "certified repaired" Barracudas aren't any > better... ) and these work fine so far. However, why would I give > Seagate my hard-earned money after they cost me so dearly for years? :) Oh, you are correct about the money. If it happened to us I'll also think twice about that too. The biggest problems thus far we had were with Samsung disks. I haven't seen such a high fail rate in all my life. About 70% of the 100 disks we got failed within a year. Too bad Seagate took them over. I can only hope that Seagate's manufacturing an QA doesn't suffer because of that. > > -- > ------------------------------------------------------------------------ > Emmanuel Florac | Direction technique > | Intellique > | <eflorac@intellique.com> > | +33 1 78 94 84 02 > ------------------------------------------------------------------------ -- Yours truly _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 14:31 ` Emmanuel Florac 2014-09-10 14:52 ` Grozdan @ 2014-09-10 14:54 ` Sean Caron 2014-09-10 23:18 ` Leslie Rhorer 2014-09-11 13:24 ` Greg Freemyer 3 siblings, 0 replies; 35+ messages in thread From: Sean Caron @ 2014-09-10 14:54 UTC (permalink / raw) To: Emmanuel Florac, Sean Caron; +Cc: Leslie Rhorer, xfs@oss.sgi.com [-- Attachment #1.1: Type: text/plain, Size: 2182 bytes --] I am probably overseeing a similar number (3-4000) of Hitachi A7K3000s, A7K2000s and WDC RE4s and I probably see a few failures a month. When we are building a new machine and we get a fresh shipment in, maybe 10% failure rate right out of the box. Those that survive the burn-in usually do pretty good. Man, you have my sympathy with that failure rate in excess of 50%... even the WDC Greens weren't THAT bad (although it probably got close, as we neared closer and closer to EOLing them... and they had been moved to third-tier "backup storage" status by that point). Thankfully they are gone now, LOL. You're right, esp. in large installations, it's critical to do your homework on drives, pick a good candidate, validate it and then run with them. Even with the good ones, you've gotta keep a watchful eye... "when you buy them in bulk, they fail in bulk". Best, Sean On Wed, Sep 10, 2014 at 10:31 AM, Emmanuel Florac <eflorac@intellique.com> wrote: > Le Tue, 09 Sep 2014 20:43:08 -0500 > Leslie Rhorer <lrhorer@mygrande.net> écrivait: > > > None of the failed drives were WD green. All three and the > > previous four were Seagate. I realize that is not a large > > statistical sample. > > > > If you're interested in large statistical samples, on a grand total of > 4000 1 TB Seagate Barracuda ES2, I had to replace 2100 of them over the > course of 3 years. I still have a couple of hundred of these > unfortunate pieces of crap in service, and they still represent the > vast majority of unexpected RAID malfunctions, urgent replacements, > late night calls and other "interesting side activities". > > I wouldn't buy anything labeled Seagate nowadays. Their drives have > been the baddest train wreck since the dreaded 9 GB Micropolis back in > 1994 (or was it 1995?). > > -- > ------------------------------------------------------------------------ > Emmanuel Florac | Direction technique > | Intellique > | <eflorac@intellique.com> > | +33 1 78 94 84 02 > ------------------------------------------------------------------------ > [-- Attachment #1.2: Type: text/html, Size: 2947 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 14:31 ` Emmanuel Florac 2014-09-10 14:52 ` Grozdan 2014-09-10 14:54 ` Sean Caron @ 2014-09-10 23:18 ` Leslie Rhorer 2014-09-11 13:24 ` Greg Freemyer 3 siblings, 0 replies; 35+ messages in thread From: Leslie Rhorer @ 2014-09-10 23:18 UTC (permalink / raw) To: Emmanuel Florac; +Cc: Sean Caron, xfs@oss.sgi.com On 9/10/2014 9:31 AM, Emmanuel Florac wrote: > Le Tue, 09 Sep 2014 20:43:08 -0500 > Leslie Rhorer <lrhorer@mygrande.net> écrivait: > >> None of the failed drives were WD green. All three and the >> previous four were Seagate. I realize that is not a large >> statistical sample. >> > > If you're interested in large statistical samples, on a grand total of > 4000 1 TB Seagate Barracuda ES2, I had to replace 2100 of them over the > course of 3 years. That's a good sized statistical sample. Oddly enough, perhaps, the ones that failed on me were also 1T Barracuda drives, and my failure rate was 40%. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 14:31 ` Emmanuel Florac ` (2 preceding siblings ...) 2014-09-10 23:18 ` Leslie Rhorer @ 2014-09-11 13:24 ` Greg Freemyer 2014-09-12 7:06 ` Emmanuel Florac 3 siblings, 1 reply; 35+ messages in thread From: Greg Freemyer @ 2014-09-11 13:24 UTC (permalink / raw) To: Emmanuel Florac; +Cc: Sean Caron, Leslie Rhorer, xfs@oss.sgi.com On Wed, Sep 10, 2014 at 10:31 AM, Emmanuel Florac <eflorac@intellique.com> wrote: > Le Tue, 09 Sep 2014 20:43:08 -0500 > Leslie Rhorer <lrhorer@mygrande.net> écrivait: > >> None of the failed drives were WD green. All three and the >> previous four were Seagate. I realize that is not a large >> statistical sample. >> > > If you're interested in large statistical samples, on a grand total of > 4000 1 TB Seagate Barracuda ES2, I had to replace 2100 of them over the > course of 3 years. I still have a couple of hundred of these > unfortunate pieces of crap in service, and they still represent the > vast majority of unexpected RAID malfunctions, urgent replacements, > late night calls and other "interesting side activities". > > I wouldn't buy anything labeled Seagate nowadays. Their drives have > been the baddest train wreck since the dreaded 9 GB Micropolis back in > 1994 (or was it 1995?). I buy about 100 drives a year, but I don't work them very hard. Just lots of data to store and I need to keep my data sets segregated for legal reasons. I don't use raid, just lots of individual disks and most data maintained redundantly. About 4 years ago (or maybe 5), Seagate had a catastrophic drive situation. I can remember buying a batch of 10 drives and having 8 of them fail in the first 2 months. The bad part was they mostly survived a 10 hour burn-in, so they tended to fail with real data on them. I had one case (at a minimum) that summer where I put the data on 3 different Seagate drives and all 3 failed. Fortunately, I was able to swap the disk controller card from one of the working drives with one of the dead drives and recover the data. Regardless, ignoring the summer of discontent, I find Seagate to be my preferred drives. fyi: In June I bought 30 or so WD elements drives to try them out. These are not the green drives, just bare bones WD drives. None of them were DOA, but 3 failed within 4 weeks, so a 10% failure rate in the first month. Only one of them had unique data on it, so I had to recreate that data. Fortunately the source of the data was still available. All of those drives have been pulled out of routine service. Greg _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-11 13:24 ` Greg Freemyer @ 2014-09-12 7:06 ` Emmanuel Florac 0 siblings, 0 replies; 35+ messages in thread From: Emmanuel Florac @ 2014-09-12 7:06 UTC (permalink / raw) To: Greg Freemyer; +Cc: Sean Caron, Leslie Rhorer, xfs@oss.sgi.com Le Thu, 11 Sep 2014 09:24:04 -0400 vous écriviez: > Regardless, ignoring the summer of discontent, I find Seagate to be my > preferred drives. Nowadays I only buy HGST drives. The 3 TB aren't as reliable as the 1, 2, 4 and 6 TB, but generally speaking the failure rate is extremely low (an order of a few failures a year among several thousands units). -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 1:12 ` Leslie Rhorer 2014-09-10 1:25 ` Sean Caron @ 2014-09-10 1:53 ` Dave Chinner 2014-09-10 3:10 ` Leslie Rhorer 1 sibling, 1 reply; 35+ messages in thread From: Dave Chinner @ 2014-09-10 1:53 UTC (permalink / raw) To: Leslie Rhorer; +Cc: xfs On Tue, Sep 09, 2014 at 08:12:38PM -0500, Leslie Rhorer wrote: > On 9/9/2014 5:06 PM, Dave Chinner wrote: > >Fristly, more infomration is required, namely versions and actual > >error messages: > > Indubitably: > > RAID-Server:/# xfs_repair -V > xfs_repair version 3.1.7 > RAID-Server:/# uname -r > 3.2.0-4-amd64 Ok, so a relatively old xfs_repair. That's important - read on.... > 4.0 GHz FX-8350 eight core processor > > RAID-Server:/# cat /proc/meminfo /proc/mounts /proc/partitions > MemTotal: 8099916 kB .... > /dev/md0 /RAID xfs > rw,relatime,attr2,delaylog,sunit=2048,swidth=12288,noquota 0 0 FWIW, you don't need sunit=2048,swidth=12288 in the mount options - they are stored on disk and the mount options are only necessray to change the on-disk values. > Personalities : [raid6] [raid5] [raid4] [raid1] [raid0] > md10 : active raid0 sdf[0] sde[2] sdg[1] > 4395021312 blocks super 1.2 512k chunks > > md0 : active raid6 md10[12] sdc[13] sdk[10] sdj[11] sdi[15] sdh[8] sdd[9] > 23441319936 blocks super 1.2 level 6, 1024k chunk, algorithm 2 > [8/7] [UUU_UUUU] > bitmap: 29/30 pages [116KB], 65536KB chunk > > md3 : active (auto-read-only) raid1 sda3[0] sdb3[1] > 12623744 blocks super 1.2 [3/2] [UU_] > bitmap: 1/1 pages [4KB], 65536KB chunk > > md2 : active raid1 sda2[0] sdb2[1] > 112239488 blocks super 1.2 [3/2] [UU_] > bitmap: 1/1 pages [4KB], 65536KB chunk > > md1 : active raid1 sda1[0] sdb1[1] > 96192 blocks [3/2] [UU_] > bitmap: 1/1 pages [4KB], 65536KB chunk > > unused devices: <none> > > Six of the drives are 4T spindles (a mixture of makes and models). > The three drives comprising MD10 are WD 1.5T green drives. These > are in place to take over the function of one of the kicked 4T > drives. Md1, 2, and 3 are not data drives and are not suffering any > issue. Ok, that's creative. But when you need another drive in the array and you don't have the right spares.... ;) > I'm not sure what is meant by "write cache status" in this context. > The machine has been rebooted more than once during recovery and the > FS has been umounted and xfs_repair run several times. Start here and read the next few entries: http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache_on_journaled_filesystems.3F > I don't know for what the acronym BBWC stands. "battery backed write cache". If you're not using a hardware RAID controller, it's unlikely you have one. The difference between a drive write cache and a BBWC is that the BBWC is non-volatile - it does not get lost when power drops. > RAID-Server:/# xfs_info /dev/md0 > meta-data=/dev/md0 isize=256 agcount=43, > agsize=137356288 blks > = sectsz=512 attr=2 > data = bsize=4096 blocks=5860329984, imaxpct=5 > = sunit=256 swidth=1536 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal bsize=4096 blocks=521728, version=2 > = sectsz=512 sunit=8 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 Ok, that all looks pretty good, and the sunit/swidth match the mount options you set so you definitely don't need the mount options... > The system performs just fine, other than the aforementioned, with > loads in excess of 3Gbps. That is internal only. The LAN link is > ony 1Gbps, so no external request exceeds about 950Mbps. > > >http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F > > > >dmesg, in particular, should tell use what the corruption being > >encountered is when stat fails. > > RAID-Server:/# ls "/RAID/DVD/Big Sleep, The (1945)/VIDEO_TS/VTS_01_1.VOB" > ls: cannot access /RAID/DVD/Big Sleep, The > (1945)/VIDEO_TS/VTS_01_1.VOB: Structure needs cleaning > RAID-Server:/# dmesg | tail -n 30 > ... > [192173.363981] XFS (md0): corrupt dinode 41006, extent total = 1, > nblocks = 0. > [192173.363988] ffff8802338b8e00: 49 4e 81 b6 02 02 00 00 00 00 03 > e8 00 00 03 e8 IN.............. > [192173.363996] XFS (md0): Internal error xfs_iformat(1) at line 319 > of file /build/linux-eKuxrT/linux-3.2.60/fs/xfs/xfs_inode.c. Caller > 0xffffffffa0509318 > [192173.363999] > [192173.364062] Pid: 10813, comm: ls Not tainted 3.2.0-4-amd64 #1 > Debian 3.2.60-1+deb7u3 > [192173.364065] Call Trace: > [192173.364097] [<ffffffffa04d3731>] ? xfs_corruption_error+0x54/0x6f [xfs] > [192173.364134] [<ffffffffa0509318>] ? xfs_iread+0x9f/0x177 [xfs] > [192173.364170] [<ffffffffa0508efa>] ? xfs_iformat+0xe3/0x462 [xfs] > [192173.364204] [<ffffffffa0509318>] ? xfs_iread+0x9f/0x177 [xfs] > [192173.364240] [<ffffffffa0509318>] ? xfs_iread+0x9f/0x177 [xfs] > [192173.364268] [<ffffffffa04d6ebe>] ? xfs_iget+0x37c/0x56c [xfs] > [192173.364300] [<ffffffffa04e13b4>] ? xfs_lookup+0xa4/0xd3 [xfs] > [192173.364328] [<ffffffffa04d9e5a>] ? xfs_vn_lookup+0x3f/0x7e [xfs] > [192173.364344] [<ffffffff81102de9>] ? d_alloc_and_lookup+0x3a/0x60 > [192173.364357] [<ffffffff8110388d>] ? walk_component+0x219/0x406 > [192173.364370] [<ffffffff81104721>] ? path_lookupat+0x7c/0x2bd > [192173.364383] [<ffffffff81036628>] ? should_resched+0x5/0x23 > [192173.364396] [<ffffffff8134f144>] ? _cond_resched+0x7/0x1c > [192173.364408] [<ffffffff8110497e>] ? do_path_lookup+0x1c/0x87 > [192173.364420] [<ffffffff81106407>] ? user_path_at_empty+0x47/0x7b > [192173.364434] [<ffffffff813533d8>] ? do_page_fault+0x30a/0x345 > [192173.364448] [<ffffffff810d6a04>] ? mmap_region+0x353/0x44a > [192173.364460] [<ffffffff810fe45a>] ? vfs_fstatat+0x32/0x60 > [192173.364471] [<ffffffff810fe590>] ? sys_newstat+0x12/0x2b > [192173.364483] [<ffffffff813509f5>] ? page_fault+0x25/0x30 > [192173.364495] [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b > [192173.364503] XFS (md0): Corruption detected. Unmount and run xfs_repair > > That last line, by the way, is why I ran umount and xfs_repair. Right, that's the correct thing to do, but sometimes there are issues that repair doesn't handle properly. This *was* one of them, and it was fixed by commit e1f43b4 ("repair: update extent count after zapping duplicate blocks") which was added to xfs_repair v3.1.8. IOWs, upgrading xfsprogs to the latest release and re-running xfs_repair should fix this error. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 1:53 ` Dave Chinner @ 2014-09-10 3:10 ` Leslie Rhorer 2014-09-10 3:33 ` Dave Chinner 0 siblings, 1 reply; 35+ messages in thread From: Leslie Rhorer @ 2014-09-10 3:10 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On 9/9/2014 8:53 PM, Dave Chinner wrote: > On Tue, Sep 09, 2014 at 08:12:38PM -0500, Leslie Rhorer wrote: >> On 9/9/2014 5:06 PM, Dave Chinner wrote: >>> Fristly, more infomration is required, namely versions and actual >>> error messages: >> >> Indubitably: >> >> RAID-Server:/# xfs_repair -V >> xfs_repair version 3.1.7 >> RAID-Server:/# uname -r >> 3.2.0-4-amd64 > > Ok, so a relatively old xfs_repair. That's important - read on.... OK, a good reason is a good reason. >> 4.0 GHz FX-8350 eight core processor >> >> RAID-Server:/# cat /proc/meminfo /proc/mounts /proc/partitions >> MemTotal: 8099916 kB > .... >> /dev/md0 /RAID xfs >> rw,relatime,attr2,delaylog,sunit=2048,swidth=12288,noquota 0 0 > > FWIW, you don't need sunit=2048,swidth=12288 in the mount options - > they are stored on disk and the mount options are only necessray to > change the on-disk values. They aren't. Those were created automatically, weather at creation time or at mount time, I don't know, but the filesystem was created with mkfs.xfs /dev/md0 and fstab contains: /dev/md0 /RAID xfs rw 1 2 >> Six of the drives are 4T spindles (a mixture of makes and models). >> The three drives comprising MD10 are WD 1.5T green drives. These >> are in place to take over the function of one of the kicked 4T >> drives. Md1, 2, and 3 are not data drives and are not suffering any >> issue. > > Ok, that's creative. But when you need another drive in the array > and you don't have the right spares.... ;) Yes, but I wasn't really expecting to need 3 spares this soon or suddenly. These are fairly new drives, and with 33% of the array being parity, the sudden need for 3 extra drives just is not too likely. That, plus I have quite a few 1.5 and 1.0T drives lying around in case of sudden emergency. This isn't the first time I've replaced a single drive temporarily with a RAID0. The performance is actually better, of course, and for the 3 or 4 days it takes to get a new drive, it's really not an issue. Since I have a full online backup system plus a regularly updated off-site backup, the risk is quite minimal. This is an exercise in mild inconvenience, not an emergency failure. If this were a commercial system, it would be another matter, but I know for a fact there are a very large number of home NAS solutions in place that are less robust than this one. I personally know quite a few people who never do backups, at all. >> I'm not sure what is meant by "write cache status" in this context. >> The machine has been rebooted more than once during recovery and the >> FS has been umounted and xfs_repair run several times. > > Start here and read the next few entries: > > http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache_on_journaled_filesystems.3F I knew that, but I still don't see the relevance in this context. There is no battery backup on the drive controller or the drives, and the drives have all been powered down and back up several times. Anything in any cache right now would be from some operation in the last few minutes, not four days ago. >> I don't know for what the acronym BBWC stands. > > "battery backed write cache". If you're not using a hardware RAID > controller, it's unlikely you have one. See my previous. I do have one (a 3Ware 9650E, given to me by a friend when his company switched to zfs for their server). It's not on this system. This array is on a HighPoint RocketRAID 2722. > The difference between a > drive write cache and a BBWC is that the BBWC is non-volatile - it > does not get lost when power drops. Yeah, I'm aware, thanks. I just didn't cotton to the acronym. >> RAID-Server:/# xfs_info /dev/md0 >> meta-data=/dev/md0 isize=256 agcount=43, >> agsize=137356288 blks >> = sectsz=512 attr=2 >> data = bsize=4096 blocks=5860329984, imaxpct=5 >> = sunit=256 swidth=1536 blks >> naming =version 2 bsize=4096 ascii-ci=0 >> log =internal bsize=4096 blocks=521728, version=2 >> = sectsz=512 sunit=8 blks, lazy-count=1 >> realtime =none extsz=4096 blocks=0, rtextents=0 > > Ok, that all looks pretty good, and the sunit/swidth match the mount > options you set so you definitely don't need the mount options... Yeah, I didn't set them. What did, I don't really know for certain. See above. >> [192173.364460] [<ffffffff810fe45a>] ? vfs_fstatat+0x32/0x60 >> [192173.364471] [<ffffffff810fe590>] ? sys_newstat+0x12/0x2b >> [192173.364483] [<ffffffff813509f5>] ? page_fault+0x25/0x30 >> [192173.364495] [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b >> [192173.364503] XFS (md0): Corruption detected. Unmount and run xfs_repair >> >> That last line, by the way, is why I ran umount and xfs_repair. > > Right, that's the correct thing to do, but sometimes there are > issues that repair doesn't handle properly. This *was* one of them, > and it was fixed by commit e1f43b4 ("repair: update extent count > after zapping duplicate blocks") which was added to xfs_repair > v3.1.8. > > IOWs, upgrading xfsprogs to the latest release and re-running > xfs_repair should fix this error. OK. I'll scarf the source and compile. All I need is to git clone git://oss.sgi.com/xfs/xfs and git://oss.sgi.com/xfs/cmds/xfsprogs, right? I've never used git on a package maintained in my distro. Will I have issues when I upgrade to Debian Jessie in a few months, since this is not being managed by apt / dpkg? It looks like Jessie has 3.2.1 of xfs-progs. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 3:10 ` Leslie Rhorer @ 2014-09-10 3:33 ` Dave Chinner 2014-09-10 4:14 ` Leslie Rhorer 2014-09-10 4:51 ` Leslie Rhorer 0 siblings, 2 replies; 35+ messages in thread From: Dave Chinner @ 2014-09-10 3:33 UTC (permalink / raw) To: Leslie Rhorer; +Cc: xfs On Tue, Sep 09, 2014 at 10:10:45PM -0500, Leslie Rhorer wrote: > On 9/9/2014 8:53 PM, Dave Chinner wrote: > >On Tue, Sep 09, 2014 at 08:12:38PM -0500, Leslie Rhorer wrote: > >>On 9/9/2014 5:06 PM, Dave Chinner wrote: > >>>Fristly, more infomration is required, namely versions and actual > >>>error messages: > >> > >> Indubitably: > >> > >>RAID-Server:/# xfs_repair -V > >>xfs_repair version 3.1.7 > >>RAID-Server:/# uname -r > >>3.2.0-4-amd64 > > > >Ok, so a relatively old xfs_repair. That's important - read on.... > > OK, a good reason is a good reason. > > >>4.0 GHz FX-8350 eight core processor > >> > >>RAID-Server:/# cat /proc/meminfo /proc/mounts /proc/partitions > >>MemTotal: 8099916 kB > >.... > >>/dev/md0 /RAID xfs > >>rw,relatime,attr2,delaylog,sunit=2048,swidth=12288,noquota 0 0 > > > >FWIW, you don't need sunit=2048,swidth=12288 in the mount options - > >they are stored on disk and the mount options are only necessray to > >change the on-disk values. > > They aren't. Those were created automatically, weather at creation > time or at mount time, I don't know, but the filesystem was created > with Ah, my mistake. Normally it's only mount options in that code - I forgot that we report sunit/swidth unconditionally if it is set in the superblock. > >> I'm not sure what is meant by "write cache status" in this context. > >>The machine has been rebooted more than once during recovery and the > >>FS has been umounted and xfs_repair run several times. > > > >Start here and read the next few entries: > > > >http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache_on_journaled_filesystems.3F > > I knew that, but I still don't see the relevance in this context. > There is no battery backup on the drive controller or the drives, > and the drives have all been powered down and back up several times. > Anything in any cache right now would be from some operation in the > last few minutes, not four days ago. There is no direct relevance to your situation, but for a lot of other common problems it definitely is. That's why we ask people to report it with all the other information about their system > >> I don't know for what the acronym BBWC stands. > > > >"battery backed write cache". If you're not using a hardware RAID > >controller, it's unlikely you have one. > > See my previous. I do have one (a 3Ware 9650E, given to me by a > friend when his company switched to zfs for their server). It's not > on this system. This array is on a HighPoint RocketRAID 2722. Ok. We have seen over time that those 3ware controllers can do strange things in error conditions - we've had reports of entire hardware luns dying and being completely unrecoverable after a disk was kicked out due to an error. I can't comment on the highpoint controller - either not many people use them or they just don't report problems if there do. Either way, I'd suggest that if you aren't running the latest firmware it would be to update them as these problems were typically fixed by newer firmware releases. > >>[192173.364460] [<ffffffff810fe45a>] ? vfs_fstatat+0x32/0x60 > >>[192173.364471] [<ffffffff810fe590>] ? sys_newstat+0x12/0x2b > >>[192173.364483] [<ffffffff813509f5>] ? page_fault+0x25/0x30 > >>[192173.364495] [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b > >>[192173.364503] XFS (md0): Corruption detected. Unmount and run xfs_repair > >> > >> That last line, by the way, is why I ran umount and xfs_repair. > > > >Right, that's the correct thing to do, but sometimes there are > >issues that repair doesn't handle properly. This *was* one of them, > >and it was fixed by commit e1f43b4 ("repair: update extent count > >after zapping duplicate blocks") which was added to xfs_repair > >v3.1.8. > > > >IOWs, upgrading xfsprogs to the latest release and re-running > >xfs_repair should fix this error. > > OK. I'll scarf the source and compile. All I need is to git clone > git://oss.sgi.com/xfs/xfs and git://oss.sgi.com/xfs/cmds/xfsprogs, > right? Just clone git://oss.sgi.com/xfs/cmds/xfsprogs and check out the v3.2.1 tag and build that.. > I've never used git on a package maintained in my distro. Will I > have issues when I upgrade to Debian Jessie in a few months, since > this is not being managed by apt / dpkg? It looks like Jessie has > 3.2.1 of xfs-progs. If you're using debian you can build debian packages directly from the git tree via "make deb" (I use it all the time for pushing new builds to my test machines) and so when you upgrade to Jessie it should just replace your custom built package correctly... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 3:33 ` Dave Chinner @ 2014-09-10 4:14 ` Leslie Rhorer 2014-09-10 4:22 ` Leslie Rhorer 2014-09-10 4:51 ` Leslie Rhorer 1 sibling, 1 reply; 35+ messages in thread From: Leslie Rhorer @ 2014-09-10 4:14 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On 9/9/2014 10:33 PM, Dave Chinner wrote: > There is no direct relevance to your situation, but for a lot of > other common problems it definitely is. That's why we ask people to > report it with all the other information about their system Yeah, understood. > Ok. We have seen over time that those 3ware controllers can do > strange things in error conditions - we've had reports of entire > hardware luns dying and being completely unrecoverable after a > disk was kicked out due to an error. Oof. That's not good. It's stable right now. I'm considering a different controller at some point. I may accelerate that process. > I can't comment on the > highpoint controller - either not many people use them or they just > don't report problems if there do. Either way, I'd suggest that if > you aren't running the latest firmware it would be to update them > as these problems were typically fixed by newer firmware releases. As a matter of fact, I was going to do just that. I have to reboot the system in DOS (of all things), since they don't have a linux loader. I've got to arrange a convenient time. >> OK. I'll scarf the source and compile. All I need is to git clone >> git://oss.sgi.com/xfs/xfs and git://oss.sgi.com/xfs/cmds/xfsprogs, >> right? > > Just clone git://oss.sgi.com/xfs/cmds/xfsprogs and check out the > v3.2.1 tag and build that.. OK, I'm doing something wrong, I think. It's been over a decade since I compiled a kernel. It makes me a little nervous. > >> I've never used git on a package maintained in my distro. Will I >> have issues when I upgrade to Debian Jessie in a few months, since >> this is not being managed by apt / dpkg? It looks like Jessie has >> 3.2.1 of xfs-progs. > > If you're using debian you can build debian packages directly from > the git tree via "make deb" (I use it all the time for pushing Um, is that make deb-pkg, perhaps? I'm not seeing a "deb" in the package targets. > new builds to my test machines) and so when you upgrade to Jessie it > should just replace your custom built package correctly... `make deb` finds no install target. If I run `make menuconfig` it complains about there being no ncurses. Libncurses5 is installed, and I don't know what else I should get. `make oldconfig` seems to work. Am I headed the right direction? There are quite a few configuration targets, and I am not sure which one to choose. There are also a number of questions asked by the oldconfig target (and presumably the same for other config targets), and I'm unsure how to answer. I definitely don't want to make an error and potentially wind up with an unbootable system. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 4:14 ` Leslie Rhorer @ 2014-09-10 4:22 ` Leslie Rhorer 2014-09-10 14:34 ` Emmanuel Florac 0 siblings, 1 reply; 35+ messages in thread From: Leslie Rhorer @ 2014-09-10 4:22 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs >>> OK. I'll scarf the source and compile. All I need is to git clone >>> git://oss.sgi.com/xfs/xfs and git://oss.sgi.com/xfs/cmds/xfsprogs, >>> right? >> >> Just clone git://oss.sgi.com/xfs/cmds/xfsprogs and check out the >> v3.2.1 tag and build that.. Oops! Hold on. I didn't read that closely enough. You were saying I only need to compile xfs-progs. That's working. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 4:22 ` Leslie Rhorer @ 2014-09-10 14:34 ` Emmanuel Florac 0 siblings, 0 replies; 35+ messages in thread From: Emmanuel Florac @ 2014-09-10 14:34 UTC (permalink / raw) To: Leslie Rhorer; +Cc: xfs Le Tue, 09 Sep 2014 23:22:03 -0500 Leslie Rhorer <lrhorer@mygrande.net> écrivait: > Oops! Hold on. I didn't read that closely enough. You were > saying I only need to compile xfs-progs. That's working. > You don't need to install the resulting binaries either. xfs_repair will happily run from the source directory, ./xfs_repair /dev/blah ... -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 3:33 ` Dave Chinner 2014-09-10 4:14 ` Leslie Rhorer @ 2014-09-10 4:51 ` Leslie Rhorer 2014-09-10 5:23 ` Dave Chinner 1 sibling, 1 reply; 35+ messages in thread From: Leslie Rhorer @ 2014-09-10 4:51 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On 9/9/2014 10:33 PM, Dave Chinner wrote: > On Tue, Sep 09, 2014 at 10:10:45PM -0500, Leslie Rhorer wrote: >> On 9/9/2014 8:53 PM, Dave Chinner wrote: >>> On Tue, Sep 09, 2014 at 08:12:38PM -0500, Leslie Rhorer wrote: >>>> On 9/9/2014 5:06 PM, Dave Chinner wrote: >>>>> Fristly, more infomration is required, namely versions and actual >>>>> error messages: >>>> >>>> Indubitably: >>>> >>>> RAID-Server:/# xfs_repair -V >>>> xfs_repair version 3.1.7 >>>> RAID-Server:/# uname -r >>>> 3.2.0-4-amd64 >>> >>> Ok, so a relatively old xfs_repair. That's important - read on.... >> >> OK, a good reason is a good reason. >> >>>> 4.0 GHz FX-8350 eight core processor >>>> >>>> RAID-Server:/# cat /proc/meminfo /proc/mounts /proc/partitions >>>> MemTotal: 8099916 kB >>> .... >>>> /dev/md0 /RAID xfs >>>> rw,relatime,attr2,delaylog,sunit=2048,swidth=12288,noquota 0 0 >>> >>> FWIW, you don't need sunit=2048,swidth=12288 in the mount options - >>> they are stored on disk and the mount options are only necessray to >>> change the on-disk values. >> >> They aren't. Those were created automatically, weather at creation >> time or at mount time, I don't know, but the filesystem was created >> with > > Ah, my mistake. Normally it's only mount options in that code - I > forgot that we report sunit/swidth unconditionally if it is set in > the superblock. > >>>> I'm not sure what is meant by "write cache status" in this context. >>>> The machine has been rebooted more than once during recovery and the >>>> FS has been umounted and xfs_repair run several times. >>> >>> Start here and read the next few entries: >>> >>> http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache_on_journaled_filesystems.3F >> >> I knew that, but I still don't see the relevance in this context. >> There is no battery backup on the drive controller or the drives, >> and the drives have all been powered down and back up several times. >> Anything in any cache right now would be from some operation in the >> last few minutes, not four days ago. > > There is no direct relevance to your situation, but for a lot of > other common problems it definitely is. That's why we ask people to > report it with all the other information about their system > >>>> I don't know for what the acronym BBWC stands. >>> >>> "battery backed write cache". If you're not using a hardware RAID >>> controller, it's unlikely you have one. >> >> See my previous. I do have one (a 3Ware 9650E, given to me by a >> friend when his company switched to zfs for their server). It's not >> on this system. This array is on a HighPoint RocketRAID 2722. > > Ok. We have seen over time that those 3ware controllers can do > strange things in error conditions - we've had reports of entire > hardware luns dying and being completely unrecoverable after a > disk was kicked out due to an error. I can't comment on the > highpoint controller - either not many people use them or they just > don't report problems if there do. Either way, I'd suggest that if > you aren't running the latest firmware it would be to update them > as these problems were typically fixed by newer firmware releases. > >>>> [192173.364460] [<ffffffff810fe45a>] ? vfs_fstatat+0x32/0x60 >>>> [192173.364471] [<ffffffff810fe590>] ? sys_newstat+0x12/0x2b >>>> [192173.364483] [<ffffffff813509f5>] ? page_fault+0x25/0x30 >>>> [192173.364495] [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b >>>> [192173.364503] XFS (md0): Corruption detected. Unmount and run xfs_repair >>>> >>>> That last line, by the way, is why I ran umount and xfs_repair. >>> >>> Right, that's the correct thing to do, but sometimes there are >>> issues that repair doesn't handle properly. This *was* one of them, >>> and it was fixed by commit e1f43b4 ("repair: update extent count >>> after zapping duplicate blocks") which was added to xfs_repair >>> v3.1.8. >>> >>> IOWs, upgrading xfsprogs to the latest release and re-running >>> xfs_repair should fix this error. >> >> OK. I'll scarf the source and compile. All I need is to git clone >> git://oss.sgi.com/xfs/xfs and git://oss.sgi.com/xfs/cmds/xfsprogs, >> right? > > Just clone git://oss.sgi.com/xfs/cmds/xfsprogs and check out the > v3.2.1 tag and build that.. > >> I've never used git on a package maintained in my distro. Will I >> have issues when I upgrade to Debian Jessie in a few months, since >> this is not being managed by apt / dpkg? It looks like Jessie has >> 3.2.1 of xfs-progs. > > If you're using debian you can build debian packages directly from > the git tree via "make deb" (I use it all the time for pushing > new builds to my test machines) and so when you upgrade to Jessie it > should just replace your custom built package correctly... > > Cheers, > > Dave. Thanks a ton, Dave (and everyone else who helped). That seems to have worked just fine. The three grunged entries are gone and the system is happily copying over the backups. Now I'll run another rsync with checksum to make sure everything is good before putting the backup into production. I'm also going to upgrade the controller BIOS just in case. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 4:51 ` Leslie Rhorer @ 2014-09-10 5:23 ` Dave Chinner 2014-09-11 5:47 ` Leslie Rhorer 0 siblings, 1 reply; 35+ messages in thread From: Dave Chinner @ 2014-09-10 5:23 UTC (permalink / raw) To: Leslie Rhorer; +Cc: xfs On Tue, Sep 09, 2014 at 11:51:42PM -0500, Leslie Rhorer wrote: > On 9/9/2014 10:33 PM, Dave Chinner wrote: > >On Tue, Sep 09, 2014 at 10:10:45PM -0500, Leslie Rhorer wrote: > >>On 9/9/2014 8:53 PM, Dave Chinner wrote: > >>>On Tue, Sep 09, 2014 at 08:12:38PM -0500, Leslie Rhorer wrote: > >>>>On 9/9/2014 5:06 PM, Dave Chinner wrote: > >> I've never used git on a package maintained in my distro. Will I > >>have issues when I upgrade to Debian Jessie in a few months, since > >>this is not being managed by apt / dpkg? It looks like Jessie has > >>3.2.1 of xfs-progs. > > > >If you're using debian you can build debian packages directly from > >the git tree via "make deb" (I use it all the time for pushing > >new builds to my test machines) and so when you upgrade to Jessie it > >should just replace your custom built package correctly... > > Thanks a ton, Dave (and everyone else who helped). That seems to > have worked just fine. The three grunged entries are gone and the > system is happily copying over the backups. Now I'll run another > rsync with checksum to make sure everything is good before putting > the backup into production. I'm also going to upgrade the > controller BIOS just in case. Good to hear. Hopefully everything will check out. Just yell if you need more help. ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Corrupted files 2014-09-10 5:23 ` Dave Chinner @ 2014-09-11 5:47 ` Leslie Rhorer 0 siblings, 0 replies; 35+ messages in thread From: Leslie Rhorer @ 2014-09-11 5:47 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On 9/10/2014 12:23 AM, Dave Chinner wrote: > On Tue, Sep 09, 2014 at 11:51:42PM -0500, Leslie Rhorer wrote: >> On 9/9/2014 10:33 PM, Dave Chinner wrote: >>> On Tue, Sep 09, 2014 at 10:10:45PM -0500, Leslie Rhorer wrote: >>>> On 9/9/2014 8:53 PM, Dave Chinner wrote: >>>>> On Tue, Sep 09, 2014 at 08:12:38PM -0500, Leslie Rhorer wrote: >>>>>> On 9/9/2014 5:06 PM, Dave Chinner wrote: >>>> I've never used git on a package maintained in my distro. Will I >>>> have issues when I upgrade to Debian Jessie in a few months, since >>>> this is not being managed by apt / dpkg? It looks like Jessie has >>>> 3.2.1 of xfs-progs. >>> >>> If you're using debian you can build debian packages directly from >>> the git tree via "make deb" (I use it all the time for pushing >>> new builds to my test machines) and so when you upgrade to Jessie it >>> should just replace your custom built package correctly... >> >> Thanks a ton, Dave (and everyone else who helped). That seems to >> have worked just fine. The three grunged entries are gone and the >> system is happily copying over the backups. Now I'll run another >> rsync with checksum to make sure everything is good before putting >> the backup into production. I'm also going to upgrade the >> controller BIOS just in case. > > Good to hear. Hopefully everything will check out. Just yell if you > need more help. ;) Thanks. The rsync compare just finished on the non-volatile areas of the file system without a single mismatch and no missing files. That's good enough for me. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2014-09-12 7:06 UTC | newest] Thread overview: 35+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-09-09 15:21 Corrupted files Leslie Rhorer 2014-09-09 15:50 ` Sean Caron 2014-09-09 16:03 ` Sean Caron 2014-09-09 22:24 ` Eric Sandeen 2014-09-09 22:57 ` Sean Caron 2014-09-10 1:00 ` Roger Willcocks 2014-09-10 1:23 ` Leslie Rhorer 2014-09-10 5:09 ` Eric Sandeen 2014-09-10 0:48 ` Leslie Rhorer 2014-09-10 1:10 ` Roger Willcocks 2014-09-10 1:31 ` Leslie Rhorer 2014-09-10 14:24 ` Emmanuel Florac 2014-09-10 14:49 ` Sean Caron 2014-09-09 16:08 ` Emmanuel Florac 2014-09-09 22:06 ` Dave Chinner 2014-09-10 1:12 ` Leslie Rhorer 2014-09-10 1:25 ` Sean Caron 2014-09-10 1:43 ` Leslie Rhorer 2014-09-10 14:31 ` Emmanuel Florac 2014-09-10 14:52 ` Grozdan 2014-09-10 15:12 ` Emmanuel Florac 2014-09-10 15:32 ` Grozdan 2014-09-10 14:54 ` Sean Caron 2014-09-10 23:18 ` Leslie Rhorer 2014-09-11 13:24 ` Greg Freemyer 2014-09-12 7:06 ` Emmanuel Florac 2014-09-10 1:53 ` Dave Chinner 2014-09-10 3:10 ` Leslie Rhorer 2014-09-10 3:33 ` Dave Chinner 2014-09-10 4:14 ` Leslie Rhorer 2014-09-10 4:22 ` Leslie Rhorer 2014-09-10 14:34 ` Emmanuel Florac 2014-09-10 4:51 ` Leslie Rhorer 2014-09-10 5:23 ` Dave Chinner 2014-09-11 5:47 ` Leslie Rhorer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox