* XFS filesystem corruption
@ 2013-03-06 15:08 Julien FERRERO
2013-03-06 15:15 ` Emmanuel Florac
2013-03-07 3:56 ` Stan Hoeppner
0 siblings, 2 replies; 31+ messages in thread
From: Julien FERRERO @ 2013-03-06 15:08 UTC (permalink / raw)
To: xfs
Hi,
I am migrating a video streaming server from Linux kernel 2.6.18 (8
years old kernel...) to 2.6.35 (2 years old...). Unfortunately, I
don't have the choice of the kernel version since some proprietary
external modules require this specific kernel version.
We use XFS for filesystem and the layout is the following:
H/W RAID 5 (/dev/sda) > mdadm Linear RAID (/dev/md0) > XFS filesystem
(/mountpoint).
The allocated size of the fs is 1.5 TB.
Since we have migrated to 2.6.35, we start to experience some very
rare and random filesystem corruption. Some file or directory suddenly
become no longer accessible. For instance, the /bin/ls command
returns:
?????????? ? ? ? ? ?
4988d60d-2ee5-4ee6-9a16-6f7f5f28f412.xml
and I cannot open the file (No such file or directory).
I had a look to the FAQ and I did try to remount the fs with the
option "inode64" but it did not change anything. I have the exact same
result.
If I run "xfs_repair" on my system I have the following output:
--------------8<--------------8<--------------
# xfs_repair -n /dev/md0
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- scan filesystem freespace and inode maps...
agi unlinked bucket 62 is 190 in ag 1 (inode=134217918)
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
b52feb70: Badness in key lookup (length)
bp=(bno 60387336, len 16384 bytes) key=(bno 60387336, len 8192 bytes)
- agno = 1
imap claims a free inode 134217858 is in use, would correct imap and clear inode
imap claims a free inode 134217859 is in use, would correct imap and clear inode
imap claims a free inode 134217860 is in use, would correct imap and clear inode
imap claims a free inode 134217863 is in use, would correct imap and clear inode
imap claims a free inode 134217864 is in use, would correct imap and clear inode
imap claims a free inode 134217866 is in use, would correct imap and clear inode
imap claims a free inode 134217867 is in use, would correct imap and clear inode
imap claims a free inode 134217869 is in use, would correct imap and clear inode
imap claims a free inode 134217915 is in use, would correct imap and clear inode
imap claims a free inode 134217916 is in use, would correct imap and clear inode
imap claims a free inode 140493888 is in use, would correct imap and clear inode
imap claims a free inode 140493894 is in use, would correct imap and clear inode
imap claims a free inode 140493896 is in use, would correct imap and clear inode
imap claims a free inode 140493897 is in use, would correct imap and clear inode
imap claims a free inode 140493898 is in use, would correct imap and clear inode
imap claims a free inode 140493899 is in use, would correct imap and clear inode
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- agno = 15
- agno = 16
- agno = 17
- agno = 18
- agno = 19
- agno = 20
- agno = 21
- agno = 22
- agno = 23
- agno = 24
- agno = 25
- agno = 26
- agno = 27
- agno = 28
- agno = 29
- agno = 30
- agno = 31
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
entry "10fdb8cd-b48a-4d2a-8ff4-19516e6a3b06.xml" at block 0 offset 544
in directory inode 134217856 references free inode 140493896
would clear inode number in entry at offset 544...
entry "9e6727ff-9fd6-466a-aa30-c7aabdd67646.xml" at block 0 offset 600
in directory inode 134217856 references free inode 140493898
entry "tmp" at block 0 offset 112 in directory inode 128 references
free inode 140493888
would clear inode number in entry at offset 600...
would clear inode number in entry at offset 112...
entry "5ff59379-e982-4d4e-b87a-cb194ea6cfd8.xml" at block 0 offset 632
in directory inode 134217856 references free inode 140493899
entry "tmp" at block 0 offset 3984 in directory inode 128 references
free inode 135
would clear inode number in entry at offset 632...
would clear inode number in entry at offset 3984...
entry "b8078379-d8ee-4af0-9ed4-2c94479a7a51.xml" in shortform
directory 131 references free inode 135
would have junked entry "b8078379-d8ee-4af0-9ed4-2c94479a7a51.xml" in
directory inode 131
entry "4988d60d-2ee5-4ee6-9a16-6f7f5f28f412.xml" in shortform
directory 131 references free inode 135
would have junked entry "4988d60d-2ee5-4ee6-9a16-6f7f5f28f412.xml" in
directory inode 131
- agno = 2
entry "87280c00-3b60-46ec-9d65-937db364a7b9" at block 2 offset 16 in
directory inode 268435584 references free inode 135
would clear inode number in entry at offset 16...
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- agno = 10
- agno = 11
- agno = 12
- agno = 13
- agno = 14
- agno = 15
- agno = 16
- agno = 17
- agno = 18
- agno = 19
- agno = 20
- agno = 21
- agno = 22
- agno = 23
- agno = 24
- agno = 25
- agno = 26
- agno = 27
- agno = 28
- agno = 29
- agno = 30
entry "up2" in shortform directory 4026531968 references free inode 135
would have junked entry "up2" in directory inode 4026531968
- agno = 31
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
entry "tmp" in directory inode 128 points to free inode 140493888,
would junk entry
entry "tmp" in directory inode 128 points to free inode 135, would junk entry
bad hash table for directory inode 128 (no data entry): would rebuild
entry "56b3f51e-4912-4e43-99ed-2204aa8a68f2.xml" in shortform
directory inode 131 points to free inode 135would junk entry
entry "4988d60d-2ee5-4ee6-9a16-6f7f5f28f412.xml" in shortform
directory inode 131 points to free inode 135would junk entry
entry "ca073ec3-5d59-4306-a8a6-67c2e0d79c81.xml" in directory inode
134217856 points to free inode 140493896, would junk entry
entry "ea5dd270-06a0-4e25-8cbf-0a37b2dad755.xml" in directory inode
134217856 points to free inode 140493898, would junk entry
entry "dd745300-48fa-46e5-b5c7-a4ba5e820353.xml" in directory inode
134217856 points to free inode 140493899, would junk entry
entry "3a092246-f8ea-4cb6-9758-f0d73253f368.xml" in dir 134217856
points to an already connected directory inode 140493909
would clear entry "3a092246-f8ea-4cb6-9758-f0d73253f368.xml"
bad hash table for directory inode 134217856 (no data entry): would rebuild
entry "87280c00-3b60-46ec-9d65-937db364a7b9" in directory inode
268435584 points to free inode 135, would junk entry
bad hash table for directory inode 268435584 (no data entry): would rebuild
entry "up2" in shortform directory inode 4026531968 points to free
inode 135would junk entry
- traversal finished ...
- moving disconnected inodes to lost+found ...
disconnected dir inode 134217861, would move to lost+found
disconnected inode 134217865, would move to lost+found
disconnected inode 134217868, would move to lost+found
disconnected inode 134217918, would move to lost+found
disconnected inode 134217919, would move to lost+found
disconnected inode 140493900, would move to lost+found
Phase 7 - verify link counts...
would have reset inode 128 nlinks from 12 to 8
would have reset inode 134217918 nlinks from -1 to 1
would have reset inode 268435584 nlinks from 192 to 191
would have reset inode 4026531968 nlinks from 4 to 3
No modify flag set, skipping filesystem flush and exiting.
--------------8<--------------8<--------------
The filesystem was originally created with the command:
# mkfs.xfs -f -l size=32m /dev/md0
and the mount option in fstab are "defaults" (rw,relatime,attr2,noquota).
We know the problem is not related to the RAID H/W. We also have an
unit with corrupted fs on a single drive (the RAID Linear is still
there though).
I am totally stuck and I really don't know how to duplicate the
corruption. I only know that units are used to be power cycle by
operator while the fs is still mounted (no proper shutdown / reboot).
My guess is the fs journal shall handle this case and avoid such
corruption.
Any help would be appreciated.
Thank you.
-- Julian
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: XFS filesystem corruption 2013-03-06 15:08 XFS filesystem corruption Julien FERRERO @ 2013-03-06 15:15 ` Emmanuel Florac 2013-03-06 16:16 ` Julien FERRERO 2013-03-07 3:56 ` Stan Hoeppner 1 sibling, 1 reply; 31+ messages in thread From: Emmanuel Florac @ 2013-03-06 15:15 UTC (permalink / raw) To: Julien FERRERO; +Cc: xfs Le Wed, 6 Mar 2013 16:08:59 +0100 vous écriviez: > I am totally stuck and I really don't know how to duplicate the > corruption. I only know that units are used to be power cycle by > operator while the fs is still mounted (no proper shutdown / reboot). > My guess is the fs journal shall handle this case and avoid such > corruption. Wrong guess. It may work or not, depending upon a long list of parameters, but basically not turning it off properly is asking for problems and corruptions. The problem will be tragically aggravated if your hardware RAID doesn't have a battery backed-up cache. -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-06 15:15 ` Emmanuel Florac @ 2013-03-06 16:16 ` Julien FERRERO 2013-03-06 16:47 ` Ric Wheeler 2013-03-06 22:21 ` Emmanuel Florac 0 siblings, 2 replies; 31+ messages in thread From: Julien FERRERO @ 2013-03-06 16:16 UTC (permalink / raw) To: Emmanuel Florac; +Cc: xfs Hi Emmanuel 2013/3/6 Emmanuel Florac <eflorac@intellique.com>: > Le Wed, 6 Mar 2013 16:08:59 +0100 vous écriviez: > >> I am totally stuck and I really don't know how to duplicate the >> corruption. I only know that units are used to be power cycle by >> operator while the fs is still mounted (no proper shutdown / reboot). >> My guess is the fs journal shall handle this case and avoid such >> corruption. > > Wrong guess. It may work or not, depending upon a long list of > parameters, but basically not turning it off properly is asking for > problems and corruptions. The problem will be tragically aggravated if > your hardware RAID doesn't have a battery backed-up cache. > OK but our server is 95% of the time reading data and 5% of the time writing data. We have a case of a server that did not write anything at the time of failure (and during all the uptime session). Moreover, failure occurs to files that were opened in read-only or weren't accessed at all at the time of failure. I don't think the H/W RAID is the issue since we have the same corruption with other setup without H/W RAID. Does the "ls" output with "???" looks like a fs corruption ? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-06 16:16 ` Julien FERRERO @ 2013-03-06 16:47 ` Ric Wheeler 2013-03-06 22:21 ` Emmanuel Florac 1 sibling, 0 replies; 31+ messages in thread From: Ric Wheeler @ 2013-03-06 16:47 UTC (permalink / raw) To: Julien FERRERO; +Cc: xfs On 03/06/2013 11:16 AM, Julien FERRERO wrote: > Hi Emmanuel > > 2013/3/6 Emmanuel Florac <eflorac@intellique.com>: >> Le Wed, 6 Mar 2013 16:08:59 +0100 vous écriviez: >> >>> I am totally stuck and I really don't know how to duplicate the >>> corruption. I only know that units are used to be power cycle by >>> operator while the fs is still mounted (no proper shutdown / reboot). >>> My guess is the fs journal shall handle this case and avoid such >>> corruption. >> Wrong guess. It may work or not, depending upon a long list of >> parameters, but basically not turning it off properly is asking for >> problems and corruptions. The problem will be tragically aggravated if >> your hardware RAID doesn't have a battery backed-up cache. >> > OK but our server is 95% of the time reading data and 5% of the time > writing data. We have a case of a server that did not write anything > at the time of failure (and during all the uptime session). Moreover, > failure occurs to files that were opened in read-only or weren't > accessed at all at the time of failure. I don't think the H/W RAID is > the issue since we have the same corruption with other setup without > H/W RAID. > > Does the "ls" output with "???" looks like a fs corruption ? > Caching can hold dirty data in volatile cache for a very long time. Even if you open a file in "read-only" mode, you still do a fair amount of writes to storage. You can use blktrace or similar tool to see just how much data is written. As mentioned earlier, you always must unmount cleanly as a best practice. An operator that powers off with mounted file systems need educated or let go :) Ric _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-06 16:16 ` Julien FERRERO 2013-03-06 16:47 ` Ric Wheeler @ 2013-03-06 22:21 ` Emmanuel Florac 2013-03-06 23:12 ` Ric Wheeler 1 sibling, 1 reply; 31+ messages in thread From: Emmanuel Florac @ 2013-03-06 22:21 UTC (permalink / raw) To: Julien FERRERO; +Cc: xfs Le Wed, 6 Mar 2013 17:16:31 +0100 vous écriviez: > I don't think the H/W RAID is > the issue since we have the same corruption with other setup without > H/W RAID. HW RAID may exacerbate the problem. XFS is absolutely, definitely not "brutal power off" safe. All linux systems from this century are perfectly able to turn themselves off properly at a single press of the power button; the only safe options are educating the users or mounting the filesystem read-only. And yes, the ls garbled output is caracteristic of a filesystem corruption. -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-06 22:21 ` Emmanuel Florac @ 2013-03-06 23:12 ` Ric Wheeler 2013-03-07 13:15 ` Julien FERRERO 2013-03-08 8:39 ` Stan Hoeppner 0 siblings, 2 replies; 31+ messages in thread From: Ric Wheeler @ 2013-03-06 23:12 UTC (permalink / raw) To: Emmanuel Florac; +Cc: Julien FERRERO, xfs On 03/06/2013 05:21 PM, Emmanuel Florac wrote: > Le Wed, 6 Mar 2013 17:16:31 +0100 vous écriviez: > >> I don't think the H/W RAID is >> the issue since we have the same corruption with other setup without >> H/W RAID. > HW RAID may exacerbate the problem. XFS is absolutely, definitely not > "brutal power off" safe. All linux systems from this century are > perfectly able to turn themselves off properly at a single press of the > power button; the only safe options are educating the users or mounting > the filesystem read-only. > > And yes, the ls garbled output is caracteristic of a filesystem > corruption. > We actually test brutal "Power off" for xfs, ext4 and other file systems. If your storage is configured properly and you have barriers enabled, they all pass without corruption. What hardware raid cards can do is to hide a volatile write cache. Either on the raid HBA itself or, even worse, on the backend disks behind the card. S-ata disks tend to default to write cache enabled and need to be checked especially careful (sas drives tend to be write cache disabled by default). ric _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-06 23:12 ` Ric Wheeler @ 2013-03-07 13:15 ` Julien FERRERO 2013-03-07 13:40 ` Ric Wheeler 2013-03-07 23:22 ` Dave Chinner 2013-03-08 8:39 ` Stan Hoeppner 1 sibling, 2 replies; 31+ messages in thread From: Julien FERRERO @ 2013-03-07 13:15 UTC (permalink / raw) To: Ric Wheeler; +Cc: xfs > We actually test brutal "Power off" for xfs, ext4 and other file systems. If > your storage is configured properly and you have barriers enabled, they all > pass without corruption. > > What hardware raid cards can do is to hide a volatile write cache. Either on > the raid HBA itself or, even worse, on the backend disks behind the card. > S-ata disks tend to default to write cache enabled and need to be checked > especially careful (sas drives tend to be write cache disabled by default). Write cache is supposed to be disabled on the H/W RAID (according to hdparm) and barrier are correctly enabled since xfs does not report any warning at mount. The odd thing is we never see this with kernel 2.6.18 where barriers weren't yet available. An other difference is the "unwritten extend" that was used to set to 0 by default. Now we cannot change this setting according to an old thread I've found: "unwritten extents on linux are generally a bad idea, this option should not be used.". Unfortunately, the engineer that chose this setting is no longer working with us... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-07 13:15 ` Julien FERRERO @ 2013-03-07 13:40 ` Ric Wheeler 2013-03-07 23:22 ` Dave Chinner 1 sibling, 0 replies; 31+ messages in thread From: Ric Wheeler @ 2013-03-07 13:40 UTC (permalink / raw) To: Julien FERRERO; +Cc: xfs On 03/07/2013 08:15 AM, Julien FERRERO wrote: >> We actually test brutal "Power off" for xfs, ext4 and other file systems. If >> your storage is configured properly and you have barriers enabled, they all >> pass without corruption. >> >> What hardware raid cards can do is to hide a volatile write cache. Either on >> the raid HBA itself or, even worse, on the backend disks behind the card. >> S-ata disks tend to default to write cache enabled and need to be checked >> especially careful (sas drives tend to be write cache disabled by default). > Write cache is supposed to be disabled on the H/W RAID (according to > hdparm) and barrier are correctly enabled since xfs does not report > any warning at mount. hdparm shows you the devices that the card shows, not the state of the write cache on the drives behind them. You need special vendor tools to do.... The LSI controllers for example have megaraid tools. Until your IO stack is properly configured, you really don't need to worry about the file system options :) ric > > The odd thing is we never see this with kernel 2.6.18 where barriers > weren't yet available. An other difference is the "unwritten extend" > that was used to set to 0 by default. Now we cannot change this > setting according to an old thread I've found: "unwritten extents on > linux are generally a bad idea, this option should not be used.". > Unfortunately, the engineer that chose this setting is no longer > working with us... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-07 13:15 ` Julien FERRERO 2013-03-07 13:40 ` Ric Wheeler @ 2013-03-07 23:22 ` Dave Chinner 2013-03-08 10:16 ` Julien FERRERO 2013-03-12 9:57 ` Martin Steigerwald 1 sibling, 2 replies; 31+ messages in thread From: Dave Chinner @ 2013-03-07 23:22 UTC (permalink / raw) To: Julien FERRERO; +Cc: Ric Wheeler, xfs On Thu, Mar 07, 2013 at 02:15:31PM +0100, Julien FERRERO wrote: > > We actually test brutal "Power off" for xfs, ext4 and other file systems. If > > your storage is configured properly and you have barriers enabled, they all > > pass without corruption. > > > > What hardware raid cards can do is to hide a volatile write cache. Either on > > the raid HBA itself or, even worse, on the backend disks behind the card. > > S-ata disks tend to default to write cache enabled and need to be checked > > especially careful (sas drives tend to be write cache disabled by default). > > Write cache is supposed to be disabled on the H/W RAID (according to > hdparm) and barrier are correctly enabled since xfs does not report > any warning at mount. > > The odd thing is we never see this with kernel 2.6.18 where barriers > weren't yet available. Yes they were. XFS had barrier support added in 2.6.15. > An other difference is the "unwritten extend" > that was used to set to 0 by default. Now we cannot change this > setting according to an old thread I've found: "unwritten extents on > linux are generally a bad idea, this option should not be used.". Yes, that would have been me that said that. I started seeing lots of boy-racer "tweak your filesystem to go faster" blogs recommending that unwritten extents should be turned off high up in google results, with numbers to prove that it improved performance. There were two common things wrong with these blogs: 1. None of them mentioned that turning off unwritten extents exposes stale data to users. i.e. a whopping great big security hole. 2. they reported significant performance improvements for workloads that *didn't use unwritten extents* when they set this flag. i.e. they mistook run-to-run variablity of the benchmark for a performance improvement. i.e. Benchmarking 101 Fail. When you get people who do not understand what they are doing and giving bad advice as the first 10 hits for a google search about optimising/tuning XFS filesystems, it's a major concern, and so I took steps to ensure you can't turn off unwritten extents with mkfs... > Unfortunately, the engineer that chose this setting is no longer > working with us... It sounds like he read one too many of these blogs, because if mysql is triggering speculative preallocation, it is not using unwritten extents.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-07 23:22 ` Dave Chinner @ 2013-03-08 10:16 ` Julien FERRERO 2013-03-12 9:57 ` Martin Steigerwald 1 sibling, 0 replies; 31+ messages in thread From: Julien FERRERO @ 2013-03-08 10:16 UTC (permalink / raw) To: Dave Chinner; +Cc: Ric Wheeler, xfs > > Yes they were. XFS had barrier support added in 2.6.15. > The XFS was running over a software RAID0 or software RAID5 depending on the product. In both case, my understanding is that either software RAID0 or software RAID5 did not support barrier in 2.6.18. > Yes, that would have been me that said that. I started seeing lots > of boy-racer "tweak your filesystem to go faster" blogs recommending > that unwritten extents should be turned off high up in google > results, with numbers to prove that it improved performance. > > There were two common things wrong with these blogs: > > 1. None of them mentioned that turning off unwritten extents > exposes stale data to users. i.e. a whopping great big > security hole. > > 2. they reported significant performance improvements for > workloads that *didn't use unwritten extents* when they set > this flag. i.e. they mistook run-to-run variablity of the > benchmark for a performance improvement. i.e. Benchmarking > 101 Fail. > > When you get people who do not understand what they are doing and > giving bad advice as the first 10 hits for a google search about > optimising/tuning XFS filesystems, it's a major concern, and so I > took steps to ensure you can't turn off unwritten extents with > mkfs... Thanks for the explanation. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-07 23:22 ` Dave Chinner 2013-03-08 10:16 ` Julien FERRERO @ 2013-03-12 9:57 ` Martin Steigerwald 1 sibling, 0 replies; 31+ messages in thread From: Martin Steigerwald @ 2013-03-12 9:57 UTC (permalink / raw) To: xfs; +Cc: Ric Wheeler, Julien FERRERO Am Freitag, 8. März 2013 schrieb Dave Chinner: > On Thu, Mar 07, 2013 at 02:15:31PM +0100, Julien FERRERO wrote: > > > We actually test brutal "Power off" for xfs, ext4 and other file > > > systems. If your storage is configured properly and you have > > > barriers enabled, they all pass without corruption. > > > > > > What hardware raid cards can do is to hide a volatile write cache. > > > Either on the raid HBA itself or, even worse, on the backend disks > > > behind the card. S-ata disks tend to default to write cache enabled > > > and need to be checked especially careful (sas drives tend to be > > > write cache disabled by default). > > > > > > > > Write cache is supposed to be disabled on the H/W RAID (according to > > hdparm) and barrier are correctly enabled since xfs does not report > > any warning at mount. > > > > > > > > The odd thing is we never see this with kernel 2.6.18 where barriers > > weren't yet available. > > Yes they were. XFS had barrier support added in 2.6.15. I thought this was 2.6.16? Or was that the kernel where it became usable due to the generic write barrier part being merged while the XFS one was ready earlier? I still remember the XFS filesystem crashes I had back then that went away with disabling the write cache of the drive in my ThinkPad T42 back then and where solved with 2.6.17, whereas 2.6.17.7 solved a directory corruption issue introduced with 2.6.17. Thus I always recommended at least 2.6.17.7 in case of write barrier usage with XFS. Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-06 23:12 ` Ric Wheeler 2013-03-07 13:15 ` Julien FERRERO @ 2013-03-08 8:39 ` Stan Hoeppner 2013-03-08 10:17 ` Julien FERRERO ` (2 more replies) 1 sibling, 3 replies; 31+ messages in thread From: Stan Hoeppner @ 2013-03-08 8:39 UTC (permalink / raw) To: Ric Wheeler; +Cc: Julien FERRERO, xfs On 3/6/2013 5:12 PM, Ric Wheeler wrote: > We actually test brutal "Power off" for xfs, ext4 and other file > systems. If your storage is configured properly and you have barriers > enabled, they all pass without corruption. Something that none of us mentioned WRT write barriers is that while the filesystem structure may avoid corruption when the power is cut, files may still be corrupted, in conditions such as any/all of these: 1. unwritten data still in buffer cache 2. drive caches are enabled 3. BBWC not working properly If the techs are determined to hard cut power because they don't have the time or the knowledge to do a clean shutdown, it may be well worth your time/effort to write a script and teach the field techs to execute it, before flipping the master switch. Your simple script would run as root, or you'd need to do some sudo foo within, and would contain something like: #! /bin/sh sync echo 2 > /proc/sys/vm/drop_caches echo "Ready for power down." This will flush pending writes in buffer cache to disk, and assumes of course that drive caches are disabled, and/or that BBWC, if present, is functioning properly. It also assumes no applications are still actively writing files, in which case you're screwed regardless. It's not a perfect solution and there's no guarantee you won't suffer file corruption, but this greatly increases your odds against it. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-08 8:39 ` Stan Hoeppner @ 2013-03-08 10:17 ` Julien FERRERO 2013-03-08 12:20 ` Ric Wheeler 2013-03-12 10:42 ` Martin Steigerwald 2 siblings, 0 replies; 31+ messages in thread From: Julien FERRERO @ 2013-03-08 10:17 UTC (permalink / raw) To: stan; +Cc: Ric Wheeler, xfs > Something that none of us mentioned WRT write barriers is that while the > filesystem structure may avoid corruption when the power is cut, files > may still be corrupted, in conditions such as any/all of these: > > 1. unwritten data still in buffer cache > 2. drive caches are enabled > 3. BBWC not working properly > > If the techs are determined to hard cut power because they don't have > the time or the knowledge to do a clean shutdown, it may be well worth > your time/effort to write a script and teach the field techs to execute > it, before flipping the master switch. Your simple script would run as > root, or you'd need to do some sudo foo within, and would contain > something like: > > #! /bin/sh > sync > echo 2 > /proc/sys/vm/drop_caches > echo "Ready for power down." > > This will flush pending writes in buffer cache to disk, and assumes of > course that drive caches are disabled, and/or that BBWC, if present, is > functioning properly. It also assumes no applications are still > actively writing files, in which case you're screwed regardless. It's > not a perfect solution and there's no guarantee you won't suffer file > corruption, but this greatly increases your odds against it. Thank you, that's the plan indeed. Educating our customer, and minimize failure with such script / recommendation. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-08 8:39 ` Stan Hoeppner 2013-03-08 10:17 ` Julien FERRERO @ 2013-03-08 12:20 ` Ric Wheeler 2013-03-08 18:59 ` Stan Hoeppner 2013-03-12 10:42 ` Martin Steigerwald 2 siblings, 1 reply; 31+ messages in thread From: Ric Wheeler @ 2013-03-08 12:20 UTC (permalink / raw) To: stan; +Cc: Julien FERRERO, xfs On 03/08/2013 03:39 AM, Stan Hoeppner wrote: > On 3/6/2013 5:12 PM, Ric Wheeler wrote: > >> We actually test brutal "Power off" for xfs, ext4 and other file >> systems. If your storage is configured properly and you have barriers >> enabled, they all pass without corruption. > Something that none of us mentioned WRT write barriers is that while the > filesystem structure may avoid corruption when the power is cut, files > may still be corrupted, in conditions such as any/all of these: > > 1. unwritten data still in buffer cache This is true only for user data, not the file system metadata. We should always be able to drop power without seeing corruption (like the garbled ls output). > 2. drive caches are enabled Write barriers will take care of drives with write cache enabled, as long as the hardware RAID card is not in the middle and misleading us. > 3. BBWC not working properly This should not be a worry. If the battery (or in more modern cards, flash backed) is not working, a good card will flip into write through caching. Should be slow, but safe. Note that the write cache state on the drives is still a question mark - that needs to be disabled normally. > > If the techs are determined to hard cut power because they don't have > the time or the knowledge to do a clean shutdown, it may be well worth > your time/effort to write a script and teach the field techs to execute > it, before flipping the master switch. Your simple script would run as > root, or you'd need to do some sudo foo within, and would contain > something like: > > #! /bin/sh > sync > echo 2 > /proc/sys/vm/drop_caches > echo "Ready for power down." > > This will flush pending writes in buffer cache to disk, and assumes of > course that drive caches are disabled, and/or that BBWC, if present, is > functioning properly. It also assumes no applications are still > actively writing files, in which case you're screwed regardless. It's > not a perfect solution and there's no guarantee you won't suffer file > corruption, but this greatly increases your odds against it. > For file system *metadata* consistency, you should not have to do this ever if the stack is properly configured. The application data will still be lost. Also, if there are active writers, this is inherently racy. A better script would unmount the file systems :) Ric _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-08 12:20 ` Ric Wheeler @ 2013-03-08 18:59 ` Stan Hoeppner 2013-03-09 9:11 ` Dave Chinner 0 siblings, 1 reply; 31+ messages in thread From: Stan Hoeppner @ 2013-03-08 18:59 UTC (permalink / raw) To: Ric Wheeler; +Cc: Julien FERRERO, xfs On 3/8/2013 6:20 AM, Ric Wheeler wrote: > On 03/08/2013 03:39 AM, Stan Hoeppner wrote: >> On 3/6/2013 5:12 PM, Ric Wheeler wrote: >> >>> We actually test brutal "Power off" for xfs, ext4 and other file >>> systems. If your storage is configured properly and you have barriers >>> enabled, they all pass without corruption. I think you missed the context. Please reread this: >> Something that none of us mentioned WRT write barriers is that while the >> filesystem structure may avoid corruption when the power is cut, files >> may still be corrupted, in conditions such as any/all of these: I made it very clear I was discussing file corruption here, not filesystem corruption. You already covered that base. I was specifically addressing the fact that XFS performs barriers on metadata writes but not file data writes. >> 1. unwritten data still in buffer cache > > This is true only for user data, not the file system metadata. We should > always be able to drop power without seeing corruption (like the garbled > ls output). > >> 2. drive caches are enabled > > Write barriers will take care of drives with write cache enabled, as > long as the hardware RAID card is not in the middle and misleading us. > >> 3. BBWC not working properly > > This should not be a worry. If the battery (or in more modern cards, > flash backed) is not working, a good card will flip into write through > caching. Should be slow, but safe. > > Note that the write cache state on the drives is still a question mark - > that needs to be disabled normally. > >> >> If the techs are determined to hard cut power because they don't have >> the time or the knowledge to do a clean shutdown, it may be well worth >> your time/effort to write a script and teach the field techs to execute >> it, before flipping the master switch. Your simple script would run as >> root, or you'd need to do some sudo foo within, and would contain >> something like: >> >> #! /bin/sh >> sync >> echo 2 > /proc/sys/vm/drop_caches >> echo "Ready for power down." >> >> This will flush pending writes in buffer cache to disk, and assumes of >> course that drive caches are disabled, and/or that BBWC, if present, is >> functioning properly. It also assumes no applications are still >> actively writing files, in which case you're screwed regardless. It's >> not a perfect solution and there's no guarantee you won't suffer file >> corruption, but this greatly increases your odds against it. >> > > For file system *metadata* consistency, you should not have to do this > ever if the stack is properly configured. The application data will > still be lost. > > Also, if there are active writers, this is inherently racy. A better > script would unmount the file systems :) Yes, a umount would be even better. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-08 18:59 ` Stan Hoeppner @ 2013-03-09 9:11 ` Dave Chinner 2013-03-09 18:51 ` Stan Hoeppner 0 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2013-03-09 9:11 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Julien FERRERO, Ric Wheeler, xfs On Fri, Mar 08, 2013 at 12:59:22PM -0600, Stan Hoeppner wrote: > On 3/8/2013 6:20 AM, Ric Wheeler wrote: > > On 03/08/2013 03:39 AM, Stan Hoeppner wrote: > >> On 3/6/2013 5:12 PM, Ric Wheeler wrote: > >> > >>> We actually test brutal "Power off" for xfs, ext4 and other file > >>> systems. If your storage is configured properly and you have barriers > >>> enabled, they all pass without corruption. > > I think you missed the context. Please reread this: > > >> Something that none of us mentioned WRT write barriers is that while the > >> filesystem structure may avoid corruption when the power is cut, files > >> may still be corrupted, in conditions such as any/all of these: > > I made it very clear I was discussing file corruption here, not > filesystem corruption. You already covered that base. I was > specifically addressing the fact that XFS performs barriers on metadata > writes but not file data writes. Actually, you're not correct there, either, Stan. ;) XFS only issues cache flushes/FUA writes for log IO. Metadata IO is done exactly the same way that data IO is done - without barriers. It's because metadata lost in drive caches at the time of a crash is rewritten by journal replay that filesystem corruption does not occur. As it is, if the application uses direct IO (likely, as it sounds like video capture/editing/playout here) then log IO will also ensure that the data written by the app is on disk (i.e. that's ithe mechanism by which fsync works). Hence even assumptions that there will be data loss are dependent on how the application is doing it's IO.... > > Also, if there are active writers, this is inherently racy. A better > > script would unmount the file systems :) > > Yes, a umount would be even better. Change the bios so that the power button does not cause a power down so the OS can capture the button event and trigger an orderly shutdown. Laptops use power button events for all sorts of different things (e.g. suspend rather than shutdown) and you can do exactly the same sort of event triggered shutdown for any server or desktop... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-09 9:11 ` Dave Chinner @ 2013-03-09 18:51 ` Stan Hoeppner 2013-03-10 22:45 ` Dave Chinner 0 siblings, 1 reply; 31+ messages in thread From: Stan Hoeppner @ 2013-03-09 18:51 UTC (permalink / raw) To: Dave Chinner; +Cc: Julien FERRERO, Ric Wheeler, xfs On 3/9/2013 3:11 AM, Dave Chinner wrote: > On Fri, Mar 08, 2013 at 12:59:22PM -0600, Stan Hoeppner wrote: >> On 3/8/2013 6:20 AM, Ric Wheeler wrote: >>> On 03/08/2013 03:39 AM, Stan Hoeppner wrote: >>>> On 3/6/2013 5:12 PM, Ric Wheeler wrote: >>>> >>>>> We actually test brutal "Power off" for xfs, ext4 and other file >>>>> systems. If your storage is configured properly and you have barriers >>>>> enabled, they all pass without corruption. >> >> I think you missed the context. Please reread this: >> >>>> Something that none of us mentioned WRT write barriers is that while the >>>> filesystem structure may avoid corruption when the power is cut, files >>>> may still be corrupted, in conditions such as any/all of these: >> >> I made it very clear I was discussing file corruption here, not >> filesystem corruption. You already covered that base. I was >> specifically addressing the fact that XFS performs barriers on metadata >> writes but not file data writes. > > Actually, you're not correct there, either, Stan. ;) With "either" you're implying I was incorrect twice, and I wasn't, not in whole anyway, maybe in part. ;) > XFS only issues cache flushes/FUA writes for log IO. Metadata IO is > done exactly the same way that data IO is done - without barriers. > It's because metadata lost in drive caches at the time of a crash is > rewritten by journal replay that filesystem corruption does not > occur. Technical semantics. Geeze, give the non dev a break now and then. ;) Does everyone remember the transitive property of equality from math class decades ago? It states "If A=B and B=C then A=C". Thus if barrier writes to the journal protect the journal, and the journal protects metadata, then barrier writes to the journal protect metadata. I had a detail incorrect, but not the big picture. And I'd bet the OP is more interested in the big picture. So surely I'd get a B or a C here, but certainly not an F. > As it is, if the application uses direct IO (likely, as it > sounds like video capture/editing/playout here) then log IO > will also ensure that the data written by the app is on disk (i.e. > that's ithe mechanism by which fsync works). So this would be an interesting upside down case for XFS, as the file data may be intact, but the filesystem gets corrupted, the opposite of the design point. > Hence even assumptions that there will be data loss are dependent on > how the application is doing it's IO.... I didn't assume there _will_ be data loss. I'm simply trying to help the guy think about covering all the bases, which is the smart thing to do, is it not? I've never designed any system with the "assumption" that pulling the plug is the standard mode of system shutdown. ;) I doubt anyone else here has either. So we're all working a bit "outside the box" here, yes? >>> Also, if there are active writers, this is inherently racy. A better >>> script would unmount the file systems :) >> >> Yes, a umount would be even better. > > Change the bios so that the power button does not cause a power down > so the OS can capture the button event and trigger an orderly > shutdown. Dare I say "Dave you're incorrect". ;) The OP already stated that all the gear, whatever that is, in the vans is controlled by a master switch, probably something like an 8 outlet surge protector/power strip, and the techs power down all the gear by this one switch. So this solution doesn't work either. I think someone already suggested this upstream in the thread. This is one of those classic cases of computers being injected into a field application where the users are so used to dumb/analog devices that they simply can't/won't adapt, resist, or simply take a long time to assimilate. Reminds me of a similar case some time ago... When I ordered my first aDSL circuit back in ~2000 it took SW Bell 6 weeks to get it working. The field techs had been trained and worked in the analog phone world for 70+ years and these guys are the antithesis of technical folks. In my case the port on the brand new Alcatel DSLAM was defective. Took 4 weeks and a dozen different techs to finally diagnose it, and another 2 weeks of "paperwork" to reassign my circuit to another DSLAM port, though the bureaucracy issue wasn't the techs' fault. From what I understand it took about 2 years for these guys to become proficient with DSL installations. Let's hope for OP's sake that it doesn't take two years for his guys to learn and adapt to this "new" digital recording system. I put "new" in quotes, as having worked for SGI Dave you know this direct to disk recording technology has been around for over a decade. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-09 18:51 ` Stan Hoeppner @ 2013-03-10 22:45 ` Dave Chinner 2013-03-10 23:54 ` Stan Hoeppner 0 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2013-03-10 22:45 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Julien FERRERO, Ric Wheeler, xfs On Sat, Mar 09, 2013 at 12:51:25PM -0600, Stan Hoeppner wrote: > On 3/9/2013 3:11 AM, Dave Chinner wrote: > > On Fri, Mar 08, 2013 at 12:59:22PM -0600, Stan Hoeppner wrote: > >> On 3/8/2013 6:20 AM, Ric Wheeler wrote: > >>>> Something that none of us mentioned WRT write barriers is that while the > >>>> filesystem structure may avoid corruption when the power is cut, files > >>>> may still be corrupted, in conditions such as any/all of these: > >> > >> I made it very clear I was discussing file corruption here, not > >> filesystem corruption. You already covered that base. I was > >> specifically addressing the fact that XFS performs barriers on metadata > >> writes but not file data writes. > > > > Actually, you're not correct there, either, Stan. ;) > > With "either" you're implying I was incorrect twice, and I wasn't, not > in whole anyway, maybe in part. ;) The "either" was in reference to you correcting someone else... > > XFS only issues cache flushes/FUA writes for log IO. Metadata IO is > > done exactly the same way that data IO is done - without barriers. > > It's because metadata lost in drive caches at the time of a crash is > > rewritten by journal replay that filesystem corruption does not > > occur. > > Technical semantics. Geeze, give the non dev a break now and then. ;) It's the technical semantics that matter when it comes to behaviour at power loss. That's why I pick on "technical semantics" - it's makes your analysis and understanding of problems better, and that means there's less for me to do in future ;) > Does everyone remember the transitive property of equality from math > class decades ago? It states "If A=B and B=C then A=C". Thus if > barrier writes to the journal protect the journal, and the journal > protects metadata, then barrier writes to the journal protect metadata. Yup, but the devil is in the detail - we don't protect individual metadata writes at all and that difference is significant enough to comment on.... :P > I had a detail incorrect, but not the big picture. And I'd bet the OP > is more interested in the big picture. So surely I'd get a B or a C > here, but certainly not an F. Certainly a B+ - like I said, I'm being picky because you seem to understand the details once explained... :) > > As it is, if the application uses direct IO (likely, as it > > sounds like video capture/editing/playout here) then log IO > > will also ensure that the data written by the app is on disk (i.e. > > that's ithe mechanism by which fsync works). > > So this would be an interesting upside down case for XFS, as the file > data may be intact, but the filesystem gets corrupted, the opposite of > the design point. Well, if barriers are working correctly, then there won't be any filesystem corruption, either... > >>> Also, if there are active writers, this is inherently racy. A better > >>> script would unmount the file systems :) > >> > >> Yes, a umount would be even better. > > > > Change the bios so that the power button does not cause a power down > > so the OS can capture the button event and trigger an orderly > > shutdown. > > Dare I say "Dave you're incorrect". ;) Heh. Not so much incorrect as "unaware of the entire scope". I browsed the thread and didn't pick up on this little detail... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-10 22:45 ` Dave Chinner @ 2013-03-10 23:54 ` Stan Hoeppner 2013-03-11 0:50 ` Dave Chinner 2013-03-11 9:25 ` Julien FERRERO 0 siblings, 2 replies; 31+ messages in thread From: Stan Hoeppner @ 2013-03-10 23:54 UTC (permalink / raw) To: Dave Chinner; +Cc: Julien FERRERO, Ric Wheeler, xfs On 3/10/2013 5:45 PM, Dave Chinner wrote: > On Sat, Mar 09, 2013 at 12:51:25PM -0600, Stan Hoeppner wrote: >> On 3/9/2013 3:11 AM, Dave Chinner wrote: >>> On Fri, Mar 08, 2013 at 12:59:22PM -0600, Stan Hoeppner wrote: >>>> On 3/8/2013 6:20 AM, Ric Wheeler wrote: >>>>>> Something that none of us mentioned WRT write barriers is that while the >>>>>> filesystem structure may avoid corruption when the power is cut, files >>>>>> may still be corrupted, in conditions such as any/all of these: >>>> >>>> I made it very clear I was discussing file corruption here, not >>>> filesystem corruption. You already covered that base. I was >>>> specifically addressing the fact that XFS performs barriers on metadata >>>> writes but not file data writes. >>> >>> Actually, you're not correct there, either, Stan. ;) >> >> With "either" you're implying I was incorrect twice, and I wasn't, not >> in whole anyway, maybe in part. ;) > > The "either" was in reference to you correcting someone else... I wasn't attempting to correct Ric on the technicals, as that's simply not really possible, me being a user talking to a dev. That would be really presumptuous on my part, not to mention dumb. I had made a point about file data corruption, and he replied talking about metadata corruption. My "correction" was simply to clarify I was talking about file data not metadata. >>> XFS only issues cache flushes/FUA writes for log IO. Metadata IO is >>> done exactly the same way that data IO is done - without barriers. >>> It's because metadata lost in drive caches at the time of a crash is >>> rewritten by journal replay that filesystem corruption does not >>> occur. >> >> Technical semantics. Geeze, give the non dev a break now and then. ;) > > It's the technical semantics that matter when it comes to behaviour > at power loss. That's why I pick on "technical semantics" - it's > makes your analysis and understanding of problems better, and that > means there's less for me to do in future ;) I do my best to grab the low hanging fruit when I can so you guys can concentrate on more important stuff. >> Does everyone remember the transitive property of equality from math >> class decades ago? It states "If A=B and B=C then A=C". Thus if >> barrier writes to the journal protect the journal, and the journal >> protects metadata, then barrier writes to the journal protect metadata. > > Yup, but the devil is in the detail - we don't protect individual > metadata writes at all and that difference is significant enough to > comment on.... :P Elaborate on this a bit, if you have time. I was under the impression that all directory updates were journaled first. >> I had a detail incorrect, but not the big picture. And I'd bet the OP >> is more interested in the big picture. So surely I'd get a B or a C >> here, but certainly not an F. > > Certainly a B+ - like I said, I'm being picky because you seem to > understand the details once explained... :) Usually. ;) Sometimes it takes a couple of sessions before it fully sinks in. I must say I've learned a tremendous amount from the devs on this list, and I'm grateful that you specifically Dave have taken the time to 'tutor' me, and others, over the last couple of years. >>> As it is, if the application uses direct IO (likely, as it >>> sounds like video capture/editing/playout here) then log IO >>> will also ensure that the data written by the app is on disk (i.e. >>> that's ithe mechanism by which fsync works). >> >> So this would be an interesting upside down case for XFS, as the file >> data may be intact, but the filesystem gets corrupted, the opposite of >> the design point. > > Well, if barriers are working correctly, then there won't be any > filesystem corruption, either... Ok, see, this is odd part here. The OP didn't seem to have this metadata corruption issue with the old 2.6.18 kernel, at least I think that's the one he mentioned. Then he switched to 2.6.35. IIRC there were a number of commits around that time and some regressions. I also recall 2.6.35 is not a long term stable kernel. I'd guess there were reasons for that. So, I'm wondering if there was a bug/regression relating to XFS metadata in 2.6.35 corrected in .36 or later and simply not backported. Seems to ring a bell, vaguely. I have no idea where/how to search for such information. >>>>> Also, if there are active writers, this is inherently racy. A better >>>>> script would unmount the file systems :) >>>> >>>> Yes, a umount would be even better. >>> >>> Change the bios so that the power button does not cause a power down >>> so the OS can capture the button event and trigger an orderly >>> shutdown. >> >> Dare I say "Dave you're incorrect". ;) > > Heh. Not so much incorrect as "unaware of the entire scope". I > browsed the thread and didn't pick up on this little detail... I know. That was a bit of a cheap shot, hence the judicious use of quotes and winkies. ;) I knew you'd missed it or you'd not have mentioned the ACPI soft power switch option. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-10 23:54 ` Stan Hoeppner @ 2013-03-11 0:50 ` Dave Chinner 2013-03-11 9:29 ` Stan Hoeppner 2013-03-11 9:25 ` Julien FERRERO 1 sibling, 1 reply; 31+ messages in thread From: Dave Chinner @ 2013-03-11 0:50 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Julien FERRERO, Ric Wheeler, xfs On Sun, Mar 10, 2013 at 06:54:57PM -0500, Stan Hoeppner wrote: > On 3/10/2013 5:45 PM, Dave Chinner wrote: > >> Does everyone remember the transitive property of equality from math > >> class decades ago? It states "If A=B and B=C then A=C". Thus if > >> barrier writes to the journal protect the journal, and the journal > >> protects metadata, then barrier writes to the journal protect metadata. > > > > Yup, but the devil is in the detail - we don't protect individual > > metadata writes at all and that difference is significant enough to > > comment on.... :P > > Elaborate on this a bit, if you have time. I was under the impression > that all directory updates were journaled first. That's correct - they are all journalled. But journalling is done at the transactional level, not that of individual metadata changes. IOWs, journalled changes do not contain the same information as a metadata buffer write - they contain both more and less information than a metadata buffer write. They contain more information in that there is change atomicity information in the journal information for recovery purposes. i.e. how the individual change relates to changes in other related metadata objects. This information is needed in the journal so that log recovery knows to either apply all the changes in a checkpoint or none of them if this journal checkpoint (or a previous one) is incomplete. They contain less information in that the changes to a metadata object is stored as a diff in the journal rather than as a complete copy of the object. This is done to reduce the amount of journal space and memory required to track and store all of the changes in the checkpoint. Hence what is written to the journal is quite different to what is written during metadata writeback in both contents ad method. It is the atomicity information in the journal that we know got synchronised to disk (via the FUA/cache flush) that enables us to get away with being lazy writing back metadata buffers in any order we please without needing FUA/cache flushes... So, yes you are correct in that the journalling protects metadata. However, the distinction I'm making is that the journal writes contain different information and have different constraints compared to individual metadata object writeback, and therefore are not the "same thing" and do not require the same protection from power loss/crash events... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-11 0:50 ` Dave Chinner @ 2013-03-11 9:29 ` Stan Hoeppner 2013-03-11 22:45 ` Dave Chinner 0 siblings, 1 reply; 31+ messages in thread From: Stan Hoeppner @ 2013-03-11 9:29 UTC (permalink / raw) To: Dave Chinner; +Cc: Julien FERRERO, Ric Wheeler, xfs On 3/10/2013 7:50 PM, Dave Chinner wrote: > On Sun, Mar 10, 2013 at 06:54:57PM -0500, Stan Hoeppner wrote: >> On 3/10/2013 5:45 PM, Dave Chinner wrote: >>>> Does everyone remember the transitive property of equality from math >>>> class decades ago? It states "If A=B and B=C then A=C". Thus if >>>> barrier writes to the journal protect the journal, and the journal >>>> protects metadata, then barrier writes to the journal protect metadata. >>> >>> Yup, but the devil is in the detail - we don't protect individual >>> metadata writes at all and that difference is significant enough to >>> comment on.... :P >> >> Elaborate on this a bit, if you have time. I was under the impression >> that all directory updates were journaled first. > > That's correct - they are all journalled. > > But journalling is done at the transactional level, not that of > individual metadata changes. IOWs, journalled changes do not > contain the same information as a metadata buffer write - they > contain both more and less information than a metadata buffer write. > > They contain more information in that there is change atomicity > information in the journal information for recovery purposes. i.e. > how the individual change relates to changes in other related > metadata objects. This information is needed in the journal so that > log recovery knows to either apply all the changes in a checkpoint > or none of them if this journal checkpoint (or a previous one) is > incomplete. > > They contain less information in that the changes to a metadata > object is stored as a diff in the journal rather than as a complete > copy of the object. This is done to reduce the amount of journal > space and memory required to track and store all of the changes in > the checkpoint. Forget the power loss issue for a moment. If I'm digesting this correctly, it's seems quite an accomplishment that you got delaylog working, at all, let alone extremely well as it does. Given what you state above, it would seem there is quite a bit of complexity involved in tracking these metadata change relationships and modifying the checkpoint information accordingly. I would think as you merge multiple traditional XFS log writes into a single write that the relationship information would also need to be modified as well. Or do I lack sufficient understanding at this point to digest this? > Hence what is written to the journal is quite different to what is > written during metadata writeback in both contents ad method. It is > the atomicity information in the journal that we know got > synchronised to disk (via the FUA/cache flush) that enables us to > get away with being lazy writing back metadata buffers in any order > we please without needing FUA/cache flushes... This makes me wonder... for a given metadata write into an AG, is the amount of data in the corresponding journal write typically greater or less? You stated above it is both more and less but I don't know if you meant that qualitatively or quantitatively, or both. I'm wondering that if log write bytes is typically significantly lower, and we know we can recreate a lost metadata write from the journal data during a recovery.... Given that CPU is so much faster than disk, would it be plausible to do all metadata writes in a lazy fashion through the relevant sections of the recovery code, or something along these lines? Make 'recovery' the standard method for metadata writes? I'm not talking about replacing the log journal, but replacing the metadata write method with something akin to a portion of the the journal recovery routine. In other words, could we make use of the delaylog concept of doing more work with fewer IOs to achieve a similar performance gain for metadata writeback? Or is XFS metadata writeback already fully optimized WRT IOs and bandwidth, latency, etc? Or is this simply a crazy idea from a member of the peanut gallery who has just enough knowledge to be a nuisance, but lacks enough to make real contributions? Probably the latter. ;) > So, yes you are correct in that the journalling protects metadata. > However, the distinction I'm making is that the journal writes > contain different information and have different constraints > compared to individual metadata object writeback, and therefore are > not the "same thing" and do not require the same protection from > power loss/crash events... Thanks Dave for continuing to take time to teach. I've passed on much of what I've learned from you to many others outside of this list. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-11 9:29 ` Stan Hoeppner @ 2013-03-11 22:45 ` Dave Chinner 0 siblings, 0 replies; 31+ messages in thread From: Dave Chinner @ 2013-03-11 22:45 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Julien FERRERO, Ric Wheeler, xfs On Mon, Mar 11, 2013 at 04:29:47AM -0500, Stan Hoeppner wrote: > On 3/10/2013 7:50 PM, Dave Chinner wrote: > > On Sun, Mar 10, 2013 at 06:54:57PM -0500, Stan Hoeppner wrote: > >> On 3/10/2013 5:45 PM, Dave Chinner wrote: > >>>> Does everyone remember the transitive property of equality from math > >>>> class decades ago? It states "If A=B and B=C then A=C". Thus if > >>>> barrier writes to the journal protect the journal, and the journal > >>>> protects metadata, then barrier writes to the journal protect metadata. > >>> > >>> Yup, but the devil is in the detail - we don't protect individual > >>> metadata writes at all and that difference is significant enough to > >>> comment on.... :P > >> > >> Elaborate on this a bit, if you have time. I was under the impression > >> that all directory updates were journaled first. > > > > That's correct - they are all journalled. > > > > But journalling is done at the transactional level, not that of > > individual metadata changes. IOWs, journalled changes do not > > contain the same information as a metadata buffer write - they > > contain both more and less information than a metadata buffer write. > > > > They contain more information in that there is change atomicity > > information in the journal information for recovery purposes. i.e. > > how the individual change relates to changes in other related > > metadata objects. This information is needed in the journal so that > > log recovery knows to either apply all the changes in a checkpoint > > or none of them if this journal checkpoint (or a previous one) is > > incomplete. > > > > They contain less information in that the changes to a metadata > > object is stored as a diff in the journal rather than as a complete > > copy of the object. This is done to reduce the amount of journal > > space and memory required to track and store all of the changes in > > the checkpoint. > > Forget the power loss issue for a moment. If I'm digesting this > correctly, it's seems quite an accomplishment that you got delaylog > working, at all, let alone extremely well as it does. Given what you > state above, it would seem there is quite a bit of complexity involved > in tracking these metadata change relationships and modifying the > checkpoint information accordingly. Yes, there is. > I would think as you merge multiple > traditional XFS log writes into a single write that the relationship > information would also need to be modified as well. Or do I lack > sufficient understanding at this point to digest this? Relationship information is inherent in the checkpoint method due to a feature that has been built into the XFS transaction/journalling code from day-zero: relogging. This is described in all it's glory in Documentation/filesystems/xfs-delayed-logging-design.txt.... > > Hence what is written to the journal is quite different to what is > > written during metadata writeback in both contents ad method. It is > > the atomicity information in the journal that we know got > > synchronised to disk (via the FUA/cache flush) that enables us to > > get away with being lazy writing back metadata buffers in any order > > we please without needing FUA/cache flushes... > > This makes me wonder... for a given metadata write into an AG, is the > amount of data in the corresponding journal write typically greater or > less? Typically less - buffer changes are logged into a dirty bitmap with a resolution of 128 bytes. hence a single byte change will record a single dirty bit, which means 128 bytes will be logged. Both the dirty bitmap and the 128 byte regions are written to the journal. So inthis case, less is written to the journal. However, because of relogging, the more a buffer gets modified, then larger the number of dirty regions, and so for buffers that are repeatedly modified we typically end up logging them entirely, including the bitmap and other information. In this case, more is written to the journal than would be written by metadata writeback... The difference with delayed logging is that the frequency of journal writes goes way down, so the fact that we typically log more per object into the journal is greatly outweighed by the fact that the objects are logged orders of magnitude less often.... > You stated above it is both more and less but I don't know if you > meant that qualitatively or quantitatively, or both. I'm > wondering that if log write bytes is typically significantly > lower, and we know we can recreate a lost metadata write from the > journal data during a recovery.... Given that CPU is so much > faster than disk, would it be plausible to do all metadata writes > in a lazy fashion through the relevant sections of the recovery > code, or something along these lines? We already do that. > Make 'recovery' the > standard method for metadata writes? I'm not talking about > replacing the log journal, but replacing the metadata write method > with something akin to a portion of the the journal recovery > routine. To do that, you need an unbound log size. i.e. you are talking about a log structured filesystem rather than a traditional journalled filesystem. The problem with log structured filesystem is that you trade off write side speed and latency for read side speed and latency. When you get large data sets or frequently modified data sets larger than can be cached in memory, log structured filesystems perform teribly because the metadata needs reconstructing or regathering every time it is read. IOWs, log structured filesystems simply don't scale to large sizes or large scale data sets effectively. > In other words, could we make use of the delaylog concept of doing > more work with fewer IOs to achieve a similar performance gain for > metadata writeback? Or is XFS metadata writeback already fully > optimized WRT IOs and bandwidth, latency, etc? I wouldn't say it's fully optimised (nothing ever is), but metadata writeback is almost completely decoupled from the transactional modification side of the filesystem and so we can do far larger scale optimisation of writback order than othe filesystems. Hence there are relatively few latency/bandwidth/IOPS issued with metadata writeback. For example: have a look at slide 24 of this presentation: http://xfs.org/images/d/d1/Xfs-scalability-lca2012.pdf and note how much IO XFS is doing for metadata writeback for the given metadata performance compared to ext4. Take away message: "XFS has the lowest IOPS rate at a given modification rate - both ext4 and BTRFS are IO bound at higher thread counts." Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-10 23:54 ` Stan Hoeppner 2013-03-11 0:50 ` Dave Chinner @ 2013-03-11 9:25 ` Julien FERRERO 2013-03-12 10:54 ` Emmanuel Florac 1 sibling, 1 reply; 31+ messages in thread From: Julien FERRERO @ 2013-03-11 9:25 UTC (permalink / raw) To: stan; +Cc: Ric Wheeler, xfs > > Ok, see, this is odd part here. The OP didn't seem to have this > metadata corruption issue with the old 2.6.18 kernel, at least I think > that's the one he mentioned. Then he switched to 2.6.35. IIRC there > were a number of commits around that time and some regressions. I also > recall 2.6.35 is not a long term stable kernel. I'd guess there were > reasons for that. So, I'm wondering if there was a bug/regression > relating to XFS metadata in 2.6.35 corrected in .36 or later and simply > not backported. Seems to ring a bell, vaguely. I have no idea > where/how to search for such information. > That is the main reason I asked. I google for regression / issue with XFS in 2.6.35 but I didn't find anything. My hope was that someone from this mailing list would remember it (if such a regression did exist of course). _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-11 9:25 ` Julien FERRERO @ 2013-03-12 10:54 ` Emmanuel Florac 0 siblings, 0 replies; 31+ messages in thread From: Emmanuel Florac @ 2013-03-12 10:54 UTC (permalink / raw) To: Julien FERRERO; +Cc: Ric Wheeler, stan, xfs Le Mon, 11 Mar 2013 10:25:13 +0100 vous écriviez: > That is the main reason I asked. I google for regression / issue with > XFS in 2.6.35 but I didn't find anything. My hope was that someone > from this mailing list would remember it (if such a regression did > exist of course). I don't, but I had at last one serious corruption under 2.6.35.13 (though it was related to hard drive woes). -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-08 8:39 ` Stan Hoeppner 2013-03-08 10:17 ` Julien FERRERO 2013-03-08 12:20 ` Ric Wheeler @ 2013-03-12 10:42 ` Martin Steigerwald 2013-03-12 22:16 ` Stan Hoeppner 2 siblings, 1 reply; 31+ messages in thread From: Martin Steigerwald @ 2013-03-12 10:42 UTC (permalink / raw) To: xfs, stan; +Cc: Julien FERRERO, Ric Wheeler Am Freitag, 8. März 2013 schrieb Stan Hoeppner: > If the techs are determined to hard cut power because they don't have > the time or the knowledge to do a clean shutdown, it may be well worth > your time/effort to write a script and teach the field techs to execute > it, before flipping the master switch. Your simple script would run as > root, or you'd need to do some sudo foo within, and would contain > something like: > > #! /bin/sh > sync > echo 2 > /proc/sys/vm/drop_caches > echo "Ready for power down." mount -o remount,ro /your/mount/point One can at least try. Maybe some "service stop" commands before that. But then, if using a script like this, why not just type "halt"? Heck, Linux kernel / userspace / distro developers prepared safe shutdown already, so why not use it? Another idea: On Debian Usually on Ctrl-Alt-Delete on a TTY get a shutdown: # What to do when CTRL-ALT-DEL is pressed. ca:12345:ctrlaltdel:/sbin/shutdown -t1 -a -r now Then you plug a keyboard to the server and tell the local admins to just press Ctrl-Alt-Del in order to shutdown the server instead of the power button. But heck, even just pressing the power button for a short period of time should work. In Debian it does. So you can just tip the power button. So, I do see not much of a reason to not shutdown the server properly. Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-12 10:42 ` Martin Steigerwald @ 2013-03-12 22:16 ` Stan Hoeppner 0 siblings, 0 replies; 31+ messages in thread From: Stan Hoeppner @ 2013-03-12 22:16 UTC (permalink / raw) To: Martin Steigerwald; +Cc: Julien FERRERO, Ric Wheeler, xfs On 3/12/2013 5:42 AM, Martin Steigerwald wrote: > Am Freitag, 8. März 2013 schrieb Stan Hoeppner: >> If the techs are determined to hard cut power because they don't have >> the time or the knowledge to do a clean shutdown, it may be well worth >> your time/effort to write a script and teach the field techs to execute >> it, before flipping the master switch. Your simple script would run as >> root, or you'd need to do some sudo foo within, and would contain >> something like: >> >> #! /bin/sh >> sync >> echo 2 > /proc/sys/vm/drop_caches >> echo "Ready for power down." > > mount -o remount,ro /your/mount/point > > One can at least try. Maybe some "service stop" commands before that. > > But then, if using a script like this, why not just type "halt"? ... The real solution to the OP's problem has nothing to do with XFS, buffer flushing, nor Linux shutdown modes. The problem is power and automation of cutting power. And the solution is rather simple. Put a small UPS in the van backing the server, connect USB or serial, and configure upsmon. When the crews hit the master switch, AC to the UPS is lost, and upsmon then performs a clean shutdown of the server. The crews do nothing more than they currently do. And they don't have to wait on anything. If the previously described "master" switch *is* currently a UPS, simply install another smaller unit inline to the server. Disable the audible alarm on the small unit as by default it will screech continuously while AC input is absent. At the end of the day they simply flip the switch on this little UPS so it's not running overnight/weekends (though with no load it probably wouldn't drain the battery--this just a safety precaution). At the start of the next day, they flip the master switch, then the little UPS switch. I've not laid eyes on the vans/power circuits/gear in question so I'm making educated guesses. There may be even better/easier ways to do it. But one way or another, a properly configured UPS/upsmon setup is the way to go, if the desire is to easily control everything with power switches. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-06 15:08 XFS filesystem corruption Julien FERRERO 2013-03-06 15:15 ` Emmanuel Florac @ 2013-03-07 3:56 ` Stan Hoeppner 2013-03-07 13:04 ` Julien FERRERO 1 sibling, 1 reply; 31+ messages in thread From: Stan Hoeppner @ 2013-03-07 3:56 UTC (permalink / raw) To: Julien FERRERO; +Cc: xfs On 3/6/2013 9:08 AM, Julien FERRERO wrote: > The filesystem was originally created with the command: > # mkfs.xfs -f -l size=32m /dev/md0 It may be unrelated to your corruption, problem but I'm curious why you are specifying a 32MB log section instead of letting mkfs.xfs make the log size decision. > corruption. I only know that units are used to be power cycle by > operator while the fs is still mounted (no proper shutdown / reboot). > My guess is the fs journal shall handle this case and avoid such > corruption. As others have stated, this operator needs to be flogged and educated. A computer based video ingestion/playback system with disk storage and a complex filesystem is not a tape deck. You can't can't simply power it off as if it were a tape deck. I would assume based on your description that this is a mobile storage system, often moved from one location to another, probably in a van, and this is why the operator simply hits the power switch? Live news crew type application or similar? -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-07 3:56 ` Stan Hoeppner @ 2013-03-07 13:04 ` Julien FERRERO 2013-03-07 13:32 ` Stan Hoeppner ` (2 more replies) 0 siblings, 3 replies; 31+ messages in thread From: Julien FERRERO @ 2013-03-07 13:04 UTC (permalink / raw) To: stan; +Cc: xfs > It may be unrelated to your corruption, problem but I'm curious why you > are specifying a 32MB log section instead of letting mkfs.xfs make the > log size decision. I honestly don' know, the rebuild script was written 8 years ago by an engineer that since left the company. Is 32MB a short log space for a 1.5 TB of data ? > > I would assume based on your description that this is a mobile storage > system, often moved from one location to another, probably in a van, and > this is why the operator simply hits the power switch? Live news crew > type application or similar? > Correct. Moreover, the common usage is to power off all the equipment (included ours) from a general power switch. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-07 13:04 ` Julien FERRERO @ 2013-03-07 13:32 ` Stan Hoeppner 2013-03-10 2:50 ` Eric Sandeen 2013-03-10 22:11 ` Dave Chinner 2 siblings, 0 replies; 31+ messages in thread From: Stan Hoeppner @ 2013-03-07 13:32 UTC (permalink / raw) To: Julien FERRERO; +Cc: xfs On 3/7/2013 7:04 AM, Julien FERRERO wrote: >> It may be unrelated to your corruption, problem but I'm curious why you >> are specifying a 32MB log section instead of letting mkfs.xfs make the >> log size decision. > > I honestly don' know, the rebuild script was written 8 years ago by an > engineer that since left the company. > > Is 32MB a short log space for a 1.5 TB of data ? The log is for journal metadata. So if you're capturing a frame of video per file, or 24 or 60 frames per file, and thus are writing lots of files, 32MB may be too small. I'm not an expert here. Dave C. would be better able to answer this. But this is a very minor problem compared to... > Moreover, the common usage is to power off all the equipment (included > ours) from a general power switch. this. Have the crews been hard cutting power to these XFS boxen for the 8 years you mention above? And this filesystem corruption problem and/or corrupted files, is just now cropping up? That's hard to believe. There may be a bug in 2.6.35 that exacerbates this that's been fixed in later versions--2.6.35 is not a long term stable kernel--odd that a vendor would choose it for long term use. If you never had this problem before, I can only guess that previously you were using hardware RAID controllers with BBWC having sufficient battery hours of cache power to survive until the next power on, at which point the BBWC RAID dumped the data to the disks. If you switched from that solution to non BBWC RAID, or to Linux software RAID, that might explain why you're seeing corruption now and did not previously. And even with BBWC RAID, hard cutting power to the system is still not a smart thing to do. For this kind of environment, if field techs are going to hard cut power no matter what you tell them, then you simply MUST get LSI (or possibly other) RAID cards with the flash backed write cache. This doesn't rely on batteries so the cache is never volatile, and can sit overnight, or for days or weeks, without losing the data in the write cache. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-07 13:04 ` Julien FERRERO 2013-03-07 13:32 ` Stan Hoeppner @ 2013-03-10 2:50 ` Eric Sandeen 2013-03-10 22:11 ` Dave Chinner 2 siblings, 0 replies; 31+ messages in thread From: Eric Sandeen @ 2013-03-10 2:50 UTC (permalink / raw) To: Julien FERRERO; +Cc: stan, xfs On 3/7/13 7:04 AM, Julien FERRERO wrote: >> It may be unrelated to your corruption, problem but I'm curious why you >> are specifying a 32MB log section instead of letting mkfs.xfs make the >> log size decision. > > I honestly don' know, the rebuild script was written 8 years ago by an > engineer that since left the company. > > Is 32MB a short log space for a 1.5 TB of data ? $ mkfs.xfs -dfile,name=fsfile,size=1536g meta-data=fsfile isize=256 agcount=4, agsize=100663296 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=402653184, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=196608, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Default would be 768M w/ current xfsprogs. So I'd say yes it's short. You might do well to re-examine any old, crufty "engineer left a while ago" tunings. Defaults are defaults for a reason, if you don't know why you're tuning something it may well be the wrong choice. -Eric _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: XFS filesystem corruption 2013-03-07 13:04 ` Julien FERRERO 2013-03-07 13:32 ` Stan Hoeppner 2013-03-10 2:50 ` Eric Sandeen @ 2013-03-10 22:11 ` Dave Chinner 2 siblings, 0 replies; 31+ messages in thread From: Dave Chinner @ 2013-03-10 22:11 UTC (permalink / raw) To: Julien FERRERO; +Cc: stan, xfs On Thu, Mar 07, 2013 at 02:04:32PM +0100, Julien FERRERO wrote: > > It may be unrelated to your corruption, problem but I'm curious why you > > are specifying a 32MB log section instead of letting mkfs.xfs make the > > log size decision. > > I honestly don' know, the rebuild script was written 8 years ago by an > engineer that since left the company. > > Is 32MB a short log space for a 1.5 TB of data ? Depends on your workload. And to tell the truth, the tiny log is probably the only reason that your filesystems have gone this long without corruption, as the small size will force frequent log writes and hence issue cache flushes regularly.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2013-03-12 22:16 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-03-06 15:08 XFS filesystem corruption Julien FERRERO 2013-03-06 15:15 ` Emmanuel Florac 2013-03-06 16:16 ` Julien FERRERO 2013-03-06 16:47 ` Ric Wheeler 2013-03-06 22:21 ` Emmanuel Florac 2013-03-06 23:12 ` Ric Wheeler 2013-03-07 13:15 ` Julien FERRERO 2013-03-07 13:40 ` Ric Wheeler 2013-03-07 23:22 ` Dave Chinner 2013-03-08 10:16 ` Julien FERRERO 2013-03-12 9:57 ` Martin Steigerwald 2013-03-08 8:39 ` Stan Hoeppner 2013-03-08 10:17 ` Julien FERRERO 2013-03-08 12:20 ` Ric Wheeler 2013-03-08 18:59 ` Stan Hoeppner 2013-03-09 9:11 ` Dave Chinner 2013-03-09 18:51 ` Stan Hoeppner 2013-03-10 22:45 ` Dave Chinner 2013-03-10 23:54 ` Stan Hoeppner 2013-03-11 0:50 ` Dave Chinner 2013-03-11 9:29 ` Stan Hoeppner 2013-03-11 22:45 ` Dave Chinner 2013-03-11 9:25 ` Julien FERRERO 2013-03-12 10:54 ` Emmanuel Florac 2013-03-12 10:42 ` Martin Steigerwald 2013-03-12 22:16 ` Stan Hoeppner 2013-03-07 3:56 ` Stan Hoeppner 2013-03-07 13:04 ` Julien FERRERO 2013-03-07 13:32 ` Stan Hoeppner 2013-03-10 2:50 ` Eric Sandeen 2013-03-10 22:11 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox