* Random file system corruption in 3.17 (not BTRFS related...?)
@ 2014-10-14 16:54 Robert White
2014-10-14 17:22 ` David Arendt
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Robert White @ 2014-10-14 16:54 UTC (permalink / raw)
To: linux-btrfs
Howdy,
So I run several gentoo systems and I upgraded two of them to kernel 3.17.0
One using BTRFS for root.
One using ext3 for root (via the ext4 driver)
_Both_ systems exhibited strange behavior (long pauses and then hangs
requiring hard-power) within several hours. Both then had random
filesystem damage.
On the BTRFS system much of my browser settings for firefox were
trashed, particularly the cookies and saved conifigurations for add-ons
(like which sites had scripts enabled/disabled in no-script) etc.
On the ext3/4 system there were several corruptions including a
pipe/special file with a large non-zero size that required I do a "fsck
-fyD /dev/sda3" to repair. (one comment from fsck was that the
pipe/special file "looked like a directory" or some such)
So I can say that corruption is taking place, but I suspect it is _not_
happening in the BTRFS specific code.
(ASIDE: both systems are older amd64 using built-in radeon display
hardware.)
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: Random file system corruption in 3.17 (not BTRFS related...?) 2014-10-14 16:54 Random file system corruption in 3.17 (not BTRFS related...?) Robert White @ 2014-10-14 17:22 ` David Arendt 2014-10-14 20:06 ` Robert White 2014-10-14 22:35 ` Duncan 2014-10-15 7:08 ` Juan Orti Alcaine 2 siblings, 1 reply; 13+ messages in thread From: David Arendt @ 2014-10-14 17:22 UTC (permalink / raw) To: Robert White, linux-btrfs I didn't notice a corruption on other filesystems with kernel 3.17.0. Also I didn't experience any hangs except when trying to mount a corrupted btrfs but this was causing a hang within less than 10 seconds. It could be that your problem is unrelated and that the corruption you are experiencing is due to an unrelated hang followed by a hard powerdown. Have you been able to capture any btrfs related kernel panics ? On 10/14/14 6:54 PM, Robert White wrote: > Howdy, > > So I run several gentoo systems and I upgraded two of them to kernel > 3.17.0 > > One using BTRFS for root. > One using ext3 for root (via the ext4 driver) > > _Both_ systems exhibited strange behavior (long pauses and then hangs > requiring hard-power) within several hours. Both then had random > filesystem damage. > > On the BTRFS system much of my browser settings for firefox were > trashed, particularly the cookies and saved conifigurations for > add-ons (like which sites had scripts enabled/disabled in no-script) etc. > > On the ext3/4 system there were several corruptions including a > pipe/special file with a large non-zero size that required I do a > "fsck -fyD /dev/sda3" to repair. (one comment from fsck was that the > pipe/special file "looked like a directory" or some such) > > So I can say that corruption is taking place, but I suspect it is > _not_ happening in the BTRFS specific code. > > (ASIDE: both systems are older amd64 using built-in radeon display > hardware.) > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Random file system corruption in 3.17 (not BTRFS related...?) 2014-10-14 17:22 ` David Arendt @ 2014-10-14 20:06 ` Robert White 0 siblings, 0 replies; 13+ messages in thread From: Robert White @ 2014-10-14 20:06 UTC (permalink / raw) To: David Arendt, linux-btrfs On 10/14/2014 10:22 AM, David Arendt wrote: > I didn't notice a corruption on other filesystems with kernel 3.17.0. > Also I didn't experience any hangs except when trying to mount a > corrupted btrfs but this was causing a hang within less than 10 seconds. > It could be that your problem is unrelated and that the corruption you > are experiencing is due to an unrelated hang followed by a hard > powerdown. Have you been able to capture any btrfs related kernel panics ? My installation is _not_ well suited for capturing panics. I have not been able to capture any panics on either system and I had to just switch back to 3.16.3 as the two systems were my firewall (ext4) and my primary laptop (BTRFS). I didn't want to grind them up with repeated crashes and corruptions. I only let the firewall fault once before switching back. The laptop faulted and hung twice under 3.17.0 before I switched it back, thinking it was a radeon graphics driver issue. Then I logged into the firewall via ssh to check something and three shell commands or so in, it went to lunch (but the firewall layer was still passing packets). The only actual sign of filesystem corruption on the laptop was the sudden absence or corruption of the (sqlite3 format) history and settings files. But firefox was the only thing I'd been actively using. Given the way the firewall jammed up and died, and the kind of corruption (special files don't get updated that much, let alone to link up a directory) -- and the fact that it ran fine as a firewall for several hours then died as soon as I touched the file system. I suspect that there is something fishy in dcache or the vnode layers. It was too much too soon on two otherwise stable systems. I offered this email here because I noticed that people were seeing "BTRFS corruption" with 3.17 and I'd seen both BTRFS and EXT4 corruption which suggests that BTRFS _isn't_ particularly culpable. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Random file system corruption in 3.17 (not BTRFS related...?) 2014-10-14 16:54 Random file system corruption in 3.17 (not BTRFS related...?) Robert White 2014-10-14 17:22 ` David Arendt @ 2014-10-14 22:35 ` Duncan 2014-10-15 7:08 ` Juan Orti Alcaine 2 siblings, 0 replies; 13+ messages in thread From: Duncan @ 2014-10-14 22:35 UTC (permalink / raw) To: linux-btrfs Robert White posted on Tue, 14 Oct 2014 09:54:51 -0700 as excerpted: > On the BTRFS system much of my browser settings for firefox were > trashed, particularly the cookies and saved conifigurations for add-ons > (like which sites had scripts enabled/disabled in no-script) etc. FWIW, this reply is more toward the firefox corruption than the why-particulars of the crash. The prefs.js file in the profile dir holds addon settings and seems to be particularly sensitive to corruption. At least here, firefox has created several backups, prefs-1.js thru prefs-7.js, I suppose at upgrade. The first time I lost settings I restored prefs-7.js (the newest/largest of the backups) as prefs.js, and only lost a few settings that I had changed since the last upgrade, which had changed the firefox interface so I had to change my settings accordingly. The time or two since then that I hard-crashed and lost my addons, I was able to replace the prefs.js file from a recent /home backup. Anyway, it's the prefs.js file that you want to restore. Whether it's from the last prefs-N.js backup that firefox did, or from your own backup, prefs.js is it. As for cookies, history, etc. I didn't notice them going corrupt. I do run raid1 btrfs and after a crash, do a scrub, which may recover some files. And I run tight enough security that most cookies are session- only (and no third-party), so that file won't be written to much, which probably saves it. I don't know about history. Maybe it was corrupted and I simply didn't notice it, as I don't use history that often. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Random file system corruption in 3.17 (not BTRFS related...?) 2014-10-14 16:54 Random file system corruption in 3.17 (not BTRFS related...?) Robert White 2014-10-14 17:22 ` David Arendt 2014-10-14 22:35 ` Duncan @ 2014-10-15 7:08 ` Juan Orti Alcaine 2014-10-15 8:53 ` Duncan 2014-10-15 13:46 ` Josef Bacik 2 siblings, 2 replies; 13+ messages in thread From: Juan Orti Alcaine @ 2014-10-15 7:08 UTC (permalink / raw) To: linux-btrfs El 2014-10-14 18:54, Robert White escribió: > Howdy, > > So I run several gentoo systems and I upgraded two of them to kernel > 3.17.0 > > One using BTRFS for root. > One using ext3 for root (via the ext4 driver) > > _Both_ systems exhibited strange behavior (long pauses and then hangs > requiring hard-power) within several hours. Both then had random > filesystem damage. > > On the BTRFS system much of my browser settings for firefox were > trashed, particularly the cookies and saved conifigurations for > add-ons (like which sites had scripts enabled/disabled in no-script) > etc. > > On the ext3/4 system there were several corruptions including a > pipe/special file with a large non-zero size that required I do a > "fsck -fyD /dev/sda3" to repair. (one comment from fsck was that the > pipe/special file "looked like a directory" or some such) > > So I can say that corruption is taking place, but I suspect it is > _not_ happening in the BTRFS specific code. > > (ASIDE: both systems are older amd64 using built-in radeon display > hardware.) > I've also experienced Btrfs corruptions with 3.17.0 (Fedora 21 alpha). It has happened two times, each one after a clean reinstall and a wipe of the old fs. In less than a day, both installations got corrupted and the filesystems went readonly. When listing the contents, I saw many directories with question marks. My system has 4 drives and 2 fs: - 1 SSD in single - 3 HDD in RAID1 I do readonly snapshots every hour of all the subvolumes, so I have hundreds of snapshots. Now I'm back in 3.16.4 without any problems. I'm trying to reproduce my setup in a virtual machine. If the corruption happens again, I'll send you more data on this problem. -- Juan Orti https://miceliux.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Random file system corruption in 3.17 (not BTRFS related...?) 2014-10-15 7:08 ` Juan Orti Alcaine @ 2014-10-15 8:53 ` Duncan 2014-10-15 13:46 ` Josef Bacik 1 sibling, 0 replies; 13+ messages in thread From: Duncan @ 2014-10-15 8:53 UTC (permalink / raw) To: linux-btrfs Juan Orti Alcaine posted on Wed, 15 Oct 2014 09:08:14 +0200 as excerpted: > I've also experienced Btrfs corruptions with 3.17.0 > I do readonly snapshots every hour of all the subvolumes, so I have > hundreds of snapshots. That's a known issue with read-only snapshots in 3.17.0. There's quite a thread on the list about it. So I'd suggest either turning off read-only snapshots on 3.17 (which I'm running here without snapshots, no problem), possibly switching to writable snapshots as they don't seem to trigger the problem, or as you mentioned doing already, going back to 3.16.x (x>2 due to another bug, latest should be good), until the read-only snapshots issue with 3.17.0 is traced down and fixed. Given the approximately two kernel cycles it took for the widely reproduced but rather difficult to trace compression-related bug in 3.15 to be reported in 3.15 and traced and fixed in 3.17-rc2 and 3.16.2, I'd guess a fix for this similarly widely reproduced read-only-snapshot- related bug should be no later than 3.19-rc3 and 3.18.3, possibly rather earlier if it proves easier to trace, especially since this one seems to have been reported and recognized as widely occurring a bit faster than the compression-related bug. But with testing, etc, it's still likely to be late in the 3.18-rc cycle before mainline commit, so it'll probably be rather late in the 3.17.x stable cycle, if it makes it at all. Unless it gets picked as a long-term support kernel, the full 3.17 stable cycle might in fact be blacklisted for btrfs due to this bug, much like the full 3.15 stable cycle ended up being blacklisted due to the compression- related bug. So either switch your snapshots to writable if it's not going to interfere with your use-case, or stay on the 3.16.x, x>2, stable series until the problem is fixed, hopefully with 3.18.0, tho it might be 3.18.2 or so. I seriously doubt it'll be longer than that, because it's a well reproduced bug which makes it both high priority and easy to test fixes for. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Random file system corruption in 3.17 (not BTRFS related...?) 2014-10-15 7:08 ` Juan Orti Alcaine 2014-10-15 8:53 ` Duncan @ 2014-10-15 13:46 ` Josef Bacik 2014-10-15 14:05 ` Juan Orti Alcaine 1 sibling, 1 reply; 13+ messages in thread From: Josef Bacik @ 2014-10-15 13:46 UTC (permalink / raw) To: Juan Orti Alcaine, linux-btrfs On 10/15/2014 03:08 AM, Juan Orti Alcaine wrote: > El 2014-10-14 18:54, Robert White escribió: >> Howdy, >> >> So I run several gentoo systems and I upgraded two of them to kernel >> 3.17.0 >> >> One using BTRFS for root. >> One using ext3 for root (via the ext4 driver) >> >> _Both_ systems exhibited strange behavior (long pauses and then hangs >> requiring hard-power) within several hours. Both then had random >> filesystem damage. >> >> On the BTRFS system much of my browser settings for firefox were >> trashed, particularly the cookies and saved conifigurations for >> add-ons (like which sites had scripts enabled/disabled in no-script) >> etc. >> >> On the ext3/4 system there were several corruptions including a >> pipe/special file with a large non-zero size that required I do a >> "fsck -fyD /dev/sda3" to repair. (one comment from fsck was that the >> pipe/special file "looked like a directory" or some such) >> >> So I can say that corruption is taking place, but I suspect it is >> _not_ happening in the BTRFS specific code. >> >> (ASIDE: both systems are older amd64 using built-in radeon display >> hardware.) >> > > I've also experienced Btrfs corruptions with 3.17.0 (Fedora 21 alpha). > It has happened two times, each one after a clean reinstall and a wipe > of the old fs. In less than a day, both installations got corrupted and > the filesystems went readonly. When listing the contents, I saw many > directories with question marks. > > My system has 4 drives and 2 fs: > - 1 SSD in single > - 3 HDD in RAID1 Did it happen on both fs'es or just one? Thanks, Josef ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Random file system corruption in 3.17 (not BTRFS related...?) 2014-10-15 13:46 ` Josef Bacik @ 2014-10-15 14:05 ` Juan Orti Alcaine 2014-10-15 14:30 ` Josef Bacik 0 siblings, 1 reply; 13+ messages in thread From: Juan Orti Alcaine @ 2014-10-15 14:05 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs El 2014-10-15 15:46, Josef Bacik escribió: > On 10/15/2014 03:08 AM, Juan Orti Alcaine wrote: >> I've also experienced Btrfs corruptions with 3.17.0 (Fedora 21 alpha). >> It has happened two times, each one after a clean reinstall and a wipe >> of the old fs. In less than a day, both installations got corrupted >> and >> the filesystems went readonly. When listing the contents, I saw many >> directories with question marks. >> >> My system has 4 drives and 2 fs: >> - 1 SSD in single >> - 3 HDD in RAID1 > > Did it happen on both fs'es or just one? Thanks, > > Josef Both filesystems were corrupted. I have / in the SSD and /home in the HDDs. I didn't notice anything while working with the system, I only discovered the problem when booting up after the second or third reboot and seeing the service failing to start. Could it be something related to the mount/umount logic? -- Juan Orti https://miceliux.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Random file system corruption in 3.17 (not BTRFS related...?) 2014-10-15 14:05 ` Juan Orti Alcaine @ 2014-10-15 14:30 ` Josef Bacik 2014-10-15 14:34 ` Juan Orti Alcaine 2014-10-15 19:30 ` Rich Freeman 0 siblings, 2 replies; 13+ messages in thread From: Josef Bacik @ 2014-10-15 14:30 UTC (permalink / raw) To: Juan Orti Alcaine; +Cc: linux-btrfs On 10/15/2014 10:05 AM, Juan Orti Alcaine wrote: > El 2014-10-15 15:46, Josef Bacik escribió: >> On 10/15/2014 03:08 AM, Juan Orti Alcaine wrote: >>> I've also experienced Btrfs corruptions with 3.17.0 (Fedora 21 alpha). >>> It has happened two times, each one after a clean reinstall and a wipe >>> of the old fs. In less than a day, both installations got corrupted and >>> the filesystems went readonly. When listing the contents, I saw many >>> directories with question marks. >>> >>> My system has 4 drives and 2 fs: >>> - 1 SSD in single >>> - 3 HDD in RAID1 >> >> Did it happen on both fs'es or just one? Thanks, >> >> Josef > > Both filesystems were corrupted. I have / in the SSD and /home in the HDDs. > > I didn't notice anything while working with the system, I only > discovered the problem when booting up after the second or third reboot > and seeing the service failing to start. Could it be something related > to the mount/umount logic? > We've found it, the Fedora guys are reverting the bad patch now, we'll get the fix sent back to stable shortly. Sorry about that. Josef ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Random file system corruption in 3.17 (not BTRFS related...?) 2014-10-15 14:30 ` Josef Bacik @ 2014-10-15 14:34 ` Juan Orti Alcaine 2014-10-15 19:30 ` Rich Freeman 1 sibling, 0 replies; 13+ messages in thread From: Juan Orti Alcaine @ 2014-10-15 14:34 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs El 2014-10-15 16:30, Josef Bacik escribió: > On 10/15/2014 10:05 AM, Juan Orti Alcaine wrote: >> El 2014-10-15 15:46, Josef Bacik escribió: >>> On 10/15/2014 03:08 AM, Juan Orti Alcaine wrote: >>>> I've also experienced Btrfs corruptions with 3.17.0 (Fedora 21 >>>> alpha). >>>> It has happened two times, each one after a clean reinstall and a >>>> wipe >>>> of the old fs. In less than a day, both installations got corrupted >>>> and >>>> the filesystems went readonly. When listing the contents, I saw many >>>> directories with question marks. >>>> >>>> My system has 4 drives and 2 fs: >>>> - 1 SSD in single >>>> - 3 HDD in RAID1 >>> >>> Did it happen on both fs'es or just one? Thanks, >>> >>> Josef >> >> Both filesystems were corrupted. I have / in the SSD and /home in the >> HDDs. >> >> I didn't notice anything while working with the system, I only >> discovered the problem when booting up after the second or third >> reboot >> and seeing the service failing to start. Could it be something related >> to the mount/umount logic? >> > > We've found it, the Fedora guys are reverting the bad patch now, we'll > get the fix sent back to stable shortly. Sorry about that. Thanks to you. Fortunately I have good backups. -- Juan Orti https://miceliux.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Random file system corruption in 3.17 (not BTRFS related...?) 2014-10-15 14:30 ` Josef Bacik 2014-10-15 14:34 ` Juan Orti Alcaine @ 2014-10-15 19:30 ` Rich Freeman 2014-10-15 20:20 ` Josef Bacik 1 sibling, 1 reply; 13+ messages in thread From: Rich Freeman @ 2014-10-15 19:30 UTC (permalink / raw) To: Josef Bacik; +Cc: Juan Orti Alcaine, Btrfs BTRFS On Wed, Oct 15, 2014 at 10:30 AM, Josef Bacik <jbacik@fb.com> wrote: > We've found it, the Fedora guys are reverting the bad patch now, we'll get > the fix sent back to stable shortly. Sorry about that. After reverting this commit, can the bad snapshots be deleted/repaired/etc without wiping and restoring the entire filesystem? Copying 2.3TB of data isn't a particularly fast operation... -- Rich ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Random file system corruption in 3.17 (not BTRFS related...?) 2014-10-15 19:30 ` Rich Freeman @ 2014-10-15 20:20 ` Josef Bacik 2014-10-17 16:26 ` Filipe David Manana 0 siblings, 1 reply; 13+ messages in thread From: Josef Bacik @ 2014-10-15 20:20 UTC (permalink / raw) To: Rich Freeman; +Cc: Juan Orti Alcaine, Btrfs BTRFS On 10/15/2014 03:30 PM, Rich Freeman wrote: > On Wed, Oct 15, 2014 at 10:30 AM, Josef Bacik <jbacik@fb.com> wrote: >> We've found it, the Fedora guys are reverting the bad patch now, we'll get >> the fix sent back to stable shortly. Sorry about that. > > After reverting this commit, can the bad snapshots be > deleted/repaired/etc without wiping and restoring the entire > filesystem? Copying 2.3TB of data isn't a particularly fast > operation... > I would certainly like to make fsck repair this sort of problem, let me reproduce the corruption locally and then make fsck fix it and then you can use that. Thanks, Josef ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Random file system corruption in 3.17 (not BTRFS related...?) 2014-10-15 20:20 ` Josef Bacik @ 2014-10-17 16:26 ` Filipe David Manana 0 siblings, 0 replies; 13+ messages in thread From: Filipe David Manana @ 2014-10-17 16:26 UTC (permalink / raw) To: Josef Bacik; +Cc: Rich Freeman, Juan Orti Alcaine, Btrfs BTRFS On Wed, Oct 15, 2014 at 9:20 PM, Josef Bacik <jbacik@fb.com> wrote: > On 10/15/2014 03:30 PM, Rich Freeman wrote: >> >> On Wed, Oct 15, 2014 at 10:30 AM, Josef Bacik <jbacik@fb.com> wrote: >>> >>> We've found it, the Fedora guys are reverting the bad patch now, we'll >>> get >>> the fix sent back to stable shortly. Sorry about that. >> >> >> After reverting this commit, can the bad snapshots be >> deleted/repaired/etc without wiping and restoring the entire >> filesystem? Copying 2.3TB of data isn't a particularly fast >> operation... >> > > I would certainly like to make fsck repair this sort of problem, let me > reproduce the corruption locally and then make fsck fix it and then you can > use that. Thanks, I just sent out a patch for fsck to fix this issue - i.e. bad read-only snapshots (inaccessible without errors, impossible to delete, etc). It fixes the snapshots if, and only if, you haven't run fsck in repair mode (--repair) before, as that would touch back references and other metadata as it didn't expect for root items to incorrect (which is essentially what the snapshots bug made). The patch is this one: https://patchwork.kernel.org/patch/5098331/ Also, if you have errors accessing files through a path that doesn't contain any of the read-only snapshots, it's possible that it's the corruption bug we had in 3.17 - bad extent map manipulation, that manifests itself in several ways (e.g. reports: http://www.spinics.net/lists/linux-btrfs/msg38045.html and http://www.spinics.net/lists/linux-btrfs/msg37567.html). Anyway, if you run into further issues, please report them. thanks > > Josef > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2014-10-17 16:26 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-10-14 16:54 Random file system corruption in 3.17 (not BTRFS related...?) Robert White 2014-10-14 17:22 ` David Arendt 2014-10-14 20:06 ` Robert White 2014-10-14 22:35 ` Duncan 2014-10-15 7:08 ` Juan Orti Alcaine 2014-10-15 8:53 ` Duncan 2014-10-15 13:46 ` Josef Bacik 2014-10-15 14:05 ` Juan Orti Alcaine 2014-10-15 14:30 ` Josef Bacik 2014-10-15 14:34 ` Juan Orti Alcaine 2014-10-15 19:30 ` Rich Freeman 2014-10-15 20:20 ` Josef Bacik 2014-10-17 16:26 ` Filipe David Manana
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).