From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f179.google.com ([209.85.213.179]:36355 "EHLO mail-ig0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753189AbcELPnn (ORCPT ); Thu, 12 May 2016 11:43:43 -0400 Received: by mail-ig0-f179.google.com with SMTP id lr7so57394957igb.1 for ; Thu, 12 May 2016 08:43:42 -0700 (PDT) Subject: Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair To: =?UTF-8?Q?Niccol=c3=b2_Belli?= , linux-btrfs@vger.kernel.org References: <3bf4a554-e3b8-44e2-b8e7-d08889dcffed@linuxsystems.it> <20160505174854.GA1012@vader.dhcp.thefacebook.com> <585760e0-7d18-4fa0-9974-62a3d7561aee@linuxsystems.it> <2cd5aca36f853f3c9cf1d46c2f133aa3@linuxsystems.it> <799cf552-4612-56c5-b44d-59458119e2b0@gmail.com> <52f0c710-d695-443d-b6d5-266e3db634f8@linuxsystems.it> <20160509162940.GC15597@hungrycats.org> Cc: Clemens Eisserer , Patrik Lundquist , Chris Murphy , Qu Wenruo , Omar Sandoval , Zygo Blaxell , 1i5t5.duncan@cox.net From: "Austin S. Hemmelgarn" Message-ID: Date: Thu, 12 May 2016 11:43:38 -0400 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 2016-05-12 10:35, Niccolò Belli wrote: > On lunedì 9 maggio 2016 18:29:41 CEST, Zygo Blaxell wrote: >> Did you also check the data matches the backup? btrfs check will only >> look at the metadata, which is 0.1% of what you've copied. From what >> you've written, there should be a lot of errors in the data too. If you >> have incorrect data but btrfs scrub finds no incorrect checksums, then >> your storage layer is probably fine and we have to look at CPU, host RAM, >> and software as possible culprits. >> >> The logs you've posted so far indicate that bad metadata (e.g. negative >> item lengths, nonsense transids in metadata references but sane transids >> in the referred pages) is getting into otherwise valid and well-formed >> btrfs metadata pages. Since these pages are protected by checksums, >> the corruption can't be originating in the storage layer--if it was, the >> pages should be rejected as they are read from disk, before btrfs even >> looks at them, and the insane transid should be the "found" one not the >> "expected" one. That suggests there is either RAM corruption happening >> _after_ the data is read from disk (i.e. while the pages are cached in >> RAM), or a severe software bug in the kernel you're running. > > When doing the btrfs check I also always do a btrfs scrub and it never > found any error. Once it didn't manage to finish the scrub because of: > BTRFS critical (device dm-0): corrupt leaf, slot offset bad: > block=670597120,root=1, slot=6 > and btrfs scrub status reported "was aborted after 00:00:10". > > Talking about scrub I created a systemd timer to run scrub hourly and I > noticed 2 *uncorrectable* errors suddenly appeared on my system. So I > immediately re-run the scrub just to confirm it and then I rebooted into > the Arch live usb and runned btrfs check: the metadata were perfect. So > I runned btrfs scrub from the live usb and there were no errors at all! > I rebooted into my system and runned scrub once again and the > uncorrectable errors where really gone! It happened two times in the > past few days. This would indicate to me that you've either got bad RAM (most likely), or some other hardware component is not working correctly. It's not unusual for hardware issues to be intermittent. > >> Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever >> maintains your kernel had a bad day and merged a patch they should >> not have. > > Almost no patches get applied by the Arch kernel team: > https://git.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux > At the moment the only one is an harmless > "change-default-console-loglevel.patch". > >> Try a minimal configuration with as few drivers as possible loaded, >> especially GPU drivers and anything from the staging subdirectory--when >> these drivers have bugs, they ruin everything. > > Arch kernel team is quite conservative regarding staging/experimental > features, I remember they rejected some config patches I submitted > because of this. > Anyway I will try to blacklist as many kernel modules as I can. Maybe > blacklisting GPU is too much because if I can't actually use my laptop > it will be much more difficult to reproduce the issue. Disable the GPU driver, but make sure you have the VGA_CONSOLE config enabled, and you should be fine (you'll just get a 80x25 text-mode console instead of a high-resolution one). > >> Try memtest86+ which has a few more/different tests than memtest86. >> I have encountered RAM modules that pass memtest86 but fail memtest86+ >> and vice versa. >> >> Try memtester, a memory tester that runs as a Linux process, so it can >> detect corruption caused when device drivers spray data randomly into >> RAM, >> or when the CPU thermal controls are influenced by Linux (an overheating >> CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop >> designs rely on the OS for thermal management). >> >> Try running more than one memory testing process, in case there is a bug >> in your hardware that affects interactions between multiple cores >> (memtest >> is single-threaded). You can run memtest86 inside a kvm (e.g. kvm >> -m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues. >> >> Kernel compiles are a bad way to test RAM. I've successfully built >> kernels on hosts with known RAM failures. The kernels don't always work >> properly, but it's quite rare to see a build fail outright. > > I didn't use memtest86+ because of the lack of EFI support, but I just > tried the shiny new memtest86 7.0 beta with improved tests for 12+ hours > without issues. > Also I runned "memtester 4G" and "systester-cli -gausslg 64M -threads 4 > -turns 100000" together for 12 hours without any issue so I think both > my ram and cpu are ok. That's probably a good indication of the CPU and the MB being OK, but not necessarily the RAM. There's two other possible options for testing the RAM that haven't been mentioned yet though (which I hadn't thought of myself until now): 1. If you have access to Windows, try the Windows Memory Diagnostic. This runs yet another slightly different set of tests from memtest86 and memtest86+, so it may catch issues they don't. You can start this directly on an EFI system by loading /EFI/Microsoft/Boot/MEMTEST.EFI from the EFI system partition. 2. This is a Dell system. If you still have the utility partition which Dell ships all their per-provisioned systems with, that should have a hardware diagnostics tool. I doubt that this will find anything (it's part of their QA procedure AFAICT), but it's probably worth trying, as the memory testing in that uses yet another slightly different implementation of the typical tests. You can usually find this in the boot interrupt menu accessed by hitting F12 before the boot-loader loads. > > I can think only about two possible culprits now (correct me if I'm wrong): > 1) A btrfs bug > 2) Another module screwing things around It could still be the disk (not likely, but possible) or the storage controller. If you have a spare disk, I'd suggest trying with that (assuming of course it doesn't void your warranty). > > I can do nothing about btrfs bugs so I will try to hunt the second > option. This is the list of modules I'm running: > > lsmod | awk '$4 == ""' | awk '{print $1}' | sort > > 8250_dw > ac > acpi_als > acpi_pad > aesni_intel > ahci > algif_skcipher > ansi_cprng > arc4 > atkbd > battery > bnep > btrfs > btusb > cdc_ether > cmac > coretemp > crc32c_intel > crc32_pclmul > crct10dif_pclmul > dell_laptop > dell_wmi > dm_crypt > drbg > ecb > elan_i2c > evdev > ext4 > fan > fjes > ghash_clmulni_intel > gpio_lynxpoint > hid_generic > hid_multitouch > hmac > i2c_designware_platform > i2c_hid > i2c_i801 > i915 > input_leds > int3400_thermal > int3402_thermal > int3403_thermal > intel_hid > intel_pch_thermal > intel_powerclamp > intel_rapl > ip_tables > iTCO_wdt > iwlmvm > jitterentropy_rng > joydev > kvm_intel > lpc_ich > mac_hid > mei_me > mos7720 > mousedev > msr > nls_cp437 > nls_iso8859_1 > nvram > pcspkr > pl2303 > processor > processor_thermal_device > psmouse > r8152 > rfcomm > rtsx_pci_ms > rtsx_pci_sdmmc > sch_fq_codel > sdhci_acpi > sd_mod > serio_raw > sha256_ssse3 > shpchp > snd_hda_codec_hdmi > snd_hda_intel > snd_soc_ssm4567 > snd_soc_sst_acpi > snd_soc_sst_broadwell > spi_pxa2xx_platform > thermal > tpm_crb > tpm_tis > uas > usbhid > uvcvideo > vfat > visor > x86_pkg_temp_thermal > xhci_pci > > I will try to blacklist as many as I can will still keeping a somehow > usable system and see if can reproduce it. If I will not be able to > reproduce it anymore then the hunt will begin. It will not be a funny > one as I already experienced with hid-multitouch which gave me random > kernel hangs at boot ONLY if loaded early into the initramfs: > https://bugzilla.kernel.org/show_bug.cgi?id=105251 Based on what you've got listed for modules, I'd expect the absolute minimum for a usable test system to be: ac acpi_als (you can probably remove this, it's for the ambient light sensor) acpi_pad ahci atkbd battery btrfs coretemp dell_laptop dell_wmi elan_i2c evdev ext4 fan gpio_lynxpoint hid_generic hid_multitouch i2c_i801 i915 (this is your GPU module, you should still have a usable text console if this isn't loaded) int3400_thermal int3402_thermal int3403_thermal intel_hid intel_pch_thermal intel_powerclamp intel_rapl ip_tables (if you have no firewall configured, you can safely blacklist this) iwlmvm (you might try removing this, but you will have no wifi without it) lpc_ich mousedev nvram (you might be able to remove this, I don't remember if the dell modules depend on it or not) processor processor_thermal_device psmouse r8152 (you can try removing this too, but you will have no ethernet without it) sch_fq_codel serio_raw spi_pxa2xx_platform thermal usbhid vfat (if you avoid mounting your EFI system partition, you can probably pull this out) x86_pkg_temp_thermal xhci_pci Note that this assumes you aren't testing on dmcrypt. Make absolutely certain though that you don't remove any of the *thermal modules, the fan module, and the dell modules, not having those may result in hardware damage.