Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Niccolò Belli" <darkbasic@linuxsystems.it>
To: <linux-btrfs@vger.kernel.org>
Cc: Clemens Eisserer <linuxhippy@gmail.com>,
	"Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
	Patrik Lundquist <patrik.lundquist@gmail.com>,
	Chris Murphy <lists@colorremedies.com>,
	Qu Wenruo <quwenruo@cn.fujitsu.com>,
	Omar Sandoval <osandov@osandov.com>
Subject: Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
Date: Mon, 09 May 2016 16:53:13 +0200	[thread overview]
Message-ID: <52f0c710-d695-443d-b6d5-266e3db634f8@linuxsystems.it> (raw)
In-Reply-To: <799cf552-4612-56c5-b44d-59458119e2b0@gmail.com>

On domenica 8 maggio 2016 20:27:55 CEST, Patrik Lundquist wrote:
> Are you using any power management tweaks?

Yes, as stated in my very first post I use TLP with 
SATA_LINKPWR_ON_BAT=max_performance, but I managed to reproduce the bug 
even without TLP. Also in the past week I've alwyas been on AC.

On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
> Memtest doesn't replicate typical usage patterns very well.  My 
> usual testing for RAM involves not just memtest, but also 
> booting into a LiveCD (usually SystemRescueCD), pulling down a 
> copy of the kernel source, and then running as many concurrent 
> kernel builds as cores, each with as many make jobs as cores (so 
> if you've got a quad core CPU (or a dual core with 
> hyperthreading), it would be running 4 builds with -j4 passed to 
> make).  GCC seems to have memory usage patterns that reliably 
> trigger memory errors that aren't caught by memtest, so this 
> generally gives good results.

Building kernel with 4 concurrent threads is not an issue for my system, in 
fact I do compile a lot and I never had any issue.

On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
> On a similar note, badblocks doesn't replicate filesystem like 
> access patterns, it just runs sequentially through the entire 
> disk.  This isn't as likely to give bad results, but it's still 
> important to know.  In particular, try running it over a dmcrypt 
> volume a couple of times (preferably with a different key each 
> time, pulling keys from /dev/urandom works well for this), as 
> that will result in writing different data.  For what it's 
> worth, when I'm doing initial testing of new disks, I always use 
> ddrescue to copy /dev/zero over the whole disk, then do it twice 
> through dmcrypt with different keys, copying from the disk to 
> /dev/null after each pass.  This gives random data on disk as a 
> starting point (which is good if you're going to use dmcrypt), 
> and usually triggers reallocation of any bad sectors as early as 
> possible.

While trying to find a common denominator for my issue I did lots of 
backups of /dev/mapper/cryptroot and I restored them into 
/dev/mapper/cryptroot dozens of times (triggering a 150GB+ random data 
write every time), without any issue (after restoring the backup I alwyas 
check the parition with btrfs check). So disk doesn't seem to be the 
culprit.

On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
> 1. If you have an eSATA port, try plugging your hard disk in 
> there and see if things work.  If that works but having the hard 
> drive plugged in internally doesn't, then the issue is probably 
> either that specific SATA port (in which case your chip-set is 
> bad and you should get a new system), or the SATA connector 
> itself (or the wiring, but that's not as likely when it's traces 
> on a PCB).  Normally I'd suggest just swapping cables and SATA 
> ports, but that's not really possible with a laptop.
> 2. If you have access to a reasonably large flash drive, or to 
> a USB to SATA adapter, try that as well, if it works on that but 
> not internally (or on an eSATA port), you've probably got a bad 
> SATA controller, and should get a new system.

My laptop doesn't have an eSATA port and my only big enough external drive 
is currently used for daily backups, since I fear for data loss.

On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
> 3. Try things without dmcrypt.  Adding extra layers makes it 
> harder to determine what is actually wrong.  If it works without 
> dmcrypt, try using different parameters for the encryption 
> (different ciphers is what I would try first).  If it works 
> reliably without dmcrypt, then it's either a bug in dmcrypt 
> (which I don't think is very likely), or it's bad interaction 
> between dmcrypt and BTRFS.  If it works with some encryption 
> parameters but not others, then that will help narrow down where 
> the issue is.

On domenica 8 maggio 2016 01:35:16 CEST, Chris Murphy wrote:
> You're making the troubleshooting unnecessarily difficult by
> continuing to use non-default options. *shrug*
>
> Every single layer you add complicates the setup and troubleshooting.
> Of course all of it should work together, many people do. But you're
> the one having the problem so in order to demonstrate whether this is
> a software bug or hardware problem, you need to test it with the most
> basic setup possible --> btrfs on plain partitions and default mount
> options.

I will try to recap because you obviously missed my previous e-mail: I 
managed to replicate the irrecoverable corruption bug even with default 
options and no dmcrypt at all. Somehow it was a bit more difficult to 
replicate with default options and so I started to play with different 
combinations to find if there was something which increased the chances of 
getting corruption. I have the feeling that "autodefrag" enhances the 
chances to get corruption, but I'm not 100% sure about it. Anyway, 
triggering a whole packages reinstall with "pacaur -S $(pacman -Qe)", 
giving high chances to get irrecoverable corruption. When running such 
command it simply extracts the tarballs from the cache and overwrites the 
already installed files. It doesn't write lots of data (after 
reinstallation my system is still quite small, just a few GBs) but it seems 
to be enough to displease the filesystem.

To avoid losing my data every time I power on or reboot my laptop I first 
boot into an external drive, I btrfs check /dev/mapper/cryptroot and if 
it's still sane I backup /dev/mapper/cryptroot into an external SSD with 
dd, otherwise I restore the previous copy from the SSD into 
/dev/mapper/cryptroot.
I cannot manage to survive such annoying workflow for long, so I really 
hope someone will manage to track the bug down soon.

Thanks for your help, I really appreciate it.
Niccolò

next prev parent reply	other threads:[~2016-05-09 14:53 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-04 23:21 btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair Niccolò Belli
2016-05-05  1:07 ` Chris Murphy
2016-05-05 10:36   ` Niccolò Belli
2016-05-05 17:48     ` Omar Sandoval
2016-05-06 11:38       ` Niccolò Belli
2016-05-07 15:45         ` Niccolò Belli
2016-05-07 15:58           ` Clemens Eisserer
2016-05-07 16:11             ` Niccolò Belli
2016-05-08 18:27               ` Patrik Lundquist
2016-05-09 11:52               ` Austin S. Hemmelgarn
2016-05-09 14:53                 ` Niccolò Belli [this message]
2016-05-09 16:29                   ` Zygo Blaxell
2016-05-09 18:21                     ` Austin S. Hemmelgarn
2016-05-09 19:18                       ` Duncan
2016-05-12 14:35                     ` Niccolò Belli
2016-05-12 15:43                       ` Austin S. Hemmelgarn
2016-05-13 11:07                         ` Niccolò Belli
2016-05-13 11:35                           ` Austin S. Hemmelgarn
2016-05-13 12:10                             ` Niccolò Belli
2016-05-13 21:54                               ` Chris Murphy
2016-05-12 16:48                       ` Zygo Blaxell
2016-05-09 19:23                   ` Lionel Bouton
2016-05-09 21:30                   ` Chris Murphy
2016-05-07 23:35           ` Chris Murphy
2016-05-05  4:12 ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52f0c710-d695-443d-b6d5-266e3db634f8@linuxsystems.it \
    --to=darkbasic@linuxsystems.it \
    --cc=ahferroin7@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linuxhippy@gmail.com \
    --cc=lists@colorremedies.com \
    --cc=osandov@osandov.com \
    --cc=patrik.lundquist@gmail.com \
    --cc=quwenruo@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).