FS corruption when mounting non-degraded after mounting degraded

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* FS corruption when mounting non-degraded after mounting degraded
@ 2016-01-19 11:28 Rian Hunter
  2016-01-19 13:54 ` Duncan
  2016-01-21 16:26 ` Rich Freeman
  0 siblings, 2 replies; 8+ messages in thread
From: Rian Hunter @ 2016-01-19 11:28 UTC (permalink / raw)
  To: linux-btrfs

In my raid6 setup, a disk was soft-failing on me. I pulled the disk, 
inserted a new one, mounted degraded, then did btrfs-replace while 
running some RW jobs on the FS.

My jobs were taking too long. It seems like raid6 btrfs-replace without 
the source disk is not very fast. So I unmounted the FS, inserted the 
soft-failing disk again, remounted normally (non-degraded) and restarted 
the (now much faster) btrfs-replace.

I checked on the status sometime later and there were hundreds if not 
thousands of "transid verify failure" messages in my dmesg. 
Additionally the btrfs-replace operation had outright failed.

After removing the soft-failing disk, and mounting degraded again, it now 
seemed that some files in the FS were corrupted and in some instances 
accessing certain files would actually cause the kernel to loop 
indefinitely while eating up memory.

Nothing was corrupted before I mounted the soft-failing disk in 
non-degraded mode. This leads me to believe that btrfs doesn't 
intelligently handle remounting normally previously degraded arrays. Can 
anyone confirm this?

I would highly recommend some sort of fast failure or DWIM behavior for 
this, e.g.:

     $ mount -o degraded /dev/sda1 /mnt
     $ touch /mnt/newfile.txt
     $ unmount /mnt
     $ # plugin other device, e.g. /dev/sdb
     $ mount /dev/sda1 /mnt
     STDERR: Ignoring /dev/sdb1, FS has changed since mounting degraded
     without it.

Alternatively:

     STDERR: Cannot mount file system with device /dev/sdb1, FS has changed
     since mounting degraded without it.

Rian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FS corruption when mounting non-degraded after mounting degraded
  2016-01-19 11:28 FS corruption when mounting non-degraded after mounting degraded Rian Hunter
@ 2016-01-19 13:54 ` Duncan
  2016-01-21 16:26 ` Rich Freeman
  1 sibling, 0 replies; 8+ messages in thread
From: Duncan @ 2016-01-19 13:54 UTC (permalink / raw)
  To: linux-btrfs

Rian Hunter posted on Tue, 19 Jan 2016 03:28:53 -0800 as excerpted:

> Nothing was corrupted before I mounted the soft-failing disk in
> non-degraded mode. This leads me to believe that btrfs doesn't
> intelligently handle remounting normally previously degraded arrays. Can
> anyone confirm this?

Two things:

1) Critical:  I see absolutely no mention of the kernel version you were/
are using.  Btrfs raid56 mode is reasonably new and was only nominally 
complete with kernel 3.19.  Both 3.19 and 4.0 still had quickly known 
bugs in raid56 mode, however, and should be out of use now as neither one 
was an LTS kernel, and obviously anything 3.18 and earlier shouldn't be 
used for raid56 mode either as it wasn't even complete at that point.  
Beyond that, the recommendation has always been to wait for awhile if you 
need anything like stable, with my own recommendation being at least a 
year, five releases cyles, after full-feature introduction, and then 
looking at reports on the list, before considering it stable.

So LTS 4.1 still isn't stable for raid56 mode, 4.2 is already out of the 
last couple current kernel releases window, and 4.3, while to my 
knowledge not having any critical known raid56 bugs, is still within that 
year stabilization window, which means with 4.4 out you should be on it, 
unless there's a specific bug in it that is preventing that at this point.

The just released 4.4 is the first kernel that could meet the year-
minimum-to-stabilize criteria, and it's LTS, which means it has the 
/potential/ of being the first kernel I'd consider btrfs raid56 mode as 
stable as the rest of btrfs.  But 4.4 is still new enough we don't /know/ 
that yet, and while it's an LTS and hopefully will eventually be 
reasonably stable for raid56 mode, there's no reason beyond the simple 
first-release-after-1-year timing to suggest that it actually /is/ at 
that level yet, as it simply hasn't been out long enough to know.

So, particularly with btrfs raid56 mode, running the latest kernel is 
absolutely critical, and if you're not on kernel 4.4 yet, you definitely 
should be if you're running raid56 mode.

And even 4.4 shouldn't really be considered raid56 mode stable yet.  It 
might be, but we simply don't have the data for it one way or the other 
yet.

And given that you didn't mention kernel version at all, where kernel 
version is so critical, a reasonable assumption would be that you didn't 
do due-diligence research before deploying on raid56 and no nothing about 
it, so who knows /what/ old kernel you might be using?

Meanwhile, btrfs itself is, as frequently stated on this list, 
"stabilizING, but not entirely stable and mature yet."  IOW, even for 
single-device btrfs or the more mature raid0/1/10 modes, btrfs isn't 
"production ready stable", even if it's stable /enough/ that many both on 
this list and in the wider Linux distro world are using it as a "daily 
driver", hopefully with backups available and ready, since it /is/ still 
stabilizing and maturing.  Of course the sysadmin's rule of backups even 
for mature filesystems says, in simplest form, that if you don't have at 
least one backup, that by your actions, you're defining the data as worth 
less than the time and resources necessary to do the backup, despite any 
verbal/written claims to the contrary, and with btrfs not yet fully 
stable and btrfs raid56 even less so, that means on btrfs raid56, you 
REALLY have backups or you REALLY are declaring the data of trivial 
value, at best.

So by all means, use btrfs if you have appropriate backups and are 
willing to use them, but don't expect btrfs raid56 mode to be stable yet, 
particularly on kernels before the just-released 4.4, because it simply 
isn't.

2) Specifically regarding your posted point, I don't personally use raid56 
yet, and don't understand its limitations as well as I do the raid1 mode 
I do use, but it wouldn't surprise me if indeed adding an old device 
after the filesystem has moved on could indeed cause problems, 
particularly in a rebuild from degraded environment.

Btrfs raid1 mode has a similar but more limited don't-do-this case -- it 
should be OK in the situation you mentioned, but it's strongly 
recommended not to separate the two copies, mount each one writable 
separately, and then try to use them together, as that's an invitation to 
corruption.  Rather, if raid1 components must be separated, care should 
be taken to mount just one degraded,rw, if it's at all planned to mount 
them together, undegraded, again (and if that's done, a scrub is needed, 
at minimum, to catch the older device data back up to the newer one(s)).  
In the event that both /do/ get separately mounted rw, the only way to 
properly use them combined again is to wipe one or the other, and add it 
again as a new device.

And with parity rebuilds there's a generally known non-btrfs-specific 
rebuild hole where parity rebuilds can corrupt under the wrong 
circumstances already.  I've never taken the time to fully dig into the 
technical details so don't claim to fully understand the hole and thus 
don't know for sure if you obviously triggered it, but I know it's 
there.  And because btrfs does per-chunk raid rather than whole-device 
raid, it's possible you triggered it in the btrfs case even if you 
wouldn't have in the general case.  But I'll defer to the more 
knowledgeable for the specifics there.

Meanwhile... these facts are no secret.  The non-btrfs-specific parity 
rebuild hole is well known, as is the fact that btrfs raid56 mode isn't 
yet mature.  Anyone doing due-diligence pre-deployment research should 
have come across the latter repeatedly on this list as well as on the 
btrfs wiki, and the former repeatedly in general discussions of parity-
raid.

So I'd say, restore from the backup you certainly had if the data was of 
any value at all, and call it good.  And if you didn't have that backup, 
well, your actions declared the data of trivial value at best, so be 
happy, because you saved what your actions defined as of higher value, 
the time and resources you would have put into the backup had you done 
it.  So you can still call it good!  =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FS corruption when mounting non-degraded after mounting degraded
  2016-01-19 11:28 FS corruption when mounting non-degraded after mounting degraded Rian Hunter
  2016-01-19 13:54 ` Duncan
@ 2016-01-21 16:26 ` Rich Freeman
  2016-01-21 17:15   ` Chris Murphy
  1 sibling, 1 reply; 8+ messages in thread
From: Rich Freeman @ 2016-01-21 16:26 UTC (permalink / raw)
  To: Rian Hunter; +Cc: Btrfs BTRFS

On Tue, Jan 19, 2016 at 6:28 AM, Rian Hunter <rian@thelig.ht> wrote:
> In my raid6 setup, a disk was soft-failing on me. I pulled the disk,
> inserted a new one, mounted degraded, then did btrfs-replace while running
> some RW jobs on the FS.
>
> My jobs were taking too long. It seems like raid6 btrfs-replace without the
> source disk is not very fast. So I unmounted the FS, inserted the
> soft-failing disk again, remounted normally (non-degraded) and restarted the
> (now much faster) btrfs-replace.
>
> I checked on the status sometime later and there were hundreds if not
> thousands of "transid verify failure" messages in my dmesg. Additionally the
> btrfs-replace operation had outright failed.
>

I think the bottom line here is the same thing that everybody runs
into when they have LVM snapshots of btrfs devices.

Btrfs doesn't record any kind of timestamp or generation number of
anything like that when it touches data on a drive, or if it does it
isn't granular enough or it isn't being used when mounting drives to
ensure consistency.

So, btrfs sees a bunch of partitions that at one time were consistent
and assumes that they must still be consistent.  If they aren't
consistent, almost anything can go wrong.

The bottom line is that anytime you mount btrfs in anything other than
read-only mode, any partitions that may have been associated with that
filesystem which aren't mounted should be treated like the plague.
Don't let btrfs ever mount them again, because it will be all too
willing to try and you'll likely lose data if it happens.

There was a long thread on this not long ago and I don't know what the
final outcome of it was.  It would be ideal if btrfs could detect if a
partition is out-of-date and not attempt to mount it, except perhaps
as read-only in some kind of recovery mode.

--
Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FS corruption when mounting non-degraded after mounting degraded
  2016-01-21 16:26 ` Rich Freeman
@ 2016-01-21 17:15   ` Chris Murphy
  2016-01-21 22:25     ` Rian Hunter
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Murphy @ 2016-01-21 17:15 UTC (permalink / raw)
  To: Rich Freeman; +Cc: Rian Hunter, Btrfs BTRFS

On Thu, Jan 21, 2016 at 9:26 AM, Rich Freeman
<r-btrfs@thefreemanclan.net> wrote:
> On Tue, Jan 19, 2016 at 6:28 AM, Rian Hunter <rian@thelig.ht> wrote:
>> In my raid6 setup, a disk was soft-failing on me. I pulled the disk,
>> inserted a new one, mounted degraded, then did btrfs-replace while running
>> some RW jobs on the FS.
>>
>> My jobs were taking too long. It seems like raid6 btrfs-replace without the
>> source disk is not very fast. So I unmounted the FS, inserted the
>> soft-failing disk again, remounted normally (non-degraded) and restarted the
>> (now much faster) btrfs-replace.
>>
>> I checked on the status sometime later and there were hundreds if not
>> thousands of "transid verify failure" messages in my dmesg. Additionally the
>> btrfs-replace operation had outright failed.
>>
>
> I think the bottom line here is the same thing that everybody runs
> into when they have LVM snapshots of btrfs devices.

Seems unrelated. The LVM and dd concern is an *identical* member
device appearing (two or more instances of the same volume UUID and
dev UUID). In this case it's a member device with a (truly) unique dev
UUID. So there should be no such confusion.

>
> Btrfs doesn't record any kind of timestamp or generation number of
> anything like that when it touches data on a drive, or if it does it
> isn't granular enough or it isn't being used when mounting drives to
> ensure consistency.

Except it does.

And the superblock on every device includes the dev UUID of all other
devices. So even with a device missing, or a device missing and
reappearing while that same device is being replaced, the file system
has information available.

While a member device is being replaced, it isn't invalid. It's still
available for reading from for its own replacement. Hence why 'btrfs
replace start' has an -r option. So it makes sense that the rebuild
goes faster once it's reconnected, because now that drive is available
for normal reads, the volume isn't degraded anymore, so data doesn't
have to be reconstructed from parity.

But none of this explains why corruption happened. So I'd say it's a
bug. The question is, is it discretely reproducible? Once there's
concise reproduce steps, it's much more likely a dev can reproduce and
debug what the problem is.

Maybe there's a problem where the reintroduction of a previously
missing device in the middle of being replaced. The devid is different
from dev.uuid, but they are mapped somehow, and mainly it's the devid
being used by the filesystem chunk tree. So maybe confusion happens
where there are suddenly two devid 2, even though they have different
dev UUID.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FS corruption when mounting non-degraded after mounting degraded
  2016-01-21 17:15   ` Chris Murphy
@ 2016-01-21 22:25     ` Rian Hunter
  2016-01-21 23:51       ` Chris Murphy
  0 siblings, 1 reply; 8+ messages in thread
From: Rian Hunter @ 2016-01-21 22:25 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Rich Freeman, Btrfs BTRFS

On Thu, 21 Jan 2016, Chris Murphy wrote:
> But none of this explains why corruption happened. So I'd say it's a
> bug. The question is, is it discretely reproducible? Once there's
> concise reproduce steps, it's much more likely a dev can reproduce and
> debug what the problem is.
>
> Maybe there's a problem where the reintroduction of a previously
> missing device in the middle of being replaced. The devid is different
> from dev.uuid, but they are mapped somehow, and mainly it's the devid
> being used by the filesystem chunk tree. So maybe confusion happens
> where there are suddenly two devid 2, even though they have different
> dev UUID.

Let me put the sequence of events into more discrete steps so that any
interested dev may reproduce the bug. For the record I maintain daily
backups and specifically made a backup before the events described so
I was not adversely affected by this bug except through lost time.

Kernel uname -a: Linux nori 4.3.0-1-amd64 #1 SMP Debian 4.3.3-5 (2016-01-04) x86_64 GNU/Linux

Start state: Normally functioning raid6 array. Device FOO intermittently
fails and requires power cycle to work again. This has happened 25-50
times in the past with no irresolvable data corruption. Decision has
finally been made to replace it.

* Unmount raid6 FS
* Disconnect array.
* Physically remove device FOO from array, add new device BAR to array.
* Connect array
* Mount raid6 array with "-o degraded"
* Run "btrfs replace start 2 /dev/BAR /mnt"
* Start VMs on FS
* Machine freezes (not sure why)
* Restart machine
* Mount raid6 array with "-o degraded"
* Replace job continues automatically
* Start VMs on FS
* After an hour: VMs have no started up yet (seeing hung-task
   warnings in kernel). "btrfs replace status /mnt" shows 0.1% done
* Cancel replace: "btrfs replace cancel /mnt"
* Unmount raid6 FS
* Disconnect array
* Physically add device FOO back to array
* Reconnect array
* Mount raid6 array normally (no "-o degraded")
* Run "btrfs replace start 2 /dev/BAR /mnt"
* Start VMs on FS
* VMs starting is still too slow, kill all VMs except for a critical
   one, VMFOO. Let VMFOO attempt to start.
* After an hour: lots of "parent transid verify failed" in dmesg, I get
   nervous. I attribute it to re-adding the disk. Also seeing
   "hung-task" corresponding to the starting of VMFOO.
* Forcibly Stop VMFOO.
* Cancel replace: "btrfs replace cancel /mnt"
* Unmount raid6 FS
* Disconnect array
* Physically remove device FOO from array
* Reconnect array
* Mount raid6 array with "-o degraded"
* Run "btrfs replace start 2 /dev/BAR /mnt"
* After an hour: Replace operation was automatically cancelled, lots
   of "parent transid verify failed" in dmesg again.
* Run "btrfs scrub," "btrfs scrub status" shows millions of
   unrecoverable errors
* Cancel "btrfs scrub"
* At this point I'm convinced this FS is in a very broken state and I
   try to salvage whatever data could have changed since beginning the
   process.
* Tried rsync'ing VMFOO off disk, lots of "parent transid verify
   failed" in dmesg, rsync reporting "stale file handle" and it seems
   stalled, iostat reporting no disk IO, all CPUS burning in system
   time (not user mode), system memory usage steadily rising.
* Unmount raid6 array
* Mount raid6 array with "-o degraded", fails:
   drive would not appear mounted for some reason. (no message in
   dmesg but there was nothing under mount point, and no failure from
   the "mount" command)
* Restart machine
* Mount raid6 array with "-o degraded", success
* Tried rsync'ing VMFOO off disk again, lots of "parent transid verify
   failed" in dmesg, rsync reporting "stale file h andle," seems
   stalled, iostat reporting no disk IO, all CPUS burning, system
   memory usage steadily rising.
* Tried rsync'ing other VMs, saw additional "parent transid verify"
   message in the kernel but the rsync completed successfully (though
   there was some corruption).
* Tried rsync'ing some other non-VM data, everything proceeded
   normally, invoked no "parent transid verify failed" errors.

After this I completely reformatted the FS and restored it from backup
so unfortunately I don't have any way to investigate the remnants of
what happened.

When I personally was trying to figure out what happened, some
observations stood out:

* The fact that VMFOO's data was in a very bad state 
* Yet the other VMs were in a bad, but less bad, state
* Also the non-VM data was uncorrupted.
* The errors first occured during replacing after mounting the FS
   normally after previously mounting it with "-o degraded"

>From a black box perspective, this led me to believe that the
corruption happened during the replace operation after mounting
normally after first mounting with "-o degraded." Of course,
knowledge of the internals could easily verify this.

HTH! I'm not on this mailing list so please reply directly to me.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FS corruption when mounting non-degraded after mounting degraded
  2016-01-21 22:25     ` Rian Hunter
@ 2016-01-21 23:51       ` Chris Murphy
  2016-01-22  1:21         ` Rian Hunter
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Murphy @ 2016-01-21 23:51 UTC (permalink / raw)
  To: Rian Hunter; +Cc: Btrfs BTRFS

On Thu, Jan 21, 2016 at 3:25 PM, Rian Hunter <rian@thelig.ht> wrote:

>
> Start state: Normally functioning raid6 array. Device FOO intermittently
> fails and requires power cycle to work again. This has happened 25-50
> times in the past with no irresolvable data corruption.

For each drive, what are the results:
# smartctl -l scterc /dev/sdX
# cat /sys/block/sdX/device/timeout

>
> * Unmount raid6 FS
> * Disconnect array.
> * Physically remove device FOO from array, add new device BAR to array.
> * Connect array
> * Mount raid6 array with "-o degraded"
> * Run "btrfs replace start 2 /dev/BAR /mnt"
> * Start VMs on FS
> * Machine freezes (not sure why)
> * Restart machine
> * Mount raid6 array with "-o degraded"
> * Replace job continues automatically
> * Start VMs on FS
> * After an hour: VMs have no started up yet (seeing hung-task
>   warnings in kernel). "btrfs replace status /mnt" shows 0.1% done

Do you have dmesg output that includes sysrq+w at the time of the hung
tasks warnings? That's pretty much always requested by devs in these
cases.

> * Cancel replace: "btrfs replace cancel /mnt"
> * Unmount raid6 FS
> * Disconnect array
> * Physically add device FOO back to array
> * Reconnect array
> * Mount raid6 array normally (no "-o degraded")
> * Run "btrfs replace start 2 /dev/BAR /mnt"

Hmm. Open question if 'btrfs replace cancel' actually marked /dev/BAR
as wiped. If it doesn't, then this 2nd replace start should have
failed unless you had used -f, or you used wipefs -a. If it's not
wiped by any of the above, then I'd expect it's possible things get
messy.

> * Mount raid6 array with "-o degraded"
> * Run "btrfs replace start 2 /dev/BAR /mnt"
> * After an hour: Replace operation was automatically cancelled, lots
>   of "parent transid verify failed" in dmesg again.
> * Run "btrfs scrub," "btrfs scrub status" shows millions of
>   unrecoverable errors

Some others (no devs however) have disagreed with me on this, so take
it with a grain of salt, but I don't understand the rationale of
running scrub on degraded arrays. The first order of business is to
get it non-degraded. If that can't be done, scrub is pointless. Of
course it's going to produce millions of unrecoverable errors. There's
a device missing, so I'd expect many unrecoverable errors during a
degraded scrub.

> * Cancel "btrfs scrub"
> * At this point I'm convinced this FS is in a very broken state and I
>   try to salvage whatever data could have changed since beginning the
>   process.

Agreed. Certainly not reassuring.

> From a black box perspective, this led me to believe that the
> corruption happened during the replace operation after mounting
> normally after first mounting with "-o degraded." Of course,
> knowledge of the internals could easily verify this.

Filesystems are really difficult, so even knowledge of the internals
doesn't guarantee the devs will understand where the problem first
started in this case.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FS corruption when mounting non-degraded after mounting degraded
  2016-01-21 23:51       ` Chris Murphy
@ 2016-01-22  1:21         ` Rian Hunter
  2016-01-22  3:38           ` Chris Murphy
  0 siblings, 1 reply; 8+ messages in thread
From: Rian Hunter @ 2016-01-22  1:21 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On Thu, 21 Jan 2016, Chris Murphy wrote:
> On Thu, Jan 21, 2016 at 3:25 PM, Rian Hunter <rian@thelig.ht> wrote:
>> Start state: Normally functioning raid6 array. Device FOO intermittently
>> fails and requires power cycle to work again. This has happened 25-50
>> times in the past with no irresolvable data corruption.
>
> For each drive, what are the results:
> # smartctl -l scterc /dev/sdX
> # cat /sys/block/sdX/device/timeout

I no longer have the failing drive in my array and it would be a bit
hard to replace it now that all drive bays are full. It was the same
make/model as my other drives.

Notably, the SMART attribute "Command_Timeout" was in the millions
when I removed the device and that is suppossed to be highly
correlated with drive failure.

"SCT Error Recovery Control command not supported" is printed for all my
devices for the "smartctl" command.

# cat /sys/block/sd{a,b,c,d,e,f,g,h}/device/timeout
30
30
30
30
30
30
30
30

>> * Unmount raid6 FS
>> * Disconnect array.
>> * Physically remove device FOO from array, add new device BAR to array.
>> * Connect array
>> * Mount raid6 array with "-o degraded"
>> * Run "btrfs replace start 2 /dev/BAR /mnt"
>> * Start VMs on FS
>> * Machine freezes (not sure why)
>> * Restart machine
>> * Mount raid6 array with "-o degraded"
>> * Replace job continues automatically
>> * Start VMs on FS
>> * After an hour: VMs have no started up yet (seeing hung-task
>>   warnings in kernel). "btrfs replace status /mnt" shows 0.1% done
>
> Do you have dmesg output that includes sysrq+w at the time of the hung
> tasks warnings? That's pretty much always requested by devs in these
> cases.

This is dmesg before the freeze:

Jan 18 15:15:40 nori kernel: [1019147.178187] INFO: task btrfs-transacti:24186 blocked for more than 120 seconds.
Jan 18 15:15:40 nori kernel: [1019147.178196]       Tainted: G        W       4.3.0-1-amd64 #1
Jan 18 15:15:40 nori kernel: [1019147.178199] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 18 15:15:40 nori kernel: [1019147.178202] btrfs-transacti D ffffffff810a0850     0 24186      2 0x00000000
Jan 18 15:15:40 nori kernel: [1019147.178211]  ffff88034726e0c0 0000000000000046 ffff88046c6ab040 ffff88046ab84000
Jan 18 15:15:40 nori kernel: [1019147.178217]  ffff8803b844ed08 ffff8803b844ecd0 ffff88046ab83de8 ffff88042488fb70
Jan 18 15:15:40 nori kernel: [1019147.178222]  0000000000000000 ffffffff8157d6af ffff8803b844e000 ffffffffa038362d
Jan 18 15:15:40 nori kernel: [1019147.178228] Call Trace:
Jan 18 15:15:40 nori kernel: [1019147.178240]  [<ffffffff8157d6af>] ? schedule+0x2f/0x70
Jan 18 15:15:40 nori kernel: [1019147.178284]  [<ffffffffa038362d>] ? btrfs_scrub_pause+0x8d/0x100 [btrfs]
Jan 18 15:15:40 nori kernel: [1019147.178291]  [<ffffffff810b1520>] ? wake_atomic_t_function+0x60/0x60
Jan 18 15:15:40 nori kernel: [1019147.178322]  [<ffffffffa03270fa>] ? btrfs_commit_transaction+0x47a/0xad0 [btrfs]
Jan 18 15:15:40 nori kernel: [1019147.178350]  [<ffffffffa03277e7>] ? start_transaction+0x97/0x590 [btrfs]
Jan 18 15:15:40 nori kernel: [1019147.178376]  [<ffffffffa03223b0>] ? transaction_kthread+0x220/0x240 [btrfs]
Jan 18 15:15:40 nori kernel: [1019147.178400]  [<ffffffffa0322190>] ? btrfs_cleanup_transaction+0x540/0x540 [btrfs]
Jan 18 15:15:40 nori kernel: [1019147.178407]  [<ffffffff8109044d>] ? kthread+0xcd/0xf0
Jan 18 15:15:40 nori kernel: [1019147.178414]  [<ffffffff81090380>] ? kthread_create_on_node+0x190/0x190
Jan 18 15:15:40 nori kernel: [1019147.178420]  [<ffffffff815816cf>] ? ret_from_fork+0x3f/0x70
Jan 18 15:15:40 nori kernel: [1019147.178426]  [<ffffffff81090380>] ? kthread_create_on_node+0x190/0x190
Jan 18 15:17:40 nori kernel: [1019266.951118] INFO: task btrfs-transacti:24186 blocked for more than 120 seconds.
Jan 18 15:17:40 nori kernel: [1019266.951126]       Tainted: G        W       4.3.0-1-amd64 #1
Jan 18 15:17:40 nori kernel: [1019266.951129] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 18 15:17:40 nori kernel: [1019266.951133] btrfs-transacti D ffffffff810a0850     0 24186      2 0x00000000
Jan 18 15:17:40 nori kernel: [1019266.951141]  ffff88034726e0c0 0000000000000046 ffff88046c6ab040 ffff88046ab84000
Jan 18 15:17:40 nori kernel: [1019266.951147]  ffff8803b844ed08 ffff8803b844ecd0 ffff88046ab83de8 ffff88042488fb70
Jan 18 15:17:40 nori kernel: [1019266.951152]  0000000000000000 ffffffff8157d6af ffff8803b844e000 ffffffffa038362d
Jan 18 15:17:40 nori kernel: [1019266.951158] Call Trace:
Jan 18 15:17:40 nori kernel: [1019266.951170]  [<ffffffff8157d6af>] ? schedule+0x2f/0x70
Jan 18 15:17:40 nori kernel: [1019266.951213]  [<ffffffffa038362d>] ? btrfs_scrub_pause+0x8d/0x100 [btrfs]
Jan 18 15:17:40 nori kernel: [1019266.951219]  [<ffffffff810b1520>] ? wake_atomic_t_function+0x60/0x60
Jan 18 15:17:40 nori kernel: [1019266.951251]  [<ffffffffa03270fa>] ? btrfs_commit_transaction+0x47a/0xad0 [btrfs]
Jan 18 15:17:40 nori kernel: [1019266.951278]  [<ffffffffa03277e7>] ? start_transaction+0x97/0x590 [btrfs]
Jan 18 15:17:40 nori kernel: [1019266.951304]  [<ffffffffa03223b0>] ? transaction_kthread+0x220/0x240 [btrfs]
Jan 18 15:17:40 nori kernel: [1019266.951328]  [<ffffffffa0322190>] ? btrfs_cleanup_transaction+0x540/0x540 [btrfs]
Jan 18 15:17:40 nori kernel: [1019266.951335]  [<ffffffff8109044d>] ? kthread+0xcd/0xf0
Jan 18 15:17:40 nori kernel: [1019266.951341]  [<ffffffff81090380>] ? kthread_create_on_node+0x190/0x190
Jan 18 15:17:40 nori kernel: [1019266.951348]  [<ffffffff815816cf>] ? ret_from_fork+0x3f/0x70
Jan 18 15:17:40 nori kernel: [1019266.951354]  [<ffffffff81090380>] ? kthread_create_on_node+0x190/0x190
Jan 18 15:17:40 nori kernel: [1019266.951368] INFO: task lxc-start:25363 blocked for more than 120 seconds.
Jan 18 15:17:40 nori kernel: [1019266.951371]       Tainted: G        W       4.3.0-1-amd64 #1
Jan 18 15:17:40 nori kernel: [1019266.951374] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 18 15:17:40 nori kernel: [1019266.951376] lxc-start       D ffffffff810a0850     0 25363      1 0x00000000
Jan 18 15:17:40 nori kernel: [1019266.951382]  ffff8801d8319140 0000000000000082 ffff88046c6f8080 ffff88002f49c000
Jan 18 15:17:40 nori kernel: [1019266.951387]  ffff8803dab7e1f0 ffff880237740b98 ffff8803dab7e1f0 0000000000000001
Jan 18 15:17:40 nori kernel: [1019266.951392]  ffff8801d8319140 ffffffff8157d6af ffff88042488fb70 ffffffffa032611d
Jan 18 15:17:40 nori kernel: [1019266.951397] Call Trace:
Jan 18 15:17:40 nori kernel: [1019266.951402]  [<ffffffff8157d6af>] ? schedule+0x2f/0x70
Jan 18 15:17:40 nori kernel: [1019266.951424]  [<ffffffffa032611d>] ? wait_current_trans.isra.21+0xcd/0x110 [btrfs]
Jan 18 15:17:40 nori kernel: [1019266.951429]  [<ffffffff810b1520>] ? wake_atomic_t_function+0x60/0x60
Jan 18 15:17:40 nori kernel: [1019266.951453]  [<ffffffffa0327b66>] ? start_transaction+0x416/0x590 [btrfs]
Jan 18 15:17:40 nori kernel: [1019266.951481]  [<ffffffffa0336309>] ? btrfs_create+0x49/0x1f0 [btrfs]
Jan 18 15:17:40 nori kernel: [1019266.951488]  [<ffffffff811d856a>] ? vfs_create+0xba/0x120
Jan 18 15:17:40 nori kernel: [1019266.951494]  [<ffffffff811db9d0>] ? path_openat+0x1120/0x1260
Jan 18 15:17:40 nori kernel: [1019266.951498]  [<ffffffff811d6d00>] ? terminate_walk+0x40/0xc0
Jan 18 15:17:40 nori kernel: [1019266.951503]  [<ffffffff811dc0e1>] ? filename_lookup+0xf1/0x180
Jan 18 15:17:40 nori kernel: [1019266.951509]  [<ffffffff811dcca1>] ? do_filp_open+0x91/0x100
Jan 18 15:17:40 nori kernel: [1019266.951516]  [<ffffffff811cc4e9>] ? do_sys_open+0x139/0x220
Jan 18 15:17:40 nori kernel: [1019266.951522]  [<ffffffff81581332>] ? system_call_fast_compare_end+0xc/0x67

Relavant dmesg after the restart:

Jan 18 15:24:06 nori kernel: [  205.325598] BTRFS: device fsid f3b9a0ff-85f4-40d0-a130-04db96d92836 devid 5 transid 67149 /dev/dm-3
Jan 18 15:24:19 nori kernel: [  218.385633] BTRFS: device fsid f3b9a0ff-85f4-40d0-a130-04db96d92836 devid 4 transid 67149 /dev/dm-4
Jan 18 15:24:24 nori kernel: [  223.122442] BTRFS: device fsid f3b9a0ff-85f4-40d0-a130-04db96d92836 devid 1 transid 67149 /dev/dm-5
Jan 18 15:24:30 nori kernel: [  228.995711] BTRFS: device fsid f3b9a0ff-85f4-40d0-a130-04db96d92836 devid 3 transid 67149 /dev/dm-6
Jan 18 15:24:43 nori kernel: [  242.305643] BTRFS: device fsid f3b9a0ff-85f4-40d0-a130-04db96d92836 devid 0 transid 67149 /dev/dm-7
Jan 18 15:24:44 nori kernel: [  243.036998] BTRFS info (device dm-7): allowing degraded mounts
Jan 18 15:24:44 nori kernel: [  243.037009] BTRFS info (device dm-7): disk space caching is enabled
Jan 18 15:24:44 nori kernel: [  243.037013] BTRFS: has skinny extents
Jan 18 15:24:44 nori kernel: [  243.038806] BTRFS warning (device dm-7): devid 2 uuid 4ac0186c-ad43-448c-855b-df265ae31975 is missing
Jan 18 15:24:48 nori kernel: [  247.008168] BTRFS: continuing dev_replace from <missing disk> (devid 2) to /dev/dm-7 @0%
Jan 18 15:28:41 nori kernel: [  479.247257] device vethAG2VFQ entered promiscuous mode
Jan 18 15:28:41 nori kernel: [  479.247302] IPv6: ADDRCONF(NETDEV_UP): vethAG2VFQ: link is not ready
Jan 18 15:28:42 nori kernel: [  480.263321] eth0: renamed from vethGCO23Y
Jan 18 15:28:42 nori kernel: [  480.289101] IPv6: ADDRCONF(NETDEV_CHANGE): vethAG2VFQ: link becomes ready
Jan 18 15:28:42 nori kernel: [  480.289150] lxc-bridge-nat: port 1(vethAG2VFQ) entered forwarding state
Jan 18 15:28:42 nori kernel: [  480.289172] lxc-bridge-nat: port 1(vethAG2VFQ) entered forwarding state
Jan 18 15:48:41 nori kernel: [ 1676.618159] INFO: task btrfs-transacti:2351 blocked for more than 120 seconds.
Jan 18 15:48:41 nori kernel: [ 1676.618167]       Not tainted 4.3.0-1-amd64 #1
Jan 18 15:48:41 nori kernel: [ 1676.618170] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 18 15:48:41 nori kernel: [ 1676.618174] btrfs-transacti D ffffffff810a08c0     0  2351      2 0x00000000
Jan 18 15:48:41 nori kernel: [ 1676.618182]  ffff88044c282100 0000000000000046 ffff88046c6a9180 ffff88044c36c000
Jan 18 15:48:41 nori kernel: [ 1676.618188]  ffff88044c36ba50 7fffffffffffffff ffffffff815818d0 ffff88044c36bac8
Jan 18 15:48:41 nori kernel: [ 1676.618193]  ffffea000fdaa880 ffffffff8158113f 0000000000000000 ffffffff81583ee9
Jan 18 15:48:41 nori kernel: [ 1676.618199] Call Trace:
Jan 18 15:48:41 nori kernel: [ 1676.618211]  [<ffffffff815818d0>] ? bit_wait+0x40/0x40
Jan 18 15:48:41 nori kernel: [ 1676.618216]  [<ffffffff8158113f>] ? schedule+0x2f/0x70
Jan 18 15:48:41 nori kernel: [ 1676.618223]  [<ffffffff81583ee9>] ? schedule_timeout+0x189/0x250
Jan 18 15:48:41 nori kernel: [ 1676.618262]  [<ffffffffa038d930>] ? __raid56_parity_write+0x30/0x40 [btrfs]
Jan 18 15:48:41 nori kernel: [ 1676.618295]  [<ffffffffa038eb6f>] ? raid56_parity_write+0x15f/0x170 [btrfs]
Jan 18 15:48:41 nori kernel: [ 1676.618324]  [<ffffffffa038b87e>] ? btrfs_bio_counter_sub+0x1e/0x50 [btrfs]
Jan 18 15:48:41 nori kernel: [ 1676.618355]  [<ffffffffa0347e0b>] ? btrfs_map_bio+0xfb/0x330 [btrfs]
Jan 18 15:48:41 nori kernel: [ 1676.618360]  [<ffffffff815818d0>] ? bit_wait+0x40/0x40
Jan 18 15:48:41 nori kernel: [ 1676.618365]  [<ffffffff815807cd>] ? io_schedule_timeout+0x9d/0x100
Jan 18 15:48:41 nori kernel: [ 1676.618370]  [<ffffffff81581901>] ? bit_wait_io+0x31/0x50
Jan 18 15:48:41 nori kernel: [ 1676.618375]  [<ffffffff815814d8>] ? __wait_on_bit+0x58/0x90
Jan 18 15:48:41 nori kernel: [ 1676.618402]  [<ffffffffa031e042>] ? btrfs_submit_bio_hook+0xe2/0x190 [btrfs]
Jan 18 15:48:41 nori kernel: [ 1676.618408]  [<ffffffff8115e646>] ? wait_on_page_bit+0xc6/0xe0
Jan 18 15:48:41 nori kernel: [ 1676.618414]  [<ffffffff810b15d0>] ? autoremove_wake_function+0x40/0x40
Jan 18 15:48:41 nori kernel: [ 1676.618445]  [<ffffffffa033ce52>] ? extent_write_cache_pages.isra.30.constprop.50+0x252/0x360 [btrfs]
Jan 18 15:48:41 nori kernel: [ 1676.618475]  [<ffffffffa033e0ad>] ? extent_writepages+0x5d/0x90 [btrfs]
Jan 18 15:48:41 nori kernel: [ 1676.618503]  [<ffffffffa0320660>] ? btrfs_real_readdir+0x530/0x530 [btrfs]
Jan 18 15:48:41 nori kernel: [ 1676.618509]  [<ffffffff81160465>] ? __filemap_fdatawrite_range+0xc5/0x100
Jan 18 15:48:41 nori kernel: [ 1676.618537]  [<ffffffffa0331dfb>] ? btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
Jan 18 15:48:41 nori kernel: [ 1676.618567]  [<ffffffffa035e397>] ? __btrfs_write_out_cache.isra.25+0x407/0x450 [btrfs]
Jan 18 15:48:41 nori kernel: [ 1676.618596]  [<ffffffffa035f891>] ? btrfs_write_out_cache+0x91/0xe0 [btrfs]
Jan 18 15:48:41 nori kernel: [ 1676.618619]  [<ffffffffa030adea>] ? btrfs_start_dirty_block_groups+0x1ea/0x3e0 [btrfs]
Jan 18 15:48:41 nori kernel: [ 1676.618644]  [<ffffffffa031be44>] ? btrfs_commit_transaction+0x1c4/0xad0 [btrfs]
Jan 18 15:48:41 nori kernel: [ 1676.618668]  [<ffffffffa031c7e7>] ? start_transaction+0x97/0x590 [btrfs]
Jan 18 15:48:41 nori kernel: [ 1676.618690]  [<ffffffffa03173b0>] ? transaction_kthread+0x220/0x240 [btrfs]
Jan 18 15:48:41 nori kernel: [ 1676.618711]  [<ffffffffa0317190>] ? btrfs_cleanup_transaction+0x540/0x540 [btrfs]
Jan 18 15:48:41 nori kernel: [ 1676.618718]  [<ffffffff810904bd>] ? kthread+0xcd/0xf0
Jan 18 15:48:41 nori kernel: [ 1676.618724]  [<ffffffff810903f0>] ? kthread_create_on_node+0x190/0x190
Jan 18 15:48:41 nori kernel: [ 1676.618730]  [<ffffffff8158514f>] ? ret_from_fork+0x3f/0x70
Jan 18 15:48:41 nori kernel: [ 1676.618736]  [<ffffffff810903f0>] ? kthread_create_on_node+0x190/0x190
Jan 18 15:50:41 nori kernel: [ 1796.383713] INFO: task btrfs-transacti:2351 blocked for more than 120 seconds.
Jan 18 15:50:41 nori kernel: [ 1796.383721]       Not tainted 4.3.0-1-amd64 #1
Jan 18 15:50:41 nori kernel: [ 1796.383724] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 18 15:50:41 nori kernel: [ 1796.383727] btrfs-transacti D ffffffff810a08c0     0  2351      2 0x00000000
Jan 18 15:50:41 nori kernel: [ 1796.383736]  ffff88044c282100 0000000000000046 ffff88046c6a9180 ffff88044c36c000
Jan 18 15:50:41 nori kernel: [ 1796.383742]  ffff88044c36ba50 7fffffffffffffff ffffffff815818d0 ffff88044c36bac8
Jan 18 15:50:41 nori kernel: [ 1796.383747]  ffffea000fdaa880 ffffffff8158113f 0000000000000000 ffffffff81583ee9
Jan 18 15:50:41 nori kernel: [ 1796.383752] Call Trace:
Jan 18 15:50:41 nori kernel: [ 1796.383764]  [<ffffffff815818d0>] ? bit_wait+0x40/0x40
Jan 18 15:50:41 nori kernel: [ 1796.383770]  [<ffffffff8158113f>] ? schedule+0x2f/0x70
Jan 18 15:50:41 nori kernel: [ 1796.383776]  [<ffffffff81583ee9>] ? schedule_timeout+0x189/0x250
Jan 18 15:50:41 nori kernel: [ 1796.383817]  [<ffffffffa038d930>] ? __raid56_parity_write+0x30/0x40 [btrfs]
Jan 18 15:50:41 nori kernel: [ 1796.383853]  [<ffffffffa038eb6f>] ? raid56_parity_write+0x15f/0x170 [btrfs]
Jan 18 15:50:41 nori kernel: [ 1796.383884]  [<ffffffffa038b87e>] ? btrfs_bio_counter_sub+0x1e/0x50 [btrfs]
Jan 18 15:50:41 nori kernel: [ 1796.383916]  [<ffffffffa0347e0b>] ? btrfs_map_bio+0xfb/0x330 [btrfs]
Jan 18 15:50:41 nori kernel: [ 1796.383921]  [<ffffffff815818d0>] ? bit_wait+0x40/0x40
Jan 18 15:50:41 nori kernel: [ 1796.383926]  [<ffffffff815807cd>] ? io_schedule_timeout+0x9d/0x100
Jan 18 15:50:41 nori kernel: [ 1796.383931]  [<ffffffff81581901>] ? bit_wait_io+0x31/0x50
Jan 18 15:50:41 nori kernel: [ 1796.383936]  [<ffffffff815814d8>] ? __wait_on_bit+0x58/0x90
Jan 18 15:50:41 nori kernel: [ 1796.383963]  [<ffffffffa031e042>] ? btrfs_submit_bio_hook+0xe2/0x190 [btrfs]
Jan 18 15:50:41 nori kernel: [ 1796.383970]  [<ffffffff8115e646>] ? wait_on_page_bit+0xc6/0xe0
Jan 18 15:50:41 nori kernel: [ 1796.383976]  [<ffffffff810b15d0>] ? autoremove_wake_function+0x40/0x40
Jan 18 15:50:41 nori kernel: [ 1796.384008]  [<ffffffffa033ce52>] ? extent_write_cache_pages.isra.30.constprop.50+0x252/0x360 [btrfs]
Jan 18 15:50:41 nori kernel: [ 1796.384040]  [<ffffffffa033e0ad>] ? extent_writepages+0x5d/0x90 [btrfs]
Jan 18 15:50:41 nori kernel: [ 1796.384068]  [<ffffffffa0320660>] ? btrfs_real_readdir+0x530/0x530 [btrfs]
Jan 18 15:50:41 nori kernel: [ 1796.384074]  [<ffffffff81160465>] ? __filemap_fdatawrite_range+0xc5/0x100
Jan 18 15:50:41 nori kernel: [ 1796.384103]  [<ffffffffa0331dfb>] ? btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
Jan 18 15:50:41 nori kernel: [ 1796.384133]  [<ffffffffa035e397>] ? __btrfs_write_out_cache.isra.25+0x407/0x450 [btrfs]
Jan 18 15:50:41 nori kernel: [ 1796.384162]  [<ffffffffa035f891>] ? btrfs_write_out_cache+0x91/0xe0 [btrfs]
Jan 18 15:50:41 nori kernel: [ 1796.384186]  [<ffffffffa030adea>] ? btrfs_start_dirty_block_groups+0x1ea/0x3e0 [btrfs]
Jan 18 15:50:41 nori kernel: [ 1796.384212]  [<ffffffffa031be44>] ? btrfs_commit_transaction+0x1c4/0xad0 [btrfs]
Jan 18 15:50:41 nori kernel: [ 1796.384235]  [<ffffffffa031c7e7>] ? start_transaction+0x97/0x590 [btrfs]
Jan 18 15:50:41 nori kernel: [ 1796.384258]  [<ffffffffa03173b0>] ? transaction_kthread+0x220/0x240 [btrfs]
Jan 18 15:50:41 nori kernel: [ 1796.384280]  [<ffffffffa0317190>] ? btrfs_cleanup_transaction+0x540/0x540 [btrfs]
Jan 18 15:50:41 nori kernel: [ 1796.384287]  [<ffffffff810904bd>] ? kthread+0xcd/0xf0
Jan 18 15:50:41 nori kernel: [ 1796.384293]  [<ffffffff810903f0>] ? kthread_create_on_node+0x190/0x190
Jan 18 15:50:41 nori kernel: [ 1796.384300]  [<ffffffff8158514f>] ? ret_from_fork+0x3f/0x70
Jan 18 15:50:41 nori kernel: [ 1796.384305]  [<ffffffff810903f0>] ? kthread_create_on_node+0x190/0x190
Jan 18 15:54:41 nori kernel: [ 2035.918749] INFO: task btrfs-transacti:2351 blocked for more than 120 seconds.
Jan 18 15:54:41 nori kernel: [ 2035.918757]       Not tainted 4.3.0-1-amd64 #1
Jan 18 15:54:41 nori kernel: [ 2035.918760] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 18 15:54:41 nori kernel: [ 2035.918763] btrfs-transacti D ffffffff810a08c0     0  2351      2 0x00000000
Jan 18 15:54:41 nori kernel: [ 2035.918772]  ffff88044c282100 0000000000000046 ffff88046c6f6140 ffff88044c36c000
Jan 18 15:54:41 nori kernel: [ 2035.918777]  ffff88046a970d08 ffff88046a970cd0 ffff88044c36bde8 ffff880450288d10
Jan 18 15:54:41 nori kernel: [ 2035.918782]  0000000000000000 ffffffff8158113f ffff88046a970000 ffffffffa037862d
Jan 18 15:54:41 nori kernel: [ 2035.918788] Call Trace:
Jan 18 15:54:41 nori kernel: [ 2035.918800]  [<ffffffff8158113f>] ? schedule+0x2f/0x70
Jan 18 15:54:41 nori kernel: [ 2035.918844]  [<ffffffffa037862d>] ? btrfs_scrub_pause+0x8d/0x100 [btrfs]
Jan 18 15:54:41 nori kernel: [ 2035.918851]  [<ffffffff810b1590>] ? wake_atomic_t_function+0x60/0x60
Jan 18 15:54:41 nori kernel: [ 2035.918881]  [<ffffffffa031c0fa>] ? btrfs_commit_transaction+0x47a/0xad0 [btrfs]
Jan 18 15:54:41 nori kernel: [ 2035.918907]  [<ffffffffa031c7e7>] ? start_transaction+0x97/0x590 [btrfs]
Jan 18 15:54:41 nori kernel: [ 2035.918932]  [<ffffffffa03173b0>] ? transaction_kthread+0x220/0x240 [btrfs]
Jan 18 15:54:41 nori kernel: [ 2035.918954]  [<ffffffffa0317190>] ? btrfs_cleanup_transaction+0x540/0x540 [btrfs]
Jan 18 15:54:41 nori kernel: [ 2035.918961]  [<ffffffff810904bd>] ? kthread+0xcd/0xf0
Jan 18 15:54:41 nori kernel: [ 2035.918968]  [<ffffffff810903f0>] ? kthread_create_on_node+0x190/0x190
Jan 18 15:54:41 nori kernel: [ 2035.918975]  [<ffffffff8158514f>] ? ret_from_fork+0x3f/0x70
Jan 18 15:54:41 nori kernel: [ 2035.918980]  [<ffffffff810903f0>] ? kthread_create_on_node+0x190/0x190
Jan 18 15:54:41 nori kernel: [ 2035.918999] INFO: task gzip:5211 blocked for more than 120 seconds.
Jan 18 15:54:41 nori kernel: [ 2035.919002]       Not tainted 4.3.0-1-amd64 #1
Jan 18 15:54:41 nori kernel: [ 2035.919004] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 18 15:54:41 nori kernel: [ 2035.919007] gzip            D ffffffff810a08c0     0  5211   5173 0x00000100
Jan 18 15:54:41 nori kernel: [ 2035.919012]  ffff8803f941f0c0 0000000000000086 ffff88046c6a8140 ffff8803de66c000
Jan 18 15:54:41 nori kernel: [ 2035.919018]  ffff88044c2991f0 ffff8803f4f3eae0 ffff88044c2991f0 0000000000000001
Jan 18 15:54:41 nori kernel: [ 2035.919023]  ffff8803f941f0c0 ffffffff8158113f ffff880450288d10 ffffffffa031b11d
Jan 18 15:54:41 nori kernel: [ 2035.919028] Call Trace:
Jan 18 15:54:41 nori kernel: [ 2035.919033]  [<ffffffff8158113f>] ? schedule+0x2f/0x70
Jan 18 15:54:41 nori kernel: [ 2035.919056]  [<ffffffffa031b11d>] ? wait_current_trans.isra.21+0xcd/0x110 [btrfs]
Jan 18 15:54:41 nori kernel: [ 2035.919061]  [<ffffffff810b1590>] ? wake_atomic_t_function+0x60/0x60
Jan 18 15:54:41 nori kernel: [ 2035.919083]  [<ffffffffa031cb66>] ? start_transaction+0x416/0x590 [btrfs]
Jan 18 15:54:41 nori kernel: [ 2035.919088]  [<ffffffff811e5935>] ? __d_lookup+0x75/0x150
Jan 18 15:54:41 nori kernel: [ 2035.919111]  [<ffffffffa031da0f>] ? __unlink_start_trans+0x1f/0xb0 [btrfs]
Jan 18 15:54:41 nori kernel: [ 2035.919135]  [<ffffffffa0325d23>] ? btrfs_unlink+0x23/0xa0 [btrfs]
Jan 18 15:54:41 nori kernel: [ 2035.919141]  [<ffffffff811d9667>] ? vfs_unlink+0xe7/0x190
Jan 18 15:54:41 nori kernel: [ 2035.919147]  [<ffffffff811dcaa9>] ? do_unlinkat+0x299/0x2d0
Jan 18 15:54:41 nori kernel: [ 2035.919154]  [<ffffffff81584db2>] ? system_call_fast_compare_end+0xc/0x67
Jan 18 15:56:41 nori kernel: [ 2155.710465] INFO: task btrfs-transacti:2351 blocked for more than 120 seconds.
Jan 18 15:56:41 nori kernel: [ 2155.710473]       Not tainted 4.3.0-1-amd64 #1
Jan 18 15:56:41 nori kernel: [ 2155.710475] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 18 15:56:41 nori kernel: [ 2155.710478] btrfs-transacti D ffffffff810a08c0     0  2351      2 0x00000000
Jan 18 15:56:41 nori kernel: [ 2155.710487]  ffff88044c282100 0000000000000046 ffff88046c6f6140 ffff88044c36c000
Jan 18 15:56:41 nori kernel: [ 2155.710493]  ffff88046a970d08 ffff88046a970cd0 ffff88044c36bde8 ffff880450288d10
Jan 18 15:56:41 nori kernel: [ 2155.710498]  0000000000000000 ffffffff8158113f ffff88046a970000 ffffffffa037862d
Jan 18 15:56:41 nori kernel: [ 2155.710504] Call Trace:
Jan 18 15:56:41 nori kernel: [ 2155.710516]  [<ffffffff8158113f>] ? schedule+0x2f/0x70
Jan 18 15:56:41 nori kernel: [ 2155.710559]  [<ffffffffa037862d>] ? btrfs_scrub_pause+0x8d/0x100 [btrfs]
Jan 18 15:56:41 nori kernel: [ 2155.710566]  [<ffffffff810b1590>] ? wake_atomic_t_function+0x60/0x60
Jan 18 15:56:41 nori kernel: [ 2155.710597]  [<ffffffffa031c0fa>] ? btrfs_commit_transaction+0x47a/0xad0 [btrfs]
Jan 18 15:56:41 nori kernel: [ 2155.710624]  [<ffffffffa031c7e7>] ? start_transaction+0x97/0x590 [btrfs]
Jan 18 15:56:41 nori kernel: [ 2155.710650]  [<ffffffffa03173b0>] ? transaction_kthread+0x220/0x240 [btrfs]
Jan 18 15:56:41 nori kernel: [ 2155.710674]  [<ffffffffa0317190>] ? btrfs_cleanup_transaction+0x540/0x540 [btrfs]
Jan 18 15:56:41 nori kernel: [ 2155.710681]  [<ffffffff810904bd>] ? kthread+0xcd/0xf0
Jan 18 15:56:41 nori kernel: [ 2155.710688]  [<ffffffff810903f0>] ? kthread_create_on_node+0x190/0x190
Jan 18 15:56:41 nori kernel: [ 2155.710695]  [<ffffffff8158514f>] ? ret_from_fork+0x3f/0x70
Jan 18 15:56:41 nori kernel: [ 2155.710700]  [<ffffffff810903f0>] ? kthread_create_on_node+0x190/0x190
Jan 18 15:56:41 nori kernel: [ 2155.710719] INFO: task gzip:5211 blocked for more than 120 seconds.
Jan 18 15:56:41 nori kernel: [ 2155.710722]       Not tainted 4.3.0-1-amd64 #1
Jan 18 15:56:41 nori kernel: [ 2155.710724] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 18 15:56:41 nori kernel: [ 2155.710727] gzip            D ffffffff810a08c0     0  5211   5173 0x00000100
Jan 18 15:56:41 nori kernel: [ 2155.710732]  ffff8803f941f0c0 0000000000000086 ffff88046c6a8140 ffff8803de66c000
Jan 18 15:56:41 nori kernel: [ 2155.710737]  ffff88044c2991f0 ffff8803f4f3eae0 ffff88044c2991f0 0000000000000001
Jan 18 15:56:41 nori kernel: [ 2155.710742]  ffff8803f941f0c0 ffffffff8158113f ffff880450288d10 ffffffffa031b11d
Jan 18 15:56:41 nori kernel: [ 2155.710747] Call Trace:
Jan 18 15:56:41 nori kernel: [ 2155.710753]  [<ffffffff8158113f>] ? schedule+0x2f/0x70
Jan 18 15:56:41 nori kernel: [ 2155.710776]  [<ffffffffa031b11d>] ? wait_current_trans.isra.21+0xcd/0x110 [btrfs]
Jan 18 15:56:41 nori kernel: [ 2155.710781]  [<ffffffff810b1590>] ? wake_atomic_t_function+0x60/0x60
Jan 18 15:56:41 nori kernel: [ 2155.710804]  [<ffffffffa031cb66>] ? start_transaction+0x416/0x590 [btrfs]
Jan 18 15:56:41 nori kernel: [ 2155.710809]  [<ffffffff811e5935>] ? __d_lookup+0x75/0x150
Jan 18 15:56:41 nori kernel: [ 2155.710833]  [<ffffffffa031da0f>] ? __unlink_start_trans+0x1f/0xb0 [btrfs]
Jan 18 15:56:41 nori kernel: [ 2155.710859]  [<ffffffffa0325d23>] ? btrfs_unlink+0x23/0xa0 [btrfs]
Jan 18 15:56:41 nori kernel: [ 2155.710868]  [<ffffffff811d9667>] ? vfs_unlink+0xe7/0x190
Jan 18 15:56:41 nori kernel: [ 2155.710874]  [<ffffffff811dcaa9>] ? do_unlinkat+0x299/0x2d0
Jan 18 15:56:41 nori kernel: [ 2155.710882]  [<ffffffff81584db2>] ? system_call_fast_compare_end+0xc/0x67
Jan 18 15:58:41 nori kernel: [ 2275.507816] INFO: task btrfs-transacti:2351 blocked for more than 120 seconds.
Jan 18 15:58:41 nori kernel: [ 2275.507824]       Not tainted 4.3.0-1-amd64 #1
Jan 18 15:58:41 nori kernel: [ 2275.507826] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 18 15:58:41 nori kernel: [ 2275.507830] btrfs-transacti D ffffffff810a08c0     0  2351      2 0x00000000
Jan 18 15:58:41 nori kernel: [ 2275.507838]  ffff88044c282100 0000000000000046 ffff88046c6f6140 ffff88044c36c000
Jan 18 15:58:41 nori kernel: [ 2275.507844]  ffff88046a970d08 ffff88046a970cd0 ffff88044c36bde8 ffff880450288d10
Jan 18 15:58:41 nori kernel: [ 2275.507849]  0000000000000000 ffffffff8158113f ffff88046a970000 ffffffffa037862d
Jan 18 15:58:41 nori kernel: [ 2275.507855] Call Trace:
Jan 18 15:58:41 nori kernel: [ 2275.507866]  [<ffffffff8158113f>] ? schedule+0x2f/0x70
Jan 18 15:58:41 nori kernel: [ 2275.507909]  [<ffffffffa037862d>] ? btrfs_scrub_pause+0x8d/0x100 [btrfs]
Jan 18 15:58:41 nori kernel: [ 2275.507916]  [<ffffffff810b1590>] ? wake_atomic_t_function+0x60/0x60
Jan 18 15:58:41 nori kernel: [ 2275.507946]  [<ffffffffa031c0fa>] ? btrfs_commit_transaction+0x47a/0xad0 [btrfs]
Jan 18 15:58:41 nori kernel: [ 2275.507974]  [<ffffffffa031c7e7>] ? start_transaction+0x97/0x590 [btrfs]
Jan 18 15:58:41 nori kernel: [ 2275.508000]  [<ffffffffa03173b0>] ? transaction_kthread+0x220/0x240 [btrfs]
Jan 18 15:58:41 nori kernel: [ 2275.508022]  [<ffffffffa0317190>] ? btrfs_cleanup_transaction+0x540/0x540 [btrfs]
Jan 18 15:58:41 nori kernel: [ 2275.508030]  [<ffffffff810904bd>] ? kthread+0xcd/0xf0
Jan 18 15:58:41 nori kernel: [ 2275.508036]  [<ffffffff810903f0>] ? kthread_create_on_node+0x190/0x190
Jan 18 15:58:41 nori kernel: [ 2275.508043]  [<ffffffff8158514f>] ? ret_from_fork+0x3f/0x70
Jan 18 15:58:41 nori kernel: [ 2275.508049]  [<ffffffff810903f0>] ? kthread_create_on_node+0x190/0x190
Jan 18 15:58:41 nori kernel: [ 2275.508067] INFO: task gzip:5211 blocked for more than 120 seconds.
Jan 18 15:58:41 nori kernel: [ 2275.508070]       Not tainted 4.3.0-1-amd64 #1
Jan 18 15:58:41 nori kernel: [ 2275.508072] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 18 15:58:41 nori kernel: [ 2275.508075] gzip            D ffffffff810a08c0     0  5211   5173 0x00000100
Jan 18 15:58:41 nori kernel: [ 2275.508080]  ffff8803f941f0c0 0000000000000086 ffff88046c6a8140 ffff8803de66c000
Jan 18 15:58:41 nori kernel: [ 2275.508086]  ffff88044c2991f0 ffff8803f4f3eae0 ffff88044c2991f0 0000000000000001
Jan 18 15:58:41 nori kernel: [ 2275.508091]  ffff8803f941f0c0 ffffffff8158113f ffff880450288d10 ffffffffa031b11d
Jan 18 15:58:41 nori kernel: [ 2275.508095] Call Trace:
Jan 18 15:58:41 nori kernel: [ 2275.508101]  [<ffffffff8158113f>] ? schedule+0x2f/0x70
Jan 18 15:58:41 nori kernel: [ 2275.508123]  [<ffffffffa031b11d>] ? wait_current_trans.isra.21+0xcd/0x110 [btrfs]
Jan 18 15:58:41 nori kernel: [ 2275.508128]  [<ffffffff810b1590>] ? wake_atomic_t_function+0x60/0x60
Jan 18 15:58:41 nori kernel: [ 2275.508152]  [<ffffffffa031cb66>] ? start_transaction+0x416/0x590 [btrfs]
Jan 18 15:58:41 nori kernel: [ 2275.508157]  [<ffffffff811e5935>] ? __d_lookup+0x75/0x150
Jan 18 15:58:41 nori kernel: [ 2275.508181]  [<ffffffffa031da0f>] ? __unlink_start_trans+0x1f/0xb0 [btrfs]
Jan 18 15:58:41 nori kernel: [ 2275.508207]  [<ffffffffa0325d23>] ? btrfs_unlink+0x23/0xa0 [btrfs]
Jan 18 15:58:41 nori kernel: [ 2275.508213]  [<ffffffff811d9667>] ? vfs_unlink+0xe7/0x190
Jan 18 15:58:41 nori kernel: [ 2275.508219]  [<ffffffff811dcaa9>] ? do_unlinkat+0x299/0x2d0
Jan 18 15:58:41 nori kernel: [ 2275.508226]  [<ffffffff81584db2>] ? system_call_fast_compare_end+0xc/0x67
Jan 18 16:00:41 nori kernel: [ 2395.299270] INFO: task btrfs-transacti:2351 blocked for more than 120 seconds.
Jan 18 16:00:41 nori kernel: [ 2395.299278]       Not tainted 4.3.0-1-amd64 #1
Jan 18 16:00:41 nori kernel: [ 2395.299281] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 18 16:00:41 nori kernel: [ 2395.299284] btrfs-transacti D ffffffff810a08c0     0  2351      2 0x00000000
Jan 18 16:00:41 nori kernel: [ 2395.299293]  ffff88044c282100 0000000000000046 ffff88046c6f6140 ffff88044c36c000
Jan 18 16:00:41 nori kernel: [ 2395.299299]  ffff88046a970d08 ffff88046a970cd0 ffff88044c36bde8 ffff880450288d10
Jan 18 16:00:41 nori kernel: [ 2395.299304]  0000000000000000 ffffffff8158113f ffff88046a970000 ffffffffa037862d
Jan 18 16:00:41 nori kernel: [ 2395.299309] Call Trace:
Jan 18 16:00:41 nori kernel: [ 2395.299322]  [<ffffffff8158113f>] ? schedule+0x2f/0x70
Jan 18 16:00:41 nori kernel: [ 2395.299365]  [<ffffffffa037862d>] ? btrfs_scrub_pause+0x8d/0x100 [btrfs]
Jan 18 16:00:41 nori kernel: [ 2395.299371]  [<ffffffff810b1590>] ? wake_atomic_t_function+0x60/0x60
Jan 18 16:00:41 nori kernel: [ 2395.299402]  [<ffffffffa031c0fa>] ? btrfs_commit_transaction+0x47a/0xad0 [btrfs]
Jan 18 16:00:41 nori kernel: [ 2395.299429]  [<ffffffffa031c7e7>] ? start_transaction+0x97/0x590 [btrfs]
Jan 18 16:00:41 nori kernel: [ 2395.299455]  [<ffffffffa03173b0>] ? transaction_kthread+0x220/0x240 [btrfs]
Jan 18 16:00:41 nori kernel: [ 2395.299478]  [<ffffffffa0317190>] ? btrfs_cleanup_transaction+0x540/0x540 [btrfs]
Jan 18 16:00:41 nori kernel: [ 2395.299485]  [<ffffffff810904bd>] ? kthread+0xcd/0xf0
Jan 18 16:00:41 nori kernel: [ 2395.299492]  [<ffffffff810903f0>] ? kthread_create_on_node+0x190/0x190
Jan 18 16:00:41 nori kernel: [ 2395.299499]  [<ffffffff8158514f>] ? ret_from_fork+0x3f/0x70
Jan 18 16:00:41 nori kernel: [ 2395.299505]  [<ffffffff810903f0>] ? kthread_create_on_node+0x190/0x190
Jan 18 16:00:41 nori kernel: [ 2395.299522] INFO: task gzip:5211 blocked for more than 120 seconds.
Jan 18 16:00:41 nori kernel: [ 2395.299526]       Not tainted 4.3.0-1-amd64 #1
Jan 18 16:00:41 nori kernel: [ 2395.299528] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 18 16:00:41 nori kernel: [ 2395.299530] gzip            D ffffffff810a08c0     0  5211   5173 0x00000100
Jan 18 16:00:41 nori kernel: [ 2395.299536]  ffff8803f941f0c0 0000000000000086 ffff88046c6a8140 ffff8803de66c000
Jan 18 16:00:41 nori kernel: [ 2395.299541]  ffff88044c2991f0 ffff8803f4f3eae0 ffff88044c2991f0 0000000000000001
Jan 18 16:00:41 nori kernel: [ 2395.299546]  ffff8803f941f0c0 ffffffff8158113f ffff880450288d10 ffffffffa031b11d
Jan 18 16:00:41 nori kernel: [ 2395.299551] Call Trace:
Jan 18 16:00:41 nori kernel: [ 2395.299557]  [<ffffffff8158113f>] ? schedule+0x2f/0x70
Jan 18 16:00:41 nori kernel: [ 2395.299580]  [<ffffffffa031b11d>] ? wait_current_trans.isra.21+0xcd/0x110 [btrfs]
Jan 18 16:00:41 nori kernel: [ 2395.299585]  [<ffffffff810b1590>] ? wake_atomic_t_function+0x60/0x60
Jan 18 16:00:41 nori kernel: [ 2395.299608]  [<ffffffffa031cb66>] ? start_transaction+0x416/0x590 [btrfs]
Jan 18 16:00:41 nori kernel: [ 2395.299614]  [<ffffffff811e5935>] ? __d_lookup+0x75/0x150
Jan 18 16:00:41 nori kernel: [ 2395.299637]  [<ffffffffa031da0f>] ? __unlink_start_trans+0x1f/0xb0 [btrfs]
Jan 18 16:00:41 nori kernel: [ 2395.299662]  [<ffffffffa0325d23>] ? btrfs_unlink+0x23/0xa0 [btrfs]
Jan 18 16:00:41 nori kernel: [ 2395.299669]  [<ffffffff811d9667>] ? vfs_unlink+0xe7/0x190
Jan 18 16:00:41 nori kernel: [ 2395.299674]  [<ffffffff811dcaa9>] ? do_unlinkat+0x299/0x2d0
Jan 18 16:00:41 nori kernel: [ 2395.299681]  [<ffffffff81584db2>] ? system_call_fast_compare_end+0xc/0x67

After that last line, I cancelled the replace and unmounted the FS and
then proceeded to start it again with the previously missing drive.

>> * Cancel replace: "btrfs replace cancel /mnt"
>> * Unmount raid6 FS
>> * Disconnect array
>> * Physically add device FOO back to array
>> * Reconnect array
>> * Mount raid6 array normally (no "-o degraded")
>> * Run "btrfs replace start 2 /dev/BAR /mnt"
>
> Hmm. Open question if 'btrfs replace cancel' actually marked /dev/BAR
> as wiped. If it doesn't, then this 2nd replace start should have
> failed unless you had used -f, or you used wipefs -a. If it's not
> wiped by any of the above, then I'd expect it's possible things get
> messy.

I'm fairly positive that it must have marked it as wiped because I
don't remember it failing or having to use -f.


> Some others (no devs however) have disagreed with me on this, so take
> it with a grain of salt, but I don't understand the rationale of
> running scrub on degraded arrays. The first order of business is to
> get it non-degraded. If that can't be done, scrub is pointless. Of
> course it's going to produce millions of unrecoverable errors. There's
> a device missing, so I'd expect many unrecoverable errors during a
> degraded scrub.

Yes I searched and I found your opinion on this. I was fairly
confident that the "unrecoverable errors" in the scrub were due to it
being degraded.  I did the scrub because I felt I was running out of
options.

My only opinion, after going through this experience would be that
btrfs-scrub fail fast and returned a hard error like "Cannot scrub
degraded mount." That would have saved me some time.

>> From a black box perspective, this led me to believe that the
>> corruption happened during the replace operation after mounting
>> normally after first mounting with "-o degraded." Of course,
>> knowledge of the internals could easily verify this.
>
> Filesystems are really difficult, so even knowledge of the internals
> doesn't guarantee the devs will understand where the problem first
> started in this case.

Fair point. I have experience writing a file system myself and I agree.
Though sometimes it helps when you have a non-zero mental model of
the internals.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FS corruption when mounting non-degraded after mounting degraded
  2016-01-22  1:21         ` Rian Hunter
@ 2016-01-22  3:38           ` Chris Murphy
  0 siblings, 0 replies; 8+ messages in thread
From: Chris Murphy @ 2016-01-22  3:38 UTC (permalink / raw)
  To: Rian Hunter; +Cc: Chris Murphy, Btrfs BTRFS

On Thu, Jan 21, 2016 at 6:21 PM, Rian Hunter <rian@thelig.ht> wrote:

>
> "SCT Error Recovery Control command not supported" is printed for all my
> devices for the "smartctl" command.
>
> # cat /sys/block/sd{a,b,c,d,e,f,g,h}/device/timeout
> 30
> 30
> 30
> 30
> 30
> 30
> 30
> 30

This combination is a misconfigured setup. The SCT ERC value must be
lower than thekernel block device timout value. When SCT ERC is not
supported, that means the value is unknown but in effect it means it
can be stratospheric, 120+ seconds is not unheard of.

So your use case must be prepared for non-deterministic, rare
(hopefully), total stalls of 2-3 minutes. If your use case can't ever
tolerate such a hang while a drive is in deep recovery doing a read on
a marginal/bad sector, then the drives are disqualified for the use
case, you have to get drives that have configurable SCT ERC, and do
something like 'smartctl -l scterc,70,70' on every device at startup,
since it's not a persistent setting.

In the meantime you need to change the block device timeout for all
drives to something nutty like 160. This also is not a persistent
setting. This will permit a drive with a bad sector to actually report
a read failure rather than the SATA link being reset by the kernel.
The kernel uses this timeout value to know when to give up waiting for
a hung drive, and the drive is hanging because it's in deep read
recovery so its whole queue just stalls until there's a read error, or
the read succeeds.

What will suck is that some of these deep recoveries will work, so
instead of getting a read error, which is a prerequisite for Btrfs (or
md raid or lvm raid or any raid) to fix the slow sector, the drive
just gets slower and slower when it has one or more frequently used
bad blocks *until* it hits the drive's own internal timeout. The work
around for this (instead of waiting for that sector to get bad enough
to actually fail to read) is to do a full balance. That rewrites all
data, and sectors with recent writes are fast. Any persistently bad
sectors at write time will be remapped by drive firmware in favor of
reserve sectors.

>>
>> Do you have dmesg output that includes sysrq+w at the time of the hung
>> tasks warnings? That's pretty much always requested by devs in these
>> cases.
>
>
> This is dmesg before the freeze:

https://www.kernel.org/doc/Documentation/sysrq.txt

If you get a hung task message, enable sysrq, then write w to the
trigger path, then capture dmesg to a file and post that to a bug
report.

>
>
> Yes I searched and I found your opinion on this. I was fairly
> confident that the "unrecoverable errors" in the scrub were due to it
> being degraded.  I did the scrub because I felt I was running out of
> options.
>
> My only opinion, after going through this experience would be that
> btrfs-scrub fail fast and returned a hard error like "Cannot scrub
> degraded mount." That would have saved me some time.

Yes that's my position also until a dev tells me to shush. It's
probably be a pretty straight forward patch for btrfs-progs to have
scrub first check for degraded mount and fail with the error message
you propose if it is; David might accept it! But if he doesn't he'll
probably suggest a better way to handle such cases if there is one.

Btrfs is always doing a passive on-the-fly scrub anyway. An active
read and write scrub where every single stripe *will* have a read
error that the scrub code has to do something with just seems to me
like taxing an array that's already in duress.

Maybe a compromise is a -f flag that forces scrub anyway? Maybe it's a
valid stress test, but I still remain unconvinced it's a good
production task.

>
>>> From a black box perspective, this led me to believe that the
>>> corruption happened during the replace operation after mounting
>>> normally after first mounting with "-o degraded." Of course,
>>> knowledge of the internals could easily verify this.
>>
>>
>> Filesystems are really difficult, so even knowledge of the internals
>> doesn't guarantee the devs will understand where the problem first
>> started in this case.
>
>
> Fair point. I have experience writing a file system myself and I agree.
> Though sometimes it helps when you have a non-zero mental model of
> the internals.

Well at least they can look at the call traces, and maybe ask some
intelligent questions. I can't do that yet.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-01-22  3:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-19 11:28 FS corruption when mounting non-degraded after mounting degraded Rian Hunter
2016-01-19 13:54 ` Duncan
2016-01-21 16:26 ` Rich Freeman
2016-01-21 17:15   ` Chris Murphy
2016-01-21 22:25     ` Rian Hunter
2016-01-21 23:51       ` Chris Murphy
2016-01-22  1:21         ` Rian Hunter
2016-01-22  3:38           ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).