* parent transid verify failed on snapshot deletion
@ 2016-03-12 15:48 Roman Mamedov
2016-03-12 17:15 ` Roman Mamedov
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Roman Mamedov @ 2016-03-12 15:48 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 4618 bytes --]
Hello,
The system was seemingly running just fine for days or weeks, then I
routinely deleted a bunch of old snapshots, and suddenly got hit with:
[Sat Mar 12 20:17:10 2016] BTRFS error (device dm-0): parent transid verify failed on 7483566862336 wanted 410578 found 404133
[Sat Mar 12 20:17:10 2016] BTRFS error (device dm-0): parent transid verify failed on 7483566862336 wanted 410578 found 404133
[Sat Mar 12 20:17:10 2016] ------------[ cut here ]------------
[Sat Mar 12 20:17:10 2016] WARNING: CPU: 0 PID: 217 at fs/btrfs/extent-tree.c:6549 __btrfs_free_extent.isra.67+0x2c2/0xd40 [btrfs]()
[Sat Mar 12 20:17:10 2016] BTRFS: Transaction aborted (error -5)
[Sat Mar 12 20:17:10 2016] Modules linked in: xt_tcpudp xt_multiport xt_limit xt_length xt_conntrack ip6t_rpfilter ipt_rpfilter ip6table_raw ip6table_mangle iptable_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip6table_filter ip6_tables iptable_filter ip_tables x_tables cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative cfg80211 rfkill arc4 ecb md4 hmac nls_utf8 cifs dns_resolver fscache 8021q garp mrp bridge stp llc tcp_illinois ext4 crc16 mbcache jbd2 fuse kvm_amd kvm irqbypass serio_raw evdev pcspkr joydev snd_hda_codec_realtek k10temp snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep acpi_cpufreq sp5100_tco snd_pcm snd_timer tpm_tis snd tpm shpchp soundcore i2c_piix4 button processor btrfs dm_mod raid1 raid456
[Sat Mar 12 20:17:10 2016] async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic md_mod sg ata_generic sd_mod hid_generic usbhid hid uas usb_storage ohci_pci xhci_pci xhci_hcd r8169 mii sata_mv ahci libahci pata_atiixp ehci_pci ohci_hcd ehci_hcd libata usbcore usb_common scsi_mod
[Sat Mar 12 20:17:10 2016] CPU: 0 PID: 217 Comm: btrfs-cleaner Tainted: G W 4.4.4-rm1+ #108
[Sat Mar 12 20:17:10 2016] Hardware name: Gigabyte Technology Co., Ltd. GA-E350N-USB3/GA-E350N-USB3, BIOS F2 09/19/2011
[Sat Mar 12 20:17:10 2016] 0000000000000286 000000007223a131 ffff880406befa88 ffffffff81315721
[Sat Mar 12 20:17:10 2016] ffff880406befad0 ffffffffa03539b2 ffff880406befac0 ffffffff8107e735
[Sat Mar 12 20:17:10 2016] 0000000183c9c000 00000000fffffffb ffff88032dbc0e01 0000069c4f95b000
[Sat Mar 12 20:17:10 2016] Call Trace:
[Sat Mar 12 20:17:10 2016] [<ffffffff81315721>] dump_stack+0x63/0x82
[Sat Mar 12 20:17:10 2016] [<ffffffff8107e735>] warn_slowpath_common+0x95/0xe0
[Sat Mar 12 20:17:10 2016] [<ffffffff8107e7dc>] warn_slowpath_fmt+0x5c/0x80
[Sat Mar 12 20:17:10 2016] [<ffffffffa02b2e42>] __btrfs_free_extent.isra.67+0x2c2/0xd40 [btrfs]
[Sat Mar 12 20:17:10 2016] [<ffffffffa02b6f12>] __btrfs_run_delayed_refs+0x412/0x1230 [btrfs]
[Sat Mar 12 20:17:10 2016] [<ffffffff8133edad>] ? __percpu_counter_add+0x5d/0x80
[Sat Mar 12 20:17:10 2016] [<ffffffffa02bab4e>] btrfs_run_delayed_refs+0x7e/0x2b0 [btrfs]
[Sat Mar 12 20:17:10 2016] [<ffffffffa02cfd08>] btrfs_should_end_transaction+0x68/0x70 [btrfs]
[Sat Mar 12 20:17:10 2016] [<ffffffffa02b932d>] btrfs_drop_snapshot+0x45d/0x840 [btrfs]
[Sat Mar 12 20:17:10 2016] [<ffffffff815d1ee5>] ? __schedule+0x355/0xa30
[Sat Mar 12 20:17:10 2016] [<ffffffffa02d020d>] btrfs_clean_one_deleted_snapshot+0xbd/0x120 [btrfs]
[Sat Mar 12 20:17:10 2016] [<ffffffffa02c7e6d>] cleaner_kthread+0x17d/0x210 [btrfs]
[Sat Mar 12 20:17:10 2016] [<ffffffffa02c7cf0>] ? check_leaf+0x370/0x370 [btrfs]
[Sat Mar 12 20:17:10 2016] [<ffffffff8109db9a>] kthread+0xea/0x100
[Sat Mar 12 20:17:10 2016] [<ffffffff8109dab0>] ? kthread_park+0x60/0x60
[Sat Mar 12 20:17:10 2016] [<ffffffff815d6c4f>] ret_from_fork+0x3f/0x70
[Sat Mar 12 20:17:10 2016] [<ffffffff8109dab0>] ? kthread_park+0x60/0x60
[Sat Mar 12 20:17:10 2016] ---[ end trace 4a0a05309f1c27f4 ]---
[Sat Mar 12 20:17:10 2016] BTRFS: error (device dm-0) in __btrfs_free_extent:6549: errno=-5 IO failure
[Sat Mar 12 20:17:10 2016] BTRFS info (device dm-0): forced readonly
[Sat Mar 12 20:17:10 2016] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2927: errno=-5 IO failure
[Sat Mar 12 20:17:10 2016] pending csums is 103825408
Now this happens after each reboot too, causing the FS to be remounted read-only.
I wonder what's the best way to proceed here. Maybe try btrfs-zero-log? But
the difference between transid numbers of 6 thousands is concerning.
Also puzzling why did this happen in the first place, I don't think this
filesystem had any crashes or storage device-related issues recently.
--
With respect,
Roman
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: parent transid verify failed on snapshot deletion
2016-03-12 15:48 parent transid verify failed on snapshot deletion Roman Mamedov
@ 2016-03-12 17:15 ` Roman Mamedov
2016-03-13 9:24 ` Roman Mamedov
2016-03-13 3:54 ` Duncan
` (2 subsequent siblings)
3 siblings, 1 reply; 12+ messages in thread
From: Roman Mamedov @ 2016-03-12 17:15 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 1218 bytes --]
Hello,
btrfsck output:
# btrfsck /dev/alpha/lv1
Checking filesystem on /dev/alpha/lv1
UUID: 8cf8eff9-fd5a-4b6f-bb85-3f2df2f63c99
checking extents
parent transid verify failed on 7483566862336 wanted 410578 found 404133
parent transid verify failed on 7483566862336 wanted 410578 found 404133
parent transid verify failed on 7483566862336 wanted 410578 found 404133
parent transid verify failed on 7483566862336 wanted 410578 found 404133
Ignoring transid failure
bad block 7483566862336
Errors found in extent allocation tree or chunk allocation
parent transid verify failed on 7483566862336 wanted 410578 found 404133
Ignoring transid failure
checking free space cache
parent transid verify failed on 7483566862336 wanted 410578 found 404133
Ignoring transid failure
There is no free space entry for 6504947712-7537164288
cache appears valid but isnt 6463422464
found 2455135703350 bytes used err is -22
total csum bytes: 0
total tree bytes: 368590848
total fs tree bytes: 0
total extent tree bytes: 364605440
btree space waste bytes: 122267203
file data blocks allocated: 1294204928
referenced 1294204928
Seems like it should be safe to run --repair?
--
With respect,
Roman
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: parent transid verify failed on snapshot deletion
2016-03-12 15:48 parent transid verify failed on snapshot deletion Roman Mamedov
2016-03-12 17:15 ` Roman Mamedov
@ 2016-03-13 3:54 ` Duncan
2016-03-13 20:54 ` Sylvain Joyeux
2016-03-17 8:32 ` "Fixed", " Roman Mamedov
3 siblings, 0 replies; 12+ messages in thread
From: Duncan @ 2016-03-13 3:54 UTC (permalink / raw)
To: linux-btrfs
Roman Mamedov posted on Sat, 12 Mar 2016 20:48:47 +0500 as excerpted:
> I wonder what's the best way to proceed here. Maybe try btrfs-zero-log?
> But the difference between transid numbers of 6 thousands is concerning.
btrfs-zero-log is a very specific tool designed to fix a very specific
problem, and transid differences >1 are not it.
I read your followup, posting btrfs check output and wondering about
enabling --repair, as well.
As long as you have a backup, shouldn't be a problem, even if it does
cause further damage (which it doesn't appear like it will in your case).
If you don't have a backup it shouldn't be a problem either, since the
very fact that you don't have a backup, indicates by your actions that
you consider the data at risk as of less value than the time, effort and
resources necessary to have that backup in the first place. As such,
even if you lose the data, you saved what was obviously more important
than that data to you, the time, effort and resources that you would have
otherwise put into making and testing that backup, so you're still coming
out ahead. =:^)
Which means the only case not clearly covered is that of data worth
having backed up, which you do, but the backup is somewhat stale, and as
long as the risk was theoretical, you didn't consider the chance of
something happening to the data updated since the backup worth more than
the cost of updating that backup. But now that the theoretical chance
has become reality, while loss of that incremental data isn't earth
shattering in its consequences, you'd prefer not to lose it if you can
save it without too much trouble. That's quite understandable, and is
the exact position I've been in myself a couple times.
In both my cases where I did end up actually giving up on repair and
eventually blowing away the filesystem, btrfs restore (before that blow-
away) was able to get me back the incremental changes since my last
proper backup. If it hadn't worked I'd have certainly lost some work and
been less than absolutely happy, but as I _did_ have backups (which by
the fact that I had them indicated I actually valued the data at risk at
something above trivial), they were simply somewhat stale, it wouldn't
have been the end of the world.
Of course in your case you _can_ mount, if only in read-only mode. So
take the opportunity you've been handed and update your backups (and of
course backups that haven't been verified readable/restorable aren't yet
completed backups, as a would-be backup isn't complete and can't really
be considered a backup yet, until that verification is done), just in
case, and then even in the worst-case scenario, btrfs check --repair
can't do more than inconvenience you a bit if it makes the problem worse
instead of fixing it, since you have current backups and will only need
to blow away the filesystem and recreate it fresh, in ordered to restore
them.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: parent transid verify failed on snapshot deletion
2016-03-12 17:15 ` Roman Mamedov
@ 2016-03-13 9:24 ` Roman Mamedov
2016-03-13 17:03 ` Duncan
0 siblings, 1 reply; 12+ messages in thread
From: Roman Mamedov @ 2016-03-13 9:24 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2601 bytes --]
On Sat, 12 Mar 2016 22:15:24 +0500
Roman Mamedov <rm@romanrm.net> wrote:
> Seems like it should be safe to run --repair?
Well this is unexpected, I ran --repair, and it did not do anything.
# btrfsck --repair /dev/alpha/lv1
enabling repair mode
Checking filesystem on /dev/alpha/lv1
UUID: 8cf8eff9-fd5a-4b6f-bb85-3f2df2f63c99
checking extents
parent transid verify failed on 7483566862336 wanted 410578 found 404133
parent transid verify failed on 7483566862336 wanted 410578 found 404133
parent transid verify failed on 7483566862336 wanted 410578 found 404133
parent transid verify failed on 7483566862336 wanted 410578 found 404133
Ignoring transid failure
bad block 7483566862336
Errors found in extent allocation tree or chunk allocation
parent transid verify failed on 7483566862336 wanted 410578 found 404133
Ignoring transid failure
Fixed 0 roots.
checking free space cache
parent transid verify failed on 7483566862336 wanted 410578 found 404133
Ignoring transid failure
There is no free space entry for 6504947712-7537164288
cache appears valid but isnt 6463422464
found 2455135691065 bytes used err is -22
total csum bytes: 0
total tree bytes: 368590848
total fs tree bytes: 0
total extent tree bytes: 364605440
btree space waste bytes: 122267201
file data blocks allocated: 1294204928
referenced 1294204928
# btrfsck /dev/alpha/lv1
Checking filesystem on /dev/alpha/lv1
UUID: 8cf8eff9-fd5a-4b6f-bb85-3f2df2f63c99
checking extents
parent transid verify failed on 7483566862336 wanted 410578 found 404133
parent transid verify failed on 7483566862336 wanted 410578 found 404133
parent transid verify failed on 7483566862336 wanted 410578 found 404133
parent transid verify failed on 7483566862336 wanted 410578 found 404133
Ignoring transid failure
bad block 7483566862336
Errors found in extent allocation tree or chunk allocation
parent transid verify failed on 7483566862336 wanted 410578 found 404133
Ignoring transid failure
checking free space cache
parent transid verify failed on 7483566862336 wanted 410578 found 404133
Ignoring transid failure
There is no free space entry for 6504947712-7537164288
cache appears valid but isnt 6463422464
found 2455135691065 bytes used err is -22
total csum bytes: 0
total tree bytes: 368590848
total fs tree bytes: 0
total extent tree bytes: 364605440
btree space waste bytes: 122267201
file data blocks allocated: 1294204928
referenced 1294204928
With "Errors found in extent allocation tree", I wonder if I should
try --init-extent-tree next.
--
With respect,
Roman
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: parent transid verify failed on snapshot deletion
2016-03-13 9:24 ` Roman Mamedov
@ 2016-03-13 17:03 ` Duncan
2016-03-13 17:24 ` Roman Mamedov
0 siblings, 1 reply; 12+ messages in thread
From: Duncan @ 2016-03-13 17:03 UTC (permalink / raw)
To: linux-btrfs
Roman Mamedov posted on Sun, 13 Mar 2016 14:24:28 +0500 as excerpted:
> With "Errors found in extent allocation tree", I wonder if I should try
> --init-extent-tree next.
With backups I'd try it, if only for the personal experience value and to
see what the result was. But that's certainly more intensive "surgery"
on the filesystem than --repair, and I'd only do it either for that
experience value or if I was seriously desperate to recover files, as I'd
not trust the filesystem's health after that intensive a surgery, and
would blow the filesystem away after I recovered what I needed, even if
it did appear to work successfully.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: parent transid verify failed on snapshot deletion
2016-03-13 17:03 ` Duncan
@ 2016-03-13 17:24 ` Roman Mamedov
2016-03-13 20:10 ` Chris Murphy
0 siblings, 1 reply; 12+ messages in thread
From: Roman Mamedov @ 2016-03-13 17:24 UTC (permalink / raw)
To: Duncan; +Cc: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2928 bytes --]
On Sun, 13 Mar 2016 17:03:54 +0000 (UTC)
Duncan <1i5t5.duncan@cox.net> wrote:
> With backups I'd try it, if only for the personal experience value and to
> see what the result was. But that's certainly more intensive "surgery"
> on the filesystem than --repair, and I'd only do it either for that
> experience value or if I was seriously desperate to recover files, as I'd
> not trust the filesystem's health after that intensive a surgery, and
> would blow the filesystem away after I recovered what I needed, even if
> it did appear to work successfully.
"Blowing away" a 6TB filesystem just because some block randomly went "bad",
without any explanation why, or guarantees that this won't happen again, is not
the best outcome. Sure there might be no way to "guarantee" anything, but let's
at least figure out a robust way to recover from this failure state.
I'm running --init-extent-tree right now in a "what if" mode, using
the copy-on-write feature of 'nbd-server' (this way the original block device
is not modified, and all changes are saved in a separate file). It's been
running for a good 8 hours now, with 100% CPU use of btrfsck and very little
disk access. Unless I'm mistaken and something went majorly wrong, these
messages (100 MB worth of them by now) seem to indicate it indeed proceeds in
recreating the extent tree.
adding new data backref on 3282190336 parent 4315246948352 owner 0 offset 0 found 1
Backref 3282190336 root 256 owner 1187677 offset 4096 num_refs 0 not found in extent tree
Incorrect local backref count on 3282190336 root 256 owner 1187677 offset 4096 found 1 wanted 0 back 0x23496e40
Backref 3282190336 parent 4315038240768 owner 0 offset 0 num_refs 0 not found in extent tree
Incorrect local backref count on 3282190336 parent 4315038240768 owner 0 offset 0 found 1 wanted 0 back 0x4b29f3a0
Backref 3282190336 parent 4315246948352 owner 0 offset 0 num_refs 0 not found in extent tree
Incorrect local backref count on 3282190336 parent 4315246948352 owner 0 offset 0 found 1 wanted 0 back 0x4c330f60
backpointer mismatch on [3282190336 4096]
ref mismatch on [3282194432 32768] extent item 0, found 1
adding new data backref on 3282194432 parent 4309109956608 owner 0 offset 0 found 1
Backref 3282194432 parent 4309109956608 owner 0 offset 0 num_refs 0 not found in extent tree
Incorrect local backref count on 3282194432 parent 4309109956608 owner 0 offset 0 found 1 wanted 0 back 0x52903a20
backpointer mismatch on [3282194432 32768]
ref mismatch on [3282227200 4096] extent item 0, found 1
As it finishes I'll check if files are present and not corrupted, then will
have to run it once more, this time "for real". Unfortunately this also seems
to be an O(n) operation (if I'm using the term correctly), as the rate at which
new log messages appear has been slowing down considerably as it progresses.
--
With respect,
Roman
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: parent transid verify failed on snapshot deletion
2016-03-13 17:24 ` Roman Mamedov
@ 2016-03-13 20:10 ` Chris Murphy
2016-03-13 20:55 ` Roman Mamedov
0 siblings, 1 reply; 12+ messages in thread
From: Chris Murphy @ 2016-03-13 20:10 UTC (permalink / raw)
To: Roman Mamedov; +Cc: Duncan, Btrfs BTRFS
On Sun, Mar 13, 2016 at 11:24 AM, Roman Mamedov <rm@romanrm.net> wrote:
>
> "Blowing away" a 6TB filesystem just because some block randomly went "bad",
I'm going to guess it's a metadata block, and the profile is single.
Otherwise, if it were data it'd just be a corrupt file and you'd be
told which one is affected. And if metadata had more than one copy,
then it should recover from the copy. The exact nature of the loss
isn't clear, a kernel message for the time of the bad block message
might help but I'm going to guess again that it's a 4096 byte missing
block of metadata. Depending on what it is, that could be a pretty
serious hole for any file system.
> I'm running --init-extent-tree right now in a "what if" mode, using
> the copy-on-write feature of 'nbd-server' (this way the original block device
> is not modified, and all changes are saved in a separate file).
So it's a Btrfs on NDB with no replication either from Btrfs or the
storage backing it on the server? Off hand I'd say one of them needs
redundancy to avoid this very problem, otherwise it's just too easy
for even network corruption to cause a problem (NDB or iSCSI).
Not related to your problem, I'm not sure whether and how many times
Btrfs retries corrupt reads. That is, device returns read command OK
(no error), but Btrfs detects corruption. Does it retry? Or
immediately fail? For flash and network based Btrfs, it's possible the
result is intermittant so it should try again.
> It's been
> running for a good 8 hours now, with 100% CPU use of btrfsck and very little
> disk access.
Yeah btrfs check is very much RAM intensive.
--
Chris Murphy
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: parent transid verify failed on snapshot deletion
2016-03-12 15:48 parent transid verify failed on snapshot deletion Roman Mamedov
2016-03-12 17:15 ` Roman Mamedov
2016-03-13 3:54 ` Duncan
@ 2016-03-13 20:54 ` Sylvain Joyeux
2016-03-17 8:32 ` "Fixed", " Roman Mamedov
3 siblings, 0 replies; 12+ messages in thread
From: Sylvain Joyeux @ 2016-03-13 20:54 UTC (permalink / raw)
To: Roman Mamedov; +Cc: linux-btrfs
My unfortunate experience with these transid problems is that they (1)
randomly appear without warning and (2) --repair completely destroys
the filesystem. I have right now two separate volumes on two separate
disks reporting that error, and --repair surely destroyed the first
one. I am trying to see what I can restore from the second one before
I try --repair as well.
The frustrating part is that these volumes in my case are only used to
receive subvolumes, and delete them. From an outsider's point of view,
it hardly seems to be a very intensive workload.
Sylvain
2016-03-12 12:48 GMT-03:00 Roman Mamedov <rm@romanrm.net>:
> Hello,
>
> The system was seemingly running just fine for days or weeks, then I
> routinely deleted a bunch of old snapshots, and suddenly got hit with:
>
> [Sat Mar 12 20:17:10 2016] BTRFS error (device dm-0): parent transid verify failed on 7483566862336 wanted 410578 found 404133
> [Sat Mar 12 20:17:10 2016] BTRFS error (device dm-0): parent transid verify failed on 7483566862336 wanted 410578 found 404133
> [Sat Mar 12 20:17:10 2016] ------------[ cut here ]------------
> [Sat Mar 12 20:17:10 2016] WARNING: CPU: 0 PID: 217 at fs/btrfs/extent-tree.c:6549 __btrfs_free_extent.isra.67+0x2c2/0xd40 [btrfs]()
> [Sat Mar 12 20:17:10 2016] BTRFS: Transaction aborted (error -5)
> [Sat Mar 12 20:17:10 2016] Modules linked in: xt_tcpudp xt_multiport xt_limit xt_length xt_conntrack ip6t_rpfilter ipt_rpfilter ip6table_raw ip6table_mangle iptable_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip6table_filter ip6_tables iptable_filter ip_tables x_tables cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative cfg80211 rfkill arc4 ecb md4 hmac nls_utf8 cifs dns_resolver fscache 8021q garp mrp bridge stp llc tcp_illinois ext4 crc16 mbcache jbd2 fuse kvm_amd kvm irqbypass serio_raw evdev pcspkr joydev snd_hda_codec_realtek k10temp snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep acpi_cpufreq sp5100_tco snd_pcm snd_timer tpm_tis snd tpm shpchp soundcore i2c_piix4 button processor btrfs dm_mod raid1 raid456
> [Sat Mar 12 20:17:10 2016] async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic md_mod sg ata_generic sd_mod hid_generic usbhid hid uas usb_storage ohci_pci xhci_pci xhci_hcd r8169 mii sata_mv ahci libahci pata_atiixp ehci_pci ohci_hcd ehci_hcd libata usbcore usb_common scsi_mod
> [Sat Mar 12 20:17:10 2016] CPU: 0 PID: 217 Comm: btrfs-cleaner Tainted: G W 4.4.4-rm1+ #108
> [Sat Mar 12 20:17:10 2016] Hardware name: Gigabyte Technology Co., Ltd. GA-E350N-USB3/GA-E350N-USB3, BIOS F2 09/19/2011
> [Sat Mar 12 20:17:10 2016] 0000000000000286 000000007223a131 ffff880406befa88 ffffffff81315721
> [Sat Mar 12 20:17:10 2016] ffff880406befad0 ffffffffa03539b2 ffff880406befac0 ffffffff8107e735
> [Sat Mar 12 20:17:10 2016] 0000000183c9c000 00000000fffffffb ffff88032dbc0e01 0000069c4f95b000
> [Sat Mar 12 20:17:10 2016] Call Trace:
> [Sat Mar 12 20:17:10 2016] [<ffffffff81315721>] dump_stack+0x63/0x82
> [Sat Mar 12 20:17:10 2016] [<ffffffff8107e735>] warn_slowpath_common+0x95/0xe0
> [Sat Mar 12 20:17:10 2016] [<ffffffff8107e7dc>] warn_slowpath_fmt+0x5c/0x80
> [Sat Mar 12 20:17:10 2016] [<ffffffffa02b2e42>] __btrfs_free_extent.isra.67+0x2c2/0xd40 [btrfs]
> [Sat Mar 12 20:17:10 2016] [<ffffffffa02b6f12>] __btrfs_run_delayed_refs+0x412/0x1230 [btrfs]
> [Sat Mar 12 20:17:10 2016] [<ffffffff8133edad>] ? __percpu_counter_add+0x5d/0x80
> [Sat Mar 12 20:17:10 2016] [<ffffffffa02bab4e>] btrfs_run_delayed_refs+0x7e/0x2b0 [btrfs]
> [Sat Mar 12 20:17:10 2016] [<ffffffffa02cfd08>] btrfs_should_end_transaction+0x68/0x70 [btrfs]
> [Sat Mar 12 20:17:10 2016] [<ffffffffa02b932d>] btrfs_drop_snapshot+0x45d/0x840 [btrfs]
> [Sat Mar 12 20:17:10 2016] [<ffffffff815d1ee5>] ? __schedule+0x355/0xa30
> [Sat Mar 12 20:17:10 2016] [<ffffffffa02d020d>] btrfs_clean_one_deleted_snapshot+0xbd/0x120 [btrfs]
> [Sat Mar 12 20:17:10 2016] [<ffffffffa02c7e6d>] cleaner_kthread+0x17d/0x210 [btrfs]
> [Sat Mar 12 20:17:10 2016] [<ffffffffa02c7cf0>] ? check_leaf+0x370/0x370 [btrfs]
> [Sat Mar 12 20:17:10 2016] [<ffffffff8109db9a>] kthread+0xea/0x100
> [Sat Mar 12 20:17:10 2016] [<ffffffff8109dab0>] ? kthread_park+0x60/0x60
> [Sat Mar 12 20:17:10 2016] [<ffffffff815d6c4f>] ret_from_fork+0x3f/0x70
> [Sat Mar 12 20:17:10 2016] [<ffffffff8109dab0>] ? kthread_park+0x60/0x60
> [Sat Mar 12 20:17:10 2016] ---[ end trace 4a0a05309f1c27f4 ]---
> [Sat Mar 12 20:17:10 2016] BTRFS: error (device dm-0) in __btrfs_free_extent:6549: errno=-5 IO failure
> [Sat Mar 12 20:17:10 2016] BTRFS info (device dm-0): forced readonly
> [Sat Mar 12 20:17:10 2016] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2927: errno=-5 IO failure
> [Sat Mar 12 20:17:10 2016] pending csums is 103825408
>
> Now this happens after each reboot too, causing the FS to be remounted read-only.
>
> I wonder what's the best way to proceed here. Maybe try btrfs-zero-log? But
> the difference between transid numbers of 6 thousands is concerning.
>
> Also puzzling why did this happen in the first place, I don't think this
> filesystem had any crashes or storage device-related issues recently.
>
> --
> With respect,
> Roman
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: parent transid verify failed on snapshot deletion
2016-03-13 20:10 ` Chris Murphy
@ 2016-03-13 20:55 ` Roman Mamedov
2016-03-13 21:52 ` Chris Murphy
0 siblings, 1 reply; 12+ messages in thread
From: Roman Mamedov @ 2016-03-13 20:55 UTC (permalink / raw)
To: Chris Murphy; +Cc: Duncan, Btrfs BTRFS
[-- Attachment #1: Type: text/plain, Size: 10718 bytes --]
On Sun, 13 Mar 2016 14:10:47 -0600
Chris Murphy <lists@colorremedies.com> wrote:
> I'm going to guess it's a metadata block, and the profile is single.
> Otherwise, if it were data it'd just be a corrupt file and you'd be
> told which one is affected. And if metadata had more than one copy,
> then it should recover from the copy. The exact nature of the loss
> isn't clear, a kernel message for the time of the bad block message
> might help but I'm going to guess again that it's a 4096 byte missing
> block of metadata. Depending on what it is, that could be a pretty
> serious hole for any file system.
Pretty sure the metadata is DUP on that FS.
Besides, the "bad" block (only going by btrfsck's lingo here, it's not the usual
"hard disk got a bad block" problem) is not entirely missing, just 6k transids
older than it should be(???). I saved this from before the btrfsck passes:
# btrfs-debug-tree -b 7483566862336 /dev/alpha/lv1 :(
node 7483566862336 level 3 items 95 free 26 generation 404133 owner 7
fs uuid 8cf8eff9-fd5a-4b6f-bb85-3f2df2f63c99
chunk uuid 4688dce4-89dd-43eb-a0f4-d10900535183
key (EXTENT_CSUM EXTENT_CSUM 1062973087744) block 4314139631616 (1053256746) gen 402032
key (EXTENT_CSUM EXTENT_CSUM 1091441795072) block 4314548232192 (1053356502) gen 402102
key (EXTENT_CSUM EXTENT_CSUM 1107647541248) block 7482607947776 (1826808581) gen 402791
key (EXTENT_CSUM EXTENT_CSUM 1176289222656) block 7482608832512 (1826808797) gen 402791
key (EXTENT_CSUM EXTENT_CSUM 1199852232704) block 7483421888512 (1827007297) gen 403882
key (EXTENT_CSUM EXTENT_CSUM 1252762054656) block 7483566968832 (1827042717) gen 404133
key (EXTENT_CSUM EXTENT_CSUM 1302207705088) block 7486122131456 (1827666536) gen 399086
key (EXTENT_CSUM EXTENT_CSUM 1342292983808) block 7486136766464 (1827670109) gen 399086
key (EXTENT_CSUM EXTENT_CSUM 1357230608384) block 7486143053824 (1827671644) gen 399088
key (EXTENT_CSUM EXTENT_CSUM 1374801608704) block 7486219661312 (1827690347) gen 399097
key (EXTENT_CSUM EXTENT_CSUM 1406541111296) block 7482936365056 (1826888761) gen 403108
key (EXTENT_CSUM EXTENT_CSUM 1425602490368) block 7482806996992 (1826857177) gen 402938
key (EXTENT_CSUM EXTENT_CSUM 1439588401152) block 7492133109760 (1829134060) gen 400631
key (EXTENT_CSUM EXTENT_CSUM 1471449923584) block 7486878142464 (1827851109) gen 399121
key (EXTENT_CSUM EXTENT_CSUM 1494641868800) block 7486882181120 (1827852095) gen 399121
key (EXTENT_CSUM EXTENT_CSUM 1511553085440) block 7492376141824 (1829193394) gen 400803
key (EXTENT_CSUM EXTENT_CSUM 1530452836352) block 7492377698304 (1829193774) gen 400803
key (EXTENT_CSUM EXTENT_CSUM 1557468987392) block 7544937934848 (1842025863) gen 401275
key (EXTENT_CSUM EXTENT_CSUM 1589122428928) block 7544937947136 (1842025866) gen 401275
key (EXTENT_CSUM EXTENT_CSUM 1623402835968) block 7544935043072 (1842025157) gen 401275
key (EXTENT_CSUM EXTENT_CSUM 1660158967808) block 7544935292928 (1842025218) gen 401275
key (EXTENT_CSUM EXTENT_CSUM 1686639628288) block 7544935317504 (1842025224) gen 401275
key (EXTENT_CSUM EXTENT_CSUM 1717318074368) block 7545404669952 (1842139812) gen 401300
key (EXTENT_CSUM EXTENT_CSUM 1755587174400) block 7544935378944 (1842025239) gen 401275
key (EXTENT_CSUM EXTENT_CSUM 1771312803840) block 7482802622464 (1826856109) gen 402938
key (EXTENT_CSUM EXTENT_CSUM 1792774889472) block 7545001177088 (1842041303) gen 401281
key (EXTENT_CSUM EXTENT_CSUM 1833762066432) block 7545013350400 (1842044275) gen 401278
key (EXTENT_CSUM EXTENT_CSUM 1848938086400) block 7545009430528 (1842043318) gen 401278
key (EXTENT_CSUM EXTENT_CSUM 1874773962752) block 7545013170176 (1842044231) gen 401278
key (EXTENT_CSUM EXTENT_CSUM 1912300650496) block 4309044703232 (1052012867) gen 401366
key (EXTENT_CSUM EXTENT_CSUM 1934921564160) block 4308804886528 (1051954318) gen 401354
key (EXTENT_CSUM EXTENT_CSUM 1951308283904) block 4310900432896 (1052465926) gen 401686
key (EXTENT_CSUM EXTENT_CSUM 1966261223424) block 4309153787904 (1052039499) gen 401376
key (EXTENT_CSUM EXTENT_CSUM 1985369530368) block 4311094611968 (1052513333) gen 401757
key (EXTENT_CSUM EXTENT_CSUM 2002212573184) block 4311279501312 (1052558472) gen 401766
key (EXTENT_CSUM EXTENT_CSUM 2031789600768) block 4311093194752 (1052512987) gen 401757
key (EXTENT_CSUM EXTENT_CSUM 2056985681920) block 4311095111680 (1052513455) gen 401757
key (EXTENT_CSUM EXTENT_CSUM 2086494728192) block 4310101364736 (1052270841) gen 401441
key (EXTENT_CSUM EXTENT_CSUM 2114637971456) block 4311356846080 (1052577355) gen 401773
key (EXTENT_CSUM EXTENT_CSUM 2138850193408) block 4313693347840 (1053147790) gen 401966
key (EXTENT_CSUM EXTENT_CSUM 2160176660480) block 4314105159680 (1053248330) gen 402026
key (EXTENT_CSUM EXTENT_CSUM 2191463452672) block 4313988440064 (1053219834) gen 402009
key (EXTENT_CSUM EXTENT_CSUM 2219386761216) block 4313964060672 (1053213882) gen 402005
key (EXTENT_CSUM EXTENT_CSUM 2277297422336) block 4314309550080 (1053298230) gen 402066
key (EXTENT_CSUM EXTENT_CSUM 2341651099648) block 4314278002688 (1053290528) gen 402058
key (EXTENT_CSUM EXTENT_CSUM 2385829801984) block 4314699358208 (1053393398) gen 402131
key (EXTENT_CSUM EXTENT_CSUM 2443256795136) block 4314533724160 (1053352960) gen 402102
key (EXTENT_CSUM EXTENT_CSUM 2473251045376) block 4314534068224 (1053353044) gen 402102
key (EXTENT_CSUM EXTENT_CSUM 2492309962752) block 4314533797888 (1053352978) gen 402102
key (EXTENT_CSUM EXTENT_CSUM 2541250543616) block 7491513913344 (1828982889) gen 367993
key (EXTENT_CSUM EXTENT_CSUM 2624366092288) block 4314533789696 (1053352976) gen 402102
key (EXTENT_CSUM EXTENT_CSUM 2661959823360) block 4314533863424 (1053352994) gen 402102
key (EXTENT_CSUM EXTENT_CSUM 2722339299328) block 4314643193856 (1053379686) gen 402118
key (EXTENT_CSUM EXTENT_CSUM 2769931730944) block 4314614272000 (1053372625) gen 402114
key (EXTENT_CSUM EXTENT_CSUM 2795646136320) block 4314612932608 (1053372298) gen 402114
key (EXTENT_CSUM EXTENT_CSUM 2843763052544) block 4314612928512 (1053372297) gen 402114
key (EXTENT_CSUM EXTENT_CSUM 2902613557248) block 4314614157312 (1053372597) gen 402114
key (EXTENT_CSUM EXTENT_CSUM 2968288628736) block 4314614329344 (1053372639) gen 402114
key (EXTENT_CSUM EXTENT_CSUM 3134623027200) block 7492569567232 (1829240617) gen 400840
key (EXTENT_CSUM EXTENT_CSUM 3384253874176) block 7268773081088 (1774602803) gen 402786
key (EXTENT_CSUM EXTENT_CSUM 3434919317504) block 7268782407680 (1774605080) gen 402786
key (EXTENT_CSUM EXTENT_CSUM 3589271453696) block 7482801561600 (1826855850) gen 402938
key (EXTENT_CSUM EXTENT_CSUM 3610059431936) block 7482801238016 (1826855771) gen 402938
key (EXTENT_CSUM EXTENT_CSUM 3632980488192) block 4310713114624 (1052420194) gen 379864
key (EXTENT_CSUM EXTENT_CSUM 3662123552768) block 7482802126848 (1826855988) gen 402938
key (EXTENT_CSUM EXTENT_CSUM 3693896204288) block 7482802315264 (1826856034) gen 402938
key (EXTENT_CSUM EXTENT_CSUM 3731483045888) block 7483428696064 (1827008959) gen 403882
key (EXTENT_CSUM EXTENT_CSUM 3890200125440) block 7483526922240 (1827032940) gen 404055
key (EXTENT_CSUM EXTENT_CSUM 3924815777792) block 7483418935296 (1827006576) gen 403882
key (EXTENT_CSUM EXTENT_CSUM 3953528250368) block 4314230116352 (1053278837) gen 402051
key (EXTENT_CSUM EXTENT_CSUM 3978332045312) block 4314185465856 (1053267936) gen 402046
key (EXTENT_CSUM EXTENT_CSUM 3999411937280) block 4314513797120 (1053348095) gen 402097
key (EXTENT_CSUM EXTENT_CSUM 4022030766080) block 4309417017344 (1052103764) gen 401401
key (EXTENT_CSUM EXTENT_CSUM 4328173846528) block 4314038706176 (1053232106) gen 402015
key (EXTENT_CSUM EXTENT_CSUM 4388483334144) block 4314774265856 (1053411686) gen 402142
key (EXTENT_CSUM EXTENT_CSUM 4492224630784) block 7483410653184 (1827004554) gen 403881
key (EXTENT_CSUM EXTENT_CSUM 4540637818880) block 4314122088448 (1053252463) gen 402032
key (EXTENT_CSUM EXTENT_CSUM 4614089646080) block 4314448781312 (1053332222) gen 402086
key (EXTENT_CSUM EXTENT_CSUM 4720340647936) block 7483409018880 (1827004155) gen 403881
key (EXTENT_CSUM EXTENT_CSUM 4736819306496) block 4310925000704 (1052471924) gen 401688
key (EXTENT_CSUM EXTENT_CSUM 4755398365184) block 4314130493440 (1053254515) gen 402030
key (EXTENT_CSUM EXTENT_CSUM 4774954143744) block 7492586037248 (1829244638) gen 400843
key (EXTENT_CSUM EXTENT_CSUM 4805973180416) block 7492582633472 (1829243807) gen 400842
key (EXTENT_CSUM EXTENT_CSUM 4837741899776) block 7492538318848 (1829232988) gen 400835
key (EXTENT_CSUM EXTENT_CSUM 4871764180992) block 7492545794048 (1829234813) gen 400836
key (EXTENT_CSUM EXTENT_CSUM 4919789879296) block 7492521549824 (1829228894) gen 400832
key (EXTENT_CSUM EXTENT_CSUM 4956089876480) block 7492569387008 (1829240573) gen 400840
key (EXTENT_CSUM EXTENT_CSUM 5004070121472) block 7268872728576 (1774627131) gen 402787
key (EXTENT_CSUM EXTENT_CSUM 5065431572480) block 4314616324096 (1053373126) gen 402114
key (EXTENT_CSUM EXTENT_CSUM 5090921189376) block 7492482326528 (1829219318) gen 400825
key (EXTENT_CSUM EXTENT_CSUM 5132048932864) block 4310321446912 (1052324572) gen 309501
key (EXTENT_CSUM EXTENT_CSUM 5180942753792) block 4310060957696 (1052260976) gen 394444
key (EXTENT_CSUM EXTENT_CSUM 5232640884736) block 4310459052032 (1052358167) gen 394446
key (EXTENT_CSUM EXTENT_CSUM 5270016761856) block 7492586082304 (1829244649) gen 400843
key (EXTENT_CSUM EXTENT_CSUM 5298670948352) block 7483409063936 (1827004166) gen 403881
> > I'm running --init-extent-tree right now in a "what if" mode, using
> > the copy-on-write feature of 'nbd-server' (this way the original block device
> > is not modified, and all changes are saved in a separate file).
>
> So it's a Btrfs on NDB with no replication either from Btrfs or the
> storage backing it on the server? Off hand I'd say one of them needs
> redundancy to avoid this very problem, otherwise it's just too easy
> for even network corruption to cause a problem (NDB or iSCSI).
Its normal mode of operation is to be mounted locally, I only use NDB right now in the
recovery process, primarily for that nifty COW feature, and also to mount the FS on a more
powerful machine than its local one.
As for reliability, both Ethernet and TCP/IP have checksums for transferred content;
in case of AoE this might be more concerning as it can only rely on the former, but NBD and
iSCSI should be rather safe.
--
With respect,
Roman
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: parent transid verify failed on snapshot deletion
2016-03-13 20:55 ` Roman Mamedov
@ 2016-03-13 21:52 ` Chris Murphy
2016-03-17 8:39 ` Roman Mamedov
0 siblings, 1 reply; 12+ messages in thread
From: Chris Murphy @ 2016-03-13 21:52 UTC (permalink / raw)
To: Roman Mamedov, Btrfs BTRFS
On Sun, Mar 13, 2016 at 2:55 PM, Roman Mamedov <rm@romanrm.net> wrote:
> On Sun, 13 Mar 2016 14:10:47 -0600
> Chris Murphy <lists@colorremedies.com> wrote:
>
>> I'm going to guess it's a metadata block, and the profile is single.
>> Otherwise, if it were data it'd just be a corrupt file and you'd be
>> told which one is affected. And if metadata had more than one copy,
>> then it should recover from the copy. The exact nature of the loss
>> isn't clear, a kernel message for the time of the bad block message
>> might help but I'm going to guess again that it's a 4096 byte missing
>> block of metadata. Depending on what it is, that could be a pretty
>> serious hole for any file system.
>
> Pretty sure the metadata is DUP on that FS.
Big difference. If it's single and the block is bad, it's uncertain if
it's something Btrfs should be able to recover from. If it's DUP then
it should be a non-factor. In either case, kernel messages would be a
lot more enlightening about what happened right before this. The call
trace really isn't that helpful in my opinion, all that tells us is
Btrfs got confused.
I saved this from before the btrfsck passes:
>
> # btrfs-debug-tree -b 7483566862336 /dev/alpha/lv1 :(
> node 7483566862336 level 3 items 95 free 26 generation 404133 owner 7
> fs uuid 8cf8eff9-fd5a-4b6f-bb85-3f2df2f63c99
> chunk uuid 4688dce4-89dd-43eb-a0f4-d10900535183
> key (EXTENT_CSUM EXTENT_CSUM 1062973087744) block 4314139631616 (1053256746) gen 402032
> key (EXTENT_CSUM EXTENT_CSUM 1091441795072) block 4314548232192 (1053356502) gen 402102
> key (EXTENT_CSUM EXTENT_CSUM 1107647541248) block 7482607947776 (1826808581) gen 402791
> key (EXTENT_CSUM EXTENT_CSUM 1176289222656) block 7482608832512 (1826808797) gen 402791
> key (EXTENT_CSUM EXTENT_CSUM 1199852232704) block 7483421888512 (1827007297) gen 403882
> key (EXTENT_CSUM EXTENT_CSUM 1252762054656) block 7483566968832 (1827042717) gen 404133
> key (EXTENT_CSUM EXTENT_CSUM 1302207705088) block 7486122131456 (1827666536) gen 399086
> key (EXTENT_CSUM EXTENT_CSUM 1342292983808) block 7486136766464 (1827670109) gen 399086
> key (EXTENT_CSUM EXTENT_CSUM 1357230608384) block 7486143053824 (1827671644) gen 399088
> key (EXTENT_CSUM EXTENT_CSUM 1374801608704) block 7486219661312 (1827690347) gen 399097
> key (EXTENT_CSUM EXTENT_CSUM 1406541111296) block 7482936365056 (1826888761) gen 403108
> key (EXTENT_CSUM EXTENT_CSUM 1425602490368) block 7482806996992 (1826857177) gen 402938
> key (EXTENT_CSUM EXTENT_CSUM 1439588401152) block 7492133109760 (1829134060) gen 400631
> key (EXTENT_CSUM EXTENT_CSUM 1471449923584) block 7486878142464 (1827851109) gen 399121
> key (EXTENT_CSUM EXTENT_CSUM 1494641868800) block 7486882181120 (1827852095) gen 399121
> key (EXTENT_CSUM EXTENT_CSUM 1511553085440) block 7492376141824 (1829193394) gen 400803
> key (EXTENT_CSUM EXTENT_CSUM 1530452836352) block 7492377698304 (1829193774) gen 400803
> key (EXTENT_CSUM EXTENT_CSUM 1557468987392) block 7544937934848 (1842025863) gen 401275
> key (EXTENT_CSUM EXTENT_CSUM 1589122428928) block 7544937947136 (1842025866) gen 401275
> key (EXTENT_CSUM EXTENT_CSUM 1623402835968) block 7544935043072 (1842025157) gen 401275
> key (EXTENT_CSUM EXTENT_CSUM 1660158967808) block 7544935292928 (1842025218) gen 401275
> key (EXTENT_CSUM EXTENT_CSUM 1686639628288) block 7544935317504 (1842025224) gen 401275
> key (EXTENT_CSUM EXTENT_CSUM 1717318074368) block 7545404669952 (1842139812) gen 401300
> key (EXTENT_CSUM EXTENT_CSUM 1755587174400) block 7544935378944 (1842025239) gen 401275
> key (EXTENT_CSUM EXTENT_CSUM 1771312803840) block 7482802622464 (1826856109) gen 402938
> key (EXTENT_CSUM EXTENT_CSUM 1792774889472) block 7545001177088 (1842041303) gen 401281
> key (EXTENT_CSUM EXTENT_CSUM 1833762066432) block 7545013350400 (1842044275) gen 401278
> key (EXTENT_CSUM EXTENT_CSUM 1848938086400) block 7545009430528 (1842043318) gen 401278
> key (EXTENT_CSUM EXTENT_CSUM 1874773962752) block 7545013170176 (1842044231) gen 401278
> key (EXTENT_CSUM EXTENT_CSUM 1912300650496) block 4309044703232 (1052012867) gen 401366
> key (EXTENT_CSUM EXTENT_CSUM 1934921564160) block 4308804886528 (1051954318) gen 401354
> key (EXTENT_CSUM EXTENT_CSUM 1951308283904) block 4310900432896 (1052465926) gen 401686
> key (EXTENT_CSUM EXTENT_CSUM 1966261223424) block 4309153787904 (1052039499) gen 401376
> key (EXTENT_CSUM EXTENT_CSUM 1985369530368) block 4311094611968 (1052513333) gen 401757
> key (EXTENT_CSUM EXTENT_CSUM 2002212573184) block 4311279501312 (1052558472) gen 401766
> key (EXTENT_CSUM EXTENT_CSUM 2031789600768) block 4311093194752 (1052512987) gen 401757
> key (EXTENT_CSUM EXTENT_CSUM 2056985681920) block 4311095111680 (1052513455) gen 401757
> key (EXTENT_CSUM EXTENT_CSUM 2086494728192) block 4310101364736 (1052270841) gen 401441
> key (EXTENT_CSUM EXTENT_CSUM 2114637971456) block 4311356846080 (1052577355) gen 401773
> key (EXTENT_CSUM EXTENT_CSUM 2138850193408) block 4313693347840 (1053147790) gen 401966
> key (EXTENT_CSUM EXTENT_CSUM 2160176660480) block 4314105159680 (1053248330) gen 402026
> key (EXTENT_CSUM EXTENT_CSUM 2191463452672) block 4313988440064 (1053219834) gen 402009
> key (EXTENT_CSUM EXTENT_CSUM 2219386761216) block 4313964060672 (1053213882) gen 402005
> key (EXTENT_CSUM EXTENT_CSUM 2277297422336) block 4314309550080 (1053298230) gen 402066
> key (EXTENT_CSUM EXTENT_CSUM 2341651099648) block 4314278002688 (1053290528) gen 402058
> key (EXTENT_CSUM EXTENT_CSUM 2385829801984) block 4314699358208 (1053393398) gen 402131
> key (EXTENT_CSUM EXTENT_CSUM 2443256795136) block 4314533724160 (1053352960) gen 402102
> key (EXTENT_CSUM EXTENT_CSUM 2473251045376) block 4314534068224 (1053353044) gen 402102
> key (EXTENT_CSUM EXTENT_CSUM 2492309962752) block 4314533797888 (1053352978) gen 402102
> key (EXTENT_CSUM EXTENT_CSUM 2541250543616) block 7491513913344 (1828982889) gen 367993
> key (EXTENT_CSUM EXTENT_CSUM 2624366092288) block 4314533789696 (1053352976) gen 402102
> key (EXTENT_CSUM EXTENT_CSUM 2661959823360) block 4314533863424 (1053352994) gen 402102
> key (EXTENT_CSUM EXTENT_CSUM 2722339299328) block 4314643193856 (1053379686) gen 402118
> key (EXTENT_CSUM EXTENT_CSUM 2769931730944) block 4314614272000 (1053372625) gen 402114
> key (EXTENT_CSUM EXTENT_CSUM 2795646136320) block 4314612932608 (1053372298) gen 402114
> key (EXTENT_CSUM EXTENT_CSUM 2843763052544) block 4314612928512 (1053372297) gen 402114
> key (EXTENT_CSUM EXTENT_CSUM 2902613557248) block 4314614157312 (1053372597) gen 402114
> key (EXTENT_CSUM EXTENT_CSUM 2968288628736) block 4314614329344 (1053372639) gen 402114
> key (EXTENT_CSUM EXTENT_CSUM 3134623027200) block 7492569567232 (1829240617) gen 400840
> key (EXTENT_CSUM EXTENT_CSUM 3384253874176) block 7268773081088 (1774602803) gen 402786
> key (EXTENT_CSUM EXTENT_CSUM 3434919317504) block 7268782407680 (1774605080) gen 402786
> key (EXTENT_CSUM EXTENT_CSUM 3589271453696) block 7482801561600 (1826855850) gen 402938
> key (EXTENT_CSUM EXTENT_CSUM 3610059431936) block 7482801238016 (1826855771) gen 402938
> key (EXTENT_CSUM EXTENT_CSUM 3632980488192) block 4310713114624 (1052420194) gen 379864
> key (EXTENT_CSUM EXTENT_CSUM 3662123552768) block 7482802126848 (1826855988) gen 402938
> key (EXTENT_CSUM EXTENT_CSUM 3693896204288) block 7482802315264 (1826856034) gen 402938
> key (EXTENT_CSUM EXTENT_CSUM 3731483045888) block 7483428696064 (1827008959) gen 403882
> key (EXTENT_CSUM EXTENT_CSUM 3890200125440) block 7483526922240 (1827032940) gen 404055
> key (EXTENT_CSUM EXTENT_CSUM 3924815777792) block 7483418935296 (1827006576) gen 403882
> key (EXTENT_CSUM EXTENT_CSUM 3953528250368) block 4314230116352 (1053278837) gen 402051
> key (EXTENT_CSUM EXTENT_CSUM 3978332045312) block 4314185465856 (1053267936) gen 402046
> key (EXTENT_CSUM EXTENT_CSUM 3999411937280) block 4314513797120 (1053348095) gen 402097
> key (EXTENT_CSUM EXTENT_CSUM 4022030766080) block 4309417017344 (1052103764) gen 401401
> key (EXTENT_CSUM EXTENT_CSUM 4328173846528) block 4314038706176 (1053232106) gen 402015
> key (EXTENT_CSUM EXTENT_CSUM 4388483334144) block 4314774265856 (1053411686) gen 402142
> key (EXTENT_CSUM EXTENT_CSUM 4492224630784) block 7483410653184 (1827004554) gen 403881
> key (EXTENT_CSUM EXTENT_CSUM 4540637818880) block 4314122088448 (1053252463) gen 402032
> key (EXTENT_CSUM EXTENT_CSUM 4614089646080) block 4314448781312 (1053332222) gen 402086
> key (EXTENT_CSUM EXTENT_CSUM 4720340647936) block 7483409018880 (1827004155) gen 403881
> key (EXTENT_CSUM EXTENT_CSUM 4736819306496) block 4310925000704 (1052471924) gen 401688
> key (EXTENT_CSUM EXTENT_CSUM 4755398365184) block 4314130493440 (1053254515) gen 402030
> key (EXTENT_CSUM EXTENT_CSUM 4774954143744) block 7492586037248 (1829244638) gen 400843
> key (EXTENT_CSUM EXTENT_CSUM 4805973180416) block 7492582633472 (1829243807) gen 400842
> key (EXTENT_CSUM EXTENT_CSUM 4837741899776) block 7492538318848 (1829232988) gen 400835
> key (EXTENT_CSUM EXTENT_CSUM 4871764180992) block 7492545794048 (1829234813) gen 400836
> key (EXTENT_CSUM EXTENT_CSUM 4919789879296) block 7492521549824 (1829228894) gen 400832
> key (EXTENT_CSUM EXTENT_CSUM 4956089876480) block 7492569387008 (1829240573) gen 400840
> key (EXTENT_CSUM EXTENT_CSUM 5004070121472) block 7268872728576 (1774627131) gen 402787
> key (EXTENT_CSUM EXTENT_CSUM 5065431572480) block 4314616324096 (1053373126) gen 402114
> key (EXTENT_CSUM EXTENT_CSUM 5090921189376) block 7492482326528 (1829219318) gen 400825
> key (EXTENT_CSUM EXTENT_CSUM 5132048932864) block 4310321446912 (1052324572) gen 309501
> key (EXTENT_CSUM EXTENT_CSUM 5180942753792) block 4310060957696 (1052260976) gen 394444
> key (EXTENT_CSUM EXTENT_CSUM 5232640884736) block 4310459052032 (1052358167) gen 394446
> key (EXTENT_CSUM EXTENT_CSUM 5270016761856) block 7492586082304 (1829244649) gen 400843
> key (EXTENT_CSUM EXTENT_CSUM 5298670948352) block 7483409063936 (1827004166) gen 403881
Weird, I'm lost. That block address is bad, but btrsf-debug-tree shows
you it's a node pointing to a bunch of csum tree entries? If that
block is bad then I'd expect a lot more csum mismatches if it can't do
csum tree lookups. Although it's somewhat consistent with the last
part of the call trace from when the file system went read only:
> [Sat Mar 12 20:17:10 2016] pending csums is 103825408
I really think you need a minute's worth of kernel messages prior to
that time stamp.
--
Chris Murphy
^ permalink raw reply [flat|nested] 12+ messages in thread
* "Fixed", Re: parent transid verify failed on snapshot deletion
2016-03-12 15:48 parent transid verify failed on snapshot deletion Roman Mamedov
` (2 preceding siblings ...)
2016-03-13 20:54 ` Sylvain Joyeux
@ 2016-03-17 8:32 ` Roman Mamedov
3 siblings, 0 replies; 12+ messages in thread
From: Roman Mamedov @ 2016-03-17 8:32 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 3778 bytes --]
On Sat, 12 Mar 2016 20:48:47 +0500
Roman Mamedov <rm@romanrm.net> wrote:
> The system was seemingly running just fine for days or weeks, then I
> routinely deleted a bunch of old snapshots, and suddenly got hit with:
>
> [Sat Mar 12 20:17:10 2016] BTRFS error (device dm-0): parent transid verify failed on 7483566862336 wanted 410578 found 404133
> [Sat Mar 12 20:17:10 2016] BTRFS error (device dm-0): parent transid verify failed on 7483566862336 wanted 410578 found 404133
As I mentioned, the initial run of btrfsck --repair did not do anything to fix
this problem; I started btrfsck --repair --init-extent-tree, but it still not
finished after 5 days, so I looked for other options.
While reviewing the btrfs-progs source for some attempts to make btrfsck do
something about these transid-failures, I spotted the tool called
btrfs-corrupt-block. At this point I was ready to accept some loss of data,
which I'd expect to be minor if even user-visible at all (after all the
original backtrace is happening in "btrfs_clean_one_deleted_snapshot" so
perhaps all that the "bad" block was storing was only related to a snapshot
that's already been deleted).
I ran:
/root/btrfs-corrupt-block -l 7483566862336 /dev/nbd8
Btrfsck then finally reported something inspiring some hope:
checking extents
checksum verify failed on 7483566862336 found 295F0086 wanted 00000000
checksum verify failed on 7483566862336 found 295F0086 wanted 00000000
checksum verify failed on 7483566862336 found 295F0086 wanted 00000000
checksum verify failed on 7483566862336 found 295F0086 wanted 00000000
bytenr mismatch, want=7483566862336, have=0
deleting pointer to block 7483566862336
ref mismatch on [6504947712 118784] extent item 0, found 1
adding new data backref on 6504947712 parent 4311306919936 owner 0 offset 0 found 1
Backref 6504947712 parent 4311306919936 owner 0 offset 0 num_refs 0 not found in extent tree
Incorrect local backref count on 6504947712 parent 4311306919936 owner 0 offset 0 found 1 wanted 0 back 0x57cfdff0
backpointer mismatch on [6504947712 118784]
...etc
After a few passes it settled into a state with no new errors reported (only
a few of "bad metadata crossing stripe boundary", but those seem to be also
commonly reported in connection with filesystems otherwise exhibiting no issues).
Finally I was able to mount the FS with no backtrace occurring anymore -- the
btrfs-cleaner process then finished all the remaining snapshot deletion work,
freeing up 20GB or so. All data seems to be present, and selective checksum
verifications showed no corruption. Well, this machine is primarily a backup
server using rsync, so it should catch and fix-up any losses.
As a side note, for experiments with 'btrfsck --repair', 'btrfs-corrupt-block'
and my own patched versions of btrfsck, the technique of making writable CoW
snapshots of the whole block device has proved invaluable:
At first I used the nbd-server '-c' mode, but quickly discovered it to be
flaky: it seems to crash if the amount of changes gets over 150 MB or so, and
anyways the RAM usage of it seems to match "block device size / 1000", i.e. it
used 6GB of RAM for a 6TB filesystem. So in the end I changed to using the
dm-snapshot target as described in [1]. One just has to remember to never have
the snapshot and the original device visible and trying to mount one of them
on the same machine (this will confuse Btrfs with duplicate UUIDs); for that,
I used the same nbd-server (not using its built-in CoW anymore), exporting
writable snapshots via network and mounting them on a different server or VM.
[1]http://stackoverflow.com/questions/7582019/lvm-like-snapshot-on-a-normal-block-device
--
With respect,
Roman
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: parent transid verify failed on snapshot deletion
2016-03-13 21:52 ` Chris Murphy
@ 2016-03-17 8:39 ` Roman Mamedov
0 siblings, 0 replies; 12+ messages in thread
From: Roman Mamedov @ 2016-03-17 8:39 UTC (permalink / raw)
To: Chris Murphy; +Cc: Btrfs BTRFS
[-- Attachment #1: Type: text/plain, Size: 429 bytes --]
On Sun, 13 Mar 2016 15:52:52 -0600
Chris Murphy <lists@colorremedies.com> wrote:
> I really think you need a minute's worth of kernel messages prior to
> that time stamp.
There was no messages a minute, or even (from memory) many hours prior to the
crash. If there was something even remotely weird or block-device or
FS-related, I would've of course included it with the original report.
--
With respect,
Roman
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2016-03-17 8:39 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-12 15:48 parent transid verify failed on snapshot deletion Roman Mamedov
2016-03-12 17:15 ` Roman Mamedov
2016-03-13 9:24 ` Roman Mamedov
2016-03-13 17:03 ` Duncan
2016-03-13 17:24 ` Roman Mamedov
2016-03-13 20:10 ` Chris Murphy
2016-03-13 20:55 ` Roman Mamedov
2016-03-13 21:52 ` Chris Murphy
2016-03-17 8:39 ` Roman Mamedov
2016-03-13 3:54 ` Duncan
2016-03-13 20:54 ` Sylvain Joyeux
2016-03-17 8:32 ` "Fixed", " Roman Mamedov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).