* transid failed / mount Problem on Linux pc6 3.17.2-1-ARCH #1 SMP PREEMPT Thu Oct 30 20:49:39 CET 2014 x86_64 GNU/Linux
@ 2014-11-11 11:13 Juergen Sauer
2014-11-11 22:39 ` Duncan
0 siblings, 1 reply; 4+ messages in thread
From: Juergen Sauer @ 2014-11-11 11:13 UTC (permalink / raw)
To: linux-btrfs
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi!
this event occoured today in the morning.
Accidentially the Archive Machine was kickt into hibernation.
After reactivating the archive Btrfs filesystem was "readonly", after
rebooting the system the "archive" btrfs filesystem was not mountable
anymore.
I tried every thing of recovery possibilities I know. Nothing worked.
Here I liste the Problem of the Machine, it would be very ugly to loose
thoes data.
Do you have any further ideas, what I may try to recover my archive
filesystem?
The archive Filesystem is an raid5-multi device btrfs.
System:
root@pc6:/usr/src/build/btrfs-progs# uname -a
Linux pc6 3.17.2-1-ARCH #1 SMP PREEMPT Thu Oct 30 20:49:39 CET 2014
x86_64 GNU/Linux
This BTRFS Tools were in use:
git clone
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git
root@pc6:/usr/src/build/btrfs-progs# ./btrfs fi show
parent transid verify failed on 209362944 wanted 293924 found 293922
parent transid verify failed on 209362944 wanted 293924 found 293922
Check tree block failed, want=209362944, have=209559552
parent transid verify failed on 209362944 wanted 293924 found 293922
Ignoring transid failure
Label: 'archiv' uuid: 48f71e09-6898-4665-bc61-bd7ca4ba4a24
Total devices 4 FS bytes used 3.35TiB
devid 1 size 1.70TiB used 726.69GiB path /dev/sdh3
devid 2 size 1.82TiB used 1.35TiB path /dev/sda1
devid 3 size 1.82TiB used 1.35TiB path /dev/sdj1
devid 4 size 1.82TiB used 1.35TiB path /dev/sdi1
Btrfs v3.17.1
mount -o ro,recovery -t btrfs /dev/sdh3 /mnt
mount: Falscher Dateisystemtyp, ungültige Optionen, der
Superblock von /dev/sdh3 ist beschädigt, fehlende
Kodierungsseite oder ein anderer Fehler
Manchmal liefert das Systemprotokoll wertvolle Informationen –
versuchen Sie dmesg | tail oder ähnlich
root@pc6:/usr/src/build/btrfs-progs# dmesg ...
[ 7116.746815] BTRFS info (device sdi1): enabling auto recovery
[ 7116.746820] BTRFS info (device sdi1): disk space caching is enabled
[ 7117.028008] verify_parent_transid: 6 callbacks suppressed
[ 7117.028013] parent transid verify failed on 209362944 wanted 293924
found 293922
[ 7117.028324] parent transid verify failed on 209362944 wanted 293924
found 293922
[ 7117.033188] parent transid verify failed on 244719616 wanted 293924
found 293922
[ 7117.033516] parent transid verify failed on 244719616 wanted 293924
found 293922
[ 7117.034114] BTRFS: bdev /dev/sda1 errs: wr 3, rd 0, flush 1, corrupt
0, gen 0
[ 7117.034557] parent transid verify failed on 209375232 wanted 293924
found 293914
[ 7117.034873] parent transid verify failed on 209375232 wanted 293924
found 293914
[ 7117.037358] parent transid verify failed on 245538816 wanted 293924
found 293922
[ 7117.037702] parent transid verify failed on 245538816 wanted 293924
found 293922
[ 7117.108132] parent transid verify failed on 253378560 wanted 293924
found 293914
[ 7117.108509] parent transid verify failed on 253378560 wanted 293924
found 293914
[ 7117.231038] BTRFS: bad tree block start 0 253911040
[ 7117.231052] BTRFS: Failed to read block groups: -5
[ 7117.290534] BTRFS: open_ctree failed
root@pc6:/usr/src/build/btrfs-progs# btrfs check --repair /dev/sdh3
enabling repair mode
parent transid verify failed on 209362944 wanted 293924 found 293922
parent transid verify failed on 209362944 wanted 293924 found 293922
Check tree block failed, want=209362944, have=209559552
parent transid verify failed on 209362944 wanted 293924 found 293922
Ignoring transid failure
parent transid verify failed on 247873536 wanted 293924 found 293922
parent transid verify failed on 247873536 wanted 293924 found 293922
Check tree block failed, want=247873536, have=248070144
parent transid verify failed on 247873536 wanted 293924 found 293922
Ignoring transid failure
leaf parent key incorrect 247873536
root@pc6:/usr/src/build/btrfs-progs# btrfs-zero-log /dev/sdh3
parent transid verify failed on 209362944 wanted 293924 found 293922
parent transid verify failed on 209362944 wanted 293924 found 293922
Check tree block failed, want=209362944, have=209559552
parent transid verify failed on 209362944 wanted 293924 found 293922
Ignoring transid failure
Unable to find block group for 0
extent-tree.c:289: find_search_start: Assertion `1` failed.
btrfs-zero-log[0x40c0ca]
btrfs-zero-log[0x410eb3]
btrfs-zero-log[0x410f6f]
btrfs-zero-log[0x403361]
btrfs-zero-log[0x403975]
btrfs-zero-log[0x408606]
btrfs-zero-log[0x409d8e]
btrfs-zero-log[0x402542]
/usr/lib/libc.so.6(__libc_start_main+0xf0)[0x7f810af82040]
btrfs-zero-log[0x402653]
root@pc6:/usr/src/build/btrfs-progs# btrfs rescue chunk-recover /dev/sdi1
Speicherzugriffsfehler (Speicherabzug geschrieben)
Any Ideas, I may check for ?
TIA ...
mit freundlichen Grüßen
Jürgen Sauer
- --
Jürgen Sauer - automatiX GmbH,
http://www.automatix.de/juergen_sauer_publickey.gpg
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEARECAAYFAlRh72UACgkQW7UKI9EqarHXMQCfdN8SgA5NZFw6TJpF/fSXUspn
FjsAoMDU/kRd2PSj4EvXF0moxC7c+26Z
=WpUq
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: transid failed / mount Problem on Linux pc6 3.17.2-1-ARCH #1 SMP PREEMPT Thu Oct 30 20:49:39 CET 2014 x86_64 GNU/Linux
2014-11-11 11:13 transid failed / mount Problem on Linux pc6 3.17.2-1-ARCH #1 SMP PREEMPT Thu Oct 30 20:49:39 CET 2014 x86_64 GNU/Linux Juergen Sauer
@ 2014-11-11 22:39 ` Duncan
2014-11-12 17:26 ` BTRFS Raid5/6 Recovery Problem after accidentially hibernation Juergen Sauer
0 siblings, 1 reply; 4+ messages in thread
From: Duncan @ 2014-11-11 22:39 UTC (permalink / raw)
To: linux-btrfs
Juergen Sauer posted on Tue, 11 Nov 2014 12:13:41 +0100 as excerpted:
> this event occoured today in the morning.
> Accidentially the Archive Machine was kickt into hibernation.
>
> After reactivating the archive Btrfs filesystem was "readonly", after
> rebooting the system the "archive" btrfs filesystem was not mountable
> anymore.
FWIW, I've had similar issues with both mdraid in the past, and with
btrfs now, with both hibernation and suspend-to-ram.
Tho after early experiences I switched to mdraid-1 some time in the past,
and now btrfs raid1 mode, which (even with the more mature mdraid) tends
to be more resilient than raid5 and faster than raid6. At least with
raid1, there's multiple copies of the data, and at least in my
experience, that dramatically increases the reliability of recovery from
temporary or permanent dropout of one device.
The general problem seems to be that in the resume process, some devices
wake up faster than others, and even "awake" devices don't necessarily
fully stabilize for a minute or two. Back on mdraid, I noticed some
devices coming up with model number strings and UIDs that would have
incorrect characters in some position, tho they'd stabilize over time.
Obviously, this plays havoc with kernel efforts to ensure the devices it
woke up to are the same devices it had when it went to sleep (either
suspend to ram or hibernate to disk).
And the same general problems continue to occur with the pair of SSDs I
have now, with suspend-to-ram instead of hibernate, while the original
devices I noticed the problem on were spinning rust of an entirely
different brand.
So it's not a btrfs-specific issue, or a device specific issue, or a
motherboard specific issue since I've upgraded since I first saw it too,
or a suspend/hibernate type specific issue. It's a general issue. Tho I
/have/ noticed on the current equipment, that if I suspend for a
relatively short period, an hour or two, it seems to come back with fewer
problems than if I suspend for 6 hours or more... say if I suspend while
I'm at work or overnite. (FWIW, the old machine seemed to hibernate and
resume reasonably well other than this but couldn't reliably resume from
suspend, while the new machine is the opposite, I never got it to resume
from hibernation, but other than this, it reliably resumes from suspend.)
Unfortunately, the only reliable solution seems to be to fully shut down
instead of suspending or hibernating, and obviously, after running into
issues a few times, I eventually quit experimenting further. But the
fact that I'm running systemd on fast ssds now, does ameliorate the
problem quite a bit, both due to faster booting, and by making the lost
cache of a reboot far less of an issue because reading the data back in
is so much faster on ssd.
So it seems both suspend and hibernate seem to work better with single
devices where one device being slower to stabilize won't be the issue it
is with raid (either mdraid or btrfs raid), and raid doesn't combine well
with suspend/hibernate. =:^(
Too bad, as being able to suspend and wake up right away was saving on
the electric bill. =:^(
So if it's really critical, as it arguably might be on an archive
machine, I'd consider pointing whatever suspend/hibernate triggers at
shutdown or reboot, instead. If it's not possible to accidentally
hibernate the thing, it triggers shutdown/reboot instead, it won't/can't
be accidentally hibernated. =:^)
> I tried every thing of recovery possibilities I know. Nothing worked.
>
> Here I liste the Problem of the Machine, it would be very ugly to loose
> thoes data.
>
> Do you have any further ideas, what I may try to recover my archive
> filesystem?
>
> The archive Filesystem is an raid5-multi device btrfs.
Btrfs raid5, or mdraid-5 with btrfs on top? Because it's common
knowledge that btrfs raid56 modes aren't yet fully implemented, and while
they work in normal operation, recovery from a lost device is iffy at
best because the code simply isn't complete for that yet. As such a
raid5/6 mode btrfs is best effectively considered a raid0 in terms of
reliability, don't count on recovering anything if a single device is
lost, even temporarily. Depending on the circumstances, it's not always
/quite/ that bad, but raid0 reliability, or more accurately the lack
thereof, is what you plan for when you setup a btrfs raid5 or raid6,
because that's effectively what it is until the recovery code is complete
and tested, and that way you won't be caught with critical data on it if
it does go south, any more than you would put critical data on a raid0.
So I /hope/ you meant mdraid-5, on top of which you had btrfs. With
that, once the mdraid level is recovered, you are basically looking at a
standard btrfs recovery as if it were a single device. That's still not
a great position to be in as you are after all looking at a recovery with
a non-zero chance of failure, but let's call it an 80% chance of
recovery, over a 10% chance, you're still in far better shape than with
btrfs raid5/6 at that point.
If you /did/ mean btrfs raid56 mode, then take a look at the raid56
information on the wiki and the links from there to additional
information on Marc MERLIN's site, as he is the regular around here that
has done the most intensive testing of raid56 mode and has written about
it extensively, and other than getting one of the devs to take personal
interest in your special case, that's the best chance you have at
recovery.
https://btrfs.wiki.kernel.org/index.php/RAID56
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 4+ messages in thread
* BTRFS Raid5/6 Recovery Problem after accidentially hibernation
2014-11-11 22:39 ` Duncan
@ 2014-11-12 17:26 ` Juergen Sauer
2014-11-13 8:49 ` Duncan
0 siblings, 1 reply; 4+ messages in thread
From: Juergen Sauer @ 2014-11-12 17:26 UTC (permalink / raw)
To: Btrfs BTRFS
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Was: Re: transid failed / mount Problem on Linux pc6 3.17.2-1-ARCH #1
SMP PREEMPT Thu Oct 30 20:49:39 CET 2014 x86_64 GNU/Linux
Am 11.11.2014 um 23:39 schrieb Duncan:
> Tho after early experiences I switched to mdraid-1 some time in the past,
[..]
> So if it's really critical, as it arguably might be on an archive
> machine, I'd consider pointing whatever suspend/hibernate triggers at
> shutdown or reboot, instead. If it's not possible to accidentally
> hibernate the thing, it triggers shutdown/reboot instead, it won't/can't
> be accidentally hibernated. =:^)
:)
I didn't recongnize bevore that the hibernation is so what faulty.
It does not care for me anymore, disabled it.
It's like the old story. A test machine was just used for production,
after it worked very fantastic. BTRFS on the raid5/6 seznario has really
great potential.
> Btrfs raid5, or mdraid-5 with btrfs on top?
It's not mdraid. It's really Rais56 of BTRFS.
> Because it's common
> knowledge that btrfs raid56 modes aren't yet fully implemented, and while
> they work in normal operation, recovery from a lost device is iffy at
> best because the code simply isn't complete for that yet.
Ok, seems to be nessesary to open my "emacs". :)
The recovery code for raid5/6 has to be grow to mature. I'll look, what
I can do. But my skills in kernel hacking are somewhat very old. I did
not worked on kernel since 17 year or so, I just code on industrial
automation - crane robotics for cement plants. (heavy metal coding, with
cranes > 50 tons and >10 tons of material in the grip, all is in linux
:))) ).
> As such a
> raid5/6 mode btrfs is best effectively considered a raid0 in terms of
[...]
> If you /did/ mean btrfs raid56 mode, then take a look at the raid56
> information on the wiki and the links from there to additional
> information on Marc MERLIN's site,
[,,,]
> https://btrfs.wiki.kernel.org/index.php/RAID56
I'll follow that hint.
(and I change Topic)
Thank you Duncan, I see, this gets really into the deep. But I hope
fixing this, it may improve BTRFS in general. I hope I get the right
contact.
Current Status:
# root@pc6:~# btrfs fi show /dev/sda1
# parent transid verify failed on 209362944 wanted 293924 found 293922
# parent transid verify failed on 209362944 wanted 293924 found 293922
# Check tree block failed, want=209362944, have=209559552
# parent transid verify failed on 209362944 wanted 293924 found 293922
# Ignoring transid failure
# Label: 'archiv' uuid: 48f71e09-6898-4665-bc61-bd7ca4ba4a24
# Total devices 4 FS bytes used 3.35TiB
# devid 1 size 1.70TiB used 726.69GiB path /dev/sdh3
# devid 2 size 1.82TiB used 1.35TiB path /dev/sda1
# devid 3 size 1.82TiB used 1.35TiB path /dev/sdj1
# devid 4 size 1.82TiB used 1.35TiB path /dev/sdi1
# Btrfs v3.17.1
What does "parent transid verify failed on 209362944 wanted 293924 found
293922" really mean?
Trying to mount ends up with:
root@pc6:~# mount -v -t btrfs -o ro,degraded LABEL=archiv /mnt
dmesg Log exerpt:
[Nov12 09:44] BTRFS info (device sdi1): allowing degraded mounts
[ +0,000005] BTRFS info (device sdi1): disk space caching is enabled
[ +0,026335] verify_parent_transid: 6 callbacks suppressed
[ +0,000005] parent transid verify failed on 209362944 wanted 293924
found 293922
[ +0,000302] parent transid verify failed on 209362944 wanted 293924
found 293922
[ +0,003393] parent transid verify failed on 244719616 wanted 293924
found 293922
[ +0,000303] parent transid verify failed on 244719616 wanted 293924
found 293922
[ +0,000558] BTRFS: bdev /dev/sda1 errs: wr 3, rd 0, flush 1, corrupt
0, gen 0
[ +0,000402] parent transid verify failed on 209375232 wanted 293924
found 293914
[ +0,000275] parent transid verify failed on 209375232 wanted 293924
found 293914
[ +0,001366] parent transid verify failed on 245538816 wanted 293924
found 293922
[ +0,000269] parent transid verify failed on 245538816 wanted 293924
found 293922
[ +0,033266] parent transid verify failed on 253378560 wanted 293924
found 293914
[ +0,001869] parent transid verify failed on 253378560 wanted 293924
found 293914
[ +0,028920] BTRFS: bad tree block start 0 253911040
[ +0,000013] BTRFS: Failed to read block groups: -5
[ +0,047223] BTRFS: open_ctree failed
It would be great, to recover.
I anybody from the raid5/6 developer team reading here?
TIA
mit freundlichen Grüßen
Jürgen Sauer
- --
Jürgen Sauer - automatiX GmbH,
GPG Public Key zur
Signaturprüfung:http://www.automatix.de/juergen_sauer_publickey.gpg
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEARECAAYFAlRjmGAACgkQW7UKI9EqarG5nQCgnRxghqq62EJWQsoe9MyZzYp6
56UAn2CLA86NKpdzu8Bc49j7RRes4eGd
=ClEg
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: BTRFS Raid5/6 Recovery Problem after accidentially hibernation
2014-11-12 17:26 ` BTRFS Raid5/6 Recovery Problem after accidentially hibernation Juergen Sauer
@ 2014-11-13 8:49 ` Duncan
0 siblings, 0 replies; 4+ messages in thread
From: Duncan @ 2014-11-13 8:49 UTC (permalink / raw)
To: linux-btrfs
Juergen Sauer posted on Wed, 12 Nov 2014 18:26:56 +0100 as excerpted:
> Current Status:
> # root@pc6:~# btrfs fi show /dev/sda1
> # parent transid verify failed on 209362944 wanted 293924 found 293922
> # parent transid verify failed on 209362944 wanted 293924 found 293922
> What does "parent transid verify failed on 209362944 wanted 293924 found
> 293922" really mean?
The wanted 293924 found 293922 bit is simple enough, once you understand
what it's doing.
The basis of it is this: Every full btrfs commit, all the way up to the
root node, is called a generation. These are sequential, counting
upward, with transaction ID or transid being another term meaning exactly
the same thing as generation number. You'll see the two terms used
interchangeably in btrfs documentation and log output.
So what that's saying is that it wanted transid/generation 293924, but
what it found was from two commit transactions earlier, 293922.
I believe the 209362944 is the bytenr, basically the address in the
filesystem.
You can use btrfs-show-super to see what the value of the current root
generation as stored in the superblocks, thus getting some idea how far
back this damage appears to be taking you. With commits every 30 seconds
by default if there's data being written, going back two commits as the
above seems to indicate isn't bad. If it's going back a few thousand,
things begin to look much worse.
The Restore wiki page might be of some help at this point. It discusses
find-root (now btrfs-find-root I believe) as well, which can be used to
see what's going on with these root generations and if the one is
corrupt, to find a usable earlier generation. Note that while general
idea remains, the way the tools work and what they output has changed a
bit from when it was written.
https://btrfs.wiki.kernel.org/index.php/Restore
As covered on that page, what restore does is let you attempt to retrieve
files off an unmounted filesystem that's too damaged to mount read-only
and copy them off properly. The files are restored to another (mounted)
filesystem, which does mean you need enough room on this other filesystem
in ordered to store them. If the current root is corrupted beyond help,
you can go back in time to previous roots, trying to find a good one,
using the output from find-root.
As NOT covered or not covered well on that page, partly because it's
outdated and some of this didn't exist when it was written...
1) Restore's -D|--dry-run option can be very helpful, giving you some
idea of how many files it looks like it can restore when pointed at a
particular root.
2) The -l option, list tree roots, is also very helpful. This is the way
you actually see those filesystem trees that the wiki page discusses,
saying pick the latest tree root with all or most filesystem trees
available -- this is how you see what's available! (This part was quite
confusing to me when I was trying to use the page, until I discovered the
-l option. Then it all made sense!)
3) Be aware, when the files are restored, you do get your file data, but
they're written to the new location using the user (root) you're doing
the recovery as, using standard umask permissions. So you lose ownership
and permissions metadata on all restored files, and will need to go thru
and change it back to what you want/need, as necessary.
(Luckily when I did it, I had a backup; it just wasn't current. So I was
able to quickly hack up a script that used a for loop to match all the
restored filenames against those in the backup, and if they matched, do a
chown and a chmod using the backup file as a reference. So the only ones
I had to correct manually were the new ones that weren't in the backup.)
4) Similarly, when I did my restore, no symlinks were restored at all. I
had to recreate them manually, and I do make fair use of symlinks, so I
had a fair number to recreate manually, and a somewhat broken setup until
I got all the ones I normally use recreated.
5) A different user reported that when he used restore some time ago, he
lost his real small files, too. He believed restore only restored files
that actually had extent data, while btrfs folds really small files (a
few KiB max, the exact size limit depends on certain variables set at
filesystem creation) directly into the metadata so they don't actually
have data extents. However, I didn't see any of that on my restore, so
either it has been fixed, or perhaps one of my mount options (autodefrag
or compress=lzo, most likely) saved me, or perhaps I simply didn't notice
them, tho I doubt that as I think many of the smallest files would have
been config files and I customize heavily, and didn't notice missing
settings afterward.
6) On large directories, the restore will think it got caught in a loop.
Back a few kernel and userspace cycles ago when I did my restore, it
would just bail out at that point, and I had to redo the same restore
command several times, writing to the same location (without overwrites),
in ordered to get all files. When I stopped seeing that error and it
wouldn't restore any more files I knew I had what I was going to get.
However, I think the behavior is changed now, and it simply asks if it
should continue or not, instead of bailing out without asking, now.
There's supposed to be an option to tell it to just keep continuing,
instead of asking every few minutes, as well, tho I've not actually used
it so I don't know how it ends up working in practice.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-11-13 8:49 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-11 11:13 transid failed / mount Problem on Linux pc6 3.17.2-1-ARCH #1 SMP PREEMPT Thu Oct 30 20:49:39 CET 2014 x86_64 GNU/Linux Juergen Sauer
2014-11-11 22:39 ` Duncan
2014-11-12 17:26 ` BTRFS Raid5/6 Recovery Problem after accidentially hibernation Juergen Sauer
2014-11-13 8:49 ` Duncan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).