transid failed / mount Problem on Linux pc6 3.17.2-1-ARCH #1 SMP PREEMPT Thu Oct 30 20:49:39 CET 2014 x86

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* transid failed / mount Problem on Linux pc6 3.17.2-1-ARCH #1 SMP PREEMPT Thu Oct 30 20:49:39 CET 2014 x86_64 GNU/Linux
@ 2014-11-11 11:13 Juergen Sauer
  2014-11-11 22:39 ` Duncan
  0 siblings, 1 reply; 4+ messages in thread
From: Juergen Sauer @ 2014-11-11 11:13 UTC (permalink / raw)
  To: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi!
this event occoured today in the morning.
Accidentially the Archive Machine was kickt  into hibernation.

After reactivating the archive Btrfs filesystem was "readonly", after
rebooting the system the "archive" btrfs filesystem was not mountable
anymore.

I tried every thing of recovery possibilities I know. Nothing worked.

Here I liste the Problem of the Machine, it would be very ugly to loose
thoes data.

Do you have any further ideas, what I may try to recover my archive
filesystem?

The archive Filesystem is an raid5-multi device btrfs.

System:
root@pc6:/usr/src/build/btrfs-progs# uname -a
Linux pc6 3.17.2-1-ARCH #1 SMP PREEMPT Thu Oct 30 20:49:39 CET 2014
x86_64 GNU/Linux

This BTRFS Tools were in use:
git clone
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

root@pc6:/usr/src/build/btrfs-progs# ./btrfs fi show
parent transid verify failed on 209362944 wanted 293924 found 293922
parent transid verify failed on 209362944 wanted 293924 found 293922
Check tree block failed, want=209362944, have=209559552
parent transid verify failed on 209362944 wanted 293924 found 293922
Ignoring transid failure
Label: 'archiv'  uuid: 48f71e09-6898-4665-bc61-bd7ca4ba4a24
        Total devices 4 FS bytes used 3.35TiB
        devid    1 size 1.70TiB used 726.69GiB path /dev/sdh3
        devid    2 size 1.82TiB used 1.35TiB path /dev/sda1
        devid    3 size 1.82TiB used 1.35TiB path /dev/sdj1
        devid    4 size 1.82TiB used 1.35TiB path /dev/sdi1

Btrfs v3.17.1

mount -o ro,recovery -t btrfs /dev/sdh3 /mnt
mount: Falscher Dateisystemtyp, ungültige Optionen, der
Superblock von /dev/sdh3 ist beschädigt, fehlende
Kodierungsseite oder ein anderer Fehler

       Manchmal liefert das Systemprotokoll wertvolle Informationen –
       versuchen Sie  dmesg | tail  oder ähnlich
root@pc6:/usr/src/build/btrfs-progs# dmesg ...

[ 7116.746815] BTRFS info (device sdi1): enabling auto recovery
[ 7116.746820] BTRFS info (device sdi1): disk space caching is enabled
[ 7117.028008] verify_parent_transid: 6 callbacks suppressed
[ 7117.028013] parent transid verify failed on 209362944 wanted 293924
found 293922
[ 7117.028324] parent transid verify failed on 209362944 wanted 293924
found 293922
[ 7117.033188] parent transid verify failed on 244719616 wanted 293924
found 293922
[ 7117.033516] parent transid verify failed on 244719616 wanted 293924
found 293922
[ 7117.034114] BTRFS: bdev /dev/sda1 errs: wr 3, rd 0, flush 1, corrupt
0, gen 0
[ 7117.034557] parent transid verify failed on 209375232 wanted 293924
found 293914
[ 7117.034873] parent transid verify failed on 209375232 wanted 293924
found 293914
[ 7117.037358] parent transid verify failed on 245538816 wanted 293924
found 293922
[ 7117.037702] parent transid verify failed on 245538816 wanted 293924
found 293922
[ 7117.108132] parent transid verify failed on 253378560 wanted 293924
found 293914
[ 7117.108509] parent transid verify failed on 253378560 wanted 293924
found 293914
[ 7117.231038] BTRFS: bad tree block start 0 253911040
[ 7117.231052] BTRFS: Failed to read block groups: -5
[ 7117.290534] BTRFS: open_ctree failed


root@pc6:/usr/src/build/btrfs-progs# btrfs check  --repair /dev/sdh3
enabling repair mode
parent transid verify failed on 209362944 wanted 293924 found 293922
parent transid verify failed on 209362944 wanted 293924 found 293922
Check tree block failed, want=209362944, have=209559552
parent transid verify failed on 209362944 wanted 293924 found 293922
Ignoring transid failure
parent transid verify failed on 247873536 wanted 293924 found 293922
parent transid verify failed on 247873536 wanted 293924 found 293922
Check tree block failed, want=247873536, have=248070144
parent transid verify failed on 247873536 wanted 293924 found 293922
Ignoring transid failure
leaf parent key incorrect 247873536


root@pc6:/usr/src/build/btrfs-progs# btrfs-zero-log  /dev/sdh3
parent transid verify failed on 209362944 wanted 293924 found 293922
parent transid verify failed on 209362944 wanted 293924 found 293922
Check tree block failed, want=209362944, have=209559552
parent transid verify failed on 209362944 wanted 293924 found 293922
Ignoring transid failure
Unable to find block group for 0
extent-tree.c:289: find_search_start: Assertion `1` failed.
btrfs-zero-log[0x40c0ca]
btrfs-zero-log[0x410eb3]
btrfs-zero-log[0x410f6f]
btrfs-zero-log[0x403361]
btrfs-zero-log[0x403975]
btrfs-zero-log[0x408606]
btrfs-zero-log[0x409d8e]
btrfs-zero-log[0x402542]
/usr/lib/libc.so.6(__libc_start_main+0xf0)[0x7f810af82040]
btrfs-zero-log[0x402653]

root@pc6:/usr/src/build/btrfs-progs# btrfs rescue  chunk-recover /dev/sdi1
Speicherzugriffsfehler (Speicherabzug geschrieben)


Any Ideas, I may check for ?

TIA ...

mit freundlichen Grüßen
Jürgen Sauer
- -- 
Jürgen Sauer - automatiX GmbH,
http://www.automatix.de/juergen_sauer_publickey.gpg
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iEYEARECAAYFAlRh72UACgkQW7UKI9EqarHXMQCfdN8SgA5NZFw6TJpF/fSXUspn
FjsAoMDU/kRd2PSj4EvXF0moxC7c+26Z
=WpUq
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: transid failed / mount Problem on Linux pc6 3.17.2-1-ARCH #1 SMP PREEMPT Thu Oct 30 20:49:39 CET 2014 x86_64 GNU/Linux
  2014-11-11 11:13 transid failed / mount Problem on Linux pc6 3.17.2-1-ARCH #1 SMP PREEMPT Thu Oct 30 20:49:39 CET 2014 x86_64 GNU/Linux Juergen Sauer
@ 2014-11-11 22:39 ` Duncan
  2014-11-12 17:26   ` BTRFS Raid5/6 Recovery Problem after accidentially hibernation Juergen Sauer
  0 siblings, 1 reply; 4+ messages in thread
From: Duncan @ 2014-11-11 22:39 UTC (permalink / raw)
  To: linux-btrfs

Juergen Sauer posted on Tue, 11 Nov 2014 12:13:41 +0100 as excerpted:

> this event occoured today in the morning.
> Accidentially the Archive Machine was kickt  into hibernation.
> 
> After reactivating the archive Btrfs filesystem was "readonly", after
> rebooting the system the "archive" btrfs filesystem was not mountable
> anymore.

FWIW, I've had similar issues with both mdraid in the past, and with 
btrfs now, with both hibernation and suspend-to-ram.

Tho after early experiences I switched to mdraid-1 some time in the past, 
and now btrfs raid1 mode, which (even with the more mature mdraid) tends 
to be more resilient than raid5 and faster than raid6.  At least with 
raid1, there's multiple copies of the data, and at least in my 
experience, that dramatically increases the reliability of recovery from 
temporary or permanent dropout of one device.

The general problem seems to be that in the resume process, some devices 
wake up faster than others, and even "awake" devices don't necessarily 
fully stabilize for a minute or two.  Back on mdraid, I noticed some 
devices coming up with model number strings and UIDs that would have 
incorrect characters in some position, tho they'd stabilize over time.  
Obviously, this plays havoc with kernel efforts to ensure the devices it 
woke up to are the same devices it had when it went to sleep (either 
suspend to ram or hibernate to disk).

And the same general problems continue to occur with the pair of SSDs I 
have now, with suspend-to-ram instead of hibernate, while the original 
devices I noticed the problem on were spinning rust of an entirely 
different brand.

So it's not a btrfs-specific issue, or a device specific issue, or a 
motherboard specific issue since I've upgraded since I first saw it too, 
or a suspend/hibernate type specific issue.  It's a general issue.  Tho I 
/have/ noticed on the current equipment, that if I suspend for a 
relatively short period, an hour or two, it seems to come back with fewer 
problems than if I suspend for 6 hours or more... say if I suspend while 
I'm at work or overnite.  (FWIW, the old machine seemed to hibernate and 
resume reasonably well other than this but couldn't reliably resume from 
suspend, while the new machine is the opposite, I never got it to resume 
from hibernation, but other than this, it reliably resumes from suspend.)

Unfortunately, the only reliable solution seems to be to fully shut down 
instead of suspending or hibernating, and obviously, after running into 
issues a few times, I eventually quit experimenting further.  But the 
fact that I'm running systemd on fast ssds now, does ameliorate the 
problem quite a bit, both due to faster booting, and by making the lost 
cache of a reboot far less of an issue because reading the data back in 
is so much faster on ssd.

So it seems both suspend and hibernate seem to work better with single 
devices where one device being slower to stabilize won't be the issue it 
is with raid (either mdraid or btrfs raid), and raid doesn't combine well 
with suspend/hibernate. =:^(

Too bad, as being able to suspend and wake up right away was saving on 
the electric bill. =:^(

So if it's really critical, as it arguably might be on an archive 
machine, I'd consider pointing whatever suspend/hibernate triggers at 
shutdown or reboot, instead.  If it's not possible to accidentally 
hibernate the thing, it triggers shutdown/reboot instead, it won't/can't 
be accidentally hibernated. =:^)

> I tried every thing of recovery possibilities I know. Nothing worked.
> 
> Here I liste the Problem of the Machine, it would be very ugly to loose
> thoes data.
> 
> Do you have any further ideas, what I may try to recover my archive
> filesystem?
> 
> The archive Filesystem is an raid5-multi device btrfs.

Btrfs raid5, or mdraid-5 with btrfs on top?  Because it's common 
knowledge that btrfs raid56 modes aren't yet fully implemented, and while 
they work in normal operation, recovery from a lost device is iffy at 
best because the code simply isn't complete for that yet.  As such a 
raid5/6 mode btrfs is best effectively considered a raid0 in terms of 
reliability, don't count on recovering anything if a single device is 
lost, even temporarily.  Depending on the circumstances, it's not always 
/quite/ that bad, but raid0 reliability, or more accurately the lack 
thereof, is what you plan for when you setup a btrfs raid5 or raid6, 
because that's effectively what it is until the recovery code is complete 
and tested, and that way you won't be caught with critical data on it if 
it does go south, any more than you would put critical data on a raid0.

So I /hope/ you meant mdraid-5, on top of which you had btrfs.  With 
that, once the mdraid level is recovered, you are basically looking at a 
standard btrfs recovery as if it were a single device.  That's still not 
a great position to be in as you are after all looking at a recovery with 
a non-zero chance of failure, but let's call it an 80% chance of 
recovery, over a 10% chance, you're still in far better shape than with 
btrfs raid5/6 at that point.

If you /did/ mean btrfs raid56 mode, then take a look at the raid56 
information on the wiki and the links from there to additional 
information on Marc MERLIN's site, as he is the regular around here that 
has done the most intensive testing of raid56 mode and has written about 
it extensively, and other than getting one of the devs to take personal 
interest in your special case, that's the best chance you have at 
recovery.

https://btrfs.wiki.kernel.org/index.php/RAID56

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 4+ messages in thread

* BTRFS Raid5/6 Recovery Problem after accidentially hibernation
  2014-11-11 22:39 ` Duncan
@ 2014-11-12 17:26   ` Juergen Sauer
  2014-11-13  8:49     ` Duncan
  0 siblings, 1 reply; 4+ messages in thread
From: Juergen Sauer @ 2014-11-12 17:26 UTC (permalink / raw)
  To: Btrfs BTRFS

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Was: Re: transid failed / mount Problem on Linux pc6 3.17.2-1-ARCH #1
SMP PREEMPT Thu Oct 30 20:49:39 CET 2014 x86_64 GNU/Linux

Am 11.11.2014 um 23:39 schrieb Duncan:

> Tho after early experiences I switched to mdraid-1 some time in the past,
[..]

> So if it's really critical, as it arguably might be on an archive
> machine, I'd consider pointing whatever suspend/hibernate triggers at
> shutdown or reboot, instead.  If it's not possible to accidentally
> hibernate the thing, it triggers shutdown/reboot instead, it won't/can't
> be accidentally hibernated. =:^)
:)
I didn't recongnize bevore that the hibernation is so what faulty.
It does not care for me anymore, disabled it.

It's like the old story. A test machine was just used for production,
after it worked very fantastic. BTRFS on the raid5/6 seznario has really
great potential.

> Btrfs raid5, or mdraid-5 with btrfs on top?
It's not mdraid. It's really Rais56 of BTRFS.

> Because it's common
> knowledge that btrfs raid56 modes aren't yet fully implemented, and while
> they work in normal operation, recovery from a lost device is iffy at
> best because the code simply isn't complete for that yet.
Ok, seems to be nessesary to open my "emacs". :)
The recovery code for raid5/6 has to be grow to mature. I'll look, what
I can do. But my skills in kernel hacking are somewhat very old. I did
not worked on kernel since 17 year or so, I just code on industrial
automation - crane robotics for cement plants. (heavy metal coding, with
cranes > 50 tons and >10 tons of material in the grip, all is in linux
:))) ).

> As such a
> raid5/6 mode btrfs is best effectively considered a raid0 in terms of
[...]

> If you /did/ mean btrfs raid56 mode, then take a look at the raid56
> information on the wiki and the links from there to additional
> information on Marc MERLIN's site,
[,,,]

> https://btrfs.wiki.kernel.org/index.php/RAID56

I'll follow that hint.
(and I change Topic)

Thank you Duncan, I see, this gets really into the deep. But I hope
fixing this, it may improve BTRFS in general. I hope I get the right
contact.


Current Status:
# root@pc6:~# btrfs fi show /dev/sda1
# parent transid verify failed on 209362944 wanted 293924 found 293922
# parent transid verify failed on 209362944 wanted 293924 found 293922
# Check tree block failed, want=209362944, have=209559552
# parent transid verify failed on 209362944 wanted 293924 found 293922
# Ignoring transid failure
# Label: 'archiv'  uuid: 48f71e09-6898-4665-bc61-bd7ca4ba4a24
#        Total devices 4 FS bytes used 3.35TiB
#        devid    1 size 1.70TiB used 726.69GiB path /dev/sdh3
#        devid    2 size 1.82TiB used 1.35TiB path /dev/sda1
#        devid    3 size 1.82TiB used 1.35TiB path /dev/sdj1
#        devid    4 size 1.82TiB used 1.35TiB path /dev/sdi1
# Btrfs v3.17.1

What does "parent transid verify failed on 209362944 wanted 293924 found
293922" really mean?

Trying to mount ends up with:
root@pc6:~# mount -v -t btrfs -o ro,degraded LABEL=archiv /mnt

dmesg Log exerpt:
[Nov12 09:44] BTRFS info (device sdi1): allowing degraded mounts
[  +0,000005] BTRFS info (device sdi1): disk space caching is enabled
[  +0,026335] verify_parent_transid: 6 callbacks suppressed
[  +0,000005] parent transid verify failed on 209362944 wanted 293924
found 293922
[  +0,000302] parent transid verify failed on 209362944 wanted 293924
found 293922
[  +0,003393] parent transid verify failed on 244719616 wanted 293924
found 293922
[  +0,000303] parent transid verify failed on 244719616 wanted 293924
found 293922
[  +0,000558] BTRFS: bdev /dev/sda1 errs: wr 3, rd 0, flush 1, corrupt
0, gen 0
[  +0,000402] parent transid verify failed on 209375232 wanted 293924
found 293914
[  +0,000275] parent transid verify failed on 209375232 wanted 293924
found 293914
[  +0,001366] parent transid verify failed on 245538816 wanted 293924
found 293922
[  +0,000269] parent transid verify failed on 245538816 wanted 293924
found 293922
[  +0,033266] parent transid verify failed on 253378560 wanted 293924
found 293914
[  +0,001869] parent transid verify failed on 253378560 wanted 293924
found 293914
[  +0,028920] BTRFS: bad tree block start 0 253911040
[  +0,000013] BTRFS: Failed to read block groups: -5
[  +0,047223] BTRFS: open_ctree failed

It would be great, to recover.

I anybody from the raid5/6 developer team reading here?

TIA

mit freundlichen Grüßen
Jürgen Sauer
- -- 
Jürgen Sauer - automatiX GmbH,
GPG Public Key zur
Signaturprüfung:http://www.automatix.de/juergen_sauer_publickey.gpg
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iEYEARECAAYFAlRjmGAACgkQW7UKI9EqarG5nQCgnRxghqq62EJWQsoe9MyZzYp6
56UAn2CLA86NKpdzu8Bc49j7RRes4eGd
=ClEg
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: BTRFS Raid5/6 Recovery Problem after accidentially hibernation
  2014-11-12 17:26   ` BTRFS Raid5/6 Recovery Problem after accidentially hibernation Juergen Sauer
@ 2014-11-13  8:49     ` Duncan
  0 siblings, 0 replies; 4+ messages in thread
From: Duncan @ 2014-11-13  8:49 UTC (permalink / raw)
  To: linux-btrfs

Juergen Sauer posted on Wed, 12 Nov 2014 18:26:56 +0100 as excerpted:

> Current Status:
> # root@pc6:~# btrfs fi show /dev/sda1
> # parent transid verify failed on 209362944 wanted 293924 found 293922
> # parent transid verify failed on 209362944 wanted 293924 found 293922

> What does "parent transid verify failed on 209362944 wanted 293924 found
> 293922" really mean?

The wanted 293924 found 293922 bit is simple enough, once you understand 
what it's doing.

The basis of it is this:  Every full btrfs commit, all the way up to the 
root node, is called a generation.  These are sequential, counting 
upward, with transaction ID or transid being another term meaning exactly 
the same thing as generation number.  You'll see the two terms used 
interchangeably in btrfs documentation and log output.

So what that's saying is that it wanted transid/generation 293924, but 
what it found was from two commit transactions earlier, 293922.

I believe the 209362944 is the bytenr, basically the address in the 
filesystem.

You can use btrfs-show-super to see what the value of the current root 
generation as stored in the superblocks, thus getting some idea how far 
back this damage appears to be taking you.  With commits every 30 seconds 
by default if there's data being written, going back two commits as the 
above seems to indicate isn't bad.  If it's going back a few thousand, 
things begin to look much worse.

The Restore wiki page might be of some help at this point.  It discusses 
find-root (now btrfs-find-root I believe) as well, which can be used to 
see what's going on with these root generations and if the one is 
corrupt, to find a usable earlier generation.  Note that while general 
idea remains, the way the tools work and what they output has changed a 
bit from when it was written.

https://btrfs.wiki.kernel.org/index.php/Restore

As covered on that page, what restore does is let you attempt to retrieve 
files off an unmounted filesystem that's too damaged to mount read-only 
and copy them off properly.  The files are restored to another (mounted) 
filesystem, which does mean you need enough room on this other filesystem 
in ordered to store them.  If the current root is corrupted beyond help, 
you can go back in time to previous roots, trying to find a good one, 
using the output from find-root.

As NOT covered or not covered well on that page, partly because it's 
outdated and some of this didn't exist when it was written...

1) Restore's -D|--dry-run option can be very helpful, giving you some 
idea of how many files it looks like it can restore when pointed at a 
particular root.

2) The -l option, list tree roots, is also very helpful.  This is the way 
you actually see those filesystem trees that the wiki page discusses, 
saying pick the latest tree root with all or most filesystem trees 
available -- this is how you see what's available!  (This part was quite 
confusing to me when I was trying to use the page, until I discovered the 
-l option.  Then it all made sense!)

3) Be aware, when the files are restored, you do get your file data, but 
they're written to the new location using the user (root) you're doing 
the recovery as, using standard umask permissions.  So you lose ownership 
and permissions metadata on all restored files, and will need to go thru 
and change it back to what you want/need, as necessary.

(Luckily when I did it, I had a backup; it just wasn't current.  So I was 
able to quickly hack up a script that used a for loop to match all the 
restored filenames against those in the backup, and if they matched, do a 
chown and a chmod using the backup file as a reference.  So the only ones 
I had to correct manually were the new ones that weren't in the backup.)

4) Similarly, when I did my restore, no symlinks were restored at all.  I 
had to recreate them manually, and I do make fair use of symlinks, so I 
had a fair number to recreate manually, and a somewhat broken setup until 
I got all the ones I normally use recreated.

5) A different user reported that when he used restore some time ago, he 
lost his real small files, too.  He believed restore only restored files 
that actually had extent data, while btrfs folds really small files (a 
few KiB max, the exact size limit depends on certain variables set at 
filesystem creation) directly into the metadata so they don't actually 
have data extents.  However, I didn't see any of that on my restore, so 
either it has been fixed, or perhaps one of my mount options (autodefrag 
or compress=lzo, most likely) saved me, or perhaps I simply didn't notice 
them, tho I doubt that as I think many of the smallest files would have 
been config files and I customize heavily, and didn't notice missing 
settings afterward.

6) On large directories, the restore will think it got caught in a loop.  
Back a few kernel and userspace cycles ago when I did my restore, it 
would just bail out at that point, and I had to redo the same restore 
command several times, writing to the same location (without overwrites), 
in ordered to get all files.  When I stopped seeing that error and it 
wouldn't restore any more files I knew I had what I was going to get.  
However, I think the behavior is changed now, and it simply asks if it 
should continue or not, instead of bailing out without asking, now.  
There's supposed to be an option to tell it to just keep continuing, 
instead of asking every few minutes, as well, tho I've not actually used 
it so I don't know how it ends up working in practice.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-11-13  8:49 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-11 11:13 transid failed / mount Problem on Linux pc6 3.17.2-1-ARCH #1 SMP PREEMPT Thu Oct 30 20:49:39 CET 2014 x86_64 GNU/Linux Juergen Sauer
2014-11-11 22:39 ` Duncan
2014-11-12 17:26   ` BTRFS Raid5/6 Recovery Problem after accidentially hibernation Juergen Sauer
2014-11-13  8:49     ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).