From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f171.google.com ([209.85.213.171]:34340 "EHLO mail-ig0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751776AbbIQKoH (ORCPT ); Thu, 17 Sep 2015 06:44:07 -0400 Received: by igcpb10 with SMTP id pb10so9369011igc.1 for ; Thu, 17 Sep 2015 03:44:06 -0700 (PDT) MIME-Version: 1.0 Date: Thu, 17 Sep 2015 12:44:06 +0200 Message-ID: Subject: btrfs progs 4.1.1 & 4.2 segfault on chunk-recover From: Daniel Wiegert To: linux-btrfs@vger.kernel.org Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: Hello guys I think I might found a bug, Lots of text, I dont know what you want from me and not, so I try to get almost everything in one mail, please dont shoot me! :) To make a long store somewhat short, this is about what happend to me; (skip to **** if you dont care about history) Arch-linux, btrfs-progs 4.1.1 & 4.2, linux 4.1.6-1 Data, RAID5: total=3.11TiB, used=0.00B <-- this one said the other day used=3.05TiB System, RAID1: total=32.00MiB, used=0.00B Metadata, RAID1: total=8.00GiB, used=144.00KiB GlobalReserve, single: total=16.00MiB, used=0.00B Label: 'Isolinear' uuid: 9bb3f369-f2a9-46be-8dde-1106ae740e36 Total devices 9 FS bytes used 144.00KiB devid 7 size 2.73TiB used 541.12GiB path /dev/sdi devid 9 size 1.36TiB used 533.09GiB path /dev/sdd2 devid 10 size 1.36TiB used 533.09GiB path /dev/sdg2 devid 11 size 1.82TiB used 536.12GiB path /dev/sdj2 devid 12 size 1.82TiB used 538.09GiB path /dev/sdh2 devid 13 size 286.09GiB used 286.09GiB path /dev/sda3 devid 14 size 286.09GiB used 286.09GiB path /dev/sdb3 devid 15 size 372.61GiB used 372.61GiB path /dev/sdf1 *** Some devices missing drive 8 was a 1.36TiB drive 15 is the new drive I added to the system. *one of 8 drives started to fail, smart saw error, I failed in my configure and I didn't get notified - Ran for 3-14 days before I realized. *I tried on active running system to btrfs dev del /dev/sd[failing] - Did not work (I think it was csum errors) *I added one new disk to raid, rebooted and added new disk to array, tried balancing. Power fail and ups fail after x hours *I rebooted realized the failing drive was now dead. I could mount system with degraded and some files gave me kernel panic ( https://goo.gl/photos/UXrZj6YEUW3945b37 )- others were reading fine. -Was unable to dev del missing. At this point I knew the system was probobly broken beyond repair. so I just tried all commands I could think of. check repair, check init-csum-tree etc endless loop - First very fast text scrolling, lots of CPU not much diskIO, after ~48h text slow, lots of cpu, almost no diskIO same type of message repeating (with new numbers): ----- ref mismatch on [17959857729536 4096] extent item 0, found 1 adding new data backref on 17959857729536 parent 35277570539520 owner 0 offset 0 found 1 Backref 17959857729536 parent 35277570539520 owner 0 offset 0 num_refs 0 not found in extent tree Incorrect local backref count on 17959857729536 parent 35277570539520 owner 0 offset 0 found 1 wanted 0 back 0x145f7800 backpointer mismatch on [17959857729536 4096] ref mismatch on [17959857733632 4096] extent item 0, found 1 adding new data backref on 17959857733632 parent 35277570785280 owner 0 offset 0 found 1 Backref 17959857733632 parent 35277570785280 owner 0 offset 0 num_refs 0 not found in extent tree Incorrect local backref count on 17959857733632 parent 35277570785280 owner 0 offset 0 found 1 wanted 0 back 0x145f7b90 backpointer mismatch on [17959857733632 4096] ----- **** Found out that chunk-recover gave segfault.(4.1.1 & kdave 4.2) 4.1.1 said in bt: #0 0x00000000004251bb in btrfs_new_device_extent_record () #1 0x00000000004301cb in ?? () #2 0x000000000043085d in ?? () #3 0x00007fd8071074a4 in start_thread () from /usr/lib/libpthread.so.0 #4 0x00007fd806e4513d in clone () from /usr/lib/libc.so.6 not much help, but I compiled -> https://github.com/kdave/btrfs-progs and backtrace: --> http://pastebin.com/XqRrqAB5 I can repeat the segfault. I made two btrfs-image , one is around 4MB the other is around 300MB think it was. So, did I find a bug? I cant find my logs at the beginning of my failing drive, what it said when I tried to remove the broken drive. I might be able to try the setup again (Got one more drive-about-to-fail) ps; Ive tried to make alpine to work, but it wont accept my passwords, I hope gmail web client is ok for you guys, openwrt dev team rejected my posts just because of this email client best regards Daniel end