From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cn.fujitsu.com ([59.151.112.132]:16135 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1750724AbcCWEQ1 (ORCPT ); Wed, 23 Mar 2016 00:16:27 -0400 Subject: Re: csum errors in VirtualBox VDI files To: Kai Krakow , References: <20160322090342.595fefac@jupiter.sol.kaishome.de> <56F1068E.6050806@cn.fujitsu.com> <20160322194854.161e9c4c@jupiter.sol.kaishome.de> From: Qu Wenruo Message-ID: <56F21898.3020101@cn.fujitsu.com> Date: Wed, 23 Mar 2016 12:16:24 +0800 MIME-Version: 1.0 In-Reply-To: <20160322194854.161e9c4c@jupiter.sol.kaishome.de> Content-Type: text/plain; charset="utf-8"; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: Kai Krakow wrote on 2016/03/22 19:48 +0100: > Am Tue, 22 Mar 2016 16:47:10 +0800 > schrieb Qu Wenruo : > >> Hi, >> >> Kai Krakow wrote on 2016/03/22 09:03 +0100: >>> Hello! >>> >>> Since one of the last kernel updates (I don't know which exactly), >>> I'm experiencing csum errors within VDI files when running >>> VirtualBox. A side effect of this is, as soon as dmesg shows these >>> errors, commands like "du" and "df" hang until reboot. >>> >>> I've now restored the file from backup but it happens over and over >>> again. >>> >>> On another machine I'm also seeing errors with big files in the >>> following scenario (apparently an older kernel, 4.1.x I afair): >>> >>> # ntfsclone --save /dev/md126p2 -o rescue.ntfs.img >>> ^ big NTFS partition ^ file on btrfs >>> >>> results in a write error and the file system goes read-only. >> >> When it goes RO, it must have some warning in kernel log. >> Would you please paste the kernel log? > > Apparently, that system does not boot now due to errors in bcache > b-tree. That being that, it may well be some bcache error and not > btrfs' fault. Apparently I couldn't catch the output, I've been in a > hurry. It said "write error" and had some backtrace. I will come to > this back later. > > Let's go to the system I currently care about (that one with the > always breaking VDI file): > >>> Both systems have in common they are using btrfs on bcache with >>> compress=lzo,autodefrag,nossd,discard (mraid=1,draid=0 and >>> mraid=1,draid=single). >>> >>> The system mentioned first is running Kernel 4.5.0 with Gentoo >>> patch-set. I upgraded from the last 4.4.x kernel when I first >>> experienced this problem. The first time the problem resulted in a >>> duplicate extent which btrfsck wasn't able to fix, that's when I >>> first restored from backup. But now I'm getting csum errors in this >>> file over a over again, plus when rsync has run for backup, the >>> system no longer responds to "du" and "df" commands - it just hangs. >>> >>> Known problem? Does it help if I send debug info? If so, please >>> instruct. >>> >> Does btrfs check report anything wrong? > > After the error occured? > > Yes, some text about the extent being compressed and btrfs repair > doesn't currently handle that case (I tried --repair as I'm having a > backup). I simply decided not to investigate that further at that point > but delete and restore the affected file from backup. However, this is > the message from dmesg (tho, I didn't catch the backtrace): > > btrfs_run_delayed_refs:2927: errno=-17 Object already exists That's nice, at least we have some clue. It's almost sure, it's a bug either in btrfs kernel which doesn't handle delayed refs well(low possibility), or, corrupted fs which create something kernel can't handle(I bet that's the case). > > After this, the system went RO and I had to reboot. I ran btrfs check > and it told about a duplicate extent. If output of btrfsck can be posted, it would help a lot to locate the problem and enhance btrfsck. > I identified the file (using > btrfs inspect and the inode number) being the VDI file, and restored it. > Afterwards, I upgraded from latest 4.4 to 4.5. Currently, I'm now > watching closer since this incident, and the file becomes damaged > without any message in the kernel log when doing some more than usual > IO in VirtualBox. When my backup script then runs over the file, I get > errors about missing csums - the block is not readable. If no other problem reported by btrfsck after your fix, --init-csum would handle such case. > I now ran > ddrescue, and replaced the file to get a current and slightly damaged > VDI image back (my backup uses time rotation, so no problem). But > running chkdsk in VirtualBox damages the VDI again. > > Regarding the other error on the other machine, I'm not completely > convinced bcache ain't involved in this problem. > > As soon as I "produced" csum errors again, I'll run btrfs check. Or > should I do it now without forcing the csum error to occur? > > If it's possible, btrfsck now with all its output posted is recommended. Thanks, Qu