From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail.bouton.name (ns.bouton.name [109.74.195.142]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 09B538BF7 for ; Sat, 22 Jun 2024 08:51:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=109.74.195.142 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719046291; cv=none; b=K2MligWuac0HJj+oV/q2q1++ZfsmAPw6iyMAdxgylPr3dcq1jvK94jlHIFR6av83kN+W1r7kaa40jUFjFGh7qAPJs6i5Zulsl2hD62xf/L2vx19qG2EQh2lj2vXXazRddztq2Qz4/P0GtxvdbMqdAj/ydCKFPq8P206PswRSfJQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719046291; c=relaxed/simple; bh=CL83poRtemmoHEZaqqWkKNBIF0zXkKsiNHTgbxaYICI=; h=Message-ID:Date:MIME-Version:Subject:From:To:References: In-Reply-To:Content-Type; b=WVcfKot8uAjqqaHwGCkUAdLaiVMUZ2Tw2kQl8CymerEV0/YJ8B6QEtQPZVE7fMHNhiaR8mKeTrjeDDS7tDWhj7iMCewH8IEqnXlW1UQjcikmFu6436n7/1zpG0xwFFrkhOo29jFb/sXPxFCG7oGuo4+Hg/26jUE+dptnGkjs1jI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=bouton.name; spf=pass smtp.mailfrom=bouton.name; dkim=pass (1024-bit key) header.d=bouton.name header.i=@bouton.name header.b=fLQZuHDf; arc=none smtp.client-ip=109.74.195.142 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=bouton.name Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bouton.name Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=bouton.name header.i=@bouton.name header.b="fLQZuHDf" Received: from [192.168.10.102] (052559474.box.freepro.com [82.96.130.55]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.bouton.name (Postfix) with ESMTPSA id 87B65C840; Sat, 22 Jun 2024 10:51:07 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bouton.name; s=default; t=1719046267; bh=jzp5iFww/1PviZK5YW8g1JaF0+sKOz9ybq6SCkWMFZ4=; h=Date:Subject:From:To:References:In-Reply-To; b=fLQZuHDfiqbGACWzdIkzncpnCRyeLzcBEV/DUS3H6ThOnAE+HDqUFBck6RpplHsvT hCLhUVvzHsMeCRjIaMWHEjaBBCW6MLjpGHbkyXVB5gHciEgB9XUriWSlwRzAqdfw2B sh7Am27eIAsZME1zj6NYG3rIuCKsFrOzdDEI1yJA= Message-ID: Date: Sat, 22 Jun 2024 10:51:04 +0200 Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: BUG: scrub reports uncorrectable csum errors linked to readable file (data: single) From: Lionel Bouton To: Qu Wenruo , linux-btrfs@vger.kernel.org References: <4525e502-c209-4672-ae32-68296436d204@gmx.com> <1df4ce53-8cf9-40b1-aa43-bf443947c833@bouton.name> <80456d11-9859-402c-a77c-5c3b98b755a5@gmx.com> <05fc8552-1b6f-4b6c-82b2-0cf716cc8e6b@bouton.name> Content-Language: en-US In-Reply-To: <05fc8552-1b6f-4b6c-82b2-0cf716cc8e6b@bouton.name> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Le 18/06/2024 à 23:45, Lionel Bouton a écrit : > Le 07/06/2024 à 01:30, Qu Wenruo a écrit : >> >> >> 在 2024/6/7 08:51, Lionel Bouton 写道: >>> Hi, >>> >>> Le 07/06/2024 à 01:05, Qu Wenruo a écrit : >>>> >>>> >>>> 在 2024/6/4 23:42, Lionel Bouton 写道: >>>>> Hi, >>>>> >>>>> It seems one of our system hit an (harmless ?) bug while scrubbing. >>>>> >>>>> Kernel: 6.6.30 with Gentoo patches (sys-kernel/gentoo-sources-6.6.30 >>>>> ebuild). >>>>> btrfs-progs: 6.8.1 >>>>> The system is a VM using QEMU/KVM. >>>>> The affected filesystem is directly on top of a single virtio-scsi >>>>> block >>>>> device (no partition, no LVM, ...) with a Ceph RBD backend. >>>>> Profiles for metadata: dup, data: single. >>>>> >>>>> The scrub process is ongoing but it logged uncorrectable errors : >>>>> >>>>> # btrfs scrub status /mnt/store >>>>> UUID:             >>>>> Scrub started:    Mon Jun  3 09:35:28 2024 >>>>> Status:           running >>>>> Duration:         26:12:07 >>>>> Time left:        114:45:38 >>>>> ETA:              Sun Jun  9 06:33:13 2024 >>>>> Total to scrub:   18.84TiB >>>>> Bytes scrubbed:   3.50TiB  (18.59%) >>>>> Rate:             38.92MiB/s >>>>> Error summary:    csum=61 >>>>>    Corrected:      0 >>>>>    Uncorrectable:  61 >>>>>    Unverified:     0 >>>>> >>>>> According to the logs all errors are linked to a single file >>>>> (renamed to >>>>> in the attached log) and happened within a couple of seconds. >>>>> Most errors are duplicates as there are many snapshots of the >>>>> subvolume >>>>> it is in. This system is mainly used as a backup server, storing >>>>> copies >>>>> of data for other servers and creating snapshots of them. >>>>> >>>>> I was about to overwrite the file to correct the problem by >>>>> fetching it >>>>> from its source but I didn't get an error trying to read it (cat >>>>> >>>>>  > /dev/null)). I used diff and md5sum to double check: the file is >>>>> fine. >>>>> >>>>> Result of stat on the subvolume used as base for the >>>>> snapshots: >>>>>    File: >>>>>    Size: 1799195         Blocks: 3520       IO Block: 4096 regular >>>>> file >>>>> Device: 0,36    Inode: 6676106     Links: 1 >>>>> Access: (0644/-rw-r--r--)  Uid: ( 1000/)   Gid: ( 1000/) >>>>> Access: 2018-02-15 05:22:38.506204046 +0100 >>>>> Modify: 2018-02-15 05:21:35.612914799 +0100 >>>>> Change: 2018-02-15 05:22:38.506204046 +0100 >>>>>   Birth: 2018-02-15 05:22:38.486203934 +0100 >>>>> >>>>> AFAIK the kernel logged a warning for nearly each snapshot of the >>>>> same >>>>> subvolume (they were all created after the file), the subvolume from >>>>> which snapshots are created is root 257 in the logs. What seems >>>>> interesting to me and might help diagnose this is the pattern for >>>>> offsets in the warnings. >>>>> - There are 3 offsets appearing: 20480, 86016, 151552 (exactly 64k >>>>> between them). >>>>> - Most snapshots seem to report a warning for each of them. >>>> >>>> This is how btrfs reports a corrupted data sector, it would go through >>>> all inodes (including ones in different subvolumes) and report the >>>> error. >>>> >>>> So it really means if you want to delete the file, you need to remove >>>> all the inodes from all different snapshots. >>>> >>>> So if you got a correct copy matching the csum, you can do a more >>>> dangerous but more straight forward way to fix, by directly >>>> overwriting >>>> the contents on disk. >>> >>> I briefly considered doing just that... but then I found out that the >>> scrub errors were themselves in error and the on disk data was matching >>> the checksums. When I tried to read the file not only didn't the >>> filesystem report an IO error (if I'm not mistaken it should if the >>> csum >>> doesn't match) but the file content matched the original file fetched >>> from its source. >> >> Got it, this is really weird now. >> >> What scrub doing is read the data from disk (without bothering page >> cache), and verify against checksums. >> >> Would it be possible to run "btrfs check --check-data-csum" on the >> unmounted/RO mounted fs? >> >> That would output the error for each corrupted sector (without >> ratelimit), so that you can record them all. >> And try to do logical-resolve to find each corrupted location? >> >> If btrfs check reports no error, it's 100% sure scrub is to blamce. > > The results are in, no error reported by btrfs check: > > # btrfs check --check-data-csum /dev/sdb > [1/7] checking root items > [2/7] checking extents > [3/7] checking free space tree > [4/7] checking fs roots > [5/7] checking csums against data > [6/7] checking root refs > [7/7] checking quota groups skipped (not enabled on this FS) > Opening filesystem to check... > Checking filesystem on /dev/sdb > UUID: 61e86d80-d6e4-4f9e-a312-885194c5e690 > found 20626845626377 bytes used, no error found > total csum bytes: 19548745528 > total tree bytes: 83019579392 > total fs tree bytes: 47937191936 > total extent tree bytes: 8051245056 > btree space waste bytes: 18589648011 > file data blocks allocated: 22539270656000 >  referenced 21661227479040 > > I'll mount the filesystem and run a scrub again to see if I can > reproduce the problem. It should be noticeably quicker, we made > updates to the Ceph cluster and should get approximately 2x the I/O > bandwidth. > I plan to keep the disk snapshot for at least several weeks so if you > want to test something else just say so. The scrub is finished, here are the results : UUID: 61e86d80-d6e4-4f9e-a312-885194c5e690 Scrub started:    Wed Jun 19 00:01:59 2024 Status:           finished Duration:         81:04:21 Total to scrub:   18.83TiB Rate:             67.67MiB/s Error summary:    no errors found So the scrub error isn't deterministic. I'll shut down the test VM for now and keep the disk snapshot it uses for at least a couple of week if it is needed for further tests. The original filesystem is scrubbed monthly, I'll reply to this message if another error shows up. > > Best regards, > Lionel > >> >> If btrfs check reports error, and logical-resolve failed to locate the >> file and its position, it means the corruption is in bookend exntets. >> >> If btrfs check reports error and logical-resolve can locate the file and >> position, it's a different problem then. >> >> Thanks, >> Qu >> >>> >>> The stats output show that this file was not modified since its >>> creation >>> more than 6 years ago. This is what led me to report a bug in scrub. >>> >>> >>>> >>>>> - Some including the original subvolume have only logged 2 >>>>> warnings and >>>>> one (root 151335) logged only one warning. >>>>> - All of the snapshots reported a warning for offset 20480. >>>>> - When several offsets are logged their order seems random. >>>> >>>> One problem of scrub is, it may ratelimit the output, thus we can not >>>> just rely on dmesg to know the damage. >>> >>> I wondered about this: I've read other threads where ratelimiting is >>> mentioned but I was not sure if it could apply here. Thanks for >>> clarifying. >>> >>>> >>>>> >>>>> I'm attaching kernel logs beginning with the scrub start and >>>>> ending with >>>>> the last error. As of now there are no new errors/warnings even >>>>> though >>>>> the scrub is still ongoing, 15 hours after the last error. I didn't >>>>> clean the log frombalance logs linked to the same filesystem. >>>>> >>>>> Side-note for the curious or in the unlikely case it could be >>>>> linked to >>>>> the problem: >>>>> Historically this filesystem was mounted with ssd_spread without any >>>>> issue (I guess several years ago it made sense to me reading the >>>>> documentation and given the IO latencies I saw on the Ceph >>>>> cluster). A >>>>> recent kernel filled the whole available space with nearly empty >>>>> block >>>>> groups which seemed to re-appear each time we mounted with >>>>> ssd_spread. >>>>> We switched to "ssd" since then and there is a mostly daily 90mn >>>>> balance >>>>> to reach back the previous stateprogressively (this is the reason >>>>> behind >>>>> the balance related logs). Having some space available for new block >>>>> groups seems to be a good idea but additionally as we use Ceph >>>>> RBD, we >>>>> want it to be able to deallocate unused space: having many nearly >>>>> empty >>>>> block groups could waste resources (especially if the used space in >>>>> these groups is in <4MB continuous chunks which is the default RBD >>>>> object size). >>>>> >>>>> >>>>> More information : >>>>> The scrub is run monthly. This is the first time an error was ever >>>>> reported. The previous scrub was run successfully at the beginning of >>>>> May with a 6.6.13 kernel. >>>>> >>>>> There is a continuous defragmentation process running (it >>>>> processes the >>>>> whole filesystem slowly ignoring snapshots and defragments file by >>>>> file >>>>> based on a fragmentation estimation using filefrag -v). All >>>>> defragmentations are logged and I can confirm the file for which this >>>>> error was reported was not defragmented for at least a year (I >>>>> checked >>>>> because I wanted to rule out a possible race condition between >>>>> defragmentation and scrub). >>>> >>>> I'm wondering if there is any direct IO conflicting with >>>> buffered/defrag IO. >>>> >>>> It's known that conflicting buffered/direct IO can lead to contents >>>> change halfway, and lead to btrfs csum mismatch. >>>> So far that's the only thing leading to known btrfs csum mismatch I >>>> can >>>> think of. >>> >>> But here it seems there isn't an actual mismatch as reading the file is >>> possible and gives the data which was written in it 6 years ago. Tthis >>> seems an error in scrub (or a neutrino flipping a bit somewhere during >>> the scrub). The VM runs on a server with ECC RAM so it is unlikely >>> to be >>> a simple bitflip but when everything likely has been ruled out... >>> >>> Thanks for your feedback, >>> Lionel >>> >>>> >>>> But 6.x kernel should all log a warning message for it. >>>> >>>> Thanks, >>>> Qu >>>>> >>>>> In addition to the above, among the notable IO are : >>>>> - a moderately IO intensive PostgreSQL replication subscriber, >>>>> - ponctual large synchronisations using Syncthing, >>>>> - snapshot creations/removals occur approximately every 80 >>>>> minutes. The >>>>> last snapshot operation was logged 31 minutes before the errors >>>>> (removal >>>>> occur asynchronously but where was no unusual load at this time that >>>>> could point to a particularly long snapshot cleanup process). >>>>> >>>>> To sum up, there are many processes accessing the filesystem but >>>>> historically it saw far higher IO load during scrubs than what was >>>>> occurring here. >>>>> >>>>> >>>>> Reproducing this might be difficult: the whole scrub can take a week >>>>> depending on load. That said I can easily prepare a kernel and/or new >>>>> btrfs-progs binaries to launch scrubs or other non-destructive >>>>> tasks the >>>>> week-end or at the end of the day (UTC+2 timezone). >>>>> >>>>> Best regards, >>>>> >>>>> Lionel Bouton >>>> >>> >