From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-lj1-f173.google.com (mail-lj1-f173.google.com [209.85.208.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3528E211C for ; Mon, 15 Jul 2024 05:00:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.173 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1721019616; cv=none; b=f7ehtYllWNF+0wYqn6q0qVkiH1YwVhkpoLB1m5vgYrhlfWiY57yYFcEXSdEgPNVNIisymxx1x56YdQ3jBTQ0yyrf+Q2PnDkxkrjiQ2GzGn2ZLtPQau2bsvwhLc0/WVhKsabg36h1ZTAAd0psUf6omIdR+VJrsQT7Dke2i7UG2p4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1721019616; c=relaxed/simple; bh=FucaTR0q5Qj7tYeM5faCFU7D3DD50pf6kGXUHlYUbWM=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=LhHV0lxYGcVyjbheWs0YqbP975kIXSM0vIV/WrFq8oRawEZucVWsdYRJ4FUVgeq4303Hi0BcS4aqlC7KtEGHCPjQT72i28CHqNyE/v1N/xKZxsmxHGzwog9NB9nu5S5aUDhL41qmUkzIXvKpQGSVan2ksaLcug0IvFmeucH7mS8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=suse.com; spf=pass smtp.mailfrom=suse.com; dkim=pass (2048-bit key) header.d=suse.com header.i=@suse.com header.b=KqpALsJY; arc=none smtp.client-ip=209.85.208.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=suse.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=suse.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=suse.com header.i=@suse.com header.b="KqpALsJY" Received: by mail-lj1-f173.google.com with SMTP id 38308e7fff4ca-2eabd22d3f4so43884391fa.1 for ; Sun, 14 Jul 2024 22:00:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=google; t=1721019611; x=1721624411; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:autocrypt:from :content-language:references:cc:to:subject:user-agent:mime-version :date:message-id:from:to:cc:subject:date:message-id:reply-to; bh=ZDf2e/Sss0ALk0/mWOxmEPFBiBVjFKP+L94PQcz3psE=; b=KqpALsJYpJQb44oGXXWCODBhsJesOgFmxty+mL87KyNr3MNefCmRszbG5WuS43bJKl c7cLIKC4EwFXwuHxldBkgT/a+whUPaVd007rLlSfoyzvWqQn92dqA+aKRI+C1tHJEohb kaGllReekDE67YMjvGY/aFBh1Btmza1No3YsuarOzKBLtnb5RQ5mjlYjobXKGopLdmYr eOYvtE9Kt6w8+cgRtTs4RBDny1QNOeI8v0CC1oyc6y+LJzqRWZr/dw6thyM8kMM9FUtD xAWCVD8jgASxCAYAAGzcKePJAt4vAjV/L7XrayNlXrZThf/CjK5qhY6RDeOe+/D3WFM1 kZuw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721019611; x=1721624411; h=content-transfer-encoding:in-reply-to:autocrypt:from :content-language:references:cc:to:subject:user-agent:mime-version :date:message-id:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=ZDf2e/Sss0ALk0/mWOxmEPFBiBVjFKP+L94PQcz3psE=; b=V6jdUw9+bggcIkNONITSaU9mwNNuVVE1f957U+zGqZImiD2lyJyoHaD9xjMXhyXNnK jpCOat9m8VrhZMPDSnw+q+X+yw4iv+N9MssQkgYFpPiDF2T5QRj24TXwdzrh21H5Zm3H 47okii5+l708lWap6F1V+2+jlz6XKIm/Z5LDnrOrgrE0uer7IE0Mj+IjFSFBpEjNF4qz hA/5p1C9nQDlXKIXnSM7/XMuqdTPnrrv1Ky3ZBOent+iyu4CzADq3pLdsoUyP+hN0T8T cJ37luEKhhThq5zbMXrT9fKiEYOzG0K5TxObPvQ2uqrwr7qPPr8Ijb/RckRvYlTpvLVn iTYQ== X-Gm-Message-State: AOJu0Yziimi/NB5hyZ7PtcVbYDqyLahNPbvSmajoCfLymdeGGOTBMY3k +JSgkN0bZ0w5IIYyPJI8mSQbODrUfrWbo9PAaU46zlAx3P64ei53t6QrZnaBgQn8+TmVuELb1rq M X-Google-Smtp-Source: AGHT+IGhDBRgHtPDvhhDdE7yuXfQIrSvgvxx1Ca6o8Wt39GsWmRHm0QS8SBACS+Vmu+icBEixGzroQ== X-Received: by 2002:a05:651c:97:b0:2ee:8aed:ddcd with SMTP id 38308e7fff4ca-2eeb30ba01bmr114145041fa.2.1721019610904; Sun, 14 Jul 2024 22:00:10 -0700 (PDT) Received: from ?IPV6:2403:580d:fda1::299? (2403-580d-fda1--299.ip6.aussiebb.net. [2403:580d:fda1::299]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-1fc0bc27264sm31124715ad.165.2024.07.14.22.00.07 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 14 Jul 2024 22:00:10 -0700 (PDT) Message-ID: Date: Mon, 15 Jul 2024 14:30:05 +0930 Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: btrfs crashes during routine btrfs-balance-least-used To: Kai Krakow , Qu Wenruo Cc: linux-btrfs , Oliver Wien References: <0bedfc5f-4658-4d01-98b3-34bc14f736f3@gmx.com> Content-Language: en-US From: Qu Wenruo Autocrypt: addr=wqu@suse.com; keydata= xsBNBFnVga8BCACyhFP3ExcTIuB73jDIBA/vSoYcTyysFQzPvez64TUSCv1SgXEByR7fju3o 8RfaWuHCnkkea5luuTZMqfgTXrun2dqNVYDNOV6RIVrc4YuG20yhC1epnV55fJCThqij0MRL 1NxPKXIlEdHvN0Kov3CtWA+R1iNN0RCeVun7rmOrrjBK573aWC5sgP7YsBOLK79H3tmUtz6b 9Imuj0ZyEsa76Xg9PX9Hn2myKj1hfWGS+5og9Va4hrwQC8ipjXik6NKR5GDV+hOZkktU81G5 gkQtGB9jOAYRs86QG/b7PtIlbd3+pppT0gaS+wvwMs8cuNG+Pu6KO1oC4jgdseFLu7NpABEB AAHNGFF1IFdlbnJ1byA8d3F1QHN1c2UuY29tPsLAlAQTAQgAPgIbAwULCQgHAgYVCAkKCwIE FgIDAQIeAQIXgBYhBC3fcuWlpVuonapC4cI9kfOhJf6oBQJjTSJVBQkNOgemAAoJEMI9kfOh Jf6oapEH/3r/xcalNXMvyRODoprkDraOPbCnULLPNwwp4wLP0/nKXvAlhvRbDpyx1+Ht/3gW p+Klw+S9zBQemxu+6v5nX8zny8l7Q6nAM5InkLaD7U5OLRgJ0O1MNr/UTODIEVx3uzD2X6MR ECMigQxu9c3XKSELXVjTJYgRrEo8o2qb7xoInk4mlleji2rRrqBh1rS0pEexImWphJi+Xgp3 dxRGHsNGEbJ5+9yK9Nc5r67EYG4bwm+06yVT8aQS58ZI22C/UeJpPwcsYrdABcisd7dddj4Q RhWiO4Iy5MTGUD7PdfIkQ40iRcQzVEL1BeidP8v8C4LVGmk4vD1wF6xTjQRKfXHOwE0EWdWB rwEIAKpT62HgSzL9zwGe+WIUCMB+nOEjXAfvoUPUwk+YCEDcOdfkkM5FyBoJs8TCEuPXGXBO Cl5P5B8OYYnkHkGWutAVlUTV8KESOIm/KJIA7jJA+Ss9VhMjtePfgWexw+P8itFRSRrrwyUf E+0WcAevblUi45LjWWZgpg3A80tHP0iToOZ5MbdYk7YFBE29cDSleskfV80ZKxFv6koQocq0 vXzTfHvXNDELAuH7Ms/WJcdUzmPyBf3Oq6mKBBH8J6XZc9LjjNZwNbyvsHSrV5bgmu/THX2n g/3be+iqf6OggCiy3I1NSMJ5KtR0q2H2Nx2Vqb1fYPOID8McMV9Ll6rh8S8AEQEAAcLAfAQY AQgAJgIbDBYhBC3fcuWlpVuonapC4cI9kfOhJf6oBQJjTSJuBQkNOge/AAoJEMI9kfOhJf6o rq8H/3LJmWxL6KO2y/BgOMYDZaFWE3TtdrlIEG8YIDJzIYbNIyQ4lw61RR+0P4APKstsu5VJ 9E3WR7vfxSiOmHCRIWPi32xwbkD5TwaA5m2uVg6xjb5wbdHm+OhdSBcw/fsg19aHQpsmh1/Q bjzGi56yfTxxt9R2WmFIxe6MIDzLlNw3JG42/ark2LOXywqFRnOHgFqxygoMKEG7OcGy5wJM AavA+Abj+6XoedYTwOKkwq+RX2hvXElLZbhYlE+npB1WsFYn1wJ22lHoZsuJCLba5lehI+// ShSsZT5Tlfgi92e9P7y+I/OzMvnBezAll+p/Ly2YczznKM5tV0gboCWeusM= In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit 在 2024/7/15 13:59, Kai Krakow 写道: > Hello Qu! > > Thanks for looking into this. We already started the restore, so we no > longer have any access to the corrupted disk images. > > Am So., 14. Juli 2024 um 23:54 Uhr schrieb Qu Wenruo : >> >> >> >> 在 2024/7/15 01:43, Kai Krakow 写道: >>> Hello btrfs list! >>> >>> (also reported in irc) >>> >>> Our btrfs pool crashed during a routine btrfs-balance-least-used. >>> Maybe of interest: bees is also running on this filesystem, snapper >>> takes hourly snapshots with retention policy. >>> >>> I'm currently still collecting diagnostics, "btrfs check" log is >>> already 3 GB and growing. >>> >>> The btrfs runs on three devices vd{c,e,f}1 with data=single meta=raid1. >>> >>> Here's an excerpt from dmesg (full log https://gist.tnonline.net/TE): >> >> Unfortunately the full log is not really full. >> >> There should be extent leaf dump, and after that dump, showing the >> reason why we believe it's a problem. >> >> Is there any true full dmesg dump? > > Yes, sorry. The gist has been truncated - mea culpa. I repasted it: > https://gist.tnonline.net/6Q Thanks a lot! That contains (almost) all info we need to know. The offending bytenr is 402811572224, and in the dump we indeed have the item for it: [1143913.108184] item 188 key (402811572224 168 4096) itemoff 14598 itemsize 79 [1143913.108185] extent refs 3 gen 3678544 flags 1 [1143913.108186] ref#0: extent data backref root 13835058055282163977 objectid 281473384125923 offset 81432576 count 1 The last line is the problem. Firstly we shouldn't even have a root with that high value. Secondly that root number 13835058055282163977 is a very weird hex value too (0xc000000000000109), the '0xc' part means it's definitely not a simple bitflip. Furthermore, the objectid value is also very weird (0xffffa11315e3). No wonder the extent tree is not going to handle it correctly. But I have no idea why this happens, it passes csum thus I'm assuming it's runtime corruption. [...] > >> The other thing is, does the server has ECC memory? >> It's not uncommon to see bitflips causing various problems (almost >> monthly reports). > > I don't know the exact hosting environment, we are inside of a QEMU > VM. But I'm pretty sure it is ECC. And considering it's some virtualization environment, you do not have any out-of-tree modules? > > The disk images are hosted on DRBD, with two redundant remote block > devices on NVMe RAID. Our VM runs on KVM/QEMU. We are not seeing DRBD > from within the VM. Because the lower storage layer is redundant, we > are not running a data raid profile in btrfs but we use multiple block > devices because we are seeing better latency behavior that way. > >> If the machine doesn't have ECC memory, then a memtest would be preferable. > > I'll ask our data center operators about ECC but I'm pretty sure the > answer will be: yes, it's ECC. > > We have been using their data centers for 20+ years and have never > seen a bit flip or storage failure. Yeah, I do not think it's the hardware corruption either now. > > I wonder if parallel use of snapper (hourly, with thinning after 24h), > bees (we are seeing dedup rates of 2:1 - 3:1 for some datasets in the > hosting services) Snapshotting is done in a very special timing (at the end of transaction commit), thus it should not be related to balance operations. > and btrfs-balance-least-used somehow triggered this. > I remember some old reports where bees could trigger corruption in > balance or scrub, and evading that by pausing if it detected it. I > don't know if this is an issue any longer (kernel 6.6.30 LTS). No recent bugs come up to me immediately, but even if we have, the corruption looks too special. It still matches the item size and ref count, but in the middle the data it got corrupted with seemingly garbage. As the higher bits of u64 is store in higher address in x86_64 memory, the corruption looks to cover the following bits: 0 8 16 | le64 root | le64 objectid | |09 01 00 00 00 00 00 0c|e3 15 13 a1 ff ff 00 00| ==================== 16 24 28 | le64 offset | le32 refs | |00 09 da 04 00 00 00 00|01 00 00 00| So far the corruption looks like starts from byte 7 ends at byte 14. In theory, if you have kept the image, we can spend enough time to find out the correct values, but so far it really looks like some garbage filling the range. For now what I can do is add extra checks (making sure the root number seems valid), but it won't really catch all randomly corrupted data. And as my final struggle, did this VM experienced any migration? As that's the only thing not btrfs I can think of, that would corrupt data at runtime. Thanks, Qu > > > Thanks, > Kai >