From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D34DBC38145 for ; Wed, 7 Sep 2022 16:04:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229477AbiIGQEd (ORCPT ); Wed, 7 Sep 2022 12:04:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49792 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229436AbiIGQEE (ORCPT ); Wed, 7 Sep 2022 12:04:04 -0400 Received: from mail-pf1-x433.google.com (mail-pf1-x433.google.com [IPv6:2607:f8b0:4864:20::433]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A8119A5C7E for ; Wed, 7 Sep 2022 09:02:39 -0700 (PDT) Received: by mail-pf1-x433.google.com with SMTP id 65so4851698pfx.0 for ; Wed, 07 Sep 2022 09:02:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date; bh=12IUsA/cMwBIxeadxv4/3ocGPLCW9oxNtzio7zkZo5s=; b=iLmbipR98L0CUjb8dbJDqB4BFFegLd3o46lQmJv/oUKKR6MYjTvTI6TbdmUnlZUsMB 1e6ZPjpZ7WaYuzfcR19RLDfGQwGfSgqCHfYc2ycU078nuMEULAMNMvmJwl//yvrjdTOx kLJ2mSCQdVE8a+TMW3N9iq7qut+2br5cDf1qbxWxHKKtQRtlNfeywsg72m95ZamLm+Jb NNTLVeDBaIoaJWrn0qeIHqBwdp13fx0j9k3FZ3+1CbzoVTEPensfG+DMHWCEri7LSOCZ IvmZPUqKA5IVjCRYrxN0h2Ovj8Hv7MU0KcEmiT+5O5QJJeKqFXa0U0g6i1vyYM8pC7xV iI1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date; bh=12IUsA/cMwBIxeadxv4/3ocGPLCW9oxNtzio7zkZo5s=; b=qfc7gSJN6i3pnf5XEmuLUr8zvKBQdYIbVQydhj5KvIYGB1H9jMGkkmftKaZDaydPCN weqNe8ONakK/Mljvucvl/GOydF0Pc7cJ3NRVMaFXkUv0sv28XftNFxzNQydDtv4HjDk0 VpioboHQtZGmeqwqlChDnx7bRxVZveq6dC2WipySsw+UAIvUJLCAPGk5jK7RS2riMn7t 4x6klTvruIQDuVctDR7r9B2fL31vnwxLVGVN5LwetrhuakU+oUzO4F4DP2ffHqViguLK S9/TCT/EwiCrBVFJRt41qtoBVjwsmyOylbKjegHROHX3k76XDPl4sQlwtRTOLGKU4z9a /E+A== X-Gm-Message-State: ACgBeo1aftzAwWVw/qYABlvvENbudLM3/h2QvLYblmq3OFgKE9lsPatd 84aswiUtN/LsfqeWg/sNOFcw9/KA8HQzP143dRlQl1eYMlU= X-Google-Smtp-Source: AA6agR4j6iFKRk59npYGg1dIx4kt8YC4qjfdMVfbKycTBsi6/0NPkLWmb0jTab4Bsmd2cZiim9nBBIdNyqYsWCXUpBI= X-Received: by 2002:aa7:9532:0:b0:53e:7875:39e1 with SMTP id c18-20020aa79532000000b0053e787539e1mr3253419pfp.82.1662566539868; Wed, 07 Sep 2022 09:02:19 -0700 (PDT) MIME-Version: 1.0 From: Luigi Fabio Date: Wed, 7 Sep 2022 12:01:45 -0400 Message-ID: Subject: Ext4 filesystem recovery after mdraid failure -2 To: linux-ext4@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org I am encountering an unusual problem after an mdraid failure, I'll summarise briefly and can provide further details as required. First of all, the context. This is happening on a Debian 11 system, amd64 arch, with current updates (kernel 5.10.136-1, util-linux 2.36.1). The system has a 12 drive mdraid RAID5 for data, recently migrated to LSI 2308 HBAs. This is relevant because yesterday, at around 13,00 local (EST), four drives, an entire HBA channel, decided to drop from the RAID. Of course, mdraid didn't like that and stopped the arrays. I reverted to best practice and shut down the system first of all. Further context: the filesystem in the array is ancient - I am vaguely proud of that - from 2001. It started as ext2, grew to ext3, then to ext4 and finally to ext4 with 64 bits. Because I am paranoid, I always mount ext4 with nodelalloc and data=journal. The journal is external on a RAID1 of SSDs. I recently (within the last ~3 months) enabled metadata_csum, which is relevant to the following - the filesystem had never had metadata_csum enabled before. Upon reboot, the arrays would not reassemble - this is expected, because 4/12 drives were marked faulty. So I re--created the array using the same parameters as were used back when the array was built. Unfortunately, I had a moment of stupid and didn't specify metadata 0.90 in the re--create, so it was recreated with metadata 1.2... which writes its data block at the beginning of the components, not at the end. I noticed it, restopped the array and recreated with the correct 0.90, but the damage was done: the 256 byte + 12 * 20 header was written at the beginning of each of the 12 components. Still, unless I am mistaken, this just means that at worst 12x (second block of each component) were damaged, which shouldn't be too bad. The only further possibility is that mdraid also zeroed out the 'blank space' that it puts AFTER the header block and BEFORE the data, but according to documentation it shouldn't do that. In any case, I subsequently reassembled the array 'correctly' to match the previous order and settings and I believe I got it right. I kept the array RO and tried fsck -n, which gave me this: ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap fsck.ext4: Group descriptors look bad... trying backup blocks... It then warns that it won't attempt journal recovery because it's in RO mode and declares the fs clean - with a reasonable looking number of files and blocks. If I try to mount -t ext4 -o ro, I get : mount: /mnt: mount(2) system call failed: Structure needs cleaning. so before anything else, I tried fsck -nf to make sure that the REST of the filesystem is in one logical piece. THAT painted a very different picture: On pass 1, I get approximately 980k (almost 10^6) of Inode nnnnn passes checks, but checksum does not match inode and ~ 2000 Inode nnnnn contains garbage Plus some 'tree not optimised' which are technically not errors, from what I understand. After ~11 hours, it switches to 1b, tells me that inode 12 has a long list of duplicate blocks Running additional passes to resolve blocks claimed by more than one inode... Pass 1B: Rescanning for multiply-claimed blocks Multiply-claimed block(s) in inode 12: 2928004133 [....] And ends after the list of multiply claimed blocks with: e2fsck: aborted Error while scanning inodes (8193): Inode checksum does not match inode /dev/md123: ********** WARNING: Filesystem still has errors ********** /dev/md123: ********** WARNING: Filesystem still has errors ********** So, what is my next step? I realise I should NOT have touched the original drives and dd-ed images to a separate array to work on those, but I believe the only writing that occurred were the mdraid superblocks. Where do I go from here? Thanks!