From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AD482C433E1 for ; Sun, 12 Jul 2020 09:27:08 +0000 (UTC) Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 7A0A120720 for ; Sun, 12 Jul 2020 09:27:08 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="LFd4SE40"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=armlinux.org.uk header.i=@armlinux.org.uk header.b="xW+sdkhP" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7A0A120720 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=armlinux.org.uk Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:List-Subscribe:List-Help:List-Post:List-Archive:List-Unsubscribe :List-Id:MIME-Version:Message-ID:Subject:To:From:Date:Reply-To:Cc:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:In-Reply-To:References:List-Owner; bh=r51U97FPQPSEwq2iM4pl6hH3CadzuDqSD87p+n1MOmg=; b=LFd4SE40xGk0jt0DNhH1E9kMsL A6/hQZ+BI6uNuYfSbYcF/KI46OGdf3G54UWN1mxUebzxYmvmQkSgdlLv1IMwypa6wtbMBEYq7ZIA5 2Xj8O9/xdH3Yn2tjIwYioyEgSwNMDa5AoB8iEc090NvZIeN7utJT1FWKm4juiBNTWSoSFnQ9InFtB /yOALQ6ntCiUlWrYrAjxFrrZs1gCSPftaer3KgnazOdsf9vypGE+A33uRWmX93vMI2LrkpbLOjTCR ad1mPpQI8rMACpT00tzb+kT2fZD6bmuYyU7ovM7CPf2x+mRHV8BZol6Ubu1XGwqaZu+OzlHwHiYdq e/YbJnSA==; Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1juYEw-0001SO-BT; Sun, 12 Jul 2020 09:25:42 +0000 Received: from pandora.armlinux.org.uk ([2001:4d48:ad52:32c8:5054:ff:fe00:142]) by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1juYEr-0001J9-Cy for linux-arm-kernel@lists.infradead.org; Sun, 12 Jul 2020 09:25:38 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=armlinux.org.uk; s=pandora-2019; h=Sender:Content-Type:MIME-Version: Message-ID:Subject:To:From:Date:Reply-To:Cc:Content-Transfer-Encoding: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To:References:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=uo1jc5mIHK55ynrb9gihCktliXaAiwoqct95TwmEMgI=; b=xW+sdkhPF7S3W4g5wy5cRxDCf EWgrU0Kb8fgArJzPTRQaPJdgYeEsjowizggFqG2LzXf7422SR8VI1COcym5LzSUSaktcqMttsCtLH 4RZyFDWgVKm0ySWgBlUu9VBfyEOK+fA8iQJYoO1trGBSbZMw9/HWqEhLouku40VldS/wKGYVv/X38 1qXOJOQB9CXOSxIp6L8G3HYXWSh0Jv83B3IasOhzf+P/C/kXGDAKIV5Yu4UUMbW7fyn0xY3IM5Toq uN9D5FPuVBIjjceox6bWouTtci+V+7ZS/5NyHHebytFnRIavHobvOvy9OuNj7Ku6SqB2bDqltwqxE uL/y50uIw==; Received: from shell.armlinux.org.uk ([fd8f:7570:feb6:1:5054:ff:fe00:4ec]:38502) by pandora.armlinux.org.uk with esmtpsa (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1juYBs-00030J-4g for linux-arm-kernel@lists.infradead.org; Sun, 12 Jul 2020 10:22:32 +0100 Received: from linux by shell.armlinux.org.uk with local (Exim 4.92) (envelope-from ) id 1juYBr-0005Av-Uo for linux-arm-kernel@lists.infradead.org; Sun, 12 Jul 2020 10:22:31 +0100 Date: Sun, 12 Jul 2020 10:22:31 +0100 From: Russell King - ARM Linux admin To: linux-arm-kernel@lists.infradead.org Subject: aarch64: ext4 metadata integrity regression in kernels >= 5.5 ? Message-ID: <20200712092231.GQ1551@shell.armlinux.org.uk> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.10.1 (2018-07-13) X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20200712_052537_497409_9B8AC6D6 X-CRM114-Status: GOOD ( 17.81 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Hi, Some will know that during the last six months, I've been seeing problems on the LX2160A rev 1 with corrupted checksums on a EXT4 FS on a NVMe recently. I'm not certain exactly which kernels are affected, but I know that 5.1 seems to be fine, and 5.5, possibly 5.4 onwards seem affected, maybe earlier. The symptom is that the kernel will run for some random amount of time (between a few days and a few months) and then EXT4 will complain with "iget: checksum invalid" on the root filesystem either during a logrotate or a mandb rebuild. Upon investigation with debugfs and hexdump, it appeared that a single EXT4 inode in one sector contained an invalid 32-bit checksum. EXT4 splits the 32-bit checksum into two 16-bit halves and stores them in separate locations in the inode, consequently any read or update of the checksum requires two separate reads or writes. The problem initially seemed to correlate with powering the platform down as the trigger, and it was suggested that the NVMe was at fault. However, a recent case disproved that theory when the problem appeared to self-correct itself after using "hdparm -f" on the drive, and the problem going away - e2fsck found no errors on the filesystem, and I could remount the filesystem in read/write mode. "hdparm -f" syncs the device and flushes the kernel cache, which it also does when you use "hdparm -t" to measure disk performance. My next question was whether it was being caused by PCIe ordering issues. I've since upgraded the machine to a LX2160A rev 2, which has yet to show any symptoms of this. However, the reason for this email is a troubling development with this problem: [7478798.720368] EXT4-fs error (device mmcblk0p1): ext4_lookup:1707: inode #157096: comm mandb: iget: checksum invalid [7478798.729925] Aborting journal on device mmcblk0p1-8. [7478798.734070] EXT4-fs (mmcblk0p1): Remounting filesystem read-only [7478798.734589] EXT4-fs error (device mmcblk0p1): ext4_journal_check_start:84: Detected aborted journal Running "e2fsck -n" on the system without having done anything gives: Inode 13755 passes checks, but checksum does not match inode. Fix? no Inode 157096 passes checks, but checksum does not match inode. Fix? no amongst other errors, which are expected for a filesystem that is normally "in-use". Using "hdparm -f" does not make these errors go away. The offending inodes found by e2fsck corresponds with: /usr/share/man/nl/man1/apt-transport-mirror.1.gz /lib/firmware/rtl_bt/rtl8723a_fw.bin However, just like all the other instances, these would not have changed recently except for atime updates. There are a couple of important differences here: - It is an Armada 8040 system - Clearfog GT-8K running a 5.6 kernel, rather than the LX2160A. - Its rootfs is on eMMC, not NVMe. That seems to rule out the NVMe being a cause of the problem, and any PCIe issues of the LX2160A rev 1. Another data point is that I'm also running an Armada 8040 system as a VM host, which has over a year uptime, so is on an older kernel (5.1). This uses EXT4 for its rootfs as well, but is on SATA SSD, and has not shown any issues. The VMs it runs are a later kernel (5.6) also with EXT4, and have yet to display any symptoms. The similarities are - the kernel is the same or similar binary on the failing systems (I've been running the same kernel config on both.) Both are a Cortex-A72, but slightly different revisions. So, it's starting to feel like an aarch64 problem, potentially a locking or ordering issue. Due to how rare this issue is, investigating it is likely very difficult. However, it seems to be very real, as the symptoms have now been observed on two rather different aarch64 platforms. Due to the amount of time required to test, it very difficult to do any kind of bisection, or test alternative kernels - it would take months of runtime for a single test. I'm chucking this out there so that if anyone else is seeing this behaviour, they can shout and maybe confirm what I'm seeing. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last! _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel