* Data corruption with XFS on Debian 11 and 12 under heavy load. @ 2023-08-29 17:15 Jose M Calhariz 2023-08-29 21:54 ` Dave Chinner 2023-08-29 23:41 ` Darrick J. Wong 0 siblings, 2 replies; 3+ messages in thread From: Jose M Calhariz @ 2023-08-29 17:15 UTC (permalink / raw) To: linux-xfs [-- Attachment #1: Type: text/plain, Size: 1580 bytes --] Hi, I have been chasing a data corruption problem under heavy load on 4 servers that I have at my care. First I thought of an hardware problem because it only happen with RAID 6 disks. So I reported to Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1032391 Further research pointed to be the XFS the common pattern, not an hardware issue. So I made an informal query to a friend in a software house that relies heavily on XFS about his thought on this issue. He made reference to several problems fixed on kernel 6.2 and a discussion on this mailing list about back porting the fixes to 6.1 kernel. With this information I have tried the latest kernel at that time on Debian testing over Debian v12 and I could not reproduce the problem. So I made another bug report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1040416 My questions to this mailing list: - Have anyone experienced under Debian or with vanilla kernels corruption under heavy load on XFS? - Should I stop waiting for the fixes being back ported to vanilla 6.1 and run the latest kernel from Debian testing anyway? Taking notice that kernels from testing have less security updates on time than stable kernels, specially security issues with limited disclosure. I am happy to provide more info about my setup or my stability tests that fail under XFS. Kind regards Jose M Calhariz -- -- Um falso amigo nunca o xinga Um verdadeiro amigo já o xingou de tudo quanto é palavrão que existe - e até inventou alguns novos [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Data corruption with XFS on Debian 11 and 12 under heavy load. 2023-08-29 17:15 Data corruption with XFS on Debian 11 and 12 under heavy load Jose M Calhariz @ 2023-08-29 21:54 ` Dave Chinner 2023-08-29 23:41 ` Darrick J. Wong 1 sibling, 0 replies; 3+ messages in thread From: Dave Chinner @ 2023-08-29 21:54 UTC (permalink / raw) To: Jose M Calhariz; +Cc: linux-xfs On Tue, Aug 29, 2023 at 06:15:36PM +0100, Jose M Calhariz wrote: > > Hi, > > I have been chasing a data corruption problem under heavy load on 4 > servers that I have at my care. First I thought of an hardware > problem because it only happen with RAID 6 disks. So I reported to Debian: > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1032391 Summary: corruption on HW RAID6, not on a separate HW RAID1 volume on the same controller. Firmware update of HW RAID controller made on disk corruption on RAID6 volumes go away, but weird compiler failures still occurred indicating data corruption was likely still occuring. Updating kernel to "bookworm" which runs a 6.3 kernel didn't fix the problem. This smells of corruption occurring on read IO, not on write IO, and likely a hardware related problem given the change of behaviour with a firmware update. > Further research pointed to be the XFS the common pattern, not an > hardware issue. So I made an informal query to a friend in a software > house that relies heavily on XFS about his thought on this issue. He > made reference to several problems fixed on kernel 6.2 and a > discussion on this mailing list about back porting the fixes to 6.1 > kernel. I can't think of any bug fix we've been talking about backporting to 6.1 that might fix a data corruption? Anything that is a known data corruption fix normally gets backported pretty quickly (e.g. the corruption that could be triggered in 6.3.0-6.3.4 kernels had the fix backported into 6.3.5 as soon as we identified the cause). > With this information I have tried the latest kernel at that time on > Debian testing over Debian v12 and I could not reproduce the > problem. So I made another bug report: > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1040416 Your test case of make -j 4096 fails on 6.1.27 but does not fail on 6.3.7. Which is different behaviour to the above bug. This time you have a kernel log that indicates XFS appears to be hung up waiting for an AGI lock during inode allocation from the hung task timer. This does not indicate any sort of corruption is occurring - it means either the storage is really slow (i.e. waiting for IO completion on either the AGI, or IO completion on whatever is holding the AGI lock) or there has been a deadlock of some kind. EIther way, this sort of thing is not an indication of data corruption. You also don't mention what storage hardware this is on - is this still on the HW RAID6 volumes that were causing issues that you reported in the first bug above? ---- There's really nothing in either of these bug reports that indicate that XFS is the root cause, whilst there's plenty of anecdotal evidence from the first bug to point at storage hardware problems being the cause. So, which of these problems is easiest to reproduce on your machines? Pick one of them and: - describe the storage hardware stack (BBWC, RAID, caching strategy) - describe the storage software stack (drdb, lvm, xfs_info for the filesystem, etc) - cpus, memory, etc - example of a corrupt data file vs a good file (i.e. what is the corrupt data that is appearing in the corrupt .o files?) - find the minimum storage stack that reproduces the problem, and determine if the problem reproduces across different storage hardware in the same machine. - if you have known bad and known good kernels, run a bisect and see where the problem goes away (e.g. which -rcX kernel between good and bad results in the problem going away). > My questions to this mailing list: > > - Have anyone experienced under Debian or with vanilla kernels > corruption under heavy load on XFS? No. I do long term kernel soak testing with my main workstation with debian kernels (i.e. months of uptime, daily use with hundreds of browser tabs, tens of terminals, multiple VMs, lots of source tree work, all on XFS filesystems. I've been running this kernel: Linux devoid 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) x86_64 GNU/Linux on this machine for some months. > - Should I stop waiting for the fixes being back ported to vanilla > 6.1 and run the latest kernel from Debian testing anyway? Taking > notice that kernels from testing have less security updates on time > than stable kernels, specially security issues with limited > disclosure. There's nothing to "fix" or backport until we've done root cause analysis on the failures and identified what is actually causing your systems to fail. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Data corruption with XFS on Debian 11 and 12 under heavy load. 2023-08-29 17:15 Data corruption with XFS on Debian 11 and 12 under heavy load Jose M Calhariz 2023-08-29 21:54 ` Dave Chinner @ 2023-08-29 23:41 ` Darrick J. Wong 1 sibling, 0 replies; 3+ messages in thread From: Darrick J. Wong @ 2023-08-29 23:41 UTC (permalink / raw) To: Jose M Calhariz; +Cc: linux-xfs On Tue, Aug 29, 2023 at 06:15:36PM +0100, Jose M Calhariz wrote: > > Hi, > > I have been chasing a data corruption problem under heavy load on 4 > servers that I have at my care. First I thought of an hardware > problem because it only happen with RAID 6 disks. So I reported to Debian: > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1032391 > > Further research pointed to be the XFS the common pattern, not an > hardware issue. So I made an informal query to a friend in a software > house that relies heavily on XFS about his thought on this issue. He > made reference to several problems fixed on kernel 6.2 and a > discussion on this mailing list about back porting the fixes to 6.1 > kernel. > > With this information I have tried the latest kernel at that time on > Debian testing over Debian v12 and I could not reproduce the > problem. So I made another bug report: > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1040416 > > My questions to this mailing list: > > - Have anyone experienced under Debian or with vanilla kernels > corruption under heavy load on XFS? Yes. There were a rash of corruption problems that got fixed in 6.2: https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/tag/?h=xfs-6.2-merge-8 My guess with no other information is either the write invalidation problem in iomap; or maybe COW extent allocations racing with the log. Most of these haven't been backported to 6.1 because our only choices as a community were (a) let a dumb bot shovel in patches with zero QA or (b) try to scare up volunteers to backport things to LTS kernels. (a) wasn't acceptable, but then with (b)... > - Should I stop waiting for the fixes being back ported to vanilla > 6.1 and run the latest kernel from Debian testing anyway? Taking > notice that kernels from testing have less security updates on time > than stable kernels, specially security issues with limited > disclosure. ...there isn't really a designated 6.1 LTS backport engineer right now. A couple folks from Cloudflare; Amir Goldstein; and Ted Ts'o have been sharing the work when they have spare time. --D > I am happy to provide more info about my setup or my stability tests > that fail under XFS. > > > Kind regards > Jose M Calhariz > > -- > -- > Um falso amigo nunca o xinga > > Um verdadeiro amigo já o xingou de tudo quanto é > palavrão que existe - e até inventou alguns novos ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2023-08-30 1:02 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-08-29 17:15 Data corruption with XFS on Debian 11 and 12 under heavy load Jose M Calhariz 2023-08-29 21:54 ` Dave Chinner 2023-08-29 23:41 ` Darrick J. Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox