Data corruption with XFS on Debian 11 and 12 under heavy load.

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Data corruption with XFS on Debian 11 and 12 under heavy load.
@ 2023-08-29 17:15 Jose M Calhariz
  2023-08-29 21:54 ` Dave Chinner
  2023-08-29 23:41 ` Darrick J. Wong
  0 siblings, 2 replies; 3+ messages in thread
From: Jose M Calhariz @ 2023-08-29 17:15 UTC (permalink / raw)
  To: linux-xfs

[-- Attachment #1: Type: text/plain, Size: 1580 bytes --]

Hi,

I have been chasing a data corruption problem under heavy load on 4
servers that I have at my care.  First I thought of an hardware
problem because it only happen with RAID 6 disks.  So I reported to Debian: 

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1032391

Further research pointed to be the XFS the common pattern, not an
hardware issue.  So I made an informal query to a friend in a software
house that relies heavily on XFS about his thought on this issue.  He
made reference to several problems fixed on kernel 6.2 and a
discussion on this mailing list about back porting the fixes to 6.1
kernel.

With this information I have tried the latest kernel at that time on
Debian testing over Debian v12 and I could not reproduce the
problem.  So I made another bug report:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1040416

My questions to this mailing list:

  - Have anyone experienced under Debian or with vanilla kernels
  corruption under heavy load on XFS?

  - Should I stop waiting for the fixes being back ported to vanilla
  6.1 and run the latest kernel from Debian testing anyway?  Taking
  notice that kernels from testing have less security updates on time
  than stable kernels, specially security issues with limited
  disclosure.

I am happy to provide more info about my setup or my stability tests
that fail under XFS.

Kind regards
Jose M Calhariz

-- 
--
Um falso amigo nunca o xinga

Um verdadeiro amigo já o xingou de tudo quanto é
palavrão que existe - e até inventou alguns novos

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Data corruption with XFS on Debian 11 and 12 under heavy load.
  2023-08-29 17:15 Data corruption with XFS on Debian 11 and 12 under heavy load Jose M Calhariz
@ 2023-08-29 21:54 ` Dave Chinner
  2023-08-29 23:41 ` Darrick J. Wong
  1 sibling, 0 replies; 3+ messages in thread
From: Dave Chinner @ 2023-08-29 21:54 UTC (permalink / raw)
  To: Jose M Calhariz; +Cc: linux-xfs

On Tue, Aug 29, 2023 at 06:15:36PM +0100, Jose M Calhariz wrote:
> 
> Hi,
> 
> I have been chasing a data corruption problem under heavy load on 4
> servers that I have at my care.  First I thought of an hardware
> problem because it only happen with RAID 6 disks.  So I reported to Debian: 
> 
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1032391

Summary: corruption on HW RAID6, not on a separate HW RAID1 volume
on the same controller.

Firmware update of HW RAID controller made on disk corruption on
RAID6 volumes go away, but weird compiler failures still occurred
indicating data corruption was likely still occuring.

Updating kernel to "bookworm" which runs a 6.3 kernel didn't fix the
problem.

This smells of corruption occurring on read IO, not on write IO, and
likely a hardware related problem given the change of behaviour with
a firmware update.

> Further research pointed to be the XFS the common pattern, not an
> hardware issue.  So I made an informal query to a friend in a software
> house that relies heavily on XFS about his thought on this issue.  He
> made reference to several problems fixed on kernel 6.2 and a
> discussion on this mailing list about back porting the fixes to 6.1
> kernel.

I can't think of any bug fix we've been talking about backporting to
6.1 that might fix a data corruption? Anything that is a known data
corruption fix normally gets backported pretty quickly (e.g. the
corruption that could be triggered in 6.3.0-6.3.4 kernels had the
fix backported into 6.3.5 as soon as we identified the cause).

> With this information I have tried the latest kernel at that time on
> Debian testing over Debian v12 and I could not reproduce the
> problem.  So I made another bug report:
> 
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1040416

Your test case of make -j 4096 fails on 6.1.27 but does not fail on
6.3.7. Which is different behaviour to the above bug. This time you
have a kernel log that indicates XFS appears to be hung up waiting
for an AGI lock during inode allocation from the hung task timer.

This does not indicate any sort of corruption is occurring - it
means either the storage is really slow (i.e. waiting for IO
completion on either the AGI, or IO completion on whatever is
holding the AGI lock) or there has been a deadlock of some kind.
EIther way, this sort of thing is not an indication of data corruption.

You also don't mention what storage hardware this is on - is this
still on the HW RAID6 volumes that were causing issues that you
reported in the first bug above?

----

There's really nothing in either of these bug reports that indicate
that XFS is the root cause, whilst there's plenty of anecdotal
evidence from the first bug to point at storage hardware
problems being the cause.

So, which of these problems is easiest to reproduce on your
machines? Pick one of them and:

- describe the storage hardware stack (BBWC, RAID, caching strategy)
- describe the storage software stack (drdb, lvm, xfs_info for the
  filesystem, etc)
- cpus, memory, etc
- example of a corrupt data file vs a good file (i.e. what is the
  corrupt data that is appearing in the corrupt .o files?)
- find the minimum storage stack that reproduces the problem, and
  determine if the problem reproduces across different storage
  hardware in the same machine.
- if you have known bad and known good kernels, run a bisect and see
  where the problem goes away (e.g. which -rcX kernel between good
  and bad results in the problem going away).

> My questions to this mailing list:
> 
>   - Have anyone experienced under Debian or with vanilla kernels
>   corruption under heavy load on XFS?

No.

I do long term kernel soak testing with my main workstation with
debian kernels (i.e. months of uptime, daily use with hundreds of
browser tabs, tens of terminals, multiple VMs, lots of source tree
work, all on XFS filesystems. I've been running this kernel:

Linux devoid 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) x86_64 GNU/Linux

on this machine for some months.

>   - Should I stop waiting for the fixes being back ported to vanilla
>   6.1 and run the latest kernel from Debian testing anyway?  Taking
>   notice that kernels from testing have less security updates on time
>   than stable kernels, specially security issues with limited
>   disclosure.

There's nothing to "fix" or backport until we've done root cause
analysis on the failures and identified what is actually causing
your systems to fail.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Data corruption with XFS on Debian 11 and 12 under heavy load.
  2023-08-29 17:15 Data corruption with XFS on Debian 11 and 12 under heavy load Jose M Calhariz
  2023-08-29 21:54 ` Dave Chinner
@ 2023-08-29 23:41 ` Darrick J. Wong
  1 sibling, 0 replies; 3+ messages in thread
From: Darrick J. Wong @ 2023-08-29 23:41 UTC (permalink / raw)
  To: Jose M Calhariz; +Cc: linux-xfs

On Tue, Aug 29, 2023 at 06:15:36PM +0100, Jose M Calhariz wrote:
> 
> Hi,
> 
> I have been chasing a data corruption problem under heavy load on 4
> servers that I have at my care.  First I thought of an hardware
> problem because it only happen with RAID 6 disks.  So I reported to Debian: 
> 
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1032391
> 
> Further research pointed to be the XFS the common pattern, not an
> hardware issue.  So I made an informal query to a friend in a software
> house that relies heavily on XFS about his thought on this issue.  He
> made reference to several problems fixed on kernel 6.2 and a
> discussion on this mailing list about back porting the fixes to 6.1
> kernel.
> 
> With this information I have tried the latest kernel at that time on
> Debian testing over Debian v12 and I could not reproduce the
> problem.  So I made another bug report:
> 
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1040416
> 
> My questions to this mailing list:
> 
>   - Have anyone experienced under Debian or with vanilla kernels
>   corruption under heavy load on XFS?

Yes.  There were a rash of corruption problems that got fixed in 6.2:
https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/tag/?h=xfs-6.2-merge-8

My guess with no other information is either the write invalidation
problem in iomap; or maybe COW extent allocations racing with the log.

Most of these haven't been backported to 6.1 because our only choices as
a community were (a) let a dumb bot shovel in patches with zero QA or
(b) try to scare up volunteers to backport things to LTS kernels.  (a)
wasn't acceptable, but then with (b)...

>   - Should I stop waiting for the fixes being back ported to vanilla
>   6.1 and run the latest kernel from Debian testing anyway?  Taking
>   notice that kernels from testing have less security updates on time
>   than stable kernels, specially security issues with limited
>   disclosure.

...there isn't really a designated 6.1 LTS backport engineer right now.
A couple folks from Cloudflare; Amir Goldstein; and Ted Ts'o have been
sharing the work when they have spare time.

--D

> I am happy to provide more info about my setup or my stability tests
> that fail under XFS.
> 
> 
> Kind regards
> Jose M Calhariz
> 
> -- 
> --
> Um falso amigo nunca o xinga
> 
> Um verdadeiro amigo já o xingou de tudo quanto é
> palavrão que existe - e até inventou alguns novos



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-08-30  1:02 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-29 17:15 Data corruption with XFS on Debian 11 and 12 under heavy load Jose M Calhariz
2023-08-29 21:54 ` Dave Chinner
2023-08-29 23:41 ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox