Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71

Linux CIFS filesystem development
 help / color / mirror / Atom feed

From: Mark A Whiting <whitingm@opentext.com>
To: Enzo Matsumiya <ematsumiya@suse.de>, Steve French <smfrench@gmail.com>
Cc: "Heckmann, Ilja" <heckmann@izw-berlin.de>,
	"linux-cifs@vger.kernel.org" <linux-cifs@vger.kernel.org>,
	"henrique.carvalho@suse.com" <henrique.carvalho@suse.com>
Subject: Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71
Date: Thu, 27 Mar 2025 12:48:33 +0000	[thread overview]
Message-ID: <YT1PR01MB945191C652AEE173CEADBA3EB3A12@YT1PR01MB9451.CANPRD01.PROD.OUTLOOK.COM> (raw)
In-Reply-To: <uildcjpqxzc5nckupgdeeifkrqwrau2qxuc2df2uxuyys3i2k2@iz2bmi6yojyu>

Hello,

>Hello,
>
>On 03/26, Steve French wrote:
>>Were you able to confirm that the problem started after 6.6.0 but 
>>regressed before 6.6.9 - any chance of narrowing the regression down by 
>>bisection?
>
>This looks similar to a bug we found in our v6.4-based SLES products.
>
>Bisecting it indicated the regression is
>d08089f649a0 "cifs: Change the I/O paths to use an iterator rather than a page list".
>
>The first good commit we found is a395726cf823 "cifs: fix data corruption in read after invalidate", but it was probably a bit before that (we didn't check further because we couldn't afford to backport all netfs modifications).
>
>This is the fix we used (rebased on top of v6.6.71 tag):
>https://git.exis.tech/linux.git/commit/?h=data_corruption_v6.x&id=8d4c40e084f3d132434d5d3d068175c8db59ce65

I tried following the link but it gave me a "502 Bad Gateway" error, I also tried the link on my personal machine at home in case it was my corporate network blocking things, same result. I don't know how big the patch is. Any chance you could just drop it in this thread?

>
>@Ilja @Mark could you test it with your reproducer please?
>@Steve can you try it with the reproducer mentioned in the commit message please?
>

I would be happy to try it out.

Thanks!
Mark

>
>Cheers,
>
>Enzo
>
>>On Wed, Mar 26, 2025 at 5:13 AM Heckmann, Ilja <heckmann@izw-berlin.de> wrote:
>>>
>>> We ran into what probably is the same problem with silent data corruption that was only noticed thanks to using a data format with internal checksums. It also went away when mounting a share with "cache=none" while running the kernel 6.6.9, but that had the side-effect that no executables could be started from the share (I reported this in June 2024). This second problem was fixed in 6.10, but at the same time mounting with "cache=none" stopped helping against the data corruption issue. It persists until now, with kernel 6.12.8, although the frequency at which the problem manifests went down significantly.
>>>
>>> The way we test for it is by running a certain workload 100 times in a loop and counting the number of runs aborted because of errors. That number went down from about 10 per 100 runs with kernel 6.6.9 to about 1 per 100 runs with 6.12.8. Its non-deterministic nature and the lack of in-house expertise to investigate the issue at the same level as Mark did stopped us from reporting it so far. And while there is no way of knowing that the issue we observe in 6.12.8 is the same one, at least I can confirm that there is a similar issue in more recent kernel versions as well.
>>>
>>> Best wishes,
>>> Ilja Heckmann
>>> ________________________________________
>>> Von: Mark A Whiting <whitingm@opentext.com>
>>> Gesendet: Dienstag, 25. März 2025 22:24:55
>>> An: linux-cifs@vger.kernel.org
>>> Betreff: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when 
>>> writing, x86_64, kernel 6.6.71
>>>
>>> Hello,
>>>
>>> I have discovered a data corruption issue with our application writing to a CIFS share. I believe this issue may be related to another report I saw on this mailing list, https://urldefense.com/v3/__https://lore.kernel.org/linux-cifs/DFC1DAC5-5C6C-4DC2-807A-DAF12E4B7882@gmail.com/__;!!Obbck6kTJA!Y6MSAoNMFpNXbISjuEk_9W7YONTXcNxR54SWXuV-Ep2qHAlkeh10XDZn3W0eCVixe4zgnWRYOPyYiZB0sMdoZg$ . I understand that updating to a newer kernel would likely fix this issue. However, at the moment, that's not an option for us. In the long term we are looking to upgrade to 6.12 but I'm hoping to find a solution for our current 6.6 kernel.
>>>
>>> I have tested mounting with the "cache=none" option and that solves the problem, albeit with a very large performance hit.
>>>
>>> The platform is an embedded system. We're using an off-the-shelf COM Express Type 7 module with an Intel XEON D-1713NT processor. We're running a custom Linux system built using Buildroot, currently running the 6.6.71 kernel. I've tested the latest 6.6.84 kernel and the problem still exists there. Our application is writing large amounts of compressed data (4+ GB) to the network share. When I read back the data to verify it, I'm seeing small portions of the file that have been replaced with zeros.
>>>
>>> I've attacked the issue from several angles. Starting with a TCP dump of a complete operation from mounting, data transfer, to unmounting the network share. Through Wireshark I can see that there is no write command to the server covering the sections of the output that ends up as zeros. This indicated to me that the CIFS kernel driver is failing to write out portions of the file.
>>>
>>> I then enabled all the CIFS debug info I could via cifsFYI and the kernel dynamic debug controls and tweaked the code to not rate limit the pr_debug calls. I could trace through the resulting logs and find pairs of cifs_write_begin() / cifs_write_end() that covered all the data including the sections that ultimately don't get written out. However, tracing through the smb2_async_writev() messages I again could not find any writes that covered the corrupt portions. At this point I began to suspect some kind of race condition within the cifs_writepages() function.
>>>
>>> I also analyzed the data corruption and noticed a pattern. It does not fail 100% of the time, and it does not always fail in the same place. This furthered my belief that it was some kind of non-deterministic data race. The corrupt data region is always less than a page in size (<4096 bytes), it's always zeros, and it always ends on a page boundary. Because I knew the expected format of the data, I could also tell that the corrupt data was always at the beginning of a write syscall by our application.
>>>
>>> I've attempted to read through the CIFS kernel code involved in this. But I've never worked in the VFS/filesystem layers before. And I'm having trouble following / understanding the intricacies of the page cache, page dirtying/cleaning, and writeback.
>>>
>>> My current best guess at what's happening is as follows:
>>>     * Our application writes out a buffer of data to the file on a CIFS share, this is compressed data that isn't nicely aligned, the data does not end on a page boundary. This is a newly created file that we are writing to, so this write extends the files EOF to the end of the newly written data which is in the middle of a page in the cache.
>>>     * cifs_writepages() is invoked to write the cached data back to the server, it scans the cached pages and prepares to write out all the dirty pages (including the final partial page).
>>>     * Our application performs another write. This extends the file and the beginning of this write falls into the end of the previous final partial cached page.
>>>     * cifs_writepages() finishes writing out the dirty pages, including the first portion of what it thought was the final partial page, and marks all pages as clean.
>>>     * On the next invocation of cifs_writepages(), it scans for dirty pages and skips the beginning of the second write because it thinks that page is clean. The following page is a completely new page and is dirty, so it starts a new write from that page. This would explain why the corruption is always at the beginning of our application's write and corrects itself at the next page boundary.
>>>
>>> I have yet to really prove this, but this type of race between dirty/clean pages would explain all the behavior I'm seeing. I'm hoping someone much more intimately familiar with the CIFS code can help point me in the right direction.
>>>
>>> I did try one quick and dirty fix, assuming it was a race I applied the following patch. This added a per inode mutex that completely serialized the cifs_write_begin(), cifs_write_end(), and cifs_writepages() functions. This did seem to resolve the data corruption issue, but at the cost of occasional deadlocks writing to CIFS files.
>>>
>>> > diff --git a/fs/smb/client/cifsfs.c b/fs/smb/client/cifsfs.c index 
>>> > bbb0ef18d7b8..6e2e273b9838 100644
>>> > --- a/fs/smb/client/cifsfs.c
>>> > +++ b/fs/smb/client/cifsfs.c
>>> > @@ -1659,6 +1659,7 @@ cifs_init_once(void *inode)
>>> >
>>> >       inode_init_once(&cifsi->netfs.inode);
>>> >       init_rwsem(&cifsi->lock_sem);
>>> > +     mutex_init(&cifsi->tbl_write_mutex);
>>>  > }
>>>  >
>>> >  static int __init
>>> > diff --git a/fs/smb/client/cifsglob.h b/fs/smb/client/cifsglob.h 
>>> > index 43b42eca6780..4af4c5036d81 100644
>>> > --- a/fs/smb/client/cifsglob.h
>>> > +++ b/fs/smb/client/cifsglob.h
>>> > @@ -1606,6 +1606,17 @@ struct cifsInodeInfo {
>>> >       bool lease_granted; /* Flag to indicate whether lease or oplock is granted. */
>>> >       char *symlink_target;
>>> >       __u32 reparse_tag;
>>> > +
>>> > +     /* During development we discovered what we believe to be a race condition
>>> > +      * in the write caching behavior of cifs. Setting cache=none solved the
>>> > +      * issue but with an unacceptable performance hit. The following mutex was
>>> > +      * added to serialize the cifs_write_begin, cifs_write_end, and
>>> > +      * cifs_writepages functions in file.c. This appears to solve the issue
>>> > +      * without completely disabling caching.
>>> > +      *
>>> > +      * -Mark Whiting (whitingm@opentext.com)
>>> > +      */
>>> > +     struct mutex tbl_write_mutex;
>>> >  };
>>> >
>>> >  static inline struct cifsInodeInfo * diff --git 
>>> > a/fs/smb/client/file.c b/fs/smb/client/file.c index 
>>> > cb75b95efb70..d3bc652a7e65 100644
>>> > --- a/fs/smb/client/file.c
>>> > +++ b/fs/smb/client/file.c
>>> > @@ -3085,6 +3085,7 @@ static int cifs_writepages(struct 
>>> > address_space *mapping,  {
>>> >       loff_t start, end;
>>> >       int ret;
>>> > +     mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex);
>>> >
>>> >       /* We have to be careful as we can end up racing with setattr()
>>> >        * truncating the pagecache since the caller doesn't take a 
>>> > lock here @@ -3119,6 +3120,7 @@ static int cifs_writepages(struct address_space *mapping,
>>> >       }
>>> >
>>> >  out:
>>> > +     mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex);
>>> >       return ret;
>>> >  }
>>> >
>>> > @@ -3174,6 +3176,8 @@ static int cifs_write_end(struct file *file, struct address_space *mapping,
>>> >       struct folio *folio = page_folio(page);
>>> >       __u32 pid;
>>> >
>>> > +     mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex);
>>> > +
>>> >       if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
>>> >               pid = cfile->pid;
>>> >       else
>>> > @@ -3233,6 +3237,7 @@ static int cifs_write_end(struct file *file, struct address_space *mapping,
>>> >       /* Indication to update ctime and mtime as close is deferred */
>>> >       set_bit(CIFS_INO_MODIFIED_ATTR, &CIFS_I(inode)->flags);
>>> >
>>> > +     mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex);
>>> >       return rc;
>>> >  }
>>> >
>>> > @@ -4905,6 +4910,7 @@ static int cifs_write_begin(struct file *file, struct address_space *mapping,
>>> >       int rc = 0;
>>> >
>>> >       cifs_dbg(FYI, "write_begin from %lld len %d\n", (long 
>>> > long)pos, len);
>>> > +     mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex);
>>> >
>>> >  start:
>>> >       page = grab_cache_page_write_begin(mapping, index); @@ 
>>> > -4965,6 +4971,7 @@ static int cifs_write_begin(struct file *file, struct address_space *mapping,
>>> >                  this will be written out by write_end so is fine */
>>> >       }
>>> >  out:
>>> > +     mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex);
>>> >       *pagep = page;
>>> >       return rc;
>>> >  }
>>>
>>> Here are some of the log excerpts for one of my test cases. In this file one of the corrupt regions starts at file offset 1,074,214,474 (0x4007364A), and was corrupt for 2,486 bytes, ending on a page boundary. First there is a section of the log trimmed to just the cifs_write_begin() / cifs_write_end() functions. You can see that there is a write shown at the exact offset/length of the corrupted data.
>>>
>>> > Mar 25 15:25:39 TX2 kernel: [  124.080900] [1567] 
>>> > cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 
>>> > 1074212864 len 1610 Mar 25 15:25:39 TX2 kernel: [  124.080906] 
>>> > [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end 
>>> > for page 0000000086519afd from pos 1074212864 with 1610 bytes Mar 
>>> > 25 15:25:39 TX2 kernel: [  124.080911] [1567] 
>>> > cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 
>>> > 1074214474 len 2486 Mar 25 15:25:39 TX2 kernel: [  124.080916] 
>>> > [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end 
>>> > for page 0000000086519afd from pos 1074214474 with 2486 bytes Mar 
>>> > 25 15:25:39 TX2 kernel: [  124.080917] [1567] 
>>> > cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 
>>> > 1074216960 len 846 Mar 25 15:25:39 TX2 kernel: [  124.080924] 
>>> > [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end 
>>> > for page 00000000880cee03 from pos 1074216960 with 846 bytes
>>>
>>> Now here's a section of the log trimmed to just the smb2_async_writev() function. You can see writes covering the data immediately before and after the corrupted region, but there is no write to the corrupted region. I'm assuming the corrupted region is always zeros because the server is extending and zero-filling the file to the new write offset after the gap of the missing write.
>>>
>>> > Mar 25 15:25:39 TX2 kernel: [  123.829696] [1635] 
>>> > smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write 
>>> > at 1072214016 988260 bytes iter=f1464 Mar 25 15:25:39 TX2 kernel: [  
>>> > 124.081016] [1636] smb2_async_writev:4945: CIFS: 
>>> > fs/smb/client/smb2pdu.c: async write at 1073201152 1013322 bytes 
>>> > iter=f764a
>>> ** Missing write: 1073201152 + 1013322 = 1074214474 **
>>> > Mar 25 15:25:39 TX2 kernel: [  124.083901] [1636] 
>>> > smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write 
>>> > at 1074216960 39564 bytes iter=9a8c Mar 25 15:25:40 TX2 kernel: [  
>>> > 124.340557] [1637] smb2_async_writev:4945: CIFS: 
>>> > fs/smb/client/smb2pdu.c: async write at 1074253824 1237843 bytes 
>>> > iter=12e353
>>>
>>> I can very easily reproduce this with our application. If anyone has any suggestions to try, additional logging / tracing they would like me to perform, please let me know. I can provide more detailed, full logs if desired, but they're quite large. I'll continue to read through the code and try to understand, if I find anything I will update you.
>>>
>>> Thanks,
>>> Mark Whiting
>>>
>>>
>>
>>
>>--
>>Thanks,
>>
>>Steve
>>

next prev parent reply	other threads:[~2025-03-27 12:48 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-25 21:24 [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 Mark A Whiting
2025-03-26 10:11 ` AW: [[ EXT ]] " Heckmann, Ilja
2025-03-26 18:58   ` Steve French
2025-03-26 21:13     ` Enzo Matsumiya
2025-03-27 12:48       ` Mark A Whiting [this message]
2025-03-27 13:07         ` Enzo Matsumiya
2025-03-27 19:31           ` [EXTERNAL] - " Mark A Whiting
2025-03-31 19:47           ` Mark A Whiting
2025-11-06 15:00             ` Bharath SM
2025-11-06 15:20               ` Enzo Matsumiya
2025-11-06 15:51                 ` Bharath SM
2025-11-06 16:03                   ` Enzo Matsumiya
2025-11-10 15:56                 ` David Howells
2025-11-06 16:22               ` David Howells
2025-11-06 16:46                 ` Bharath SM
2025-11-06 16:49                   ` Bharath SM
2025-11-10 16:00               ` David Howells
2025-11-11  9:22               ` David Howells
2025-11-11 10:39                 ` Shyam Prasad N
2025-11-12 17:09                   ` Shyam Prasad N
2025-11-12 18:14                     ` David Howells
2025-11-13 12:04                       ` Shyam Prasad N
2025-11-13 13:57                         ` David Howells
2025-11-14  9:27                           ` Shyam Prasad N
2025-11-14 10:49                             ` David Howells
2025-11-14 10:57                               ` Shyam Prasad N
2025-11-07 12:54           ` Shyam Prasad N
2025-11-10  4:37             ` Shyam Prasad N
2025-03-27 10:09     ` AW: [[ EXT ]] " Heckmann, Ilja

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YT1PR01MB945191C652AEE173CEADBA3EB3A12@YT1PR01MB9451.CANPRD01.PROD.OUTLOOK.COM \
    --to=whitingm@opentext.com \
    --cc=ematsumiya@suse.de \
    --cc=heckmann@izw-berlin.de \
    --cc=henrique.carvalho@suse.com \
    --cc=linux-cifs@vger.kernel.org \
    --cc=smfrench@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox