Linux CIFS filesystem development
 help / color / mirror / Atom feed
* [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71
@ 2025-03-25 21:24 Mark A Whiting
  2025-03-26 10:11 ` AW: [[ EXT ]] " Heckmann, Ilja
  0 siblings, 1 reply; 29+ messages in thread
From: Mark A Whiting @ 2025-03-25 21:24 UTC (permalink / raw)
  To: linux-cifs@vger.kernel.org

Hello,

I have discovered a data corruption issue with our application writing to a CIFS share. I believe this issue may be related to another report I saw on this mailing list, https://lore.kernel.org/linux-cifs/DFC1DAC5-5C6C-4DC2-807A-DAF12E4B7882@gmail.com/. I understand that updating to a newer kernel would likely fix this issue. However, at the moment, that's not an option for us. In the long term we are looking to upgrade to 6.12 but I'm hoping to find a solution for our current 6.6 kernel.

I have tested mounting with the "cache=none" option and that solves the problem, albeit with a very large performance hit.

The platform is an embedded system. We're using an off-the-shelf COM Express Type 7 module with an Intel XEON D-1713NT processor. We're running a custom Linux system built using Buildroot, currently running the 6.6.71 kernel. I've tested the latest 6.6.84 kernel and the problem still exists there. Our application is writing large amounts of compressed data (4+ GB) to the network share. When I read back the data to verify it, I'm seeing small portions of the file that have been replaced with zeros.

I've attacked the issue from several angles. Starting with a TCP dump of a complete operation from mounting, data transfer, to unmounting the network share. Through Wireshark I can see that there is no write command to the server covering the sections of the output that ends up as zeros. This indicated to me that the CIFS kernel driver is failing to write out portions of the file.

I then enabled all the CIFS debug info I could via cifsFYI and the kernel dynamic debug controls and tweaked the code to not rate limit the pr_debug calls. I could trace through the resulting logs and find pairs of cifs_write_begin() / cifs_write_end() that covered all the data including the sections that ultimately don't get written out. However, tracing through the smb2_async_writev() messages I again could not find any writes that covered the corrupt portions. At this point I began to suspect some kind of race condition within the cifs_writepages() function.

I also analyzed the data corruption and noticed a pattern. It does not fail 100% of the time, and it does not always fail in the same place. This furthered my belief that it was some kind of non-deterministic data race. The corrupt data region is always less than a page in size (<4096 bytes), it's always zeros, and it always ends on a page boundary. Because I knew the expected format of the data, I could also tell that the corrupt data was always at the beginning of a write syscall by our application.

I've attempted to read through the CIFS kernel code involved in this. But I've never worked in the VFS/filesystem layers before. And I'm having trouble following / understanding the intricacies of the page cache, page dirtying/cleaning, and writeback.

My current best guess at what's happening is as follows:
    * Our application writes out a buffer of data to the file on a CIFS share, this is compressed data that isn't nicely aligned, the data does not end on a page boundary. This is a newly created file that we are writing to, so this write extends the files EOF to the end of the newly written data which is in the middle of a page in the cache.
    * cifs_writepages() is invoked to write the cached data back to the server, it scans the cached pages and prepares to write out all the dirty pages (including the final partial page).
    * Our application performs another write. This extends the file and the beginning of this write falls into the end of the previous final partial cached page.
    * cifs_writepages() finishes writing out the dirty pages, including the first portion of what it thought was the final partial page, and marks all pages as clean.
    * On the next invocation of cifs_writepages(), it scans for dirty pages and skips the beginning of the second write because it thinks that page is clean. The following page is a completely new page and is dirty, so it starts a new write from that page. This would explain why the corruption is always at the beginning of our application's write and corrects itself at the next page boundary.

I have yet to really prove this, but this type of race between dirty/clean pages would explain all the behavior I'm seeing. I'm hoping someone much more intimately familiar with the CIFS code can help point me in the right direction.

I did try one quick and dirty fix, assuming it was a race I applied the following patch. This added a per inode mutex that completely serialized the cifs_write_begin(), cifs_write_end(), and cifs_writepages() functions. This did seem to resolve the data corruption issue, but at the cost of occasional deadlocks writing to CIFS files.

> diff --git a/fs/smb/client/cifsfs.c b/fs/smb/client/cifsfs.c
> index bbb0ef18d7b8..6e2e273b9838 100644
> --- a/fs/smb/client/cifsfs.c
> +++ b/fs/smb/client/cifsfs.c
> @@ -1659,6 +1659,7 @@ cifs_init_once(void *inode)
>  
>  	inode_init_once(&cifsi->netfs.inode);
>  	init_rwsem(&cifsi->lock_sem);
> +	mutex_init(&cifsi->tbl_write_mutex);
 > }
 > 
>  static int __init
> diff --git a/fs/smb/client/cifsglob.h b/fs/smb/client/cifsglob.h
> index 43b42eca6780..4af4c5036d81 100644
> --- a/fs/smb/client/cifsglob.h
> +++ b/fs/smb/client/cifsglob.h
> @@ -1606,6 +1606,17 @@ struct cifsInodeInfo {
>  	bool lease_granted; /* Flag to indicate whether lease or oplock is granted. */
>  	char *symlink_target;
>  	__u32 reparse_tag;
> +
> +	/* During development we discovered what we believe to be a race condition
> +	 * in the write caching behavior of cifs. Setting cache=none solved the
> +	 * issue but with an unacceptable performance hit. The following mutex was
> +	 * added to serialize the cifs_write_begin, cifs_write_end, and
> +	 * cifs_writepages functions in file.c. This appears to solve the issue
> +	 * without completely disabling caching.
> +	 *
> +	 * -Mark Whiting (whitingm@opentext.com)
> +	 */
> +	struct mutex tbl_write_mutex;
>  };
>  
>  static inline struct cifsInodeInfo *
> diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
> index cb75b95efb70..d3bc652a7e65 100644
> --- a/fs/smb/client/file.c
> +++ b/fs/smb/client/file.c
> @@ -3085,6 +3085,7 @@ static int cifs_writepages(struct address_space *mapping,
>  {
>  	loff_t start, end;
>  	int ret;
> +	mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex);
>  
>  	/* We have to be careful as we can end up racing with setattr()
>  	 * truncating the pagecache since the caller doesn't take a lock here
> @@ -3119,6 +3120,7 @@ static int cifs_writepages(struct address_space *mapping,
>  	}
>  
>  out:
> +	mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex);
>  	return ret;
>  }
>  
> @@ -3174,6 +3176,8 @@ static int cifs_write_end(struct file *file, struct address_space *mapping,
>  	struct folio *folio = page_folio(page);
>  	__u32 pid;
>  
> +	mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex);
> +
>  	if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
>  		pid = cfile->pid;
>  	else
> @@ -3233,6 +3237,7 @@ static int cifs_write_end(struct file *file, struct address_space *mapping,
>  	/* Indication to update ctime and mtime as close is deferred */
>  	set_bit(CIFS_INO_MODIFIED_ATTR, &CIFS_I(inode)->flags);
>  
> +	mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex);
>  	return rc;
>  }
>  
> @@ -4905,6 +4910,7 @@ static int cifs_write_begin(struct file *file, struct address_space *mapping,
>  	int rc = 0;
>  
>  	cifs_dbg(FYI, "write_begin from %lld len %d\n", (long long)pos, len);
> +	mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex);
>  
>  start:
>  	page = grab_cache_page_write_begin(mapping, index);
> @@ -4965,6 +4971,7 @@ static int cifs_write_begin(struct file *file, struct address_space *mapping,
>  		   this will be written out by write_end so is fine */
>  	}
>  out:
> +	mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex);
>  	*pagep = page;
>  	return rc;
>  }

Here are some of the log excerpts for one of my test cases. In this file one of the corrupt regions starts at file offset 1,074,214,474 (0x4007364A), and was corrupt for 2,486 bytes, ending on a page boundary. First there is a section of the log trimmed to just the cifs_write_begin() / cifs_write_end() functions. You can see that there is a write shown at the exact offset/length of the corrupted data.

> Mar 25 15:25:39 TX2 kernel: [  124.080900] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074212864 len 1610
> Mar 25 15:25:39 TX2 kernel: [  124.080906] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 0000000086519afd from pos 1074212864 with 1610 bytes
> Mar 25 15:25:39 TX2 kernel: [  124.080911] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074214474 len 2486
> Mar 25 15:25:39 TX2 kernel: [  124.080916] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 0000000086519afd from pos 1074214474 with 2486 bytes
> Mar 25 15:25:39 TX2 kernel: [  124.080917] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074216960 len 846
> Mar 25 15:25:39 TX2 kernel: [  124.080924] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 00000000880cee03 from pos 1074216960 with 846 bytes

Now here's a section of the log trimmed to just the smb2_async_writev() function. You can see writes covering the data immediately before and after the corrupted region, but there is no write to the corrupted region. I'm assuming the corrupted region is always zeros because the server is extending and zero-filling the file to the new write offset after the gap of the missing write.

> Mar 25 15:25:39 TX2 kernel: [  123.829696] [1635] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1072214016 988260 bytes iter=f1464
> Mar 25 15:25:39 TX2 kernel: [  124.081016] [1636] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1073201152 1013322 bytes iter=f764a
** Missing write: 1073201152 + 1013322 = 1074214474 **
> Mar 25 15:25:39 TX2 kernel: [  124.083901] [1636] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1074216960 39564 bytes iter=9a8c
> Mar 25 15:25:40 TX2 kernel: [  124.340557] [1637] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1074253824 1237843 bytes iter=12e353

I can very easily reproduce this with our application. If anyone has any suggestions to try, additional logging / tracing they would like me to perform, please let me know. I can provide more detailed, full logs if desired, but they're quite large. I'll continue to read through the code and try to understand, if I find anything I will update you.

Thanks,
Mark Whiting

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-11-14 10:58 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-25 21:24 [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 Mark A Whiting
2025-03-26 10:11 ` AW: [[ EXT ]] " Heckmann, Ilja
2025-03-26 18:58   ` Steve French
2025-03-26 21:13     ` Enzo Matsumiya
2025-03-27 12:48       ` Mark A Whiting
2025-03-27 13:07         ` Enzo Matsumiya
2025-03-27 19:31           ` [EXTERNAL] - " Mark A Whiting
2025-03-31 19:47           ` Mark A Whiting
2025-11-06 15:00             ` Bharath SM
2025-11-06 15:20               ` Enzo Matsumiya
2025-11-06 15:51                 ` Bharath SM
2025-11-06 16:03                   ` Enzo Matsumiya
2025-11-10 15:56                 ` David Howells
2025-11-06 16:22               ` David Howells
2025-11-06 16:46                 ` Bharath SM
2025-11-06 16:49                   ` Bharath SM
2025-11-10 16:00               ` David Howells
2025-11-11  9:22               ` David Howells
2025-11-11 10:39                 ` Shyam Prasad N
2025-11-12 17:09                   ` Shyam Prasad N
2025-11-12 18:14                     ` David Howells
2025-11-13 12:04                       ` Shyam Prasad N
2025-11-13 13:57                         ` David Howells
2025-11-14  9:27                           ` Shyam Prasad N
2025-11-14 10:49                             ` David Howells
2025-11-14 10:57                               ` Shyam Prasad N
2025-11-07 12:54           ` Shyam Prasad N
2025-11-10  4:37             ` Shyam Prasad N
2025-03-27 10:09     ` AW: [[ EXT ]] " Heckmann, Ilja

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox