* [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71
@ 2025-03-25 21:24 Mark A Whiting
2025-03-26 10:11 ` AW: [[ EXT ]] " Heckmann, Ilja
0 siblings, 1 reply; 29+ messages in thread
From: Mark A Whiting @ 2025-03-25 21:24 UTC (permalink / raw)
To: linux-cifs@vger.kernel.org
Hello,
I have discovered a data corruption issue with our application writing to a CIFS share. I believe this issue may be related to another report I saw on this mailing list, https://lore.kernel.org/linux-cifs/DFC1DAC5-5C6C-4DC2-807A-DAF12E4B7882@gmail.com/. I understand that updating to a newer kernel would likely fix this issue. However, at the moment, that's not an option for us. In the long term we are looking to upgrade to 6.12 but I'm hoping to find a solution for our current 6.6 kernel.
I have tested mounting with the "cache=none" option and that solves the problem, albeit with a very large performance hit.
The platform is an embedded system. We're using an off-the-shelf COM Express Type 7 module with an Intel XEON D-1713NT processor. We're running a custom Linux system built using Buildroot, currently running the 6.6.71 kernel. I've tested the latest 6.6.84 kernel and the problem still exists there. Our application is writing large amounts of compressed data (4+ GB) to the network share. When I read back the data to verify it, I'm seeing small portions of the file that have been replaced with zeros.
I've attacked the issue from several angles. Starting with a TCP dump of a complete operation from mounting, data transfer, to unmounting the network share. Through Wireshark I can see that there is no write command to the server covering the sections of the output that ends up as zeros. This indicated to me that the CIFS kernel driver is failing to write out portions of the file.
I then enabled all the CIFS debug info I could via cifsFYI and the kernel dynamic debug controls and tweaked the code to not rate limit the pr_debug calls. I could trace through the resulting logs and find pairs of cifs_write_begin() / cifs_write_end() that covered all the data including the sections that ultimately don't get written out. However, tracing through the smb2_async_writev() messages I again could not find any writes that covered the corrupt portions. At this point I began to suspect some kind of race condition within the cifs_writepages() function.
I also analyzed the data corruption and noticed a pattern. It does not fail 100% of the time, and it does not always fail in the same place. This furthered my belief that it was some kind of non-deterministic data race. The corrupt data region is always less than a page in size (<4096 bytes), it's always zeros, and it always ends on a page boundary. Because I knew the expected format of the data, I could also tell that the corrupt data was always at the beginning of a write syscall by our application.
I've attempted to read through the CIFS kernel code involved in this. But I've never worked in the VFS/filesystem layers before. And I'm having trouble following / understanding the intricacies of the page cache, page dirtying/cleaning, and writeback.
My current best guess at what's happening is as follows:
* Our application writes out a buffer of data to the file on a CIFS share, this is compressed data that isn't nicely aligned, the data does not end on a page boundary. This is a newly created file that we are writing to, so this write extends the files EOF to the end of the newly written data which is in the middle of a page in the cache.
* cifs_writepages() is invoked to write the cached data back to the server, it scans the cached pages and prepares to write out all the dirty pages (including the final partial page).
* Our application performs another write. This extends the file and the beginning of this write falls into the end of the previous final partial cached page.
* cifs_writepages() finishes writing out the dirty pages, including the first portion of what it thought was the final partial page, and marks all pages as clean.
* On the next invocation of cifs_writepages(), it scans for dirty pages and skips the beginning of the second write because it thinks that page is clean. The following page is a completely new page and is dirty, so it starts a new write from that page. This would explain why the corruption is always at the beginning of our application's write and corrects itself at the next page boundary.
I have yet to really prove this, but this type of race between dirty/clean pages would explain all the behavior I'm seeing. I'm hoping someone much more intimately familiar with the CIFS code can help point me in the right direction.
I did try one quick and dirty fix, assuming it was a race I applied the following patch. This added a per inode mutex that completely serialized the cifs_write_begin(), cifs_write_end(), and cifs_writepages() functions. This did seem to resolve the data corruption issue, but at the cost of occasional deadlocks writing to CIFS files.
> diff --git a/fs/smb/client/cifsfs.c b/fs/smb/client/cifsfs.c
> index bbb0ef18d7b8..6e2e273b9838 100644
> --- a/fs/smb/client/cifsfs.c
> +++ b/fs/smb/client/cifsfs.c
> @@ -1659,6 +1659,7 @@ cifs_init_once(void *inode)
>
> inode_init_once(&cifsi->netfs.inode);
> init_rwsem(&cifsi->lock_sem);
> + mutex_init(&cifsi->tbl_write_mutex);
> }
>
> static int __init
> diff --git a/fs/smb/client/cifsglob.h b/fs/smb/client/cifsglob.h
> index 43b42eca6780..4af4c5036d81 100644
> --- a/fs/smb/client/cifsglob.h
> +++ b/fs/smb/client/cifsglob.h
> @@ -1606,6 +1606,17 @@ struct cifsInodeInfo {
> bool lease_granted; /* Flag to indicate whether lease or oplock is granted. */
> char *symlink_target;
> __u32 reparse_tag;
> +
> + /* During development we discovered what we believe to be a race condition
> + * in the write caching behavior of cifs. Setting cache=none solved the
> + * issue but with an unacceptable performance hit. The following mutex was
> + * added to serialize the cifs_write_begin, cifs_write_end, and
> + * cifs_writepages functions in file.c. This appears to solve the issue
> + * without completely disabling caching.
> + *
> + * -Mark Whiting (whitingm@opentext.com)
> + */
> + struct mutex tbl_write_mutex;
> };
>
> static inline struct cifsInodeInfo *
> diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
> index cb75b95efb70..d3bc652a7e65 100644
> --- a/fs/smb/client/file.c
> +++ b/fs/smb/client/file.c
> @@ -3085,6 +3085,7 @@ static int cifs_writepages(struct address_space *mapping,
> {
> loff_t start, end;
> int ret;
> + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex);
>
> /* We have to be careful as we can end up racing with setattr()
> * truncating the pagecache since the caller doesn't take a lock here
> @@ -3119,6 +3120,7 @@ static int cifs_writepages(struct address_space *mapping,
> }
>
> out:
> + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex);
> return ret;
> }
>
> @@ -3174,6 +3176,8 @@ static int cifs_write_end(struct file *file, struct address_space *mapping,
> struct folio *folio = page_folio(page);
> __u32 pid;
>
> + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex);
> +
> if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
> pid = cfile->pid;
> else
> @@ -3233,6 +3237,7 @@ static int cifs_write_end(struct file *file, struct address_space *mapping,
> /* Indication to update ctime and mtime as close is deferred */
> set_bit(CIFS_INO_MODIFIED_ATTR, &CIFS_I(inode)->flags);
>
> + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex);
> return rc;
> }
>
> @@ -4905,6 +4910,7 @@ static int cifs_write_begin(struct file *file, struct address_space *mapping,
> int rc = 0;
>
> cifs_dbg(FYI, "write_begin from %lld len %d\n", (long long)pos, len);
> + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex);
>
> start:
> page = grab_cache_page_write_begin(mapping, index);
> @@ -4965,6 +4971,7 @@ static int cifs_write_begin(struct file *file, struct address_space *mapping,
> this will be written out by write_end so is fine */
> }
> out:
> + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex);
> *pagep = page;
> return rc;
> }
Here are some of the log excerpts for one of my test cases. In this file one of the corrupt regions starts at file offset 1,074,214,474 (0x4007364A), and was corrupt for 2,486 bytes, ending on a page boundary. First there is a section of the log trimmed to just the cifs_write_begin() / cifs_write_end() functions. You can see that there is a write shown at the exact offset/length of the corrupted data.
> Mar 25 15:25:39 TX2 kernel: [ 124.080900] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074212864 len 1610
> Mar 25 15:25:39 TX2 kernel: [ 124.080906] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 0000000086519afd from pos 1074212864 with 1610 bytes
> Mar 25 15:25:39 TX2 kernel: [ 124.080911] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074214474 len 2486
> Mar 25 15:25:39 TX2 kernel: [ 124.080916] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 0000000086519afd from pos 1074214474 with 2486 bytes
> Mar 25 15:25:39 TX2 kernel: [ 124.080917] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074216960 len 846
> Mar 25 15:25:39 TX2 kernel: [ 124.080924] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 00000000880cee03 from pos 1074216960 with 846 bytes
Now here's a section of the log trimmed to just the smb2_async_writev() function. You can see writes covering the data immediately before and after the corrupted region, but there is no write to the corrupted region. I'm assuming the corrupted region is always zeros because the server is extending and zero-filling the file to the new write offset after the gap of the missing write.
> Mar 25 15:25:39 TX2 kernel: [ 123.829696] [1635] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1072214016 988260 bytes iter=f1464
> Mar 25 15:25:39 TX2 kernel: [ 124.081016] [1636] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1073201152 1013322 bytes iter=f764a
** Missing write: 1073201152 + 1013322 = 1074214474 **
> Mar 25 15:25:39 TX2 kernel: [ 124.083901] [1636] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1074216960 39564 bytes iter=9a8c
> Mar 25 15:25:40 TX2 kernel: [ 124.340557] [1637] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1074253824 1237843 bytes iter=12e353
I can very easily reproduce this with our application. If anyone has any suggestions to try, additional logging / tracing they would like me to perform, please let me know. I can provide more detailed, full logs if desired, but they're quite large. I'll continue to read through the code and try to understand, if I find anything I will update you.
Thanks,
Mark Whiting
^ permalink raw reply [flat|nested] 29+ messages in thread* AW: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-03-25 21:24 [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 Mark A Whiting @ 2025-03-26 10:11 ` Heckmann, Ilja 2025-03-26 18:58 ` Steve French 0 siblings, 1 reply; 29+ messages in thread From: Heckmann, Ilja @ 2025-03-26 10:11 UTC (permalink / raw) To: Mark A Whiting, linux-cifs@vger.kernel.org We ran into what probably is the same problem with silent data corruption that was only noticed thanks to using a data format with internal checksums. It also went away when mounting a share with "cache=none" while running the kernel 6.6.9, but that had the side-effect that no executables could be started from the share (I reported this in June 2024). This second problem was fixed in 6.10, but at the same time mounting with "cache=none" stopped helping against the data corruption issue. It persists until now, with kernel 6.12.8, although the frequency at which the problem manifests went down significantly. The way we test for it is by running a certain workload 100 times in a loop and counting the number of runs aborted because of errors. That number went down from about 10 per 100 runs with kernel 6.6.9 to about 1 per 100 runs with 6.12.8. Its non-deterministic nature and the lack of in-house expertise to investigate the issue at the same level as Mark did stopped us from reporting it so far. And while there is no way of knowing that the issue we observe in 6.12.8 is the same one, at least I can confirm that there is a similar issue in more recent kernel versions as well. Best wishes, Ilja Heckmann ________________________________________ Von: Mark A Whiting <whitingm@opentext.com> Gesendet: Dienstag, 25. März 2025 22:24:55 An: linux-cifs@vger.kernel.org Betreff: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 Hello, I have discovered a data corruption issue with our application writing to a CIFS share. I believe this issue may be related to another report I saw on this mailing list, https://lore.kernel.org/linux-cifs/DFC1DAC5-5C6C-4DC2-807A-DAF12E4B7882@gmail.com/. I understand that updating to a newer kernel would likely fix this issue. However, at the moment, that's not an option for us. In the long term we are looking to upgrade to 6.12 but I'm hoping to find a solution for our current 6.6 kernel. I have tested mounting with the "cache=none" option and that solves the problem, albeit with a very large performance hit. The platform is an embedded system. We're using an off-the-shelf COM Express Type 7 module with an Intel XEON D-1713NT processor. We're running a custom Linux system built using Buildroot, currently running the 6.6.71 kernel. I've tested the latest 6.6.84 kernel and the problem still exists there. Our application is writing large amounts of compressed data (4+ GB) to the network share. When I read back the data to verify it, I'm seeing small portions of the file that have been replaced with zeros. I've attacked the issue from several angles. Starting with a TCP dump of a complete operation from mounting, data transfer, to unmounting the network share. Through Wireshark I can see that there is no write command to the server covering the sections of the output that ends up as zeros. This indicated to me that the CIFS kernel driver is failing to write out portions of the file. I then enabled all the CIFS debug info I could via cifsFYI and the kernel dynamic debug controls and tweaked the code to not rate limit the pr_debug calls. I could trace through the resulting logs and find pairs of cifs_write_begin() / cifs_write_end() that covered all the data including the sections that ultimately don't get written out. However, tracing through the smb2_async_writev() messages I again could not find any writes that covered the corrupt portions. At this point I began to suspect some kind of race condition within the cifs_writepages() function. I also analyzed the data corruption and noticed a pattern. It does not fail 100% of the time, and it does not always fail in the same place. This furthered my belief that it was some kind of non-deterministic data race. The corrupt data region is always less than a page in size (<4096 bytes), it's always zeros, and it always ends on a page boundary. Because I knew the expected format of the data, I could also tell that the corrupt data was always at the beginning of a write syscall by our application. I've attempted to read through the CIFS kernel code involved in this. But I've never worked in the VFS/filesystem layers before. And I'm having trouble following / understanding the intricacies of the page cache, page dirtying/cleaning, and writeback. My current best guess at what's happening is as follows: * Our application writes out a buffer of data to the file on a CIFS share, this is compressed data that isn't nicely aligned, the data does not end on a page boundary. This is a newly created file that we are writing to, so this write extends the files EOF to the end of the newly written data which is in the middle of a page in the cache. * cifs_writepages() is invoked to write the cached data back to the server, it scans the cached pages and prepares to write out all the dirty pages (including the final partial page). * Our application performs another write. This extends the file and the beginning of this write falls into the end of the previous final partial cached page. * cifs_writepages() finishes writing out the dirty pages, including the first portion of what it thought was the final partial page, and marks all pages as clean. * On the next invocation of cifs_writepages(), it scans for dirty pages and skips the beginning of the second write because it thinks that page is clean. The following page is a completely new page and is dirty, so it starts a new write from that page. This would explain why the corruption is always at the beginning of our application's write and corrects itself at the next page boundary. I have yet to really prove this, but this type of race between dirty/clean pages would explain all the behavior I'm seeing. I'm hoping someone much more intimately familiar with the CIFS code can help point me in the right direction. I did try one quick and dirty fix, assuming it was a race I applied the following patch. This added a per inode mutex that completely serialized the cifs_write_begin(), cifs_write_end(), and cifs_writepages() functions. This did seem to resolve the data corruption issue, but at the cost of occasional deadlocks writing to CIFS files. > diff --git a/fs/smb/client/cifsfs.c b/fs/smb/client/cifsfs.c > index bbb0ef18d7b8..6e2e273b9838 100644 > --- a/fs/smb/client/cifsfs.c > +++ b/fs/smb/client/cifsfs.c > @@ -1659,6 +1659,7 @@ cifs_init_once(void *inode) > > inode_init_once(&cifsi->netfs.inode); > init_rwsem(&cifsi->lock_sem); > + mutex_init(&cifsi->tbl_write_mutex); > } > > static int __init > diff --git a/fs/smb/client/cifsglob.h b/fs/smb/client/cifsglob.h > index 43b42eca6780..4af4c5036d81 100644 > --- a/fs/smb/client/cifsglob.h > +++ b/fs/smb/client/cifsglob.h > @@ -1606,6 +1606,17 @@ struct cifsInodeInfo { > bool lease_granted; /* Flag to indicate whether lease or oplock is granted. */ > char *symlink_target; > __u32 reparse_tag; > + > + /* During development we discovered what we believe to be a race condition > + * in the write caching behavior of cifs. Setting cache=none solved the > + * issue but with an unacceptable performance hit. The following mutex was > + * added to serialize the cifs_write_begin, cifs_write_end, and > + * cifs_writepages functions in file.c. This appears to solve the issue > + * without completely disabling caching. > + * > + * -Mark Whiting (whitingm@opentext.com) > + */ > + struct mutex tbl_write_mutex; > }; > > static inline struct cifsInodeInfo * > diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c > index cb75b95efb70..d3bc652a7e65 100644 > --- a/fs/smb/client/file.c > +++ b/fs/smb/client/file.c > @@ -3085,6 +3085,7 @@ static int cifs_writepages(struct address_space *mapping, > { > loff_t start, end; > int ret; > + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex); > > /* We have to be careful as we can end up racing with setattr() > * truncating the pagecache since the caller doesn't take a lock here > @@ -3119,6 +3120,7 @@ static int cifs_writepages(struct address_space *mapping, > } > > out: > + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex); > return ret; > } > > @@ -3174,6 +3176,8 @@ static int cifs_write_end(struct file *file, struct address_space *mapping, > struct folio *folio = page_folio(page); > __u32 pid; > > + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex); > + > if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD) > pid = cfile->pid; > else > @@ -3233,6 +3237,7 @@ static int cifs_write_end(struct file *file, struct address_space *mapping, > /* Indication to update ctime and mtime as close is deferred */ > set_bit(CIFS_INO_MODIFIED_ATTR, &CIFS_I(inode)->flags); > > + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex); > return rc; > } > > @@ -4905,6 +4910,7 @@ static int cifs_write_begin(struct file *file, struct address_space *mapping, > int rc = 0; > > cifs_dbg(FYI, "write_begin from %lld len %d\n", (long long)pos, len); > + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex); > > start: > page = grab_cache_page_write_begin(mapping, index); > @@ -4965,6 +4971,7 @@ static int cifs_write_begin(struct file *file, struct address_space *mapping, > this will be written out by write_end so is fine */ > } > out: > + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex); > *pagep = page; > return rc; > } Here are some of the log excerpts for one of my test cases. In this file one of the corrupt regions starts at file offset 1,074,214,474 (0x4007364A), and was corrupt for 2,486 bytes, ending on a page boundary. First there is a section of the log trimmed to just the cifs_write_begin() / cifs_write_end() functions. You can see that there is a write shown at the exact offset/length of the corrupted data. > Mar 25 15:25:39 TX2 kernel: [ 124.080900] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074212864 len 1610 > Mar 25 15:25:39 TX2 kernel: [ 124.080906] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 0000000086519afd from pos 1074212864 with 1610 bytes > Mar 25 15:25:39 TX2 kernel: [ 124.080911] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074214474 len 2486 > Mar 25 15:25:39 TX2 kernel: [ 124.080916] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 0000000086519afd from pos 1074214474 with 2486 bytes > Mar 25 15:25:39 TX2 kernel: [ 124.080917] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074216960 len 846 > Mar 25 15:25:39 TX2 kernel: [ 124.080924] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 00000000880cee03 from pos 1074216960 with 846 bytes Now here's a section of the log trimmed to just the smb2_async_writev() function. You can see writes covering the data immediately before and after the corrupted region, but there is no write to the corrupted region. I'm assuming the corrupted region is always zeros because the server is extending and zero-filling the file to the new write offset after the gap of the missing write. > Mar 25 15:25:39 TX2 kernel: [ 123.829696] [1635] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1072214016 988260 bytes iter=f1464 > Mar 25 15:25:39 TX2 kernel: [ 124.081016] [1636] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1073201152 1013322 bytes iter=f764a ** Missing write: 1073201152 + 1013322 = 1074214474 ** > Mar 25 15:25:39 TX2 kernel: [ 124.083901] [1636] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1074216960 39564 bytes iter=9a8c > Mar 25 15:25:40 TX2 kernel: [ 124.340557] [1637] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1074253824 1237843 bytes iter=12e353 I can very easily reproduce this with our application. If anyone has any suggestions to try, additional logging / tracing they would like me to perform, please let me know. I can provide more detailed, full logs if desired, but they're quite large. I'll continue to read through the code and try to understand, if I find anything I will update you. Thanks, Mark Whiting ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-03-26 10:11 ` AW: [[ EXT ]] " Heckmann, Ilja @ 2025-03-26 18:58 ` Steve French 2025-03-26 21:13 ` Enzo Matsumiya 2025-03-27 10:09 ` AW: [[ EXT ]] " Heckmann, Ilja 0 siblings, 2 replies; 29+ messages in thread From: Steve French @ 2025-03-26 18:58 UTC (permalink / raw) To: Heckmann, Ilja; +Cc: Mark A Whiting, linux-cifs@vger.kernel.org Were you able to confirm that the problem started after 6.6.0 but regressed before 6.6.9 - any chance of narrowing the regression down by bisection? On Wed, Mar 26, 2025 at 5:13 AM Heckmann, Ilja <heckmann@izw-berlin.de> wrote: > > We ran into what probably is the same problem with silent data corruption that was only noticed thanks to using a data format with internal checksums. It also went away when mounting a share with "cache=none" while running the kernel 6.6.9, but that had the side-effect that no executables could be started from the share (I reported this in June 2024). This second problem was fixed in 6.10, but at the same time mounting with "cache=none" stopped helping against the data corruption issue. It persists until now, with kernel 6.12.8, although the frequency at which the problem manifests went down significantly. > > The way we test for it is by running a certain workload 100 times in a loop and counting the number of runs aborted because of errors. That number went down from about 10 per 100 runs with kernel 6.6.9 to about 1 per 100 runs with 6.12.8. Its non-deterministic nature and the lack of in-house expertise to investigate the issue at the same level as Mark did stopped us from reporting it so far. And while there is no way of knowing that the issue we observe in 6.12.8 is the same one, at least I can confirm that there is a similar issue in more recent kernel versions as well. > > Best wishes, > Ilja Heckmann > ________________________________________ > Von: Mark A Whiting <whitingm@opentext.com> > Gesendet: Dienstag, 25. März 2025 22:24:55 > An: linux-cifs@vger.kernel.org > Betreff: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 > > Hello, > > I have discovered a data corruption issue with our application writing to a CIFS share. I believe this issue may be related to another report I saw on this mailing list, https://lore.kernel.org/linux-cifs/DFC1DAC5-5C6C-4DC2-807A-DAF12E4B7882@gmail.com/. I understand that updating to a newer kernel would likely fix this issue. However, at the moment, that's not an option for us. In the long term we are looking to upgrade to 6.12 but I'm hoping to find a solution for our current 6.6 kernel. > > I have tested mounting with the "cache=none" option and that solves the problem, albeit with a very large performance hit. > > The platform is an embedded system. We're using an off-the-shelf COM Express Type 7 module with an Intel XEON D-1713NT processor. We're running a custom Linux system built using Buildroot, currently running the 6.6.71 kernel. I've tested the latest 6.6.84 kernel and the problem still exists there. Our application is writing large amounts of compressed data (4+ GB) to the network share. When I read back the data to verify it, I'm seeing small portions of the file that have been replaced with zeros. > > I've attacked the issue from several angles. Starting with a TCP dump of a complete operation from mounting, data transfer, to unmounting the network share. Through Wireshark I can see that there is no write command to the server covering the sections of the output that ends up as zeros. This indicated to me that the CIFS kernel driver is failing to write out portions of the file. > > I then enabled all the CIFS debug info I could via cifsFYI and the kernel dynamic debug controls and tweaked the code to not rate limit the pr_debug calls. I could trace through the resulting logs and find pairs of cifs_write_begin() / cifs_write_end() that covered all the data including the sections that ultimately don't get written out. However, tracing through the smb2_async_writev() messages I again could not find any writes that covered the corrupt portions. At this point I began to suspect some kind of race condition within the cifs_writepages() function. > > I also analyzed the data corruption and noticed a pattern. It does not fail 100% of the time, and it does not always fail in the same place. This furthered my belief that it was some kind of non-deterministic data race. The corrupt data region is always less than a page in size (<4096 bytes), it's always zeros, and it always ends on a page boundary. Because I knew the expected format of the data, I could also tell that the corrupt data was always at the beginning of a write syscall by our application. > > I've attempted to read through the CIFS kernel code involved in this. But I've never worked in the VFS/filesystem layers before. And I'm having trouble following / understanding the intricacies of the page cache, page dirtying/cleaning, and writeback. > > My current best guess at what's happening is as follows: > * Our application writes out a buffer of data to the file on a CIFS share, this is compressed data that isn't nicely aligned, the data does not end on a page boundary. This is a newly created file that we are writing to, so this write extends the files EOF to the end of the newly written data which is in the middle of a page in the cache. > * cifs_writepages() is invoked to write the cached data back to the server, it scans the cached pages and prepares to write out all the dirty pages (including the final partial page). > * Our application performs another write. This extends the file and the beginning of this write falls into the end of the previous final partial cached page. > * cifs_writepages() finishes writing out the dirty pages, including the first portion of what it thought was the final partial page, and marks all pages as clean. > * On the next invocation of cifs_writepages(), it scans for dirty pages and skips the beginning of the second write because it thinks that page is clean. The following page is a completely new page and is dirty, so it starts a new write from that page. This would explain why the corruption is always at the beginning of our application's write and corrects itself at the next page boundary. > > I have yet to really prove this, but this type of race between dirty/clean pages would explain all the behavior I'm seeing. I'm hoping someone much more intimately familiar with the CIFS code can help point me in the right direction. > > I did try one quick and dirty fix, assuming it was a race I applied the following patch. This added a per inode mutex that completely serialized the cifs_write_begin(), cifs_write_end(), and cifs_writepages() functions. This did seem to resolve the data corruption issue, but at the cost of occasional deadlocks writing to CIFS files. > > > diff --git a/fs/smb/client/cifsfs.c b/fs/smb/client/cifsfs.c > > index bbb0ef18d7b8..6e2e273b9838 100644 > > --- a/fs/smb/client/cifsfs.c > > +++ b/fs/smb/client/cifsfs.c > > @@ -1659,6 +1659,7 @@ cifs_init_once(void *inode) > > > > inode_init_once(&cifsi->netfs.inode); > > init_rwsem(&cifsi->lock_sem); > > + mutex_init(&cifsi->tbl_write_mutex); > > } > > > > static int __init > > diff --git a/fs/smb/client/cifsglob.h b/fs/smb/client/cifsglob.h > > index 43b42eca6780..4af4c5036d81 100644 > > --- a/fs/smb/client/cifsglob.h > > +++ b/fs/smb/client/cifsglob.h > > @@ -1606,6 +1606,17 @@ struct cifsInodeInfo { > > bool lease_granted; /* Flag to indicate whether lease or oplock is granted. */ > > char *symlink_target; > > __u32 reparse_tag; > > + > > + /* During development we discovered what we believe to be a race condition > > + * in the write caching behavior of cifs. Setting cache=none solved the > > + * issue but with an unacceptable performance hit. The following mutex was > > + * added to serialize the cifs_write_begin, cifs_write_end, and > > + * cifs_writepages functions in file.c. This appears to solve the issue > > + * without completely disabling caching. > > + * > > + * -Mark Whiting (whitingm@opentext.com) > > + */ > > + struct mutex tbl_write_mutex; > > }; > > > > static inline struct cifsInodeInfo * > > diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c > > index cb75b95efb70..d3bc652a7e65 100644 > > --- a/fs/smb/client/file.c > > +++ b/fs/smb/client/file.c > > @@ -3085,6 +3085,7 @@ static int cifs_writepages(struct address_space *mapping, > > { > > loff_t start, end; > > int ret; > > + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex); > > > > /* We have to be careful as we can end up racing with setattr() > > * truncating the pagecache since the caller doesn't take a lock here > > @@ -3119,6 +3120,7 @@ static int cifs_writepages(struct address_space *mapping, > > } > > > > out: > > + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex); > > return ret; > > } > > > > @@ -3174,6 +3176,8 @@ static int cifs_write_end(struct file *file, struct address_space *mapping, > > struct folio *folio = page_folio(page); > > __u32 pid; > > > > + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex); > > + > > if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD) > > pid = cfile->pid; > > else > > @@ -3233,6 +3237,7 @@ static int cifs_write_end(struct file *file, struct address_space *mapping, > > /* Indication to update ctime and mtime as close is deferred */ > > set_bit(CIFS_INO_MODIFIED_ATTR, &CIFS_I(inode)->flags); > > > > + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex); > > return rc; > > } > > > > @@ -4905,6 +4910,7 @@ static int cifs_write_begin(struct file *file, struct address_space *mapping, > > int rc = 0; > > > > cifs_dbg(FYI, "write_begin from %lld len %d\n", (long long)pos, len); > > + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex); > > > > start: > > page = grab_cache_page_write_begin(mapping, index); > > @@ -4965,6 +4971,7 @@ static int cifs_write_begin(struct file *file, struct address_space *mapping, > > this will be written out by write_end so is fine */ > > } > > out: > > + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex); > > *pagep = page; > > return rc; > > } > > Here are some of the log excerpts for one of my test cases. In this file one of the corrupt regions starts at file offset 1,074,214,474 (0x4007364A), and was corrupt for 2,486 bytes, ending on a page boundary. First there is a section of the log trimmed to just the cifs_write_begin() / cifs_write_end() functions. You can see that there is a write shown at the exact offset/length of the corrupted data. > > > Mar 25 15:25:39 TX2 kernel: [ 124.080900] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074212864 len 1610 > > Mar 25 15:25:39 TX2 kernel: [ 124.080906] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 0000000086519afd from pos 1074212864 with 1610 bytes > > Mar 25 15:25:39 TX2 kernel: [ 124.080911] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074214474 len 2486 > > Mar 25 15:25:39 TX2 kernel: [ 124.080916] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 0000000086519afd from pos 1074214474 with 2486 bytes > > Mar 25 15:25:39 TX2 kernel: [ 124.080917] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074216960 len 846 > > Mar 25 15:25:39 TX2 kernel: [ 124.080924] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 00000000880cee03 from pos 1074216960 with 846 bytes > > Now here's a section of the log trimmed to just the smb2_async_writev() function. You can see writes covering the data immediately before and after the corrupted region, but there is no write to the corrupted region. I'm assuming the corrupted region is always zeros because the server is extending and zero-filling the file to the new write offset after the gap of the missing write. > > > Mar 25 15:25:39 TX2 kernel: [ 123.829696] [1635] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1072214016 988260 bytes iter=f1464 > > Mar 25 15:25:39 TX2 kernel: [ 124.081016] [1636] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1073201152 1013322 bytes iter=f764a > ** Missing write: 1073201152 + 1013322 = 1074214474 ** > > Mar 25 15:25:39 TX2 kernel: [ 124.083901] [1636] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1074216960 39564 bytes iter=9a8c > > Mar 25 15:25:40 TX2 kernel: [ 124.340557] [1637] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1074253824 1237843 bytes iter=12e353 > > I can very easily reproduce this with our application. If anyone has any suggestions to try, additional logging / tracing they would like me to perform, please let me know. I can provide more detailed, full logs if desired, but they're quite large. I'll continue to read through the code and try to understand, if I find anything I will update you. > > Thanks, > Mark Whiting > > -- Thanks, Steve ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-03-26 18:58 ` Steve French @ 2025-03-26 21:13 ` Enzo Matsumiya 2025-03-27 12:48 ` Mark A Whiting 2025-03-27 10:09 ` AW: [[ EXT ]] " Heckmann, Ilja 1 sibling, 1 reply; 29+ messages in thread From: Enzo Matsumiya @ 2025-03-26 21:13 UTC (permalink / raw) To: Steve French Cc: Heckmann, Ilja, Mark A Whiting, linux-cifs@vger.kernel.org, henrique.carvalho Hello, On 03/26, Steve French wrote: >Were you able to confirm that the problem started after 6.6.0 but >regressed before 6.6.9 - any chance of narrowing the regression down >by bisection? This looks similar to a bug we found in our v6.4-based SLES products. Bisecting it indicated the regression is d08089f649a0 "cifs: Change the I/O paths to use an iterator rather than a page list". The first good commit we found is a395726cf823 "cifs: fix data corruption in read after invalidate", but it was probably a bit before that (we didn't check further because we couldn't afford to backport all netfs modifications). This is the fix we used (rebased on top of v6.6.71 tag): https://git.exis.tech/linux.git/commit/?h=data_corruption_v6.x&id=8d4c40e084f3d132434d5d3d068175c8db59ce65 @Ilja @Mark could you test it with your reproducer please? @Steve can you try it with the reproducer mentioned in the commit message please? Cheers, Enzo >On Wed, Mar 26, 2025 at 5:13 AM Heckmann, Ilja <heckmann@izw-berlin.de> wrote: >> >> We ran into what probably is the same problem with silent data corruption that was only noticed thanks to using a data format with internal checksums. It also went away when mounting a share with "cache=none" while running the kernel 6.6.9, but that had the side-effect that no executables could be started from the share (I reported this in June 2024). This second problem was fixed in 6.10, but at the same time mounting with "cache=none" stopped helping against the data corruption issue. It persists until now, with kernel 6.12.8, although the frequency at which the problem manifests went down significantly. >> >> The way we test for it is by running a certain workload 100 times in a loop and counting the number of runs aborted because of errors. That number went down from about 10 per 100 runs with kernel 6.6.9 to about 1 per 100 runs with 6.12.8. Its non-deterministic nature and the lack of in-house expertise to investigate the issue at the same level as Mark did stopped us from reporting it so far. And while there is no way of knowing that the issue we observe in 6.12.8 is the same one, at least I can confirm that there is a similar issue in more recent kernel versions as well. >> >> Best wishes, >> Ilja Heckmann >> ________________________________________ >> Von: Mark A Whiting <whitingm@opentext.com> >> Gesendet: Dienstag, 25. März 2025 22:24:55 >> An: linux-cifs@vger.kernel.org >> Betreff: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 >> >> Hello, >> >> I have discovered a data corruption issue with our application writing to a CIFS share. I believe this issue may be related to another report I saw on this mailing list, https://lore.kernel.org/linux-cifs/DFC1DAC5-5C6C-4DC2-807A-DAF12E4B7882@gmail.com/. I understand that updating to a newer kernel would likely fix this issue. However, at the moment, that's not an option for us. In the long term we are looking to upgrade to 6.12 but I'm hoping to find a solution for our current 6.6 kernel. >> >> I have tested mounting with the "cache=none" option and that solves the problem, albeit with a very large performance hit. >> >> The platform is an embedded system. We're using an off-the-shelf COM Express Type 7 module with an Intel XEON D-1713NT processor. We're running a custom Linux system built using Buildroot, currently running the 6.6.71 kernel. I've tested the latest 6.6.84 kernel and the problem still exists there. Our application is writing large amounts of compressed data (4+ GB) to the network share. When I read back the data to verify it, I'm seeing small portions of the file that have been replaced with zeros. >> >> I've attacked the issue from several angles. Starting with a TCP dump of a complete operation from mounting, data transfer, to unmounting the network share. Through Wireshark I can see that there is no write command to the server covering the sections of the output that ends up as zeros. This indicated to me that the CIFS kernel driver is failing to write out portions of the file. >> >> I then enabled all the CIFS debug info I could via cifsFYI and the kernel dynamic debug controls and tweaked the code to not rate limit the pr_debug calls. I could trace through the resulting logs and find pairs of cifs_write_begin() / cifs_write_end() that covered all the data including the sections that ultimately don't get written out. However, tracing through the smb2_async_writev() messages I again could not find any writes that covered the corrupt portions. At this point I began to suspect some kind of race condition within the cifs_writepages() function. >> >> I also analyzed the data corruption and noticed a pattern. It does not fail 100% of the time, and it does not always fail in the same place. This furthered my belief that it was some kind of non-deterministic data race. The corrupt data region is always less than a page in size (<4096 bytes), it's always zeros, and it always ends on a page boundary. Because I knew the expected format of the data, I could also tell that the corrupt data was always at the beginning of a write syscall by our application. >> >> I've attempted to read through the CIFS kernel code involved in this. But I've never worked in the VFS/filesystem layers before. And I'm having trouble following / understanding the intricacies of the page cache, page dirtying/cleaning, and writeback. >> >> My current best guess at what's happening is as follows: >> * Our application writes out a buffer of data to the file on a CIFS share, this is compressed data that isn't nicely aligned, the data does not end on a page boundary. This is a newly created file that we are writing to, so this write extends the files EOF to the end of the newly written data which is in the middle of a page in the cache. >> * cifs_writepages() is invoked to write the cached data back to the server, it scans the cached pages and prepares to write out all the dirty pages (including the final partial page). >> * Our application performs another write. This extends the file and the beginning of this write falls into the end of the previous final partial cached page. >> * cifs_writepages() finishes writing out the dirty pages, including the first portion of what it thought was the final partial page, and marks all pages as clean. >> * On the next invocation of cifs_writepages(), it scans for dirty pages and skips the beginning of the second write because it thinks that page is clean. The following page is a completely new page and is dirty, so it starts a new write from that page. This would explain why the corruption is always at the beginning of our application's write and corrects itself at the next page boundary. >> >> I have yet to really prove this, but this type of race between dirty/clean pages would explain all the behavior I'm seeing. I'm hoping someone much more intimately familiar with the CIFS code can help point me in the right direction. >> >> I did try one quick and dirty fix, assuming it was a race I applied the following patch. This added a per inode mutex that completely serialized the cifs_write_begin(), cifs_write_end(), and cifs_writepages() functions. This did seem to resolve the data corruption issue, but at the cost of occasional deadlocks writing to CIFS files. >> >> > diff --git a/fs/smb/client/cifsfs.c b/fs/smb/client/cifsfs.c >> > index bbb0ef18d7b8..6e2e273b9838 100644 >> > --- a/fs/smb/client/cifsfs.c >> > +++ b/fs/smb/client/cifsfs.c >> > @@ -1659,6 +1659,7 @@ cifs_init_once(void *inode) >> > >> > inode_init_once(&cifsi->netfs.inode); >> > init_rwsem(&cifsi->lock_sem); >> > + mutex_init(&cifsi->tbl_write_mutex); >> > } >> > >> > static int __init >> > diff --git a/fs/smb/client/cifsglob.h b/fs/smb/client/cifsglob.h >> > index 43b42eca6780..4af4c5036d81 100644 >> > --- a/fs/smb/client/cifsglob.h >> > +++ b/fs/smb/client/cifsglob.h >> > @@ -1606,6 +1606,17 @@ struct cifsInodeInfo { >> > bool lease_granted; /* Flag to indicate whether lease or oplock is granted. */ >> > char *symlink_target; >> > __u32 reparse_tag; >> > + >> > + /* During development we discovered what we believe to be a race condition >> > + * in the write caching behavior of cifs. Setting cache=none solved the >> > + * issue but with an unacceptable performance hit. The following mutex was >> > + * added to serialize the cifs_write_begin, cifs_write_end, and >> > + * cifs_writepages functions in file.c. This appears to solve the issue >> > + * without completely disabling caching. >> > + * >> > + * -Mark Whiting (whitingm@opentext.com) >> > + */ >> > + struct mutex tbl_write_mutex; >> > }; >> > >> > static inline struct cifsInodeInfo * >> > diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c >> > index cb75b95efb70..d3bc652a7e65 100644 >> > --- a/fs/smb/client/file.c >> > +++ b/fs/smb/client/file.c >> > @@ -3085,6 +3085,7 @@ static int cifs_writepages(struct address_space *mapping, >> > { >> > loff_t start, end; >> > int ret; >> > + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex); >> > >> > /* We have to be careful as we can end up racing with setattr() >> > * truncating the pagecache since the caller doesn't take a lock here >> > @@ -3119,6 +3120,7 @@ static int cifs_writepages(struct address_space *mapping, >> > } >> > >> > out: >> > + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex); >> > return ret; >> > } >> > >> > @@ -3174,6 +3176,8 @@ static int cifs_write_end(struct file *file, struct address_space *mapping, >> > struct folio *folio = page_folio(page); >> > __u32 pid; >> > >> > + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex); >> > + >> > if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD) >> > pid = cfile->pid; >> > else >> > @@ -3233,6 +3237,7 @@ static int cifs_write_end(struct file *file, struct address_space *mapping, >> > /* Indication to update ctime and mtime as close is deferred */ >> > set_bit(CIFS_INO_MODIFIED_ATTR, &CIFS_I(inode)->flags); >> > >> > + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex); >> > return rc; >> > } >> > >> > @@ -4905,6 +4910,7 @@ static int cifs_write_begin(struct file *file, struct address_space *mapping, >> > int rc = 0; >> > >> > cifs_dbg(FYI, "write_begin from %lld len %d\n", (long long)pos, len); >> > + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex); >> > >> > start: >> > page = grab_cache_page_write_begin(mapping, index); >> > @@ -4965,6 +4971,7 @@ static int cifs_write_begin(struct file *file, struct address_space *mapping, >> > this will be written out by write_end so is fine */ >> > } >> > out: >> > + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex); >> > *pagep = page; >> > return rc; >> > } >> >> Here are some of the log excerpts for one of my test cases. In this file one of the corrupt regions starts at file offset 1,074,214,474 (0x4007364A), and was corrupt for 2,486 bytes, ending on a page boundary. First there is a section of the log trimmed to just the cifs_write_begin() / cifs_write_end() functions. You can see that there is a write shown at the exact offset/length of the corrupted data. >> >> > Mar 25 15:25:39 TX2 kernel: [ 124.080900] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074212864 len 1610 >> > Mar 25 15:25:39 TX2 kernel: [ 124.080906] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 0000000086519afd from pos 1074212864 with 1610 bytes >> > Mar 25 15:25:39 TX2 kernel: [ 124.080911] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074214474 len 2486 >> > Mar 25 15:25:39 TX2 kernel: [ 124.080916] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 0000000086519afd from pos 1074214474 with 2486 bytes >> > Mar 25 15:25:39 TX2 kernel: [ 124.080917] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074216960 len 846 >> > Mar 25 15:25:39 TX2 kernel: [ 124.080924] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 00000000880cee03 from pos 1074216960 with 846 bytes >> >> Now here's a section of the log trimmed to just the smb2_async_writev() function. You can see writes covering the data immediately before and after the corrupted region, but there is no write to the corrupted region. I'm assuming the corrupted region is always zeros because the server is extending and zero-filling the file to the new write offset after the gap of the missing write. >> >> > Mar 25 15:25:39 TX2 kernel: [ 123.829696] [1635] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1072214016 988260 bytes iter=f1464 >> > Mar 25 15:25:39 TX2 kernel: [ 124.081016] [1636] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1073201152 1013322 bytes iter=f764a >> ** Missing write: 1073201152 + 1013322 = 1074214474 ** >> > Mar 25 15:25:39 TX2 kernel: [ 124.083901] [1636] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1074216960 39564 bytes iter=9a8c >> > Mar 25 15:25:40 TX2 kernel: [ 124.340557] [1637] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1074253824 1237843 bytes iter=12e353 >> >> I can very easily reproduce this with our application. If anyone has any suggestions to try, additional logging / tracing they would like me to perform, please let me know. I can provide more detailed, full logs if desired, but they're quite large. I'll continue to read through the code and try to understand, if I find anything I will update you. >> >> Thanks, >> Mark Whiting >> >> > > >-- >Thanks, > >Steve > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-03-26 21:13 ` Enzo Matsumiya @ 2025-03-27 12:48 ` Mark A Whiting 2025-03-27 13:07 ` Enzo Matsumiya 0 siblings, 1 reply; 29+ messages in thread From: Mark A Whiting @ 2025-03-27 12:48 UTC (permalink / raw) To: Enzo Matsumiya, Steve French Cc: Heckmann, Ilja, linux-cifs@vger.kernel.org, henrique.carvalho@suse.com Hello, >Hello, > >On 03/26, Steve French wrote: >>Were you able to confirm that the problem started after 6.6.0 but >>regressed before 6.6.9 - any chance of narrowing the regression down by >>bisection? > >This looks similar to a bug we found in our v6.4-based SLES products. > >Bisecting it indicated the regression is >d08089f649a0 "cifs: Change the I/O paths to use an iterator rather than a page list". > >The first good commit we found is a395726cf823 "cifs: fix data corruption in read after invalidate", but it was probably a bit before that (we didn't check further because we couldn't afford to backport all netfs modifications). > >This is the fix we used (rebased on top of v6.6.71 tag): >https://git.exis.tech/linux.git/commit/?h=data_corruption_v6.x&id=8d4c40e084f3d132434d5d3d068175c8db59ce65 I tried following the link but it gave me a "502 Bad Gateway" error, I also tried the link on my personal machine at home in case it was my corporate network blocking things, same result. I don't know how big the patch is. Any chance you could just drop it in this thread? > >@Ilja @Mark could you test it with your reproducer please? >@Steve can you try it with the reproducer mentioned in the commit message please? > I would be happy to try it out. Thanks! Mark > >Cheers, > >Enzo > >>On Wed, Mar 26, 2025 at 5:13 AM Heckmann, Ilja <heckmann@izw-berlin.de> wrote: >>> >>> We ran into what probably is the same problem with silent data corruption that was only noticed thanks to using a data format with internal checksums. It also went away when mounting a share with "cache=none" while running the kernel 6.6.9, but that had the side-effect that no executables could be started from the share (I reported this in June 2024). This second problem was fixed in 6.10, but at the same time mounting with "cache=none" stopped helping against the data corruption issue. It persists until now, with kernel 6.12.8, although the frequency at which the problem manifests went down significantly. >>> >>> The way we test for it is by running a certain workload 100 times in a loop and counting the number of runs aborted because of errors. That number went down from about 10 per 100 runs with kernel 6.6.9 to about 1 per 100 runs with 6.12.8. Its non-deterministic nature and the lack of in-house expertise to investigate the issue at the same level as Mark did stopped us from reporting it so far. And while there is no way of knowing that the issue we observe in 6.12.8 is the same one, at least I can confirm that there is a similar issue in more recent kernel versions as well. >>> >>> Best wishes, >>> Ilja Heckmann >>> ________________________________________ >>> Von: Mark A Whiting <whitingm@opentext.com> >>> Gesendet: Dienstag, 25. März 2025 22:24:55 >>> An: linux-cifs@vger.kernel.org >>> Betreff: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when >>> writing, x86_64, kernel 6.6.71 >>> >>> Hello, >>> >>> I have discovered a data corruption issue with our application writing to a CIFS share. I believe this issue may be related to another report I saw on this mailing list, https://urldefense.com/v3/__https://lore.kernel.org/linux-cifs/DFC1DAC5-5C6C-4DC2-807A-DAF12E4B7882@gmail.com/__;!!Obbck6kTJA!Y6MSAoNMFpNXbISjuEk_9W7YONTXcNxR54SWXuV-Ep2qHAlkeh10XDZn3W0eCVixe4zgnWRYOPyYiZB0sMdoZg$ . I understand that updating to a newer kernel would likely fix this issue. However, at the moment, that's not an option for us. In the long term we are looking to upgrade to 6.12 but I'm hoping to find a solution for our current 6.6 kernel. >>> >>> I have tested mounting with the "cache=none" option and that solves the problem, albeit with a very large performance hit. >>> >>> The platform is an embedded system. We're using an off-the-shelf COM Express Type 7 module with an Intel XEON D-1713NT processor. We're running a custom Linux system built using Buildroot, currently running the 6.6.71 kernel. I've tested the latest 6.6.84 kernel and the problem still exists there. Our application is writing large amounts of compressed data (4+ GB) to the network share. When I read back the data to verify it, I'm seeing small portions of the file that have been replaced with zeros. >>> >>> I've attacked the issue from several angles. Starting with a TCP dump of a complete operation from mounting, data transfer, to unmounting the network share. Through Wireshark I can see that there is no write command to the server covering the sections of the output that ends up as zeros. This indicated to me that the CIFS kernel driver is failing to write out portions of the file. >>> >>> I then enabled all the CIFS debug info I could via cifsFYI and the kernel dynamic debug controls and tweaked the code to not rate limit the pr_debug calls. I could trace through the resulting logs and find pairs of cifs_write_begin() / cifs_write_end() that covered all the data including the sections that ultimately don't get written out. However, tracing through the smb2_async_writev() messages I again could not find any writes that covered the corrupt portions. At this point I began to suspect some kind of race condition within the cifs_writepages() function. >>> >>> I also analyzed the data corruption and noticed a pattern. It does not fail 100% of the time, and it does not always fail in the same place. This furthered my belief that it was some kind of non-deterministic data race. The corrupt data region is always less than a page in size (<4096 bytes), it's always zeros, and it always ends on a page boundary. Because I knew the expected format of the data, I could also tell that the corrupt data was always at the beginning of a write syscall by our application. >>> >>> I've attempted to read through the CIFS kernel code involved in this. But I've never worked in the VFS/filesystem layers before. And I'm having trouble following / understanding the intricacies of the page cache, page dirtying/cleaning, and writeback. >>> >>> My current best guess at what's happening is as follows: >>> * Our application writes out a buffer of data to the file on a CIFS share, this is compressed data that isn't nicely aligned, the data does not end on a page boundary. This is a newly created file that we are writing to, so this write extends the files EOF to the end of the newly written data which is in the middle of a page in the cache. >>> * cifs_writepages() is invoked to write the cached data back to the server, it scans the cached pages and prepares to write out all the dirty pages (including the final partial page). >>> * Our application performs another write. This extends the file and the beginning of this write falls into the end of the previous final partial cached page. >>> * cifs_writepages() finishes writing out the dirty pages, including the first portion of what it thought was the final partial page, and marks all pages as clean. >>> * On the next invocation of cifs_writepages(), it scans for dirty pages and skips the beginning of the second write because it thinks that page is clean. The following page is a completely new page and is dirty, so it starts a new write from that page. This would explain why the corruption is always at the beginning of our application's write and corrects itself at the next page boundary. >>> >>> I have yet to really prove this, but this type of race between dirty/clean pages would explain all the behavior I'm seeing. I'm hoping someone much more intimately familiar with the CIFS code can help point me in the right direction. >>> >>> I did try one quick and dirty fix, assuming it was a race I applied the following patch. This added a per inode mutex that completely serialized the cifs_write_begin(), cifs_write_end(), and cifs_writepages() functions. This did seem to resolve the data corruption issue, but at the cost of occasional deadlocks writing to CIFS files. >>> >>> > diff --git a/fs/smb/client/cifsfs.c b/fs/smb/client/cifsfs.c index >>> > bbb0ef18d7b8..6e2e273b9838 100644 >>> > --- a/fs/smb/client/cifsfs.c >>> > +++ b/fs/smb/client/cifsfs.c >>> > @@ -1659,6 +1659,7 @@ cifs_init_once(void *inode) >>> > >>> > inode_init_once(&cifsi->netfs.inode); >>> > init_rwsem(&cifsi->lock_sem); >>> > + mutex_init(&cifsi->tbl_write_mutex); >>> > } >>> > >>> > static int __init >>> > diff --git a/fs/smb/client/cifsglob.h b/fs/smb/client/cifsglob.h >>> > index 43b42eca6780..4af4c5036d81 100644 >>> > --- a/fs/smb/client/cifsglob.h >>> > +++ b/fs/smb/client/cifsglob.h >>> > @@ -1606,6 +1606,17 @@ struct cifsInodeInfo { >>> > bool lease_granted; /* Flag to indicate whether lease or oplock is granted. */ >>> > char *symlink_target; >>> > __u32 reparse_tag; >>> > + >>> > + /* During development we discovered what we believe to be a race condition >>> > + * in the write caching behavior of cifs. Setting cache=none solved the >>> > + * issue but with an unacceptable performance hit. The following mutex was >>> > + * added to serialize the cifs_write_begin, cifs_write_end, and >>> > + * cifs_writepages functions in file.c. This appears to solve the issue >>> > + * without completely disabling caching. >>> > + * >>> > + * -Mark Whiting (whitingm@opentext.com) >>> > + */ >>> > + struct mutex tbl_write_mutex; >>> > }; >>> > >>> > static inline struct cifsInodeInfo * diff --git >>> > a/fs/smb/client/file.c b/fs/smb/client/file.c index >>> > cb75b95efb70..d3bc652a7e65 100644 >>> > --- a/fs/smb/client/file.c >>> > +++ b/fs/smb/client/file.c >>> > @@ -3085,6 +3085,7 @@ static int cifs_writepages(struct >>> > address_space *mapping, { >>> > loff_t start, end; >>> > int ret; >>> > + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex); >>> > >>> > /* We have to be careful as we can end up racing with setattr() >>> > * truncating the pagecache since the caller doesn't take a >>> > lock here @@ -3119,6 +3120,7 @@ static int cifs_writepages(struct address_space *mapping, >>> > } >>> > >>> > out: >>> > + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex); >>> > return ret; >>> > } >>> > >>> > @@ -3174,6 +3176,8 @@ static int cifs_write_end(struct file *file, struct address_space *mapping, >>> > struct folio *folio = page_folio(page); >>> > __u32 pid; >>> > >>> > + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex); >>> > + >>> > if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD) >>> > pid = cfile->pid; >>> > else >>> > @@ -3233,6 +3237,7 @@ static int cifs_write_end(struct file *file, struct address_space *mapping, >>> > /* Indication to update ctime and mtime as close is deferred */ >>> > set_bit(CIFS_INO_MODIFIED_ATTR, &CIFS_I(inode)->flags); >>> > >>> > + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex); >>> > return rc; >>> > } >>> > >>> > @@ -4905,6 +4910,7 @@ static int cifs_write_begin(struct file *file, struct address_space *mapping, >>> > int rc = 0; >>> > >>> > cifs_dbg(FYI, "write_begin from %lld len %d\n", (long >>> > long)pos, len); >>> > + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex); >>> > >>> > start: >>> > page = grab_cache_page_write_begin(mapping, index); @@ >>> > -4965,6 +4971,7 @@ static int cifs_write_begin(struct file *file, struct address_space *mapping, >>> > this will be written out by write_end so is fine */ >>> > } >>> > out: >>> > + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex); >>> > *pagep = page; >>> > return rc; >>> > } >>> >>> Here are some of the log excerpts for one of my test cases. In this file one of the corrupt regions starts at file offset 1,074,214,474 (0x4007364A), and was corrupt for 2,486 bytes, ending on a page boundary. First there is a section of the log trimmed to just the cifs_write_begin() / cifs_write_end() functions. You can see that there is a write shown at the exact offset/length of the corrupted data. >>> >>> > Mar 25 15:25:39 TX2 kernel: [ 124.080900] [1567] >>> > cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from >>> > 1074212864 len 1610 Mar 25 15:25:39 TX2 kernel: [ 124.080906] >>> > [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end >>> > for page 0000000086519afd from pos 1074212864 with 1610 bytes Mar >>> > 25 15:25:39 TX2 kernel: [ 124.080911] [1567] >>> > cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from >>> > 1074214474 len 2486 Mar 25 15:25:39 TX2 kernel: [ 124.080916] >>> > [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end >>> > for page 0000000086519afd from pos 1074214474 with 2486 bytes Mar >>> > 25 15:25:39 TX2 kernel: [ 124.080917] [1567] >>> > cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from >>> > 1074216960 len 846 Mar 25 15:25:39 TX2 kernel: [ 124.080924] >>> > [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end >>> > for page 00000000880cee03 from pos 1074216960 with 846 bytes >>> >>> Now here's a section of the log trimmed to just the smb2_async_writev() function. You can see writes covering the data immediately before and after the corrupted region, but there is no write to the corrupted region. I'm assuming the corrupted region is always zeros because the server is extending and zero-filling the file to the new write offset after the gap of the missing write. >>> >>> > Mar 25 15:25:39 TX2 kernel: [ 123.829696] [1635] >>> > smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write >>> > at 1072214016 988260 bytes iter=f1464 Mar 25 15:25:39 TX2 kernel: [ >>> > 124.081016] [1636] smb2_async_writev:4945: CIFS: >>> > fs/smb/client/smb2pdu.c: async write at 1073201152 1013322 bytes >>> > iter=f764a >>> ** Missing write: 1073201152 + 1013322 = 1074214474 ** >>> > Mar 25 15:25:39 TX2 kernel: [ 124.083901] [1636] >>> > smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write >>> > at 1074216960 39564 bytes iter=9a8c Mar 25 15:25:40 TX2 kernel: [ >>> > 124.340557] [1637] smb2_async_writev:4945: CIFS: >>> > fs/smb/client/smb2pdu.c: async write at 1074253824 1237843 bytes >>> > iter=12e353 >>> >>> I can very easily reproduce this with our application. If anyone has any suggestions to try, additional logging / tracing they would like me to perform, please let me know. I can provide more detailed, full logs if desired, but they're quite large. I'll continue to read through the code and try to understand, if I find anything I will update you. >>> >>> Thanks, >>> Mark Whiting >>> >>> >> >> >>-- >>Thanks, >> >>Steve >> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-03-27 12:48 ` Mark A Whiting @ 2025-03-27 13:07 ` Enzo Matsumiya 2025-03-27 19:31 ` [EXTERNAL] - " Mark A Whiting ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: Enzo Matsumiya @ 2025-03-27 13:07 UTC (permalink / raw) To: Mark A Whiting Cc: Steve French, Heckmann, Ilja, linux-cifs@vger.kernel.org, henrique.carvalho@suse.com [-- Attachment #1: Type: text/plain, Size: 805 bytes --] Hi Mark, On 03/27, Mark A Whiting wrote: >>This is the fix we used (rebased on top of v6.6.71 tag): >>https://git.exis.tech/linux.git/commit/?h=data_corruption_v6.x&id=8d4c40e084f3d132434d5d3d068175c8db59ce65 > >I tried following the link but it gave me a "502 Bad Gateway" error, I also tried the link on my personal machine at home in case it was my corporate network blocking things, same result. I don't know how big the patch is. Any chance you could just drop it in this thread? Yes, sorry about that, I'm having problems on that server. Patch is attached. >>@Ilja @Mark could you test it with your reproducer please? >>@Steve can you try it with the reproducer mentioned in the commit message please? >> > >I would be happy to try it out. Thanks, I'm eager to know the results. Cheers, Enzo [-- Attachment #2: smb-client-fix-corruption-in-cifs_extend_writeback.patch --] [-- Type: text/x-patch, Size: 8246 bytes --] From 8d4c40e084f3d132434d5d3d068175c8db59ce65 Mon Sep 17 00:00:00 2001 From: Enzo Matsumiya <ematsumiya@suse.de> Date: Wed, 26 Mar 2025 17:48:27 -0300 Subject: [PATCH] smb: client: fix corruption in cifs_extend_writeback cifs.ko writepages implementation will try to extend the write buffer size in order to issue less, but bigger write requests over the wire. The function responsible for doing so, cifs_extend_writeback, however, did not account for some important factors, and not handling some of those factors correctly lead to data corruption on writes coming through writepages. Such corrupt writes are very subtle and show no errors whatsoever on dmesg -- they can only be observed by comparing expected vs actual outputs. Easy reproducer: done | dd ibs=4194304 iflag=fullblock count=10240000 of=remotefile 8999946 <corrupt lines shows here> 'wc -l' is not really reliable as we've seen files with corrupt lines, but no missing ones. Of course, the corruption doesn't happen with cache=none mount option. Bug explanation: - Pointer arguments are updated before bound checking (actual root cause) @_len and @_count are updated with the current folio values before actually checking if the current values fit in their boundaries, so by the time the function exits, the caller (only cifs_write_back_from_locked_folio(), that BTW doesn't do any further checks) those arguments might have crossed bounds and extra data (zeroes) are added as padding. Later, with those offsets marked as 'done', the real actual data that should've been written into those offsets are skipped, making the final file corrupt. - Sync calls with ongoing writeback aren't sync Folios are tested for ongoing writeback (folio_test_writeback), but not handled directly for data-integrity sync syscalls (e.g. fsync() or msync()). When being called from those, and folio *is* under writeback, we MUST wait for the writeback to complete because those calls must guarantee the write went through. By simply bailing out of the function, the implementation relies on the timing/luck that no further errors happens later, and that the writeback indeed finished before returning. - Any failed checks to the folios in @xas would call xas_reset This means that whenever some/any folios were added to batch and processed, they are so again in further write calls because @xas, making upper layers do double work on it. This patch fixes the cases above, and also lessen the 'hard stop' conditions for cases where only a single folio is affected, but others in @xas can still be processed (more of a performance improvement). Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de> --- fs/smb/client/file.c | 147 ++++++++++++++++++++++++++++++------------- 1 file changed, 104 insertions(+), 43 deletions(-) diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c index cb75b95efb70..eddd0dab44ed 100644 --- a/fs/smb/client/file.c +++ b/fs/smb/client/file.c @@ -2719,15 +2719,17 @@ static void cifs_extend_writeback(struct address_space *mapping, loff_t start, int max_pages, loff_t max_len, - size_t *_len) + size_t *_len, + int sync_mode) { struct folio_batch batch; struct folio *folio; - unsigned int nr_pages; - pgoff_t index = (start + *_len) / PAGE_SIZE; - size_t len; - bool stop = true; - unsigned int i; + unsigned int nr_pages, i; + pgoff_t idx, index = (start + *_len) / PAGE_SIZE; + size_t len = *_len, flen; + bool stop = true, sync = (sync_mode != WB_SYNC_NONE); + long count = *_count; + int npages = max_pages; folio_batch_init(&batch); @@ -2742,59 +2744,110 @@ static void cifs_extend_writeback(struct address_space *mapping, stop = true; if (xas_retry(xas, folio)) continue; - if (xa_is_value(folio)) - break; - if (folio->index != index) { - xas_reset(xas); + if (xa_is_value(folio)) { + stop = false; break; } + if (folio_index(folio) != index) + goto xareset_next; - if (!folio_try_get(folio)) { - xas_reset(xas); - continue; - } - nr_pages = folio_nr_pages(folio); - if (nr_pages > max_pages) { - xas_reset(xas); - break; - } + if (!folio_try_get(folio)) + goto xareset_next; /* Has the page moved or been split? */ - if (unlikely(folio != xas_reload(xas))) { - folio_put(folio); - xas_reset(xas); - break; + if (unlikely(folio != xas_reload(xas) || folio->mapping != mapping)) { + stop = false; + goto put_next; } - if (!folio_trylock(folio)) { - folio_put(folio); - xas_reset(xas); - break; - } - if (!folio_test_dirty(folio) || - folio_test_writeback(folio)) { - folio_unlock(folio); - folio_put(folio); - xas_reset(xas); - break; + if (!folio_trylock(folio)) + goto put_next; + + nr_pages = folio_nr_pages(folio); + if (nr_pages > npages || nr_pages > count) + goto unlock_next; + + if (folio_test_writeback(folio)) { + /* + * For data-integrity syscalls (fsync(), msync()) we must wait for + * the I/O to complete on the page. + * For other cases (!sync), we can just skip this page, even if + * it's dirty. + */ + if (!sync) { + stop = false; + goto unlock_next; + } else { + folio_wait_writeback(folio); + + /* + * More I/O started meanwhile, bail out and write on the + * next call. + */ + if (WARN_ON_ONCE(folio_test_writeback(folio))) + goto unlock_next; + } } - max_pages -= nr_pages; - len = folio_size(folio); - stop = false; + /* + * We don't really have a boundary for index, so just check for overflow. + */ + if (check_add_overflow(index, nr_pages, &idx)) + goto unlock_next; + + flen = folio_size(folio); - index += nr_pages; - *_count -= nr_pages; - *_len += len; - if (max_pages <= 0 || *_len >= max_len || *_count <= 0) - stop = true; + /* Store sum in @flen so we don't have to undo it in case of failure. */ + if (check_add_overflow(len, flen, &flen) || flen > max_len) + goto unlock_next; + + index = idx; + len = flen; + + /* + * @npages and @count have been checked earlier (and are signed), so we + * can just subtract them here. + */ + npages -= nr_pages; + count -= nr_pages; + /* + * This is not an error; it just means we _did_ add this current folio, but + * can't add any more to this batch, so break out of this loop to start + * processing this batch, but don't stop the outer loop in case there are + * more folios to be processed in @xas. + */ + stop = false; if (!folio_batch_add(&batch, folio)) break; + + /* + * Folios added to the batch must be left locked for the loop below. They + * will be unlocked right away and also folio_batch_release() will take + * care of putting them. + */ + continue; +unlock_next: + folio_unlock(folio); +put_next: + folio_put(folio); +xareset_next: if (stop) break; } + /* + * Only reset @xas if we get here because of one of the stopping conditions above, + * namely: + * - couldn't lock/get a folio (someone else was processing the same folio) + * - folio was in writeback for too long (sync call was writing the same folio) + * - out of bounds index, len, or count (the last processed folio was partial and + * we can't fit it in this write request, so it shall be processed in the next + * write) + */ + if (stop) + xas_reset(xas); + xas_pause(xas); rcu_read_unlock(); @@ -2815,6 +2868,13 @@ static void cifs_extend_writeback(struct address_space *mapping, folio_unlock(folio); } + /* + * By now, data has been updated/written out to @mapping, so from this point of + * view we're done and we can safely update @_len and @_count. + */ + *_len = len; + *_count = count; + folio_batch_release(&batch); cond_resched(); } while (!stop); @@ -2902,7 +2962,8 @@ static ssize_t cifs_write_back_from_locked_folio(struct address_space *mapping, if (max_pages > 0) cifs_extend_writeback(mapping, xas, &count, start, - max_pages, max_len, &len); + max_pages, max_len, &len, + wbc->sync_mode); } } len = min_t(unsigned long long, len, i_size - start); -- 2.48.1 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* RE: [EXTERNAL] - Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-03-27 13:07 ` Enzo Matsumiya @ 2025-03-27 19:31 ` Mark A Whiting 2025-03-31 19:47 ` Mark A Whiting 2025-11-07 12:54 ` Shyam Prasad N 2 siblings, 0 replies; 29+ messages in thread From: Mark A Whiting @ 2025-03-27 19:31 UTC (permalink / raw) To: Enzo Matsumiya Cc: Steve French, Heckmann, Ilja, linux-cifs@vger.kernel.org, henrique.carvalho@suse.com Hi Enzo, >Hi Mark, > >On 03/27, Mark A Whiting wrote: >>>This is the fix we used (rebased on top of v6.6.71 tag): >>>https://urldefense.com/v3/__https://git.exis.tech/linux.git/commit/?h=data_corruption_v6.x&id=8d4c40e084f3d132434d5d3d068175c8db59ce65__;!!Obbck6kTJA!fc_giYPJqNz1XnCor_bLo9SweW3OpnLjx3UuSQUa4VtQQTpf3tIPquZbE99cQ-MmSQiLAnryvGEZSMpEY3qeSQ$ >> >>I tried following the link but it gave me a "502 Bad Gateway" error, I also tried the link on my personal machine at home in case it was my corporate network blocking things, same result. I don't know how big the patch is. >> >>I would be happy to try it out. > Any chance you could just drop it in this thread? > >Yes, sorry about that, I'm having problems on that server. >Patch is attached. > >>>@Ilja @Mark could you test it with your reproducer please? >>>@Steve can you try it with the reproducer mentioned in the commit message please? >>> >Thanks, I'm eager to know the results. > > >Cheers, > >Enzo > This patch seems to have solved my issue. Before the patch I was seeing a job failure rate of about 50%. After applying the patch I've run 30 jobs in a row without a failure. I'm going to spend the afternoon getting this into a build I can send over to our QA and let them stress test it for a few days. They have all the fancy 10G networking equipment. I'll let you know if they discover any issues. Thank you for your help. Mark ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-03-27 13:07 ` Enzo Matsumiya 2025-03-27 19:31 ` [EXTERNAL] - " Mark A Whiting @ 2025-03-31 19:47 ` Mark A Whiting 2025-11-06 15:00 ` Bharath SM 2025-11-07 12:54 ` Shyam Prasad N 2 siblings, 1 reply; 29+ messages in thread From: Mark A Whiting @ 2025-03-31 19:47 UTC (permalink / raw) To: Enzo Matsumiya Cc: Steve French, Heckmann, Ilja, linux-cifs@vger.kernel.org, henrique.carvalho@suse.com Hi Enzo, I now have a couple days of testing done. The good news is that we've run several terabytes of data through cifs and haven't had a single failure with the patch you provided. Now for the not as good news. Even though we aren't seeing any data corruption or failures. We are regularly and very frequently hitting a WARN_ON in cifs_extend_writeback() on line 2866. >for (i = 0; i < folio_batch_count(&batch); i++) { > folio = batch.folios[i]; > /* The folio should be locked, dirty and not undergoing > * writeback from the loop above. > */ > if (!folio_clear_dirty_for_io(folio)) > WARN_ON(1); Reading through the folio_clear_dirty_for_io() function it appears the only way this would happen is if the folio is clean, i.e., the dirty flag is not set. >if (folio_test_writeback(folio)) { > /* > * For data-integrity syscalls (fsync(), msync()) we must wait for > * the I/O to complete on the page. > * For other cases (!sync), we can just skip this page, even if > * it's dirty. > */ > if (!sync) { > stop = false; > goto unlock_next; > } else { > folio_wait_writeback(folio); Reading through your patch, unless I missed something, this folio_wait_writeback() call is the only addition that could affect the dirty flag indirectly. I'm assuming that when the current writeback is complete it would mark the folio clean. Then when it's added to the current batch and later checked, it's clean instead of the dirty flag being set as expected. Since you wrote the patch, is this expected behavior and that WARN_ON isn't valid anymore? Or is this something I should be worried about? Thanks, Mark Whiting ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-03-31 19:47 ` Mark A Whiting @ 2025-11-06 15:00 ` Bharath SM 2025-11-06 15:20 ` Enzo Matsumiya ` (3 more replies) 0 siblings, 4 replies; 29+ messages in thread From: Bharath SM @ 2025-11-06 15:00 UTC (permalink / raw) To: Mark A Whiting, henrique.carvalho, Enzo Matsumiya, Steve French, Shyam Prasad, David Howells, Paulo Alcantara Cc: Heckmann, Ilja, linux-cifs@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 2199 bytes --] Hi Enzo, We are noticing the similar behavior with the 6.6 kernel, can you please submit a patch to the 6.6 stable kernel. Hi Steve and David, Can you please review the attached patch from Enzo and share your comments. Thanks, Bharath On Mon, Mar 31, 2025 at 12:48 PM Mark A Whiting <whitingm@opentext.com> wrote: > > Hi Enzo, > > I now have a couple days of testing done. The good news is that we've run several terabytes of data through cifs and haven't had a single failure with the patch you provided. > > Now for the not as good news. Even though we aren't seeing any data corruption or failures. We are regularly and very frequently hitting a WARN_ON in cifs_extend_writeback() on line 2866. > > >for (i = 0; i < folio_batch_count(&batch); i++) { > > folio = batch.folios[i]; > > /* The folio should be locked, dirty and not undergoing > > * writeback from the loop above. > > */ > > if (!folio_clear_dirty_for_io(folio)) > > WARN_ON(1); > > Reading through the folio_clear_dirty_for_io() function it appears the only way this would happen is if the folio is clean, i.e., the dirty flag is not set. > > >if (folio_test_writeback(folio)) { > > /* > > * For data-integrity syscalls (fsync(), msync()) we must wait for > > * the I/O to complete on the page. > > * For other cases (!sync), we can just skip this page, even if > > * it's dirty. > > */ > > if (!sync) { > > stop = false; > > goto unlock_next; > > } else { > > folio_wait_writeback(folio); > > Reading through your patch, unless I missed something, this folio_wait_writeback() call is the only addition that could affect the dirty flag indirectly. I'm assuming that when the current writeback is complete it would mark the folio clean. Then when it's added to the current batch and later checked, it's clean instead of the dirty flag being set as expected. > > Since you wrote the patch, is this expected behavior and that WARN_ON isn't valid anymore? Or is this something I should be worried about? > > Thanks, > Mark Whiting > > > [-- Attachment #2: smb-client-fix-corruption-in-cifs_extend_writeback.patch --] [-- Type: application/octet-stream, Size: 8501 bytes --] From 8d4c40e084f3d132434d5d3d068175c8db59ce65 Mon Sep 17 00:00:00 2001 From: Enzo Matsumiya <ematsumiya@suse.de> Date: Wed, 26 Mar 2025 17:48:27 -0300 Subject: [PATCH] smb: client: fix corruption in cifs_extend_writeback cifs.ko writepages implementation will try to extend the write buffer size in order to issue less, but bigger write requests over the wire. The function responsible for doing so, cifs_extend_writeback, however, did not account for some important factors, and not handling some of those factors correctly lead to data corruption on writes coming through writepages. Such corrupt writes are very subtle and show no errors whatsoever on dmesg -- they can only be observed by comparing expected vs actual outputs. Easy reproducer: done | dd ibs=4194304 iflag=fullblock count=10240000 of=remotefile 8999946 <corrupt lines shows here> 'wc -l' is not really reliable as we've seen files with corrupt lines, but no missing ones. Of course, the corruption doesn't happen with cache=none mount option. Bug explanation: - Pointer arguments are updated before bound checking (actual root cause) @_len and @_count are updated with the current folio values before actually checking if the current values fit in their boundaries, so by the time the function exits, the caller (only cifs_write_back_from_locked_folio(), that BTW doesn't do any further checks) those arguments might have crossed bounds and extra data (zeroes) are added as padding. Later, with those offsets marked as 'done', the real actual data that should've been written into those offsets are skipped, making the final file corrupt. - Sync calls with ongoing writeback aren't sync Folios are tested for ongoing writeback (folio_test_writeback), but not handled directly for data-integrity sync syscalls (e.g. fsync() or msync()). When being called from those, and folio *is* under writeback, we MUST wait for the writeback to complete because those calls must guarantee the write went through. By simply bailing out of the function, the implementation relies on the timing/luck that no further errors happens later, and that the writeback indeed finished before returning. - Any failed checks to the folios in @xas would call xas_reset This means that whenever some/any folios were added to batch and processed, they are so again in further write calls because @xas, making upper layers do double work on it. This patch fixes the cases above, and also lessen the 'hard stop' conditions for cases where only a single folio is affected, but others in @xas can still be processed (more of a performance improvement). Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de> --- fs/smb/client/file.c | 147 ++++++++++++++++++++++++++++++------------- 1 file changed, 104 insertions(+), 43 deletions(-) diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c index cb75b95efb70..eddd0dab44ed 100644 --- a/fs/smb/client/file.c +++ b/fs/smb/client/file.c @@ -2719,15 +2719,17 @@ static void cifs_extend_writeback(struct address_space *mapping, loff_t start, int max_pages, loff_t max_len, - size_t *_len) + size_t *_len, + int sync_mode) { struct folio_batch batch; struct folio *folio; - unsigned int nr_pages; - pgoff_t index = (start + *_len) / PAGE_SIZE; - size_t len; - bool stop = true; - unsigned int i; + unsigned int nr_pages, i; + pgoff_t idx, index = (start + *_len) / PAGE_SIZE; + size_t len = *_len, flen; + bool stop = true, sync = (sync_mode != WB_SYNC_NONE); + long count = *_count; + int npages = max_pages; folio_batch_init(&batch); @@ -2742,59 +2744,110 @@ static void cifs_extend_writeback(struct address_space *mapping, stop = true; if (xas_retry(xas, folio)) continue; - if (xa_is_value(folio)) - break; - if (folio->index != index) { - xas_reset(xas); + if (xa_is_value(folio)) { + stop = false; break; } + if (folio_index(folio) != index) + goto xareset_next; - if (!folio_try_get(folio)) { - xas_reset(xas); - continue; - } - nr_pages = folio_nr_pages(folio); - if (nr_pages > max_pages) { - xas_reset(xas); - break; - } + if (!folio_try_get(folio)) + goto xareset_next; /* Has the page moved or been split? */ - if (unlikely(folio != xas_reload(xas))) { - folio_put(folio); - xas_reset(xas); - break; + if (unlikely(folio != xas_reload(xas) || folio->mapping != mapping)) { + stop = false; + goto put_next; } - if (!folio_trylock(folio)) { - folio_put(folio); - xas_reset(xas); - break; - } - if (!folio_test_dirty(folio) || - folio_test_writeback(folio)) { - folio_unlock(folio); - folio_put(folio); - xas_reset(xas); - break; + if (!folio_trylock(folio)) + goto put_next; + + nr_pages = folio_nr_pages(folio); + if (nr_pages > npages || nr_pages > count) + goto unlock_next; + + if (folio_test_writeback(folio)) { + /* + * For data-integrity syscalls (fsync(), msync()) we must wait for + * the I/O to complete on the page. + * For other cases (!sync), we can just skip this page, even if + * it's dirty. + */ + if (!sync) { + stop = false; + goto unlock_next; + } else { + folio_wait_writeback(folio); + + /* + * More I/O started meanwhile, bail out and write on the + * next call. + */ + if (WARN_ON_ONCE(folio_test_writeback(folio))) + goto unlock_next; + } } - max_pages -= nr_pages; - len = folio_size(folio); - stop = false; + /* + * We don't really have a boundary for index, so just check for overflow. + */ + if (check_add_overflow(index, nr_pages, &idx)) + goto unlock_next; + + flen = folio_size(folio); - index += nr_pages; - *_count -= nr_pages; - *_len += len; - if (max_pages <= 0 || *_len >= max_len || *_count <= 0) - stop = true; + /* Store sum in @flen so we don't have to undo it in case of failure. */ + if (check_add_overflow(len, flen, &flen) || flen > max_len) + goto unlock_next; + + index = idx; + len = flen; + + /* + * @npages and @count have been checked earlier (and are signed), so we + * can just subtract them here. + */ + npages -= nr_pages; + count -= nr_pages; + /* + * This is not an error; it just means we _did_ add this current folio, but + * can't add any more to this batch, so break out of this loop to start + * processing this batch, but don't stop the outer loop in case there are + * more folios to be processed in @xas. + */ + stop = false; if (!folio_batch_add(&batch, folio)) break; + + /* + * Folios added to the batch must be left locked for the loop below. They + * will be unlocked right away and also folio_batch_release() will take + * care of putting them. + */ + continue; +unlock_next: + folio_unlock(folio); +put_next: + folio_put(folio); +xareset_next: if (stop) break; } + /* + * Only reset @xas if we get here because of one of the stopping conditions above, + * namely: + * - couldn't lock/get a folio (someone else was processing the same folio) + * - folio was in writeback for too long (sync call was writing the same folio) + * - out of bounds index, len, or count (the last processed folio was partial and + * we can't fit it in this write request, so it shall be processed in the next + * write) + */ + if (stop) + xas_reset(xas); + xas_pause(xas); rcu_read_unlock(); @@ -2815,6 +2868,13 @@ static void cifs_extend_writeback(struct address_space *mapping, folio_unlock(folio); } + /* + * By now, data has been updated/written out to @mapping, so from this point of + * view we're done and we can safely update @_len and @_count. + */ + *_len = len; + *_count = count; + folio_batch_release(&batch); cond_resched(); } while (!stop); @@ -2902,7 +2962,8 @@ static ssize_t cifs_write_back_from_locked_folio(struct address_space *mapping, if (max_pages > 0) cifs_extend_writeback(mapping, xas, &count, start, - max_pages, max_len, &len); + max_pages, max_len, &len, + wbc->sync_mode); } } len = min_t(unsigned long long, len, i_size - start); -- 2.48.1 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-06 15:00 ` Bharath SM @ 2025-11-06 15:20 ` Enzo Matsumiya 2025-11-06 15:51 ` Bharath SM 2025-11-10 15:56 ` David Howells 2025-11-06 16:22 ` David Howells ` (2 subsequent siblings) 3 siblings, 2 replies; 29+ messages in thread From: Enzo Matsumiya @ 2025-11-06 15:20 UTC (permalink / raw) To: Bharath SM Cc: Mark A Whiting, henrique.carvalho, Steve French, Shyam Prasad, David Howells, Paulo Alcantara, Heckmann, Ilja, linux-cifs@vger.kernel.org, yangerkun, yangerkun Hi Bharath, That patch is not actually mine, I just posted a rebased version. The original patch by Yang Erkun (Cc'd) has been resent and Cc stable (6.6 to 6.9): https://lore.kernel.org/linux-cifs/20250912014150.3057545-1-yangerkun@huawei.com/ I haven't followed up on its whereabouts though -- you might want to ping someone else involved in the process. On 11/06, Bharath SM wrote: >Hi Enzo, > >We are noticing the similar behavior with the 6.6 kernel, can you >please submit a patch to the 6.6 stable kernel. > >Hi Steve and David, > >Can you please review the attached patch from Enzo and share your comments. > >Thanks, >Bharath > >On Mon, Mar 31, 2025 at 12:48 PM Mark A Whiting <whitingm@opentext.com> wrote: >> >> Hi Enzo, >> >> I now have a couple days of testing done. The good news is that we've run several terabytes of data through cifs and haven't had a single failure with the patch you provided. >> >> Now for the not as good news. Even though we aren't seeing any data corruption or failures. We are regularly and very frequently hitting a WARN_ON in cifs_extend_writeback() on line 2866. >> >> >for (i = 0; i < folio_batch_count(&batch); i++) { >> > folio = batch.folios[i]; >> > /* The folio should be locked, dirty and not undergoing >> > * writeback from the loop above. >> > */ >> > if (!folio_clear_dirty_for_io(folio)) >> > WARN_ON(1); >> >> Reading through the folio_clear_dirty_for_io() function it appears the only way this would happen is if the folio is clean, i.e., the dirty flag is not set. >> >> >if (folio_test_writeback(folio)) { >> > /* >> > * For data-integrity syscalls (fsync(), msync()) we must wait for >> > * the I/O to complete on the page. >> > * For other cases (!sync), we can just skip this page, even if >> > * it's dirty. >> > */ >> > if (!sync) { >> > stop = false; >> > goto unlock_next; >> > } else { >> > folio_wait_writeback(folio); >> >> Reading through your patch, unless I missed something, this folio_wait_writeback() call is the only addition that could affect the dirty flag indirectly. I'm assuming that when the current writeback is complete it would mark the folio clean. Then when it's added to the current batch and later checked, it's clean instead of the dirty flag being set as expected. >> >> Since you wrote the patch, is this expected behavior and that WARN_ON isn't valid anymore? Or is this something I should be worried about? @Mark sorry I totally missed that reply of yours. I could never observe that WARN_ON() being triggered. That code path was mid-migration throughout the affected versions, so I think it depends a lot of which version you used with the patch. I Cc'd Yang (original patch author), maybe he knows more about that. Cheers, Enzo ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-06 15:20 ` Enzo Matsumiya @ 2025-11-06 15:51 ` Bharath SM 2025-11-06 16:03 ` Enzo Matsumiya 2025-11-10 15:56 ` David Howells 1 sibling, 1 reply; 29+ messages in thread From: Bharath SM @ 2025-11-06 15:51 UTC (permalink / raw) To: Enzo Matsumiya Cc: Mark A Whiting, henrique.carvalho, Steve French, Shyam Prasad, David Howells, Paulo Alcantara, Heckmann, Ilja, linux-cifs@vger.kernel.org, yangerkun, yangerkun On Thu, Nov 6, 2025 at 7:20 AM Enzo Matsumiya <ematsumiya@suse.de> wrote: > > Hi Bharath, > > That patch is not actually mine, I just posted a rebased version. > > The original patch by Yang Erkun (Cc'd) has been resent and Cc stable > (6.6 to 6.9): > https://lore.kernel.org/linux-cifs/20250912014150.3057545-1-yangerkun@huawei.com/ > I haven't followed up on its whereabouts though -- you might want to > ping someone else involved in the process. This looks to be a different patch for the page leak issue and doesn't look the same as what you attached here for corruption. And the page leak patch is already in a stable kernel. > > > On 11/06, Bharath SM wrote: > >Hi Enzo, > > > >We are noticing the similar behavior with the 6.6 kernel, can you > >please submit a patch to the 6.6 stable kernel. > > > >Hi Steve and David, > > > >Can you please review the attached patch from Enzo and share your comments. > > > >Thanks, > >Bharath > > > >On Mon, Mar 31, 2025 at 12:48 PM Mark A Whiting <whitingm@opentext.com> wrote: > >> > >> Hi Enzo, > >> > >> I now have a couple days of testing done. The good news is that we've run several terabytes of data through cifs and haven't had a single failure with the patch you provided. > >> > >> Now for the not as good news. Even though we aren't seeing any data corruption or failures. We are regularly and very frequently hitting a WARN_ON in cifs_extend_writeback() on line 2866. > >> > >> >for (i = 0; i < folio_batch_count(&batch); i++) { > >> > folio = batch.folios[i]; > >> > /* The folio should be locked, dirty and not undergoing > >> > * writeback from the loop above. > >> > */ > >> > if (!folio_clear_dirty_for_io(folio)) > >> > WARN_ON(1); > >> > >> Reading through the folio_clear_dirty_for_io() function it appears the only way this would happen is if the folio is clean, i.e., the dirty flag is not set. > >> > >> >if (folio_test_writeback(folio)) { > >> > /* > >> > * For data-integrity syscalls (fsync(), msync()) we must wait for > >> > * the I/O to complete on the page. > >> > * For other cases (!sync), we can just skip this page, even if > >> > * it's dirty. > >> > */ > >> > if (!sync) { > >> > stop = false; > >> > goto unlock_next; > >> > } else { > >> > folio_wait_writeback(folio); > >> > >> Reading through your patch, unless I missed something, this folio_wait_writeback() call is the only addition that could affect the dirty flag indirectly. I'm assuming that when the current writeback is complete it would mark the folio clean. Then when it's added to the current batch and later checked, it's clean instead of the dirty flag being set as expected. > >> > >> Since you wrote the patch, is this expected behavior and that WARN_ON isn't valid anymore? Or is this something I should be worried about? > > @Mark sorry I totally missed that reply of yours. > > I could never observe that WARN_ON() being triggered. > That code path was mid-migration throughout the affected versions, so I > think it depends a lot of which version you used with the patch. > > I Cc'd Yang (original patch author), maybe he knows more about that. > > > Cheers, > > Enzo ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-06 15:51 ` Bharath SM @ 2025-11-06 16:03 ` Enzo Matsumiya 0 siblings, 0 replies; 29+ messages in thread From: Enzo Matsumiya @ 2025-11-06 16:03 UTC (permalink / raw) To: Bharath SM Cc: Mark A Whiting, henrique.carvalho, Steve French, Shyam Prasad, David Howells, Paulo Alcantara, Heckmann, Ilja, linux-cifs@vger.kernel.org, yangerkun, yangerkun On 11/06, Bharath SM wrote: >On Thu, Nov 6, 2025 at 7:20 AM Enzo Matsumiya <ematsumiya@suse.de> wrote: >> >> Hi Bharath, >> >> That patch is not actually mine, I just posted a rebased version. >> >> The original patch by Yang Erkun (Cc'd) has been resent and Cc stable >> (6.6 to 6.9): >> https://lore.kernel.org/linux-cifs/20250912014150.3057545-1-yangerkun@huawei.com/ >> I haven't followed up on its whereabouts though -- you might want to >> ping someone else involved in the process. > >This looks to be a different patch for the page leak issue and doesn't >look the same as what you attached here for corruption. >And the page leak patch is already in a stable kernel. Eh, I confused things. But given Mark's reports, I don't think I should send it as is. >> On 11/06, Bharath SM wrote: >> >Hi Enzo, >> > >> >We are noticing the similar behavior with the 6.6 kernel, can you >> >please submit a patch to the 6.6 stable kernel. >> > >> >Hi Steve and David, >> > >> >Can you please review the attached patch from Enzo and share your comments. >> > >> >Thanks, >> >Bharath >> > >> >On Mon, Mar 31, 2025 at 12:48 PM Mark A Whiting <whitingm@opentext.com> wrote: >> >> >> >> Hi Enzo, >> >> >> >> I now have a couple days of testing done. The good news is that we've run several terabytes of data through cifs and haven't had a single failure with the patch you provided. >> >> >> >> Now for the not as good news. Even though we aren't seeing any data corruption or failures. We are regularly and very frequently hitting a WARN_ON in cifs_extend_writeback() on line 2866. >> >> >> >> >for (i = 0; i < folio_batch_count(&batch); i++) { >> >> > folio = batch.folios[i]; >> >> > /* The folio should be locked, dirty and not undergoing >> >> > * writeback from the loop above. >> >> > */ >> >> > if (!folio_clear_dirty_for_io(folio)) >> >> > WARN_ON(1); >> >> >> >> Reading through the folio_clear_dirty_for_io() function it appears the only way this would happen is if the folio is clean, i.e., the dirty flag is not set. >> >> >> >> >if (folio_test_writeback(folio)) { >> >> > /* >> >> > * For data-integrity syscalls (fsync(), msync()) we must wait for >> >> > * the I/O to complete on the page. >> >> > * For other cases (!sync), we can just skip this page, even if >> >> > * it's dirty. >> >> > */ >> >> > if (!sync) { >> >> > stop = false; >> >> > goto unlock_next; >> >> > } else { >> >> > folio_wait_writeback(folio); >> >> >> >> Reading through your patch, unless I missed something, this folio_wait_writeback() call is the only addition that could affect the dirty flag indirectly. I'm assuming that when the current writeback is complete it would mark the folio clean. Then when it's added to the current batch and later checked, it's clean instead of the dirty flag being set as expected. >> >> >> >> Since you wrote the patch, is this expected behavior and that WARN_ON isn't valid anymore? Or is this something I should be worried about? >> >> @Mark sorry I totally missed that reply of yours. >> >> I could never observe that WARN_ON() being triggered. >> That code path was mid-migration throughout the affected versions, so I >> think it depends a lot of which version you used with the patch. >> >> I Cc'd Yang (original patch author), maybe he knows more about that. >> >> >> Cheers, >> >> Enzo > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-06 15:20 ` Enzo Matsumiya 2025-11-06 15:51 ` Bharath SM @ 2025-11-10 15:56 ` David Howells 1 sibling, 0 replies; 29+ messages in thread From: David Howells @ 2025-11-10 15:56 UTC (permalink / raw) To: Enzo Matsumiya Cc: dhowells, Bharath SM, Mark A Whiting, henrique.carvalho, Steve French, Shyam Prasad, Paulo Alcantara, Heckmann, Ilja, linux-cifs@vger.kernel.org, yangerkun, yangerkun Enzo Matsumiya <ematsumiya@suse.de> wrote: > The original patch by Yang Erkun (Cc'd) has been resent and Cc stable > (6.6 to 6.9): > https://lore.kernel.org/linux-cifs/20250912014150.3057545-1-yangerkun@huawei.com/ On the Yang Erkun patch: Reviewed-by: David Howells <dhowells@redhat.com> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-06 15:00 ` Bharath SM 2025-11-06 15:20 ` Enzo Matsumiya @ 2025-11-06 16:22 ` David Howells 2025-11-06 16:46 ` Bharath SM 2025-11-10 16:00 ` David Howells 2025-11-11 9:22 ` David Howells 3 siblings, 1 reply; 29+ messages in thread From: David Howells @ 2025-11-06 16:22 UTC (permalink / raw) To: Bharath SM Cc: dhowells, Mark A Whiting, henrique.carvalho, Enzo Matsumiya, Steve French, Shyam Prasad, Paulo Alcantara, Heckmann, Ilja, linux-cifs@vger.kernel.org Bharath SM <bharathsm.hsk@gmail.com> wrote: > We are noticing the similar behavior with the 6.6 kernel, can you > please submit a patch to the 6.6 stable kernel. What range of kernels is this patch aimed at? Pre-netfslib conversion cifs only? David ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-06 16:22 ` David Howells @ 2025-11-06 16:46 ` Bharath SM 2025-11-06 16:49 ` Bharath SM 0 siblings, 1 reply; 29+ messages in thread From: Bharath SM @ 2025-11-06 16:46 UTC (permalink / raw) To: David Howells Cc: Mark A Whiting, henrique.carvalho, Enzo Matsumiya, Steve French, Shyam Prasad, Paulo Alcantara, Heckmann, Ilja, linux-cifs@vger.kernel.org On Thu, Nov 6, 2025 at 8:22 AM David Howells <dhowells@redhat.com> wrote: > > Bharath SM <bharathsm.hsk@gmail.com> wrote: > > > We are noticing the similar behavior with the 6.6 kernel, can you > > please submit a patch to the 6.6 stable kernel. > > What range of kernels is this patch aimed at? Pre-netfslib conversion cifs > only? By looking at changes I think it can be applied to Pre-netfslib conversion as most of these functions were removed after that. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-06 16:46 ` Bharath SM @ 2025-11-06 16:49 ` Bharath SM 0 siblings, 0 replies; 29+ messages in thread From: Bharath SM @ 2025-11-06 16:49 UTC (permalink / raw) To: David Howells Cc: Mark A Whiting, henrique.carvalho, Enzo Matsumiya, Steve French, Shyam Prasad, Paulo Alcantara, Heckmann, Ilja, linux-cifs@vger.kernel.org On Thu, Nov 6, 2025 at 8:46 AM Bharath SM <bharathsm.hsk@gmail.com> wrote: > > On Thu, Nov 6, 2025 at 8:22 AM David Howells <dhowells@redhat.com> wrote: > > > > Bharath SM <bharathsm.hsk@gmail.com> wrote: > > > > > We are noticing the similar behavior with the 6.6 kernel, can you > > > please submit a patch to the 6.6 stable kernel. > > > > What range of kernels is this patch aimed at? Pre-netfslib conversion cifs > > only? > > By looking at changes I think it can be applied to Pre-netfslib conversion > as most of these functions were removed after that. But I do notice following warnings with patch: [Thu Nov 6 16:47:23 2025] ------------[ cut here ]------------ [Thu Nov 6 16:47:23 2025] WARNING: CPU: 7 PID: 61031 at /home/azureuser/jammy/fs/smb/client/file.c:2855 cifs_extend_writeback+0x3f8/0x5b0 [cifs] [Thu Nov 6 16:47:23 2025] Modules linked in: cifs(OE) cmac nls_utf8 cifs_arc4 nls_ucs2_utils cifs_md4 netfs tls nvme_fabrics xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_owner xt_tcpudp nft_compat nf_tables libcrc32c nfnetlink binfmt_misc nls_iso8859_1 kvm_amd ccp kvm irqbypass crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 joydev aesni_intel hid_generic hid_hyperv crypto_simd serio_raw cryptd hv_netvsc hid hyperv_drm hyperv_keyboard dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel efi_pstore ip_tables x_tables autofs4 [last unloaded: cifs(OE)] [Thu Nov 6 16:47:23 2025] CPU: 7 PID: 61031 Comm: test.sh Tainted: G W OE 6.8.0-1041-azure #47~22.04.1-Ubuntu [Thu Nov 6 16:47:23 2025] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 08/23/2024 [Thu Nov 6 16:47:23 2025] RIP: 0010:cifs_extend_writeback+0x3f8/0x5b0 [cifs] [Thu Nov 6 16:47:23 2025] Code: 0f 84 8a 00 00 00 45 31 e4 45 89 e6 41 83 fc 0e 0f 87 38 01 00 00 4e 8b b4 f5 58 ff ff ff 4c 89 f7 e8 0c 60 79 de 84 c0 75 02 <0f> 0b 31 f6 4c 89 f7 41 83 c4 01 e8 d8 61 79 de 4c 89 f7 e8 40 83 [Thu Nov 6 16:47:23 2025] RSP: 0018:ffffcb4512a3f8c8 EFLAGS: 00010246 [Thu Nov 6 16:47:23 2025] RAX: 0000000000000000 RBX: ffffcb4512a3fa68 RCX: 0000000000000026 [Thu Nov 6 16:47:23 2025] RDX: 0000000000000000 RSI: 0000000000000006 RDI: fffff4c8c7532a40 [Thu Nov 6 16:47:23 2025] RBP: ffffcb4512a3f9d8 R08: 0000000000000048 R09: 0000000000000001 [Thu Nov 6 16:47:23 2025] R10: 000000000003acc0 R11: 0000000000000001 R12: 0000000000000007 [Thu Nov 6 16:47:23 2025] R13: 0000000000000000 R14: fffff4c8c7532a40 R15: 000000000000a8d5 [Thu Nov 6 16:47:23 2025] FS: 0000715b39af7740(0000) GS:ffff89825fd80000(0000) knlGS:0000000000000000 [Thu Nov 6 16:47:23 2025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Thu Nov 6 16:47:23 2025] CR2: 00007832fd266428 CR3: 0000000185532005 CR4: 0000000000b70ef0 [Thu Nov 6 16:47:23 2025] Call Trace: [Thu Nov 6 16:47:23 2025] <TASK> [Thu Nov 6 16:47:23 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 6 16:47:23 2025] ? __mod_memcg_lruvec_state+0xae/0x180 [Thu Nov 6 16:47:23 2025] ? __entry_text_end+0x1026c9/0x1026cd [Thu Nov 6 16:47:23 2025] cifs_writepages_region+0xabf/0xb80 [cifs] [Thu Nov 6 16:47:23 2025] cifs_writepages+0xf8/0x120 [cifs] [Thu Nov 6 16:47:23 2025] do_writepages+0xd1/0x1b0 [Thu Nov 6 16:47:23 2025] ? __xa_set_mark+0x65/0x90 [Thu Nov 6 16:47:23 2025] filemap_fdatawrite_wbc+0x75/0xa0 [Thu Nov 6 16:47:23 2025] __filemap_fdatawrite_range+0x54/0x70 [Thu Nov 6 16:47:23 2025] filemap_write_and_wait_range+0x51/0xb0 [Thu Nov 6 16:47:23 2025] cifs_flush+0x82/0x110 [cifs] [Thu Nov 6 16:47:23 2025] filp_flush+0x3c/0x90 [Thu Nov 6 16:47:23 2025] filp_close+0x15/0x30 [Thu Nov 6 16:47:23 2025] do_dup2+0x8a/0xe0 [Thu Nov 6 16:47:23 2025] ksys_dup3+0xa6/0x110 [Thu Nov 6 16:47:23 2025] __x64_sys_dup2+0x29/0xd0 [Thu Nov 6 16:47:23 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 6 16:47:23 2025] x64_sys_call+0x1dfe/0x20a0 [Thu Nov 6 16:47:23 2025] do_syscall_64+0x7c/0x160 [Thu Nov 6 16:47:23 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 6 16:47:23 2025] ? audit_reset_context.part.0.constprop.0+0x2a9/0x310 [Thu Nov 6 16:47:23 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 6 16:47:23 2025] ? syscall_exit_to_user_mode_prepare+0x117/0x160 [Thu Nov 6 16:47:23 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 6 16:47:23 2025] ? syscall_exit_to_user_mode+0x81/0x220 [Thu Nov 6 16:47:23 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 6 16:47:23 2025] ? do_syscall_64+0x88/0x160 [Thu Nov 6 16:47:23 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 6 16:47:23 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 6 16:47:23 2025] ? irqentry_exit_to_user_mode+0x7b/0x220 [Thu Nov 6 16:47:23 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 6 16:47:23 2025] ? irqentry_exit+0x1d/0x30 [Thu Nov 6 16:47:23 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 6 16:47:23 2025] entry_SYSCALL_64_after_hwframe+0x78/0x80 [Thu Nov 6 16:47:23 2025] RIP: 0033:0x715b3991508b [Thu Nov 6 16:47:23 2025] Code: 73 01 c3 48 8b 0d a5 4d 10 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 21 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 4d 10 00 f7 d8 64 89 01 48 [Thu Nov 6 16:47:23 2025] RSP: 002b:00007ffcee3015d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000021 [Thu Nov 6 16:47:23 2025] RAX: ffffffffffffffda RBX: 000057ee9a27ff10 RCX: 0000715b3991508b [Thu Nov 6 16:47:23 2025] RDX: 000057ee8d0734d4 RSI: 0000000000000001 RDI: 000000000000000a [Thu Nov 6 16:47:23 2025] RBP: 00007ffcee301660 R08: 0000715b399d1460 R09: 000000000000000a [Thu Nov 6 16:47:23 2025] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000009 [Thu Nov 6 16:47:23 2025] R13: 0000000000000007 R14: 000000000000000a R15: 0000000000000007 [Thu Nov 6 16:47:23 2025] </TASK> [Thu Nov 6 16:47:23 2025] ---[ end trace 0000000000000000 ]--- ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-06 15:00 ` Bharath SM 2025-11-06 15:20 ` Enzo Matsumiya 2025-11-06 16:22 ` David Howells @ 2025-11-10 16:00 ` David Howells 2025-11-11 9:22 ` David Howells 3 siblings, 0 replies; 29+ messages in thread From: David Howells @ 2025-11-10 16:00 UTC (permalink / raw) To: Bharath SM Cc: dhowells, Mark A Whiting, henrique.carvalho, Enzo Matsumiya, Steve French, Shyam Prasad, Paulo Alcantara, Heckmann, Ilja, linux-cifs@vger.kernel.org Bharath SM <bharathsm.hsk@gmail.com> wrote: > observed by comparing expected vs actual outputs. Easy reproducer: > > done | dd ibs=4194304 iflag=fullblock count=10240000 of=remotefile > 8999946 > <corrupt lines shows here> That reproducer would seem to be invalid shell. Were there lines preceding the "done" that need to be included? David ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-06 15:00 ` Bharath SM ` (2 preceding siblings ...) 2025-11-10 16:00 ` David Howells @ 2025-11-11 9:22 ` David Howells 2025-11-11 10:39 ` Shyam Prasad N 3 siblings, 1 reply; 29+ messages in thread From: David Howells @ 2025-11-11 9:22 UTC (permalink / raw) To: Bharath SM Cc: dhowells, Mark A Whiting, henrique.carvalho, Enzo Matsumiya, Steve French, Shyam Prasad, Paulo Alcantara, Heckmann, Ilja, linux-cifs@vger.kernel.org Okay, the patch isn't good from a quick scan of it. > + if (folio_test_writeback(folio)) { > + /* > + * For data-integrity syscalls (fsync(), msync()) we must wait for > + * the I/O to complete on the page. > + * For other cases (!sync), we can just skip this page, even if > + * it's dirty. > + */ > + if (!sync) { > + stop = false; > + goto unlock_next; > + } else { > + folio_wait_writeback(folio); You can't sleep here. The RCU read lock is held. There's no actual need to sleep here anyway - you can just stop and leave the function (well, set stop=true and break so that the accumulated batch is processed). The way the code is meant to work is that cifs_write_back_from_locked_folio() locks and waits for the first folio, then calls cifs_extend_writeback() to add more folios to the writeout - if it doesn't need to wait for them. You cannot skip any folios as the set has to be contiguous. If you skip one, you'll corrupt the file. Once a set of folios has been dispatched, cifs_writepages_begin() *should* begin with the next folio that hasn't been sent - quite possibly one just rejected by cifs_extend_writeback(). But at this point cifs_write_back_from_locked_folio() will wait for it. This should[*] work correctly, even in sync mode, because it should eventually wait for any folio that's already undergoing writeback - though it might be less efficient because if there are competing writebacks, they may end up forcing each other to produce very small writes (there's no writeback-vs-writeback locking apart from the individual folio locks). [*] At least as far as the design goes; that's not to say there isn't a bug in the implementation. That said, in sync mode, you might actually want cifs_extend_writeback() to wait - but in that case, you have to drop the RCU read lock before you do the wait and then reset the iteration correctly... and beware that doing that might advance[**] the iterator state. [**] It's possible that this is the actual cause of the bug - and that we're skipping the rejected folio because the xa_state isn't been correctly rewound. David ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-11 9:22 ` David Howells @ 2025-11-11 10:39 ` Shyam Prasad N 2025-11-12 17:09 ` Shyam Prasad N 0 siblings, 1 reply; 29+ messages in thread From: Shyam Prasad N @ 2025-11-11 10:39 UTC (permalink / raw) To: David Howells Cc: Bharath SM, Mark A Whiting, henrique.carvalho, Enzo Matsumiya, Steve French, Paulo Alcantara, Heckmann, Ilja, linux-cifs@vger.kernel.org On Tue, Nov 11, 2025 at 2:52 PM David Howells <dhowells@redhat.com> wrote: > > Okay, the patch isn't good from a quick scan of it. > > > + if (folio_test_writeback(folio)) { > > + /* > > + * For data-integrity syscalls (fsync(), msync()) we must wait for > > + * the I/O to complete on the page. > > + * For other cases (!sync), we can just skip this page, even if > > + * it's dirty. > > + */ > > + if (!sync) { > > + stop = false; > > + goto unlock_next; > > + } else { > > + folio_wait_writeback(folio); > > You can't sleep here. The RCU read lock is held. There's no actual need to > sleep here anyway - you can just stop and leave the function (well, set > stop=true and break so that the accumulated batch is processed). > > The way the code is meant to work is that cifs_write_back_from_locked_folio() > locks and waits for the first folio, then calls cifs_extend_writeback() to add > more folios to the writeout - if it doesn't need to wait for them. You cannot > skip any folios as the set has to be contiguous. If you skip one, you'll > corrupt the file. > > Once a set of folios has been dispatched, cifs_writepages_begin() *should* > begin with the next folio that hasn't been sent - quite possibly one just > rejected by cifs_extend_writeback(). But at this point > cifs_write_back_from_locked_folio() will wait for it. > > This should[*] work correctly, even in sync mode, because it should eventually > wait for any folio that's already undergoing writeback - though it might be > less efficient because if there are competing writebacks, they may end up > forcing each other to produce very small writes (there's no > writeback-vs-writeback locking apart from the individual folio locks). > > [*] At least as far as the design goes; that's not to say there isn't a bug in > the implementation. > > That said, in sync mode, you might actually want cifs_extend_writeback() to > wait - but in that case, you have to drop the RCU read lock before you do the > wait and then reset the iteration correctly... and beware that doing that > might advance[**] the iterator state. > > [**] It's possible that this is the actual cause of the bug - and that we're > skipping the rejected folio because the xa_state isn't been correctly > rewound. > > David > Hi David, Thanks for the detailed review and explanation. Based on your explanation, the continue when folio_try_get fails seems out of place. Should that be a break instead? https://elixir.bootlin.com/linux/v6.9.12/source/fs/smb/client/file.c#L2758 ------------------------- if (!folio_try_get(folio)) { xas_reset(xas); continue; <<<<<<<<<<<<<<<< } ------------------------- However, that does not seem to be the cause of the corruption here. But can that cause an unnecessary spin loop? Enzo noted in his patch: ------------------------- - Pointer arguments are updated before bound checking (actual root cause) @_len and @_count are updated with the current folio values before actually checking if the current values fit in their boundaries, so by the time the function exits, the caller (only cifs_write_back_from_locked_folio(), that BTW doesn't do any further checks) those arguments might have crossed bounds and extra data (zeroes) are added as padding. Later, with those offsets marked as 'done', the real actual data that should've been written into those offsets are skipped, making the final file corrupt. ------------------------- However, I have a hard time understanding why the zero padding would happen. The bound checking does happen after the _count and _len are updated. Which would mean that _len can go above max_len and _count can go negative. https://elixir.bootlin.com/linux/v6.9.12/source/fs/smb/client/file.c#L2793 ------------------------- index += nr_pages; *_count -= nr_pages; <<<<<<<<<<<<<< updated first *_len += len; <<<<<<<<<<<<<<<< then bound check if (max_pages <= 0 || *_len >= max_len || *_count <= 0) <<<<<<<<<<<< bound check stop = true; if (!folio_batch_add(&batch, folio)) <<<<<<<<<<<<<<<<< folio batch add break; if (stop) break; } ------------------------- But that would just mean that an extra folio could be added to the batch (and marked for writeback), causing the batch to contain more len than max_len (which is likely wsize). If I follow the code further, adjust_credits can then be called with wdata->bytes > wsize, and smb2_adjust_credits can then return EOPNOTSUPP, causing us to fail the write: https://elixir.bootlin.com/linux/v6.9.12/source/fs/smb/client/smb2ops.c#L295 ------------------------- if (credits->value < new_val) { trace_smb3_too_many_credits(server->CurrentMid, server->conn_id, server->hostname, 0, credits->value - new_val, 0); cifs_server_dbg(VFS, "request has less credits (%d) than required (%d)", credits->value, new_val); return -EOPNOTSUPP; } ------------------------- However, Enzo's patch *does work* to prevent data corruption as verified by Mark and Bharath. (There's a WARNING that gets thrown, but that's probably due to a required check that Enzo missed in his change) What am I missing here? -- Regards, Shyam ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-11 10:39 ` Shyam Prasad N @ 2025-11-12 17:09 ` Shyam Prasad N 2025-11-12 18:14 ` David Howells 0 siblings, 1 reply; 29+ messages in thread From: Shyam Prasad N @ 2025-11-12 17:09 UTC (permalink / raw) To: David Howells Cc: Bharath SM, Mark A Whiting, henrique.carvalho, Enzo Matsumiya, Steve French, Paulo Alcantara, Heckmann, Ilja, linux-cifs@vger.kernel.org On Tue, Nov 11, 2025 at 4:09 PM Shyam Prasad N <nspmangalore@gmail.com> wrote: > > On Tue, Nov 11, 2025 at 2:52 PM David Howells <dhowells@redhat.com> wrote: > > > > Okay, the patch isn't good from a quick scan of it. > > > > > + if (folio_test_writeback(folio)) { > > > + /* > > > + * For data-integrity syscalls (fsync(), msync()) we must wait for > > > + * the I/O to complete on the page. > > > + * For other cases (!sync), we can just skip this page, even if > > > + * it's dirty. > > > + */ > > > + if (!sync) { > > > + stop = false; > > > + goto unlock_next; > > > + } else { > > > + folio_wait_writeback(folio); > > > > You can't sleep here. The RCU read lock is held. There's no actual need to > > sleep here anyway - you can just stop and leave the function (well, set > > stop=true and break so that the accumulated batch is processed). > > > > The way the code is meant to work is that cifs_write_back_from_locked_folio() > > locks and waits for the first folio, then calls cifs_extend_writeback() to add > > more folios to the writeout - if it doesn't need to wait for them. You cannot > > skip any folios as the set has to be contiguous. If you skip one, you'll > > corrupt the file. > > > > Once a set of folios has been dispatched, cifs_writepages_begin() *should* > > begin with the next folio that hasn't been sent - quite possibly one just > > rejected by cifs_extend_writeback(). But at this point > > cifs_write_back_from_locked_folio() will wait for it. > > > > This should[*] work correctly, even in sync mode, because it should eventually > > wait for any folio that's already undergoing writeback - though it might be > > less efficient because if there are competing writebacks, they may end up > > forcing each other to produce very small writes (there's no > > writeback-vs-writeback locking apart from the individual folio locks). > > > > [*] At least as far as the design goes; that's not to say there isn't a bug in > > the implementation. > > > > That said, in sync mode, you might actually want cifs_extend_writeback() to > > wait - but in that case, you have to drop the RCU read lock before you do the > > wait and then reset the iteration correctly... and beware that doing that > > might advance[**] the iterator state. > > > > [**] It's possible that this is the actual cause of the bug - and that we're > > skipping the rejected folio because the xa_state isn't been correctly > > rewound. > > > > David > > > > Hi David, > > Thanks for the detailed review and explanation. > > Based on your explanation, the continue when folio_try_get fails seems > out of place. Should that be a break instead? > https://elixir.bootlin.com/linux/v6.9.12/source/fs/smb/client/file.c#L2758 > ------------------------- > if (!folio_try_get(folio)) { > xas_reset(xas); > continue; <<<<<<<<<<<<<<<< > } > ------------------------- > However, that does not seem to be the cause of the corruption here. > But can that cause an unnecessary spin loop? > > Enzo noted in his patch: > ------------------------- > - Pointer arguments are updated before bound checking (actual root cause) > @_len and @_count are updated with the current folio values before > actually checking if the current > values fit in their boundaries, so by the time the function exits, the > caller (only > cifs_write_back_from_locked_folio(), that BTW doesn't do any further > checks) those arguments might > have crossed bounds and extra data (zeroes) are added as padding. > Later, with those offsets marked as 'done', the real actual data that > should've been written into > those offsets are skipped, making the final file corrupt. > ------------------------- > However, I have a hard time understanding why the zero padding would happen. I think I understand what's going on here, and why Enzo's patch works to prevent it. I think what's going on is that cifs_extend_writeback can pick up a folio on an extending write which has been dirtied, but we have a clamp on the writeback to an i_size local variable, which can cause short writes, yet mark the page as clean. As an example, consider this scenario: 1. First write to the file happens offset 0 len 5k. 2. Writeback starts for the range (0-5k). 3. Writeback locks page 1 in cifs_writepages_begin. But does not lock page 2 yet. 4. Page 2 is now written to by the next write, which extends the file by another 5k. Page 2 and 3 are now marked dirty. 5. Now we reach cifs_extend_writeback, where we extend to include the next folio (even if it should be partially written). We will mark page 2 for writeback. 6. But after exiting cifs_extend_writeback, we will clamp the writeback to i_size, which was 5k when it started. So we write only 1k bytes in page 2. 7. We still will now mark page 2 as flushed and mark it clean. So remaining contents of page 2 will not be written to the server (hence the hole in that gap, unless that range gets overwritten). With the proposed patch, it skips the partially written folio in cifs_extend_writeback, which should take care of the issue. @David Howells / @Enzo Matsumiya / @Steve French : The above is just a theory. I need to prove it. Please review. I'm working on a revised patch (doing the boundary checking like Enzo's change did, but addressing David's review comments). Depending on how the testing goes, I think we should be able to submit the patch tomorrow. > > The bound checking does happen after the _count and _len are updated. > Which would mean that _len can go above max_len and _count can go > negative. > https://elixir.bootlin.com/linux/v6.9.12/source/fs/smb/client/file.c#L2793 > ------------------------- > index += nr_pages; > *_count -= nr_pages; <<<<<<<<<<<<<< updated first > *_len += len; <<<<<<<<<<<<<<<< then bound check > if (max_pages <= 0 || *_len >= max_len || *_count <= 0) <<<<<<<<<<<< > bound check > stop = true; > > if (!folio_batch_add(&batch, folio)) <<<<<<<<<<<<<<<<< folio batch add > break; > if (stop) > break; > } > ------------------------- > But that would just mean that an extra folio could be added to the > batch (and marked for writeback), causing the batch to contain more > len than max_len (which is likely wsize). > If I follow the code further, adjust_credits can then be called with > wdata->bytes > wsize, and smb2_adjust_credits can then return > EOPNOTSUPP, causing us to fail the write: > https://elixir.bootlin.com/linux/v6.9.12/source/fs/smb/client/smb2ops.c#L295 > ------------------------- > if (credits->value < new_val) { > trace_smb3_too_many_credits(server->CurrentMid, > server->conn_id, server->hostname, 0, credits->value - new_val, 0); > cifs_server_dbg(VFS, "request has less credits (%d) than required (%d)", > credits->value, new_val); > > return -EOPNOTSUPP; > } > ------------------------- > > However, Enzo's patch *does work* to prevent data corruption as > verified by Mark and Bharath. > (There's a WARNING that gets thrown, but that's probably due to a > required check that Enzo missed in his change) > What am I missing here? > > -- > Regards, > Shyam -- Regards, Shyam ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-12 17:09 ` Shyam Prasad N @ 2025-11-12 18:14 ` David Howells 2025-11-13 12:04 ` Shyam Prasad N 0 siblings, 1 reply; 29+ messages in thread From: David Howells @ 2025-11-12 18:14 UTC (permalink / raw) To: Shyam Prasad N Cc: dhowells, Bharath SM, Mark A Whiting, henrique.carvalho, Enzo Matsumiya, Steve French, Paulo Alcantara, Heckmann, Ilja, linux-cifs@vger.kernel.org Shyam Prasad N <nspmangalore@gmail.com> wrote: > 4. Page 2 is now written to by the next write, which extends the file > by another 5k. Page 2 and 3 are now marked dirty. It sounds like this is the crux of the problem. The caller, cifs_write_back_from_locked_folio() has a stale copy of ->i_size. I think we need to do one or more of: (1) our copy of i_size should be updated by cifs_extend_writeback() whilst we hold the lock on a folio (2) we should reject on what appears (given the stale i_size) to be a partial folio if inode->i_size changed (3) cifs_extend_writeback() should just stop - and maybe abandon the current batch - if it sees i_size has changed. Note that abandoning the contents of the batch rather than doing the flag-flipping loop is fine as they're merely locked to that point and we don't have to reverse the PG_dirty -> PG_write transition. David ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-12 18:14 ` David Howells @ 2025-11-13 12:04 ` Shyam Prasad N 2025-11-13 13:57 ` David Howells 0 siblings, 1 reply; 29+ messages in thread From: Shyam Prasad N @ 2025-11-13 12:04 UTC (permalink / raw) To: David Howells Cc: Bharath SM, Mark A Whiting, henrique.carvalho, Enzo Matsumiya, Steve French, Paulo Alcantara, Heckmann, Ilja, linux-cifs@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 2413 bytes --] Hi David, Thanks for taking a look. Please review a draft patch attached. On Wed, Nov 12, 2025 at 11:44 PM David Howells <dhowells@redhat.com> wrote: > > Shyam Prasad N <nspmangalore@gmail.com> wrote: > > > 4. Page 2 is now written to by the next write, which extends the file > > by another 5k. Page 2 and 3 are now marked dirty. > > It sounds like this is the crux of the problem. > > The caller, cifs_write_back_from_locked_folio() has a stale copy of ->i_size. > > I think we need to do one or more of: > > (1) our copy of i_size should be updated by cifs_extend_writeback() whilst we > hold the lock on a folio > > (2) we should reject on what appears (given the stale i_size) to be a partial > folio if inode->i_size changed > > (3) cifs_extend_writeback() should just stop - and maybe abandon the current > batch - if it sees i_size has changed. > > Note that abandoning the contents of the batch rather than doing the > flag-flipping loop is fine as they're merely locked to that point and we > don't have to reverse the PG_dirty -> PG_write transition. > > David > I'm contemplating whether the change to retrieve the updated i_size in cifs_extend_writeback is really necessary. With the attached patch, we ensure that we never include the last folio in the extended writeback (if it is partial write). It should however be picked up for writeback in the next cifs_writepages_begin. Even if i_size has increased, would we not get the updated size in the following iteration to writeback the last folio? If i_size has decreased, we anyway seem to have the handling for it. Also, regarding your suggestion to have the xas declaration in cifs_writepages_begin, do you think this change is necessary? The way I see it, xas_reset would reset the state to how it began. i.e. to whatever index it was initialized to. If we don't make the change you suggested, we might do a lot of unnecessary iterations in xas_find_marked (inside cifs_writepages_begin): https://elixir.bootlin.com/linux/v6.9.12/source/fs/smb/client/file.c#L2986 But I don't see any functional issues due to this. Do you agree? Since this is for the stable trees, I'm a bit hesitant to make those changes unless we have strong reason to believe that is necessary. Let me know if you think those changes are really necessary. -- Regards, Shyam [-- Attachment #2: 0001-cifs-avoid-writeback-extension-beyond-the-len-specif.patch --] [-- Type: application/octet-stream, Size: 2974 bytes --] From ec0fb469671ac894dae99b7dff86a52ce1e4ab11 Mon Sep 17 00:00:00 2001 From: Shyam Prasad N <sprasad@microsoft.com> Date: Thu, 13 Nov 2025 14:32:26 +0530 Subject: [PATCH] cifs: avoid writeback extension beyond the len specified cifs_extend_writeback can pick up a folio on an extending write which has been dirtied, but we have aclamp on the writeback to an i_size local variable, which can cause short writes, yet mark the page as clean. This can cause a data corruption. As an example, consider this scenario: 1. First write to the file happens offset 0 len 5k. 2. Writeback starts for the range (0-5k). 3. Writeback locks page 1 in cifs_writepages_begin. But does not lock page 2 yet. 4. Page 2 is now written to by the next write, which extends the file by another 5k. Page 2 and 3 are now marked dirty. 5. Now we reach cifs_extend_writeback, where we extend to include the next folio (even if it should be partially written). We will mark page 2 for writeback. 6. But after exiting cifs_extend_writeback, we will clamp the writeback to i_size, which was 5k when it started. So we write only 1k bytes in page 2. 7. We still will now mark page 2 as flushed and mark it clean. So remaining contents of page 2 will not be written to the server (hence the hole in that gap, unless that range gets overwritten). With this patch, we will make sure that the boundary check in cifs_extend_writeback will happen before adding it to the extended folio batch. This fix also changes the error handling of cifs_extend_writeback when a folio get fails. We will now stop the extension when a folio get fails. Cc: stable@kernel.org # v6.3~v6.9 Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Suggested-by: Enzo Matsumiya <ematsumiya@suse.de> Reported-by: Mark A Whiting <whitingm@opentext.com> --- fs/smb/client/file.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c index 7a2b81fbd9cfd..1c583f970e177 100644 --- a/fs/smb/client/file.c +++ b/fs/smb/client/file.c @@ -2779,7 +2779,7 @@ static void cifs_extend_writeback(struct address_space *mapping, if (!folio_try_get(folio)) { xas_reset(xas); - continue; + break; } nr_pages = folio_nr_pages(folio); if (nr_pages > max_pages) { @@ -2807,15 +2807,24 @@ static void cifs_extend_writeback(struct address_space *mapping, break; } - max_pages -= nr_pages; len = folio_size(folio); + + /* boundary check before the folio is accounted */ + if ((max_pages - nr_pages) <= 0 || + (*_len + len) >= max_len || + (*_count - nr_pages) <= 0) { + folio_unlock(folio); + folio_put(folio); + xas_reset(xas); + break; + } + stop = false; + max_pages -= nr_pages; index += nr_pages; *_count -= nr_pages; *_len += len; - if (max_pages <= 0 || *_len >= max_len || *_count <= 0) - stop = true; if (!folio_batch_add(&batch, folio)) break; -- 2.43.0 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-13 12:04 ` Shyam Prasad N @ 2025-11-13 13:57 ` David Howells 2025-11-14 9:27 ` Shyam Prasad N 0 siblings, 1 reply; 29+ messages in thread From: David Howells @ 2025-11-13 13:57 UTC (permalink / raw) To: Shyam Prasad N Cc: dhowells, Bharath SM, Mark A Whiting, henrique.carvalho, Enzo Matsumiya, Steve French, Paulo Alcantara, Heckmann, Ilja, linux-cifs@vger.kernel.org Shyam Prasad N <nspmangalore@gmail.com> wrote: > + > + /* boundary check before the folio is accounted */ > + if ((max_pages - nr_pages) <= 0 || > + (*_len + len) >= max_len || > + (*_count - nr_pages) <= 0) { > + folio_unlock(folio); > + folio_put(folio); > + xas_reset(xas); > + break; > + } > + I suspect that whilst this will work, it will mean that you can never append a partial page to a write op; a partial page will always have to start a new RPC call. So if you do: fd = open("/cifs/foo"); write(fd, buf, 0x1001); close(fd); it will make two Write ops, one for 4K and one for 1 byte (give or take PAGE_SIZE==4096). David ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-13 13:57 ` David Howells @ 2025-11-14 9:27 ` Shyam Prasad N 2025-11-14 10:49 ` David Howells 0 siblings, 1 reply; 29+ messages in thread From: Shyam Prasad N @ 2025-11-14 9:27 UTC (permalink / raw) To: David Howells Cc: Bharath SM, Mark A Whiting, henrique.carvalho, Enzo Matsumiya, Steve French, Paulo Alcantara, Heckmann, Ilja, linux-cifs@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 1342 bytes --] On Thu, Nov 13, 2025 at 7:27 PM David Howells <dhowells@redhat.com> wrote: > > Shyam Prasad N <nspmangalore@gmail.com> wrote: > > > + > > + /* boundary check before the folio is accounted */ > > + if ((max_pages - nr_pages) <= 0 || > > + (*_len + len) >= max_len || > > + (*_count - nr_pages) <= 0) { > > + folio_unlock(folio); > > + folio_put(folio); > > + xas_reset(xas); > > + break; > > + } > > + > > I suspect that whilst this will work, it will mean that you can never append a > partial page to a write op; a partial page will always have to start a new RPC > call. > > So if you do: > > fd = open("/cifs/foo"); > write(fd, buf, 0x1001); > close(fd); > > it will make two Write ops, one for 4K and one for 1 byte (give or take > PAGE_SIZE==4096). > > David > Spoke to David yesterday to discuss this more. David: Here's a patch with the changes like you suggested. This fixes the corruption and has not affected other write patterns in my testing. Please let me know if I can add your "Reviewed-by" and submit to stable today. -- Regards, Shyam [-- Attachment #2: 0001-cifs-stop-writeback-extension-when-change-of-size-is.patch --] [-- Type: application/octet-stream, Size: 3444 bytes --] From c4494f7ad7c0dd46b67dc67058507278e56c9311 Mon Sep 17 00:00:00 2001 From: Shyam Prasad N <sprasad@microsoft.com> Date: Fri, 14 Nov 2025 14:30:55 +0530 Subject: [PATCH] cifs: stop writeback extension when change of size is detected cifs_extend_writeback can pick up a folio on an extending write which has been dirtied, but we have aclamp on the writeback to an i_size local variable, which can cause short writes, yet mark the page as clean. This can cause a data corruption. As an example, consider this scenario: 1. First write to the file happens offset 0 len 5k. 2. Writeback starts for the range (0-5k). 3. Writeback locks page 1 in cifs_writepages_begin. But does not lock page 2 yet. 4. Page 2 is now written to by the next write, which extends the file by another 5k. Page 2 and 3 are now marked dirty. 5. Now we reach cifs_extend_writeback, where we extend to include the next folio (even if it should be partially written). We will mark page 2 for writeback. 6. But after exiting cifs_extend_writeback, we will clamp the writeback to i_size, which was 5k when it started. So we write only 1k bytes in page 2. 7. We still will now mark page 2 as flushed and mark it clean. So remaining contents of page 2 will not be written to the server (hence the hole in that gap, unless that range gets overwritten). With this patch, we will make sure not extend the writeback anymore when a change in the file size is detected. This fix also changes the error handling of cifs_extend_writeback when a folio get fails. We will now stop the extension when a folio get fails. Cc: stable@kernel.org # v6.3~v6.9 Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Suggested-by: David Howells <dhowells@redhat.com> Reported-by: Mark A Whiting <whitingm@opentext.com> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> --- fs/smb/client/file.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c index 7a2b81fbd9cfd..cc2ba2b18c8d4 100644 --- a/fs/smb/client/file.c +++ b/fs/smb/client/file.c @@ -2747,8 +2747,10 @@ static void cifs_extend_writeback(struct address_space *mapping, loff_t start, int max_pages, loff_t max_len, - size_t *_len) + size_t *_len, + unsigned long long i_size) { + struct inode *inode = mapping->host; struct folio_batch batch; struct folio *folio; unsigned int nr_pages; @@ -2779,7 +2781,7 @@ static void cifs_extend_writeback(struct address_space *mapping, if (!folio_try_get(folio)) { xas_reset(xas); - continue; + break; } nr_pages = folio_nr_pages(folio); if (nr_pages > max_pages) { @@ -2799,6 +2801,15 @@ static void cifs_extend_writeback(struct address_space *mapping, xas_reset(xas); break; } + + /* if file size is changing, stop extending */ + if (i_size_read(inode) != i_size) { + folio_unlock(folio); + folio_put(folio); + xas_reset(xas); + break; + } + if (!folio_test_dirty(folio) || folio_test_writeback(folio)) { folio_unlock(folio); @@ -2930,7 +2941,8 @@ static ssize_t cifs_write_back_from_locked_folio(struct address_space *mapping, if (max_pages > 0) cifs_extend_writeback(mapping, xas, &count, start, - max_pages, max_len, &len); + max_pages, max_len, &len, + i_size); } } len = min_t(unsigned long long, len, i_size - start); -- 2.43.0 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-14 9:27 ` Shyam Prasad N @ 2025-11-14 10:49 ` David Howells 2025-11-14 10:57 ` Shyam Prasad N 0 siblings, 1 reply; 29+ messages in thread From: David Howells @ 2025-11-14 10:49 UTC (permalink / raw) To: Shyam Prasad N Cc: dhowells, Bharath SM, Mark A Whiting, henrique.carvalho, Enzo Matsumiya, Steve French, Paulo Alcantara, Heckmann, Ilja, linux-cifs@vger.kernel.org Looks good. Acked-by: David Howells <dhowells@redhat.com> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-14 10:49 ` David Howells @ 2025-11-14 10:57 ` Shyam Prasad N 0 siblings, 0 replies; 29+ messages in thread From: Shyam Prasad N @ 2025-11-14 10:57 UTC (permalink / raw) To: David Howells Cc: Bharath SM, Mark A Whiting, henrique.carvalho, Enzo Matsumiya, Steve French, Paulo Alcantara, Heckmann, Ilja, linux-cifs@vger.kernel.org On Fri, Nov 14, 2025 at 4:19 PM David Howells <dhowells@redhat.com> wrote: > > Looks good. > > Acked-by: David Howells <dhowells@redhat.com> > Thanks David. -- Regards, Shyam ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-03-27 13:07 ` Enzo Matsumiya 2025-03-27 19:31 ` [EXTERNAL] - " Mark A Whiting 2025-03-31 19:47 ` Mark A Whiting @ 2025-11-07 12:54 ` Shyam Prasad N 2025-11-10 4:37 ` Shyam Prasad N 2 siblings, 1 reply; 29+ messages in thread From: Shyam Prasad N @ 2025-11-07 12:54 UTC (permalink / raw) To: Enzo Matsumiya Cc: Mark A Whiting, Steve French, Heckmann, Ilja, linux-cifs@vger.kernel.org, henrique.carvalho@suse.com On Thu, Mar 27, 2025 at 6:39 PM Enzo Matsumiya <ematsumiya@suse.de> wrote: > > Hi Mark, > > On 03/27, Mark A Whiting wrote: > >>This is the fix we used (rebased on top of v6.6.71 tag): > >>https://git.exis.tech/linux.git/commit/?h=data_corruption_v6.x&id=8d4c40e084f3d132434d5d3d068175c8db59ce65 > > > >I tried following the link but it gave me a "502 Bad Gateway" error, I also tried the link on my personal machine at home in case it was my corporate network blocking things, same result. I don't know how big the patch is. Any chance you could just drop it in this thread? > > Yes, sorry about that, I'm having problems on that server. > Patch is attached. > > >>@Ilja @Mark could you test it with your reproducer please? > >>@Steve can you try it with the reproducer mentioned in the commit message please? > >> > > > >I would be happy to try it out. > > Thanks, I'm eager to know the results. > > > Cheers, > > Enzo Hi Enzo, I reviewed the patch from my end. I think you missed the test for non-dirty folios in your change. - if (!folio_test_dirty(folio) || - folio_test_writeback(folio)) { + if (folio_test_writeback(folio)) { I think this is why Mark and Bharath saw the WARNING with this patch. This happens when there's a clean folio in the extended range. With your patch, it will add such a folio to the batch as well. I also did not understand the reason for setting stop to false in these cases: + if (xa_is_value(folio)) { + stop = false; break; } + if (unlikely(folio != xas_reload(xas) || folio->mapping != mapping)) { + stop = false; + goto put_next; } It looks to me like it's a bug if we're hitting either of the above conditions. i.e. file mapping should always match and the folio should always be a pointer in the file mapping. Won't we end up in an infinite loop if ever something causes these to be true? Rest of the changes look good to me. -- Regards, Shyam ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-11-07 12:54 ` Shyam Prasad N @ 2025-11-10 4:37 ` Shyam Prasad N 0 siblings, 0 replies; 29+ messages in thread From: Shyam Prasad N @ 2025-11-10 4:37 UTC (permalink / raw) To: Enzo Matsumiya Cc: Mark A Whiting, Steve French, Heckmann, Ilja, linux-cifs@vger.kernel.org, henrique.carvalho@suse.com On Fri, Nov 7, 2025 at 6:24 PM Shyam Prasad N <nspmangalore@gmail.com> wrote: > > On Thu, Mar 27, 2025 at 6:39 PM Enzo Matsumiya <ematsumiya@suse.de> wrote: > > > > Hi Mark, > > > > On 03/27, Mark A Whiting wrote: > > >>This is the fix we used (rebased on top of v6.6.71 tag): > > >>https://git.exis.tech/linux.git/commit/?h=data_corruption_v6.x&id=8d4c40e084f3d132434d5d3d068175c8db59ce65 > > > > > >I tried following the link but it gave me a "502 Bad Gateway" error, I also tried the link on my personal machine at home in case it was my corporate network blocking things, same result. I don't know how big the patch is. Any chance you could just drop it in this thread? > > > > Yes, sorry about that, I'm having problems on that server. > > Patch is attached. > > > > >>@Ilja @Mark could you test it with your reproducer please? > > >>@Steve can you try it with the reproducer mentioned in the commit message please? > > >> > > > > > >I would be happy to try it out. > > > > Thanks, I'm eager to know the results. > > > > > > Cheers, > > > > Enzo > > Hi Enzo, > > I reviewed the patch from my end. > > I think you missed the test for non-dirty folios in your change. > - if (!folio_test_dirty(folio) || > - folio_test_writeback(folio)) { > + if (folio_test_writeback(folio)) { > I think this is why Mark and Bharath saw the WARNING with this patch. > This happens when there's a clean folio in the extended range. With > your patch, it will add such a folio to the batch as well. > > I also did not understand the reason for setting stop to false in these cases: > + if (xa_is_value(folio)) { > + stop = false; > break; > } > > + if (unlikely(folio != xas_reload(xas) || folio->mapping != > mapping)) { > + stop = false; > + goto put_next; > } > > It looks to me like it's a bug if we're hitting either of the above > conditions. i.e. file mapping should always match and the folio should > always be a pointer in the file mapping. > Won't we end up in an infinite loop if ever something causes these to be true? > > Rest of the changes look good to me. > > -- > Regards, > Shyam Hi Enzo, David and Steve, Can we please review the above and send out a final patch to stable? Without this patch, this is a serious data corruption in the affected kernels. -- Regards, Shyam ^ permalink raw reply [flat|nested] 29+ messages in thread
* AW: [[ EXT ]] Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 2025-03-26 18:58 ` Steve French 2025-03-26 21:13 ` Enzo Matsumiya @ 2025-03-27 10:09 ` Heckmann, Ilja 1 sibling, 0 replies; 29+ messages in thread From: Heckmann, Ilja @ 2025-03-27 10:09 UTC (permalink / raw) To: Steve French; +Cc: Mark A Whiting, linux-cifs@vger.kernel.org No, we only ran tests on whatever kernel version was available at the time on our Fedora machines. As I mentioned, our level of expertise in kernel-level stuff ... has room for improvement :) I'm not exactly sure how to set up a machine for bisection: What distro to start with? How to install a locally compiled kernel? How far in kernel version history could I go back without the rest of the system breaking? But if somebody could provide me with information or maybe even a complete step-by-step guide, I'd be willing to give it a try. ________________________________________ Von: Steve French <smfrench@gmail.com> Gesendet: Mittwoch, 26. März 2025 19:58:20 An: Heckmann, Ilja Cc: Mark A Whiting; linux-cifs@vger.kernel.org Betreff: [[ EXT ]] Re: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 Were you able to confirm that the problem started after 6.6.0 but regressed before 6.6.9 - any chance of narrowing the regression down by bisection? On Wed, Mar 26, 2025 at 5:13 AM Heckmann, Ilja <heckmann@izw-berlin.de> wrote: > > We ran into what probably is the same problem with silent data corruption that was only noticed thanks to using a data format with internal checksums. It also went away when mounting a share with "cache=none" while running the kernel 6.6.9, but that had the side-effect that no executables could be started from the share (I reported this in June 2024). This second problem was fixed in 6.10, but at the same time mounting with "cache=none" stopped helping against the data corruption issue. It persists until now, with kernel 6.12.8, although the frequency at which the problem manifests went down significantly. > > The way we test for it is by running a certain workload 100 times in a loop and counting the number of runs aborted because of errors. That number went down from about 10 per 100 runs with kernel 6.6.9 to about 1 per 100 runs with 6.12.8. Its non-deterministic nature and the lack of in-house expertise to investigate the issue at the same level as Mark did stopped us from reporting it so far. And while there is no way of knowing that the issue we observe in 6.12.8 is the same one, at least I can confirm that there is a similar issue in more recent kernel versions as well. > > Best wishes, > Ilja Heckmann > ________________________________________ > Von: Mark A Whiting <whitingm@opentext.com> > Gesendet: Dienstag, 25. März 2025 22:24:55 > An: linux-cifs@vger.kernel.org > Betreff: [[ EXT ]] [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 > > Hello, > > I have discovered a data corruption issue with our application writing to a CIFS share. I believe this issue may be related to another report I saw on this mailing list, https://lore.kernel.org/linux-cifs/DFC1DAC5-5C6C-4DC2-807A-DAF12E4B7882@gmail.com/. I understand that updating to a newer kernel would likely fix this issue. However, at the moment, that's not an option for us. In the long term we are looking to upgrade to 6.12 but I'm hoping to find a solution for our current 6.6 kernel. > > I have tested mounting with the "cache=none" option and that solves the problem, albeit with a very large performance hit. > > The platform is an embedded system. We're using an off-the-shelf COM Express Type 7 module with an Intel XEON D-1713NT processor. We're running a custom Linux system built using Buildroot, currently running the 6.6.71 kernel. I've tested the latest 6.6.84 kernel and the problem still exists there. Our application is writing large amounts of compressed data (4+ GB) to the network share. When I read back the data to verify it, I'm seeing small portions of the file that have been replaced with zeros. > > I've attacked the issue from several angles. Starting with a TCP dump of a complete operation from mounting, data transfer, to unmounting the network share. Through Wireshark I can see that there is no write command to the server covering the sections of the output that ends up as zeros. This indicated to me that the CIFS kernel driver is failing to write out portions of the file. > > I then enabled all the CIFS debug info I could via cifsFYI and the kernel dynamic debug controls and tweaked the code to not rate limit the pr_debug calls. I could trace through the resulting logs and find pairs of cifs_write_begin() / cifs_write_end() that covered all the data including the sections that ultimately don't get written out. However, tracing through the smb2_async_writev() messages I again could not find any writes that covered the corrupt portions. At this point I began to suspect some kind of race condition within the cifs_writepages() function. > > I also analyzed the data corruption and noticed a pattern. It does not fail 100% of the time, and it does not always fail in the same place. This furthered my belief that it was some kind of non-deterministic data race. The corrupt data region is always less than a page in size (<4096 bytes), it's always zeros, and it always ends on a page boundary. Because I knew the expected format of the data, I could also tell that the corrupt data was always at the beginning of a write syscall by our application. > > I've attempted to read through the CIFS kernel code involved in this. But I've never worked in the VFS/filesystem layers before. And I'm having trouble following / understanding the intricacies of the page cache, page dirtying/cleaning, and writeback. > > My current best guess at what's happening is as follows: > * Our application writes out a buffer of data to the file on a CIFS share, this is compressed data that isn't nicely aligned, the data does not end on a page boundary. This is a newly created file that we are writing to, so this write extends the files EOF to the end of the newly written data which is in the middle of a page in the cache. > * cifs_writepages() is invoked to write the cached data back to the server, it scans the cached pages and prepares to write out all the dirty pages (including the final partial page). > * Our application performs another write. This extends the file and the beginning of this write falls into the end of the previous final partial cached page. > * cifs_writepages() finishes writing out the dirty pages, including the first portion of what it thought was the final partial page, and marks all pages as clean. > * On the next invocation of cifs_writepages(), it scans for dirty pages and skips the beginning of the second write because it thinks that page is clean. The following page is a completely new page and is dirty, so it starts a new write from that page. This would explain why the corruption is always at the beginning of our application's write and corrects itself at the next page boundary. > > I have yet to really prove this, but this type of race between dirty/clean pages would explain all the behavior I'm seeing. I'm hoping someone much more intimately familiar with the CIFS code can help point me in the right direction. > > I did try one quick and dirty fix, assuming it was a race I applied the following patch. This added a per inode mutex that completely serialized the cifs_write_begin(), cifs_write_end(), and cifs_writepages() functions. This did seem to resolve the data corruption issue, but at the cost of occasional deadlocks writing to CIFS files. > > > diff --git a/fs/smb/client/cifsfs.c b/fs/smb/client/cifsfs.c > > index bbb0ef18d7b8..6e2e273b9838 100644 > > --- a/fs/smb/client/cifsfs.c > > +++ b/fs/smb/client/cifsfs.c > > @@ -1659,6 +1659,7 @@ cifs_init_once(void *inode) > > > > inode_init_once(&cifsi->netfs.inode); > > init_rwsem(&cifsi->lock_sem); > > + mutex_init(&cifsi->tbl_write_mutex); > > } > > > > static int __init > > diff --git a/fs/smb/client/cifsglob.h b/fs/smb/client/cifsglob.h > > index 43b42eca6780..4af4c5036d81 100644 > > --- a/fs/smb/client/cifsglob.h > > +++ b/fs/smb/client/cifsglob.h > > @@ -1606,6 +1606,17 @@ struct cifsInodeInfo { > > bool lease_granted; /* Flag to indicate whether lease or oplock is granted. */ > > char *symlink_target; > > __u32 reparse_tag; > > + > > + /* During development we discovered what we believe to be a race condition > > + * in the write caching behavior of cifs. Setting cache=none solved the > > + * issue but with an unacceptable performance hit. The following mutex was > > + * added to serialize the cifs_write_begin, cifs_write_end, and > > + * cifs_writepages functions in file.c. This appears to solve the issue > > + * without completely disabling caching. > > + * > > + * -Mark Whiting (whitingm@opentext.com) > > + */ > > + struct mutex tbl_write_mutex; > > }; > > > > static inline struct cifsInodeInfo * > > diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c > > index cb75b95efb70..d3bc652a7e65 100644 > > --- a/fs/smb/client/file.c > > +++ b/fs/smb/client/file.c > > @@ -3085,6 +3085,7 @@ static int cifs_writepages(struct address_space *mapping, > > { > > loff_t start, end; > > int ret; > > + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex); > > > > /* We have to be careful as we can end up racing with setattr() > > * truncating the pagecache since the caller doesn't take a lock here > > @@ -3119,6 +3120,7 @@ static int cifs_writepages(struct address_space *mapping, > > } > > > > out: > > + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex); > > return ret; > > } > > > > @@ -3174,6 +3176,8 @@ static int cifs_write_end(struct file *file, struct address_space *mapping, > > struct folio *folio = page_folio(page); > > __u32 pid; > > > > + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex); > > + > > if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD) > > pid = cfile->pid; > > else > > @@ -3233,6 +3237,7 @@ static int cifs_write_end(struct file *file, struct address_space *mapping, > > /* Indication to update ctime and mtime as close is deferred */ > > set_bit(CIFS_INO_MODIFIED_ATTR, &CIFS_I(inode)->flags); > > > > + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex); > > return rc; > > } > > > > @@ -4905,6 +4910,7 @@ static int cifs_write_begin(struct file *file, struct address_space *mapping, > > int rc = 0; > > > > cifs_dbg(FYI, "write_begin from %lld len %d\n", (long long)pos, len); > > + mutex_lock(&CIFS_I(mapping->host)->tbl_write_mutex); > > > > start: > > page = grab_cache_page_write_begin(mapping, index); > > @@ -4965,6 +4971,7 @@ static int cifs_write_begin(struct file *file, struct address_space *mapping, > > this will be written out by write_end so is fine */ > > } > > out: > > + mutex_unlock(&CIFS_I(mapping->host)->tbl_write_mutex); > > *pagep = page; > > return rc; > > } > > Here are some of the log excerpts for one of my test cases. In this file one of the corrupt regions starts at file offset 1,074,214,474 (0x4007364A), and was corrupt for 2,486 bytes, ending on a page boundary. First there is a section of the log trimmed to just the cifs_write_begin() / cifs_write_end() functions. You can see that there is a write shown at the exact offset/length of the corrupted data. > > > Mar 25 15:25:39 TX2 kernel: [ 124.080900] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074212864 len 1610 > > Mar 25 15:25:39 TX2 kernel: [ 124.080906] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 0000000086519afd from pos 1074212864 with 1610 bytes > > Mar 25 15:25:39 TX2 kernel: [ 124.080911] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074214474 len 2486 > > Mar 25 15:25:39 TX2 kernel: [ 124.080916] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 0000000086519afd from pos 1074214474 with 2486 bytes > > Mar 25 15:25:39 TX2 kernel: [ 124.080917] [1567] cifs_write_begin:4907: CIFS: fs/smb/client/file.c: write_begin from 1074216960 len 846 > > Mar 25 15:25:39 TX2 kernel: [ 124.080924] [1567] cifs_write_end:3182: CIFS: fs/smb/client/file.c: write_end for page 00000000880cee03 from pos 1074216960 with 846 bytes > > Now here's a section of the log trimmed to just the smb2_async_writev() function. You can see writes covering the data immediately before and after the corrupted region, but there is no write to the corrupted region. I'm assuming the corrupted region is always zeros because the server is extending and zero-filling the file to the new write offset after the gap of the missing write. > > > Mar 25 15:25:39 TX2 kernel: [ 123.829696] [1635] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1072214016 988260 bytes iter=f1464 > > Mar 25 15:25:39 TX2 kernel: [ 124.081016] [1636] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1073201152 1013322 bytes iter=f764a > ** Missing write: 1073201152 + 1013322 = 1074214474 ** > > Mar 25 15:25:39 TX2 kernel: [ 124.083901] [1636] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1074216960 39564 bytes iter=9a8c > > Mar 25 15:25:40 TX2 kernel: [ 124.340557] [1637] smb2_async_writev:4945: CIFS: fs/smb/client/smb2pdu.c: async write at 1074253824 1237843 bytes iter=12e353 > > I can very easily reproduce this with our application. If anyone has any suggestions to try, additional logging / tracing they would like me to perform, please let me know. I can provide more detailed, full logs if desired, but they're quite large. I'll continue to read through the code and try to understand, if I find anything I will update you. > > Thanks, > Mark Whiting > > -- Thanks, Steve ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2025-11-14 10:58 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-03-25 21:24 [BUG REPORT] cifs/smb data corruption when writing, x86_64, kernel 6.6.71 Mark A Whiting 2025-03-26 10:11 ` AW: [[ EXT ]] " Heckmann, Ilja 2025-03-26 18:58 ` Steve French 2025-03-26 21:13 ` Enzo Matsumiya 2025-03-27 12:48 ` Mark A Whiting 2025-03-27 13:07 ` Enzo Matsumiya 2025-03-27 19:31 ` [EXTERNAL] - " Mark A Whiting 2025-03-31 19:47 ` Mark A Whiting 2025-11-06 15:00 ` Bharath SM 2025-11-06 15:20 ` Enzo Matsumiya 2025-11-06 15:51 ` Bharath SM 2025-11-06 16:03 ` Enzo Matsumiya 2025-11-10 15:56 ` David Howells 2025-11-06 16:22 ` David Howells 2025-11-06 16:46 ` Bharath SM 2025-11-06 16:49 ` Bharath SM 2025-11-10 16:00 ` David Howells 2025-11-11 9:22 ` David Howells 2025-11-11 10:39 ` Shyam Prasad N 2025-11-12 17:09 ` Shyam Prasad N 2025-11-12 18:14 ` David Howells 2025-11-13 12:04 ` Shyam Prasad N 2025-11-13 13:57 ` David Howells 2025-11-14 9:27 ` Shyam Prasad N 2025-11-14 10:49 ` David Howells 2025-11-14 10:57 ` Shyam Prasad N 2025-11-07 12:54 ` Shyam Prasad N 2025-11-10 4:37 ` Shyam Prasad N 2025-03-27 10:09 ` AW: [[ EXT ]] " Heckmann, Ilja
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox