linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 5.15] fuse: Fix race condition in writethrough path A race
@ 2025-10-09 11:06 guangming.zhao
  2025-10-09 22:11 ` Joanne Koong
  0 siblings, 1 reply; 31+ messages in thread
From: guangming.zhao @ 2025-10-09 11:06 UTC (permalink / raw)
  To: miklos; +Cc: linux-fsdevel, linux-kernel

The race occurs as follows:
1. A write operation locks a page, fills it with new data, marks it
   Uptodate, and then immediately unlocks it within fuse_fill_write_pages().
2. This opens a window before the new data is sent to the userspace daemon.
3. A concurrent read operation for the same page may decide to re-validate
   its cache from the daemon. The fuse_wait_on_page_writeback()
   mechanism does not protect this synchronous writethrough path.
4. The read request can be processed by the multi-threaded daemon *before*
   the write request, causing it to reply with stale data from its backend.
5. The read syscall returns this stale data to userspace, causing data
   verification to fail.

This can be reliably reproduced on a mainline kernel (e.g., 6.1.x)
using iogen and a standard multi-threaded libfuse passthrough filesystem.

Steps to Reproduce:
1. Mount a libfuse passthrough filesystem (must be multi-threaded):
   $ ./passthrough /path/to/mount_point

2. Run the iogen/doio test from LTP (Linux Test Project) with mixed
   read/write operations (example):
   $ /path/to/ltp/iogen -N iogen01 -i 120s -s read,write 500k:/path/to/mount_point/file1 | \
     /path/to/ltp/doio -N iogen01 -a -v -n 2 -k

3. A data comparison error similar to the following will be reported:
   *** DATA COMPARISON ERROR ***
   check_file(/path/to/mount_point/file1, ...) failed
   expected bytes:  X:3091346:gm-arco:doio*X:3091346
   actual bytes:    91346:gm-arco:doio*C:3091346:gm-

The fix is to delay unlocking the page until after the data has been
successfully sent to the daemon. This is achieved by moving the unlock
logic from fuse_fill_write_pages() to the completion path of
fuse_send_write_pages(), ensuring the page lock is held for the entire
critical section and serializing the operations correctly.

[Note for maintainers]
This patch is created and tested against the 5.15 kernel. I have observed
that recent kernels have migrated to using folios, and I am not confident
in porting this fix to the new folio-based code myself.

I am submitting this patch to clearly document the race condition and a
proven fix on an older kernel, in the hope that a developer more
familiar with the folio conversion can adapt it for the mainline tree.

Signed-off-by: guangming.zhao <giveme.gulu@gmail.com>
---
[root@gm-arco example]# uname -a
Linux gm-arco 6.16.8-arch3-1 #1 SMP PREEMPT_DYNAMIC Mon, 22 Sep 2025 22:08:35 +0000 x86_64 GNU/Linux
[root@gm-arco example]# ./passthrough /tmp/test/
[root@gm-arco example]# mkdir /tmp/test/yy
[root@gm-arco example]# /home/gm/code/ltp/testcases/kernel/fs/doio/iogen -N iogen01 -i 120s -s read,write 500b:/tmp/test/yy/kk1 1000b:/tmp/test/yy/kk2 | /home/gm/code/ltp/testcases/kernel/fs/doio/doio -N iogen01 -a -v -n 2 -k

iogen(iogen01) starting up with the following:

Out-pipe:              stdout
Iterations:            120 seconds
Seed:                  3091343
Offset-Mode:           sequential
Overlap Flag:          off
Mintrans:              1           (1 blocks)
Maxtrans:              131072      (256 blocks)
O_RAW/O_SSD Multiple:  (Determined by device)
Syscalls:              read write
Aio completion types:  none
Flags:                 buffered sync

Test Files:

Path                                          Length    iou   raw iou file
                                              (bytes) (bytes) (bytes) type
-----------------------------------------------------------------------------
/tmp/test/yy/kk1                               256000       1     512 regular
/tmp/test/yy/kk2                               512000       1     512 regular

doio(iogen01) (3091346) 17:43:50
---------------------
*** DATA COMPARISON ERROR ***
check_file(/tmp/test/yy/kk2, 116844, 106653, X:3091346:gm-arco:doio*, 23, 0) failed

Comparison fd is 3, with open flags 0
Corrupt regions follow - unprintable chars are represented as '.'
-----------------------------------------------------------------
corrupt bytes starting at file offset 116844
    1st 32 expected bytes:  X:3091346:gm-arco:doio*X:3091346
    1st 32 actual bytes:    91346:gm-arco:doio*C:3091346:gm-
Request number 13873
syscall:  write(4, 02540107176414100, 106653)
          fd 4 is file /tmp/test/yy/kk2 - open flags are 04010001
          write done at file offset 116844 - pattern is X:3091346:gm-arco:doio*

doio(iogen01) (3091344) 17:43:50
---------------------
(parent) pid 3091346 exited because of data compare errors

 fs/fuse/file.c | 36 ++++++++++--------------------------
 1 file changed, 10 insertions(+), 26 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 5c5ed58d9..a832c3122 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1098,7 +1098,6 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
 	struct fuse_file *ff = file->private_data;
 	struct fuse_mount *fm = ff->fm;
 	unsigned int offset, i;
-	bool short_write;
 	int err;
 
 	for (i = 0; i < ap->num_pages; i++)
@@ -1113,26 +1112,21 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
 	if (!err && ia->write.out.size > count)
 		err = -EIO;
 
-	short_write = ia->write.out.size < count;
 	offset = ap->descs[0].offset;
 	count = ia->write.out.size;
 	for (i = 0; i < ap->num_pages; i++) {
 		struct page *page = ap->pages[i];
 
-		if (err) {
-			ClearPageUptodate(page);
-		} else {
-			if (count >= PAGE_SIZE - offset)
-				count -= PAGE_SIZE - offset;
-			else {
-				if (short_write)
-					ClearPageUptodate(page);
-				count = 0;
-			}
-			offset = 0;
-		}
-		if (ia->write.page_locked && (i == ap->num_pages - 1))
-			unlock_page(page);
+        if (!err && !offset && count >= PAGE_SIZE)
+            SetPageUptodate(page);
+
+        if (count > PAGE_SIZE - offset)
+            count -= PAGE_SIZE - offset;
+        else
+            count = 0;
+        offset = 0;
+
+        unlock_page(page);
 		put_page(page);
 	}
 
@@ -1195,16 +1189,6 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 		if (offset == PAGE_SIZE)
 			offset = 0;
 
-		/* If we copied full page, mark it uptodate */
-		if (tmp == PAGE_SIZE)
-			SetPageUptodate(page);
-
-		if (PageUptodate(page)) {
-			unlock_page(page);
-		} else {
-			ia->write.page_locked = true;
-			break;
-		}
 		if (!fc->big_writes)
 			break;
 	} while (iov_iter_count(ii) && count < fc->max_write &&
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-09 11:06 [PATCH 5.15] fuse: Fix race condition in writethrough path A race guangming.zhao
@ 2025-10-09 22:11 ` Joanne Koong
       [not found]   ` <CAFS-8+VcZn7WZgjV9pHz4c8DYHRdP0on6-er5fm9TZF9RAO0xQ@mail.gmail.com>
  0 siblings, 1 reply; 31+ messages in thread
From: Joanne Koong @ 2025-10-09 22:11 UTC (permalink / raw)
  To: guangming.zhao; +Cc: miklos, linux-fsdevel, linux-kernel

On Thu, Oct 9, 2025 at 4:09 AM guangming.zhao <giveme.gulu@gmail.com> wrote:
>

Hi Guangming,

> The race occurs as follows:
> 1. A write operation locks a page, fills it with new data, marks it
>    Uptodate, and then immediately unlocks it within fuse_fill_write_pages().
> 2. This opens a window before the new data is sent to the userspace daemon.
> 3. A concurrent read operation for the same page may decide to re-validate
>    its cache from the daemon. The fuse_wait_on_page_writeback()
>    mechanism does not protect this synchronous writethrough path.
> 4. The read request can be processed by the multi-threaded daemon *before*
>    the write request, causing it to reply with stale data from its backend.
> 5. The read syscall returns this stale data to userspace, causing data
>    verification to fail.

I don't think the issue is that the read returns stale data (the
client is responsible for synchronizing reads and writes, so if the
read is issued before the write has completed then it should be fine
that the read returned back stale data) but that the read will
populate the page cache with stale data (overwriting the write data in
the page cache), which makes later subsequent reads that are issued
after the write has completed return back stale data.

>
> This can be reliably reproduced on a mainline kernel (e.g., 6.1.x)
> using iogen and a standard multi-threaded libfuse passthrough filesystem.
>
> Steps to Reproduce:
> 1. Mount a libfuse passthrough filesystem (must be multi-threaded):
>    $ ./passthrough /path/to/mount_point
>
> 2. Run the iogen/doio test from LTP (Linux Test Project) with mixed
>    read/write operations (example):
>    $ /path/to/ltp/iogen -N iogen01 -i 120s -s read,write 500k:/path/to/mount_point/file1 | \
>      /path/to/ltp/doio -N iogen01 -a -v -n 2 -k
>
> 3. A data comparison error similar to the following will be reported:
>    *** DATA COMPARISON ERROR ***
>    check_file(/path/to/mount_point/file1, ...) failed
>    expected bytes:  X:3091346:gm-arco:doio*X:3091346
>    actual bytes:    91346:gm-arco:doio*C:3091346:gm-
>
> The fix is to delay unlocking the page until after the data has been
> successfully sent to the daemon. This is achieved by moving the unlock
> logic from fuse_fill_write_pages() to the completion path of
> fuse_send_write_pages(), ensuring the page lock is held for the entire
> critical section and serializing the operations correctly.
>
> [Note for maintainers]
> This patch is created and tested against the 5.15 kernel. I have observed
> that recent kernels have migrated to using folios, and I am not confident
> in porting this fix to the new folio-based code myself.
>
> I am submitting this patch to clearly document the race condition and a
> proven fix on an older kernel, in the hope that a developer more
> familiar with the folio conversion can adapt it for the mainline tree.
>
> Signed-off-by: guangming.zhao <giveme.gulu@gmail.com>
> ---
> [root@gm-arco example]# uname -a
> Linux gm-arco 6.16.8-arch3-1 #1 SMP PREEMPT_DYNAMIC Mon, 22 Sep 2025 22:08:35 +0000 x86_64 GNU/Linux
> [root@gm-arco example]# ./passthrough /tmp/test/
> [root@gm-arco example]# mkdir /tmp/test/yy
> [root@gm-arco example]# /home/gm/code/ltp/testcases/kernel/fs/doio/iogen -N iogen01 -i 120s -s read,write 500b:/tmp/test/yy/kk1 1000b:/tmp/test/yy/kk2 | /home/gm/code/ltp/testcases/kernel/fs/doio/doio -N iogen01 -a -v -n 2 -k
>
> iogen(iogen01) starting up with the following:
>
> Out-pipe:              stdout
> Iterations:            120 seconds
> Seed:                  3091343
> Offset-Mode:           sequential
> Overlap Flag:          off
> Mintrans:              1           (1 blocks)
> Maxtrans:              131072      (256 blocks)
> O_RAW/O_SSD Multiple:  (Determined by device)
> Syscalls:              read write
> Aio completion types:  none
> Flags:                 buffered sync
>
> Test Files:
>
> Path                                          Length    iou   raw iou file
>                                               (bytes) (bytes) (bytes) type
> -----------------------------------------------------------------------------
> /tmp/test/yy/kk1                               256000       1     512 regular
> /tmp/test/yy/kk2                               512000       1     512 regular
>
> doio(iogen01) (3091346) 17:43:50
> ---------------------
> *** DATA COMPARISON ERROR ***
> check_file(/tmp/test/yy/kk2, 116844, 106653, X:3091346:gm-arco:doio*, 23, 0) failed
>
> Comparison fd is 3, with open flags 0
> Corrupt regions follow - unprintable chars are represented as '.'
> -----------------------------------------------------------------
> corrupt bytes starting at file offset 116844
>     1st 32 expected bytes:  X:3091346:gm-arco:doio*X:3091346
>     1st 32 actual bytes:    91346:gm-arco:doio*C:3091346:gm-
> Request number 13873
> syscall:  write(4, 02540107176414100, 106653)
>           fd 4 is file /tmp/test/yy/kk2 - open flags are 04010001
>           write done at file offset 116844 - pattern is X:3091346:gm-arco:doio*
>
> doio(iogen01) (3091344) 17:43:50
> ---------------------
> (parent) pid 3091346 exited because of data compare errors
>
>  fs/fuse/file.c | 36 ++++++++++--------------------------
>  1 file changed, 10 insertions(+), 26 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 5c5ed58d9..a832c3122 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1098,7 +1098,6 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
>         struct fuse_file *ff = file->private_data;
>         struct fuse_mount *fm = ff->fm;
>         unsigned int offset, i;
> -       bool short_write;
>         int err;
>
>         for (i = 0; i < ap->num_pages; i++)
> @@ -1113,26 +1112,21 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
>         if (!err && ia->write.out.size > count)
>                 err = -EIO;
>
> -       short_write = ia->write.out.size < count;
>         offset = ap->descs[0].offset;
>         count = ia->write.out.size;
>         for (i = 0; i < ap->num_pages; i++) {
>                 struct page *page = ap->pages[i];
>
> -               if (err) {
> -                       ClearPageUptodate(page);
> -               } else {
> -                       if (count >= PAGE_SIZE - offset)
> -                               count -= PAGE_SIZE - offset;
> -                       else {
> -                               if (short_write)
> -                                       ClearPageUptodate(page);
> -                               count = 0;
> -                       }
> -                       offset = 0;
> -               }
> -               if (ia->write.page_locked && (i == ap->num_pages - 1))
> -                       unlock_page(page);
> +        if (!err && !offset && count >= PAGE_SIZE)
> +            SetPageUptodate(page);
> +
> +        if (count > PAGE_SIZE - offset)
> +            count -= PAGE_SIZE - offset;
> +        else
> +            count = 0;
> +        offset = 0;
> +
> +        unlock_page(page);
>                 put_page(page);
>         }
>
> @@ -1195,16 +1189,6 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
>                 if (offset == PAGE_SIZE)
>                         offset = 0;
>
> -               /* If we copied full page, mark it uptodate */
> -               if (tmp == PAGE_SIZE)
> -                       SetPageUptodate(page);
> -
> -               if (PageUptodate(page)) {
> -                       unlock_page(page);
> -               } else {
> -                       ia->write.page_locked = true;
> -                       break;
> -               }

I think this will run into the deadlock described here
https://lore.kernel.org/linux-fsdevel/CAHk-=wh9Eu-gNHzqgfvUAAiO=vJ+pWnzxkv+tX55xhGPFy+cOw@mail.gmail.com/,
so I think we would need a different solution. Maybe one idea is doing
something similar to what the fi->writectr bias does - that at least
seems simpler to me than having to unlock all the pages in the array
if we have to fault in the iov iter and then having to relock the
pages while making sure everything is all consistent.

Thanks,
Joanne

>                 if (!fc->big_writes)
>                         break;
>         } while (iov_iter_count(ii) && count < fc->max_write &&
> --
> 2.51.0
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
       [not found]   ` <CAFS-8+VcZn7WZgjV9pHz4c8DYHRdP0on6-er5fm9TZF9RAO0xQ@mail.gmail.com>
@ 2025-10-10  6:25     ` lu gu
  2025-10-10  8:46       ` Miklos Szeredi
  0 siblings, 1 reply; 31+ messages in thread
From: lu gu @ 2025-10-10  6:25 UTC (permalink / raw)
  To: Joanne Koong; +Cc: miklos, linux-fsdevel, linux-kernel

[Resend, plain-text only: previous version was rejected due to HTML formatting]
Hi Joanne,

Thank you again for the detailed explanation. I've been thinking
carefully about your feedback and the deadlock issue related to
holding the page lock.
I now understand that keeping a low-level page lock while waiting for
a userspace reply is dangerous and can easily lead to deadlocks with
the memory manager. So my original patch was  not the right approach.

Your explanation really filled in the missing piece for me. Now, I
understand the core problem sequence correctly:

    A concurrent read operation may trigger a GETATTR request when the
attribute cache expires.
    The result of this GETATTR can cause the kernel to invalidate the
page cache for the inode.
    As a result, the read operation ignores the (new) cached data and
fetches data again from the backend.
    The backend then returns stale data, which overwrites the updated
page cache contents, leading to the data mismatch I observed.

I’ve been thinking about how to implement proper synchronization
without holding the page lock. Here’s an idea I’d like your thoughts
on:

    1. Keep the early unlock_page() in fuse_fill_write_pages().

    2. In fuse_perform_write(), before sending the FUSE_WRITE request,
we insert an entry representing the affected page/offset range into a
new RB-tree on the fuse_inode. This structure, similar to how
writeback is tracked (e.g., fuse_page_is_writeback), will be named
sync_writes_in_flight.

    3. In the read path (e.g., fuse_readpage()), before issuing a
FUSE_READ, check this tree for any overlapping ranges.

    4. If an overlap is found, the read operation waits on a wait
queue associated with this synchronization structure.

    5. When the FUSE_WRITE completes in fuse_perform_write(), remove
the entry from the tree and wake up any waiting readers.

This should serialize backend reads and writes for overlapping ranges
while avoiding the deadlock risk, since the waiting happens at the
FUSE layer rather than under a page lock.

Does this approach sound reasonable to you? I’d really appreciate your
feedback on whether this design makes sense, or if you see any
potential pitfalls I’ve missed.

Thanks,
guangming.zhao

On Fri, Oct 10, 2025 at 1:17 PM lu gu <giveme.gulu@gmail.com> wrote:
>
> Hi Joanne,
>
> Thank you again for the detailed explanation. I've been thinking carefully about your feedback and the deadlock issue related to holding the page lock.
> I now understand that keeping a low-level page lock while waiting for a userspace reply is dangerous and can easily lead to deadlocks with the memory manager. So my original patch was  not the right approach.
>
> Your explanation really filled in the missing piece for me. Now, I understand the core problem sequence correctly:
>
>     A concurrent read operation may trigger a GETATTR request when the attribute cache expires.
>     The result of this GETATTR can cause the kernel to invalidate the page cache for the inode.
>     As a result, the read operation ignores the (new) cached data and fetches data again from the backend.
>     The backend then returns stale data, which overwrites the updated page cache contents, leading to the data mismatch I observed.
>
> I’ve been thinking about how to implement proper synchronization without holding the page lock. Here’s an idea I’d like your thoughts on:
>
>     1. Keep the early unlock_page() in fuse_fill_write_pages().
>
>     2. In fuse_perform_write(), before sending the FUSE_WRITE request, we insert an entry representing the affected page/offset range into a new RB-tree on the fuse_inode. This structure, similar to how writeback is tracked (e.g., fuse_page_is_writeback), will be named sync_writes_in_flight.
>
>     3. In the read path (e.g., fuse_readpage()), before issuing a FUSE_READ, check this tree for any overlapping ranges.
>
>     4. If an overlap is found, the read operation waits on a wait queue associated with this synchronization structure.
>
>     5. When the FUSE_WRITE completes in fuse_perform_write(), remove the entry from the tree and wake up any waiting readers.
>
> This should serialize backend reads and writes for overlapping ranges while avoiding the deadlock risk, since the waiting happens at the FUSE layer rather than under a page lock.
>
> Does this approach sound reasonable to you? I’d really appreciate your feedback on whether this design makes sense, or if you see any potential pitfalls I’ve missed.
>
> Thanks,
> guangming.zhao
>
> On Fri, Oct 10, 2025 at 6:11 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>>
>> On Thu, Oct 9, 2025 at 4:09 AM guangming.zhao <giveme.gulu@gmail.com> wrote:
>> >
>>
>> Hi Guangming,
>>
>> > The race occurs as follows:
>> > 1. A write operation locks a page, fills it with new data, marks it
>> >    Uptodate, and then immediately unlocks it within fuse_fill_write_pages().
>> > 2. This opens a window before the new data is sent to the userspace daemon.
>> > 3. A concurrent read operation for the same page may decide to re-validate
>> >    its cache from the daemon. The fuse_wait_on_page_writeback()
>> >    mechanism does not protect this synchronous writethrough path.
>> > 4. The read request can be processed by the multi-threaded daemon *before*
>> >    the write request, causing it to reply with stale data from its backend.
>> > 5. The read syscall returns this stale data to userspace, causing data
>> >    verification to fail.
>>
>> I don't think the issue is that the read returns stale data (the
>> client is responsible for synchronizing reads and writes, so if the
>> read is issued before the write has completed then it should be fine
>> that the read returned back stale data) but that the read will
>> populate the page cache with stale data (overwriting the write data in
>> the page cache), which makes later subsequent reads that are issued
>> after the write has completed return back stale data.
>>
>> >
>> > This can be reliably reproduced on a mainline kernel (e.g., 6.1.x)
>> > using iogen and a standard multi-threaded libfuse passthrough filesystem.
>> >
>> > Steps to Reproduce:
>> > 1. Mount a libfuse passthrough filesystem (must be multi-threaded):
>> >    $ ./passthrough /path/to/mount_point
>> >
>> > 2. Run the iogen/doio test from LTP (Linux Test Project) with mixed
>> >    read/write operations (example):
>> >    $ /path/to/ltp/iogen -N iogen01 -i 120s -s read,write 500k:/path/to/mount_point/file1 | \
>> >      /path/to/ltp/doio -N iogen01 -a -v -n 2 -k
>> >
>> > 3. A data comparison error similar to the following will be reported:
>> >    *** DATA COMPARISON ERROR ***
>> >    check_file(/path/to/mount_point/file1, ...) failed
>> >    expected bytes:  X:3091346:gm-arco:doio*X:3091346
>> >    actual bytes:    91346:gm-arco:doio*C:3091346:gm-
>> >
>> > The fix is to delay unlocking the page until after the data has been
>> > successfully sent to the daemon. This is achieved by moving the unlock
>> > logic from fuse_fill_write_pages() to the completion path of
>> > fuse_send_write_pages(), ensuring the page lock is held for the entire
>> > critical section and serializing the operations correctly.
>> >
>> > [Note for maintainers]
>> > This patch is created and tested against the 5.15 kernel. I have observed
>> > that recent kernels have migrated to using folios, and I am not confident
>> > in porting this fix to the new folio-based code myself.
>> >
>> > I am submitting this patch to clearly document the race condition and a
>> > proven fix on an older kernel, in the hope that a developer more
>> > familiar with the folio conversion can adapt it for the mainline tree.
>> >
>> > Signed-off-by: guangming.zhao <giveme.gulu@gmail.com>
>> > ---
>> > [root@gm-arco example]# uname -a
>> > Linux gm-arco 6.16.8-arch3-1 #1 SMP PREEMPT_DYNAMIC Mon, 22 Sep 2025 22:08:35 +0000 x86_64 GNU/Linux
>> > [root@gm-arco example]# ./passthrough /tmp/test/
>> > [root@gm-arco example]# mkdir /tmp/test/yy
>> > [root@gm-arco example]# /home/gm/code/ltp/testcases/kernel/fs/doio/iogen -N iogen01 -i 120s -s read,write 500b:/tmp/test/yy/kk1 1000b:/tmp/test/yy/kk2 | /home/gm/code/ltp/testcases/kernel/fs/doio/doio -N iogen01 -a -v -n 2 -k
>> >
>> > iogen(iogen01) starting up with the following:
>> >
>> > Out-pipe:              stdout
>> > Iterations:            120 seconds
>> > Seed:                  3091343
>> > Offset-Mode:           sequential
>> > Overlap Flag:          off
>> > Mintrans:              1           (1 blocks)
>> > Maxtrans:              131072      (256 blocks)
>> > O_RAW/O_SSD Multiple:  (Determined by device)
>> > Syscalls:              read write
>> > Aio completion types:  none
>> > Flags:                 buffered sync
>> >
>> > Test Files:
>> >
>> > Path                                          Length    iou   raw iou file
>> >                                               (bytes) (bytes) (bytes) type
>> > -----------------------------------------------------------------------------
>> > /tmp/test/yy/kk1                               256000       1     512 regular
>> > /tmp/test/yy/kk2                               512000       1     512 regular
>> >
>> > doio(iogen01) (3091346) 17:43:50
>> > ---------------------
>> > *** DATA COMPARISON ERROR ***
>> > check_file(/tmp/test/yy/kk2, 116844, 106653, X:3091346:gm-arco:doio*, 23, 0) failed
>> >
>> > Comparison fd is 3, with open flags 0
>> > Corrupt regions follow - unprintable chars are represented as '.'
>> > -----------------------------------------------------------------
>> > corrupt bytes starting at file offset 116844
>> >     1st 32 expected bytes:  X:3091346:gm-arco:doio*X:3091346
>> >     1st 32 actual bytes:    91346:gm-arco:doio*C:3091346:gm-
>> > Request number 13873
>> > syscall:  write(4, 02540107176414100, 106653)
>> >           fd 4 is file /tmp/test/yy/kk2 - open flags are 04010001
>> >           write done at file offset 116844 - pattern is X:3091346:gm-arco:doio*
>> >
>> > doio(iogen01) (3091344) 17:43:50
>> > ---------------------
>> > (parent) pid 3091346 exited because of data compare errors
>> >
>> >  fs/fuse/file.c | 36 ++++++++++--------------------------
>> >  1 file changed, 10 insertions(+), 26 deletions(-)
>> >
>> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> > index 5c5ed58d9..a832c3122 100644
>> > --- a/fs/fuse/file.c
>> > +++ b/fs/fuse/file.c
>> > @@ -1098,7 +1098,6 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
>> >         struct fuse_file *ff = file->private_data;
>> >         struct fuse_mount *fm = ff->fm;
>> >         unsigned int offset, i;
>> > -       bool short_write;
>> >         int err;
>> >
>> >         for (i = 0; i < ap->num_pages; i++)
>> > @@ -1113,26 +1112,21 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
>> >         if (!err && ia->write.out.size > count)
>> >                 err = -EIO;
>> >
>> > -       short_write = ia->write.out.size < count;
>> >         offset = ap->descs[0].offset;
>> >         count = ia->write.out.size;
>> >         for (i = 0; i < ap->num_pages; i++) {
>> >                 struct page *page = ap->pages[i];
>> >
>> > -               if (err) {
>> > -                       ClearPageUptodate(page);
>> > -               } else {
>> > -                       if (count >= PAGE_SIZE - offset)
>> > -                               count -= PAGE_SIZE - offset;
>> > -                       else {
>> > -                               if (short_write)
>> > -                                       ClearPageUptodate(page);
>> > -                               count = 0;
>> > -                       }
>> > -                       offset = 0;
>> > -               }
>> > -               if (ia->write.page_locked && (i == ap->num_pages - 1))
>> > -                       unlock_page(page);
>> > +        if (!err && !offset && count >= PAGE_SIZE)
>> > +            SetPageUptodate(page);
>> > +
>> > +        if (count > PAGE_SIZE - offset)
>> > +            count -= PAGE_SIZE - offset;
>> > +        else
>> > +            count = 0;
>> > +        offset = 0;
>> > +
>> > +        unlock_page(page);
>> >                 put_page(page);
>> >         }
>> >
>> > @@ -1195,16 +1189,6 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
>> >                 if (offset == PAGE_SIZE)
>> >                         offset = 0;
>> >
>> > -               /* If we copied full page, mark it uptodate */
>> > -               if (tmp == PAGE_SIZE)
>> > -                       SetPageUptodate(page);
>> > -
>> > -               if (PageUptodate(page)) {
>> > -                       unlock_page(page);
>> > -               } else {
>> > -                       ia->write.page_locked = true;
>> > -                       break;
>> > -               }
>>
>> I think this will run into the deadlock described here
>> https://lore.kernel.org/linux-fsdevel/CAHk-=wh9Eu-gNHzqgfvUAAiO=vJ+pWnzxkv+tX55xhGPFy+cOw@mail.gmail.com/,
>> so I think we would need a different solution. Maybe one idea is doing
>> something similar to what the fi->writectr bias does - that at least
>> seems simpler to me than having to unlock all the pages in the array
>> if we have to fault in the iov iter and then having to relock the
>> pages while making sure everything is all consistent.
>>
>> Thanks,
>> Joanne
>>
>> >                 if (!fc->big_writes)
>> >                         break;
>> >         } while (iov_iter_count(ii) && count < fc->max_write &&
>> > --
>> > 2.51.0
>> >
>> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-10  6:25     ` lu gu
@ 2025-10-10  8:46       ` Miklos Szeredi
  2025-10-13 13:39         ` Miklos Szeredi
  0 siblings, 1 reply; 31+ messages in thread
From: Miklos Szeredi @ 2025-10-10  8:46 UTC (permalink / raw)
  To: lu gu; +Cc: Joanne Koong, linux-fsdevel, linux-kernel

On Fri, 10 Oct 2025 at 08:25, lu gu <giveme.gulu@gmail.com> wrote:

> This should serialize backend reads and writes for overlapping ranges
> while avoiding the deadlock risk, since the waiting happens at the
> FUSE layer rather than under a page lock.
>
> Does this approach sound reasonable to you? I’d really appreciate your
> feedback on whether this design makes sense, or if you see any
> potential pitfalls I’ve missed.

Thanks for the great report.

The underlying issue here I think is that auto invalidation happens
(based on mtime change) despite the file not having been changed
externally.   This can happen because fuse_change_attributes_i() will
look at an old mtime value even if it's known to have been invalidated
by a write for example.

My idea is to introduce FUSE_I_MTIME_UNSTABLE (which would work
similarly to FUSE_I_SIZE_UNSTABLE) and when fetching old_mtime, verify
that it hasn't been invalidated.  If old_mtime is invalid or if
FUSE_I_MTIME_UNSTABLE signals that a write is in progress, the page
cache is not invalidated.

Will try this and send a patch.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-10  8:46       ` Miklos Szeredi
@ 2025-10-13 13:39         ` Miklos Szeredi
  2025-10-13 17:44           ` Brian Foster
                             ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Miklos Szeredi @ 2025-10-13 13:39 UTC (permalink / raw)
  To: lu gu
  Cc: Joanne Koong, linux-fsdevel, linux-kernel, Brian Foster,
	Bernd Schubert

On Fri, 10 Oct 2025 at 10:46, Miklos Szeredi <miklos@szeredi.hu> wrote:

> My idea is to introduce FUSE_I_MTIME_UNSTABLE (which would work
> similarly to FUSE_I_SIZE_UNSTABLE) and when fetching old_mtime, verify
> that it hasn't been invalidated.  If old_mtime is invalid or if
> FUSE_I_MTIME_UNSTABLE signals that a write is in progress, the page
> cache is not invalidated.

[Adding Brian Foster, the author of FUSE_AUTO_INVAL_DATA patches.
Link to complete thread:
https://lore.kernel.org/all/20251009110623.3115511-1-giveme.gulu@gmail.com/#r]

In summary: auto_inval_data invalidates data cache even if the
modification was done in a cache consistent manner (i.e. write
through). This is not generally a consistency problem, because the
backing file and the cache should be in sync.  The exception is when
the writeback to the backing file hasn't yet finished and a getattr()
call triggers invalidation (mtime change could be from a previous
write), and the not yet written data is invalidated and replaced with
stale data.

The proposed fix was to exclude concurrent reads and writes to the same region.

But the real issue here is that mtime changes triggered by this client
should not cause data to be invalidated.  It's not only racy, but it's
fundamentally wrong.  Unfortunately this is hard to do this correctly.
Best I can come up with is that any request that expects mtime to be
modified returns the mtime after the request has completed.

This would be much easier to implement in the fuse server: perform the
"file changed remotely" check when serving a FUSE_GETATTR request and
return a flag indicating whether the data needs to be invalidated or
not.

Thoughts?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-13 13:39         ` Miklos Szeredi
@ 2025-10-13 17:44           ` Brian Foster
  2025-10-13 18:23             ` Miklos Szeredi
  2025-10-13 20:16           ` Bernd Schubert
  2025-10-13 23:43           ` Joanne Koong
  2 siblings, 1 reply; 31+ messages in thread
From: Brian Foster @ 2025-10-13 17:44 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lu gu, Joanne Koong, linux-fsdevel, linux-kernel, Bernd Schubert

On Mon, Oct 13, 2025 at 03:39:48PM +0200, Miklos Szeredi wrote:
> On Fri, 10 Oct 2025 at 10:46, Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > My idea is to introduce FUSE_I_MTIME_UNSTABLE (which would work
> > similarly to FUSE_I_SIZE_UNSTABLE) and when fetching old_mtime, verify
> > that it hasn't been invalidated.  If old_mtime is invalid or if
> > FUSE_I_MTIME_UNSTABLE signals that a write is in progress, the page
> > cache is not invalidated.
> 
> [Adding Brian Foster, the author of FUSE_AUTO_INVAL_DATA patches.
> Link to complete thread:
> https://lore.kernel.org/all/20251009110623.3115511-1-giveme.gulu@gmail.com/#r]
> 
> In summary: auto_inval_data invalidates data cache even if the
> modification was done in a cache consistent manner (i.e. write
> through). This is not generally a consistency problem, because the
> backing file and the cache should be in sync.  The exception is when
> the writeback to the backing file hasn't yet finished and a getattr()
> call triggers invalidation (mtime change could be from a previous
> write), and the not yet written data is invalidated and replaced with
> stale data.
> 

Heh, well that's an old one. ;) I'm probably not going to recall all the
details, but from a quick look at the commits this was to facilitate
support for glusterfs. The original fuse code did an inval across i_size
changes and this patch updated that to try and accommodate overwrites by
doing a similar thing for mtime differences.

If I follow the report correctly, we're basically producing an internal
inconsistency between mtime and cache state that falsely presents as a
remote change, so one of these attr change checks can race with a write
in progress and invalidate cache. Do I have that right?

But still a few questions..

1. Do we know where exactly the mtime update comes from? Is it the write
in progress that updates the file mtime on the backend and creates the
inconsistency?

2. Is it confirmed that auto_inval is the culprit here? It seems logical
to me, but it can also be disabled dynamically so couldn't hurt to
confirm that if there's a reproducer.

3. I don't think we should be able to invalidate "dirty" folios like
this. On a quick look though, it seems we don't mark folios dirty in
this write path. Is that right?

If so, I'm a little curious if that's more of a "no apparent need" thing
since the writeback occurs right in that path vs. that is an actual
wrong thing to do for some reason. Hm?

If the former (and if there is simple confirmation of the auto inval
thing), I'm at least a little curious if marking folios
dirty/writeback/clean here would provide enough serialization against
the inval to prevent this problem.

> The proposed fix was to exclude concurrent reads and writes to the same region.
> 
> But the real issue here is that mtime changes triggered by this client
> should not cause data to be invalidated.  It's not only racy, but it's
> fundamentally wrong.  Unfortunately this is hard to do this correctly.
> Best I can come up with is that any request that expects mtime to be
> modified returns the mtime after the request has completed.
> 

Agreed in general. IIUC, this is ultimately a heuristic that isn't
guaranteed to necessarily get things right for the backing fs. ISTM that
maybe fuse is trying too hard to handle the distributed case correctly
where the backing fs should be the one to implement this sort of thing
through exposed mechanisms. OTOH so long as the heuristic exists we
should probably at least work to make it internally consistent.

> This would be much easier to implement in the fuse server: perform the
> "file changed remotely" check when serving a FUSE_GETATTR request and
> return a flag indicating whether the data needs to be invalidated or
> not.
> 

Indeed something along those lines sounds more elegant long term, IMO.

Brian

> Thoughts?
> 
> Thanks,
> Miklos
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-13 17:44           ` Brian Foster
@ 2025-10-13 18:23             ` Miklos Szeredi
  2025-10-13 18:53               ` Brian Foster
  0 siblings, 1 reply; 31+ messages in thread
From: Miklos Szeredi @ 2025-10-13 18:23 UTC (permalink / raw)
  To: Brian Foster
  Cc: lu gu, Joanne Koong, linux-fsdevel, linux-kernel, Bernd Schubert

On Mon, 13 Oct 2025 at 19:40, Brian Foster <bfoster@redhat.com> wrote:

> If I follow the report correctly, we're basically producing an internal
> inconsistency between mtime and cache state that falsely presents as a
> remote change, so one of these attr change checks can race with a write
> in progress and invalidate cache. Do I have that right?

Yes.

>
> But still a few questions..
>
> 1. Do we know where exactly the mtime update comes from? Is it the write
> in progress that updates the file mtime on the backend and creates the
> inconsistency?

It can be a previous write.  A write will set STATX_MTIME in
fi->inval_mask, indicating that the value cached in i_mtime is
invalid.  But the auto_inval code will ignore that and use  cached
mtime to compare against the new value.

We could skip data invalidation if the cached value of mtime is not
valid, but this could easily result in remote changes being missed.

>
> 2. Is it confirmed that auto_inval is the culprit here? It seems logical
> to me, but it can also be disabled dynamically so couldn't hurt to
> confirm that if there's a reproducer.

Yes, reproducer has auto_inval_data turned on (libfuse turns it on by default).

>
> 3. I don't think we should be able to invalidate "dirty" folios like
> this. On a quick look though, it seems we don't mark folios dirty in
> this write path. Is that right?

Correct.

>
> If so, I'm a little curious if that's more of a "no apparent need" thing
> since the writeback occurs right in that path vs. that is an actual
> wrong thing to do for some reason. Hm?

Good question.  I think it's wrong, since dirtying the pages would
allow the witeback code to pick them up, which would be messy.

> Agreed in general. IIUC, this is ultimately a heuristic that isn't
> guaranteed to necessarily get things right for the backing fs. ISTM that
> maybe fuse is trying too hard to handle the distributed case correctly
> where the backing fs should be the one to implement this sort of thing
> through exposed mechanisms. OTOH so long as the heuristic exists we
> should probably at least work to make it internally consistent.

Yes, that's my problem.  How can we fix this without adding too much
complexity and without breaking existing uses?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-13 18:23             ` Miklos Szeredi
@ 2025-10-13 18:53               ` Brian Foster
  2025-10-14  7:48                 ` Miklos Szeredi
  0 siblings, 1 reply; 31+ messages in thread
From: Brian Foster @ 2025-10-13 18:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lu gu, Joanne Koong, linux-fsdevel, linux-kernel, Bernd Schubert

On Mon, Oct 13, 2025 at 08:23:39PM +0200, Miklos Szeredi wrote:
> On Mon, 13 Oct 2025 at 19:40, Brian Foster <bfoster@redhat.com> wrote:
> 
> > If I follow the report correctly, we're basically producing an internal
> > inconsistency between mtime and cache state that falsely presents as a
> > remote change, so one of these attr change checks can race with a write
> > in progress and invalidate cache. Do I have that right?
> 
> Yes.
> 
> >
> > But still a few questions..
> >
> > 1. Do we know where exactly the mtime update comes from? Is it the write
> > in progress that updates the file mtime on the backend and creates the
> > inconsistency?
> 
> It can be a previous write.  A write will set STATX_MTIME in
> fi->inval_mask, indicating that the value cached in i_mtime is
> invalid.  But the auto_inval code will ignore that and use  cached
> mtime to compare against the new value.
> 
> We could skip data invalidation if the cached value of mtime is not
> valid, but this could easily result in remote changes being missed.
> 

Hrm Ok. But even if we did miss remote changes, whose to say we can even
resolve that correctly from the kernel anyways..? Like if there happens
to be dirty data in cache and a remote change at the same time, that
kind of sounds like a policy decision for userspace. Maybe the fuse
position should be something like "expose mechanisms to manage this,
otherwise we'll just pick a side." Or "we'll never toss dirty data
unless explicitly asked by userspace."

> >
> > 2. Is it confirmed that auto_inval is the culprit here? It seems logical
> > to me, but it can also be disabled dynamically so couldn't hurt to
> > confirm that if there's a reproducer.
> 
> Yes, reproducer has auto_inval_data turned on (libfuse turns it on by default).
> 

I was more wondering if the problem goes away if it were disabled..

> >
> > 3. I don't think we should be able to invalidate "dirty" folios like
> > this. On a quick look though, it seems we don't mark folios dirty in
> > this write path. Is that right?
> 
> Correct.
> 
> >
> > If so, I'm a little curious if that's more of a "no apparent need" thing
> > since the writeback occurs right in that path vs. that is an actual
> > wrong thing to do for some reason. Hm?
> 
> Good question.  I think it's wrong, since dirtying the pages would
> allow the witeback code to pick them up, which would be messy.
> 

Ah, yeah that makes sense. Though invalidate waits on writeback. Any
reason this path couldn't skip the dirty state but mark the pages as
under writeback across the op?

> > Agreed in general. IIUC, this is ultimately a heuristic that isn't
> > guaranteed to necessarily get things right for the backing fs. ISTM that
> > maybe fuse is trying too hard to handle the distributed case correctly
> > where the backing fs should be the one to implement this sort of thing
> > through exposed mechanisms. OTOH so long as the heuristic exists we
> > should probably at least work to make it internally consistent.
> 
> Yes, that's my problem.  How can we fix this without adding too much
> complexity and without breaking existing uses?
> 

I probably need to stare at the code some more. Sorry, it's been quite a
while since I've looked at this. Curious.. was there something wrong
with your unstable mtime idea, or just that it might not be fully
generic enough?

This might be a good question for any fuse based distributed fs
projects. I'm not sure if/how active glusterfs is these days..

Brian

> Thanks,
> Miklos
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-13 13:39         ` Miklos Szeredi
  2025-10-13 17:44           ` Brian Foster
@ 2025-10-13 20:16           ` Bernd Schubert
  2025-10-13 20:27             ` Joanne Koong
  2025-10-14  8:06             ` Miklos Szeredi
  2025-10-13 23:43           ` Joanne Koong
  2 siblings, 2 replies; 31+ messages in thread
From: Bernd Schubert @ 2025-10-13 20:16 UTC (permalink / raw)
  To: Miklos Szeredi, lu gu
  Cc: Joanne Koong, linux-fsdevel, linux-kernel, Brian Foster



On 10/13/25 15:39, Miklos Szeredi wrote:
> On Fri, 10 Oct 2025 at 10:46, Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
>> My idea is to introduce FUSE_I_MTIME_UNSTABLE (which would work
>> similarly to FUSE_I_SIZE_UNSTABLE) and when fetching old_mtime, verify
>> that it hasn't been invalidated.  If old_mtime is invalid or if
>> FUSE_I_MTIME_UNSTABLE signals that a write is in progress, the page
>> cache is not invalidated.
> 
> [Adding Brian Foster, the author of FUSE_AUTO_INVAL_DATA patches.
> Link to complete thread:
> https://lore.kernel.org/all/20251009110623.3115511-1-giveme.gulu@gmail.com/#r]
> 
> In summary: auto_inval_data invalidates data cache even if the
> modification was done in a cache consistent manner (i.e. write
> through). This is not generally a consistency problem, because the
> backing file and the cache should be in sync.  The exception is when
> the writeback to the backing file hasn't yet finished and a getattr()
> call triggers invalidation (mtime change could be from a previous
> write), and the not yet written data is invalidated and replaced with
> stale data.
> 
> The proposed fix was to exclude concurrent reads and writes to the same region.
> 
> But the real issue here is that mtime changes triggered by this client
> should not cause data to be invalidated.  It's not only racy, but it's
> fundamentally wrong.  Unfortunately this is hard to do this correctly.
> Best I can come up with is that any request that expects mtime to be
> modified returns the mtime after the request has completed.
> 
> This would be much easier to implement in the fuse server: perform the
> "file changed remotely" check when serving a FUSE_GETATTR request and
> return a flag indicating whether the data needs to be invalidated or
> not.

For an intelligent server maybe, but let's say one uses
<libfuse>/example/passthrough*, in combination with some external writes
to the underlying file system outside of fuse. How would passthrough*
know about external changes?

The part I don't understand yet is why invalidate_inode_pages2() causes
an issue - it has folio_wait_writeback()?


Thanks,
Bernd

> 
> Thoughts?
> 
> Thanks,
> Miklos


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-13 20:16           ` Bernd Schubert
@ 2025-10-13 20:27             ` Joanne Koong
  2025-10-13 20:40               ` Bernd Schubert
  2025-10-14  8:06             ` Miklos Szeredi
  1 sibling, 1 reply; 31+ messages in thread
From: Joanne Koong @ 2025-10-13 20:27 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, lu gu, linux-fsdevel, linux-kernel, Brian Foster

On Mon, Oct 13, 2025 at 1:16 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>
> On 10/13/25 15:39, Miklos Szeredi wrote:
> > On Fri, 10 Oct 2025 at 10:46, Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> >> My idea is to introduce FUSE_I_MTIME_UNSTABLE (which would work
> >> similarly to FUSE_I_SIZE_UNSTABLE) and when fetching old_mtime, verify
> >> that it hasn't been invalidated.  If old_mtime is invalid or if
> >> FUSE_I_MTIME_UNSTABLE signals that a write is in progress, the page
> >> cache is not invalidated.
> >
> > [Adding Brian Foster, the author of FUSE_AUTO_INVAL_DATA patches.
> > Link to complete thread:
> > https://lore.kernel.org/all/20251009110623.3115511-1-giveme.gulu@gmail.com/#r]
> >
> > In summary: auto_inval_data invalidates data cache even if the
> > modification was done in a cache consistent manner (i.e. write
> > through). This is not generally a consistency problem, because the
> > backing file and the cache should be in sync.  The exception is when
> > the writeback to the backing file hasn't yet finished and a getattr()
> > call triggers invalidation (mtime change could be from a previous
> > write), and the not yet written data is invalidated and replaced with
> > stale data.
> >
> > The proposed fix was to exclude concurrent reads and writes to the same region.
> >
> > But the real issue here is that mtime changes triggered by this client
> > should not cause data to be invalidated.  It's not only racy, but it's
> > fundamentally wrong.  Unfortunately this is hard to do this correctly.
> > Best I can come up with is that any request that expects mtime to be
> > modified returns the mtime after the request has completed.
> >
> > This would be much easier to implement in the fuse server: perform the
> > "file changed remotely" check when serving a FUSE_GETATTR request and
> > return a flag indicating whether the data needs to be invalidated or
> > not.
>
> For an intelligent server maybe, but let's say one uses
> <libfuse>/example/passthrough*, in combination with some external writes
> to the underlying file system outside of fuse. How would passthrough*
> know about external changes?
>
> The part I don't understand yet is why invalidate_inode_pages2() causes
> an issue - it has folio_wait_writeback()?
>

This issue is for the writethrough path which doesn't use writeback.

Thanks,
Joanne
>
> Thanks,
> Bernd
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-13 20:27             ` Joanne Koong
@ 2025-10-13 20:40               ` Bernd Schubert
  2025-10-13 23:32                 ` Joanne Koong
  0 siblings, 1 reply; 31+ messages in thread
From: Bernd Schubert @ 2025-10-13 20:40 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, lu gu, linux-fsdevel, linux-kernel, Brian Foster



On 10/13/25 22:27, Joanne Koong wrote:
> On Mon, Oct 13, 2025 at 1:16 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>>
>> On 10/13/25 15:39, Miklos Szeredi wrote:
>>> On Fri, 10 Oct 2025 at 10:46, Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>
>>>> My idea is to introduce FUSE_I_MTIME_UNSTABLE (which would work
>>>> similarly to FUSE_I_SIZE_UNSTABLE) and when fetching old_mtime, verify
>>>> that it hasn't been invalidated.  If old_mtime is invalid or if
>>>> FUSE_I_MTIME_UNSTABLE signals that a write is in progress, the page
>>>> cache is not invalidated.
>>>
>>> [Adding Brian Foster, the author of FUSE_AUTO_INVAL_DATA patches.
>>> Link to complete thread:
>>> https://lore.kernel.org/all/20251009110623.3115511-1-giveme.gulu@gmail.com/#r]
>>>
>>> In summary: auto_inval_data invalidates data cache even if the
>>> modification was done in a cache consistent manner (i.e. write
>>> through). This is not generally a consistency problem, because the
>>> backing file and the cache should be in sync.  The exception is when
>>> the writeback to the backing file hasn't yet finished and a getattr()
>>> call triggers invalidation (mtime change could be from a previous
>>> write), and the not yet written data is invalidated and replaced with
>>> stale data.
>>>
>>> The proposed fix was to exclude concurrent reads and writes to the same region.
>>>
>>> But the real issue here is that mtime changes triggered by this client
>>> should not cause data to be invalidated.  It's not only racy, but it's
>>> fundamentally wrong.  Unfortunately this is hard to do this correctly.
>>> Best I can come up with is that any request that expects mtime to be
>>> modified returns the mtime after the request has completed.
>>>
>>> This would be much easier to implement in the fuse server: perform the
>>> "file changed remotely" check when serving a FUSE_GETATTR request and
>>> return a flag indicating whether the data needs to be invalidated or
>>> not.
>>
>> For an intelligent server maybe, but let's say one uses
>> <libfuse>/example/passthrough*, in combination with some external writes
>> to the underlying file system outside of fuse. How would passthrough*
>> know about external changes?
>>
>> The part I don't understand yet is why invalidate_inode_pages2() causes
>> an issue - it has folio_wait_writeback()?
>>
> 
> This issue is for the writethrough path which doesn't use writeback.


Oh right. So we need some kind of fuse_invalidate_pages(), that would
wait for for all current fuse_send_write_pages() to complete? Is that
what you meant with 'fi->writectr bias'?

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-13 20:40               ` Bernd Schubert
@ 2025-10-13 23:32                 ` Joanne Koong
  0 siblings, 0 replies; 31+ messages in thread
From: Joanne Koong @ 2025-10-13 23:32 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, lu gu, linux-fsdevel, linux-kernel, Brian Foster

On Mon, Oct 13, 2025 at 1:40 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>
> On 10/13/25 22:27, Joanne Koong wrote:
> > On Mon, Oct 13, 2025 at 1:16 PM Bernd Schubert <bernd@bsbernd.com> wrote:
> >>
> >> On 10/13/25 15:39, Miklos Szeredi wrote:
> >>> On Fri, 10 Oct 2025 at 10:46, Miklos Szeredi <miklos@szeredi.hu> wrote:
> >>>
> >>>> My idea is to introduce FUSE_I_MTIME_UNSTABLE (which would work
> >>>> similarly to FUSE_I_SIZE_UNSTABLE) and when fetching old_mtime, verify
> >>>> that it hasn't been invalidated.  If old_mtime is invalid or if
> >>>> FUSE_I_MTIME_UNSTABLE signals that a write is in progress, the page
> >>>> cache is not invalidated.
> >>>
> >>> [Adding Brian Foster, the author of FUSE_AUTO_INVAL_DATA patches.
> >>> Link to complete thread:
> >>> https://lore.kernel.org/all/20251009110623.3115511-1-giveme.gulu@gmail.com/#r]
> >>>
> >>> In summary: auto_inval_data invalidates data cache even if the
> >>> modification was done in a cache consistent manner (i.e. write
> >>> through). This is not generally a consistency problem, because the
> >>> backing file and the cache should be in sync.  The exception is when
> >>> the writeback to the backing file hasn't yet finished and a getattr()
> >>> call triggers invalidation (mtime change could be from a previous
> >>> write), and the not yet written data is invalidated and replaced with
> >>> stale data.
> >>>
> >>> The proposed fix was to exclude concurrent reads and writes to the same region.
> >>>
> >>> But the real issue here is that mtime changes triggered by this client
> >>> should not cause data to be invalidated.  It's not only racy, but it's
> >>> fundamentally wrong.  Unfortunately this is hard to do this correctly.
> >>> Best I can come up with is that any request that expects mtime to be
> >>> modified returns the mtime after the request has completed.
> >>>
> >>> This would be much easier to implement in the fuse server: perform the
> >>> "file changed remotely" check when serving a FUSE_GETATTR request and
> >>> return a flag indicating whether the data needs to be invalidated or
> >>> not.
> >>
> >> For an intelligent server maybe, but let's say one uses
> >> <libfuse>/example/passthrough*, in combination with some external writes
> >> to the underlying file system outside of fuse. How would passthrough*
> >> know about external changes?
> >>
> >> The part I don't understand yet is why invalidate_inode_pages2() causes
> >> an issue - it has folio_wait_writeback()?
> >>
> >
> > This issue is for the writethrough path which doesn't use writeback.
>
>
> Oh right. So we need some kind of fuse_invalidate_pages(), that would
> wait for for all current fuse_send_write_pages() to complete? Is that
> what you meant with 'fi->writectr bias'?

Yes but unfortunately this would block reads on any part of the file,
not just the part that's being written.

Looking at this more, I think we actually could just grab the folio
locks and if we need to fault anything in, unlock all the folios,
fault in the bytes, and then relock all the folios. I don't think we would
need to check anything between dropping the locks and relocking, since
if we're comparing this against the existing code where we drop the
locks on fully-copied folios, this doesn't introduce any new race
conditions as far as I can see.

Thanks,
Joanne

>
> Thanks,
> Bernd

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-13 13:39         ` Miklos Szeredi
  2025-10-13 17:44           ` Brian Foster
  2025-10-13 20:16           ` Bernd Schubert
@ 2025-10-13 23:43           ` Joanne Koong
  2025-10-14  8:11             ` Miklos Szeredi
  2 siblings, 1 reply; 31+ messages in thread
From: Joanne Koong @ 2025-10-13 23:43 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lu gu, linux-fsdevel, linux-kernel, Brian Foster, Bernd Schubert

On Mon, Oct 13, 2025 at 6:40 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Fri, 10 Oct 2025 at 10:46, Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > My idea is to introduce FUSE_I_MTIME_UNSTABLE (which would work
> > similarly to FUSE_I_SIZE_UNSTABLE) and when fetching old_mtime, verify
> > that it hasn't been invalidated.  If old_mtime is invalid or if
> > FUSE_I_MTIME_UNSTABLE signals that a write is in progress, the page
> > cache is not invalidated.
>
> [Adding Brian Foster, the author of FUSE_AUTO_INVAL_DATA patches.
> Link to complete thread:
> https://lore.kernel.org/all/20251009110623.3115511-1-giveme.gulu@gmail.com/#r]
>
> In summary: auto_inval_data invalidates data cache even if the
> modification was done in a cache consistent manner (i.e. write
> through). This is not generally a consistency problem, because the
> backing file and the cache should be in sync.  The exception is when
> the writeback to the backing file hasn't yet finished and a getattr()
> call triggers invalidation (mtime change could be from a previous
> write), and the not yet written data is invalidated and replaced with
> stale data.
>
> The proposed fix was to exclude concurrent reads and writes to the same region.
>
> But the real issue here is that mtime changes triggered by this client
> should not cause data to be invalidated.  It's not only racy, but it's
> fundamentally wrong.  Unfortunately this is hard to do this correctly.
> Best I can come up with is that any request that expects mtime to be
> modified returns the mtime after the request has completed.
>
> This would be much easier to implement in the fuse server: perform the
> "file changed remotely" check when serving a FUSE_GETATTR request and
> return a flag indicating whether the data needs to be invalidated or
> not.

Doesn't this still lead to a problem if the data does need to be
invalidated? If the data changed remotely, then afaict the page cache
would have the new updated data but the newest write data would still
be missing in the page cache.

Thanks,
Joanne

>
> Thoughts?
>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-13 18:53               ` Brian Foster
@ 2025-10-14  7:48                 ` Miklos Szeredi
  2025-10-14 12:43                   ` Miklos Szeredi
  2025-10-14 14:01                   ` Brian Foster
  0 siblings, 2 replies; 31+ messages in thread
From: Miklos Szeredi @ 2025-10-14  7:48 UTC (permalink / raw)
  To: Brian Foster
  Cc: lu gu, Joanne Koong, linux-fsdevel, linux-kernel, Bernd Schubert

On Mon, 13 Oct 2025 at 20:49, Brian Foster <bfoster@redhat.com> wrote:

> Hrm Ok. But even if we did miss remote changes, whose to say we can even
> resolve that correctly from the kernel anyways..?

No, I'm worrying about the case of

- range1 cached locally,
- range1 changed remotely (mtime changed)
- range2 changed locally (mtime changed, cached mtime invalidated)
- range1 read locally

That last one will update mtime in cache, see that old cached mtime is
stale and happily read the stale data.

What we currently have is more correct in the sense that it will
invalidate data on any mtime change, be it of local or remote origin.

> > Yes, reproducer has auto_inval_data turned on (libfuse turns it on by default).
> >
>
> I was more wondering if the problem goes away if it were disabled..

I haven't tried, @guangming?

> Ah, yeah that makes sense. Though invalidate waits on writeback. Any
> reason this path couldn't skip the dirty state but mark the pages as
> under writeback across the op?

Maybe that'd work.  It *is* under writeback after all.

Maybe the solution is to change the write-through to regular cached
write + fsync range?  That could even be a complexity reduction.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-13 20:16           ` Bernd Schubert
  2025-10-13 20:27             ` Joanne Koong
@ 2025-10-14  8:06             ` Miklos Szeredi
  1 sibling, 0 replies; 31+ messages in thread
From: Miklos Szeredi @ 2025-10-14  8:06 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: lu gu, Joanne Koong, linux-fsdevel, linux-kernel, Brian Foster

On Mon, 13 Oct 2025 at 22:16, Bernd Schubert <bernd@bsbernd.com> wrote:

> For an intelligent server maybe, but let's say one uses
> <libfuse>/example/passthrough*, in combination with some external writes
> to the underlying file system outside of fuse. How would passthrough*
> know about external changes?

It could use fanotify or leases for example to check for external modification

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-13 23:43           ` Joanne Koong
@ 2025-10-14  8:11             ` Miklos Szeredi
  2025-10-14  9:36               ` lu gu
  0 siblings, 1 reply; 31+ messages in thread
From: Miklos Szeredi @ 2025-10-14  8:11 UTC (permalink / raw)
  To: Joanne Koong
  Cc: lu gu, linux-fsdevel, linux-kernel, Brian Foster, Bernd Schubert

On Tue, 14 Oct 2025 at 01:44, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Oct 13, 2025 at 6:40 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Fri, 10 Oct 2025 at 10:46, Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > > My idea is to introduce FUSE_I_MTIME_UNSTABLE (which would work
> > > similarly to FUSE_I_SIZE_UNSTABLE) and when fetching old_mtime, verify
> > > that it hasn't been invalidated.  If old_mtime is invalid or if
> > > FUSE_I_MTIME_UNSTABLE signals that a write is in progress, the page
> > > cache is not invalidated.
> >
> > [Adding Brian Foster, the author of FUSE_AUTO_INVAL_DATA patches.
> > Link to complete thread:
> > https://lore.kernel.org/all/20251009110623.3115511-1-giveme.gulu@gmail.com/#r]
> >
> > In summary: auto_inval_data invalidates data cache even if the
> > modification was done in a cache consistent manner (i.e. write
> > through). This is not generally a consistency problem, because the
> > backing file and the cache should be in sync.  The exception is when
> > the writeback to the backing file hasn't yet finished and a getattr()
> > call triggers invalidation (mtime change could be from a previous
> > write), and the not yet written data is invalidated and replaced with
> > stale data.
> >
> > The proposed fix was to exclude concurrent reads and writes to the same region.
> >
> > But the real issue here is that mtime changes triggered by this client
> > should not cause data to be invalidated.  It's not only racy, but it's
> > fundamentally wrong.  Unfortunately this is hard to do this correctly.
> > Best I can come up with is that any request that expects mtime to be
> > modified returns the mtime after the request has completed.
> >
> > This would be much easier to implement in the fuse server: perform the
> > "file changed remotely" check when serving a FUSE_GETATTR request and
> > return a flag indicating whether the data needs to be invalidated or
> > not.
>
> Doesn't this still lead to a problem if the data does need to be
> invalidated? If the data changed remotely, then afaict the page cache
> would have the new updated data but the newest write data would still
> be missing in the page cache.

Right, this would need to be done in combination with read/write exclusion.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-14  8:11             ` Miklos Szeredi
@ 2025-10-14  9:36               ` lu gu
  0 siblings, 0 replies; 31+ messages in thread
From: lu gu @ 2025-10-14  9:36 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, linux-kernel, Brian Foster,
	Bernd Schubert

> > > Yes, reproducer has auto_inval_data turned on (libfuse turns it on by default).
> > >
> >
> > I was more wondering if the problem goes away if it were disabled..


Yes, I tried commenting out the invalidate_inode_pages2 function and
ran the LTP test cases in a loop more than ten times, and no data
inconsistency issues occurred.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-14  7:48                 ` Miklos Szeredi
@ 2025-10-14 12:43                   ` Miklos Szeredi
  2025-10-14 16:15                     ` Miklos Szeredi
  2025-10-14 17:01                     ` Joanne Koong
  2025-10-14 14:01                   ` Brian Foster
  1 sibling, 2 replies; 31+ messages in thread
From: Miklos Szeredi @ 2025-10-14 12:43 UTC (permalink / raw)
  To: Brian Foster
  Cc: lu gu, Joanne Koong, linux-fsdevel, linux-kernel, Bernd Schubert

On Tue, 14 Oct 2025 at 09:48, Miklos Szeredi <miklos@szeredi.hu> wrote:

> Maybe the solution is to change the write-through to regular cached
> write + fsync range?  That could even be a complexity reduction.

While this would be nice, it's impossible to guarantee requests being
initiated in the context of the original write(2), which means that
the information about which open file it originated from might be
lost.   This could result in regressions, so I don't think we should
risk it.

Will try the idea of marking folios writeback for the duration of the write.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-14  7:48                 ` Miklos Szeredi
  2025-10-14 12:43                   ` Miklos Szeredi
@ 2025-10-14 14:01                   ` Brian Foster
  2025-10-14 16:10                     ` Miklos Szeredi
  1 sibling, 1 reply; 31+ messages in thread
From: Brian Foster @ 2025-10-14 14:01 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lu gu, Joanne Koong, linux-fsdevel, linux-kernel, Bernd Schubert

On Tue, Oct 14, 2025 at 09:48:34AM +0200, Miklos Szeredi wrote:
> On Mon, 13 Oct 2025 at 20:49, Brian Foster <bfoster@redhat.com> wrote:
> 
> > Hrm Ok. But even if we did miss remote changes, whose to say we can even
> > resolve that correctly from the kernel anyways..?
> 
> No, I'm worrying about the case of
> 
> - range1 cached locally,
> - range1 changed remotely (mtime changed)
> - range2 changed locally (mtime changed, cached mtime invalidated)
> - range1 read locally
> 
> That last one will update mtime in cache, see that old cached mtime is
> stale and happily read the stale data.
> 
> What we currently have is more correct in the sense that it will
> invalidate data on any mtime change, be it of local or remote origin.
> 

Yeah, I guess. IMO if you made it policy that this sort of thing is
userspace responsibility, then something like the above is not
necessarily incorrect in fuse, it's more a failure of userspace.

I'd guess the challenge is more how to manage the change in behavior.
Maybe that would need an opt-in flag or something for userspace to
signify it understands it is responsible for external changes, and then
have a notify call or whatever that can tie into cache truncation
(where'd you'd explicitly punch out cache even if dirty).

But TBH, if the writeback thing or something similarly simple works for
resolving the immediate bug, I wouldnt worry too much about it
until/unless there are userspace fs' explicitly looking for that sort of
behavior. Just my .02.

Brian

> > > Yes, reproducer has auto_inval_data turned on (libfuse turns it on by default).
> > >
> >
> > I was more wondering if the problem goes away if it were disabled..
> 
> I haven't tried, @guangming?
> 
> > Ah, yeah that makes sense. Though invalidate waits on writeback. Any
> > reason this path couldn't skip the dirty state but mark the pages as
> > under writeback across the op?
> 
> Maybe that'd work.  It *is* under writeback after all.
> 
> Maybe the solution is to change the write-through to regular cached
> write + fsync range?  That could even be a complexity reduction.
> 
> Thanks,
> Miklos
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-14 14:01                   ` Brian Foster
@ 2025-10-14 16:10                     ` Miklos Szeredi
  2025-10-14 16:15                       ` Bernd Schubert
  2025-10-14 16:21                       ` Brian Foster
  0 siblings, 2 replies; 31+ messages in thread
From: Miklos Szeredi @ 2025-10-14 16:10 UTC (permalink / raw)
  To: Brian Foster
  Cc: lu gu, Joanne Koong, linux-fsdevel, linux-kernel, Bernd Schubert

On Tue, 14 Oct 2025 at 15:57, Brian Foster <bfoster@redhat.com> wrote:

> But TBH, if the writeback thing or something similarly simple works for
> resolving the immediate bug, I wouldnt worry too much about it
> until/unless there are userspace fs' explicitly looking for that sort of
> behavior. Just my .02.

Agreed.

I just feel it unfortunate that this is default in libfuse and so many
filesystems will have auto_inval_data enabled which don't even need
it, and some mixed read-write workloads suffering badly as a
consequence.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-14 12:43                   ` Miklos Szeredi
@ 2025-10-14 16:15                     ` Miklos Szeredi
  2025-10-14 17:01                     ` Joanne Koong
  1 sibling, 0 replies; 31+ messages in thread
From: Miklos Szeredi @ 2025-10-14 16:15 UTC (permalink / raw)
  To: lu gu
  Cc: Brian Foster, Joanne Koong, linux-fsdevel, linux-kernel,
	Bernd Schubert

[-- Attachment #1: Type: text/plain, Size: 279 bytes --]

On Tue, 14 Oct 2025 at 14:43, Miklos Szeredi <miklos@szeredi.hu> wrote:

> Will try the idea of marking folios writeback for the duration of the write.

Attaching a test patch, minimally tested.

Guangming, can you please test if this fixes the cache corruption?

Thanks,
Miklos

[-- Attachment #2: fuse-write-through-set-writeback.patch --]
[-- Type: text/x-patch, Size: 1023 bytes --]

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 905726ac3a7a..2f12a501df9d 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1121,9 +1121,6 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
 	bool short_write;
 	int err;
 
-	for (i = 0; i < ap->num_folios; i++)
-		folio_wait_writeback(ap->folios[i]);
-
 	fuse_write_args_fill(ia, ff, pos, count);
 	ia->write.in.flags = fuse_write_flags(iocb);
 	if (fm->fc->handle_killpriv_v2 && !capable(CAP_FSETID))
@@ -1153,6 +1150,8 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
 		}
 		if (ia->write.folio_locked && (i == ap->num_folios - 1))
 			folio_unlock(folio);
+		else
+			folio_end_writeback_no_dropbehind(folio);
 		folio_put(folio);
 	}
 
@@ -1232,6 +1231,8 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 			folio_mark_uptodate(folio);
 
 		if (folio_test_uptodate(folio)) {
+			folio_wait_writeback(folio);
+			folio_start_writeback(folio);
 			folio_unlock(folio);
 		} else {
 			ia->write.folio_locked = true;

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-14 16:10                     ` Miklos Szeredi
@ 2025-10-14 16:15                       ` Bernd Schubert
  2025-10-14 16:21                       ` Brian Foster
  1 sibling, 0 replies; 31+ messages in thread
From: Bernd Schubert @ 2025-10-14 16:15 UTC (permalink / raw)
  To: Miklos Szeredi, Brian Foster
  Cc: lu gu, Joanne Koong, linux-fsdevel, linux-kernel



On 10/14/25 18:10, Miklos Szeredi wrote:
> On Tue, 14 Oct 2025 at 15:57, Brian Foster <bfoster@redhat.com> wrote:
> 
>> But TBH, if the writeback thing or something similarly simple works for
>> resolving the immediate bug, I wouldnt worry too much about it
>> until/unless there are userspace fs' explicitly looking for that sort of
>> behavior. Just my .02.
> 
> Agreed.
> 
> I just feel it unfortunate that this is default in libfuse and so many
> filesystems will have auto_inval_data enabled which don't even need
> it, and some mixed read-write workloads suffering badly as a
> consequence.

Maybe we should disable it in libfuse-3.18 and hope for reports, in case
there are regressions?


Thanks,
Bernd



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-14 16:10                     ` Miklos Szeredi
  2025-10-14 16:15                       ` Bernd Schubert
@ 2025-10-14 16:21                       ` Brian Foster
  1 sibling, 0 replies; 31+ messages in thread
From: Brian Foster @ 2025-10-14 16:21 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lu gu, Joanne Koong, linux-fsdevel, linux-kernel, Bernd Schubert

On Tue, Oct 14, 2025 at 06:10:45PM +0200, Miklos Szeredi wrote:
> On Tue, 14 Oct 2025 at 15:57, Brian Foster <bfoster@redhat.com> wrote:
> 
> > But TBH, if the writeback thing or something similarly simple works for
> > resolving the immediate bug, I wouldnt worry too much about it
> > until/unless there are userspace fs' explicitly looking for that sort of
> > behavior. Just my .02.
> 
> Agreed.
> 
> I just feel it unfortunate that this is default in libfuse and so many
> filesystems will have auto_inval_data enabled which don't even need
> it, and some mixed read-write workloads suffering badly as a
> consequence.
> 

Maybe it didnt really need to be on by default? I don't recall caring
much about that (but again, long time ago) as long as the fs that wants
it can enable it.

Brian

> Thanks,
> Miklos
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-14 12:43                   ` Miklos Szeredi
  2025-10-14 16:15                     ` Miklos Szeredi
@ 2025-10-14 17:01                     ` Joanne Koong
  2025-10-14 17:56                       ` Brian Foster
  1 sibling, 1 reply; 31+ messages in thread
From: Joanne Koong @ 2025-10-14 17:01 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Brian Foster, lu gu, linux-fsdevel, linux-kernel, Bernd Schubert

On Tue, Oct 14, 2025 at 5:43 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Tue, 14 Oct 2025 at 09:48, Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > Maybe the solution is to change the write-through to regular cached
> > write + fsync range?  That could even be a complexity reduction.
>
> While this would be nice, it's impossible to guarantee requests being
> initiated in the context of the original write(2), which means that
> the information about which open file it originated from might be
> lost.   This could result in regressions, so I don't think we should
> risk it.
>
> Will try the idea of marking folios writeback for the duration of the write.
>

Is it safe to mark a folio as being under writeback if it doesn't
actually go through mm writeback? for example, my understanding is
that the inode wb mechanisms get initiated when an inode is marked
dirty (__mark_inode_dirty()) but writethrough skips any dirtying.
Afaict, folio_start_writeback()/folio_end_write() needs i_wb.
Additionally, if the server page faults on the folio that is now
marked as under writeback, does that lead to a deadlock since
page_mkwrite() waits on folio writeback?


Thanks,
Joanne

> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-14 17:01                     ` Joanne Koong
@ 2025-10-14 17:56                       ` Brian Foster
  2025-10-15  3:59                         ` lu gu
  0 siblings, 1 reply; 31+ messages in thread
From: Brian Foster @ 2025-10-14 17:56 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, lu gu, linux-fsdevel, linux-kernel,
	Bernd Schubert

On Tue, Oct 14, 2025 at 10:01:53AM -0700, Joanne Koong wrote:
> On Tue, Oct 14, 2025 at 5:43 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Tue, 14 Oct 2025 at 09:48, Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > > Maybe the solution is to change the write-through to regular cached
> > > write + fsync range?  That could even be a complexity reduction.
> >
> > While this would be nice, it's impossible to guarantee requests being
> > initiated in the context of the original write(2), which means that
> > the information about which open file it originated from might be
> > lost.   This could result in regressions, so I don't think we should
> > risk it.
> >
> > Will try the idea of marking folios writeback for the duration of the write.
> >
> 
> Is it safe to mark a folio as being under writeback if it doesn't
> actually go through mm writeback? for example, my understanding is
> that the inode wb mechanisms get initiated when an inode is marked
> dirty (__mark_inode_dirty()) but writethrough skips any dirtying.
> Afaict, folio_start_writeback()/folio_end_write() needs i_wb.
> Additionally, if the server page faults on the folio that is now
> marked as under writeback, does that lead to a deadlock since
> page_mkwrite() waits on folio writeback?
> 

WRT dirtying I think Miklos was primarily concerned with some other
thread being able to pick up the folio for writeback. I'm not certain if
writeback is dependent on being dirty, but if it is, ISTM you could
technically still mark the folio dirty and transition it to writeback
before it's unlocked in the write path. To Miklos earlier point that
would put the folio through the same sequence as proper writeback,
except I think would claim the writeback work for the current task.

That might start to look wonky I suppose, but maybe that can be
addressed after if it can at least be shown to fix the problem. For
example, if it's really just the wb that is an issue, perhaps an
inode_attach_wb() is all that's needed?

I'm not sure about the fault case. I'm assuming fuse still supports
traditional writeback and that this would all work similarly to that
case.

Brian

> 
> Thanks,
> Joanne
> 
> > Thanks,
> > Miklos
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-14 17:56                       ` Brian Foster
@ 2025-10-15  3:59                         ` lu gu
  2025-10-15 14:09                           ` Miklos Szeredi
  0 siblings, 1 reply; 31+ messages in thread
From: lu gu @ 2025-10-15  3:59 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, linux-kernel, Bernd Schubert,
	Brian Foster

>  Attaching a test patch, minimally tested.
Since I only have a test environment for kernel 5.15, I ported this
patch to the FUSE module in 5.15. I ran the previous LTP test cases
more than ten times, and the data inconsistency issue did not reoccur.
However, a deadlock occur. Below is the specific stack trace.
root@ubuntu:~# ps aux | grep ltp
root       15040  0.0  0.0   2972  1124 pts/2    S+   11:26   0:00
/root/ltp-install/testcases/bin/doio -N iogen01 -a -v -n 2 -k
root       15042  0.2  0.0   3036  1320 pts/2    D+   11:26   0:01
/root/ltp-install/testcases/bin/doio -N iogen01 -a -v -n 2 -k
root@ubuntu:~# cat /proc/15042/stack
[<0>] __inode_wait_for_writeback+0xae/0xf0
[<0>] writeback_single_inode+0x72/0x190
[<0>] sync_inode_metadata+0x41/0x60
[<0>] fuse_fsync+0xbf/0x110 [fuse]
[<0>] vfs_fsync_range+0x49/0x90
[<0>] fuse_file_write_iter+0x34b/0x470 [fuse]
[<0>] new_sync_write+0x114/0x1a0
[<0>] vfs_write+0x1d5/0x270
[<0>] ksys_write+0x67/0xf0
[<0>] __x64_sys_write+0x19/0x20
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb
root@ubuntu:~# cat /proc/15040/stack
[<0>] do_wait+0x171/0x310
[<0>] kernel_wait4+0xaf/0x150
[<0>] __do_sys_wait4+0x89/0xa0
[<0>] __x64_sys_wait4+0x1c/0x30
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

Thanks,
Guangming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-15  3:59                         ` lu gu
@ 2025-10-15 14:09                           ` Miklos Szeredi
  2025-10-15 17:19                             ` Joanne Koong
  0 siblings, 1 reply; 31+ messages in thread
From: Miklos Szeredi @ 2025-10-15 14:09 UTC (permalink / raw)
  To: lu gu
  Cc: Joanne Koong, linux-fsdevel, linux-kernel, Bernd Schubert,
	Brian Foster

[-- Attachment #1: Type: text/plain, Size: 685 bytes --]

On Wed, 15 Oct 2025 at 06:00, lu gu <giveme.gulu@gmail.com> wrote:
>
> >  Attaching a test patch, minimally tested.
> Since I only have a test environment for kernel 5.15, I ported this
> patch to the FUSE module in 5.15. I ran the previous LTP test cases
> more than ten times, and the data inconsistency issue did not reoccur.
> However, a deadlock occur. Below is the specific stack trace.

This is does not reproduce for me on 6.17 even after running the test
for hours.  Without seeing your backport it is difficult to say
anything about the reason for the deadlock.

Attaching an updated patch that takes care of i_wb initialization on
CONFIG_CGROUP_WRITEBACK=y.

Thanks,
Miklos

[-- Attachment #2: fuse-write-through-set-writeback-v2.patch --]
[-- Type: text/x-patch, Size: 1318 bytes --]

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f1ef77a0be05..042c35f0466e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1126,9 +1126,6 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
 	bool short_write;
 	int err;
 
-	for (i = 0; i < ap->num_folios; i++)
-		folio_wait_writeback(ap->folios[i]);
-
 	fuse_write_args_fill(ia, ff, pos, count);
 	ia->write.in.flags = fuse_write_flags(iocb);
 	if (fm->fc->handle_killpriv_v2 && !capable(CAP_FSETID))
@@ -1158,6 +1155,8 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
 		}
 		if (ia->write.folio_locked && (i == ap->num_folios - 1))
 			folio_unlock(folio);
+		else
+			folio_end_writeback_no_dropbehind(folio);
 		folio_put(folio);
 	}
 
@@ -1236,7 +1235,9 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 		if (tmp == folio_size(folio))
 			folio_mark_uptodate(folio);
 
+		folio_wait_writeback(folio);
 		if (folio_test_uptodate(folio)) {
+			folio_start_writeback(folio);
 			folio_unlock(folio);
 		} else {
 			ia->write.folio_locked = true;
@@ -1268,6 +1269,8 @@ static ssize_t fuse_perform_write(struct kiocb *iocb, struct iov_iter *ii)
 	int err = 0;
 	ssize_t res = 0;
 
+	inode_attach_wb(inode, NULL);
+
 	if (inode->i_size < pos + iov_iter_count(ii))
 		set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
 

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-15 14:09                           ` Miklos Szeredi
@ 2025-10-15 17:19                             ` Joanne Koong
  2025-10-15 19:48                               ` Brian Foster
  0 siblings, 1 reply; 31+ messages in thread
From: Joanne Koong @ 2025-10-15 17:19 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lu gu, linux-fsdevel, linux-kernel, Bernd Schubert, Brian Foster

On Wed, Oct 15, 2025 at 7:09 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Wed, 15 Oct 2025 at 06:00, lu gu <giveme.gulu@gmail.com> wrote:
> >
> > >  Attaching a test patch, minimally tested.
> > Since I only have a test environment for kernel 5.15, I ported this
> > patch to the FUSE module in 5.15. I ran the previous LTP test cases
> > more than ten times, and the data inconsistency issue did not reoccur.
> > However, a deadlock occur. Below is the specific stack trace.
>
> This is does not reproduce for me on 6.17 even after running the test
> for hours.  Without seeing your backport it is difficult to say
> anything about the reason for the deadlock.
>
> Attaching an updated patch that takes care of i_wb initialization on
> CONFIG_CGROUP_WRITEBACK=y.

I think now we'll also need to always set
mapping_set_writeback_may_deadlock_on_reclaim(), eg

@@ -3125,8 +3128,7 @@ void fuse_init_file_inode(struct inode *inode,
unsigned int flags)

        inode->i_fop = &fuse_file_operations;
        inode->i_data.a_ops = &fuse_file_aops;
-       if (fc->writeback_cache)
-               mapping_set_writeback_may_deadlock_on_reclaim(&inode->i_data);
+       mapping_set_writeback_may_deadlock_on_reclaim(&inode->i_data);


Does this completely get rid of the race? There's a fair chance I'm
wrong here but doesn't the race still happen if the read invalidation
happens before the write grabs the folio lock? This is the scenario
I'm thinking of:

Thread A (read):
read, w/ auto inval and a outdated mtime triggers invalidate_inode_pages2()
generic_file_read_iter() is called, which calls filemap_read() ->
filemap_get_pages() -> triggers read_folio/readahead
read_folio/readahead fetches data (stale) from the server, unlocks folios

Thread B (writethrough write):
fuse_perform_write() -> fuse_fill_write_pages():
grabs the folio lock and copies new write data to page cache, sets
writeback flag and unlocks folio, sends request to server

Thread A (read):
the read data that was fetched from the server gets copied to the page
cache in filemap_read()
overwrites the write data in the page cache with the stale data

Am i misanalyzing something in this sequence?

Thanks,
Joanne
>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-15 17:19                             ` Joanne Koong
@ 2025-10-15 19:48                               ` Brian Foster
  2025-10-15 20:28                                 ` Joanne Koong
  0 siblings, 1 reply; 31+ messages in thread
From: Brian Foster @ 2025-10-15 19:48 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, lu gu, linux-fsdevel, linux-kernel,
	Bernd Schubert

On Wed, Oct 15, 2025 at 10:19:15AM -0700, Joanne Koong wrote:
> On Wed, Oct 15, 2025 at 7:09 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Wed, 15 Oct 2025 at 06:00, lu gu <giveme.gulu@gmail.com> wrote:
> > >
> > > >  Attaching a test patch, minimally tested.
> > > Since I only have a test environment for kernel 5.15, I ported this
> > > patch to the FUSE module in 5.15. I ran the previous LTP test cases
> > > more than ten times, and the data inconsistency issue did not reoccur.
> > > However, a deadlock occur. Below is the specific stack trace.
> >
> > This is does not reproduce for me on 6.17 even after running the test
> > for hours.  Without seeing your backport it is difficult to say
> > anything about the reason for the deadlock.
> >
> > Attaching an updated patch that takes care of i_wb initialization on
> > CONFIG_CGROUP_WRITEBACK=y.
> 
> I think now we'll also need to always set
> mapping_set_writeback_may_deadlock_on_reclaim(), eg
> 
> @@ -3125,8 +3128,7 @@ void fuse_init_file_inode(struct inode *inode,
> unsigned int flags)
> 
>         inode->i_fop = &fuse_file_operations;
>         inode->i_data.a_ops = &fuse_file_aops;
> -       if (fc->writeback_cache)
> -               mapping_set_writeback_may_deadlock_on_reclaim(&inode->i_data);
> +       mapping_set_writeback_may_deadlock_on_reclaim(&inode->i_data);
> 
> 
> Does this completely get rid of the race? There's a fair chance I'm
> wrong here but doesn't the race still happen if the read invalidation
> happens before the write grabs the folio lock? This is the scenario
> I'm thinking of:
> 
> Thread A (read):
> read, w/ auto inval and a outdated mtime triggers invalidate_inode_pages2()
> generic_file_read_iter() is called, which calls filemap_read() ->
> filemap_get_pages() -> triggers read_folio/readahead
> read_folio/readahead fetches data (stale) from the server, unlocks folios
> 
> Thread B (writethrough write):
> fuse_perform_write() -> fuse_fill_write_pages():
> grabs the folio lock and copies new write data to page cache, sets
> writeback flag and unlocks folio, sends request to server
> 
> Thread A (read):
> the read data that was fetched from the server gets copied to the page
> cache in filemap_read()
> overwrites the write data in the page cache with the stale data
> 
> Am i misanalyzing something in this sequence?
> 

Maybe I misread the description, but I think folios are locked across
read I/O, so I don't follow how we could race with readahead in this
way. Hm?

Brian

> Thanks,
> Joanne
> >
> > Thanks,
> > Miklos
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-15 19:48                               ` Brian Foster
@ 2025-10-15 20:28                                 ` Joanne Koong
  2025-10-20 10:10                                   ` lu gu
  0 siblings, 1 reply; 31+ messages in thread
From: Joanne Koong @ 2025-10-15 20:28 UTC (permalink / raw)
  To: Brian Foster
  Cc: Miklos Szeredi, lu gu, linux-fsdevel, linux-kernel,
	Bernd Schubert

On Wed, Oct 15, 2025 at 12:44 PM Brian Foster <bfoster@redhat.com> wrote:
>
> On Wed, Oct 15, 2025 at 10:19:15AM -0700, Joanne Koong wrote:
> > On Wed, Oct 15, 2025 at 7:09 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > On Wed, 15 Oct 2025 at 06:00, lu gu <giveme.gulu@gmail.com> wrote:
> > > >
> > > > >  Attaching a test patch, minimally tested.
> > > > Since I only have a test environment for kernel 5.15, I ported this
> > > > patch to the FUSE module in 5.15. I ran the previous LTP test cases
> > > > more than ten times, and the data inconsistency issue did not reoccur.
> > > > However, a deadlock occur. Below is the specific stack trace.
> > >
> > > This is does not reproduce for me on 6.17 even after running the test
> > > for hours.  Without seeing your backport it is difficult to say
> > > anything about the reason for the deadlock.
> > >
> > > Attaching an updated patch that takes care of i_wb initialization on
> > > CONFIG_CGROUP_WRITEBACK=y.
> >
> > I think now we'll also need to always set
> > mapping_set_writeback_may_deadlock_on_reclaim(), eg
> >
> > @@ -3125,8 +3128,7 @@ void fuse_init_file_inode(struct inode *inode,
> > unsigned int flags)
> >
> >         inode->i_fop = &fuse_file_operations;
> >         inode->i_data.a_ops = &fuse_file_aops;
> > -       if (fc->writeback_cache)
> > -               mapping_set_writeback_may_deadlock_on_reclaim(&inode->i_data);
> > +       mapping_set_writeback_may_deadlock_on_reclaim(&inode->i_data);
> >
> >
> > Does this completely get rid of the race? There's a fair chance I'm
> > wrong here but doesn't the race still happen if the read invalidation
> > happens before the write grabs the folio lock? This is the scenario
> > I'm thinking of:
> >
> > Thread A (read):
> > read, w/ auto inval and a outdated mtime triggers invalidate_inode_pages2()
> > generic_file_read_iter() is called, which calls filemap_read() ->
> > filemap_get_pages() -> triggers read_folio/readahead
> > read_folio/readahead fetches data (stale) from the server, unlocks folios
> >
> > Thread B (writethrough write):
> > fuse_perform_write() -> fuse_fill_write_pages():
> > grabs the folio lock and copies new write data to page cache, sets
> > writeback flag and unlocks folio, sends request to server
> >
> > Thread A (read):
> > the read data that was fetched from the server gets copied to the page
> > cache in filemap_read()
> > overwrites the write data in the page cache with the stale data
> >
> > Am i misanalyzing something in this sequence?
> >
>
> Maybe I misread the description, but I think folios are locked across
> read I/O, so I don't follow how we could race with readahead in this
> way. Hm?

Ah I see where my analysis went wrong - the "copy_folio_to_iter()"
call in filemap_read() copies the data into the client's user buffer,
not the data into the page cache. The data gets copied to the page
cache in the fuse code in fuse_copy_out_args() (through
fuse_dev_do_write()), which has to be under the folio lock. Yeah
you're right, there's no race condition here then. Thanks for clearing
this up.

>
> Brian
>
> > Thanks,
> > Joanne
> > >
> > > Thanks,
> > > Miklos
> >
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
  2025-10-15 20:28                                 ` Joanne Koong
@ 2025-10-20 10:10                                   ` lu gu
  0 siblings, 0 replies; 31+ messages in thread
From: lu gu @ 2025-10-20 10:10 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Brian Foster, linux-fsdevel, linux-kernel, Bernd Schubert,
	Joanne Koong

I tried to backport the fix  to my 5.15 environment.
After further investigation and comparing the code across kernel
versions, I now believe I understand why the straightforward backport
failed.

My understanding is that in kernel 5.15, FUSE's writeback detection
(e.g., in fuse_wait_on_page_writeback) relies on its own tracking
mechanism—the fi->writepages red-black tree, which is checked via
fuse_find_writeback(). In contrast, the fix in the mainline kernel
appears to rely on the generic VFS/MM mechanism, where
folio_wait_writeback() directly checks the PG_writeback flag on the
folio itself.

By simply backporting the logic that sets the PG_writeback flag
without also adding a corresponding entry to the fi->writepages
red-black tree, I created an inconsistent state: the page was marked
as under writeback, but FUSE's own checking functions were completely
unaware of it. I believe this inconsistency is what caused the
deadlock.

Therefore, a proper fix for 5.15 will require a more sophisticated approach.

On Thu, Oct 16, 2025 at 4:28 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, Oct 15, 2025 at 12:44 PM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Wed, Oct 15, 2025 at 10:19:15AM -0700, Joanne Koong wrote:
> > > On Wed, Oct 15, 2025 at 7:09 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >
> > > > On Wed, 15 Oct 2025 at 06:00, lu gu <giveme.gulu@gmail.com> wrote:
> > > > >
> > > > > >  Attaching a test patch, minimally tested.
> > > > > Since I only have a test environment for kernel 5.15, I ported this
> > > > > patch to the FUSE module in 5.15. I ran the previous LTP test cases
> > > > > more than ten times, and the data inconsistency issue did not reoccur.
> > > > > However, a deadlock occur. Below is the specific stack trace.
> > > >
> > > > This is does not reproduce for me on 6.17 even after running the test
> > > > for hours.  Without seeing your backport it is difficult to say
> > > > anything about the reason for the deadlock.
> > > >
> > > > Attaching an updated patch that takes care of i_wb initialization on
> > > > CONFIG_CGROUP_WRITEBACK=y.
> > >
> > > I think now we'll also need to always set
> > > mapping_set_writeback_may_deadlock_on_reclaim(), eg
> > >
> > > @@ -3125,8 +3128,7 @@ void fuse_init_file_inode(struct inode *inode,
> > > unsigned int flags)
> > >
> > >         inode->i_fop = &fuse_file_operations;
> > >         inode->i_data.a_ops = &fuse_file_aops;
> > > -       if (fc->writeback_cache)
> > > -               mapping_set_writeback_may_deadlock_on_reclaim(&inode->i_data);
> > > +       mapping_set_writeback_may_deadlock_on_reclaim(&inode->i_data);
> > >
> > >
> > > Does this completely get rid of the race? There's a fair chance I'm
> > > wrong here but doesn't the race still happen if the read invalidation
> > > happens before the write grabs the folio lock? This is the scenario
> > > I'm thinking of:
> > >
> > > Thread A (read):
> > > read, w/ auto inval and a outdated mtime triggers invalidate_inode_pages2()
> > > generic_file_read_iter() is called, which calls filemap_read() ->
> > > filemap_get_pages() -> triggers read_folio/readahead
> > > read_folio/readahead fetches data (stale) from the server, unlocks folios
> > >
> > > Thread B (writethrough write):
> > > fuse_perform_write() -> fuse_fill_write_pages():
> > > grabs the folio lock and copies new write data to page cache, sets
> > > writeback flag and unlocks folio, sends request to server
> > >
> > > Thread A (read):
> > > the read data that was fetched from the server gets copied to the page
> > > cache in filemap_read()
> > > overwrites the write data in the page cache with the stale data
> > >
> > > Am i misanalyzing something in this sequence?
> > >
> >
> > Maybe I misread the description, but I think folios are locked across
> > read I/O, so I don't follow how we could race with readahead in this
> > way. Hm?
>
> Ah I see where my analysis went wrong - the "copy_folio_to_iter()"
> call in filemap_read() copies the data into the client's user buffer,
> not the data into the page cache. The data gets copied to the page
> cache in the fuse code in fuse_copy_out_args() (through
> fuse_dev_do_write()), which has to be under the folio lock. Yeah
> you're right, there's no race condition here then. Thanks for clearing
> this up.
>
> >
> > Brian
> >
> > > Thanks,
> > > Joanne
> > > >
> > > > Thanks,
> > > > Miklos
> > >
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-10-20 10:10 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-09 11:06 [PATCH 5.15] fuse: Fix race condition in writethrough path A race guangming.zhao
2025-10-09 22:11 ` Joanne Koong
     [not found]   ` <CAFS-8+VcZn7WZgjV9pHz4c8DYHRdP0on6-er5fm9TZF9RAO0xQ@mail.gmail.com>
2025-10-10  6:25     ` lu gu
2025-10-10  8:46       ` Miklos Szeredi
2025-10-13 13:39         ` Miklos Szeredi
2025-10-13 17:44           ` Brian Foster
2025-10-13 18:23             ` Miklos Szeredi
2025-10-13 18:53               ` Brian Foster
2025-10-14  7:48                 ` Miklos Szeredi
2025-10-14 12:43                   ` Miklos Szeredi
2025-10-14 16:15                     ` Miklos Szeredi
2025-10-14 17:01                     ` Joanne Koong
2025-10-14 17:56                       ` Brian Foster
2025-10-15  3:59                         ` lu gu
2025-10-15 14:09                           ` Miklos Szeredi
2025-10-15 17:19                             ` Joanne Koong
2025-10-15 19:48                               ` Brian Foster
2025-10-15 20:28                                 ` Joanne Koong
2025-10-20 10:10                                   ` lu gu
2025-10-14 14:01                   ` Brian Foster
2025-10-14 16:10                     ` Miklos Szeredi
2025-10-14 16:15                       ` Bernd Schubert
2025-10-14 16:21                       ` Brian Foster
2025-10-13 20:16           ` Bernd Schubert
2025-10-13 20:27             ` Joanne Koong
2025-10-13 20:40               ` Bernd Schubert
2025-10-13 23:32                 ` Joanne Koong
2025-10-14  8:06             ` Miklos Szeredi
2025-10-13 23:43           ` Joanne Koong
2025-10-14  8:11             ` Miklos Szeredi
2025-10-14  9:36               ` lu gu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).