silent data corruption in fuse in rc1

All of lore.kernel.org
 help / color / mirror / Atom feed

* silent data corruption in fuse in rc1
@ 2024-12-07 21:58 Kent Overstreet
  2024-12-07 23:01 ` Malte Schröder
  0 siblings, 1 reply; 31+ messages in thread
From: Kent Overstreet @ 2024-12-07 21:58 UTC (permalink / raw)
  To: Miklos Szeredi, malte.schroeder
  Cc: Josef Bacik, Joanne Koong, Matthew Wilcox, linux-fsdevel

Hey Miklos, I've got a user who noticed silent data corruption when
building flatpaks on rc1; it appeared when using both bcachefs and btrfs
as the host filesystem, it's now looking like fuse changes might be the
culprit - we just got confirmation that reverting
fb527fc1f36e252cd1f62a26be4906949e7708ff fixes it.

Can you guys take it from here? Also, I think a data corruption in rc1
merits a post mortem and hopefully some test improvements, this caused a
bit of a freak out.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-07 21:58 silent data corruption in fuse in rc1 Kent Overstreet
@ 2024-12-07 23:01 ` Malte Schröder
  2024-12-07 23:39   ` Kent Overstreet
  2024-12-08  1:23   ` Matthew Wilcox
  0 siblings, 2 replies; 31+ messages in thread
From: Malte Schröder @ 2024-12-07 23:01 UTC (permalink / raw)
  To: Kent Overstreet, Miklos Szeredi
  Cc: Josef Bacik, Joanne Koong, Matthew Wilcox, linux-fsdevel

On 07/12/2024 22:58, Kent Overstreet wrote:
> Hey Miklos, I've got a user who noticed silent data corruption when
> building flatpaks on rc1; it appeared when using both bcachefs and btrfs
> as the host filesystem, it's now looking like fuse changes might be the
> culprit - we just got confirmation that reverting
> fb527fc1f36e252cd1f62a26be4906949e7708ff fixes it.
>
> Can you guys take it from here? Also, I think a data corruption in rc1
> merits a post mortem and hopefully some test improvements, this caused a
> bit of a freak out.
Hi
all,                                                                        
                                                                             
 
I did report this. What I encountered is flatpak installs failing on
bcachefs  
because of mismatching hashes. I did not notice any other issues, just
this    
thing with flatpak. Flatpak seems to put downloaded files into
directories     
mounted with "revokefs-fuse" on /var/tmp. So far I could only reproduce
this on
bcachefs, it does not happen when I make /var/tmp a tmpfs or when I
bind-mount
a directory from a btrfs onto
/var/tmp.                                        
                                                                             
 
To me there seems to be some bad interaction between fuse and
bcachefs.        
                                                                             
 
Reverting fb527fc1f36e252cd1f62a26be4906949e7708ff fixes the issue for
me.     
                                                                             
 
Regards                                                                      
 
                                                                             
 
/Malte                                                                       
 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-07 23:01 ` Malte Schröder
@ 2024-12-07 23:39   ` Kent Overstreet
  2024-12-08  0:11     ` Malte Schröder
  2024-12-08  1:23   ` Matthew Wilcox
  1 sibling, 1 reply; 31+ messages in thread
From: Kent Overstreet @ 2024-12-07 23:39 UTC (permalink / raw)
  To: Malte Schröder
  Cc: Miklos Szeredi, Josef Bacik, Joanne Koong, Matthew Wilcox,
	linux-fsdevel

On Sun, Dec 08, 2024 at 12:01:11AM +0100, Malte Schröder wrote:
> On 07/12/2024 22:58, Kent Overstreet wrote:
> > Hey Miklos, I've got a user who noticed silent data corruption when
> > building flatpaks on rc1; it appeared when using both bcachefs and btrfs
> > as the host filesystem, it's now looking like fuse changes might be the
> > culprit - we just got confirmation that reverting
> > fb527fc1f36e252cd1f62a26be4906949e7708ff fixes it.
> >
> > Can you guys take it from here? Also, I think a data corruption in rc1
> > merits a post mortem and hopefully some test improvements, this caused a
> > bit of a freak out.
> Hi
> all,                                                                        
>                                                                              
>  
> I did report this. What I encountered is flatpak installs failing on
> bcachefs  
> because of mismatching hashes. I did not notice any other issues, just
> this    
> thing with flatpak. Flatpak seems to put downloaded files into
> directories     
> mounted with "revokefs-fuse" on /var/tmp. So far I could only reproduce
> this on
> bcachefs, it does not happen when I make /var/tmp a tmpfs or when I
> bind-mount
> a directory from a btrfs onto
> /var/tmp.                                        
>                                                                              
>  
> To me there seems to be some bad interaction between fuse and
> bcachefs.        
>                                                                              
>  
> Reverting fb527fc1f36e252cd1f62a26be4906949e7708ff fixes the issue for
> me.     

You said it happened with btrfs as well?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-07 23:39   ` Kent Overstreet
@ 2024-12-08  0:11     ` Malte Schröder
  0 siblings, 0 replies; 31+ messages in thread
From: Malte Schröder @ 2024-12-08  0:11 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Miklos Szeredi, Josef Bacik, Joanne Koong, Matthew Wilcox,
	linux-fsdevel

On 08/12/2024 00:39, Kent Overstreet wrote:
> On Sun, Dec 08, 2024 at 12:01:11AM +0100, Malte Schröder wrote:
>> On 07/12/2024 22:58, Kent Overstreet wrote:
>>> Hey Miklos, I've got a user who noticed silent data corruption when
>>> building flatpaks on rc1; it appeared when using both bcachefs and btrfs
>>> as the host filesystem, it's now looking like fuse changes might be the
>>> culprit - we just got confirmation that reverting
>>> fb527fc1f36e252cd1f62a26be4906949e7708ff fixes it.
>>>
>>> Can you guys take it from here? Also, I think a data corruption in rc1
>>> merits a post mortem and hopefully some test improvements, this caused a
>>> bit of a freak out.
>> Hi
>> all,                                                                        
>>                                                                              
>>  
>> I did report this. What I encountered is flatpak installs failing on
>> bcachefs  
>> because of mismatching hashes. I did not notice any other issues, just
>> this    
>> thing with flatpak. Flatpak seems to put downloaded files into
>> directories     
>> mounted with "revokefs-fuse" on /var/tmp. So far I could only reproduce
>> this on
>> bcachefs, it does not happen when I make /var/tmp a tmpfs or when I
>> bind-mount
>> a directory from a btrfs onto
>> /var/tmp.                                        
>>                                                                              
>>  
>> To me there seems to be some bad interaction between fuse and
>> bcachefs.        
>>                                                                              
>>  
>> Reverting fb527fc1f36e252cd1f62a26be4906949e7708ff fixes the issue for
>> me.     
> You said it happened with btrfs as well?
That was my first assessment, but my method was flawed. I corrected
later. Sorry for the confusion.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-07 23:01 ` Malte Schröder
  2024-12-07 23:39   ` Kent Overstreet
@ 2024-12-08  1:23   ` Matthew Wilcox
  2024-12-08 20:02     ` Malte Schröder
  1 sibling, 1 reply; 31+ messages in thread
From: Matthew Wilcox @ 2024-12-08  1:23 UTC (permalink / raw)
  To: Malte Schröder
  Cc: Kent Overstreet, Miklos Szeredi, Josef Bacik, Joanne Koong,
	linux-fsdevel

On Sun, Dec 08, 2024 at 12:01:11AM +0100, Malte Schröder wrote:
> Reverting fb527fc1f36e252cd1f62a26be4906949e7708ff fixes the issue for
> me.     

That's a merge commit ... does the problem reproduce if you run
d1dfb5f52ffc?  And if it does, can you bisect the problem any further
back?  I'd recommend also testing v6.12-rc1; if that's good, bisect
between those two.

If the problem doesn't show up with d1dfb5f52ffc? then we have a dilly
of an interaction to debug ;-(

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-08  1:23   ` Matthew Wilcox
@ 2024-12-08 20:02     ` Malte Schröder
  2024-12-08 22:32       ` Malte Schröder
  0 siblings, 1 reply; 31+ messages in thread
From: Malte Schröder @ 2024-12-08 20:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kent Overstreet, Miklos Szeredi, Josef Bacik, Joanne Koong,
	linux-fsdevel

On 08/12/2024 02:23, Matthew Wilcox wrote:
> On Sun, Dec 08, 2024 at 12:01:11AM +0100, Malte Schröder wrote:
>> Reverting fb527fc1f36e252cd1f62a26be4906949e7708ff fixes the issue for
>> me.     
> That's a merge commit ... does the problem reproduce if you run
> d1dfb5f52ffc?  And if it does, can you bisect the problem any further
> back?  I'd recommend also testing v6.12-rc1; if that's good, bisect
> between those two.
>
> If the problem doesn't show up with d1dfb5f52ffc? then we have a dilly
> of an interaction to debug ;-(

I spent half a day compiling kernels, but bisect was non-conclusive.
There are some steps where the failure mode changes slightly, so this is
hard. It ended up at 445d9f05fa149556422f7fdd52dacf487cc8e7be which is
the nfsd-6.13 merge ...

d1dfb5f52ffc also shows the issue. I will try to narrow down from there.

/Malte


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-08 20:02     ` Malte Schröder
@ 2024-12-08 22:32       ` Malte Schröder
  2024-12-09  1:57         ` Jingbo Xu
  2024-12-09 14:49         ` Josef Bacik
  0 siblings, 2 replies; 31+ messages in thread
From: Malte Schröder @ 2024-12-08 22:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kent Overstreet, Miklos Szeredi, Josef Bacik, Joanne Koong,
	linux-fsdevel

On 08/12/2024 21:02, Malte Schröder wrote:
> On 08/12/2024 02:23, Matthew Wilcox wrote:
>> On Sun, Dec 08, 2024 at 12:01:11AM +0100, Malte Schröder wrote:
>>> Reverting fb527fc1f36e252cd1f62a26be4906949e7708ff fixes the issue for
>>> me.     
>> That's a merge commit ... does the problem reproduce if you run
>> d1dfb5f52ffc?  And if it does, can you bisect the problem any further
>> back?  I'd recommend also testing v6.12-rc1; if that's good, bisect
>> between those two.
>>
>> If the problem doesn't show up with d1dfb5f52ffc? then we have a dilly
>> of an interaction to debug ;-(
> I spent half a day compiling kernels, but bisect was non-conclusive.
> There are some steps where the failure mode changes slightly, so this is
> hard. It ended up at 445d9f05fa149556422f7fdd52dacf487cc8e7be which is
> the nfsd-6.13 merge ...
>
> d1dfb5f52ffc also shows the issue. I will try to narrow down from there.
>
> /Malte
>
Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
with 3b97c3652d91 as the culprit.

/Malte


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-08 22:32       ` Malte Schröder
@ 2024-12-09  1:57         ` Jingbo Xu
  2024-12-09  6:42           ` Malte Schröder
  2024-12-09 14:49         ` Josef Bacik
  1 sibling, 1 reply; 31+ messages in thread
From: Jingbo Xu @ 2024-12-09  1:57 UTC (permalink / raw)
  To: Malte Schröder, Matthew Wilcox
  Cc: Kent Overstreet, Miklos Szeredi, Josef Bacik, Joanne Koong,
	linux-fsdevel, Bernd Schubert

Hi, Malte

On 12/9/24 6:32 AM, Malte Schröder wrote:
> On 08/12/2024 21:02, Malte Schröder wrote:
>> On 08/12/2024 02:23, Matthew Wilcox wrote:
>>> On Sun, Dec 08, 2024 at 12:01:11AM +0100, Malte Schröder wrote:
>>>> Reverting fb527fc1f36e252cd1f62a26be4906949e7708ff fixes the issue for
>>>> me.     
>>> That's a merge commit ... does the problem reproduce if you run
>>> d1dfb5f52ffc?  And if it does, can you bisect the problem any further
>>> back?  I'd recommend also testing v6.12-rc1; if that's good, bisect
>>> between those two.
>>>
>>> If the problem doesn't show up with d1dfb5f52ffc? then we have a dilly
>>> of an interaction to debug ;-(
>> I spent half a day compiling kernels, but bisect was non-conclusive.
>> There are some steps where the failure mode changes slightly, so this is
>> hard. It ended up at 445d9f05fa149556422f7fdd52dacf487cc8e7be which is
>> the nfsd-6.13 merge ...
>>
>> d1dfb5f52ffc also shows the issue. I will try to narrow down from there.
>>
>> /Malte
>>
> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
> with 3b97c3652d91 as the culprit.

Would you mind checking if [1] fixes the issue?  It is a fix for
3b97c3652d91, though the initial report shows 3b97c3652d91 will cause
null-ptr-deref.


[1]
https://lore.kernel.org/all/20241203-fix-fuse_get_user_pages-v2-1-acce8a29d06b@ddn.com/

-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-09  1:57         ` Jingbo Xu
@ 2024-12-09  6:42           ` Malte Schröder
  2024-12-09  8:06             ` Bernd Schubert
  0 siblings, 1 reply; 31+ messages in thread
From: Malte Schröder @ 2024-12-09  6:42 UTC (permalink / raw)
  To: Jingbo Xu, Matthew Wilcox
  Cc: Kent Overstreet, Miklos Szeredi, Josef Bacik, Joanne Koong,
	linux-fsdevel, Bernd Schubert

On 09/12/2024 02:57, Jingbo Xu wrote:
> Hi, Malte
>
> On 12/9/24 6:32 AM, Malte Schröder wrote:
>> On 08/12/2024 21:02, Malte Schröder wrote:
>>> On 08/12/2024 02:23, Matthew Wilcox wrote:
>>>> On Sun, Dec 08, 2024 at 12:01:11AM +0100, Malte Schröder wrote:
>>>>> Reverting fb527fc1f36e252cd1f62a26be4906949e7708ff fixes the issue for
>>>>> me.     
>>>> That's a merge commit ... does the problem reproduce if you run
>>>> d1dfb5f52ffc?  And if it does, can you bisect the problem any further
>>>> back?  I'd recommend also testing v6.12-rc1; if that's good, bisect
>>>> between those two.
>>>>
>>>> If the problem doesn't show up with d1dfb5f52ffc? then we have a dilly
>>>> of an interaction to debug ;-(
>>> I spent half a day compiling kernels, but bisect was non-conclusive.
>>> There are some steps where the failure mode changes slightly, so this is
>>> hard. It ended up at 445d9f05fa149556422f7fdd52dacf487cc8e7be which is
>>> the nfsd-6.13 merge ...
>>>
>>> d1dfb5f52ffc also shows the issue. I will try to narrow down from there.
>>>
>>> /Malte
>>>
>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
>> with 3b97c3652d91 as the culprit.
> Would you mind checking if [1] fixes the issue?  It is a fix for
> 3b97c3652d91, though the initial report shows 3b97c3652d91 will cause
> null-ptr-deref.
>
>
> [1]
> https://lore.kernel.org/all/20241203-fix-fuse_get_user_pages-v2-1-acce8a29d06b@ddn.com/
It does not fix the issue, still behaves the same.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-09  6:42           ` Malte Schröder
@ 2024-12-09  8:06             ` Bernd Schubert
  2024-12-09  9:07               ` Malte Schröder
  0 siblings, 1 reply; 31+ messages in thread
From: Bernd Schubert @ 2024-12-09  8:06 UTC (permalink / raw)
  To: Malte Schröder, Jingbo Xu, Matthew Wilcox
  Cc: Kent Overstreet, Miklos Szeredi, Josef Bacik, Joanne Koong,
	linux-fsdevel, Bernd Schubert

Hi Malte,

On 12/9/24 07:42, Malte Schröder wrote:
> On 09/12/2024 02:57, Jingbo Xu wrote:
>> Hi, Malte
>>
>> On 12/9/24 6:32 AM, Malte Schröder wrote:
>>> On 08/12/2024 21:02, Malte Schröder wrote:
>>>> On 08/12/2024 02:23, Matthew Wilcox wrote:
>>>>> On Sun, Dec 08, 2024 at 12:01:11AM +0100, Malte Schröder wrote:
>>>>>> Reverting fb527fc1f36e252cd1f62a26be4906949e7708ff fixes the issue for
>>>>>> me.     
>>>>> That's a merge commit ... does the problem reproduce if you run
>>>>> d1dfb5f52ffc?  And if it does, can you bisect the problem any further
>>>>> back?  I'd recommend also testing v6.12-rc1; if that's good, bisect
>>>>> between those two.
>>>>>
>>>>> If the problem doesn't show up with d1dfb5f52ffc? then we have a dilly
>>>>> of an interaction to debug ;-(
>>>> I spent half a day compiling kernels, but bisect was non-conclusive.
>>>> There are some steps where the failure mode changes slightly, so this is
>>>> hard. It ended up at 445d9f05fa149556422f7fdd52dacf487cc8e7be which is
>>>> the nfsd-6.13 merge ...
>>>>
>>>> d1dfb5f52ffc also shows the issue. I will try to narrow down from there.
>>>>
>>>> /Malte
>>>>
>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
>>> with 3b97c3652d91 as the culprit.
>> Would you mind checking if [1] fixes the issue?  It is a fix for
>> 3b97c3652d91, though the initial report shows 3b97c3652d91 will cause
>> null-ptr-deref.
>>
>>
>> [1]
>> https://lore.kernel.org/all/20241203-fix-fuse_get_user_pages-v2-1-acce8a29d06b@ddn.com/
> It does not fix the issue, still behaves the same.
> 

could you give instructions how to get the issue? Maybe we can script it and I let 
it run in a loop on one my systems?


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-09  8:06             ` Bernd Schubert
@ 2024-12-09  9:07               ` Malte Schröder
  2024-12-09 13:06                 ` Bernd Schubert
  0 siblings, 1 reply; 31+ messages in thread
From: Malte Schröder @ 2024-12-09  9:07 UTC (permalink / raw)
  To: Bernd Schubert, Jingbo Xu, Matthew Wilcox
  Cc: Kent Overstreet, Miklos Szeredi, Josef Bacik, Joanne Koong,
	linux-fsdevel, Bernd Schubert

On 09/12/2024 09:06, Bernd Schubert wrote:
> Hi Malte,
>
> On 12/9/24 07:42, Malte Schröder wrote:
>> On 09/12/2024 02:57, Jingbo Xu wrote:
>>> Hi, Malte
>>>
>>> On 12/9/24 6:32 AM, Malte Schröder wrote:
>>>> On 08/12/2024 21:02, Malte Schröder wrote:
>>>>> On 08/12/2024 02:23, Matthew Wilcox wrote:
>>>>>> On Sun, Dec 08, 2024 at 12:01:11AM +0100, Malte Schröder wrote:
>>>>>>> Reverting fb527fc1f36e252cd1f62a26be4906949e7708ff fixes the issue for
>>>>>>> me.     
>>>>>> That's a merge commit ... does the problem reproduce if you run
>>>>>> d1dfb5f52ffc?  And if it does, can you bisect the problem any further
>>>>>> back?  I'd recommend also testing v6.12-rc1; if that's good, bisect
>>>>>> between those two.
>>>>>>
>>>>>> If the problem doesn't show up with d1dfb5f52ffc? then we have a dilly
>>>>>> of an interaction to debug ;-(
>>>>> I spent half a day compiling kernels, but bisect was non-conclusive.
>>>>> There are some steps where the failure mode changes slightly, so this is
>>>>> hard. It ended up at 445d9f05fa149556422f7fdd52dacf487cc8e7be which is
>>>>> the nfsd-6.13 merge ...
>>>>>
>>>>> d1dfb5f52ffc also shows the issue. I will try to narrow down from there.
>>>>>
>>>>> /Malte
>>>>>
>>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
>>>> with 3b97c3652d91 as the culprit.
>>> Would you mind checking if [1] fixes the issue?  It is a fix for
>>> 3b97c3652d91, though the initial report shows 3b97c3652d91 will cause
>>> null-ptr-deref.
>>>
>>>
>>> [1]
>>> https://lore.kernel.org/all/20241203-fix-fuse_get_user_pages-v2-1-acce8a29d06b@ddn.com/
>> It does not fix the issue, still behaves the same.
>>
> could you give instructions how to get the issue? Maybe we can script it and I let 
> it run in a loop on one my systems?
>
>
> Thanks,
> Bernd

Sure. To reproduce I set up a VM running Arch and bcachefs as rootfs
(Works out of the box on current Arch). Build -rc kernel using
pacman-pkg build target. Try to install FreeCAD, "flatpak install
flathub org.freecad.FreeCAD". Usually it fails to download some
dependencies. It's a pretty wonky test, but I didn't find a more
specific way to reproduce this.

/Malte


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-09  9:07               ` Malte Schröder
@ 2024-12-09 13:06                 ` Bernd Schubert
  2024-12-09 13:45                   ` Malte Schröder
  0 siblings, 1 reply; 31+ messages in thread
From: Bernd Schubert @ 2024-12-09 13:06 UTC (permalink / raw)
  To: Malte Schröder, Bernd Schubert, Jingbo Xu, Matthew Wilcox
  Cc: Kent Overstreet, Miklos Szeredi, Josef Bacik, Joanne Koong,
	linux-fsdevel@vger.kernel.org

On 12/9/24 10:07, Malte Schröder wrote:
> [You don't often get email from malte.schroeder@tnxip.de. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> On 09/12/2024 09:06, Bernd Schubert wrote:
>> Hi Malte,
>>
>> On 12/9/24 07:42, Malte Schröder wrote:
>>> On 09/12/2024 02:57, Jingbo Xu wrote:
>>>> Hi, Malte
>>>>
>>>> On 12/9/24 6:32 AM, Malte Schröder wrote:
>>>>> On 08/12/2024 21:02, Malte Schröder wrote:
>>>>>> On 08/12/2024 02:23, Matthew Wilcox wrote:
>>>>>>> On Sun, Dec 08, 2024 at 12:01:11AM +0100, Malte Schröder wrote:
>>>>>>>> Reverting fb527fc1f36e252cd1f62a26be4906949e7708ff fixes the issue for
>>>>>>>> me.
>>>>>>> That's a merge commit ... does the problem reproduce if you run
>>>>>>> d1dfb5f52ffc?  And if it does, can you bisect the problem any further
>>>>>>> back?  I'd recommend also testing v6.12-rc1; if that's good, bisect
>>>>>>> between those two.
>>>>>>>
>>>>>>> If the problem doesn't show up with d1dfb5f52ffc? then we have a dilly
>>>>>>> of an interaction to debug ;-(
>>>>>> I spent half a day compiling kernels, but bisect was non-conclusive.
>>>>>> There are some steps where the failure mode changes slightly, so this is
>>>>>> hard. It ended up at 445d9f05fa149556422f7fdd52dacf487cc8e7be which is
>>>>>> the nfsd-6.13 merge ...
>>>>>>
>>>>>> d1dfb5f52ffc also shows the issue. I will try to narrow down from there.
>>>>>>
>>>>>> /Malte
>>>>>>
>>>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
>>>>> with 3b97c3652d91 as the culprit.
>>>> Would you mind checking if [1] fixes the issue?  It is a fix for
>>>> 3b97c3652d91, though the initial report shows 3b97c3652d91 will cause
>>>> null-ptr-deref.
>>>>
>>>>
>>>> [1]
>>>> https://lore.kernel.org/all/20241203-fix-fuse_get_user_pages-v2-1-acce8a29d06b@ddn.com/
>>> It does not fix the issue, still behaves the same.
>>>
>> could you give instructions how to get the issue? Maybe we can script it and I let
>> it run in a loop on one my systems?
>>
>>
>> Thanks,
>> Bernd
> 
> Sure. To reproduce I set up a VM running Arch and bcachefs as rootfs
> (Works out of the box on current Arch). Build -rc kernel using
> pacman-pkg build target. Try to install FreeCAD, "flatpak install
> flathub org.freecad.FreeCAD". Usually it fails to download some
> dependencies. It's a pretty wonky test, but I didn't find a more
> specific way to reproduce this.
> 

What is the relation to fuse here? pacmang-pkg or 'flatpak' are using
fuse internally?


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-09 13:06                 ` Bernd Schubert
@ 2024-12-09 13:45                   ` Malte Schröder
  0 siblings, 0 replies; 31+ messages in thread
From: Malte Schröder @ 2024-12-09 13:45 UTC (permalink / raw)
  To: Bernd Schubert, Bernd Schubert, Jingbo Xu, Matthew Wilcox
  Cc: Kent Overstreet, Miklos Szeredi, Josef Bacik, Joanne Koong,
	linux-fsdevel@vger.kernel.org

On 09/12/2024 14:06, Bernd Schubert wrote:
> On 12/9/24 10:07, Malte Schröder wrote:
>> [You don't often get email from malte.schroeder@tnxip.de. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>
>> On 09/12/2024 09:06, Bernd Schubert wrote:
>>> Hi Malte,
>>>
>>> On 12/9/24 07:42, Malte Schröder wrote:
>>>> On 09/12/2024 02:57, Jingbo Xu wrote:
>>>>> Hi, Malte
>>>>>
>>>>> On 12/9/24 6:32 AM, Malte Schröder wrote:
>>>>>> On 08/12/2024 21:02, Malte Schröder wrote:
>>>>>>> On 08/12/2024 02:23, Matthew Wilcox wrote:
>>>>>>>> On Sun, Dec 08, 2024 at 12:01:11AM +0100, Malte Schröder wrote:
>>>>>>>>> Reverting fb527fc1f36e252cd1f62a26be4906949e7708ff fixes the issue for
>>>>>>>>> me.
>>>>>>>> That's a merge commit ... does the problem reproduce if you run
>>>>>>>> d1dfb5f52ffc?  And if it does, can you bisect the problem any further
>>>>>>>> back?  I'd recommend also testing v6.12-rc1; if that's good, bisect
>>>>>>>> between those two.
>>>>>>>>
>>>>>>>> If the problem doesn't show up with d1dfb5f52ffc? then we have a dilly
>>>>>>>> of an interaction to debug ;-(
>>>>>>> I spent half a day compiling kernels, but bisect was non-conclusive.
>>>>>>> There are some steps where the failure mode changes slightly, so this is
>>>>>>> hard. It ended up at 445d9f05fa149556422f7fdd52dacf487cc8e7be which is
>>>>>>> the nfsd-6.13 merge ...
>>>>>>>
>>>>>>> d1dfb5f52ffc also shows the issue. I will try to narrow down from there.
>>>>>>>
>>>>>>> /Malte
>>>>>>>
>>>>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
>>>>>> with 3b97c3652d91 as the culprit.
>>>>> Would you mind checking if [1] fixes the issue?  It is a fix for
>>>>> 3b97c3652d91, though the initial report shows 3b97c3652d91 will cause
>>>>> null-ptr-deref.
>>>>>
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/all/20241203-fix-fuse_get_user_pages-v2-1-acce8a29d06b@ddn.com/
>>>> It does not fix the issue, still behaves the same.
>>>>
>>> could you give instructions how to get the issue? Maybe we can script it and I let
>>> it run in a loop on one my systems?
>>>
>>>
>>> Thanks,
>>> Bernd
>> Sure. To reproduce I set up a VM running Arch and bcachefs as rootfs
>> (Works out of the box on current Arch). Build -rc kernel using
>> pacman-pkg build target. Try to install FreeCAD, "flatpak install
>> flathub org.freecad.FreeCAD". Usually it fails to download some
>> dependencies. It's a pretty wonky test, but I didn't find a more
>> specific way to reproduce this.
>>
> What is the relation to fuse here? pacmang-pkg or 'flatpak' are using
> fuse internally?
>
>
> Thanks,
> Bernd

Flatpak is using fuse when downloading/installing packages

/Malte


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-08 22:32       ` Malte Schröder
  2024-12-09  1:57         ` Jingbo Xu
@ 2024-12-09 14:49         ` Josef Bacik
  2024-12-09 15:28           ` Matthew Wilcox
  1 sibling, 1 reply; 31+ messages in thread
From: Josef Bacik @ 2024-12-09 14:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Malte Schröder, Kent Overstreet, Miklos Szeredi,
	Joanne Koong, linux-fsdevel

On Sun, Dec 08, 2024 at 11:32:16PM +0100, Malte Schröder wrote:
> On 08/12/2024 21:02, Malte Schröder wrote:
> > On 08/12/2024 02:23, Matthew Wilcox wrote:
> >> On Sun, Dec 08, 2024 at 12:01:11AM +0100, Malte Schröder wrote:
> >>> Reverting fb527fc1f36e252cd1f62a26be4906949e7708ff fixes the issue for
> >>> me.     
> >> That's a merge commit ... does the problem reproduce if you run
> >> d1dfb5f52ffc?  And if it does, can you bisect the problem any further
> >> back?  I'd recommend also testing v6.12-rc1; if that's good, bisect
> >> between those two.
> >>
> >> If the problem doesn't show up with d1dfb5f52ffc? then we have a dilly
> >> of an interaction to debug ;-(
> > I spent half a day compiling kernels, but bisect was non-conclusive.
> > There are some steps where the failure mode changes slightly, so this is
> > hard. It ended up at 445d9f05fa149556422f7fdd52dacf487cc8e7be which is
> > the nfsd-6.13 merge ...
> >
> > d1dfb5f52ffc also shows the issue. I will try to narrow down from there.
> >
> > /Malte
> >
> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
> with 3b97c3652d91 as the culprit.
> 

Willy, I've looked at this code and it does indeed look like a 1:1 conversion,
EXCEPT I'm fuzzy about how how this works with large folios.  Previously, if we
got a hugepage in, we'd get each individual struct page back for the whole range
of the hugepage, so if for example we had a 2M hugepage, we'd fill in the
->offset for each "middle" struct page as 0, since obviously we're consuming
PAGE_SIZE chunks at a time.

But now we're doing this

	for (i = 0; i < nfolios; i++)
		ap->folios[i + ap->num_folios] = page_folio(pages[i]);

So if userspace handed us a 2M hugepage, page_folio() on each of the
intermediary struct page's would return the same folio, correct?  So we'd end up
with the wrong offsets for our fuse request, because they should be based from
the start of the folio, correct?

I'm coming off of vacation so my brain isn't fully engaged yet, feel free to
point and laugh at me if I'm wrong.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-09 14:49         ` Josef Bacik
@ 2024-12-09 15:28           ` Matthew Wilcox
  2024-12-09 15:48             ` Josef Bacik
  0 siblings, 1 reply; 31+ messages in thread
From: Matthew Wilcox @ 2024-12-09 15:28 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Malte Schröder, Kent Overstreet, Miklos Szeredi,
	Joanne Koong, linux-fsdevel

On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote:
> > Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
> > with 3b97c3652d91 as the culprit.
> 
> Willy, I've looked at this code and it does indeed look like a 1:1 conversion,
> EXCEPT I'm fuzzy about how how this works with large folios.  Previously, if we
> got a hugepage in, we'd get each individual struct page back for the whole range
> of the hugepage, so if for example we had a 2M hugepage, we'd fill in the
> ->offset for each "middle" struct page as 0, since obviously we're consuming
> PAGE_SIZE chunks at a time.
> 
> But now we're doing this
> 
> 	for (i = 0; i < nfolios; i++)
> 		ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> 
> So if userspace handed us a 2M hugepage, page_folio() on each of the
> intermediary struct page's would return the same folio, correct?  So we'd end up
> with the wrong offsets for our fuse request, because they should be based from
> the start of the folio, correct?

I think you're 100% right.  We could put in some nice asserts to check
this is what's happening, but it does seem like a rather incautious
conversion.  Yes, all folios _in the page cache_ for fuse are small, but
that's not guaranteed to be the case for folios found in userspace for
directio.  At least the comment is wrong, and I'd suggest the code is too.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-09 15:28           ` Matthew Wilcox
@ 2024-12-09 15:48             ` Josef Bacik
  2024-12-09 17:07               ` Malte Schröder
  0 siblings, 1 reply; 31+ messages in thread
From: Josef Bacik @ 2024-12-09 15:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Malte Schröder, Kent Overstreet, Miklos Szeredi,
	Joanne Koong, linux-fsdevel

On Mon, Dec 09, 2024 at 03:28:14PM +0000, Matthew Wilcox wrote:
> On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote:
> > > Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
> > > with 3b97c3652d91 as the culprit.
> > 
> > Willy, I've looked at this code and it does indeed look like a 1:1 conversion,
> > EXCEPT I'm fuzzy about how how this works with large folios.  Previously, if we
> > got a hugepage in, we'd get each individual struct page back for the whole range
> > of the hugepage, so if for example we had a 2M hugepage, we'd fill in the
> > ->offset for each "middle" struct page as 0, since obviously we're consuming
> > PAGE_SIZE chunks at a time.
> > 
> > But now we're doing this
> > 
> > 	for (i = 0; i < nfolios; i++)
> > 		ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > 
> > So if userspace handed us a 2M hugepage, page_folio() on each of the
> > intermediary struct page's would return the same folio, correct?  So we'd end up
> > with the wrong offsets for our fuse request, because they should be based from
> > the start of the folio, correct?
> 
> I think you're 100% right.  We could put in some nice asserts to check
> this is what's happening, but it does seem like a rather incautious
> conversion.  Yes, all folios _in the page cache_ for fuse are small, but
> that's not guaranteed to be the case for folios found in userspace for
> directio.  At least the comment is wrong, and I'd suggest the code is too.

Ok cool, Malte can you try the attached only compile tested patch and see if the
problem goes away?  Thanks,

Josef

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 88d0946b5bc9..c4b93ead99a5 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1562,9 +1562,19 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
 		nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
 
 		ap->descs[ap->num_folios].offset = start;
-		fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios);
-		for (i = 0; i < nfolios; i++)
-			ap->folios[i + ap->num_folios] = page_folio(pages[i]);
+		for (i = 0; i < nfolios; i++) {
+			struct folio *folio = page_folio(pages[i]);
+			unsigned int offset = start +
+				(folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
+			unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset);
+
+			len = min_t(unsigned int, len, PAGE_SIZE);
+
+			ap->descs[ap->num_folios + i].offset = offset;
+			ap->descs[ap->num_folios + i].length = len;
+			ap->folios[i + ap->num_folios] = folio;
+			start = 0;
+		}
 
 		ap->num_folios += nfolios;
 		ap->descs[ap->num_folios - 1].length -=

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-09 15:48             ` Josef Bacik
@ 2024-12-09 17:07               ` Malte Schröder
  2024-12-09 18:47                 ` Joanne Koong
  0 siblings, 1 reply; 31+ messages in thread
From: Malte Schröder @ 2024-12-09 17:07 UTC (permalink / raw)
  To: Josef Bacik, Matthew Wilcox
  Cc: Kent Overstreet, Miklos Szeredi, Joanne Koong, linux-fsdevel

On 09/12/2024 16:48, Josef Bacik wrote:
> On Mon, Dec 09, 2024 at 03:28:14PM +0000, Matthew Wilcox wrote:
>> On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote:
>>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
>>>> with 3b97c3652d91 as the culprit.
>>> Willy, I've looked at this code and it does indeed look like a 1:1 conversion,
>>> EXCEPT I'm fuzzy about how how this works with large folios.  Previously, if we
>>> got a hugepage in, we'd get each individual struct page back for the whole range
>>> of the hugepage, so if for example we had a 2M hugepage, we'd fill in the
>>> ->offset for each "middle" struct page as 0, since obviously we're consuming
>>> PAGE_SIZE chunks at a time.
>>>
>>> But now we're doing this
>>>
>>> 	for (i = 0; i < nfolios; i++)
>>> 		ap->folios[i + ap->num_folios] = page_folio(pages[i]);
>>>
>>> So if userspace handed us a 2M hugepage, page_folio() on each of the
>>> intermediary struct page's would return the same folio, correct?  So we'd end up
>>> with the wrong offsets for our fuse request, because they should be based from
>>> the start of the folio, correct?
>> I think you're 100% right.  We could put in some nice asserts to check
>> this is what's happening, but it does seem like a rather incautious
>> conversion.  Yes, all folios _in the page cache_ for fuse are small, but
>> that's not guaranteed to be the case for folios found in userspace for
>> directio.  At least the comment is wrong, and I'd suggest the code is too.
> Ok cool, Malte can you try the attached only compile tested patch and see if the
> problem goes away?  Thanks,
>
> Josef
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 88d0946b5bc9..c4b93ead99a5 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1562,9 +1562,19 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
>  		nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
>  
>  		ap->descs[ap->num_folios].offset = start;
> -		fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios);
> -		for (i = 0; i < nfolios; i++)
> -			ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> +		for (i = 0; i < nfolios; i++) {
> +			struct folio *folio = page_folio(pages[i]);
> +			unsigned int offset = start +
> +				(folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
> +			unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset);
> +
> +			len = min_t(unsigned int, len, PAGE_SIZE);
> +
> +			ap->descs[ap->num_folios + i].offset = offset;
> +			ap->descs[ap->num_folios + i].length = len;
> +			ap->folios[i + ap->num_folios] = folio;
> +			start = 0;
> +		}
>  
>  		ap->num_folios += nfolios;
>  		ap->descs[ap->num_folios - 1].length -=

The problem persists with this patch.


/Malte


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-09 17:07               ` Malte Schröder
@ 2024-12-09 18:47                 ` Joanne Koong
  2024-12-09 19:52                   ` Joanne Koong
  0 siblings, 1 reply; 31+ messages in thread
From: Joanne Koong @ 2024-12-09 18:47 UTC (permalink / raw)
  To: Malte Schröder
  Cc: Josef Bacik, Matthew Wilcox, Kent Overstreet, Miklos Szeredi,
	linux-fsdevel

On Mon, Dec 9, 2024 at 9:07 AM Malte Schröder <malte.schroeder@tnxip.de> wrote:
>
> On 09/12/2024 16:48, Josef Bacik wrote:
> > On Mon, Dec 09, 2024 at 03:28:14PM +0000, Matthew Wilcox wrote:
> >> On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote:
> >>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
> >>>> with 3b97c3652d91 as the culprit.
> >>> Willy, I've looked at this code and it does indeed look like a 1:1 conversion,
> >>> EXCEPT I'm fuzzy about how how this works with large folios.  Previously, if we
> >>> got a hugepage in, we'd get each individual struct page back for the whole range
> >>> of the hugepage, so if for example we had a 2M hugepage, we'd fill in the
> >>> ->offset for each "middle" struct page as 0, since obviously we're consuming
> >>> PAGE_SIZE chunks at a time.
> >>>
> >>> But now we're doing this
> >>>
> >>>     for (i = 0; i < nfolios; i++)
> >>>             ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> >>>
> >>> So if userspace handed us a 2M hugepage, page_folio() on each of the
> >>> intermediary struct page's would return the same folio, correct?  So we'd end up
> >>> with the wrong offsets for our fuse request, because they should be based from
> >>> the start of the folio, correct?
> >> I think you're 100% right.  We could put in some nice asserts to check
> >> this is what's happening, but it does seem like a rather incautious
> >> conversion.  Yes, all folios _in the page cache_ for fuse are small, but
> >> that's not guaranteed to be the case for folios found in userspace for
> >> directio.  At least the comment is wrong, and I'd suggest the code is too.
> > Ok cool, Malte can you try the attached only compile tested patch and see if the
> > problem goes away?  Thanks,
> >
> > Josef
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 88d0946b5bc9..c4b93ead99a5 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1562,9 +1562,19 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
> >               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
> >
> >               ap->descs[ap->num_folios].offset = start;
> > -             fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios);
> > -             for (i = 0; i < nfolios; i++)
> > -                     ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > +             for (i = 0; i < nfolios; i++) {
> > +                     struct folio *folio = page_folio(pages[i]);
> > +                     unsigned int offset = start +
> > +                             (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
> > +                     unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset);
> > +
> > +                     len = min_t(unsigned int, len, PAGE_SIZE);
> > +
> > +                     ap->descs[ap->num_folios + i].offset = offset;
> > +                     ap->descs[ap->num_folios + i].length = len;
> > +                     ap->folios[i + ap->num_folios] = folio;
> > +                     start = 0;
> > +             }
> >
> >               ap->num_folios += nfolios;
> >               ap->descs[ap->num_folios - 1].length -=
>
> The problem persists with this patch.
>

Catching up on this thread now. I'll investigate this today.

>
> /Malte
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-09 18:47                 ` Joanne Koong
@ 2024-12-09 19:52                   ` Joanne Koong
  2024-12-10  5:14                     ` Joanne Koong
  0 siblings, 1 reply; 31+ messages in thread
From: Joanne Koong @ 2024-12-09 19:52 UTC (permalink / raw)
  To: Malte Schröder
  Cc: Josef Bacik, Matthew Wilcox, Kent Overstreet, Miklos Szeredi,
	linux-fsdevel

On Mon, Dec 9, 2024 at 10:47 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Dec 9, 2024 at 9:07 AM Malte Schröder <malte.schroeder@tnxip.de> wrote:
> >
> > On 09/12/2024 16:48, Josef Bacik wrote:
> > > On Mon, Dec 09, 2024 at 03:28:14PM +0000, Matthew Wilcox wrote:
> > >> On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote:
> > >>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
> > >>>> with 3b97c3652d91 as the culprit.
> > >>> Willy, I've looked at this code and it does indeed look like a 1:1 conversion,
> > >>> EXCEPT I'm fuzzy about how how this works with large folios.  Previously, if we
> > >>> got a hugepage in, we'd get each individual struct page back for the whole range
> > >>> of the hugepage, so if for example we had a 2M hugepage, we'd fill in the
> > >>> ->offset for each "middle" struct page as 0, since obviously we're consuming
> > >>> PAGE_SIZE chunks at a time.
> > >>>
> > >>> But now we're doing this
> > >>>
> > >>>     for (i = 0; i < nfolios; i++)
> > >>>             ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > >>>
> > >>> So if userspace handed us a 2M hugepage, page_folio() on each of the
> > >>> intermediary struct page's would return the same folio, correct?  So we'd end up
> > >>> with the wrong offsets for our fuse request, because they should be based from
> > >>> the start of the folio, correct?
> > >> I think you're 100% right.  We could put in some nice asserts to check
> > >> this is what's happening, but it does seem like a rather incautious
> > >> conversion.  Yes, all folios _in the page cache_ for fuse are small, but
> > >> that's not guaranteed to be the case for folios found in userspace for
> > >> directio.  At least the comment is wrong, and I'd suggest the code is too.
> > > Ok cool, Malte can you try the attached only compile tested patch and see if the
> > > problem goes away?  Thanks,
> > >
> > > Josef
> > >
> > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > index 88d0946b5bc9..c4b93ead99a5 100644
> > > --- a/fs/fuse/file.c
> > > +++ b/fs/fuse/file.c
> > > @@ -1562,9 +1562,19 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
> > >               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
> > >
> > >               ap->descs[ap->num_folios].offset = start;
> > > -             fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios);
> > > -             for (i = 0; i < nfolios; i++)
> > > -                     ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > > +             for (i = 0; i < nfolios; i++) {
> > > +                     struct folio *folio = page_folio(pages[i]);
> > > +                     unsigned int offset = start +
> > > +                             (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
> > > +                     unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset);
> > > +
> > > +                     len = min_t(unsigned int, len, PAGE_SIZE);
> > > +
> > > +                     ap->descs[ap->num_folios + i].offset = offset;
> > > +                     ap->descs[ap->num_folios + i].length = len;
> > > +                     ap->folios[i + ap->num_folios] = folio;
> > > +                     start = 0;
> > > +             }
> > >
> > >               ap->num_folios += nfolios;
> > >               ap->descs[ap->num_folios - 1].length -=
> >
> > The problem persists with this patch.
> >

Malte, could you try Josef's patch except with that last line
"ap->descs[ap->num_pages - 1].length  -= (PAGE_SIZE - ret) &
(PAGE_SIZE - 1);" also removed? I think we need that line removed as
well since that does a "-=" instead of a "=" and
ap->descs[ap->num_folios - 1].length gets set inside the for loop.

In the meantime, I'll try to get a local repro running on fsx so that
you don't have to keep testing out repos for us.

Thanks,
Joanne
>
> Catching up on this thread now. I'll investigate this today.
>
> >
> > /Malte
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-09 19:52                   ` Joanne Koong
@ 2024-12-10  5:14                     ` Joanne Koong
  2024-12-10 16:54                       ` Kent Overstreet
  2024-12-10 17:53                       ` Malte Schröder
  0 siblings, 2 replies; 31+ messages in thread
From: Joanne Koong @ 2024-12-10  5:14 UTC (permalink / raw)
  To: Malte Schröder
  Cc: Josef Bacik, Matthew Wilcox, Kent Overstreet, Miklos Szeredi,
	linux-fsdevel

On Mon, Dec 9, 2024 at 11:52 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Dec 9, 2024 at 10:47 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Mon, Dec 9, 2024 at 9:07 AM Malte Schröder <malte.schroeder@tnxip.de> wrote:
> > >
> > > On 09/12/2024 16:48, Josef Bacik wrote:
> > > > On Mon, Dec 09, 2024 at 03:28:14PM +0000, Matthew Wilcox wrote:
> > > >> On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote:
> > > >>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
> > > >>>> with 3b97c3652d91 as the culprit.
> > > >>> Willy, I've looked at this code and it does indeed look like a 1:1 conversion,
> > > >>> EXCEPT I'm fuzzy about how how this works with large folios.  Previously, if we
> > > >>> got a hugepage in, we'd get each individual struct page back for the whole range
> > > >>> of the hugepage, so if for example we had a 2M hugepage, we'd fill in the
> > > >>> ->offset for each "middle" struct page as 0, since obviously we're consuming
> > > >>> PAGE_SIZE chunks at a time.
> > > >>>
> > > >>> But now we're doing this
> > > >>>
> > > >>>     for (i = 0; i < nfolios; i++)
> > > >>>             ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > > >>>
> > > >>> So if userspace handed us a 2M hugepage, page_folio() on each of the
> > > >>> intermediary struct page's would return the same folio, correct?  So we'd end up
> > > >>> with the wrong offsets for our fuse request, because they should be based from
> > > >>> the start of the folio, correct?
> > > >> I think you're 100% right.  We could put in some nice asserts to check
> > > >> this is what's happening, but it does seem like a rather incautious
> > > >> conversion.  Yes, all folios _in the page cache_ for fuse are small, but
> > > >> that's not guaranteed to be the case for folios found in userspace for
> > > >> directio.  At least the comment is wrong, and I'd suggest the code is too.
> > > > Ok cool, Malte can you try the attached only compile tested patch and see if the
> > > > problem goes away?  Thanks,
> > > >
> > > > Josef
> > > >
> > > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > > index 88d0946b5bc9..c4b93ead99a5 100644
> > > > --- a/fs/fuse/file.c
> > > > +++ b/fs/fuse/file.c
> > > > @@ -1562,9 +1562,19 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
> > > >               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
> > > >
> > > >               ap->descs[ap->num_folios].offset = start;
> > > > -             fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios);
> > > > -             for (i = 0; i < nfolios; i++)
> > > > -                     ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > > > +             for (i = 0; i < nfolios; i++) {
> > > > +                     struct folio *folio = page_folio(pages[i]);
> > > > +                     unsigned int offset = start +
> > > > +                             (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
> > > > +                     unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset);
> > > > +
> > > > +                     len = min_t(unsigned int, len, PAGE_SIZE);
> > > > +
> > > > +                     ap->descs[ap->num_folios + i].offset = offset;
> > > > +                     ap->descs[ap->num_folios + i].length = len;
> > > > +                     ap->folios[i + ap->num_folios] = folio;
> > > > +                     start = 0;
> > > > +             }
> > > >
> > > >               ap->num_folios += nfolios;
> > > >               ap->descs[ap->num_folios - 1].length -=
> > >
> > > The problem persists with this patch.
> > >
>
> Malte, could you try Josef's patch except with that last line
> "ap->descs[ap->num_pages - 1].length  -= (PAGE_SIZE - ret) &
> (PAGE_SIZE - 1);" also removed? I think we need that line removed as
> well since that does a "-=" instead of a "=" and
> ap->descs[ap->num_folios - 1].length gets set inside the for loop.
>
> In the meantime, I'll try to get a local repro running on fsx so that
> you don't have to keep testing out repos for us.

I was able to repro this locally by doing:

-- start libfuse server --
sudo ./libfuse/build/example/passthrough_hp --direct-io ~/src ~/fuse_mount

-- patch + compile this (rough / ugly-for-now) code snippet --
diff --git a/ltp/fsx.c b/ltp/fsx.c
index 777ba0de..9f040bc4 100644
--- a/ltp/fsx.c
+++ b/ltp/fsx.c
@@ -1049,7 +1049,8 @@ dowrite(unsigned offset, unsigned size)
        }
 }

-
+#define TWO_MIB (1 << 21)  // 2 MiB in bytes

 void
 domapwrite(unsigned offset, unsigned size)
 {
@@ -1057,6 +1058,8 @@ domapwrite(unsigned offset, unsigned size)
        unsigned map_size;
        off_t    cur_filesize;
        char    *p;
+       int ret;
+       unsigned size_2mib_aligned;

        offset -= offset % writebdy;
        if (size == 0) {
@@ -1101,6 +1104,41 @@ domapwrite(unsigned offset, unsigned size)
        pg_offset = offset & PAGE_MASK;
        map_size  = pg_offset + size;

+       size_2mib_aligned = (size + TWO_MIB - 1) & ~(TWO_MIB - 1);
+       void *placeholder_map = mmap(NULL, size_2mib_aligned * 2, PROT_NONE,
+                            MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+       if (!placeholder_map) {
+               prterr("domapwrite: placeholder map");
+               exit(202);
+       }
+
+       /* align address to nearest 2 MiB */
+       void *aligned_address =
+               (void *)(((uintptr_t)placeholder_map + TWO_MIB - 1) &
~(TWO_MIB - 1));
+
+       void *map = mmap(aligned_address, size_2mib_aligned, PROT_READ
| PROT_WRITE,
+                         MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED |
MAP_POPULATE, -1, 0);
+
+       ret = madvise(map, size_2mib_aligned, MADV_COLLAPSE);
+       if (ret) {
+               prterr("domapwrite: madvise collapse");
+               exit(203);
+       }
+
+       memcpy(map, good_buf + offset, size);
+
+       if (lseek(fd, offset, SEEK_SET) == -1) {
+               prterr("domapwrite: lseek");
+               exit(204);
+       }
+
+       ret = write(fd, map, size);
+       if (ret == -1) {
+               prterr("domapwrite: write");
+               exit(205);
+       }
+
+       /*
        if ((p = (char *)mmap(0, map_size, PROT_READ | PROT_WRITE,
                              MAP_FILE | MAP_SHARED, fd,
                              (off_t)(offset - pg_offset))) == (char *)-1) {
@@ -1119,6 +1157,15 @@ domapwrite(unsigned offset, unsigned size)
                prterr("domapwrite: munmap");
                report_failure(204);
        }
+       */
+       if (munmap(map, size_2mib_aligned) != 0) {
+               prterr("domapwrite: munmap map");
+               report_failure(206);
+       }
+       if (munmap(placeholder_map, size_2mib_aligned * 2) != 0) {
+               prterr("domapwrite: munmap placeholder_map");
+               report_failure(207);
+       }
 }

-- run fsx test --
sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000

On the offending commit 3b97c3652, I'm seeing:
[user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
Will begin at operation 3
Seed set to 1
...
READ BAD DATA: offset = 0x1925f, size = 0xf7a3, fname =
/home/user/fuse_mount/example.txt
OFFSET      GOOD    BAD     RANGE
0x1e43f     0x4b4a  0x114a  0x0
operation# (mod 256) for the bad data may be 74
0x1e441     0xa64a  0xeb4a  0x1
operation# (mod 256) for the bad data may be 74
0x1e443     0x264a  0xe44a  0x2
operation# (mod 256) for the bad data may be 74
0x1e445     0x254a  0x9e4a  0x3
...
Correct content saved for comparison
(maybe hexdump "/home/user/fuse_mount/example.txt" vs
"/home/user/fuse_mount/example.txt.fsxgood")


I tested Josef's patch with the "ap->descs[ap->num_pages - 1].length
-= (PAGE_SIZE - ret) & (PAGE_SIZE - 1);" line removed and it fixed the
issue:

[user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
Will begin at operation 3
Seed set to 1
...
copying to largest ever: 0x3e19b
copying to largest ever: 0x3e343
fallocating to largest ever: 0x40000
All 5000 operations completed A-OK!


Malte, would you mind double-checking whether this fixes the issue
you're seeing on your end?


Thanks,
Joanne

>
> Thanks,
> Joanne
> >
> > Catching up on this thread now. I'll investigate this today.
> >
> > >
> > > /Malte
> > >

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-10  5:14                     ` Joanne Koong
@ 2024-12-10 16:54                       ` Kent Overstreet
  2024-12-10 17:08                         ` Matthew Wilcox
  2024-12-10 17:31                         ` Josef Bacik
  2024-12-10 17:53                       ` Malte Schröder
  1 sibling, 2 replies; 31+ messages in thread
From: Kent Overstreet @ 2024-12-10 16:54 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Malte Schröder, Josef Bacik, Matthew Wilcox, Miklos Szeredi,
	linux-fsdevel

On Mon, Dec 09, 2024 at 09:14:33PM -0800, Joanne Koong wrote:
> On Mon, Dec 9, 2024 at 11:52 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Mon, Dec 9, 2024 at 10:47 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Mon, Dec 9, 2024 at 9:07 AM Malte Schröder <malte.schroeder@tnxip.de> wrote:
> > > >
> > > > On 09/12/2024 16:48, Josef Bacik wrote:
> > > > > On Mon, Dec 09, 2024 at 03:28:14PM +0000, Matthew Wilcox wrote:
> > > > >> On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote:
> > > > >>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
> > > > >>>> with 3b97c3652d91 as the culprit.
> > > > >>> Willy, I've looked at this code and it does indeed look like a 1:1 conversion,
> > > > >>> EXCEPT I'm fuzzy about how how this works with large folios.  Previously, if we
> > > > >>> got a hugepage in, we'd get each individual struct page back for the whole range
> > > > >>> of the hugepage, so if for example we had a 2M hugepage, we'd fill in the
> > > > >>> ->offset for each "middle" struct page as 0, since obviously we're consuming
> > > > >>> PAGE_SIZE chunks at a time.
> > > > >>>
> > > > >>> But now we're doing this
> > > > >>>
> > > > >>>     for (i = 0; i < nfolios; i++)
> > > > >>>             ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > > > >>>
> > > > >>> So if userspace handed us a 2M hugepage, page_folio() on each of the
> > > > >>> intermediary struct page's would return the same folio, correct?  So we'd end up
> > > > >>> with the wrong offsets for our fuse request, because they should be based from
> > > > >>> the start of the folio, correct?
> > > > >> I think you're 100% right.  We could put in some nice asserts to check
> > > > >> this is what's happening, but it does seem like a rather incautious
> > > > >> conversion.  Yes, all folios _in the page cache_ for fuse are small, but
> > > > >> that's not guaranteed to be the case for folios found in userspace for
> > > > >> directio.  At least the comment is wrong, and I'd suggest the code is too.
> > > > > Ok cool, Malte can you try the attached only compile tested patch and see if the
> > > > > problem goes away?  Thanks,
> > > > >
> > > > > Josef
> > > > >
> > > > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > > > index 88d0946b5bc9..c4b93ead99a5 100644
> > > > > --- a/fs/fuse/file.c
> > > > > +++ b/fs/fuse/file.c
> > > > > @@ -1562,9 +1562,19 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
> > > > >               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
> > > > >
> > > > >               ap->descs[ap->num_folios].offset = start;
> > > > > -             fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios);
> > > > > -             for (i = 0; i < nfolios; i++)
> > > > > -                     ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > > > > +             for (i = 0; i < nfolios; i++) {
> > > > > +                     struct folio *folio = page_folio(pages[i]);
> > > > > +                     unsigned int offset = start +
> > > > > +                             (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
> > > > > +                     unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset);
> > > > > +
> > > > > +                     len = min_t(unsigned int, len, PAGE_SIZE);
> > > > > +
> > > > > +                     ap->descs[ap->num_folios + i].offset = offset;
> > > > > +                     ap->descs[ap->num_folios + i].length = len;
> > > > > +                     ap->folios[i + ap->num_folios] = folio;
> > > > > +                     start = 0;
> > > > > +             }
> > > > >
> > > > >               ap->num_folios += nfolios;
> > > > >               ap->descs[ap->num_folios - 1].length -=
> > > >
> > > > The problem persists with this patch.
> > > >
> >
> > Malte, could you try Josef's patch except with that last line
> > "ap->descs[ap->num_pages - 1].length  -= (PAGE_SIZE - ret) &
> > (PAGE_SIZE - 1);" also removed? I think we need that line removed as
> > well since that does a "-=" instead of a "=" and
> > ap->descs[ap->num_folios - 1].length gets set inside the for loop.
> >
> > In the meantime, I'll try to get a local repro running on fsx so that
> > you don't have to keep testing out repos for us.
> 
> I was able to repro this locally by doing:
> 
> -- start libfuse server --
> sudo ./libfuse/build/example/passthrough_hp --direct-io ~/src ~/fuse_mount
> 
> -- patch + compile this (rough / ugly-for-now) code snippet --
> diff --git a/ltp/fsx.c b/ltp/fsx.c
> index 777ba0de..9f040bc4 100644
> --- a/ltp/fsx.c
> +++ b/ltp/fsx.c
> @@ -1049,7 +1049,8 @@ dowrite(unsigned offset, unsigned size)
>         }
>  }
> 
> -
> +#define TWO_MIB (1 << 21)  // 2 MiB in bytes
> 
>  void
>  domapwrite(unsigned offset, unsigned size)
>  {
> @@ -1057,6 +1058,8 @@ domapwrite(unsigned offset, unsigned size)
>         unsigned map_size;
>         off_t    cur_filesize;
>         char    *p;
> +       int ret;
> +       unsigned size_2mib_aligned;
> 
>         offset -= offset % writebdy;
>         if (size == 0) {
> @@ -1101,6 +1104,41 @@ domapwrite(unsigned offset, unsigned size)
>         pg_offset = offset & PAGE_MASK;
>         map_size  = pg_offset + size;
> 
> +       size_2mib_aligned = (size + TWO_MIB - 1) & ~(TWO_MIB - 1);
> +       void *placeholder_map = mmap(NULL, size_2mib_aligned * 2, PROT_NONE,
> +                            MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +       if (!placeholder_map) {
> +               prterr("domapwrite: placeholder map");
> +               exit(202);
> +       }
> +
> +       /* align address to nearest 2 MiB */
> +       void *aligned_address =
> +               (void *)(((uintptr_t)placeholder_map + TWO_MIB - 1) &
> ~(TWO_MIB - 1));
> +
> +       void *map = mmap(aligned_address, size_2mib_aligned, PROT_READ
> | PROT_WRITE,
> +                         MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED |
> MAP_POPULATE, -1, 0);
> +
> +       ret = madvise(map, size_2mib_aligned, MADV_COLLAPSE);
> +       if (ret) {
> +               prterr("domapwrite: madvise collapse");
> +               exit(203);
> +       }
> +
> +       memcpy(map, good_buf + offset, size);
> +
> +       if (lseek(fd, offset, SEEK_SET) == -1) {
> +               prterr("domapwrite: lseek");
> +               exit(204);
> +       }
> +
> +       ret = write(fd, map, size);
> +       if (ret == -1) {
> +               prterr("domapwrite: write");
> +               exit(205);
> +       }
> +
> +       /*
>         if ((p = (char *)mmap(0, map_size, PROT_READ | PROT_WRITE,
>                               MAP_FILE | MAP_SHARED, fd,
>                               (off_t)(offset - pg_offset))) == (char *)-1) {
> @@ -1119,6 +1157,15 @@ domapwrite(unsigned offset, unsigned size)
>                 prterr("domapwrite: munmap");
>                 report_failure(204);
>         }
> +       */
> +       if (munmap(map, size_2mib_aligned) != 0) {
> +               prterr("domapwrite: munmap map");
> +               report_failure(206);
> +       }
> +       if (munmap(placeholder_map, size_2mib_aligned * 2) != 0) {
> +               prterr("domapwrite: munmap placeholder_map");
> +               report_failure(207);
> +       }
>  }
> 
> -- run fsx test --
> sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> 
> On the offending commit 3b97c3652, I'm seeing:
> [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> Will begin at operation 3
> Seed set to 1
> ...
> READ BAD DATA: offset = 0x1925f, size = 0xf7a3, fname =
> /home/user/fuse_mount/example.txt
> OFFSET      GOOD    BAD     RANGE
> 0x1e43f     0x4b4a  0x114a  0x0
> operation# (mod 256) for the bad data may be 74
> 0x1e441     0xa64a  0xeb4a  0x1
> operation# (mod 256) for the bad data may be 74
> 0x1e443     0x264a  0xe44a  0x2
> operation# (mod 256) for the bad data may be 74
> 0x1e445     0x254a  0x9e4a  0x3
> ...
> Correct content saved for comparison
> (maybe hexdump "/home/user/fuse_mount/example.txt" vs
> "/home/user/fuse_mount/example.txt.fsxgood")
> 
> 
> I tested Josef's patch with the "ap->descs[ap->num_pages - 1].length
> -= (PAGE_SIZE - ret) & (PAGE_SIZE - 1);" line removed and it fixed the
> issue:
> 
> [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> Will begin at operation 3
> Seed set to 1
> ...
> copying to largest ever: 0x3e19b
> copying to largest ever: 0x3e343
> fallocating to largest ever: 0x40000
> All 5000 operations completed A-OK!
> 
> 
> Malte, would you mind double-checking whether this fixes the issue
> you're seeing on your end?

I get the impression you might be flailing a bit, it seems you're not
exactly sure what's going on, and either Willy or Josef previously
alluded to a lack of assertions - so I'm going to echo that.

I've noticed a lot of (more junior?) kernel engineers are hesitant to
use assertions (because of e.g. checkpatch and "don't crash the
kernel!"), but assertions are one of the best tools we've got until we
get to languages where we can do embedded correctness proofs.

_Especially_ when you're doing tricky data structure work, like all the
folios stuff. As a lot of us have learned from painful experience, bugs
in low level data structures can easily translate into silent data
corruption bugs, so this is a situation where I'd even likely ignore the
"don't crash the kernel" guidance and prefer BUG_ON() to WARN_ON() -
crashing is better than silent data corruption, and it makes any bugs
noisier so they shake out quicker.

Haven't looked at the relevant patches yet, but if you'd like me to look
and offer suggestions I'd be happy to.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-10 16:54                       ` Kent Overstreet
@ 2024-12-10 17:08                         ` Matthew Wilcox
  2024-12-10 17:43                           ` Kent Overstreet
  2024-12-10 17:31                         ` Josef Bacik
  1 sibling, 1 reply; 31+ messages in thread
From: Matthew Wilcox @ 2024-12-10 17:08 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Joanne Koong, Malte Schröder, Josef Bacik, Miklos Szeredi,
	linux-fsdevel

On Tue, Dec 10, 2024 at 11:54:18AM -0500, Kent Overstreet wrote:
> I get the impression you might be flailing a bit, it seems you're not
> exactly sure what's going on, and either Willy or Josef previously
> alluded to a lack of assertions - so I'm going to echo that.

This comes across very patronising.  You don't know Joanne and I
suggest you refrain from making assumptions.

> Haven't looked at the relevant patches yet, but if you'd like me to look
> and offer suggestions I'd be happy to.

You should probably read more carefully and think a little harder about
what you're writing.  I made some suggestions about what could be done
to narrow down the bug to a specific path; Josef looked at the patch and
found something that looked wrong.  I confirmed it looked wrong.  Josef
sent a patch; Joanne fixed a problem with that patch.

This does not entitle you to say that Joanne is flailing.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-10 16:54                       ` Kent Overstreet
  2024-12-10 17:08                         ` Matthew Wilcox
@ 2024-12-10 17:31                         ` Josef Bacik
  2024-12-10 17:48                           ` Kent Overstreet
  1 sibling, 1 reply; 31+ messages in thread
From: Josef Bacik @ 2024-12-10 17:31 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Joanne Koong, Malte Schröder, Matthew Wilcox, Miklos Szeredi,
	linux-fsdevel

On Tue, Dec 10, 2024 at 11:54:18AM -0500, Kent Overstreet wrote:
> On Mon, Dec 09, 2024 at 09:14:33PM -0800, Joanne Koong wrote:
> > On Mon, Dec 9, 2024 at 11:52 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Mon, Dec 9, 2024 at 10:47 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > On Mon, Dec 9, 2024 at 9:07 AM Malte Schröder <malte.schroeder@tnxip.de> wrote:
> > > > >
> > > > > On 09/12/2024 16:48, Josef Bacik wrote:
> > > > > > On Mon, Dec 09, 2024 at 03:28:14PM +0000, Matthew Wilcox wrote:
> > > > > >> On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote:
> > > > > >>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
> > > > > >>>> with 3b97c3652d91 as the culprit.
> > > > > >>> Willy, I've looked at this code and it does indeed look like a 1:1 conversion,
> > > > > >>> EXCEPT I'm fuzzy about how how this works with large folios.  Previously, if we
> > > > > >>> got a hugepage in, we'd get each individual struct page back for the whole range
> > > > > >>> of the hugepage, so if for example we had a 2M hugepage, we'd fill in the
> > > > > >>> ->offset for each "middle" struct page as 0, since obviously we're consuming
> > > > > >>> PAGE_SIZE chunks at a time.
> > > > > >>>
> > > > > >>> But now we're doing this
> > > > > >>>
> > > > > >>>     for (i = 0; i < nfolios; i++)
> > > > > >>>             ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > > > > >>>
> > > > > >>> So if userspace handed us a 2M hugepage, page_folio() on each of the
> > > > > >>> intermediary struct page's would return the same folio, correct?  So we'd end up
> > > > > >>> with the wrong offsets for our fuse request, because they should be based from
> > > > > >>> the start of the folio, correct?
> > > > > >> I think you're 100% right.  We could put in some nice asserts to check
> > > > > >> this is what's happening, but it does seem like a rather incautious
> > > > > >> conversion.  Yes, all folios _in the page cache_ for fuse are small, but
> > > > > >> that's not guaranteed to be the case for folios found in userspace for
> > > > > >> directio.  At least the comment is wrong, and I'd suggest the code is too.
> > > > > > Ok cool, Malte can you try the attached only compile tested patch and see if the
> > > > > > problem goes away?  Thanks,
> > > > > >
> > > > > > Josef
> > > > > >
> > > > > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > > > > index 88d0946b5bc9..c4b93ead99a5 100644
> > > > > > --- a/fs/fuse/file.c
> > > > > > +++ b/fs/fuse/file.c
> > > > > > @@ -1562,9 +1562,19 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
> > > > > >               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
> > > > > >
> > > > > >               ap->descs[ap->num_folios].offset = start;
> > > > > > -             fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios);
> > > > > > -             for (i = 0; i < nfolios; i++)
> > > > > > -                     ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > > > > > +             for (i = 0; i < nfolios; i++) {
> > > > > > +                     struct folio *folio = page_folio(pages[i]);
> > > > > > +                     unsigned int offset = start +
> > > > > > +                             (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
> > > > > > +                     unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset);
> > > > > > +
> > > > > > +                     len = min_t(unsigned int, len, PAGE_SIZE);
> > > > > > +
> > > > > > +                     ap->descs[ap->num_folios + i].offset = offset;
> > > > > > +                     ap->descs[ap->num_folios + i].length = len;
> > > > > > +                     ap->folios[i + ap->num_folios] = folio;
> > > > > > +                     start = 0;
> > > > > > +             }
> > > > > >
> > > > > >               ap->num_folios += nfolios;
> > > > > >               ap->descs[ap->num_folios - 1].length -=
> > > > >
> > > > > The problem persists with this patch.
> > > > >
> > >
> > > Malte, could you try Josef's patch except with that last line
> > > "ap->descs[ap->num_pages - 1].length  -= (PAGE_SIZE - ret) &
> > > (PAGE_SIZE - 1);" also removed? I think we need that line removed as
> > > well since that does a "-=" instead of a "=" and
> > > ap->descs[ap->num_folios - 1].length gets set inside the for loop.
> > >
> > > In the meantime, I'll try to get a local repro running on fsx so that
> > > you don't have to keep testing out repos for us.
> > 
> > I was able to repro this locally by doing:
> > 
> > -- start libfuse server --
> > sudo ./libfuse/build/example/passthrough_hp --direct-io ~/src ~/fuse_mount
> > 
> > -- patch + compile this (rough / ugly-for-now) code snippet --
> > diff --git a/ltp/fsx.c b/ltp/fsx.c
> > index 777ba0de..9f040bc4 100644
> > --- a/ltp/fsx.c
> > +++ b/ltp/fsx.c
> > @@ -1049,7 +1049,8 @@ dowrite(unsigned offset, unsigned size)
> >         }
> >  }
> > 
> > -
> > +#define TWO_MIB (1 << 21)  // 2 MiB in bytes
> > 
> >  void
> >  domapwrite(unsigned offset, unsigned size)
> >  {
> > @@ -1057,6 +1058,8 @@ domapwrite(unsigned offset, unsigned size)
> >         unsigned map_size;
> >         off_t    cur_filesize;
> >         char    *p;
> > +       int ret;
> > +       unsigned size_2mib_aligned;
> > 
> >         offset -= offset % writebdy;
> >         if (size == 0) {
> > @@ -1101,6 +1104,41 @@ domapwrite(unsigned offset, unsigned size)
> >         pg_offset = offset & PAGE_MASK;
> >         map_size  = pg_offset + size;
> > 
> > +       size_2mib_aligned = (size + TWO_MIB - 1) & ~(TWO_MIB - 1);
> > +       void *placeholder_map = mmap(NULL, size_2mib_aligned * 2, PROT_NONE,
> > +                            MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > +       if (!placeholder_map) {
> > +               prterr("domapwrite: placeholder map");
> > +               exit(202);
> > +       }
> > +
> > +       /* align address to nearest 2 MiB */
> > +       void *aligned_address =
> > +               (void *)(((uintptr_t)placeholder_map + TWO_MIB - 1) &
> > ~(TWO_MIB - 1));
> > +
> > +       void *map = mmap(aligned_address, size_2mib_aligned, PROT_READ
> > | PROT_WRITE,
> > +                         MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED |
> > MAP_POPULATE, -1, 0);
> > +
> > +       ret = madvise(map, size_2mib_aligned, MADV_COLLAPSE);
> > +       if (ret) {
> > +               prterr("domapwrite: madvise collapse");
> > +               exit(203);
> > +       }
> > +
> > +       memcpy(map, good_buf + offset, size);
> > +
> > +       if (lseek(fd, offset, SEEK_SET) == -1) {
> > +               prterr("domapwrite: lseek");
> > +               exit(204);
> > +       }
> > +
> > +       ret = write(fd, map, size);
> > +       if (ret == -1) {
> > +               prterr("domapwrite: write");
> > +               exit(205);
> > +       }
> > +
> > +       /*
> >         if ((p = (char *)mmap(0, map_size, PROT_READ | PROT_WRITE,
> >                               MAP_FILE | MAP_SHARED, fd,
> >                               (off_t)(offset - pg_offset))) == (char *)-1) {
> > @@ -1119,6 +1157,15 @@ domapwrite(unsigned offset, unsigned size)
> >                 prterr("domapwrite: munmap");
> >                 report_failure(204);
> >         }
> > +       */
> > +       if (munmap(map, size_2mib_aligned) != 0) {
> > +               prterr("domapwrite: munmap map");
> > +               report_failure(206);
> > +       }
> > +       if (munmap(placeholder_map, size_2mib_aligned * 2) != 0) {
> > +               prterr("domapwrite: munmap placeholder_map");
> > +               report_failure(207);
> > +       }
> >  }
> > 
> > -- run fsx test --
> > sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> > 
> > On the offending commit 3b97c3652, I'm seeing:
> > [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> > Will begin at operation 3
> > Seed set to 1
> > ...
> > READ BAD DATA: offset = 0x1925f, size = 0xf7a3, fname =
> > /home/user/fuse_mount/example.txt
> > OFFSET      GOOD    BAD     RANGE
> > 0x1e43f     0x4b4a  0x114a  0x0
> > operation# (mod 256) for the bad data may be 74
> > 0x1e441     0xa64a  0xeb4a  0x1
> > operation# (mod 256) for the bad data may be 74
> > 0x1e443     0x264a  0xe44a  0x2
> > operation# (mod 256) for the bad data may be 74
> > 0x1e445     0x254a  0x9e4a  0x3
> > ...
> > Correct content saved for comparison
> > (maybe hexdump "/home/user/fuse_mount/example.txt" vs
> > "/home/user/fuse_mount/example.txt.fsxgood")
> > 
> > 
> > I tested Josef's patch with the "ap->descs[ap->num_pages - 1].length
> > -= (PAGE_SIZE - ret) & (PAGE_SIZE - 1);" line removed and it fixed the
> > issue:
> > 
> > [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> > Will begin at operation 3
> > Seed set to 1
> > ...
> > copying to largest ever: 0x3e19b
> > copying to largest ever: 0x3e343
> > fallocating to largest ever: 0x40000
> > All 5000 operations completed A-OK!
> > 
> > 
> > Malte, would you mind double-checking whether this fixes the issue
> > you're seeing on your end?
> 
> I get the impression you might be flailing a bit, it seems you're not
> exactly sure what's going on, and either Willy or Josef previously
> alluded to a lack of assertions - so I'm going to echo that.
> 
> I've noticed a lot of (more junior?) kernel engineers are hesitant to
> use assertions (because of e.g. checkpatch and "don't crash the
> kernel!"), but assertions are one of the best tools we've got until we
> get to languages where we can do embedded correctness proofs.
> 
> _Especially_ when you're doing tricky data structure work, like all the
> folios stuff. As a lot of us have learned from painful experience, bugs
> in low level data structures can easily translate into silent data
> corruption bugs, so this is a situation where I'd even likely ignore the
> "don't crash the kernel" guidance and prefer BUG_ON() to WARN_ON() -
> crashing is better than silent data corruption, and it makes any bugs
> noisier so they shake out quicker.
> 
> Haven't looked at the relevant patches yet, but if you'd like me to look
> and offer suggestions I'd be happy to.

This is a failure in testing.  And I don't mean "Joanne obviously didn't test
this", I mean that we as a file system community forget that fstests isn't as
exhaustive as we think.  I was honestly surprised to see that we don't actually
have anything that purposely allocates large folios and then does O_DIRECT with
them.  Now we know, Joanne already has followups for extending fsx to purposely
allocate large folios so we can have testing coverage for this case.  Yay
software development, we find gaps, we plug them, we move on.

As an echo to Willy's point, it is a good thing to offer to help and provide
guidance, I applaud the sentiment.  Making comments about the experience level
of the people involved is not helpful and can be seen as condescending.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-10 17:08                         ` Matthew Wilcox
@ 2024-12-10 17:43                           ` Kent Overstreet
  0 siblings, 0 replies; 31+ messages in thread
From: Kent Overstreet @ 2024-12-10 17:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Joanne Koong, Malte Schröder, Josef Bacik, Miklos Szeredi,
	linux-fsdevel

On Tue, Dec 10, 2024 at 05:08:40PM +0000, Matthew Wilcox wrote:
> On Tue, Dec 10, 2024 at 11:54:18AM -0500, Kent Overstreet wrote:
> > I get the impression you might be flailing a bit, it seems you're not
> > exactly sure what's going on, and either Willy or Josef previously
> > alluded to a lack of assertions - so I'm going to echo that.
> 
> This comes across very patronising.  You don't know Joanne and I
> suggest you refrain from making assumptions.
> 
> > Haven't looked at the relevant patches yet, but if you'd like me to look
> > and offer suggestions I'd be happy to.
> 
> You should probably read more carefully and think a little harder about
> what you're writing.  I made some suggestions about what could be done
> to narrow down the bug to a specific path; Josef looked at the patch and
> found something that looked wrong.  I confirmed it looked wrong.  Josef
> sent a patch; Joanne fixed a problem with that patch.
> 
> This does not entitle you to say that Joanne is flailing.

Seriously?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-10 17:31                         ` Josef Bacik
@ 2024-12-10 17:48                           ` Kent Overstreet
  0 siblings, 0 replies; 31+ messages in thread
From: Kent Overstreet @ 2024-12-10 17:48 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Joanne Koong, Malte Schröder, Matthew Wilcox, Miklos Szeredi,
	linux-fsdevel

On Tue, Dec 10, 2024 at 12:31:51PM -0500, Josef Bacik wrote:
> This is a failure in testing.  And I don't mean "Joanne obviously didn't test
> this", I mean that we as a file system community forget that fstests isn't as
> exhaustive as we think.  I was honestly surprised to see that we don't actually
> have anything that purposely allocates large folios and then does O_DIRECT with
> them.  Now we know, Joanne already has followups for extending fsx to purposely
> allocate large folios so we can have testing coverage for this case.  Yay
> software development, we find gaps, we plug them, we move on.

Yeah, agreed.

Code coverage analysis would've also likely pointed this out.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-10  5:14                     ` Joanne Koong
  2024-12-10 16:54                       ` Kent Overstreet
@ 2024-12-10 17:53                       ` Malte Schröder
  2024-12-11  2:57                         ` Joanne Koong
  1 sibling, 1 reply; 31+ messages in thread
From: Malte Schröder @ 2024-12-10 17:53 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Josef Bacik, Matthew Wilcox, Kent Overstreet, Miklos Szeredi,
	linux-fsdevel

On 10/12/2024 06:14, Joanne Koong wrote:
> On Mon, Dec 9, 2024 at 11:52 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>> On Mon, Dec 9, 2024 at 10:47 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>>> On Mon, Dec 9, 2024 at 9:07 AM Malte Schröder <malte.schroeder@tnxip.de> wrote:
>>>> On 09/12/2024 16:48, Josef Bacik wrote:
>>>>> On Mon, Dec 09, 2024 at 03:28:14PM +0000, Matthew Wilcox wrote:
>>>>>> On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote:
>>>>>>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
>>>>>>>> with 3b97c3652d91 as the culprit.
>>>>>>> Willy, I've looked at this code and it does indeed look like a 1:1 conversion,
>>>>>>> EXCEPT I'm fuzzy about how how this works with large folios.  Previously, if we
>>>>>>> got a hugepage in, we'd get each individual struct page back for the whole range
>>>>>>> of the hugepage, so if for example we had a 2M hugepage, we'd fill in the
>>>>>>> ->offset for each "middle" struct page as 0, since obviously we're consuming
>>>>>>> PAGE_SIZE chunks at a time.
>>>>>>>
>>>>>>> But now we're doing this
>>>>>>>
>>>>>>>     for (i = 0; i < nfolios; i++)
>>>>>>>             ap->folios[i + ap->num_folios] = page_folio(pages[i]);
>>>>>>>
>>>>>>> So if userspace handed us a 2M hugepage, page_folio() on each of the
>>>>>>> intermediary struct page's would return the same folio, correct?  So we'd end up
>>>>>>> with the wrong offsets for our fuse request, because they should be based from
>>>>>>> the start of the folio, correct?
>>>>>> I think you're 100% right.  We could put in some nice asserts to check
>>>>>> this is what's happening, but it does seem like a rather incautious
>>>>>> conversion.  Yes, all folios _in the page cache_ for fuse are small, but
>>>>>> that's not guaranteed to be the case for folios found in userspace for
>>>>>> directio.  At least the comment is wrong, and I'd suggest the code is too.
>>>>> Ok cool, Malte can you try the attached only compile tested patch and see if the
>>>>> problem goes away?  Thanks,
>>>>>
>>>>> Josef
>>>>>
>>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>>> index 88d0946b5bc9..c4b93ead99a5 100644
>>>>> --- a/fs/fuse/file.c
>>>>> +++ b/fs/fuse/file.c
>>>>> @@ -1562,9 +1562,19 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
>>>>>               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
>>>>>
>>>>>               ap->descs[ap->num_folios].offset = start;
>>>>> -             fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios);
>>>>> -             for (i = 0; i < nfolios; i++)
>>>>> -                     ap->folios[i + ap->num_folios] = page_folio(pages[i]);
>>>>> +             for (i = 0; i < nfolios; i++) {
>>>>> +                     struct folio *folio = page_folio(pages[i]);
>>>>> +                     unsigned int offset = start +
>>>>> +                             (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
>>>>> +                     unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset);
>>>>> +
>>>>> +                     len = min_t(unsigned int, len, PAGE_SIZE);
>>>>> +
>>>>> +                     ap->descs[ap->num_folios + i].offset = offset;
>>>>> +                     ap->descs[ap->num_folios + i].length = len;
>>>>> +                     ap->folios[i + ap->num_folios] = folio;
>>>>> +                     start = 0;
>>>>> +             }
>>>>>
>>>>>               ap->num_folios += nfolios;
>>>>>               ap->descs[ap->num_folios - 1].length -=
>>>> The problem persists with this patch.
>>>>
>> Malte, could you try Josef's patch except with that last line
>> "ap->descs[ap->num_pages - 1].length  -= (PAGE_SIZE - ret) &
>> (PAGE_SIZE - 1);" also removed? I think we need that line removed as
>> well since that does a "-=" instead of a "=" and
>> ap->descs[ap->num_folios - 1].length gets set inside the for loop.
>>
>> In the meantime, I'll try to get a local repro running on fsx so that
>> you don't have to keep testing out repos for us.
> I was able to repro this locally by doing:
>
> -- start libfuse server --
> sudo ./libfuse/build/example/passthrough_hp --direct-io ~/src ~/fuse_mount
>
> -- patch + compile this (rough / ugly-for-now) code snippet --
> diff --git a/ltp/fsx.c b/ltp/fsx.c
> index 777ba0de..9f040bc4 100644
> --- a/ltp/fsx.c
> +++ b/ltp/fsx.c
> @@ -1049,7 +1049,8 @@ dowrite(unsigned offset, unsigned size)
>         }
>  }
>
> -
> +#define TWO_MIB (1 << 21)  // 2 MiB in bytes
>
>  void
>  domapwrite(unsigned offset, unsigned size)
>  {
> @@ -1057,6 +1058,8 @@ domapwrite(unsigned offset, unsigned size)
>         unsigned map_size;
>         off_t    cur_filesize;
>         char    *p;
> +       int ret;
> +       unsigned size_2mib_aligned;
>
>         offset -= offset % writebdy;
>         if (size == 0) {
> @@ -1101,6 +1104,41 @@ domapwrite(unsigned offset, unsigned size)
>         pg_offset = offset & PAGE_MASK;
>         map_size  = pg_offset + size;
>
> +       size_2mib_aligned = (size + TWO_MIB - 1) & ~(TWO_MIB - 1);
> +       void *placeholder_map = mmap(NULL, size_2mib_aligned * 2, PROT_NONE,
> +                            MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +       if (!placeholder_map) {
> +               prterr("domapwrite: placeholder map");
> +               exit(202);
> +       }
> +
> +       /* align address to nearest 2 MiB */
> +       void *aligned_address =
> +               (void *)(((uintptr_t)placeholder_map + TWO_MIB - 1) &
> ~(TWO_MIB - 1));
> +
> +       void *map = mmap(aligned_address, size_2mib_aligned, PROT_READ
> | PROT_WRITE,
> +                         MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED |
> MAP_POPULATE, -1, 0);
> +
> +       ret = madvise(map, size_2mib_aligned, MADV_COLLAPSE);
> +       if (ret) {
> +               prterr("domapwrite: madvise collapse");
> +               exit(203);
> +       }
> +
> +       memcpy(map, good_buf + offset, size);
> +
> +       if (lseek(fd, offset, SEEK_SET) == -1) {
> +               prterr("domapwrite: lseek");
> +               exit(204);
> +       }
> +
> +       ret = write(fd, map, size);
> +       if (ret == -1) {
> +               prterr("domapwrite: write");
> +               exit(205);
> +       }
> +
> +       /*
>         if ((p = (char *)mmap(0, map_size, PROT_READ | PROT_WRITE,
>                               MAP_FILE | MAP_SHARED, fd,
>                               (off_t)(offset - pg_offset))) == (char *)-1) {
> @@ -1119,6 +1157,15 @@ domapwrite(unsigned offset, unsigned size)
>                 prterr("domapwrite: munmap");
>                 report_failure(204);
>         }
> +       */
> +       if (munmap(map, size_2mib_aligned) != 0) {
> +               prterr("domapwrite: munmap map");
> +               report_failure(206);
> +       }
> +       if (munmap(placeholder_map, size_2mib_aligned * 2) != 0) {
> +               prterr("domapwrite: munmap placeholder_map");
> +               report_failure(207);
> +       }
>  }
>
> -- run fsx test --
> sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
>
> On the offending commit 3b97c3652, I'm seeing:
> [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> Will begin at operation 3
> Seed set to 1
> ...
> READ BAD DATA: offset = 0x1925f, size = 0xf7a3, fname =
> /home/user/fuse_mount/example.txt
> OFFSET      GOOD    BAD     RANGE
> 0x1e43f     0x4b4a  0x114a  0x0
> operation# (mod 256) for the bad data may be 74
> 0x1e441     0xa64a  0xeb4a  0x1
> operation# (mod 256) for the bad data may be 74
> 0x1e443     0x264a  0xe44a  0x2
> operation# (mod 256) for the bad data may be 74
> 0x1e445     0x254a  0x9e4a  0x3
> ...
> Correct content saved for comparison
> (maybe hexdump "/home/user/fuse_mount/example.txt" vs
> "/home/user/fuse_mount/example.txt.fsxgood")
>
>
> I tested Josef's patch with the "ap->descs[ap->num_pages - 1].length
> -= (PAGE_SIZE - ret) & (PAGE_SIZE - 1);" line removed and it fixed the
> issue:
>
> [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> Will begin at operation 3
> Seed set to 1
> ...
> copying to largest ever: 0x3e19b
> copying to largest ever: 0x3e343
> fallocating to largest ever: 0x40000
> All 5000 operations completed A-OK!
>
>
> Malte, would you mind double-checking whether this fixes the issue
> you're seeing on your end?

My test still fails.


>
>
> Thanks,
> Joanne
>
>> Thanks,
>> Joanne
>>> Catching up on this thread now. I'll investigate this today.
>>>
>>>> /Malte
>>>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-10 17:53                       ` Malte Schröder
@ 2024-12-11  2:57                         ` Joanne Koong
  2024-12-11  6:43                           ` Kent Overstreet
  2024-12-11 14:01                           ` Malte Schröder
  0 siblings, 2 replies; 31+ messages in thread
From: Joanne Koong @ 2024-12-11  2:57 UTC (permalink / raw)
  To: Malte Schröder
  Cc: Josef Bacik, Matthew Wilcox, Kent Overstreet, Miklos Szeredi,
	linux-fsdevel

On Tue, Dec 10, 2024 at 9:53 AM Malte Schröder <malte.schroeder@tnxip.de> wrote:
>
> On 10/12/2024 06:14, Joanne Koong wrote:
> > On Mon, Dec 9, 2024 at 11:52 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >> On Mon, Dec 9, 2024 at 10:47 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>> On Mon, Dec 9, 2024 at 9:07 AM Malte Schröder <malte.schroeder@tnxip.de> wrote:
> >>>> On 09/12/2024 16:48, Josef Bacik wrote:
> >>>>> On Mon, Dec 09, 2024 at 03:28:14PM +0000, Matthew Wilcox wrote:
> >>>>>> On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote:
> >>>>>>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
> >>>>>>>> with 3b97c3652d91 as the culprit.
> >>>>>>> Willy, I've looked at this code and it does indeed look like a 1:1 conversion,
> >>>>>>> EXCEPT I'm fuzzy about how how this works with large folios.  Previously, if we
> >>>>>>> got a hugepage in, we'd get each individual struct page back for the whole range
> >>>>>>> of the hugepage, so if for example we had a 2M hugepage, we'd fill in the
> >>>>>>> ->offset for each "middle" struct page as 0, since obviously we're consuming
> >>>>>>> PAGE_SIZE chunks at a time.
> >>>>>>>
> >>>>>>> But now we're doing this
> >>>>>>>
> >>>>>>>     for (i = 0; i < nfolios; i++)
> >>>>>>>             ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> >>>>>>>
> >>>>>>> So if userspace handed us a 2M hugepage, page_folio() on each of the
> >>>>>>> intermediary struct page's would return the same folio, correct?  So we'd end up
> >>>>>>> with the wrong offsets for our fuse request, because they should be based from
> >>>>>>> the start of the folio, correct?
> >>>>>> I think you're 100% right.  We could put in some nice asserts to check
> >>>>>> this is what's happening, but it does seem like a rather incautious
> >>>>>> conversion.  Yes, all folios _in the page cache_ for fuse are small, but
> >>>>>> that's not guaranteed to be the case for folios found in userspace for
> >>>>>> directio.  At least the comment is wrong, and I'd suggest the code is too.
> >>>>> Ok cool, Malte can you try the attached only compile tested patch and see if the
> >>>>> problem goes away?  Thanks,
> >>>>>
> >>>>> Josef
> >>>>>
> >>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> >>>>> index 88d0946b5bc9..c4b93ead99a5 100644
> >>>>> --- a/fs/fuse/file.c
> >>>>> +++ b/fs/fuse/file.c
> >>>>> @@ -1562,9 +1562,19 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
> >>>>>               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
> >>>>>
> >>>>>               ap->descs[ap->num_folios].offset = start;
> >>>>> -             fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios);
> >>>>> -             for (i = 0; i < nfolios; i++)
> >>>>> -                     ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> >>>>> +             for (i = 0; i < nfolios; i++) {
> >>>>> +                     struct folio *folio = page_folio(pages[i]);
> >>>>> +                     unsigned int offset = start +
> >>>>> +                             (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
> >>>>> +                     unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset);
> >>>>> +
> >>>>> +                     len = min_t(unsigned int, len, PAGE_SIZE);
> >>>>> +
> >>>>> +                     ap->descs[ap->num_folios + i].offset = offset;
> >>>>> +                     ap->descs[ap->num_folios + i].length = len;
> >>>>> +                     ap->folios[i + ap->num_folios] = folio;
> >>>>> +                     start = 0;
> >>>>> +             }
> >>>>>
> >>>>>               ap->num_folios += nfolios;
> >>>>>               ap->descs[ap->num_folios - 1].length -=
> >>>> The problem persists with this patch.
> >>>>
> >> Malte, could you try Josef's patch except with that last line
> >> "ap->descs[ap->num_pages - 1].length  -= (PAGE_SIZE - ret) &
> >> (PAGE_SIZE - 1);" also removed? I think we need that line removed as
> >> well since that does a "-=" instead of a "=" and
> >> ap->descs[ap->num_folios - 1].length gets set inside the for loop.
> >>
> >> In the meantime, I'll try to get a local repro running on fsx so that
> >> you don't have to keep testing out repos for us.
> > I was able to repro this locally by doing:
> >
> > -- start libfuse server --
> > sudo ./libfuse/build/example/passthrough_hp --direct-io ~/src ~/fuse_mount
> >
> > -- patch + compile this (rough / ugly-for-now) code snippet --
> > diff --git a/ltp/fsx.c b/ltp/fsx.c
> > index 777ba0de..9f040bc4 100644
> > --- a/ltp/fsx.c
> > +++ b/ltp/fsx.c
> > @@ -1049,7 +1049,8 @@ dowrite(unsigned offset, unsigned size)
> >         }
> >  }
> >
> > -
> > +#define TWO_MIB (1 << 21)  // 2 MiB in bytes
> >
> >  void
> >  domapwrite(unsigned offset, unsigned size)
> >  {
> > @@ -1057,6 +1058,8 @@ domapwrite(unsigned offset, unsigned size)
> >         unsigned map_size;
> >         off_t    cur_filesize;
> >         char    *p;
> > +       int ret;
> > +       unsigned size_2mib_aligned;
> >
> >         offset -= offset % writebdy;
> >         if (size == 0) {
> > @@ -1101,6 +1104,41 @@ domapwrite(unsigned offset, unsigned size)
> >         pg_offset = offset & PAGE_MASK;
> >         map_size  = pg_offset + size;
> >
> > +       size_2mib_aligned = (size + TWO_MIB - 1) & ~(TWO_MIB - 1);
> > +       void *placeholder_map = mmap(NULL, size_2mib_aligned * 2, PROT_NONE,
> > +                            MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > +       if (!placeholder_map) {
> > +               prterr("domapwrite: placeholder map");
> > +               exit(202);
> > +       }
> > +
> > +       /* align address to nearest 2 MiB */
> > +       void *aligned_address =
> > +               (void *)(((uintptr_t)placeholder_map + TWO_MIB - 1) &
> > ~(TWO_MIB - 1));
> > +
> > +       void *map = mmap(aligned_address, size_2mib_aligned, PROT_READ
> > | PROT_WRITE,
> > +                         MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED |
> > MAP_POPULATE, -1, 0);
> > +
> > +       ret = madvise(map, size_2mib_aligned, MADV_COLLAPSE);
> > +       if (ret) {
> > +               prterr("domapwrite: madvise collapse");
> > +               exit(203);
> > +       }
> > +
> > +       memcpy(map, good_buf + offset, size);
> > +
> > +       if (lseek(fd, offset, SEEK_SET) == -1) {
> > +               prterr("domapwrite: lseek");
> > +               exit(204);
> > +       }
> > +
> > +       ret = write(fd, map, size);
> > +       if (ret == -1) {
> > +               prterr("domapwrite: write");
> > +               exit(205);
> > +       }
> > +
> > +       /*
> >         if ((p = (char *)mmap(0, map_size, PROT_READ | PROT_WRITE,
> >                               MAP_FILE | MAP_SHARED, fd,
> >                               (off_t)(offset - pg_offset))) == (char *)-1) {
> > @@ -1119,6 +1157,15 @@ domapwrite(unsigned offset, unsigned size)
> >                 prterr("domapwrite: munmap");
> >                 report_failure(204);
> >         }
> > +       */
> > +       if (munmap(map, size_2mib_aligned) != 0) {
> > +               prterr("domapwrite: munmap map");
> > +               report_failure(206);
> > +       }
> > +       if (munmap(placeholder_map, size_2mib_aligned * 2) != 0) {
> > +               prterr("domapwrite: munmap placeholder_map");
> > +               report_failure(207);
> > +       }
> >  }
> >
> > -- run fsx test --
> > sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> >
> > On the offending commit 3b97c3652, I'm seeing:
> > [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> > Will begin at operation 3
> > Seed set to 1
> > ...
> > READ BAD DATA: offset = 0x1925f, size = 0xf7a3, fname =
> > /home/user/fuse_mount/example.txt
> > OFFSET      GOOD    BAD     RANGE
> > 0x1e43f     0x4b4a  0x114a  0x0
> > operation# (mod 256) for the bad data may be 74
> > 0x1e441     0xa64a  0xeb4a  0x1
> > operation# (mod 256) for the bad data may be 74
> > 0x1e443     0x264a  0xe44a  0x2
> > operation# (mod 256) for the bad data may be 74
> > 0x1e445     0x254a  0x9e4a  0x3
> > ...
> > Correct content saved for comparison
> > (maybe hexdump "/home/user/fuse_mount/example.txt" vs
> > "/home/user/fuse_mount/example.txt.fsxgood")
> >
> >
> > I tested Josef's patch with the "ap->descs[ap->num_pages - 1].length
> > -= (PAGE_SIZE - ret) & (PAGE_SIZE - 1);" line removed and it fixed the
> > issue:
> >
> > [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> > Will begin at operation 3
> > Seed set to 1
> > ...
> > copying to largest ever: 0x3e19b
> > copying to largest ever: 0x3e343
> > fallocating to largest ever: 0x40000
> > All 5000 operations completed A-OK!
> >
> >
> > Malte, would you mind double-checking whether this fixes the issue
> > you're seeing on your end?
>
> My test still fails.

Hi Malte,

I simulated your repro with installing bcachefs as the root filesystem
on a VM running Arch, then running "sudo pacman -S flatpak" and then
installing FreeCAD with "flatpak install flathub org.freecad.FreeCAD".

On base commit 3b97c3652, I see the same corruption you noted:

error: Failed to install org.kde.Platform: Error pulling from repo:
While pulling runtime/org.kde.Platform/x86_64/6.7 from remote flathub:
fsck content object
886fd60617b81e81475db5e62beda5846d3e85fe77562eae536d2dd2a7af5b33:
Corrupted file object; checksum
expected='886fd60617b81e81475db5e62beda5846d3e85fe77562eae536d2dd2a7af5b33'
actual='67f5a60d19f7a65e1ee272d455fed138b864be73399816ad18fa71319614a418'


i tried this patch on top of commit 3b97c3652 and it fixed it for me:

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 14af8c41fc83..0d213a22972b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1560,18 +1560,24 @@ static int fuse_get_user_pages(struct
fuse_args_pages *ap, struct iov_iter *ii,

                nbytes += ret;

-               ret += start;
-               /* Currently, all folios in FUSE are one page */
-               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
-
-               ap->folio_descs[ap->num_folios].offset = start;
-               fuse_folio_descs_length_init(ap->folio_descs,
ap->num_folios, nfolios);
-               for (i = 0; i < nfolios; i++)
-                       ap->folios[i + ap->num_folios] = page_folio(pages[i]);
-
-               ap->num_folios += nfolios;
-               ap->folio_descs[ap->num_folios - 1].length -=
-                       (PAGE_SIZE - ret) & (PAGE_SIZE - 1);
+               nfolios = DIV_ROUND_UP(ret + start, PAGE_SIZE);
+
+               for (i = 0; i < nfolios; i++) {
+                       struct folio *folio = page_folio(pages[i]);
+                       unsigned int offset = start +
+                               (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
+                       unsigned int len = min_t(unsigned int, ret,
folio_size(folio) - offset);
+
+                       len = min_t(unsigned int, len, PAGE_SIZE - start);
+
+                       ap->descs[ap->num_folios].offset = offset;
+                       ap->descs[ap->num_folios].length = len;
+                       ap->folios[ap->num_folios] = folio;
+                       start = 0;
+                       ret -= len;
+                       ap->num_folios++;
+               }
+
                nr_pages += nfolios;
        }
        kfree(pages);

I ran this multiple times and don't see the corruption issues.

However, I do see another issue crop up on some of my VMs, which is
flatpak hanging with this corresponding stack trace:

[  368.520976] INFO: task pool-/usr/lib/f:582 blocked for more than 122 seconds.
[  368.521509]       Not tainted 6.12.0-rc1-g86b74eb5a11e #734
[  368.521905] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  368.522483] task:pool-/usr/lib/f state:D stack:0     pid:582
tgid:405   ppid:1      flags:0x00000002
[  368.523238] Call Trace:
[  368.523411]  <TASK>
[  368.523548]  __schedule+0xaf0/0x27f0
[  368.523773]  ? __pfx___schedule+0x10/0x10
[  368.524024]  schedule+0x7e/0x300
[  368.524233]  __wait_on_freeing_inode+0xda/0x120
[  368.524527]  ? __pfx___wait_on_freeing_inode+0x10/0x10
[  368.524830]  ? _raw_spin_unlock_irqrestore+0xe/0x30
[  368.525124]  ? __pfx_wake_bit_function+0x10/0x10
[  368.525411]  bch2_inode_hash_find+0x372/0x580
[  368.525700]  ? __pfx_bch2_dirent_read_target+0x10/0x10
[  368.526023]  ? bch2_dirent_hash+0x23d/0x370
[  368.526286]  ? __pfx_bch2_inode_hash_find+0x10/0x10
[  368.526583]  ? __pfx_dirent_cmp_key+0x10/0x10
[  368.526972]  bch2_lookup_trans+0x61a/0x990
[  368.527247]  ? __pfx_bch2_lookup_trans+0x10/0x10
[  368.527559]  ? __do_sys_newfstatat+0x75/0xd0
[  368.527850]  ? do_syscall_64+0x4b/0x110
[  368.528088]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  368.528411]  ? __pfx_bch2_hash_info_init+0x10/0x10
[  368.528720]  ? __mod_memcg_state+0x102/0x390
[  368.529134]  ? obj_cgroup_charge+0x1b4/0x4c0
[  368.529500]  ? __memcg_slab_post_alloc_hook+0x536/0xba0
[  368.529913]  ? kvm_sched_clock_read+0x11/0x20
[  368.530265]  ? _raw_spin_lock+0x85/0xe0
[  368.530553]  ? bch2_lookup_trans+0x323/0x990
[  368.530909]  ? __asan_memset+0x24/0x50
[  368.531204]  ? __bch2_trans_get+0x735/0xdd0
[  368.531525]  ? bch2_lookup+0x18a/0x350
[  368.531859]  bch2_lookup+0x18a/0x350
[  368.532145]  ? __pfx_bch2_lookup+0x10/0x10
[  368.532466]  ? __pfx_lockref_get_not_dead+0x10/0x10
[  368.532860]  __lookup_slow+0x182/0x350
[  368.533159]  ? __pfx___lookup_slow+0x10/0x10
[  368.533513]  walk_component+0x2ab/0x4f0
[  368.533816]  path_lookupat+0x120/0x660
[  368.534110]  ? vfs_fstatat+0x53/0xb0
[  368.534413]  filename_lookup+0x1aa/0x520
[  368.534718]  ? __pfx_filename_lookup+0x10/0x10
[  368.535052]  ? __wake_up_common+0xf2/0x170
[  368.535385]  ? _raw_spin_unlock_irq+0xe/0x30
[  368.535683]  ? __pfx_eventfd_write+0x10/0x10
[  368.536019]  ? kasan_save_stack+0x24/0x50
[  368.536313]  ? __kasan_record_aux_stack+0xad/0xc0
[  368.536651]  ? kmem_cache_free+0x353/0x550
[  368.536919]  ? do_syscall_64+0x4b/0x110
[  368.537179]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  368.537528]  vfs_statx+0xbf/0x140
[  368.537745]  ? kmem_cache_alloc_noprof+0x12d/0x340
[  368.538057]  ? __pfx_vfs_statx+0x10/0x10
[  368.538322]  ? getname_flags.part.0+0xaf/0x4a0
[  368.538627]  vfs_fstatat+0x6c/0xb0
[  368.538840]  __do_sys_newfstatat+0x75/0xd0
[  368.539100]  ? __pfx___do_sys_newfstatat+0x10/0x10
[  368.539415]  ? kasan_unpoison+0x27/0x60
[  368.539675]  ? __pfx_slab_free_after_rcu_debug+0x10/0x10
[  368.540081]  ? fdget_pos+0x366/0x5d0
[  368.540355]  ? fput+0x1b/0x2d0
[  368.540602]  ? ksys_write+0x18c/0x1c0
[  368.540934]  ? __pfx_ksys_write+0x10/0x10
[  368.541252]  ? fpregs_assert_state_consistent+0x20/0xa0
[  368.541657]  ? clear_bhb_loop+0x45/0xa0
[  368.541948]  do_syscall_64+0x4b/0x110
[  368.542270]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  368.542711] RIP: 0033:0x7fef789bb94e
[  368.543001] RSP: 002b:00007fef713fe468 EFLAGS: 00000246 ORIG_RAX:
0000000000000106
[  368.543605] RAX: ffffffffffffffda RBX: 00007fef713fe5d0 RCX: 00007fef789bb94e
[  368.544129] RDX: 00007fef713fe540 RSI: 00007fef713fe5d0 RDI: 000000000000001f
[  368.544636] RBP: 00007fef713ff710 R08: 0000000000000073 R09: 0000000000000000
[  368.545130] R10: 0000000000000100 R11: 0000000000000246 R12: 000000000000001f
[  368.545640] R13: 00007fef713fe540 R14: 00007fef5802dc00 R15: 00007fef713ff780
[  368.546177]  </TASK>

I'm seeing this happen on commit 86b74eb5 as well, which is the commit
that's before any of my or Josef's folio changes, and this persists
across reboots. I've seen this happen for a couple other tasks too,
"task:flatpak" and "task:pool-flatpak" with similar stack traces eg

[  368.505270] task:pool-flatpak in state:D stack:0     pid:568
tgid:538   ppid:537    flags:0x00000002
[  368.505872] Call Trace:
[  368.506042]  <TASK>
[  368.506191]  __schedule+0xaf0/0x27f0
[  368.506430]  ? __pfx___schedule+0x10/0x10
[  368.506715]  schedule+0x7e/0x300
[  368.506940]  __wait_on_freeing_inode+0xda/0x120
[  368.507490]  ? __pfx___wait_on_freeing_inode+0x10/0x10
[  368.507965]  ? _raw_spin_unlock_irqrestore+0xe/0x30
[  368.508312]  ? __pfx_wake_bit_function+0x10/0x10
[  368.508650]  bch2_inode_hash_find+0x372/0x580
[  368.508969]  ? __pfx_bch2_dirent_read_target+0x10/0x10
[  368.509332]  ? bch2_dirent_hash+0x23d/0x370
[  368.509637]  ? __pfx_bch2_inode_hash_find+0x10/0x10
[  368.509991]  ? __pfx_dirent_cmp_key+0x10/0x10
[  368.510309]  bch2_lookup_trans+0x61a/0x990
[  368.510662]  ? __pfx_bch2_lookup_trans+0x10/0x10
[  368.511155]  ? vfs_statx+0xbf/0x140
[  368.511429]  ? vfs_fstatat+0x6c/0xb0
[  368.511731]  ? do_syscall_64+0x4b/0x110
[  368.512014]  ? __d_alloc+0x5cc/0x8f0
[  368.512283]  ? memcg_list_lru_alloc+0x184/0x8a0
[  368.512612]  ? __pfx_bch2_hash_info_init+0x10/0x10
[  368.512943]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  368.513327]  ? __memcg_slab_post_alloc_hook+0x536/0xba0
[  368.513753]  ? kvm_sched_clock_read+0x11/0x20
[  368.514133]  ? _raw_spin_lock+0x85/0xe0
[  368.515095]  ? bch2_lookup_trans+0x323/0x990
[  368.516994]  ? __asan_memset+0x24/0x50
[  368.517294]  ? __bch2_trans_get+0x735/0xdd0
[  368.517675]  ? bch2_lookup+0x18a/0x350
[  368.517987]  bch2_lookup+0x18a/0x350
[  368.518271]  ? __pfx_bch2_lookup+0x10/0x10
[  368.518568]  ? __pfx_lockref_get_not_dead+0x10/0x10
[  368.518933]  __lookup_slow+0x182/0x350
[  368.519264]  ? __pfx___lookup_slow+0x10/0x10
[  368.519633]  walk_component+0x2ab/0x4f0
[  368.519934]  ? fput+0x1b/0x2d0
[  368.520206]  link_path_walk.part.0.constprop.0+0x4ad/0xac0
[  368.520657]  path_lookupat+0x72/0x660
[  368.521022]  ? vfs_fstatat+0x53/0xb0
[  368.521343]  filename_lookup+0x1aa/0x520
[  368.521664]  ? __pfx_filename_lookup+0x10/0x10
[  368.522021]  ? __pfx_xa_load+0x10/0x10
[  368.522319]  ? ___slab_alloc+0x128/0x9d0
[  368.522683]  ? mntput_no_expire+0xcf/0x760
[  368.523018]  ? lockref_put_return+0xc7/0x140
[  368.523357]  ? kmem_cache_free+0x19e/0x550
[  368.523686]  ? __pfx_mutex_unlock+0x10/0x10
[  368.524035]  ? do_renameat2+0x1f4/0xa50
[  368.524404]  ? do_renameat2+0x1f4/0xa50
[  368.524716]  vfs_statx+0xbf/0x140
[  368.524980]  ? kmem_cache_alloc_noprof+0x12d/0x340
[  368.525355]  ? __pfx_vfs_statx+0x10/0x10
[  368.525702]  ? getname_flags.part.0+0xaf/0x4a0
[  368.526080]  vfs_fstatat+0x6c/0xb0
[  368.526391]  __do_sys_newfstatat+0x75/0xd0
[  368.526745]  ? __pfx___do_sys_newfstatat+0x10/0x10
[  368.527206]  ? do_symlinkat+0x15d/0x260
[  368.527509]  ? do_mkdirat+0x19d/0x2c0
[  368.527807]  ? kasan_save_track+0x14/0x30
[  368.528132]  ? getname_flags.part.0+0xaf/0x4a0
[  368.528509]  ? fpregs_assert_state_consistent+0x20/0xa0
[  368.529198]  ? clear_bhb_loop+0x45/0xa0
[  368.529536]  do_syscall_64+0x4b/0x110
[  368.529856]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  368.530279] RIP: 0033:0x7f592256194e
[  368.530610] RSP: 002b:00007f5915bfe548 EFLAGS: 00000217 ORIG_RAX:
0000000000000106
[  368.531350] RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f592256194e
[  368.531943] RDX: 00007f5915bfe570 RSI: 00007f5915bfe600 RDI: 0000000000000013
[  368.532491] RBP: 00007f5915bfe740 R08: 0000000000000073 R09: 0000000000000000
[  368.533048] R10: 0000000000000100 R11: 0000000000000217 R12: 00007f5915bfe570
[  368.533597] R13: 00007f5915bfe600 R14: 00007f5915bfe56c R15: 00007f5915bfe780
[  368.534236]  </TASK>
[  368.534419] INFO: task pool-flatpak in:571 blocked for more than 122 seconds.

This seems like something we should investigate as well. I'm happy to
help repro this if needed. I'm able to hit this pretty consistently.

Malte, would you mind applying the patch above and confirming if this
fixes it for you on your VM? And while you've been running flatpak,
have you also been seeing some flatpak tasks hanging on some of your
VMs?


Thanks,
Joanne

>
>
> >
> >
> > Thanks,
> > Joanne
> >
> >> Thanks,
> >> Joanne
> >>> Catching up on this thread now. I'll investigate this today.
> >>>
> >>>> /Malte
> >>>>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-11  2:57                         ` Joanne Koong
@ 2024-12-11  6:43                           ` Kent Overstreet
  2024-12-11 20:32                             ` Joanne Koong
  2024-12-11 14:01                           ` Malte Schröder
  1 sibling, 1 reply; 31+ messages in thread
From: Kent Overstreet @ 2024-12-11  6:43 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Malte Schröder, Josef Bacik, Matthew Wilcox, Miklos Szeredi,
	linux-fsdevel

On Tue, Dec 10, 2024 at 06:57:41PM -0800, Joanne Koong wrote:
> On Tue, Dec 10, 2024 at 9:53 AM Malte Schröder <malte.schroeder@tnxip.de> wrote:
> >
> > On 10/12/2024 06:14, Joanne Koong wrote:
> > > On Mon, Dec 9, 2024 at 11:52 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >> On Mon, Dec 9, 2024 at 10:47 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >>> On Mon, Dec 9, 2024 at 9:07 AM Malte Schröder <malte.schroeder@tnxip.de> wrote:
> > >>>> On 09/12/2024 16:48, Josef Bacik wrote:
> > >>>>> On Mon, Dec 09, 2024 at 03:28:14PM +0000, Matthew Wilcox wrote:
> > >>>>>> On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote:
> > >>>>>>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
> > >>>>>>>> with 3b97c3652d91 as the culprit.
> > >>>>>>> Willy, I've looked at this code and it does indeed look like a 1:1 conversion,
> > >>>>>>> EXCEPT I'm fuzzy about how how this works with large folios.  Previously, if we
> > >>>>>>> got a hugepage in, we'd get each individual struct page back for the whole range
> > >>>>>>> of the hugepage, so if for example we had a 2M hugepage, we'd fill in the
> > >>>>>>> ->offset for each "middle" struct page as 0, since obviously we're consuming
> > >>>>>>> PAGE_SIZE chunks at a time.
> > >>>>>>>
> > >>>>>>> But now we're doing this
> > >>>>>>>
> > >>>>>>>     for (i = 0; i < nfolios; i++)
> > >>>>>>>             ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > >>>>>>>
> > >>>>>>> So if userspace handed us a 2M hugepage, page_folio() on each of the
> > >>>>>>> intermediary struct page's would return the same folio, correct?  So we'd end up
> > >>>>>>> with the wrong offsets for our fuse request, because they should be based from
> > >>>>>>> the start of the folio, correct?
> > >>>>>> I think you're 100% right.  We could put in some nice asserts to check
> > >>>>>> this is what's happening, but it does seem like a rather incautious
> > >>>>>> conversion.  Yes, all folios _in the page cache_ for fuse are small, but
> > >>>>>> that's not guaranteed to be the case for folios found in userspace for
> > >>>>>> directio.  At least the comment is wrong, and I'd suggest the code is too.
> > >>>>> Ok cool, Malte can you try the attached only compile tested patch and see if the
> > >>>>> problem goes away?  Thanks,
> > >>>>>
> > >>>>> Josef
> > >>>>>
> > >>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > >>>>> index 88d0946b5bc9..c4b93ead99a5 100644
> > >>>>> --- a/fs/fuse/file.c
> > >>>>> +++ b/fs/fuse/file.c
> > >>>>> @@ -1562,9 +1562,19 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
> > >>>>>               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
> > >>>>>
> > >>>>>               ap->descs[ap->num_folios].offset = start;
> > >>>>> -             fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios);
> > >>>>> -             for (i = 0; i < nfolios; i++)
> > >>>>> -                     ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > >>>>> +             for (i = 0; i < nfolios; i++) {
> > >>>>> +                     struct folio *folio = page_folio(pages[i]);
> > >>>>> +                     unsigned int offset = start +
> > >>>>> +                             (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
> > >>>>> +                     unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset);
> > >>>>> +
> > >>>>> +                     len = min_t(unsigned int, len, PAGE_SIZE);
> > >>>>> +
> > >>>>> +                     ap->descs[ap->num_folios + i].offset = offset;
> > >>>>> +                     ap->descs[ap->num_folios + i].length = len;
> > >>>>> +                     ap->folios[i + ap->num_folios] = folio;
> > >>>>> +                     start = 0;
> > >>>>> +             }
> > >>>>>
> > >>>>>               ap->num_folios += nfolios;
> > >>>>>               ap->descs[ap->num_folios - 1].length -=
> > >>>> The problem persists with this patch.
> > >>>>
> > >> Malte, could you try Josef's patch except with that last line
> > >> "ap->descs[ap->num_pages - 1].length  -= (PAGE_SIZE - ret) &
> > >> (PAGE_SIZE - 1);" also removed? I think we need that line removed as
> > >> well since that does a "-=" instead of a "=" and
> > >> ap->descs[ap->num_folios - 1].length gets set inside the for loop.
> > >>
> > >> In the meantime, I'll try to get a local repro running on fsx so that
> > >> you don't have to keep testing out repos for us.
> > > I was able to repro this locally by doing:
> > >
> > > -- start libfuse server --
> > > sudo ./libfuse/build/example/passthrough_hp --direct-io ~/src ~/fuse_mount
> > >
> > > -- patch + compile this (rough / ugly-for-now) code snippet --
> > > diff --git a/ltp/fsx.c b/ltp/fsx.c
> > > index 777ba0de..9f040bc4 100644
> > > --- a/ltp/fsx.c
> > > +++ b/ltp/fsx.c
> > > @@ -1049,7 +1049,8 @@ dowrite(unsigned offset, unsigned size)
> > >         }
> > >  }
> > >
> > > -
> > > +#define TWO_MIB (1 << 21)  // 2 MiB in bytes
> > >
> > >  void
> > >  domapwrite(unsigned offset, unsigned size)
> > >  {
> > > @@ -1057,6 +1058,8 @@ domapwrite(unsigned offset, unsigned size)
> > >         unsigned map_size;
> > >         off_t    cur_filesize;
> > >         char    *p;
> > > +       int ret;
> > > +       unsigned size_2mib_aligned;
> > >
> > >         offset -= offset % writebdy;
> > >         if (size == 0) {
> > > @@ -1101,6 +1104,41 @@ domapwrite(unsigned offset, unsigned size)
> > >         pg_offset = offset & PAGE_MASK;
> > >         map_size  = pg_offset + size;
> > >
> > > +       size_2mib_aligned = (size + TWO_MIB - 1) & ~(TWO_MIB - 1);
> > > +       void *placeholder_map = mmap(NULL, size_2mib_aligned * 2, PROT_NONE,
> > > +                            MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > > +       if (!placeholder_map) {
> > > +               prterr("domapwrite: placeholder map");
> > > +               exit(202);
> > > +       }
> > > +
> > > +       /* align address to nearest 2 MiB */
> > > +       void *aligned_address =
> > > +               (void *)(((uintptr_t)placeholder_map + TWO_MIB - 1) &
> > > ~(TWO_MIB - 1));
> > > +
> > > +       void *map = mmap(aligned_address, size_2mib_aligned, PROT_READ
> > > | PROT_WRITE,
> > > +                         MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED |
> > > MAP_POPULATE, -1, 0);
> > > +
> > > +       ret = madvise(map, size_2mib_aligned, MADV_COLLAPSE);
> > > +       if (ret) {
> > > +               prterr("domapwrite: madvise collapse");
> > > +               exit(203);
> > > +       }
> > > +
> > > +       memcpy(map, good_buf + offset, size);
> > > +
> > > +       if (lseek(fd, offset, SEEK_SET) == -1) {
> > > +               prterr("domapwrite: lseek");
> > > +               exit(204);
> > > +       }
> > > +
> > > +       ret = write(fd, map, size);
> > > +       if (ret == -1) {
> > > +               prterr("domapwrite: write");
> > > +               exit(205);
> > > +       }
> > > +
> > > +       /*
> > >         if ((p = (char *)mmap(0, map_size, PROT_READ | PROT_WRITE,
> > >                               MAP_FILE | MAP_SHARED, fd,
> > >                               (off_t)(offset - pg_offset))) == (char *)-1) {
> > > @@ -1119,6 +1157,15 @@ domapwrite(unsigned offset, unsigned size)
> > >                 prterr("domapwrite: munmap");
> > >                 report_failure(204);
> > >         }
> > > +       */
> > > +       if (munmap(map, size_2mib_aligned) != 0) {
> > > +               prterr("domapwrite: munmap map");
> > > +               report_failure(206);
> > > +       }
> > > +       if (munmap(placeholder_map, size_2mib_aligned * 2) != 0) {
> > > +               prterr("domapwrite: munmap placeholder_map");
> > > +               report_failure(207);
> > > +       }
> > >  }
> > >
> > > -- run fsx test --
> > > sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> > >
> > > On the offending commit 3b97c3652, I'm seeing:
> > > [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> > > Will begin at operation 3
> > > Seed set to 1
> > > ...
> > > READ BAD DATA: offset = 0x1925f, size = 0xf7a3, fname =
> > > /home/user/fuse_mount/example.txt
> > > OFFSET      GOOD    BAD     RANGE
> > > 0x1e43f     0x4b4a  0x114a  0x0
> > > operation# (mod 256) for the bad data may be 74
> > > 0x1e441     0xa64a  0xeb4a  0x1
> > > operation# (mod 256) for the bad data may be 74
> > > 0x1e443     0x264a  0xe44a  0x2
> > > operation# (mod 256) for the bad data may be 74
> > > 0x1e445     0x254a  0x9e4a  0x3
> > > ...
> > > Correct content saved for comparison
> > > (maybe hexdump "/home/user/fuse_mount/example.txt" vs
> > > "/home/user/fuse_mount/example.txt.fsxgood")
> > >
> > >
> > > I tested Josef's patch with the "ap->descs[ap->num_pages - 1].length
> > > -= (PAGE_SIZE - ret) & (PAGE_SIZE - 1);" line removed and it fixed the
> > > issue:
> > >
> > > [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> > > Will begin at operation 3
> > > Seed set to 1
> > > ...
> > > copying to largest ever: 0x3e19b
> > > copying to largest ever: 0x3e343
> > > fallocating to largest ever: 0x40000
> > > All 5000 operations completed A-OK!
> > >
> > >
> > > Malte, would you mind double-checking whether this fixes the issue
> > > you're seeing on your end?
> >
> > My test still fails.
> 
> Hi Malte,
> 
> I simulated your repro with installing bcachefs as the root filesystem
> on a VM running Arch, then running "sudo pacman -S flatpak" and then
> installing FreeCAD with "flatpak install flathub org.freecad.FreeCAD".
> 
> On base commit 3b97c3652, I see the same corruption you noted:
> 
> error: Failed to install org.kde.Platform: Error pulling from repo:
> While pulling runtime/org.kde.Platform/x86_64/6.7 from remote flathub:
> fsck content object
> 886fd60617b81e81475db5e62beda5846d3e85fe77562eae536d2dd2a7af5b33:
> Corrupted file object; checksum
> expected='886fd60617b81e81475db5e62beda5846d3e85fe77562eae536d2dd2a7af5b33'
> actual='67f5a60d19f7a65e1ee272d455fed138b864be73399816ad18fa71319614a418'
> 
> 
> i tried this patch on top of commit 3b97c3652 and it fixed it for me:
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 14af8c41fc83..0d213a22972b 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1560,18 +1560,24 @@ static int fuse_get_user_pages(struct
> fuse_args_pages *ap, struct iov_iter *ii,
> 
>                 nbytes += ret;
> 
> -               ret += start;
> -               /* Currently, all folios in FUSE are one page */
> -               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
> -
> -               ap->folio_descs[ap->num_folios].offset = start;
> -               fuse_folio_descs_length_init(ap->folio_descs,
> ap->num_folios, nfolios);
> -               for (i = 0; i < nfolios; i++)
> -                       ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> -
> -               ap->num_folios += nfolios;
> -               ap->folio_descs[ap->num_folios - 1].length -=
> -                       (PAGE_SIZE - ret) & (PAGE_SIZE - 1);
> +               nfolios = DIV_ROUND_UP(ret + start, PAGE_SIZE);
> +
> +               for (i = 0; i < nfolios; i++) {
> +                       struct folio *folio = page_folio(pages[i]);
> +                       unsigned int offset = start +
> +                               (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
> +                       unsigned int len = min_t(unsigned int, ret,
> folio_size(folio) - offset);
> +
> +                       len = min_t(unsigned int, len, PAGE_SIZE - start);
> +
> +                       ap->descs[ap->num_folios].offset = offset;
> +                       ap->descs[ap->num_folios].length = len;
> +                       ap->folios[ap->num_folios] = folio;
> +                       start = 0;
> +                       ret -= len;
> +                       ap->num_folios++;
> +               }
> +
>                 nr_pages += nfolios;
>         }
>         kfree(pages);
> 
> I ran this multiple times and don't see the corruption issues.
> 
> However, I do see another issue crop up on some of my VMs, which is
> flatpak hanging with this corresponding stack trace:
> 
> [  368.520976] INFO: task pool-/usr/lib/f:582 blocked for more than 122 seconds.
> [  368.521509]       Not tainted 6.12.0-rc1-g86b74eb5a11e #734
> [  368.521905] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  368.522483] task:pool-/usr/lib/f state:D stack:0     pid:582
> tgid:405   ppid:1      flags:0x00000002
> [  368.523238] Call Trace:
> [  368.523411]  <TASK>
> [  368.523548]  __schedule+0xaf0/0x27f0
> [  368.523773]  ? __pfx___schedule+0x10/0x10
> [  368.524024]  schedule+0x7e/0x300
> [  368.524233]  __wait_on_freeing_inode+0xda/0x120
> [  368.524527]  ? __pfx___wait_on_freeing_inode+0x10/0x10
> [  368.524830]  ? _raw_spin_unlock_irqrestore+0xe/0x30
> [  368.525124]  ? __pfx_wake_bit_function+0x10/0x10
> [  368.525411]  bch2_inode_hash_find+0x372/0x580
> [  368.525700]  ? __pfx_bch2_dirent_read_target+0x10/0x10
> [  368.526023]  ? bch2_dirent_hash+0x23d/0x370
> [  368.526286]  ? __pfx_bch2_inode_hash_find+0x10/0x10
> [  368.526583]  ? __pfx_dirent_cmp_key+0x10/0x10
> [  368.526972]  bch2_lookup_trans+0x61a/0x990
> [  368.527247]  ? __pfx_bch2_lookup_trans+0x10/0x10
> [  368.527559]  ? __do_sys_newfstatat+0x75/0xd0
> [  368.527850]  ? do_syscall_64+0x4b/0x110
> [  368.528088]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  368.528411]  ? __pfx_bch2_hash_info_init+0x10/0x10
> [  368.528720]  ? __mod_memcg_state+0x102/0x390
> [  368.529134]  ? obj_cgroup_charge+0x1b4/0x4c0
> [  368.529500]  ? __memcg_slab_post_alloc_hook+0x536/0xba0
> [  368.529913]  ? kvm_sched_clock_read+0x11/0x20
> [  368.530265]  ? _raw_spin_lock+0x85/0xe0
> [  368.530553]  ? bch2_lookup_trans+0x323/0x990
> [  368.530909]  ? __asan_memset+0x24/0x50
> [  368.531204]  ? __bch2_trans_get+0x735/0xdd0
> [  368.531525]  ? bch2_lookup+0x18a/0x350
> [  368.531859]  bch2_lookup+0x18a/0x350
> [  368.532145]  ? __pfx_bch2_lookup+0x10/0x10
> [  368.532466]  ? __pfx_lockref_get_not_dead+0x10/0x10
> [  368.532860]  __lookup_slow+0x182/0x350
> [  368.533159]  ? __pfx___lookup_slow+0x10/0x10
> [  368.533513]  walk_component+0x2ab/0x4f0
> [  368.533816]  path_lookupat+0x120/0x660
> [  368.534110]  ? vfs_fstatat+0x53/0xb0
> [  368.534413]  filename_lookup+0x1aa/0x520
> [  368.534718]  ? __pfx_filename_lookup+0x10/0x10
> [  368.535052]  ? __wake_up_common+0xf2/0x170
> [  368.535385]  ? _raw_spin_unlock_irq+0xe/0x30
> [  368.535683]  ? __pfx_eventfd_write+0x10/0x10
> [  368.536019]  ? kasan_save_stack+0x24/0x50
> [  368.536313]  ? __kasan_record_aux_stack+0xad/0xc0
> [  368.536651]  ? kmem_cache_free+0x353/0x550
> [  368.536919]  ? do_syscall_64+0x4b/0x110
> [  368.537179]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  368.537528]  vfs_statx+0xbf/0x140
> [  368.537745]  ? kmem_cache_alloc_noprof+0x12d/0x340
> [  368.538057]  ? __pfx_vfs_statx+0x10/0x10
> [  368.538322]  ? getname_flags.part.0+0xaf/0x4a0
> [  368.538627]  vfs_fstatat+0x6c/0xb0
> [  368.538840]  __do_sys_newfstatat+0x75/0xd0
> [  368.539100]  ? __pfx___do_sys_newfstatat+0x10/0x10
> [  368.539415]  ? kasan_unpoison+0x27/0x60
> [  368.539675]  ? __pfx_slab_free_after_rcu_debug+0x10/0x10
> [  368.540081]  ? fdget_pos+0x366/0x5d0
> [  368.540355]  ? fput+0x1b/0x2d0
> [  368.540602]  ? ksys_write+0x18c/0x1c0
> [  368.540934]  ? __pfx_ksys_write+0x10/0x10
> [  368.541252]  ? fpregs_assert_state_consistent+0x20/0xa0
> [  368.541657]  ? clear_bhb_loop+0x45/0xa0
> [  368.541948]  do_syscall_64+0x4b/0x110
> [  368.542270]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  368.542711] RIP: 0033:0x7fef789bb94e
> [  368.543001] RSP: 002b:00007fef713fe468 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000106
> [  368.543605] RAX: ffffffffffffffda RBX: 00007fef713fe5d0 RCX: 00007fef789bb94e
> [  368.544129] RDX: 00007fef713fe540 RSI: 00007fef713fe5d0 RDI: 000000000000001f
> [  368.544636] RBP: 00007fef713ff710 R08: 0000000000000073 R09: 0000000000000000
> [  368.545130] R10: 0000000000000100 R11: 0000000000000246 R12: 000000000000001f
> [  368.545640] R13: 00007fef713fe540 R14: 00007fef5802dc00 R15: 00007fef713ff780
> [  368.546177]  </TASK>
> 
> I'm seeing this happen on commit 86b74eb5 as well, which is the commit
> that's before any of my or Josef's folio changes, and this persists
> across reboots. I've seen this happen for a couple other tasks too,
> "task:flatpak" and "task:pool-flatpak" with similar stack traces eg
> 
> [  368.505270] task:pool-flatpak in state:D stack:0     pid:568
> tgid:538   ppid:537    flags:0x00000002
> [  368.505872] Call Trace:
> [  368.506042]  <TASK>
> [  368.506191]  __schedule+0xaf0/0x27f0
> [  368.506430]  ? __pfx___schedule+0x10/0x10
> [  368.506715]  schedule+0x7e/0x300
> [  368.506940]  __wait_on_freeing_inode+0xda/0x120
> [  368.507490]  ? __pfx___wait_on_freeing_inode+0x10/0x10
> [  368.507965]  ? _raw_spin_unlock_irqrestore+0xe/0x30
> [  368.508312]  ? __pfx_wake_bit_function+0x10/0x10
> [  368.508650]  bch2_inode_hash_find+0x372/0x580
> [  368.508969]  ? __pfx_bch2_dirent_read_target+0x10/0x10
> [  368.509332]  ? bch2_dirent_hash+0x23d/0x370
> [  368.509637]  ? __pfx_bch2_inode_hash_find+0x10/0x10
> [  368.509991]  ? __pfx_dirent_cmp_key+0x10/0x10
> [  368.510309]  bch2_lookup_trans+0x61a/0x990
> [  368.510662]  ? __pfx_bch2_lookup_trans+0x10/0x10
> [  368.511155]  ? vfs_statx+0xbf/0x140
> [  368.511429]  ? vfs_fstatat+0x6c/0xb0
> [  368.511731]  ? do_syscall_64+0x4b/0x110
> [  368.512014]  ? __d_alloc+0x5cc/0x8f0
> [  368.512283]  ? memcg_list_lru_alloc+0x184/0x8a0
> [  368.512612]  ? __pfx_bch2_hash_info_init+0x10/0x10
> [  368.512943]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  368.513327]  ? __memcg_slab_post_alloc_hook+0x536/0xba0
> [  368.513753]  ? kvm_sched_clock_read+0x11/0x20
> [  368.514133]  ? _raw_spin_lock+0x85/0xe0
> [  368.515095]  ? bch2_lookup_trans+0x323/0x990
> [  368.516994]  ? __asan_memset+0x24/0x50
> [  368.517294]  ? __bch2_trans_get+0x735/0xdd0
> [  368.517675]  ? bch2_lookup+0x18a/0x350
> [  368.517987]  bch2_lookup+0x18a/0x350
> [  368.518271]  ? __pfx_bch2_lookup+0x10/0x10
> [  368.518568]  ? __pfx_lockref_get_not_dead+0x10/0x10
> [  368.518933]  __lookup_slow+0x182/0x350
> [  368.519264]  ? __pfx___lookup_slow+0x10/0x10
> [  368.519633]  walk_component+0x2ab/0x4f0
> [  368.519934]  ? fput+0x1b/0x2d0
> [  368.520206]  link_path_walk.part.0.constprop.0+0x4ad/0xac0
> [  368.520657]  path_lookupat+0x72/0x660
> [  368.521022]  ? vfs_fstatat+0x53/0xb0
> [  368.521343]  filename_lookup+0x1aa/0x520
> [  368.521664]  ? __pfx_filename_lookup+0x10/0x10
> [  368.522021]  ? __pfx_xa_load+0x10/0x10
> [  368.522319]  ? ___slab_alloc+0x128/0x9d0
> [  368.522683]  ? mntput_no_expire+0xcf/0x760
> [  368.523018]  ? lockref_put_return+0xc7/0x140
> [  368.523357]  ? kmem_cache_free+0x19e/0x550
> [  368.523686]  ? __pfx_mutex_unlock+0x10/0x10
> [  368.524035]  ? do_renameat2+0x1f4/0xa50
> [  368.524404]  ? do_renameat2+0x1f4/0xa50
> [  368.524716]  vfs_statx+0xbf/0x140
> [  368.524980]  ? kmem_cache_alloc_noprof+0x12d/0x340
> [  368.525355]  ? __pfx_vfs_statx+0x10/0x10
> [  368.525702]  ? getname_flags.part.0+0xaf/0x4a0
> [  368.526080]  vfs_fstatat+0x6c/0xb0
> [  368.526391]  __do_sys_newfstatat+0x75/0xd0
> [  368.526745]  ? __pfx___do_sys_newfstatat+0x10/0x10
> [  368.527206]  ? do_symlinkat+0x15d/0x260
> [  368.527509]  ? do_mkdirat+0x19d/0x2c0
> [  368.527807]  ? kasan_save_track+0x14/0x30
> [  368.528132]  ? getname_flags.part.0+0xaf/0x4a0
> [  368.528509]  ? fpregs_assert_state_consistent+0x20/0xa0
> [  368.529198]  ? clear_bhb_loop+0x45/0xa0
> [  368.529536]  do_syscall_64+0x4b/0x110
> [  368.529856]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  368.530279] RIP: 0033:0x7f592256194e
> [  368.530610] RSP: 002b:00007f5915bfe548 EFLAGS: 00000217 ORIG_RAX:
> 0000000000000106
> [  368.531350] RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f592256194e
> [  368.531943] RDX: 00007f5915bfe570 RSI: 00007f5915bfe600 RDI: 0000000000000013
> [  368.532491] RBP: 00007f5915bfe740 R08: 0000000000000073 R09: 0000000000000000
> [  368.533048] R10: 0000000000000100 R11: 0000000000000217 R12: 00007f5915bfe570
> [  368.533597] R13: 00007f5915bfe600 R14: 00007f5915bfe56c R15: 00007f5915bfe780
> [  368.534236]  </TASK>
> [  368.534419] INFO: task pool-flatpak in:571 blocked for more than 122 seconds.
> 
> This seems like something we should investigate as well. I'm happy to
> help repro this if needed. I'm able to hit this pretty consistently.
> 
> Malte, would you mind applying the patch above and confirming if this
> fixes it for you on your VM? And while you've been running flatpak,
> have you also been seeing some flatpak tasks hanging on some of your
> VMs?

You're testing with snapshots? That one is fixed in my master branch.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-11  2:57                         ` Joanne Koong
  2024-12-11  6:43                           ` Kent Overstreet
@ 2024-12-11 14:01                           ` Malte Schröder
  2024-12-11 19:43                             ` Joanne Koong
  1 sibling, 1 reply; 31+ messages in thread
From: Malte Schröder @ 2024-12-11 14:01 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Josef Bacik, Matthew Wilcox, Kent Overstreet, Miklos Szeredi,
	linux-fsdevel

On 11/12/2024 03:57, Joanne Koong wrote:
> On Tue, Dec 10, 2024 at 9:53 AM Malte Schröder <malte.schroeder@tnxip.de> wrote:
>> On 10/12/2024 06:14, Joanne Koong wrote:
>>> On Mon, Dec 9, 2024 at 11:52 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>>>> On Mon, Dec 9, 2024 at 10:47 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>>>>> On Mon, Dec 9, 2024 at 9:07 AM Malte Schröder <malte.schroeder@tnxip.de> wrote:
>>>>>> On 09/12/2024 16:48, Josef Bacik wrote:
>>>>>>> On Mon, Dec 09, 2024 at 03:28:14PM +0000, Matthew Wilcox wrote:
>>>>>>>> On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote:
>>>>>>>>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
>>>>>>>>>> with 3b97c3652d91 as the culprit.
>>>>>>>>> Willy, I've looked at this code and it does indeed look like a 1:1 conversion,
>>>>>>>>> EXCEPT I'm fuzzy about how how this works with large folios.  Previously, if we
>>>>>>>>> got a hugepage in, we'd get each individual struct page back for the whole range
>>>>>>>>> of the hugepage, so if for example we had a 2M hugepage, we'd fill in the
>>>>>>>>> ->offset for each "middle" struct page as 0, since obviously we're consuming
>>>>>>>>> PAGE_SIZE chunks at a time.
>>>>>>>>>
>>>>>>>>> But now we're doing this
>>>>>>>>>
>>>>>>>>>     for (i = 0; i < nfolios; i++)
>>>>>>>>>             ap->folios[i + ap->num_folios] = page_folio(pages[i]);
>>>>>>>>>
>>>>>>>>> So if userspace handed us a 2M hugepage, page_folio() on each of the
>>>>>>>>> intermediary struct page's would return the same folio, correct?  So we'd end up
>>>>>>>>> with the wrong offsets for our fuse request, because they should be based from
>>>>>>>>> the start of the folio, correct?
>>>>>>>> I think you're 100% right.  We could put in some nice asserts to check
>>>>>>>> this is what's happening, but it does seem like a rather incautious
>>>>>>>> conversion.  Yes, all folios _in the page cache_ for fuse are small, but
>>>>>>>> that's not guaranteed to be the case for folios found in userspace for
>>>>>>>> directio.  At least the comment is wrong, and I'd suggest the code is too.
>>>>>>> Ok cool, Malte can you try the attached only compile tested patch and see if the
>>>>>>> problem goes away?  Thanks,
>>>>>>>
>>>>>>> Josef
>>>>>>>
>>>>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>>>>> index 88d0946b5bc9..c4b93ead99a5 100644
>>>>>>> --- a/fs/fuse/file.c
>>>>>>> +++ b/fs/fuse/file.c
>>>>>>> @@ -1562,9 +1562,19 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
>>>>>>>               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
>>>>>>>
>>>>>>>               ap->descs[ap->num_folios].offset = start;
>>>>>>> -             fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios);
>>>>>>> -             for (i = 0; i < nfolios; i++)
>>>>>>> -                     ap->folios[i + ap->num_folios] = page_folio(pages[i]);
>>>>>>> +             for (i = 0; i < nfolios; i++) {
>>>>>>> +                     struct folio *folio = page_folio(pages[i]);
>>>>>>> +                     unsigned int offset = start +
>>>>>>> +                             (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
>>>>>>> +                     unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset);
>>>>>>> +
>>>>>>> +                     len = min_t(unsigned int, len, PAGE_SIZE);
>>>>>>> +
>>>>>>> +                     ap->descs[ap->num_folios + i].offset = offset;
>>>>>>> +                     ap->descs[ap->num_folios + i].length = len;
>>>>>>> +                     ap->folios[i + ap->num_folios] = folio;
>>>>>>> +                     start = 0;
>>>>>>> +             }
>>>>>>>
>>>>>>>               ap->num_folios += nfolios;
>>>>>>>               ap->descs[ap->num_folios - 1].length -=
>>>>>> The problem persists with this patch.
>>>>>>
>>>> Malte, could you try Josef's patch except with that last line
>>>> "ap->descs[ap->num_pages - 1].length  -= (PAGE_SIZE - ret) &
>>>> (PAGE_SIZE - 1);" also removed? I think we need that line removed as
>>>> well since that does a "-=" instead of a "=" and
>>>> ap->descs[ap->num_folios - 1].length gets set inside the for loop.
>>>>
>>>> In the meantime, I'll try to get a local repro running on fsx so that
>>>> you don't have to keep testing out repos for us.
>>> I was able to repro this locally by doing:
>>>
>>> -- start libfuse server --
>>> sudo ./libfuse/build/example/passthrough_hp --direct-io ~/src ~/fuse_mount
>>>
>>> -- patch + compile this (rough / ugly-for-now) code snippet --
>>> diff --git a/ltp/fsx.c b/ltp/fsx.c
>>> index 777ba0de..9f040bc4 100644
>>> --- a/ltp/fsx.c
>>> +++ b/ltp/fsx.c
>>> @@ -1049,7 +1049,8 @@ dowrite(unsigned offset, unsigned size)
>>>         }
>>>  }
>>>
>>> -
>>> +#define TWO_MIB (1 << 21)  // 2 MiB in bytes
>>>
>>>  void
>>>  domapwrite(unsigned offset, unsigned size)
>>>  {
>>> @@ -1057,6 +1058,8 @@ domapwrite(unsigned offset, unsigned size)
>>>         unsigned map_size;
>>>         off_t    cur_filesize;
>>>         char    *p;
>>> +       int ret;
>>> +       unsigned size_2mib_aligned;
>>>
>>>         offset -= offset % writebdy;
>>>         if (size == 0) {
>>> @@ -1101,6 +1104,41 @@ domapwrite(unsigned offset, unsigned size)
>>>         pg_offset = offset & PAGE_MASK;
>>>         map_size  = pg_offset + size;
>>>
>>> +       size_2mib_aligned = (size + TWO_MIB - 1) & ~(TWO_MIB - 1);
>>> +       void *placeholder_map = mmap(NULL, size_2mib_aligned * 2, PROT_NONE,
>>> +                            MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>> +       if (!placeholder_map) {
>>> +               prterr("domapwrite: placeholder map");
>>> +               exit(202);
>>> +       }
>>> +
>>> +       /* align address to nearest 2 MiB */
>>> +       void *aligned_address =
>>> +               (void *)(((uintptr_t)placeholder_map + TWO_MIB - 1) &
>>> ~(TWO_MIB - 1));
>>> +
>>> +       void *map = mmap(aligned_address, size_2mib_aligned, PROT_READ
>>> | PROT_WRITE,
>>> +                         MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED |
>>> MAP_POPULATE, -1, 0);
>>> +
>>> +       ret = madvise(map, size_2mib_aligned, MADV_COLLAPSE);
>>> +       if (ret) {
>>> +               prterr("domapwrite: madvise collapse");
>>> +               exit(203);
>>> +       }
>>> +
>>> +       memcpy(map, good_buf + offset, size);
>>> +
>>> +       if (lseek(fd, offset, SEEK_SET) == -1) {
>>> +               prterr("domapwrite: lseek");
>>> +               exit(204);
>>> +       }
>>> +
>>> +       ret = write(fd, map, size);
>>> +       if (ret == -1) {
>>> +               prterr("domapwrite: write");
>>> +               exit(205);
>>> +       }
>>> +
>>> +       /*
>>>         if ((p = (char *)mmap(0, map_size, PROT_READ | PROT_WRITE,
>>>                               MAP_FILE | MAP_SHARED, fd,
>>>                               (off_t)(offset - pg_offset))) == (char *)-1) {
>>> @@ -1119,6 +1157,15 @@ domapwrite(unsigned offset, unsigned size)
>>>                 prterr("domapwrite: munmap");
>>>                 report_failure(204);
>>>         }
>>> +       */
>>> +       if (munmap(map, size_2mib_aligned) != 0) {
>>> +               prterr("domapwrite: munmap map");
>>> +               report_failure(206);
>>> +       }
>>> +       if (munmap(placeholder_map, size_2mib_aligned * 2) != 0) {
>>> +               prterr("domapwrite: munmap placeholder_map");
>>> +               report_failure(207);
>>> +       }
>>>  }
>>>
>>> -- run fsx test --
>>> sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
>>>
>>> On the offending commit 3b97c3652, I'm seeing:
>>> [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
>>> Will begin at operation 3
>>> Seed set to 1
>>> ...
>>> READ BAD DATA: offset = 0x1925f, size = 0xf7a3, fname =
>>> /home/user/fuse_mount/example.txt
>>> OFFSET      GOOD    BAD     RANGE
>>> 0x1e43f     0x4b4a  0x114a  0x0
>>> operation# (mod 256) for the bad data may be 74
>>> 0x1e441     0xa64a  0xeb4a  0x1
>>> operation# (mod 256) for the bad data may be 74
>>> 0x1e443     0x264a  0xe44a  0x2
>>> operation# (mod 256) for the bad data may be 74
>>> 0x1e445     0x254a  0x9e4a  0x3
>>> ...
>>> Correct content saved for comparison
>>> (maybe hexdump "/home/user/fuse_mount/example.txt" vs
>>> "/home/user/fuse_mount/example.txt.fsxgood")
>>>
>>>
>>> I tested Josef's patch with the "ap->descs[ap->num_pages - 1].length
>>> -= (PAGE_SIZE - ret) & (PAGE_SIZE - 1);" line removed and it fixed the
>>> issue:
>>>
>>> [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
>>> Will begin at operation 3
>>> Seed set to 1
>>> ...
>>> copying to largest ever: 0x3e19b
>>> copying to largest ever: 0x3e343
>>> fallocating to largest ever: 0x40000
>>> All 5000 operations completed A-OK!
>>>
>>>
>>> Malte, would you mind double-checking whether this fixes the issue
>>> you're seeing on your end?
>> My test still fails.
> Hi Malte,
>
> I simulated your repro with installing bcachefs as the root filesystem
> on a VM running Arch, then running "sudo pacman -S flatpak" and then
> installing FreeCAD with "flatpak install flathub org.freecad.FreeCAD".
>
> On base commit 3b97c3652, I see the same corruption you noted:
>
> error: Failed to install org.kde.Platform: Error pulling from repo:
> While pulling runtime/org.kde.Platform/x86_64/6.7 from remote flathub:
> fsck content object
> 886fd60617b81e81475db5e62beda5846d3e85fe77562eae536d2dd2a7af5b33:
> Corrupted file object; checksum
> expected='886fd60617b81e81475db5e62beda5846d3e85fe77562eae536d2dd2a7af5b33'
> actual='67f5a60d19f7a65e1ee272d455fed138b864be73399816ad18fa71319614a418'
>
>
> i tried this patch on top of commit 3b97c3652 and it fixed it for me:
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 14af8c41fc83..0d213a22972b 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1560,18 +1560,24 @@ static int fuse_get_user_pages(struct
> fuse_args_pages *ap, struct iov_iter *ii,
>
>                 nbytes += ret;
>
> -               ret += start;
> -               /* Currently, all folios in FUSE are one page */
> -               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
> -
> -               ap->folio_descs[ap->num_folios].offset = start;
> -               fuse_folio_descs_length_init(ap->folio_descs,
> ap->num_folios, nfolios);
> -               for (i = 0; i < nfolios; i++)
> -                       ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> -
> -               ap->num_folios += nfolios;
> -               ap->folio_descs[ap->num_folios - 1].length -=
> -                       (PAGE_SIZE - ret) & (PAGE_SIZE - 1);
> +               nfolios = DIV_ROUND_UP(ret + start, PAGE_SIZE);
> +
> +               for (i = 0; i < nfolios; i++) {
> +                       struct folio *folio = page_folio(pages[i]);
> +                       unsigned int offset = start +
> +                               (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
> +                       unsigned int len = min_t(unsigned int, ret,
> folio_size(folio) - offset);
> +
> +                       len = min_t(unsigned int, len, PAGE_SIZE - start);
> +
> +                       ap->descs[ap->num_folios].offset = offset;
> +                       ap->descs[ap->num_folios].length = len;
> +                       ap->folios[ap->num_folios] = folio;
> +                       start = 0;
> +                       ret -= len;
> +                       ap->num_folios++;
> +               }
> +
>                 nr_pages += nfolios;
>         }
>         kfree(pages);
>
> I ran this multiple times and don't see the corruption issues.
>
> However, I do see another issue crop up on some of my VMs, which is
> flatpak hanging with this corresponding stack trace:
>
> [  368.520976] INFO: task pool-/usr/lib/f:582 blocked for more than 122 seconds.
> [  368.521509]       Not tainted 6.12.0-rc1-g86b74eb5a11e #734
> [  368.521905] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  368.522483] task:pool-/usr/lib/f state:D stack:0     pid:582
> tgid:405   ppid:1      flags:0x00000002
> [  368.523238] Call Trace:
> [  368.523411]  <TASK>
> [  368.523548]  __schedule+0xaf0/0x27f0
> [  368.523773]  ? __pfx___schedule+0x10/0x10
> [  368.524024]  schedule+0x7e/0x300
> [  368.524233]  __wait_on_freeing_inode+0xda/0x120
> [  368.524527]  ? __pfx___wait_on_freeing_inode+0x10/0x10
> [  368.524830]  ? _raw_spin_unlock_irqrestore+0xe/0x30
> [  368.525124]  ? __pfx_wake_bit_function+0x10/0x10
> [  368.525411]  bch2_inode_hash_find+0x372/0x580
> [  368.525700]  ? __pfx_bch2_dirent_read_target+0x10/0x10
> [  368.526023]  ? bch2_dirent_hash+0x23d/0x370
> [  368.526286]  ? __pfx_bch2_inode_hash_find+0x10/0x10
> [  368.526583]  ? __pfx_dirent_cmp_key+0x10/0x10
> [  368.526972]  bch2_lookup_trans+0x61a/0x990
> [  368.527247]  ? __pfx_bch2_lookup_trans+0x10/0x10
> [  368.527559]  ? __do_sys_newfstatat+0x75/0xd0
> [  368.527850]  ? do_syscall_64+0x4b/0x110
> [  368.528088]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  368.528411]  ? __pfx_bch2_hash_info_init+0x10/0x10
> [  368.528720]  ? __mod_memcg_state+0x102/0x390
> [  368.529134]  ? obj_cgroup_charge+0x1b4/0x4c0
> [  368.529500]  ? __memcg_slab_post_alloc_hook+0x536/0xba0
> [  368.529913]  ? kvm_sched_clock_read+0x11/0x20
> [  368.530265]  ? _raw_spin_lock+0x85/0xe0
> [  368.530553]  ? bch2_lookup_trans+0x323/0x990
> [  368.530909]  ? __asan_memset+0x24/0x50
> [  368.531204]  ? __bch2_trans_get+0x735/0xdd0
> [  368.531525]  ? bch2_lookup+0x18a/0x350
> [  368.531859]  bch2_lookup+0x18a/0x350
> [  368.532145]  ? __pfx_bch2_lookup+0x10/0x10
> [  368.532466]  ? __pfx_lockref_get_not_dead+0x10/0x10
> [  368.532860]  __lookup_slow+0x182/0x350
> [  368.533159]  ? __pfx___lookup_slow+0x10/0x10
> [  368.533513]  walk_component+0x2ab/0x4f0
> [  368.533816]  path_lookupat+0x120/0x660
> [  368.534110]  ? vfs_fstatat+0x53/0xb0
> [  368.534413]  filename_lookup+0x1aa/0x520
> [  368.534718]  ? __pfx_filename_lookup+0x10/0x10
> [  368.535052]  ? __wake_up_common+0xf2/0x170
> [  368.535385]  ? _raw_spin_unlock_irq+0xe/0x30
> [  368.535683]  ? __pfx_eventfd_write+0x10/0x10
> [  368.536019]  ? kasan_save_stack+0x24/0x50
> [  368.536313]  ? __kasan_record_aux_stack+0xad/0xc0
> [  368.536651]  ? kmem_cache_free+0x353/0x550
> [  368.536919]  ? do_syscall_64+0x4b/0x110
> [  368.537179]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  368.537528]  vfs_statx+0xbf/0x140
> [  368.537745]  ? kmem_cache_alloc_noprof+0x12d/0x340
> [  368.538057]  ? __pfx_vfs_statx+0x10/0x10
> [  368.538322]  ? getname_flags.part.0+0xaf/0x4a0
> [  368.538627]  vfs_fstatat+0x6c/0xb0
> [  368.538840]  __do_sys_newfstatat+0x75/0xd0
> [  368.539100]  ? __pfx___do_sys_newfstatat+0x10/0x10
> [  368.539415]  ? kasan_unpoison+0x27/0x60
> [  368.539675]  ? __pfx_slab_free_after_rcu_debug+0x10/0x10
> [  368.540081]  ? fdget_pos+0x366/0x5d0
> [  368.540355]  ? fput+0x1b/0x2d0
> [  368.540602]  ? ksys_write+0x18c/0x1c0
> [  368.540934]  ? __pfx_ksys_write+0x10/0x10
> [  368.541252]  ? fpregs_assert_state_consistent+0x20/0xa0
> [  368.541657]  ? clear_bhb_loop+0x45/0xa0
> [  368.541948]  do_syscall_64+0x4b/0x110
> [  368.542270]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  368.542711] RIP: 0033:0x7fef789bb94e
> [  368.543001] RSP: 002b:00007fef713fe468 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000106
> [  368.543605] RAX: ffffffffffffffda RBX: 00007fef713fe5d0 RCX: 00007fef789bb94e
> [  368.544129] RDX: 00007fef713fe540 RSI: 00007fef713fe5d0 RDI: 000000000000001f
> [  368.544636] RBP: 00007fef713ff710 R08: 0000000000000073 R09: 0000000000000000
> [  368.545130] R10: 0000000000000100 R11: 0000000000000246 R12: 000000000000001f
> [  368.545640] R13: 00007fef713fe540 R14: 00007fef5802dc00 R15: 00007fef713ff780
> [  368.546177]  </TASK>
>
> I'm seeing this happen on commit 86b74eb5 as well, which is the commit
> that's before any of my or Josef's folio changes, and this persists
> across reboots. I've seen this happen for a couple other tasks too,
> "task:flatpak" and "task:pool-flatpak" with similar stack traces eg
>
> [  368.505270] task:pool-flatpak in state:D stack:0     pid:568
> tgid:538   ppid:537    flags:0x00000002
> [  368.505872] Call Trace:
> [  368.506042]  <TASK>
> [  368.506191]  __schedule+0xaf0/0x27f0
> [  368.506430]  ? __pfx___schedule+0x10/0x10
> [  368.506715]  schedule+0x7e/0x300
> [  368.506940]  __wait_on_freeing_inode+0xda/0x120
> [  368.507490]  ? __pfx___wait_on_freeing_inode+0x10/0x10
> [  368.507965]  ? _raw_spin_unlock_irqrestore+0xe/0x30
> [  368.508312]  ? __pfx_wake_bit_function+0x10/0x10
> [  368.508650]  bch2_inode_hash_find+0x372/0x580
> [  368.508969]  ? __pfx_bch2_dirent_read_target+0x10/0x10
> [  368.509332]  ? bch2_dirent_hash+0x23d/0x370
> [  368.509637]  ? __pfx_bch2_inode_hash_find+0x10/0x10
> [  368.509991]  ? __pfx_dirent_cmp_key+0x10/0x10
> [  368.510309]  bch2_lookup_trans+0x61a/0x990
> [  368.510662]  ? __pfx_bch2_lookup_trans+0x10/0x10
> [  368.511155]  ? vfs_statx+0xbf/0x140
> [  368.511429]  ? vfs_fstatat+0x6c/0xb0
> [  368.511731]  ? do_syscall_64+0x4b/0x110
> [  368.512014]  ? __d_alloc+0x5cc/0x8f0
> [  368.512283]  ? memcg_list_lru_alloc+0x184/0x8a0
> [  368.512612]  ? __pfx_bch2_hash_info_init+0x10/0x10
> [  368.512943]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  368.513327]  ? __memcg_slab_post_alloc_hook+0x536/0xba0
> [  368.513753]  ? kvm_sched_clock_read+0x11/0x20
> [  368.514133]  ? _raw_spin_lock+0x85/0xe0
> [  368.515095]  ? bch2_lookup_trans+0x323/0x990
> [  368.516994]  ? __asan_memset+0x24/0x50
> [  368.517294]  ? __bch2_trans_get+0x735/0xdd0
> [  368.517675]  ? bch2_lookup+0x18a/0x350
> [  368.517987]  bch2_lookup+0x18a/0x350
> [  368.518271]  ? __pfx_bch2_lookup+0x10/0x10
> [  368.518568]  ? __pfx_lockref_get_not_dead+0x10/0x10
> [  368.518933]  __lookup_slow+0x182/0x350
> [  368.519264]  ? __pfx___lookup_slow+0x10/0x10
> [  368.519633]  walk_component+0x2ab/0x4f0
> [  368.519934]  ? fput+0x1b/0x2d0
> [  368.520206]  link_path_walk.part.0.constprop.0+0x4ad/0xac0
> [  368.520657]  path_lookupat+0x72/0x660
> [  368.521022]  ? vfs_fstatat+0x53/0xb0
> [  368.521343]  filename_lookup+0x1aa/0x520
> [  368.521664]  ? __pfx_filename_lookup+0x10/0x10
> [  368.522021]  ? __pfx_xa_load+0x10/0x10
> [  368.522319]  ? ___slab_alloc+0x128/0x9d0
> [  368.522683]  ? mntput_no_expire+0xcf/0x760
> [  368.523018]  ? lockref_put_return+0xc7/0x140
> [  368.523357]  ? kmem_cache_free+0x19e/0x550
> [  368.523686]  ? __pfx_mutex_unlock+0x10/0x10
> [  368.524035]  ? do_renameat2+0x1f4/0xa50
> [  368.524404]  ? do_renameat2+0x1f4/0xa50
> [  368.524716]  vfs_statx+0xbf/0x140
> [  368.524980]  ? kmem_cache_alloc_noprof+0x12d/0x340
> [  368.525355]  ? __pfx_vfs_statx+0x10/0x10
> [  368.525702]  ? getname_flags.part.0+0xaf/0x4a0
> [  368.526080]  vfs_fstatat+0x6c/0xb0
> [  368.526391]  __do_sys_newfstatat+0x75/0xd0
> [  368.526745]  ? __pfx___do_sys_newfstatat+0x10/0x10
> [  368.527206]  ? do_symlinkat+0x15d/0x260
> [  368.527509]  ? do_mkdirat+0x19d/0x2c0
> [  368.527807]  ? kasan_save_track+0x14/0x30
> [  368.528132]  ? getname_flags.part.0+0xaf/0x4a0
> [  368.528509]  ? fpregs_assert_state_consistent+0x20/0xa0
> [  368.529198]  ? clear_bhb_loop+0x45/0xa0
> [  368.529536]  do_syscall_64+0x4b/0x110
> [  368.529856]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  368.530279] RIP: 0033:0x7f592256194e
> [  368.530610] RSP: 002b:00007f5915bfe548 EFLAGS: 00000217 ORIG_RAX:
> 0000000000000106
> [  368.531350] RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f592256194e
> [  368.531943] RDX: 00007f5915bfe570 RSI: 00007f5915bfe600 RDI: 0000000000000013
> [  368.532491] RBP: 00007f5915bfe740 R08: 0000000000000073 R09: 0000000000000000
> [  368.533048] R10: 0000000000000100 R11: 0000000000000217 R12: 00007f5915bfe570
> [  368.533597] R13: 00007f5915bfe600 R14: 00007f5915bfe56c R15: 00007f5915bfe780
> [  368.534236]  </TASK>
> [  368.534419] INFO: task pool-flatpak in:571 blocked for more than 122 seconds.
>
> This seems like something we should investigate as well. I'm happy to
> help repro this if needed. I'm able to hit this pretty consistently.
>
> Malte, would you mind applying the patch above and confirming if this
> fixes it for you on your VM? And while you've been running flatpak,
> have you also been seeing some flatpak tasks hanging on some of your
> VMs?

This patch also passes my test :) I did not notice any hanging installs.

I think I will try it in addition with Kent's current code on my
workstation later. I will come back to you if I encounter further
issues, thanks :)


/Malte

>
> Thanks,
> Joanne
>
>>
>>>
>>> Thanks,
>>> Joanne
>>>
>>>> Thanks,
>>>> Joanne
>>>>> Catching up on this thread now. I'll investigate this today.
>>>>>
>>>>>> /Malte
>>>>>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-11 14:01                           ` Malte Schröder
@ 2024-12-11 19:43                             ` Joanne Koong
  0 siblings, 0 replies; 31+ messages in thread
From: Joanne Koong @ 2024-12-11 19:43 UTC (permalink / raw)
  To: Malte Schröder
  Cc: Josef Bacik, Matthew Wilcox, Kent Overstreet, Miklos Szeredi,
	linux-fsdevel

On Wed, Dec 11, 2024 at 6:01 AM Malte Schröder <malte.schroeder@tnxip.de> wrote:
>
> On 11/12/2024 03:57, Joanne Koong wrote:
> > On Tue, Dec 10, 2024 at 9:53 AM Malte Schröder <malte.schroeder@tnxip.de> wrote:
> >> On 10/12/2024 06:14, Joanne Koong wrote:
> >>> On Mon, Dec 9, 2024 at 11:52 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>>> On Mon, Dec 9, 2024 at 10:47 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>>>> On Mon, Dec 9, 2024 at 9:07 AM Malte Schröder <malte.schroeder@tnxip.de> wrote:
> >>>>>> On 09/12/2024 16:48, Josef Bacik wrote:
> >>>>>>> On Mon, Dec 09, 2024 at 03:28:14PM +0000, Matthew Wilcox wrote:
> >>>>>>>> On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote:
> >>>>>>>>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
> >>>>>>>>>> with 3b97c3652d91 as the culprit.
> >>>>>>>>> Willy, I've looked at this code and it does indeed look like a 1:1 conversion,
> >>>>>>>>> EXCEPT I'm fuzzy about how how this works with large folios.  Previously, if we
> >>>>>>>>> got a hugepage in, we'd get each individual struct page back for the whole range
> >>>>>>>>> of the hugepage, so if for example we had a 2M hugepage, we'd fill in the
> >>>>>>>>> ->offset for each "middle" struct page as 0, since obviously we're consuming
> >>>>>>>>> PAGE_SIZE chunks at a time.
> >>>>>>>>>
> >>>>>>>>> But now we're doing this
> >>>>>>>>>
> >>>>>>>>>     for (i = 0; i < nfolios; i++)
> >>>>>>>>>             ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> >>>>>>>>>
> >>>>>>>>> So if userspace handed us a 2M hugepage, page_folio() on each of the
> >>>>>>>>> intermediary struct page's would return the same folio, correct?  So we'd end up
> >>>>>>>>> with the wrong offsets for our fuse request, because they should be based from
> >>>>>>>>> the start of the folio, correct?
> >>>>>>>> I think you're 100% right.  We could put in some nice asserts to check
> >>>>>>>> this is what's happening, but it does seem like a rather incautious
> >>>>>>>> conversion.  Yes, all folios _in the page cache_ for fuse are small, but
> >>>>>>>> that's not guaranteed to be the case for folios found in userspace for
> >>>>>>>> directio.  At least the comment is wrong, and I'd suggest the code is too.
> >>>>>>> Ok cool, Malte can you try the attached only compile tested patch and see if the
> >>>>>>> problem goes away?  Thanks,
> >>>>>>>
> >>>>>>> Josef
> >>>>>>>
> >>>>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> >>>>>>> index 88d0946b5bc9..c4b93ead99a5 100644
> >>>>>>> --- a/fs/fuse/file.c
> >>>>>>> +++ b/fs/fuse/file.c
> >>>>>>> @@ -1562,9 +1562,19 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
> >>>>>>>               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
> >>>>>>>
> >>>>>>>               ap->descs[ap->num_folios].offset = start;
> >>>>>>> -             fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios);
> >>>>>>> -             for (i = 0; i < nfolios; i++)
> >>>>>>> -                     ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> >>>>>>> +             for (i = 0; i < nfolios; i++) {
> >>>>>>> +                     struct folio *folio = page_folio(pages[i]);
> >>>>>>> +                     unsigned int offset = start +
> >>>>>>> +                             (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
> >>>>>>> +                     unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset);
> >>>>>>> +
> >>>>>>> +                     len = min_t(unsigned int, len, PAGE_SIZE);
> >>>>>>> +
> >>>>>>> +                     ap->descs[ap->num_folios + i].offset = offset;
> >>>>>>> +                     ap->descs[ap->num_folios + i].length = len;
> >>>>>>> +                     ap->folios[i + ap->num_folios] = folio;
> >>>>>>> +                     start = 0;
> >>>>>>> +             }
> >>>>>>>
> >>>>>>>               ap->num_folios += nfolios;
> >>>>>>>               ap->descs[ap->num_folios - 1].length -=
> >>>>>> The problem persists with this patch.
> >>>>>>
> >>>> Malte, could you try Josef's patch except with that last line
> >>>> "ap->descs[ap->num_pages - 1].length  -= (PAGE_SIZE - ret) &
> >>>> (PAGE_SIZE - 1);" also removed? I think we need that line removed as
> >>>> well since that does a "-=" instead of a "=" and
> >>>> ap->descs[ap->num_folios - 1].length gets set inside the for loop.
> >>>>
> >>>> In the meantime, I'll try to get a local repro running on fsx so that
> >>>> you don't have to keep testing out repos for us.
> >>> I was able to repro this locally by doing:
> >>>
> >>> -- start libfuse server --
> >>> sudo ./libfuse/build/example/passthrough_hp --direct-io ~/src ~/fuse_mount
> >>>
> >>> -- patch + compile this (rough / ugly-for-now) code snippet --
> >>> diff --git a/ltp/fsx.c b/ltp/fsx.c
> >>> index 777ba0de..9f040bc4 100644
> >>> --- a/ltp/fsx.c
> >>> +++ b/ltp/fsx.c
> >>> @@ -1049,7 +1049,8 @@ dowrite(unsigned offset, unsigned size)
> >>>         }
> >>>  }
> >>>
> >>> -
> >>> +#define TWO_MIB (1 << 21)  // 2 MiB in bytes
> >>>
> >>>  void
> >>>  domapwrite(unsigned offset, unsigned size)
> >>>  {
> >>> @@ -1057,6 +1058,8 @@ domapwrite(unsigned offset, unsigned size)
> >>>         unsigned map_size;
> >>>         off_t    cur_filesize;
> >>>         char    *p;
> >>> +       int ret;
> >>> +       unsigned size_2mib_aligned;
> >>>
> >>>         offset -= offset % writebdy;
> >>>         if (size == 0) {
> >>> @@ -1101,6 +1104,41 @@ domapwrite(unsigned offset, unsigned size)
> >>>         pg_offset = offset & PAGE_MASK;
> >>>         map_size  = pg_offset + size;
> >>>
> >>> +       size_2mib_aligned = (size + TWO_MIB - 1) & ~(TWO_MIB - 1);
> >>> +       void *placeholder_map = mmap(NULL, size_2mib_aligned * 2, PROT_NONE,
> >>> +                            MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >>> +       if (!placeholder_map) {
> >>> +               prterr("domapwrite: placeholder map");
> >>> +               exit(202);
> >>> +       }
> >>> +
> >>> +       /* align address to nearest 2 MiB */
> >>> +       void *aligned_address =
> >>> +               (void *)(((uintptr_t)placeholder_map + TWO_MIB - 1) &
> >>> ~(TWO_MIB - 1));
> >>> +
> >>> +       void *map = mmap(aligned_address, size_2mib_aligned, PROT_READ
> >>> | PROT_WRITE,
> >>> +                         MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED |
> >>> MAP_POPULATE, -1, 0);
> >>> +
> >>> +       ret = madvise(map, size_2mib_aligned, MADV_COLLAPSE);
> >>> +       if (ret) {
> >>> +               prterr("domapwrite: madvise collapse");
> >>> +               exit(203);
> >>> +       }
> >>> +
> >>> +       memcpy(map, good_buf + offset, size);
> >>> +
> >>> +       if (lseek(fd, offset, SEEK_SET) == -1) {
> >>> +               prterr("domapwrite: lseek");
> >>> +               exit(204);
> >>> +       }
> >>> +
> >>> +       ret = write(fd, map, size);
> >>> +       if (ret == -1) {
> >>> +               prterr("domapwrite: write");
> >>> +               exit(205);
> >>> +       }
> >>> +
> >>> +       /*
> >>>         if ((p = (char *)mmap(0, map_size, PROT_READ | PROT_WRITE,
> >>>                               MAP_FILE | MAP_SHARED, fd,
> >>>                               (off_t)(offset - pg_offset))) == (char *)-1) {
> >>> @@ -1119,6 +1157,15 @@ domapwrite(unsigned offset, unsigned size)
> >>>                 prterr("domapwrite: munmap");
> >>>                 report_failure(204);
> >>>         }
> >>> +       */
> >>> +       if (munmap(map, size_2mib_aligned) != 0) {
> >>> +               prterr("domapwrite: munmap map");
> >>> +               report_failure(206);
> >>> +       }
> >>> +       if (munmap(placeholder_map, size_2mib_aligned * 2) != 0) {
> >>> +               prterr("domapwrite: munmap placeholder_map");
> >>> +               report_failure(207);
> >>> +       }
> >>>  }
> >>>
> >>> -- run fsx test --
> >>> sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> >>>
> >>> On the offending commit 3b97c3652, I'm seeing:
> >>> [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> >>> Will begin at operation 3
> >>> Seed set to 1
> >>> ...
> >>> READ BAD DATA: offset = 0x1925f, size = 0xf7a3, fname =
> >>> /home/user/fuse_mount/example.txt
> >>> OFFSET      GOOD    BAD     RANGE
> >>> 0x1e43f     0x4b4a  0x114a  0x0
> >>> operation# (mod 256) for the bad data may be 74
> >>> 0x1e441     0xa64a  0xeb4a  0x1
> >>> operation# (mod 256) for the bad data may be 74
> >>> 0x1e443     0x264a  0xe44a  0x2
> >>> operation# (mod 256) for the bad data may be 74
> >>> 0x1e445     0x254a  0x9e4a  0x3
> >>> ...
> >>> Correct content saved for comparison
> >>> (maybe hexdump "/home/user/fuse_mount/example.txt" vs
> >>> "/home/user/fuse_mount/example.txt.fsxgood")
> >>>
> >>>
> >>> I tested Josef's patch with the "ap->descs[ap->num_pages - 1].length
> >>> -= (PAGE_SIZE - ret) & (PAGE_SIZE - 1);" line removed and it fixed the
> >>> issue:
> >>>
> >>> [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> >>> Will begin at operation 3
> >>> Seed set to 1
> >>> ...
> >>> copying to largest ever: 0x3e19b
> >>> copying to largest ever: 0x3e343
> >>> fallocating to largest ever: 0x40000
> >>> All 5000 operations completed A-OK!
> >>>
> >>>
> >>> Malte, would you mind double-checking whether this fixes the issue
> >>> you're seeing on your end?
> >> My test still fails.
> > Hi Malte,
> >
> > I simulated your repro with installing bcachefs as the root filesystem
> > on a VM running Arch, then running "sudo pacman -S flatpak" and then
> > installing FreeCAD with "flatpak install flathub org.freecad.FreeCAD".
> >
> > On base commit 3b97c3652, I see the same corruption you noted:
> >
> > error: Failed to install org.kde.Platform: Error pulling from repo:
> > While pulling runtime/org.kde.Platform/x86_64/6.7 from remote flathub:
> > fsck content object
> > 886fd60617b81e81475db5e62beda5846d3e85fe77562eae536d2dd2a7af5b33:
> > Corrupted file object; checksum
> > expected='886fd60617b81e81475db5e62beda5846d3e85fe77562eae536d2dd2a7af5b33'
> > actual='67f5a60d19f7a65e1ee272d455fed138b864be73399816ad18fa71319614a418'
> >
> >
> > i tried this patch on top of commit 3b97c3652 and it fixed it for me:
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 14af8c41fc83..0d213a22972b 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1560,18 +1560,24 @@ static int fuse_get_user_pages(struct
> > fuse_args_pages *ap, struct iov_iter *ii,
> >
> >                 nbytes += ret;
> >
> > -               ret += start;
> > -               /* Currently, all folios in FUSE are one page */
> > -               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
> > -
> > -               ap->folio_descs[ap->num_folios].offset = start;
> > -               fuse_folio_descs_length_init(ap->folio_descs,
> > ap->num_folios, nfolios);
> > -               for (i = 0; i < nfolios; i++)
> > -                       ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > -
> > -               ap->num_folios += nfolios;
> > -               ap->folio_descs[ap->num_folios - 1].length -=
> > -                       (PAGE_SIZE - ret) & (PAGE_SIZE - 1);
> > +               nfolios = DIV_ROUND_UP(ret + start, PAGE_SIZE);
> > +
> > +               for (i = 0; i < nfolios; i++) {
> > +                       struct folio *folio = page_folio(pages[i]);
> > +                       unsigned int offset = start +
> > +                               (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
> > +                       unsigned int len = min_t(unsigned int, ret,
> > folio_size(folio) - offset);
> > +
> > +                       len = min_t(unsigned int, len, PAGE_SIZE - start);
> > +
> > +                       ap->descs[ap->num_folios].offset = offset;
> > +                       ap->descs[ap->num_folios].length = len;
> > +                       ap->folios[ap->num_folios] = folio;
> > +                       start = 0;
> > +                       ret -= len;
> > +                       ap->num_folios++;
> > +               }
> > +
> >                 nr_pages += nfolios;
> >         }
> >         kfree(pages);
> >
> > I ran this multiple times and don't see the corruption issues.
> >
> > However, I do see another issue crop up on some of my VMs, which is
> > flatpak hanging with this corresponding stack trace:
> >
> > [  368.520976] INFO: task pool-/usr/lib/f:582 blocked for more than 122 seconds.
> > [  368.521509]       Not tainted 6.12.0-rc1-g86b74eb5a11e #734
> > [  368.521905] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  368.522483] task:pool-/usr/lib/f state:D stack:0     pid:582
> > tgid:405   ppid:1      flags:0x00000002
> > [  368.523238] Call Trace:
> > [  368.523411]  <TASK>
> > [  368.523548]  __schedule+0xaf0/0x27f0
> > [  368.523773]  ? __pfx___schedule+0x10/0x10
> > [  368.524024]  schedule+0x7e/0x300
> > [  368.524233]  __wait_on_freeing_inode+0xda/0x120
> > [  368.524527]  ? __pfx___wait_on_freeing_inode+0x10/0x10
> > [  368.524830]  ? _raw_spin_unlock_irqrestore+0xe/0x30
> > [  368.525124]  ? __pfx_wake_bit_function+0x10/0x10
> > [  368.525411]  bch2_inode_hash_find+0x372/0x580
> > [  368.525700]  ? __pfx_bch2_dirent_read_target+0x10/0x10
> > [  368.526023]  ? bch2_dirent_hash+0x23d/0x370
> > [  368.526286]  ? __pfx_bch2_inode_hash_find+0x10/0x10
> > [  368.526583]  ? __pfx_dirent_cmp_key+0x10/0x10
> > [  368.526972]  bch2_lookup_trans+0x61a/0x990
> > [  368.527247]  ? __pfx_bch2_lookup_trans+0x10/0x10
> > [  368.527559]  ? __do_sys_newfstatat+0x75/0xd0
> > [  368.527850]  ? do_syscall_64+0x4b/0x110
> > [  368.528088]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  368.528411]  ? __pfx_bch2_hash_info_init+0x10/0x10
> > [  368.528720]  ? __mod_memcg_state+0x102/0x390
> > [  368.529134]  ? obj_cgroup_charge+0x1b4/0x4c0
> > [  368.529500]  ? __memcg_slab_post_alloc_hook+0x536/0xba0
> > [  368.529913]  ? kvm_sched_clock_read+0x11/0x20
> > [  368.530265]  ? _raw_spin_lock+0x85/0xe0
> > [  368.530553]  ? bch2_lookup_trans+0x323/0x990
> > [  368.530909]  ? __asan_memset+0x24/0x50
> > [  368.531204]  ? __bch2_trans_get+0x735/0xdd0
> > [  368.531525]  ? bch2_lookup+0x18a/0x350
> > [  368.531859]  bch2_lookup+0x18a/0x350
> > [  368.532145]  ? __pfx_bch2_lookup+0x10/0x10
> > [  368.532466]  ? __pfx_lockref_get_not_dead+0x10/0x10
> > [  368.532860]  __lookup_slow+0x182/0x350
> > [  368.533159]  ? __pfx___lookup_slow+0x10/0x10
> > [  368.533513]  walk_component+0x2ab/0x4f0
> > [  368.533816]  path_lookupat+0x120/0x660
> > [  368.534110]  ? vfs_fstatat+0x53/0xb0
> > [  368.534413]  filename_lookup+0x1aa/0x520
> > [  368.534718]  ? __pfx_filename_lookup+0x10/0x10
> > [  368.535052]  ? __wake_up_common+0xf2/0x170
> > [  368.535385]  ? _raw_spin_unlock_irq+0xe/0x30
> > [  368.535683]  ? __pfx_eventfd_write+0x10/0x10
> > [  368.536019]  ? kasan_save_stack+0x24/0x50
> > [  368.536313]  ? __kasan_record_aux_stack+0xad/0xc0
> > [  368.536651]  ? kmem_cache_free+0x353/0x550
> > [  368.536919]  ? do_syscall_64+0x4b/0x110
> > [  368.537179]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  368.537528]  vfs_statx+0xbf/0x140
> > [  368.537745]  ? kmem_cache_alloc_noprof+0x12d/0x340
> > [  368.538057]  ? __pfx_vfs_statx+0x10/0x10
> > [  368.538322]  ? getname_flags.part.0+0xaf/0x4a0
> > [  368.538627]  vfs_fstatat+0x6c/0xb0
> > [  368.538840]  __do_sys_newfstatat+0x75/0xd0
> > [  368.539100]  ? __pfx___do_sys_newfstatat+0x10/0x10
> > [  368.539415]  ? kasan_unpoison+0x27/0x60
> > [  368.539675]  ? __pfx_slab_free_after_rcu_debug+0x10/0x10
> > [  368.540081]  ? fdget_pos+0x366/0x5d0
> > [  368.540355]  ? fput+0x1b/0x2d0
> > [  368.540602]  ? ksys_write+0x18c/0x1c0
> > [  368.540934]  ? __pfx_ksys_write+0x10/0x10
> > [  368.541252]  ? fpregs_assert_state_consistent+0x20/0xa0
> > [  368.541657]  ? clear_bhb_loop+0x45/0xa0
> > [  368.541948]  do_syscall_64+0x4b/0x110
> > [  368.542270]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  368.542711] RIP: 0033:0x7fef789bb94e
> > [  368.543001] RSP: 002b:00007fef713fe468 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000106
> > [  368.543605] RAX: ffffffffffffffda RBX: 00007fef713fe5d0 RCX: 00007fef789bb94e
> > [  368.544129] RDX: 00007fef713fe540 RSI: 00007fef713fe5d0 RDI: 000000000000001f
> > [  368.544636] RBP: 00007fef713ff710 R08: 0000000000000073 R09: 0000000000000000
> > [  368.545130] R10: 0000000000000100 R11: 0000000000000246 R12: 000000000000001f
> > [  368.545640] R13: 00007fef713fe540 R14: 00007fef5802dc00 R15: 00007fef713ff780
> > [  368.546177]  </TASK>
> >
> > I'm seeing this happen on commit 86b74eb5 as well, which is the commit
> > that's before any of my or Josef's folio changes, and this persists
> > across reboots. I've seen this happen for a couple other tasks too,
> > "task:flatpak" and "task:pool-flatpak" with similar stack traces eg
> >
> > [  368.505270] task:pool-flatpak in state:D stack:0     pid:568
> > tgid:538   ppid:537    flags:0x00000002
> > [  368.505872] Call Trace:
> > [  368.506042]  <TASK>
> > [  368.506191]  __schedule+0xaf0/0x27f0
> > [  368.506430]  ? __pfx___schedule+0x10/0x10
> > [  368.506715]  schedule+0x7e/0x300
> > [  368.506940]  __wait_on_freeing_inode+0xda/0x120
> > [  368.507490]  ? __pfx___wait_on_freeing_inode+0x10/0x10
> > [  368.507965]  ? _raw_spin_unlock_irqrestore+0xe/0x30
> > [  368.508312]  ? __pfx_wake_bit_function+0x10/0x10
> > [  368.508650]  bch2_inode_hash_find+0x372/0x580
> > [  368.508969]  ? __pfx_bch2_dirent_read_target+0x10/0x10
> > [  368.509332]  ? bch2_dirent_hash+0x23d/0x370
> > [  368.509637]  ? __pfx_bch2_inode_hash_find+0x10/0x10
> > [  368.509991]  ? __pfx_dirent_cmp_key+0x10/0x10
> > [  368.510309]  bch2_lookup_trans+0x61a/0x990
> > [  368.510662]  ? __pfx_bch2_lookup_trans+0x10/0x10
> > [  368.511155]  ? vfs_statx+0xbf/0x140
> > [  368.511429]  ? vfs_fstatat+0x6c/0xb0
> > [  368.511731]  ? do_syscall_64+0x4b/0x110
> > [  368.512014]  ? __d_alloc+0x5cc/0x8f0
> > [  368.512283]  ? memcg_list_lru_alloc+0x184/0x8a0
> > [  368.512612]  ? __pfx_bch2_hash_info_init+0x10/0x10
> > [  368.512943]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  368.513327]  ? __memcg_slab_post_alloc_hook+0x536/0xba0
> > [  368.513753]  ? kvm_sched_clock_read+0x11/0x20
> > [  368.514133]  ? _raw_spin_lock+0x85/0xe0
> > [  368.515095]  ? bch2_lookup_trans+0x323/0x990
> > [  368.516994]  ? __asan_memset+0x24/0x50
> > [  368.517294]  ? __bch2_trans_get+0x735/0xdd0
> > [  368.517675]  ? bch2_lookup+0x18a/0x350
> > [  368.517987]  bch2_lookup+0x18a/0x350
> > [  368.518271]  ? __pfx_bch2_lookup+0x10/0x10
> > [  368.518568]  ? __pfx_lockref_get_not_dead+0x10/0x10
> > [  368.518933]  __lookup_slow+0x182/0x350
> > [  368.519264]  ? __pfx___lookup_slow+0x10/0x10
> > [  368.519633]  walk_component+0x2ab/0x4f0
> > [  368.519934]  ? fput+0x1b/0x2d0
> > [  368.520206]  link_path_walk.part.0.constprop.0+0x4ad/0xac0
> > [  368.520657]  path_lookupat+0x72/0x660
> > [  368.521022]  ? vfs_fstatat+0x53/0xb0
> > [  368.521343]  filename_lookup+0x1aa/0x520
> > [  368.521664]  ? __pfx_filename_lookup+0x10/0x10
> > [  368.522021]  ? __pfx_xa_load+0x10/0x10
> > [  368.522319]  ? ___slab_alloc+0x128/0x9d0
> > [  368.522683]  ? mntput_no_expire+0xcf/0x760
> > [  368.523018]  ? lockref_put_return+0xc7/0x140
> > [  368.523357]  ? kmem_cache_free+0x19e/0x550
> > [  368.523686]  ? __pfx_mutex_unlock+0x10/0x10
> > [  368.524035]  ? do_renameat2+0x1f4/0xa50
> > [  368.524404]  ? do_renameat2+0x1f4/0xa50
> > [  368.524716]  vfs_statx+0xbf/0x140
> > [  368.524980]  ? kmem_cache_alloc_noprof+0x12d/0x340
> > [  368.525355]  ? __pfx_vfs_statx+0x10/0x10
> > [  368.525702]  ? getname_flags.part.0+0xaf/0x4a0
> > [  368.526080]  vfs_fstatat+0x6c/0xb0
> > [  368.526391]  __do_sys_newfstatat+0x75/0xd0
> > [  368.526745]  ? __pfx___do_sys_newfstatat+0x10/0x10
> > [  368.527206]  ? do_symlinkat+0x15d/0x260
> > [  368.527509]  ? do_mkdirat+0x19d/0x2c0
> > [  368.527807]  ? kasan_save_track+0x14/0x30
> > [  368.528132]  ? getname_flags.part.0+0xaf/0x4a0
> > [  368.528509]  ? fpregs_assert_state_consistent+0x20/0xa0
> > [  368.529198]  ? clear_bhb_loop+0x45/0xa0
> > [  368.529536]  do_syscall_64+0x4b/0x110
> > [  368.529856]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  368.530279] RIP: 0033:0x7f592256194e
> > [  368.530610] RSP: 002b:00007f5915bfe548 EFLAGS: 00000217 ORIG_RAX:
> > 0000000000000106
> > [  368.531350] RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f592256194e
> > [  368.531943] RDX: 00007f5915bfe570 RSI: 00007f5915bfe600 RDI: 0000000000000013
> > [  368.532491] RBP: 00007f5915bfe740 R08: 0000000000000073 R09: 0000000000000000
> > [  368.533048] R10: 0000000000000100 R11: 0000000000000217 R12: 00007f5915bfe570
> > [  368.533597] R13: 00007f5915bfe600 R14: 00007f5915bfe56c R15: 00007f5915bfe780
> > [  368.534236]  </TASK>
> > [  368.534419] INFO: task pool-flatpak in:571 blocked for more than 122 seconds.
> >
> > This seems like something we should investigate as well. I'm happy to
> > help repro this if needed. I'm able to hit this pretty consistently.
> >
> > Malte, would you mind applying the patch above and confirming if this
> > fixes it for you on your VM? And while you've been running flatpak,
> > have you also been seeing some flatpak tasks hanging on some of your
> > VMs?
>
> This patch also passes my test :) I did not notice any hanging installs.
>
> I think I will try it in addition with Kent's current code on my
> workstation later. I will come back to you if I encounter further
> issues, thanks :)
>

Great, thanks for checking and for surfacing this bug. I'll submit the
fix upstream today (will cc you and the others on this thread).

I'll add a case for this in the fstests as well so that hopefully this
kind of issue can be caught earlier.

>
> /Malte
>
> >
> > Thanks,
> > Joanne
> >
> >>
> >>>
> >>> Thanks,
> >>> Joanne
> >>>
> >>>> Thanks,
> >>>> Joanne
> >>>>> Catching up on this thread now. I'll investigate this today.
> >>>>>
> >>>>>> /Malte
> >>>>>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: silent data corruption in fuse in rc1
  2024-12-11  6:43                           ` Kent Overstreet
@ 2024-12-11 20:32                             ` Joanne Koong
  0 siblings, 0 replies; 31+ messages in thread
From: Joanne Koong @ 2024-12-11 20:32 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Malte Schröder, Josef Bacik, Matthew Wilcox, Miklos Szeredi,
	linux-fsdevel

On Tue, Dec 10, 2024 at 10:43 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Tue, Dec 10, 2024 at 06:57:41PM -0800, Joanne Koong wrote:
> > On Tue, Dec 10, 2024 at 9:53 AM Malte Schröder <malte.schroeder@tnxip.de> wrote:
> > >
> > > On 10/12/2024 06:14, Joanne Koong wrote:
> > > > On Mon, Dec 9, 2024 at 11:52 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >> On Mon, Dec 9, 2024 at 10:47 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >>> On Mon, Dec 9, 2024 at 9:07 AM Malte Schröder <malte.schroeder@tnxip.de> wrote:
> > > >>>> On 09/12/2024 16:48, Josef Bacik wrote:
> > > >>>>> On Mon, Dec 09, 2024 at 03:28:14PM +0000, Matthew Wilcox wrote:
> > > >>>>>> On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote:
> > > >>>>>>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
> > > >>>>>>>> with 3b97c3652d91 as the culprit.
> > > >>>>>>> Willy, I've looked at this code and it does indeed look like a 1:1 conversion,
> > > >>>>>>> EXCEPT I'm fuzzy about how how this works with large folios.  Previously, if we
> > > >>>>>>> got a hugepage in, we'd get each individual struct page back for the whole range
> > > >>>>>>> of the hugepage, so if for example we had a 2M hugepage, we'd fill in the
> > > >>>>>>> ->offset for each "middle" struct page as 0, since obviously we're consuming
> > > >>>>>>> PAGE_SIZE chunks at a time.
> > > >>>>>>>
> > > >>>>>>> But now we're doing this
> > > >>>>>>>
> > > >>>>>>>     for (i = 0; i < nfolios; i++)
> > > >>>>>>>             ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > > >>>>>>>
> > > >>>>>>> So if userspace handed us a 2M hugepage, page_folio() on each of the
> > > >>>>>>> intermediary struct page's would return the same folio, correct?  So we'd end up
> > > >>>>>>> with the wrong offsets for our fuse request, because they should be based from
> > > >>>>>>> the start of the folio, correct?
> > > >>>>>> I think you're 100% right.  We could put in some nice asserts to check
> > > >>>>>> this is what's happening, but it does seem like a rather incautious
> > > >>>>>> conversion.  Yes, all folios _in the page cache_ for fuse are small, but
> > > >>>>>> that's not guaranteed to be the case for folios found in userspace for
> > > >>>>>> directio.  At least the comment is wrong, and I'd suggest the code is too.
> > > >>>>> Ok cool, Malte can you try the attached only compile tested patch and see if the
> > > >>>>> problem goes away?  Thanks,
> > > >>>>>
> > > >>>>> Josef
> > > >>>>>
> > > >>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > >>>>> index 88d0946b5bc9..c4b93ead99a5 100644
> > > >>>>> --- a/fs/fuse/file.c
> > > >>>>> +++ b/fs/fuse/file.c
> > > >>>>> @@ -1562,9 +1562,19 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
> > > >>>>>               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
> > > >>>>>
> > > >>>>>               ap->descs[ap->num_folios].offset = start;
> > > >>>>> -             fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios);
> > > >>>>> -             for (i = 0; i < nfolios; i++)
> > > >>>>> -                     ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > > >>>>> +             for (i = 0; i < nfolios; i++) {
> > > >>>>> +                     struct folio *folio = page_folio(pages[i]);
> > > >>>>> +                     unsigned int offset = start +
> > > >>>>> +                             (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
> > > >>>>> +                     unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset);
> > > >>>>> +
> > > >>>>> +                     len = min_t(unsigned int, len, PAGE_SIZE);
> > > >>>>> +
> > > >>>>> +                     ap->descs[ap->num_folios + i].offset = offset;
> > > >>>>> +                     ap->descs[ap->num_folios + i].length = len;
> > > >>>>> +                     ap->folios[i + ap->num_folios] = folio;
> > > >>>>> +                     start = 0;
> > > >>>>> +             }
> > > >>>>>
> > > >>>>>               ap->num_folios += nfolios;
> > > >>>>>               ap->descs[ap->num_folios - 1].length -=
> > > >>>> The problem persists with this patch.
> > > >>>>
> > > >> Malte, could you try Josef's patch except with that last line
> > > >> "ap->descs[ap->num_pages - 1].length  -= (PAGE_SIZE - ret) &
> > > >> (PAGE_SIZE - 1);" also removed? I think we need that line removed as
> > > >> well since that does a "-=" instead of a "=" and
> > > >> ap->descs[ap->num_folios - 1].length gets set inside the for loop.
> > > >>
> > > >> In the meantime, I'll try to get a local repro running on fsx so that
> > > >> you don't have to keep testing out repos for us.
> > > > I was able to repro this locally by doing:
> > > >
> > > > -- start libfuse server --
> > > > sudo ./libfuse/build/example/passthrough_hp --direct-io ~/src ~/fuse_mount
> > > >
> > > > -- patch + compile this (rough / ugly-for-now) code snippet --
> > > > diff --git a/ltp/fsx.c b/ltp/fsx.c
> > > > index 777ba0de..9f040bc4 100644
> > > > --- a/ltp/fsx.c
> > > > +++ b/ltp/fsx.c
> > > > @@ -1049,7 +1049,8 @@ dowrite(unsigned offset, unsigned size)
> > > >         }
> > > >  }
> > > >
> > > > -
> > > > +#define TWO_MIB (1 << 21)  // 2 MiB in bytes
> > > >
> > > >  void
> > > >  domapwrite(unsigned offset, unsigned size)
> > > >  {
> > > > @@ -1057,6 +1058,8 @@ domapwrite(unsigned offset, unsigned size)
> > > >         unsigned map_size;
> > > >         off_t    cur_filesize;
> > > >         char    *p;
> > > > +       int ret;
> > > > +       unsigned size_2mib_aligned;
> > > >
> > > >         offset -= offset % writebdy;
> > > >         if (size == 0) {
> > > > @@ -1101,6 +1104,41 @@ domapwrite(unsigned offset, unsigned size)
> > > >         pg_offset = offset & PAGE_MASK;
> > > >         map_size  = pg_offset + size;
> > > >
> > > > +       size_2mib_aligned = (size + TWO_MIB - 1) & ~(TWO_MIB - 1);
> > > > +       void *placeholder_map = mmap(NULL, size_2mib_aligned * 2, PROT_NONE,
> > > > +                            MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > > > +       if (!placeholder_map) {
> > > > +               prterr("domapwrite: placeholder map");
> > > > +               exit(202);
> > > > +       }
> > > > +
> > > > +       /* align address to nearest 2 MiB */
> > > > +       void *aligned_address =
> > > > +               (void *)(((uintptr_t)placeholder_map + TWO_MIB - 1) &
> > > > ~(TWO_MIB - 1));
> > > > +
> > > > +       void *map = mmap(aligned_address, size_2mib_aligned, PROT_READ
> > > > | PROT_WRITE,
> > > > +                         MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED |
> > > > MAP_POPULATE, -1, 0);
> > > > +
> > > > +       ret = madvise(map, size_2mib_aligned, MADV_COLLAPSE);
> > > > +       if (ret) {
> > > > +               prterr("domapwrite: madvise collapse");
> > > > +               exit(203);
> > > > +       }
> > > > +
> > > > +       memcpy(map, good_buf + offset, size);
> > > > +
> > > > +       if (lseek(fd, offset, SEEK_SET) == -1) {
> > > > +               prterr("domapwrite: lseek");
> > > > +               exit(204);
> > > > +       }
> > > > +
> > > > +       ret = write(fd, map, size);
> > > > +       if (ret == -1) {
> > > > +               prterr("domapwrite: write");
> > > > +               exit(205);
> > > > +       }
> > > > +
> > > > +       /*
> > > >         if ((p = (char *)mmap(0, map_size, PROT_READ | PROT_WRITE,
> > > >                               MAP_FILE | MAP_SHARED, fd,
> > > >                               (off_t)(offset - pg_offset))) == (char *)-1) {
> > > > @@ -1119,6 +1157,15 @@ domapwrite(unsigned offset, unsigned size)
> > > >                 prterr("domapwrite: munmap");
> > > >                 report_failure(204);
> > > >         }
> > > > +       */
> > > > +       if (munmap(map, size_2mib_aligned) != 0) {
> > > > +               prterr("domapwrite: munmap map");
> > > > +               report_failure(206);
> > > > +       }
> > > > +       if (munmap(placeholder_map, size_2mib_aligned * 2) != 0) {
> > > > +               prterr("domapwrite: munmap placeholder_map");
> > > > +               report_failure(207);
> > > > +       }
> > > >  }
> > > >
> > > > -- run fsx test --
> > > > sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> > > >
> > > > On the offending commit 3b97c3652, I'm seeing:
> > > > [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> > > > Will begin at operation 3
> > > > Seed set to 1
> > > > ...
> > > > READ BAD DATA: offset = 0x1925f, size = 0xf7a3, fname =
> > > > /home/user/fuse_mount/example.txt
> > > > OFFSET      GOOD    BAD     RANGE
> > > > 0x1e43f     0x4b4a  0x114a  0x0
> > > > operation# (mod 256) for the bad data may be 74
> > > > 0x1e441     0xa64a  0xeb4a  0x1
> > > > operation# (mod 256) for the bad data may be 74
> > > > 0x1e443     0x264a  0xe44a  0x2
> > > > operation# (mod 256) for the bad data may be 74
> > > > 0x1e445     0x254a  0x9e4a  0x3
> > > > ...
> > > > Correct content saved for comparison
> > > > (maybe hexdump "/home/user/fuse_mount/example.txt" vs
> > > > "/home/user/fuse_mount/example.txt.fsxgood")
> > > >
> > > >
> > > > I tested Josef's patch with the "ap->descs[ap->num_pages - 1].length
> > > > -= (PAGE_SIZE - ret) & (PAGE_SIZE - 1);" line removed and it fixed the
> > > > issue:
> > > >
> > > > [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> > > > Will begin at operation 3
> > > > Seed set to 1
> > > > ...
> > > > copying to largest ever: 0x3e19b
> > > > copying to largest ever: 0x3e343
> > > > fallocating to largest ever: 0x40000
> > > > All 5000 operations completed A-OK!
> > > >
> > > >
> > > > Malte, would you mind double-checking whether this fixes the issue
> > > > you're seeing on your end?
> > >
> > > My test still fails.
> >
> > Hi Malte,
> >
> > I simulated your repro with installing bcachefs as the root filesystem
> > on a VM running Arch, then running "sudo pacman -S flatpak" and then
> > installing FreeCAD with "flatpak install flathub org.freecad.FreeCAD".
> >
> > On base commit 3b97c3652, I see the same corruption you noted:
> >
> > error: Failed to install org.kde.Platform: Error pulling from repo:
> > While pulling runtime/org.kde.Platform/x86_64/6.7 from remote flathub:
> > fsck content object
> > 886fd60617b81e81475db5e62beda5846d3e85fe77562eae536d2dd2a7af5b33:
> > Corrupted file object; checksum
> > expected='886fd60617b81e81475db5e62beda5846d3e85fe77562eae536d2dd2a7af5b33'
> > actual='67f5a60d19f7a65e1ee272d455fed138b864be73399816ad18fa71319614a418'
> >
> >
> > i tried this patch on top of commit 3b97c3652 and it fixed it for me:
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 14af8c41fc83..0d213a22972b 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1560,18 +1560,24 @@ static int fuse_get_user_pages(struct
> > fuse_args_pages *ap, struct iov_iter *ii,
> >
> >                 nbytes += ret;
> >
> > -               ret += start;
> > -               /* Currently, all folios in FUSE are one page */
> > -               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
> > -
> > -               ap->folio_descs[ap->num_folios].offset = start;
> > -               fuse_folio_descs_length_init(ap->folio_descs,
> > ap->num_folios, nfolios);
> > -               for (i = 0; i < nfolios; i++)
> > -                       ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > -
> > -               ap->num_folios += nfolios;
> > -               ap->folio_descs[ap->num_folios - 1].length -=
> > -                       (PAGE_SIZE - ret) & (PAGE_SIZE - 1);
> > +               nfolios = DIV_ROUND_UP(ret + start, PAGE_SIZE);
> > +
> > +               for (i = 0; i < nfolios; i++) {
> > +                       struct folio *folio = page_folio(pages[i]);
> > +                       unsigned int offset = start +
> > +                               (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
> > +                       unsigned int len = min_t(unsigned int, ret,
> > folio_size(folio) - offset);
> > +
> > +                       len = min_t(unsigned int, len, PAGE_SIZE - start);
> > +
> > +                       ap->descs[ap->num_folios].offset = offset;
> > +                       ap->descs[ap->num_folios].length = len;
> > +                       ap->folios[ap->num_folios] = folio;
> > +                       start = 0;
> > +                       ret -= len;
> > +                       ap->num_folios++;
> > +               }
> > +
> >                 nr_pages += nfolios;
> >         }
> >         kfree(pages);
> >
> > I ran this multiple times and don't see the corruption issues.
> >
> > However, I do see another issue crop up on some of my VMs, which is
> > flatpak hanging with this corresponding stack trace:
> >
> > [  368.520976] INFO: task pool-/usr/lib/f:582 blocked for more than 122 seconds.
> > [  368.521509]       Not tainted 6.12.0-rc1-g86b74eb5a11e #734
> > [  368.521905] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  368.522483] task:pool-/usr/lib/f state:D stack:0     pid:582
> > tgid:405   ppid:1      flags:0x00000002
> > [  368.523238] Call Trace:
> > [  368.523411]  <TASK>
> > [  368.523548]  __schedule+0xaf0/0x27f0
> > [  368.523773]  ? __pfx___schedule+0x10/0x10
> > [  368.524024]  schedule+0x7e/0x300
> > [  368.524233]  __wait_on_freeing_inode+0xda/0x120
> > [  368.524527]  ? __pfx___wait_on_freeing_inode+0x10/0x10
> > [  368.524830]  ? _raw_spin_unlock_irqrestore+0xe/0x30
> > [  368.525124]  ? __pfx_wake_bit_function+0x10/0x10
> > [  368.525411]  bch2_inode_hash_find+0x372/0x580
> > [  368.525700]  ? __pfx_bch2_dirent_read_target+0x10/0x10
> > [  368.526023]  ? bch2_dirent_hash+0x23d/0x370
> > [  368.526286]  ? __pfx_bch2_inode_hash_find+0x10/0x10
> > [  368.526583]  ? __pfx_dirent_cmp_key+0x10/0x10
> > [  368.526972]  bch2_lookup_trans+0x61a/0x990
> > [  368.527247]  ? __pfx_bch2_lookup_trans+0x10/0x10
> > [  368.527559]  ? __do_sys_newfstatat+0x75/0xd0
> > [  368.527850]  ? do_syscall_64+0x4b/0x110
> > [  368.528088]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  368.528411]  ? __pfx_bch2_hash_info_init+0x10/0x10
> > [  368.528720]  ? __mod_memcg_state+0x102/0x390
> > [  368.529134]  ? obj_cgroup_charge+0x1b4/0x4c0
> > [  368.529500]  ? __memcg_slab_post_alloc_hook+0x536/0xba0
> > [  368.529913]  ? kvm_sched_clock_read+0x11/0x20
> > [  368.530265]  ? _raw_spin_lock+0x85/0xe0
> > [  368.530553]  ? bch2_lookup_trans+0x323/0x990
> > [  368.530909]  ? __asan_memset+0x24/0x50
> > [  368.531204]  ? __bch2_trans_get+0x735/0xdd0
> > [  368.531525]  ? bch2_lookup+0x18a/0x350
> > [  368.531859]  bch2_lookup+0x18a/0x350
> > [  368.532145]  ? __pfx_bch2_lookup+0x10/0x10
> > [  368.532466]  ? __pfx_lockref_get_not_dead+0x10/0x10
> > [  368.532860]  __lookup_slow+0x182/0x350
> > [  368.533159]  ? __pfx___lookup_slow+0x10/0x10
> > [  368.533513]  walk_component+0x2ab/0x4f0
> > [  368.533816]  path_lookupat+0x120/0x660
> > [  368.534110]  ? vfs_fstatat+0x53/0xb0
> > [  368.534413]  filename_lookup+0x1aa/0x520
> > [  368.534718]  ? __pfx_filename_lookup+0x10/0x10
> > [  368.535052]  ? __wake_up_common+0xf2/0x170
> > [  368.535385]  ? _raw_spin_unlock_irq+0xe/0x30
> > [  368.535683]  ? __pfx_eventfd_write+0x10/0x10
> > [  368.536019]  ? kasan_save_stack+0x24/0x50
> > [  368.536313]  ? __kasan_record_aux_stack+0xad/0xc0
> > [  368.536651]  ? kmem_cache_free+0x353/0x550
> > [  368.536919]  ? do_syscall_64+0x4b/0x110
> > [  368.537179]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  368.537528]  vfs_statx+0xbf/0x140
> > [  368.537745]  ? kmem_cache_alloc_noprof+0x12d/0x340
> > [  368.538057]  ? __pfx_vfs_statx+0x10/0x10
> > [  368.538322]  ? getname_flags.part.0+0xaf/0x4a0
> > [  368.538627]  vfs_fstatat+0x6c/0xb0
> > [  368.538840]  __do_sys_newfstatat+0x75/0xd0
> > [  368.539100]  ? __pfx___do_sys_newfstatat+0x10/0x10
> > [  368.539415]  ? kasan_unpoison+0x27/0x60
> > [  368.539675]  ? __pfx_slab_free_after_rcu_debug+0x10/0x10
> > [  368.540081]  ? fdget_pos+0x366/0x5d0
> > [  368.540355]  ? fput+0x1b/0x2d0
> > [  368.540602]  ? ksys_write+0x18c/0x1c0
> > [  368.540934]  ? __pfx_ksys_write+0x10/0x10
> > [  368.541252]  ? fpregs_assert_state_consistent+0x20/0xa0
> > [  368.541657]  ? clear_bhb_loop+0x45/0xa0
> > [  368.541948]  do_syscall_64+0x4b/0x110
> > [  368.542270]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  368.542711] RIP: 0033:0x7fef789bb94e
> > [  368.543001] RSP: 002b:00007fef713fe468 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000106
> > [  368.543605] RAX: ffffffffffffffda RBX: 00007fef713fe5d0 RCX: 00007fef789bb94e
> > [  368.544129] RDX: 00007fef713fe540 RSI: 00007fef713fe5d0 RDI: 000000000000001f
> > [  368.544636] RBP: 00007fef713ff710 R08: 0000000000000073 R09: 0000000000000000
> > [  368.545130] R10: 0000000000000100 R11: 0000000000000246 R12: 000000000000001f
> > [  368.545640] R13: 00007fef713fe540 R14: 00007fef5802dc00 R15: 00007fef713ff780
> > [  368.546177]  </TASK>
> >
> > I'm seeing this happen on commit 86b74eb5 as well, which is the commit
> > that's before any of my or Josef's folio changes, and this persists
> > across reboots. I've seen this happen for a couple other tasks too,
> > "task:flatpak" and "task:pool-flatpak" with similar stack traces eg
> >
> > [  368.505270] task:pool-flatpak in state:D stack:0     pid:568
> > tgid:538   ppid:537    flags:0x00000002
> > [  368.505872] Call Trace:
> > [  368.506042]  <TASK>
> > [  368.506191]  __schedule+0xaf0/0x27f0
> > [  368.506430]  ? __pfx___schedule+0x10/0x10
> > [  368.506715]  schedule+0x7e/0x300
> > [  368.506940]  __wait_on_freeing_inode+0xda/0x120
> > [  368.507490]  ? __pfx___wait_on_freeing_inode+0x10/0x10
> > [  368.507965]  ? _raw_spin_unlock_irqrestore+0xe/0x30
> > [  368.508312]  ? __pfx_wake_bit_function+0x10/0x10
> > [  368.508650]  bch2_inode_hash_find+0x372/0x580
> > [  368.508969]  ? __pfx_bch2_dirent_read_target+0x10/0x10
> > [  368.509332]  ? bch2_dirent_hash+0x23d/0x370
> > [  368.509637]  ? __pfx_bch2_inode_hash_find+0x10/0x10
> > [  368.509991]  ? __pfx_dirent_cmp_key+0x10/0x10
> > [  368.510309]  bch2_lookup_trans+0x61a/0x990
> > [  368.510662]  ? __pfx_bch2_lookup_trans+0x10/0x10
> > [  368.511155]  ? vfs_statx+0xbf/0x140
> > [  368.511429]  ? vfs_fstatat+0x6c/0xb0
> > [  368.511731]  ? do_syscall_64+0x4b/0x110
> > [  368.512014]  ? __d_alloc+0x5cc/0x8f0
> > [  368.512283]  ? memcg_list_lru_alloc+0x184/0x8a0
> > [  368.512612]  ? __pfx_bch2_hash_info_init+0x10/0x10
> > [  368.512943]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  368.513327]  ? __memcg_slab_post_alloc_hook+0x536/0xba0
> > [  368.513753]  ? kvm_sched_clock_read+0x11/0x20
> > [  368.514133]  ? _raw_spin_lock+0x85/0xe0
> > [  368.515095]  ? bch2_lookup_trans+0x323/0x990
> > [  368.516994]  ? __asan_memset+0x24/0x50
> > [  368.517294]  ? __bch2_trans_get+0x735/0xdd0
> > [  368.517675]  ? bch2_lookup+0x18a/0x350
> > [  368.517987]  bch2_lookup+0x18a/0x350
> > [  368.518271]  ? __pfx_bch2_lookup+0x10/0x10
> > [  368.518568]  ? __pfx_lockref_get_not_dead+0x10/0x10
> > [  368.518933]  __lookup_slow+0x182/0x350
> > [  368.519264]  ? __pfx___lookup_slow+0x10/0x10
> > [  368.519633]  walk_component+0x2ab/0x4f0
> > [  368.519934]  ? fput+0x1b/0x2d0
> > [  368.520206]  link_path_walk.part.0.constprop.0+0x4ad/0xac0
> > [  368.520657]  path_lookupat+0x72/0x660
> > [  368.521022]  ? vfs_fstatat+0x53/0xb0
> > [  368.521343]  filename_lookup+0x1aa/0x520
> > [  368.521664]  ? __pfx_filename_lookup+0x10/0x10
> > [  368.522021]  ? __pfx_xa_load+0x10/0x10
> > [  368.522319]  ? ___slab_alloc+0x128/0x9d0
> > [  368.522683]  ? mntput_no_expire+0xcf/0x760
> > [  368.523018]  ? lockref_put_return+0xc7/0x140
> > [  368.523357]  ? kmem_cache_free+0x19e/0x550
> > [  368.523686]  ? __pfx_mutex_unlock+0x10/0x10
> > [  368.524035]  ? do_renameat2+0x1f4/0xa50
> > [  368.524404]  ? do_renameat2+0x1f4/0xa50
> > [  368.524716]  vfs_statx+0xbf/0x140
> > [  368.524980]  ? kmem_cache_alloc_noprof+0x12d/0x340
> > [  368.525355]  ? __pfx_vfs_statx+0x10/0x10
> > [  368.525702]  ? getname_flags.part.0+0xaf/0x4a0
> > [  368.526080]  vfs_fstatat+0x6c/0xb0
> > [  368.526391]  __do_sys_newfstatat+0x75/0xd0
> > [  368.526745]  ? __pfx___do_sys_newfstatat+0x10/0x10
> > [  368.527206]  ? do_symlinkat+0x15d/0x260
> > [  368.527509]  ? do_mkdirat+0x19d/0x2c0
> > [  368.527807]  ? kasan_save_track+0x14/0x30
> > [  368.528132]  ? getname_flags.part.0+0xaf/0x4a0
> > [  368.528509]  ? fpregs_assert_state_consistent+0x20/0xa0
> > [  368.529198]  ? clear_bhb_loop+0x45/0xa0
> > [  368.529536]  do_syscall_64+0x4b/0x110
> > [  368.529856]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  368.530279] RIP: 0033:0x7f592256194e
> > [  368.530610] RSP: 002b:00007f5915bfe548 EFLAGS: 00000217 ORIG_RAX:
> > 0000000000000106
> > [  368.531350] RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f592256194e
> > [  368.531943] RDX: 00007f5915bfe570 RSI: 00007f5915bfe600 RDI: 0000000000000013
> > [  368.532491] RBP: 00007f5915bfe740 R08: 0000000000000073 R09: 0000000000000000
> > [  368.533048] R10: 0000000000000100 R11: 0000000000000217 R12: 00007f5915bfe570
> > [  368.533597] R13: 00007f5915bfe600 R14: 00007f5915bfe56c R15: 00007f5915bfe780
> > [  368.534236]  </TASK>
> > [  368.534419] INFO: task pool-flatpak in:571 blocked for more than 122 seconds.
> >
> > This seems like something we should investigate as well. I'm happy to
> > help repro this if needed. I'm able to hit this pretty consistently.
> >
> > Malte, would you mind applying the patch above and confirming if this
> > fixes it for you on your VM? And while you've been running flatpak,
> > have you also been seeing some flatpak tasks hanging on some of your
> > VMs?
>
> You're testing with snapshots? That one is fixed in my master branch.

Great. I retested this on your master branch and on top of commit
40384c84 and don't see this issue anymore, thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-12-11 20:32 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-07 21:58 silent data corruption in fuse in rc1 Kent Overstreet
2024-12-07 23:01 ` Malte Schröder
2024-12-07 23:39   ` Kent Overstreet
2024-12-08  0:11     ` Malte Schröder
2024-12-08  1:23   ` Matthew Wilcox
2024-12-08 20:02     ` Malte Schröder
2024-12-08 22:32       ` Malte Schröder
2024-12-09  1:57         ` Jingbo Xu
2024-12-09  6:42           ` Malte Schröder
2024-12-09  8:06             ` Bernd Schubert
2024-12-09  9:07               ` Malte Schröder
2024-12-09 13:06                 ` Bernd Schubert
2024-12-09 13:45                   ` Malte Schröder
2024-12-09 14:49         ` Josef Bacik
2024-12-09 15:28           ` Matthew Wilcox
2024-12-09 15:48             ` Josef Bacik
2024-12-09 17:07               ` Malte Schröder
2024-12-09 18:47                 ` Joanne Koong
2024-12-09 19:52                   ` Joanne Koong
2024-12-10  5:14                     ` Joanne Koong
2024-12-10 16:54                       ` Kent Overstreet
2024-12-10 17:08                         ` Matthew Wilcox
2024-12-10 17:43                           ` Kent Overstreet
2024-12-10 17:31                         ` Josef Bacik
2024-12-10 17:48                           ` Kent Overstreet
2024-12-10 17:53                       ` Malte Schröder
2024-12-11  2:57                         ` Joanne Koong
2024-12-11  6:43                           ` Kent Overstreet
2024-12-11 20:32                             ` Joanne Koong
2024-12-11 14:01                           ` Malte Schröder
2024-12-11 19:43                             ` Joanne Koong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.