Re: [PATCH] ext4: use private version of page_zero_new

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] ext4: use private version of page_zero_new_buffers() for data=journal mode
       [not found] ` <1444363269-25956-1-git-send-email-tytso@mit.edu>
@ 2025-01-26 17:01   ` Mateusz Guzik
  2025-01-26 18:48     ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Mateusz Guzik @ 2025-01-26 17:01 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Ext4 Developers List, Linux Kernel Developers List, dave.hansen,
	torvalds, akpm, linux-fsdevel

On Fri, Oct 09, 2015 at 12:01:09AM -0400, Theodore Ts'o wrote:
> If there is a error while copying data from userspace into the page
> cache during a write(2) system call, in data=journal mode, in
> ext4_journalled_write_end() were using page_zero_new_buffers() from
> fs/buffer.c.  Unfortunately, this sets the buffer dirty flag, which is
> no good if journalling is enabled.  This is a long-standing bug that
> goes back for years and years in ext3, but a combination of (a)
> data=journal not being very common, (b) in many case it only results
> in a warning message. and (c) only very rarely causes the kernel hang,
> means that we only really noticed this as a problem when commit
> 998ef75ddb caused this failure to happen frequently enough to cause
> generic/208 to fail when run in data=journal mode.
> 
> The fix is to have our own version of this function that doesn't call
> mark_dirty_buffer(), since we will end up calling
> ext4_handle_dirty_metadata() on the buffer head(s) in questions very
> shortly afterwards in ext4_journalled_write_end().
> 
> Thanks to Dave Hansen and Linus Torvalds for helping to identify the
> root cause of the problem.
> 

Hello there, a blast from the past.

I see this has landed in b90197b655185a11640cce3a0a0bc5d8291b8ad2

I came here from looking at a pwrite vs will-it-scale and noticing that
pre-faulting eats CPU (over 5% on my Sapphire Rapids) due to SMAP trips.

It used to be that pre-faulting was avoided specifically for that
reason, but it got temporarily reverted due to bugs in ext4, to quote
Linus (see 00a3d660cbac05af34cca149cb80fb611e916935):

>    The commit itself does not appear to be buggy per se, but it is exposing
>    a bug in ext4 (and Ted thinks ext3 too, but we solved that by getting
>    rid of it).  It's too late in the release cycle to really worry about
>    this, even if Dave Hansen has a patch that may actually fix the
>    underlying ext4 problem.  We can (and should) revisit this for the next
>    release.

Given your patch landing I take it this is expected to be fixed now?

Sounds like nobody bothered to revert the revert. Not the end of the
world, but it is few % left on the table for (hopefully) no reason. ofc
testing will be needed, but that's what -next is for

thanks,

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: use private version of page_zero_new_buffers() for data=journal mode
  2025-01-26 17:01   ` [PATCH] ext4: use private version of page_zero_new_buffers() for data=journal mode Mateusz Guzik
@ 2025-01-26 18:48     ` Linus Torvalds
  2025-01-26 19:49       ` Mateusz Guzik
  0 siblings, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2025-01-26 18:48 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Theodore Ts'o, Ext4 Developers List,
	Linux Kernel Developers List, dave.hansen, akpm, linux-fsdevel

On Sun, 26 Jan 2025 at 09:02, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> Hello there, a blast from the past.
>
> I see this has landed in b90197b655185a11640cce3a0a0bc5d8291b8ad2

Whee. What archeology are you doing to notice this decade-old issue?

> I came here from looking at a pwrite vs will-it-scale and noticing that
> pre-faulting eats CPU (over 5% on my Sapphire Rapids) due to SMAP trips.

Ugh. Yeah, turning SMAP on/off is expensive on most cores (apparently
fixed in AMD Zen 5).

> It used to be that pre-faulting was avoided specifically for that
> reason, but it got temporarily reverted due to bugs in ext4, to quote
> Linus (see 00a3d660cbac05af34cca149cb80fb611e916935):

Yeah, I think we should revert the revert (except we've done other
changes in the last decade - surprise surprise - so it would be a
completely manual revert).

If you send me a tested revert of the revert (aka re-do) of the "don't
pre-fault" patch, I'll apply it.

Note that the ext4 problem could exist in other filesystems, so we
might have to revert (again).  It's not necessarily that ext4 was
_particularly_ buggy, it's quite possible that the problem was
originally found on ext4 just because it was more widely used than
others.

               Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: use private version of page_zero_new_buffers() for data=journal mode
  2025-01-26 18:48     ` Linus Torvalds
@ 2025-01-26 19:49       ` Mateusz Guzik
  2025-01-26 22:03         ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Mateusz Guzik @ 2025-01-26 19:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Ts'o, Ext4 Developers List,
	Linux Kernel Developers List, dave.hansen, akpm, linux-fsdevel

On Sun, Jan 26, 2025 at 7:49 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, 26 Jan 2025 at 09:02, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > Hello there, a blast from the past.
> >
> > I see this has landed in b90197b655185a11640cce3a0a0bc5d8291b8ad2
>
> Whee. What archeology are you doing to notice this decade-old issue?
>

Me? Archeology? Not even once!

I was curious what's up with this very much *fresh* sucker:
https://lore.kernel.org/oe-lkp/202501261527.c3bf4764-lkp@intel.com/

> > I came here from looking at a pwrite vs will-it-scale and noticing that
> > pre-faulting eats CPU (over 5% on my Sapphire Rapids) due to SMAP trips.
>
> Ugh. Yeah, turning SMAP on/off is expensive on most cores (apparently
> fixed in AMD Zen 5).
>

Interesting, I thought it sucks everywhere.

Anyhow it definitely still sucks on Sapphire Rapids which is pretty
high up there as far as Intel goes, so...

> > It used to be that pre-faulting was avoided specifically for that
> > reason, but it got temporarily reverted due to bugs in ext4, to quote
> > Linus (see 00a3d660cbac05af34cca149cb80fb611e916935):
>
> Yeah, I think we should revert the revert (except we've done other
> changes in the last decade - surprise surprise - so it would be a
> completely manual revert).
>
> If you send me a tested revert of the revert (aka re-do) of the "don't
> pre-fault" patch, I'll apply it.
>

:( ok

This being your revert I was lowkey hoping you would do the honors.

I'll sort it out if Ted confirms that as far as he knows this is fixed in ext4.

> Note that the ext4 problem could exist in other filesystems, so we
> might have to revert (again).  It's not necessarily that ext4 was
> _particularly_ buggy, it's quite possible that the problem was
> originally found on ext4 just because it was more widely used than
> others.

Indeed. Or they might have regressed since, which is why I mentioned
-next for testing.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: use private version of page_zero_new_buffers() for data=journal mode
  2025-01-26 19:49       ` Mateusz Guzik
@ 2025-01-26 22:03         ` Linus Torvalds
  2025-01-26 22:45           ` Mateusz Guzik
  0 siblings, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2025-01-26 22:03 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Theodore Ts'o, Ext4 Developers List,
	Linux Kernel Developers List, dave.hansen, akpm, linux-fsdevel

On Sun, 26 Jan 2025 at 11:49, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> This being your revert I was lowkey hoping you would do the honors.

If it had been a plain revert, I would have - but while the general
layout of the code is similar, the code around this area has changed
enough that it really needs to be pretty much entirely "fix up by
hand" and wants some care and testing.

Which I am unlikely to have time for during this merge window.

So if you don't get around to it, and _if_ I remember this when the
merge window is open, I might do it in my local tree, but then it will
end up being too late for this merge window.

             Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: use private version of page_zero_new_buffers() for data=journal mode
  2025-01-26 22:03         ` Linus Torvalds
@ 2025-01-26 22:45           ` Mateusz Guzik
  2025-01-27 20:52             ` Dave Hansen
  0 siblings, 1 reply; 7+ messages in thread
From: Mateusz Guzik @ 2025-01-26 22:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Ts'o, Ext4 Developers List,
	Linux Kernel Developers List, dave.hansen, akpm, linux-fsdevel

On Sun, Jan 26, 2025 at 11:03 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, 26 Jan 2025 at 11:49, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > This being your revert I was lowkey hoping you would do the honors.
>
> If it had been a plain revert, I would have - but while the general
> layout of the code is similar, the code around this area has changed
> enough that it really needs to be pretty much entirely "fix up by
> hand" and wants some care and testing.
>
> Which I am unlikely to have time for during this merge window.
>
> So if you don't get around to it, and _if_ I remember this when the
> merge window is open, I might do it in my local tree, but then it will
> end up being too late for this merge window.
>

The to-be-unreverted change was written by Dave (cc'ed).

I had a brief chat with him on irc, he said he is going to submit an
updated patch.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: use private version of page_zero_new_buffers() for data=journal mode
  2025-01-26 22:45           ` Mateusz Guzik
@ 2025-01-27 20:52             ` Dave Hansen
  2025-01-27 21:46               ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2025-01-27 20:52 UTC (permalink / raw)
  To: Mateusz Guzik, Linus Torvalds
  Cc: Theodore Ts'o, Ext4 Developers List,
	Linux Kernel Developers List, akpm, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1557 bytes --]

On 1/26/25 14:45, Mateusz Guzik wrote:
>>
>> So if you don't get around to it, and _if_ I remember this when the
>> merge window is open, I might do it in my local tree, but then it will
>> end up being too late for this merge window.
>>
> The to-be-unreverted change was written by Dave (cc'ed).
> 
> I had a brief chat with him on irc, he said he is going to submit an
> updated patch.

I poked at it a bit today. There's obviously been the page=>folio churn
and also iov_iter_fault_in_readable() got renamed and got some slightly
new semantics.

Additionally, I'm doubting the explicit pagefault_disable(). The
original patch did this:

+      pagefault_disable();
       copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+      pagefault_enable();

and the modern generic_perform_write() is using:

       copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);

But the "atomic" copy is using kmap_atomic() internally which has a
built-in pagefault_disable(). It wouldn't be super atomic if it were
handling page faults of course.

So I don't think generic_perform_write() needs to do its own
pagefault_disable().

I actually still had the original decade-old test case sitting around
compiled on my test box. It still triggers the issue and _will_ livelock
if fault_in_iov_iter_readable() isn't called somewhere.

Anyway, here's a patch that compiles, boots and doesn't immediately fall
over on ext4 in case anyone else wants to poke at it. I'll do a real
changelog, SoB, etc.... and send it out for real tomorrow if it holds up.

[-- Attachment #2: generic_perform_write-1.patch --]
[-- Type: text/x-patch, Size: 1692 bytes --]


index 4f476411a9a2d..98b37e4c6d43c 100644

---

 b/mm/filemap.c |   25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff -puN mm/filemap.c~generic_perform_write-1 mm/filemap.c
--- a/mm/filemap.c~generic_perform_write-1	2025-01-27 09:53:13.219120969 -0800
+++ b/mm/filemap.c	2025-01-27 12:28:40.333920434 -0800
@@ -4027,17 +4027,6 @@ retry:
 		bytes = min(chunk - offset, bytes);
 		balance_dirty_pages_ratelimited(mapping);
 
-		/*
-		 * Bring in the user page that we will copy from _first_.
-		 * Otherwise there's a nasty deadlock on copying from the
-		 * same page as we're writing to, without it being marked
-		 * up-to-date.
-		 */
-		if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
-			status = -EFAULT;
-			break;
-		}
-
 		if (fatal_signal_pending(current)) {
 			status = -EINTR;
 			break;
@@ -4055,6 +4044,11 @@ retry:
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_folio(folio);
 
+		/*
+		 * This needs to be atomic because actually handling page
+		 * faults on 'i' can deadlock if the copy targets a
+		 * userspace mapping of 'folio'.
+		 */
 		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
 		flush_dcache_folio(folio);
 
@@ -4080,6 +4074,15 @@ retry:
 				bytes = copied;
 				goto retry;
 			}
+			/*
+			 * 'folio' is now unlocked and faults on it can be
+			 * handled. Ensure forward progress by trying to
+			 * fault it in now.
+			 */
+                        if (fault_in_iov_iter_readable(i, bytes) == bytes) {
+                                status = -EFAULT;
+                                break;
+                        }
 		} else {
 			pos += status;
 			written += status;
_

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext4: use private version of page_zero_new_buffers() for data=journal mode
  2025-01-27 20:52             ` Dave Hansen
@ 2025-01-27 21:46               ` Dave Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2025-01-27 21:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mateusz Guzik, Linus Torvalds, Theodore Ts'o,
	Ext4 Developers List, Linux Kernel Developers List, akpm,
	linux-fsdevel

On Mon, Jan 27, 2025 at 12:52:51PM -0800, Dave Hansen wrote:
> On 1/26/25 14:45, Mateusz Guzik wrote:
> >>
> >> So if you don't get around to it, and _if_ I remember this when the
> >> merge window is open, I might do it in my local tree, but then it will
> >> end up being too late for this merge window.
> >>
> > The to-be-unreverted change was written by Dave (cc'ed).
> > 
> > I had a brief chat with him on irc, he said he is going to submit an
> > updated patch.
> 
> I poked at it a bit today. There's obviously been the page=>folio churn
> and also iov_iter_fault_in_readable() got renamed and got some slightly
> new semantics.
....

> Anyway, here's a patch that compiles, boots and doesn't immediately fall
> over on ext4 in case anyone else wants to poke at it. I'll do a real
> changelog, SoB, etc.... and send it out for real tomorrow if it holds up.

> 
> index 4f476411a9a2d..98b37e4c6d43c 100644
> 
> ---
> 
>  b/mm/filemap.c |   25 ++++++++++++++-----------
>  1 file changed, 14 insertions(+), 11 deletions(-)
> 
> diff -puN mm/filemap.c~generic_perform_write-1 mm/filemap.c
> --- a/mm/filemap.c~generic_perform_write-1	2025-01-27 09:53:13.219120969 -0800
> +++ b/mm/filemap.c	2025-01-27 12:28:40.333920434 -0800
> @@ -4027,17 +4027,6 @@ retry:
>  		bytes = min(chunk - offset, bytes);
>  		balance_dirty_pages_ratelimited(mapping);
>  
> -		/*
> -		 * Bring in the user page that we will copy from _first_.
> -		 * Otherwise there's a nasty deadlock on copying from the
> -		 * same page as we're writing to, without it being marked
> -		 * up-to-date.
> -		 */
> -		if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
> -			status = -EFAULT;
> -			break;
> -		}
> -
>  		if (fatal_signal_pending(current)) {
>  			status = -EINTR;
>  			break;
> @@ -4055,6 +4044,11 @@ retry:
>  		if (mapping_writably_mapped(mapping))
>  			flush_dcache_folio(folio);
>  
> +		/*
> +		 * This needs to be atomic because actually handling page
> +		 * faults on 'i' can deadlock if the copy targets a
> +		 * userspace mapping of 'folio'.
> +		 */
>  		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
>  		flush_dcache_folio(folio);
>  
> @@ -4080,6 +4074,15 @@ retry:
>  				bytes = copied;
>  				goto retry;
>  			}
> +			/*
> +			 * 'folio' is now unlocked and faults on it can be
> +			 * handled. Ensure forward progress by trying to
> +			 * fault it in now.
> +			 */
> +                        if (fault_in_iov_iter_readable(i, bytes) == bytes) {
> +                                status = -EFAULT;
> +                                break;
> +                        }
>  		} else {
>  			pos += status;
>  			written += status;

Shouldn't all the other places that have exactly the same
fault_in_iov_iter_readable()/copy_folio_from_iter_atomic() logic
and comments (e.g.  iomap_write_iter()) be changed to do this the
same way as this new code in generic_perform_write()?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-01-27 21:46 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20151007154303.GC24678@thunk.org>
     [not found] ` <1444363269-25956-1-git-send-email-tytso@mit.edu>
2025-01-26 17:01   ` [PATCH] ext4: use private version of page_zero_new_buffers() for data=journal mode Mateusz Guzik
2025-01-26 18:48     ` Linus Torvalds
2025-01-26 19:49       ` Mateusz Guzik
2025-01-26 22:03         ` Linus Torvalds
2025-01-26 22:45           ` Mateusz Guzik
2025-01-27 20:52             ` Dave Hansen
2025-01-27 21:46               ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox