From: "Theodore Ts'o" <tytso@mit.edu>
To: John Hubbard <jhubbard@nvidia.com>
Cc: Eric Biggers <ebiggers@kernel.org>,
Lee Jones <lee.jones@linaro.org>,
linux-ext4@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
Dave Chinner <dchinner@redhat.com>,
Goldwyn Rodrigues <rgoldwyn@suse.com>,
"Darrick J . Wong" <darrick.wong@oracle.com>,
Bob Peterson <rpeterso@redhat.com>,
Damien Le Moal <damien.lemoal@wdc.com>,
Andreas Gruenbacher <agruenba@redhat.com>,
Ritesh Harjani <riteshh@linux.ibm.com>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Johannes Thumshirn <jth@kernel.org>,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
cluster-devel@redhat.com, linux-kernel@vger.kernel.org
Subject: Re: [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
Date: Fri, 25 Feb 2022 18:21:21 -0500 [thread overview]
Message-ID: <YhlkcYjozFmt3Kl4@mit.edu> (raw)
In-Reply-To: <2f9933b3-a574-23e1-e632-72fc29e582cf@nvidia.com>
On Fri, Feb 25, 2022 at 01:33:33PM -0800, John Hubbard wrote:
> On 2/25/22 13:23, Theodore Ts'o wrote:
> > [un]pin_user_pages_remote is dirtying pages without properly warning
> > the file system in advance. This was noted by Jan Kara in 2018[1] and
>
> In 2018, [un]pin_user_pages_remote did not exist. And so what Jan reported
> was actually that dio_bio_complete() was calling set_page_dirty_lock()
> on pages that were not (any longer) set up for that.
Fair enough, there are two problems that are getting conflated here,
and that's my bad. The problem which Jan pointed out is one where the
Direct I/O read path triggered a page fault, so page_mkwrite() was
actually called. So in this case, the file system was actually
notified, and the page was marked dirty after the file system was
notified. But then the DIO read was racing with the page cleaner,
which would call writepage(), and then clear the page, and then remove
the buffer_heads. Then dio_bio_complete() would call set_page_dirty()
a second time, and that's what would trigger the BUG.
But in the syzbot reproducer, it's a different problem. In this case,
process_vm_writev() calling [un]pin_user_pages_remote(), and
page_mkwrite() is never getting called. So there is no need to race
with the page cleaner, and so the BUG triggers much more reliably.
> > more recently has resulted in bug reports by Syzbot in various Android
> > kernels[2].
> >
> > This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
>
> Is it, really? unpin_user_pages_dirty_lock() moved the set_page_dirty_lock()
> call into mm/gup.c, but that merely refactored things. The callers are
> all over the kernel, and those callers are what need changing in order
> to fix this.
From my perspective, the bug is calling set_page_dirty() without first
calling the file system's page_mkwrite(). This is necessary since the
file system needs to allocate file system data blocks in preparation
for a future writeback.
Now, calling page_mkwrite() by itself is not enough, since the moment
you make the page dirty, the page cleaner could go ahead and call
writepage() behind your back and clean it. In actual practice, with a
Direct I/O read request racing with writeback, this is race was quite
hard to hit, because the that would imply that the background
writepage() call would have to complete ahead of the synchronous read
request, and the block layer generally prioritizes synchronous reads
ahead of background write requests. So in practice, this race was
***very*** hard to hit. Jan may have reported it in 2018, but I don't
think I've ever seen it happen myself.
For process_vm_writev() this is a case where user pages are pinned and
then released in short order, so I suspect that race with the page
cleaner would also be very hard to hit. But we could completely
remove the potential for the race, and also make things kinder for
f2fs and btrfs's compressed file write support, by making things work
much like the write(2) system call. Imagine if we had a
"pin_user_pages_local()" which calls write_begin(), and a
"unpin_user_pages_local()" which calls write_end(), and the
presumption with the "[un]pin_user_pages_local" API is that you don't
hold the pinned pages for very long --- say, not across a system call
boundary, and then it would work the same way the write(2) system call
works does except that in the case of process_vm_writev(2) the pages
are identified by another process's address space where they happen to
be mapped.
This obviously doesn't work when pinning pages for remote DMA, because
in that case the time between pin_user_pages_remote() and
unpin_user_pages_remote() could be a long, long time, so that means we
can't use using write_begin/write_end; we'd need to call page_mkwrite()
when the pages are first pinned and then somehow prevent the page
cleaner from touching a dirty page which is pinned for use by the
remote DMA.
Does that make sense?
- Ted
next prev parent reply other threads:[~2022-02-25 23:23 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-16 16:31 [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers() Lee Jones
2022-02-18 1:06 ` John Hubbard
2022-02-18 4:08 ` Theodore Ts'o
2022-02-18 6:33 ` John Hubbard
2022-02-23 23:31 ` Theodore Ts'o
2022-02-24 0:44 ` John Hubbard
2022-02-24 4:04 ` Theodore Ts'o
2022-02-18 7:51 ` Greg Kroah-Hartman
2022-02-23 23:35 ` Theodore Ts'o
2022-02-24 1:48 ` Dave Chinner
2022-02-24 3:50 ` Theodore Ts'o
2022-02-24 10:29 ` Dave Chinner
2022-02-18 2:54 ` Theodore Ts'o
2022-02-18 4:24 ` Matthew Wilcox
2022-02-18 6:03 ` Theodore Ts'o
2022-02-25 19:24 ` [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first Theodore Ts'o
2022-02-25 20:51 ` Eric Biggers
2022-02-25 21:08 ` Theodore Ts'o
2022-02-25 21:23 ` [PATCH -v3] " Theodore Ts'o
2022-02-25 21:33 ` John Hubbard
2022-02-25 23:21 ` Theodore Ts'o [this message]
2022-02-26 0:41 ` John Hubbard
2022-02-26 1:40 ` Theodore Ts'o
2022-02-26 2:00 ` Theodore Ts'o
2022-02-26 2:55 ` John Hubbard
2022-03-03 4:26 ` [PATCH -v4] " Theodore Ts'o
2022-03-03 8:21 ` Christoph Hellwig
2022-03-03 9:21 ` Lee Jones
2022-03-03 14:38 ` [PATCH -v5] ext4: don't BUG if someone " Theodore Ts'o
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YhlkcYjozFmt3Kl4@mit.edu \
--to=tytso@mit.edu \
--cc=agruenba@redhat.com \
--cc=cluster-devel@redhat.com \
--cc=damien.lemoal@wdc.com \
--cc=darrick.wong@oracle.com \
--cc=dchinner@redhat.com \
--cc=ebiggers@kernel.org \
--cc=gregkh@linuxfoundation.org \
--cc=hch@lst.de \
--cc=jhubbard@nvidia.com \
--cc=jth@kernel.org \
--cc=lee.jones@linaro.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=rgoldwyn@suse.com \
--cc=riteshh@linux.ibm.com \
--cc=rpeterso@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox