* [PATCH RFC 0/5] Writeback handling of pinned pages
@ 2023-02-09 12:31 Jan Kara
2023-02-09 12:31 ` [PATCH 1/5] mm: Do not reclaim private data from pinned page Jan Kara
` (4 more replies)
0 siblings, 5 replies; 26+ messages in thread
From: Jan Kara @ 2023-02-09 12:31 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-block, linux-mm, John Hubbard, David Howells,
David Hildenbrand, Jan Kara
Hello,
since we are slowly getting into a state where folios used as buffers for
[R]DMA are detectable by folio_maybe_dma_pinned(), I figured it is time we also
address the original problems filesystems had with these pages [1] - namely
that page/folio private data can get reclaimed from the page while it is being
written to by the DMA and also that page contents can be modified while the
page is under writeback.
This patch series is kind of an outline how the solution could look like (so
far only compile tested). The first two patches deal with the reclaim of page
private data for pinned pages. They are IMO no-brainers and actually deal with
99% of the observed issues so we might just separate them and merge them
earlier. The remainder of the series deals with the concern that page contents
can be modified while the page is being written back. What it implements is
that instead we skip page cleaning writeback for pinned pages and if we cannot
avoid writing the page (data integrity writeback), we bite the bullet and
bounce the page.
Note that the conversion of clear_page_dirty_for_io() (and its folio variant)
is kind of rough and providing wbc to the function only in the obvious cases -
that will need a bit more work but OTOH functionally passing NULL just retains
the old behavior + WARNs if we actually see pinned page in the writeback path.
Opinions?
Honza
[1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH 1/5] mm: Do not reclaim private data from pinned page 2023-02-09 12:31 [PATCH RFC 0/5] Writeback handling of pinned pages Jan Kara @ 2023-02-09 12:31 ` Jan Kara 2023-02-09 16:17 ` Matthew Wilcox 2023-02-13 9:01 ` David Hildenbrand 2023-02-09 12:31 ` [PATCH 2/5] ext4: Drop workaround for mm reclaiming fs private page data Jan Kara ` (3 subsequent siblings) 4 siblings, 2 replies; 26+ messages in thread From: Jan Kara @ 2023-02-09 12:31 UTC (permalink / raw) To: linux-fsdevel Cc: linux-block, linux-mm, John Hubbard, David Howells, David Hildenbrand, Jan Kara If the page is pinned, there's no point in trying to reclaim it. Furthermore if the page is from the page cache we don't want to reclaim fs-private data from the page because the pinning process may be writing to the page at any time and reclaiming fs private info on a dirty page can upset the filesystem (see link below). Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz Signed-off-by: Jan Kara <jack@suse.cz> --- mm/vmscan.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index bf3eedf0209c..ab3911a8b116 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1901,6 +1901,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, } } + /* + * Folio is unmapped now so it cannot be newly pinned anymore. + * No point in trying to reclaim folio if it is pinned. + * Furthermore we don't want to reclaim underlying fs metadata + * if the folio is pinned and thus potentially modified by the + * pinning process is that may upset the filesystem. + */ + if (folio_maybe_dma_pinned(folio)) + goto activate_locked; + mapping = folio_mapping(folio); if (folio_test_dirty(folio)) { /* -- 2.35.3 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 1/5] mm: Do not reclaim private data from pinned page 2023-02-09 12:31 ` [PATCH 1/5] mm: Do not reclaim private data from pinned page Jan Kara @ 2023-02-09 16:17 ` Matthew Wilcox 2023-02-10 11:29 ` Jan Kara 2023-02-13 9:01 ` David Hildenbrand 1 sibling, 1 reply; 26+ messages in thread From: Matthew Wilcox @ 2023-02-09 16:17 UTC (permalink / raw) To: Jan Kara Cc: linux-fsdevel, linux-block, linux-mm, John Hubbard, David Howells, David Hildenbrand On Thu, Feb 09, 2023 at 01:31:53PM +0100, Jan Kara wrote: > If the page is pinned, there's no point in trying to reclaim it. > Furthermore if the page is from the page cache we don't want to reclaim > fs-private data from the page because the pinning process may be writing > to the page at any time and reclaiming fs private info on a dirty page > can upset the filesystem (see link below). > > Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz OK, but now I'm confused. I've been told before that the reason we can't take pinned pages off the LRU list is that they need to be written back periodically for ... reasons. But now the pages are going to be skipped if they're found on the LRU list, so why is this better than taking them off the LRU list? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/5] mm: Do not reclaim private data from pinned page 2023-02-09 16:17 ` Matthew Wilcox @ 2023-02-10 11:29 ` Jan Kara 2023-02-13 9:55 ` Christoph Hellwig 0 siblings, 1 reply; 26+ messages in thread From: Jan Kara @ 2023-02-10 11:29 UTC (permalink / raw) To: Matthew Wilcox Cc: Jan Kara, linux-fsdevel, linux-block, linux-mm, John Hubbard, David Howells, David Hildenbrand On Thu 09-02-23 16:17:47, Matthew Wilcox wrote: > On Thu, Feb 09, 2023 at 01:31:53PM +0100, Jan Kara wrote: > > If the page is pinned, there's no point in trying to reclaim it. > > Furthermore if the page is from the page cache we don't want to reclaim > > fs-private data from the page because the pinning process may be writing > > to the page at any time and reclaiming fs private info on a dirty page > > can upset the filesystem (see link below). > > > > Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz > > OK, but now I'm confused. I've been told before that the reason we > can't take pinned pages off the LRU list is that they need to be written > back periodically for ... reasons. But now the pages are going to be > skipped if they're found on the LRU list, so why is this better than > taking them off the LRU list? You are mixing things together a bit :). Yes, we do need to writeback pinned pages from time to time - for data integrity purposes like fsync(2). This has nothing to do with taking the pinned page out of LRU. It would be actually nice to be able to take pinned pages out of the LRU and functionally that would make sense but as I've mentioned in my reply to you [1], the problem here is the performance. I've now dug out the discussion from 2018 where John actually tried to take pinned pages out of the LRU [2] and the result was 20% IOPS degradation on his NVME drive because of the cost of taking the LRU lock. I'm not even speaking how costly that would get on any heavily parallel direct IO workload on some high-iops device... Honza [1] https://lore.kernel.org/all/20230124102931.g7e33syuhfo7s36h@quack3 [2] https://lore.kernel.org/all/f5ad7210-05e0-3dc4-02df-01ce5346e198@nvidia.com -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/5] mm: Do not reclaim private data from pinned page 2023-02-10 11:29 ` Jan Kara @ 2023-02-13 9:55 ` Christoph Hellwig 2023-02-14 13:06 ` Jan Kara 0 siblings, 1 reply; 26+ messages in thread From: Christoph Hellwig @ 2023-02-13 9:55 UTC (permalink / raw) To: Jan Kara Cc: Matthew Wilcox, linux-fsdevel, linux-block, linux-mm, John Hubbard, David Howells, David Hildenbrand On Fri, Feb 10, 2023 at 12:29:54PM +0100, Jan Kara wrote: > functionally that would make sense but as I've mentioned in my reply to you > [1], the problem here is the performance. I've now dug out the discussion > from 2018 where John actually tried to take pinned pages out of the LRU [2] > and the result was 20% IOPS degradation on his NVME drive because of the > cost of taking the LRU lock. I'm not even speaking how costly that would > get on any heavily parallel direct IO workload on some high-iops device... I think we need to distinguish between short- and long terms pins. For short term pins like direct I/O it doesn't make sense to take them off the lru, or to do any other special action. Writeback will simplify have to wait for the short term pin. Long-term pins absolutely would make sense to be taken off the LRU list. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/5] mm: Do not reclaim private data from pinned page 2023-02-13 9:55 ` Christoph Hellwig @ 2023-02-14 13:06 ` Jan Kara 2023-02-14 21:40 ` John Hubbard 0 siblings, 1 reply; 26+ messages in thread From: Jan Kara @ 2023-02-14 13:06 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, Matthew Wilcox, linux-fsdevel, linux-block, linux-mm, John Hubbard, David Howells, David Hildenbrand On Mon 13-02-23 01:55:04, Christoph Hellwig wrote: > On Fri, Feb 10, 2023 at 12:29:54PM +0100, Jan Kara wrote: > > functionally that would make sense but as I've mentioned in my reply to you > > [1], the problem here is the performance. I've now dug out the discussion > > from 2018 where John actually tried to take pinned pages out of the LRU [2] > > and the result was 20% IOPS degradation on his NVME drive because of the > > cost of taking the LRU lock. I'm not even speaking how costly that would > > get on any heavily parallel direct IO workload on some high-iops device... > > I think we need to distinguish between short- and long terms pins. > For short term pins like direct I/O it doesn't make sense to take them > off the lru, or to do any other special action. Writeback will simplify > have to wait for the short term pin. > > Long-term pins absolutely would make sense to be taken off the LRU list. Yeah, I agree distinguishing these two would be nice as we could treat them differently then. The trouble is a bit with always-crowded struct page. But now it occurred to me that if we are going to take these long-term pinned pages out from the LRU, we could overload the space for LRU pointers with the counter (which is what I think John originally did). So yes, possibly we could track separately long-term and short-term pins. John, what do you think? Maybe time to revive your patches from 2018 in a bit different form? ;) Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/5] mm: Do not reclaim private data from pinned page 2023-02-14 13:06 ` Jan Kara @ 2023-02-14 21:40 ` John Hubbard 2023-02-16 11:56 ` Jan Kara 0 siblings, 1 reply; 26+ messages in thread From: John Hubbard @ 2023-02-14 21:40 UTC (permalink / raw) To: Jan Kara, Christoph Hellwig Cc: Matthew Wilcox, linux-fsdevel, linux-block, linux-mm, David Howells, David Hildenbrand On 2/14/23 05:06, Jan Kara wrote: > On Mon 13-02-23 01:55:04, Christoph Hellwig wrote: >> I think we need to distinguish between short- and long terms pins. >> For short term pins like direct I/O it doesn't make sense to take them >> off the lru, or to do any other special action. Writeback will simplify >> have to wait for the short term pin. >> >> Long-term pins absolutely would make sense to be taken off the LRU list. > > Yeah, I agree distinguishing these two would be nice as we could treat them > differently then. The trouble is a bit with always-crowded struct page. But > now it occurred to me that if we are going to take these long-term pinned > pages out from the LRU, we could overload the space for LRU pointers with > the counter (which is what I think John originally did). So yes, possibly > we could track separately long-term and short-term pins. John, what do you > think? Maybe time to revive your patches from 2018 in a bit different form? > ;) > Oh wow, I really love this idea. We kept running into problems because long- and short-term pins were mixed up together (except during creation), and this, at long last, separates them. Very nice. I'd almost forgotten about the 2018 page.lru adventures, too. ha :) One objection might be that pinning is now going to be taking a lot of space in struct page / folio, but I think it's warranted, based on the long-standing, difficult problems that it would solve. We could even leave most of these patches, and David Howells' patches, intact, by using an approach similar to the mm_users and mm_count technique: maintain a long-term pin count in one of the folio->lru fields, and any non-zero count there creates a single count in folio->_pincount. I could put together something to do that. thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/5] mm: Do not reclaim private data from pinned page 2023-02-14 21:40 ` John Hubbard @ 2023-02-16 11:56 ` Jan Kara 0 siblings, 0 replies; 26+ messages in thread From: Jan Kara @ 2023-02-16 11:56 UTC (permalink / raw) To: John Hubbard Cc: Jan Kara, Christoph Hellwig, Matthew Wilcox, linux-fsdevel, linux-block, linux-mm, David Howells, David Hildenbrand On Tue 14-02-23 13:40:17, John Hubbard wrote: > On 2/14/23 05:06, Jan Kara wrote: > > On Mon 13-02-23 01:55:04, Christoph Hellwig wrote: > >> I think we need to distinguish between short- and long terms pins. > >> For short term pins like direct I/O it doesn't make sense to take them > >> off the lru, or to do any other special action. Writeback will simplify > >> have to wait for the short term pin. > >> > >> Long-term pins absolutely would make sense to be taken off the LRU list. > > > > Yeah, I agree distinguishing these two would be nice as we could treat them > > differently then. The trouble is a bit with always-crowded struct page. But > > now it occurred to me that if we are going to take these long-term pinned > > pages out from the LRU, we could overload the space for LRU pointers with > > the counter (which is what I think John originally did). So yes, possibly > > we could track separately long-term and short-term pins. John, what do you > > think? Maybe time to revive your patches from 2018 in a bit different form? > > ;) > > > > Oh wow, I really love this idea. We kept running into problems because > long- and short-term pins were mixed up together (except during > creation), and this, at long last, separates them. Very nice. I'd almost > forgotten about the 2018 page.lru adventures, too. ha :) > > One objection might be that pinning is now going to be taking a lot of > space in struct page / folio, but I think it's warranted, based on the > long-standing, difficult problems that it would solve. Well, it doesn't need to consume more space in the struct page than it already does currently AFAICS. We could just mark the folio as unevictable and make sure folio_evictable() returns false for such pages. Then we should be safe to use space of lru pointers for whatever we need. > We could even leave most of these patches, and David Howells' patches, > intact, by using an approach similar to the mm_users and mm_count > technique: maintain a long-term pin count in one of the folio->lru > fields, and any non-zero count there creates a single count in > folio->_pincount. Oh, you mean that the first longterm pin would take one short-term pin? Yes, that should be possible but I'm not sure that would be a huge win. I can imagine users can care about distinguishing these states: 1) unpinned 2) has any pin 3) has only short term pins Now distinguishing between 1 and 2+3 would still be done by folio_maybe_dma_pinned(). Your change will allow us to not look at lru pointers in folio_maybe_dma_pinned() so that's some simplification and perhaps performance optimization (potentially is can save us a need to pull in another cacheline but mostly _refcount and lru will be in the same cacheline anyway) so maybe it's worth it in the end. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/5] mm: Do not reclaim private data from pinned page 2023-02-09 12:31 ` [PATCH 1/5] mm: Do not reclaim private data from pinned page Jan Kara 2023-02-09 16:17 ` Matthew Wilcox @ 2023-02-13 9:01 ` David Hildenbrand 2023-02-14 13:00 ` Jan Kara 1 sibling, 1 reply; 26+ messages in thread From: David Hildenbrand @ 2023-02-13 9:01 UTC (permalink / raw) To: Jan Kara, linux-fsdevel Cc: linux-block, linux-mm, John Hubbard, David Howells On 09.02.23 13:31, Jan Kara wrote: > If the page is pinned, there's no point in trying to reclaim it. > Furthermore if the page is from the page cache we don't want to reclaim > fs-private data from the page because the pinning process may be writing > to the page at any time and reclaiming fs private info on a dirty page > can upset the filesystem (see link below). > > Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz > Signed-off-by: Jan Kara <jack@suse.cz> > --- > mm/vmscan.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index bf3eedf0209c..ab3911a8b116 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1901,6 +1901,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > } > } > > + /* > + * Folio is unmapped now so it cannot be newly pinned anymore. > + * No point in trying to reclaim folio if it is pinned. > + * Furthermore we don't want to reclaim underlying fs metadata > + * if the folio is pinned and thus potentially modified by the > + * pinning process is that may upset the filesystem. > + */ > + if (folio_maybe_dma_pinned(folio)) > + goto activate_locked; > + > mapping = folio_mapping(folio); > if (folio_test_dirty(folio)) { > /* At this point, we made sure that the folio is completely unmapped. However, we specify "TTU_BATCH_FLUSH", so rmap code might defer a TLB flush and consequently defer an IPI sync. I remember that this check here is fine regarding GUP-fast: even if concurrent GUP-fast pins the page after our check here, it should observe the changed PTE and unpin it again. Checking after unmapping makes sense: we reduce the likelyhood of false positives when a file-backed page is mapped many times (>= 1024). OTOH, we might unmap pinned pages because we cannot really detect it early. For anon pages, we have an early (racy) check, which turned out "ok" in practice, because we don't frequently have that many anon pages that are shared by that many processes. I assume we don't want something similar for pagecache pages, because having a single page mapped by many processes can happen easily and would prevent reclaim. I once had a patch lying around that documented for the existing folio_maybe_dma_pinned() for anon pages exactly that (racy+false positives with many mappings). Long story short, I assume this change is fine. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/5] mm: Do not reclaim private data from pinned page 2023-02-13 9:01 ` David Hildenbrand @ 2023-02-14 13:00 ` Jan Kara 0 siblings, 0 replies; 26+ messages in thread From: Jan Kara @ 2023-02-14 13:00 UTC (permalink / raw) To: David Hildenbrand Cc: Jan Kara, linux-fsdevel, linux-block, linux-mm, John Hubbard, David Howells On Mon 13-02-23 10:01:35, David Hildenbrand wrote: > On 09.02.23 13:31, Jan Kara wrote: > > If the page is pinned, there's no point in trying to reclaim it. > > Furthermore if the page is from the page cache we don't want to reclaim > > fs-private data from the page because the pinning process may be writing > > to the page at any time and reclaiming fs private info on a dirty page > > can upset the filesystem (see link below). > > > > Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz > > Signed-off-by: Jan Kara <jack@suse.cz> > > --- > > mm/vmscan.c | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index bf3eedf0209c..ab3911a8b116 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1901,6 +1901,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > } > > } > > + /* > > + * Folio is unmapped now so it cannot be newly pinned anymore. > > + * No point in trying to reclaim folio if it is pinned. > > + * Furthermore we don't want to reclaim underlying fs metadata > > + * if the folio is pinned and thus potentially modified by the > > + * pinning process is that may upset the filesystem. > > + */ > > + if (folio_maybe_dma_pinned(folio)) > > + goto activate_locked; > > + > > mapping = folio_mapping(folio); > > if (folio_test_dirty(folio)) { > > /* > > At this point, we made sure that the folio is completely unmapped. However, > we specify "TTU_BATCH_FLUSH", so rmap code might defer a TLB flush and > consequently defer an IPI sync. > > I remember that this check here is fine regarding GUP-fast: even if > concurrent GUP-fast pins the page after our check here, it should observe > the changed PTE and unpin it again. > > Checking after unmapping makes sense: we reduce the likelyhood of false > positives when a file-backed page is mapped many times (>= 1024). OTOH, we > might unmap pinned pages because we cannot really detect it early. > > For anon pages, we have an early (racy) check, which turned out "ok" in > practice, because we don't frequently have that many anon pages that are > shared by that many processes. I assume we don't want something similar for > pagecache pages, because having a single page mapped by many processes can > happen easily and would prevent reclaim. Yeah, I think pagecache pages shared by many processes are more likely. Furthermore I think pinned pagecache pages are rather rare so unmapping them before checking seems fine to me. Obviously we can reconsider if reality would prove me wrong ;). > I once had a patch lying around that documented for the existing > folio_maybe_dma_pinned() for anon pages exactly that (racy+false positives > with many mappings). > > Long story short, I assume this change is fine. Thanks for the throughout verification :) Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/5] ext4: Drop workaround for mm reclaiming fs private page data 2023-02-09 12:31 [PATCH RFC 0/5] Writeback handling of pinned pages Jan Kara 2023-02-09 12:31 ` [PATCH 1/5] mm: Do not reclaim private data from pinned page Jan Kara @ 2023-02-09 12:31 ` Jan Kara 2023-02-09 12:31 ` [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback Jan Kara ` (2 subsequent siblings) 4 siblings, 0 replies; 26+ messages in thread From: Jan Kara @ 2023-02-09 12:31 UTC (permalink / raw) To: linux-fsdevel Cc: linux-block, linux-mm, John Hubbard, David Howells, David Hildenbrand, Jan Kara Drop workaround in ext4 writeback code to handle a situation when MM reclaims fs-private page data from a page that is (or becomes) dirty. After the previous commit this should not happen anymore. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/inode.c | 16 ---------------- 1 file changed, 16 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 9d9f414f99fe..46078651ce32 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2657,22 +2657,6 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd) wait_on_page_writeback(page); BUG_ON(PageWriteback(page)); - /* - * Should never happen but for buggy code in - * other subsystems that call - * set_page_dirty() without properly warning - * the file system first. See [1] for more - * information. - * - * [1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz - */ - if (!page_has_buffers(page)) { - ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index); - ClearPageDirty(page); - unlock_page(page); - continue; - } - if (mpd->map.m_len == 0) mpd->first_page = page->index; mpd->next_page = page->index + 1; -- 2.35.3 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback 2023-02-09 12:31 [PATCH RFC 0/5] Writeback handling of pinned pages Jan Kara 2023-02-09 12:31 ` [PATCH 1/5] mm: Do not reclaim private data from pinned page Jan Kara 2023-02-09 12:31 ` [PATCH 2/5] ext4: Drop workaround for mm reclaiming fs private page data Jan Kara @ 2023-02-09 12:31 ` Jan Kara 2023-02-10 1:54 ` John Hubbard 2023-02-09 12:31 ` [PATCH 4/5] block: Add support for bouncing pinned pages Jan Kara 2023-02-09 12:31 ` [PATCH 5/5] iomap: Bounce pinned pages during writeback Jan Kara 4 siblings, 1 reply; 26+ messages in thread From: Jan Kara @ 2023-02-09 12:31 UTC (permalink / raw) To: linux-fsdevel Cc: linux-block, linux-mm, John Hubbard, David Howells, David Hildenbrand, Jan Kara When a folio is pinned, there is no point in trying to write it during memory cleaning writeback. We cannot reclaim the folio until it is unpinned anyway and we cannot even be sure the folio is really clean. On top of that writeback of such folio may be problematic as the data can change while the writeback is running thus causing checksum or DIF/DIX failures. So just don't bother doing memory cleaning writeback for pinned folios. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/9p/vfs_addr.c | 2 +- fs/afs/file.c | 2 +- fs/afs/write.c | 6 +++--- fs/btrfs/extent_io.c | 14 +++++++------- fs/btrfs/free-space-cache.c | 2 +- fs/btrfs/inode.c | 2 +- fs/btrfs/subpage.c | 2 +- fs/ceph/addr.c | 6 +++--- fs/cifs/file.c | 6 +++--- fs/ext4/inode.c | 4 ++-- fs/f2fs/checkpoint.c | 4 ++-- fs/f2fs/compress.c | 2 +- fs/f2fs/data.c | 2 +- fs/f2fs/dir.c | 2 +- fs/f2fs/gc.c | 4 ++-- fs/f2fs/inline.c | 2 +- fs/f2fs/node.c | 10 +++++----- fs/fuse/file.c | 2 +- fs/gfs2/aops.c | 2 +- fs/nfs/write.c | 2 +- fs/nilfs2/page.c | 2 +- fs/nilfs2/segment.c | 8 ++++---- fs/orangefs/inode.c | 2 +- fs/ubifs/file.c | 2 +- include/linux/pagemap.h | 5 +++-- mm/folio-compat.c | 4 ++-- mm/migrate.c | 2 +- mm/page-writeback.c | 24 ++++++++++++++++++++---- mm/vmscan.c | 2 +- 29 files changed, 73 insertions(+), 56 deletions(-) diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c index 97599edbc300..a14ff3c02eb1 100644 --- a/fs/9p/vfs_addr.c +++ b/fs/9p/vfs_addr.c @@ -221,7 +221,7 @@ static int v9fs_launder_folio(struct folio *folio) { int retval; - if (folio_clear_dirty_for_io(folio)) { + if (folio_clear_dirty_for_io(NULL, folio)) { retval = v9fs_vfs_write_folio_locked(folio); if (retval) return retval; diff --git a/fs/afs/file.c b/fs/afs/file.c index 68d6d5dc608d..8a81ac9c12fa 100644 --- a/fs/afs/file.c +++ b/fs/afs/file.c @@ -453,7 +453,7 @@ static void afs_invalidate_dirty(struct folio *folio, size_t offset, undirty: trace_afs_folio_dirty(vnode, tracepoint_string("undirty"), folio); - folio_clear_dirty_for_io(folio); + folio_clear_dirty_for_io(NULL, folio); full_invalidate: trace_afs_folio_dirty(vnode, tracepoint_string("inval"), folio); folio_detach_private(folio); diff --git a/fs/afs/write.c b/fs/afs/write.c index 19df10d63323..9a5e6d59040c 100644 --- a/fs/afs/write.c +++ b/fs/afs/write.c @@ -555,7 +555,7 @@ static void afs_extend_writeback(struct address_space *mapping, folio = page_folio(pvec.pages[i]); trace_afs_folio_dirty(vnode, tracepoint_string("store+"), folio); - if (!folio_clear_dirty_for_io(folio)) + if (!folio_clear_dirty_for_io(NULL, folio)) BUG(); if (folio_start_writeback(folio)) BUG(); @@ -769,7 +769,7 @@ static int afs_writepages_region(struct address_space *mapping, continue; } - if (!folio_clear_dirty_for_io(folio)) + if (!folio_clear_dirty_for_io(NULL, folio)) BUG(); ret = afs_write_back_from_locked_folio(mapping, wbc, folio, start, end); folio_put(folio); @@ -1000,7 +1000,7 @@ int afs_launder_folio(struct folio *folio) _enter("{%lx}", folio->index); priv = (unsigned long)folio_get_private(folio); - if (folio_clear_dirty_for_io(folio)) { + if (folio_clear_dirty_for_io(NULL, folio)) { f = 0; t = folio_size(folio); if (folio_test_private(folio)) { diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 9bd32daa9b9a..2026f567cbdd 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -215,7 +215,7 @@ void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end) while (index <= end_index) { page = find_get_page(inode->i_mapping, index); BUG_ON(!page); /* Pages should be in the extent_io_tree */ - clear_page_dirty_for_io(page); + clear_page_dirty_for_io(NULL, page); put_page(page); index++; } @@ -2590,7 +2590,7 @@ static int write_one_subpage_eb(struct extent_buffer *eb, no_dirty_ebs = btrfs_subpage_clear_and_test_dirty(fs_info, page, eb->start, eb->len); if (no_dirty_ebs) - clear_page_dirty_for_io(page); + clear_page_dirty_for_io(NULL, page); bio_ctrl->end_io_func = end_bio_subpage_eb_writepage; @@ -2633,7 +2633,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb, for (i = 0; i < num_pages; i++) { struct page *p = eb->pages[i]; - clear_page_dirty_for_io(p); + clear_page_dirty_for_io(NULL, p); set_page_writeback(p); ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc, bio_ctrl, disk_bytenr, p, @@ -2655,7 +2655,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb, if (unlikely(ret)) { for (; i < num_pages; i++) { struct page *p = eb->pages[i]; - clear_page_dirty_for_io(p); + clear_page_dirty_for_io(NULL, p); unlock_page(p); } } @@ -3083,7 +3083,7 @@ static int extent_write_cache_pages(struct address_space *mapping, } if (PageWriteback(page) || - !clear_page_dirty_for_io(page)) { + !clear_page_dirty_for_io(wbc, page)) { unlock_page(page); continue; } @@ -3174,7 +3174,7 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end) */ ASSERT(PageLocked(page)); ASSERT(PageDirty(page)); - clear_page_dirty_for_io(page); + clear_page_dirty_for_io(NULL, page); ret = __extent_writepage(page, &wbc_writepages, &bio_ctrl); ASSERT(ret <= 0); if (ret < 0) { @@ -4698,7 +4698,7 @@ static void btree_clear_page_dirty(struct page *page) { ASSERT(PageDirty(page)); ASSERT(PageLocked(page)); - clear_page_dirty_for_io(page); + clear_page_dirty_for_io(NULL, page); xa_lock_irq(&page->mapping->i_pages); if (!PageDirty(page)) __xa_clear_mark(&page->mapping->i_pages, diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 0d250d052487..02ec9987fc17 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -507,7 +507,7 @@ static int io_ctl_prepare_pages(struct btrfs_io_ctl *io_ctl, bool uptodate) } for (i = 0; i < io_ctl->num_pages; i++) - clear_page_dirty_for_io(io_ctl->pages[i]); + clear_page_dirty_for_io(NULL, io_ctl->pages[i]); return 0; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 98a800b8bd43..26820c697c5b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3002,7 +3002,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work) */ mapping_set_error(page->mapping, ret); end_extent_writepage(page, ret, page_start, page_end); - clear_page_dirty_for_io(page); + clear_page_dirty_for_io(NULL, page); SetPageError(page); } btrfs_page_clear_checked(inode->root->fs_info, page, page_start, PAGE_SIZE); diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c index dd46b978ac2c..f85b5a2ccdab 100644 --- a/fs/btrfs/subpage.c +++ b/fs/btrfs/subpage.c @@ -515,7 +515,7 @@ void btrfs_subpage_clear_dirty(const struct btrfs_fs_info *fs_info, last = btrfs_subpage_clear_and_test_dirty(fs_info, page, start, len); if (last) - clear_page_dirty_for_io(page); + clear_page_dirty_for_io(NULL, page); } void btrfs_subpage_set_writeback(const struct btrfs_fs_info *fs_info, diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index cac4083e387a..ff940c9cd1cf 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -926,7 +926,7 @@ static int ceph_writepages_start(struct address_space *mapping, folio->index, ceph_wbc.i_size); if ((ceph_wbc.size_stable || folio_pos(folio) >= i_size_read(inode)) && - folio_clear_dirty_for_io(folio)) + folio_clear_dirty_for_io(wbc, folio)) folio_invalidate(folio, 0, folio_size(folio)); folio_unlock(folio); @@ -948,7 +948,7 @@ static int ceph_writepages_start(struct address_space *mapping, wait_on_page_fscache(page); } - if (!clear_page_dirty_for_io(page)) { + if (!clear_page_dirty_for_io(wbc, page)) { dout("%p !clear_page_dirty_for_io\n", page); unlock_page(page); continue; @@ -1282,7 +1282,7 @@ ceph_find_incompatible(struct page *page) /* yay, writeable, do it now (without dropping page lock) */ dout(" page %p snapc %p not current, but oldest\n", page, snapc); - if (clear_page_dirty_for_io(page)) { + if (clear_page_dirty_for_io(NULL, page)) { int r = writepage_nounlock(page, NULL); if (r < 0) return ERR_PTR(r); diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 22dfc1f8b4f1..93e36829896e 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -2342,7 +2342,7 @@ cifs_writev_requeue(struct cifs_writedata *wdata) for (j = 0; j < nr_pages; j++) { wdata2->pages[j] = wdata->pages[i + j]; lock_page(wdata2->pages[j]); - clear_page_dirty_for_io(wdata2->pages[j]); + clear_page_dirty_for_io(NULL, wdata2->pages[j]); } wdata2->sync_mode = wdata->sync_mode; @@ -2582,7 +2582,7 @@ wdata_prepare_pages(struct cifs_writedata *wdata, unsigned int found_pages, wait_on_page_writeback(page); if (PageWriteback(page) || - !clear_page_dirty_for_io(page)) { + !clear_page_dirty_for_io(wbc, page)) { unlock_page(page); break; } @@ -5076,7 +5076,7 @@ static int cifs_launder_folio(struct folio *folio) cifs_dbg(FYI, "Launder page: %lu\n", folio->index); - if (folio_clear_dirty_for_io(folio)) + if (folio_clear_dirty_for_io(&wbc, folio)) rc = cifs_writepage_locked(&folio->page, &wbc); folio_wait_fscache(folio); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 46078651ce32..7082b6ba8e12 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1616,7 +1616,7 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd, BUG_ON(folio_test_writeback(folio)); if (invalidate) { if (folio_mapped(folio)) - folio_clear_dirty_for_io(folio); + folio_clear_dirty_for_io(NULL, folio); block_invalidate_folio(folio, 0, folio_size(folio)); folio_clear_uptodate(folio); @@ -2106,7 +2106,7 @@ static int mpage_submit_page(struct mpage_da_data *mpd, struct page *page) int err; BUG_ON(page->index != mpd->first_page); - clear_page_dirty_for_io(page); + clear_page_dirty_for_io(NULL, page); /* * We have to be very careful here! Nothing protects writeback path * against i_size changes and the page can be writeably mapped into diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index 56f7d0d6a8b2..37f951b11153 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -435,7 +435,7 @@ long f2fs_sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type, f2fs_wait_on_page_writeback(page, META, true, true); - if (!clear_page_dirty_for_io(page)) + if (!clear_page_dirty_for_io(NULL, page)) goto continue_unlock; if (__f2fs_write_meta_page(page, &wbc, io_type)) { @@ -1415,7 +1415,7 @@ static void commit_checkpoint(struct f2fs_sb_info *sbi, memcpy(page_address(page), src, PAGE_SIZE); set_page_dirty(page); - if (unlikely(!clear_page_dirty_for_io(page))) + if (unlikely(!clear_page_dirty_for_io(NULL, page))) f2fs_bug_on(sbi, 1); /* writeout cp pack 2 page */ diff --git a/fs/f2fs/compress.c b/fs/f2fs/compress.c index 2532f369cb10..efd09e280b2c 100644 --- a/fs/f2fs/compress.c +++ b/fs/f2fs/compress.c @@ -1459,7 +1459,7 @@ static int f2fs_write_raw_pages(struct compress_ctx *cc, if (!PageDirty(cc->rpages[i])) goto continue_unlock; - if (!clear_page_dirty_for_io(cc->rpages[i])) + if (!clear_page_dirty_for_io(NULL, cc->rpages[i])) goto continue_unlock; ret = f2fs_write_single_data_page(cc->rpages[i], &_submitted, diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index 97e816590cd9..f1d622b64690 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -3102,7 +3102,7 @@ static int f2fs_write_cache_pages(struct address_space *mapping, goto continue_unlock; } - if (!clear_page_dirty_for_io(page)) + if (!clear_page_dirty_for_io(NULL, page)) goto continue_unlock; #ifdef CONFIG_F2FS_FS_COMPRESSION diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c index 8e025157f35c..73005e711b83 100644 --- a/fs/f2fs/dir.c +++ b/fs/f2fs/dir.c @@ -938,7 +938,7 @@ void f2fs_delete_entry(struct f2fs_dir_entry *dentry, struct page *page, if (bit_pos == NR_DENTRY_IN_BLOCK && !f2fs_truncate_hole(dir, page->index, page->index + 1)) { f2fs_clear_page_cache_dirty_tag(page); - clear_page_dirty_for_io(page); + clear_page_dirty_for_io(NULL, page); ClearPageUptodate(page); clear_page_private_gcing(page); diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c index 6e2cae3d2e71..1f647287e3eb 100644 --- a/fs/f2fs/gc.c +++ b/fs/f2fs/gc.c @@ -1361,7 +1361,7 @@ static int move_data_block(struct inode *inode, block_t bidx, f2fs_invalidate_compress_page(fio.sbi, fio.old_blkaddr); set_page_dirty(fio.encrypted_page); - if (clear_page_dirty_for_io(fio.encrypted_page)) + if (clear_page_dirty_for_io(NULL, fio.encrypted_page)) dec_page_count(fio.sbi, F2FS_DIRTY_META); set_page_writeback(fio.encrypted_page); @@ -1446,7 +1446,7 @@ static int move_data_page(struct inode *inode, block_t bidx, int gc_type, f2fs_wait_on_page_writeback(page, DATA, true, true); set_page_dirty(page); - if (clear_page_dirty_for_io(page)) { + if (clear_page_dirty_for_io(NULL, page)) { inode_dec_dirty_pages(inode); f2fs_remove_dirty_inode(inode); } diff --git a/fs/f2fs/inline.c b/fs/f2fs/inline.c index 21a495234ffd..2bfade0ead67 100644 --- a/fs/f2fs/inline.c +++ b/fs/f2fs/inline.c @@ -170,7 +170,7 @@ int f2fs_convert_inline_page(struct dnode_of_data *dn, struct page *page) set_page_dirty(page); /* clear dirty state */ - dirty = clear_page_dirty_for_io(page); + dirty = clear_page_dirty_for_io(NULL, page); /* write data page to try to make data consistent */ set_page_writeback(page); diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index dde4c0458704..6f5571cac2b3 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -124,7 +124,7 @@ static void clear_node_page_dirty(struct page *page) { if (PageDirty(page)) { f2fs_clear_page_cache_dirty_tag(page); - clear_page_dirty_for_io(page); + clear_page_dirty_for_io(NULL, page); dec_page_count(F2FS_P_SB(page), F2FS_DIRTY_NODES); } ClearPageUptodate(page); @@ -1501,7 +1501,7 @@ static void flush_inline_data(struct f2fs_sb_info *sbi, nid_t ino) if (!PageDirty(page)) goto page_out; - if (!clear_page_dirty_for_io(page)) + if (!clear_page_dirty_for_io(NULL, page)) goto page_out; ret = f2fs_write_inline_data(inode, page); @@ -1696,7 +1696,7 @@ int f2fs_move_node_page(struct page *node_page, int gc_type) set_page_dirty(node_page); - if (!clear_page_dirty_for_io(node_page)) { + if (!clear_page_dirty_for_io(NULL, node_page)) { err = -EAGAIN; goto out_page; } @@ -1803,7 +1803,7 @@ int f2fs_fsync_node_pages(struct f2fs_sb_info *sbi, struct inode *inode, set_page_dirty(page); } - if (!clear_page_dirty_for_io(page)) + if (!clear_page_dirty_for_io(NULL, page)) goto continue_unlock; ret = __write_node_page(page, atomic && @@ -2011,7 +2011,7 @@ int f2fs_sync_node_pages(struct f2fs_sb_info *sbi, write_node: f2fs_wait_on_page_writeback(page, NODE, true, true); - if (!clear_page_dirty_for_io(page)) + if (!clear_page_dirty_for_io(NULL, page)) goto continue_unlock; set_fsync_mark(page, 0); diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 875314ee6f59..cb9561128b4b 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -2399,7 +2399,7 @@ static int fuse_write_end(struct file *file, struct address_space *mapping, static int fuse_launder_folio(struct folio *folio) { int err = 0; - if (folio_clear_dirty_for_io(folio)) { + if (folio_clear_dirty_for_io(NULL, folio)) { struct inode *inode = folio->mapping->host; /* Serialize with pending writeback for the same page */ diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c index e782b4f1d104..cf784dd5fd3b 100644 --- a/fs/gfs2/aops.c +++ b/fs/gfs2/aops.c @@ -247,7 +247,7 @@ static int gfs2_write_jdata_pagevec(struct address_space *mapping, } BUG_ON(PageWriteback(page)); - if (!clear_page_dirty_for_io(page)) + if (!clear_page_dirty_for_io(wbc, page)) goto continue_unlock; trace_wbc_writepage(wbc, inode_to_bdi(inode)); diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 80c240e50952..c07b686c84ce 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -2089,7 +2089,7 @@ int nfs_wb_page(struct inode *inode, struct page *page) for (;;) { wait_on_page_writeback(page); - if (clear_page_dirty_for_io(page)) { + if (clear_page_dirty_for_io(&wbc, page)) { ret = nfs_writepage_locked(page, &wbc); if (ret < 0) goto out_error; diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c index 39b7eea2642a..9c3cc20b446e 100644 --- a/fs/nilfs2/page.c +++ b/fs/nilfs2/page.c @@ -456,7 +456,7 @@ int __nilfs_clear_page_dirty(struct page *page) __xa_clear_mark(&mapping->i_pages, page_index(page), PAGECACHE_TAG_DIRTY); xa_unlock_irq(&mapping->i_pages); - return clear_page_dirty_for_io(page); + return clear_page_dirty_for_io(NULL, page); } xa_unlock_irq(&mapping->i_pages); return 0; diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c index 76c3bd88b858..123494739030 100644 --- a/fs/nilfs2/segment.c +++ b/fs/nilfs2/segment.c @@ -1644,7 +1644,7 @@ static void nilfs_begin_page_io(struct page *page) return; lock_page(page); - clear_page_dirty_for_io(page); + clear_page_dirty_for_io(NULL, page); set_page_writeback(page); unlock_page(page); } @@ -1662,7 +1662,7 @@ static void nilfs_segctor_prepare_write(struct nilfs_sc_info *sci) if (bh->b_page != bd_page) { if (bd_page) { lock_page(bd_page); - clear_page_dirty_for_io(bd_page); + clear_page_dirty_for_io(NULL, bd_page); set_page_writeback(bd_page); unlock_page(bd_page); } @@ -1676,7 +1676,7 @@ static void nilfs_segctor_prepare_write(struct nilfs_sc_info *sci) if (bh == segbuf->sb_super_root) { if (bh->b_page != bd_page) { lock_page(bd_page); - clear_page_dirty_for_io(bd_page); + clear_page_dirty_for_io(NULL, bd_page); set_page_writeback(bd_page); unlock_page(bd_page); bd_page = bh->b_page; @@ -1691,7 +1691,7 @@ static void nilfs_segctor_prepare_write(struct nilfs_sc_info *sci) } if (bd_page) { lock_page(bd_page); - clear_page_dirty_for_io(bd_page); + clear_page_dirty_for_io(NULL, bd_page); set_page_writeback(bd_page); unlock_page(bd_page); } diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c index 4df560894386..829de5553a77 100644 --- a/fs/orangefs/inode.c +++ b/fs/orangefs/inode.c @@ -501,7 +501,7 @@ static int orangefs_launder_folio(struct folio *folio) .nr_to_write = 0, }; folio_wait_writeback(folio); - if (folio_clear_dirty_for_io(folio)) { + if (folio_clear_dirty_for_io(&wbc, folio)) { r = orangefs_writepage_locked(&folio->page, &wbc); folio_end_writeback(folio); } diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c index f2353dd676ef..41d764fd5511 100644 --- a/fs/ubifs/file.c +++ b/fs/ubifs/file.c @@ -1159,7 +1159,7 @@ static int do_truncation(struct ubifs_info *c, struct inode *inode, */ ubifs_assert(c, PagePrivate(page)); - clear_page_dirty_for_io(page); + clear_page_dirty_for_io(NULL, page); if (UBIFS_BLOCKS_PER_PAGE_SHIFT) offset = new_size & (PAGE_SIZE - 1); diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 29e1f9e76eb6..81c42a95cf8d 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -1059,8 +1059,9 @@ static inline void folio_cancel_dirty(struct folio *folio) if (folio_test_dirty(folio)) __folio_cancel_dirty(folio); } -bool folio_clear_dirty_for_io(struct folio *folio); -bool clear_page_dirty_for_io(struct page *page); +bool folio_clear_dirty_for_io(struct writeback_control *wbc, + struct folio *folio); +bool clear_page_dirty_for_io(struct writeback_control *wbc, struct page *page); void folio_invalidate(struct folio *folio, size_t offset, size_t length); int __must_check folio_write_one(struct folio *folio); static inline int __must_check write_one_page(struct page *page) diff --git a/mm/folio-compat.c b/mm/folio-compat.c index 69ed25790c68..748f82def674 100644 --- a/mm/folio-compat.c +++ b/mm/folio-compat.c @@ -63,9 +63,9 @@ int __set_page_dirty_nobuffers(struct page *page) } EXPORT_SYMBOL(__set_page_dirty_nobuffers); -bool clear_page_dirty_for_io(struct page *page) +bool clear_page_dirty_for_io(struct writeback_control *wbc, struct page *page) { - return folio_clear_dirty_for_io(page_folio(page)); + return folio_clear_dirty_for_io(wbc, page_folio(page)); } EXPORT_SYMBOL(clear_page_dirty_for_io); diff --git a/mm/migrate.c b/mm/migrate.c index a4d3fc65085f..0bda652153b9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -870,7 +870,7 @@ static int writeout(struct address_space *mapping, struct folio *folio) /* No write method for the address space */ return -EINVAL; - if (!folio_clear_dirty_for_io(folio)) + if (!folio_clear_dirty_for_io(&wbc, folio)) /* Someone else already triggered a write */ return -EAGAIN; diff --git a/mm/page-writeback.c b/mm/page-writeback.c index ad608ef2a243..2d70070e533c 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2465,7 +2465,7 @@ int write_cache_pages(struct address_space *mapping, } BUG_ON(PageWriteback(page)); - if (!clear_page_dirty_for_io(page)) + if (!clear_page_dirty_for_io(wbc, page)) goto continue_unlock; trace_wbc_writepage(wbc, inode_to_bdi(mapping->host)); @@ -2628,7 +2628,7 @@ int folio_write_one(struct folio *folio) folio_wait_writeback(folio); - if (folio_clear_dirty_for_io(folio)) { + if (folio_clear_dirty_for_io(&wbc, folio)) { folio_get(folio); ret = mapping->a_ops->writepage(&folio->page, &wbc); if (ret == 0) @@ -2924,7 +2924,7 @@ EXPORT_SYMBOL(__folio_cancel_dirty); /* * Clear a folio's dirty flag, while caring for dirty memory accounting. - * Returns true if the folio was previously dirty. + * Returns true if the folio was previously dirty and should be written back. * * This is for preparing to put the folio under writeout. We leave * the folio tagged as dirty in the xarray so that a concurrent @@ -2935,8 +2935,14 @@ EXPORT_SYMBOL(__folio_cancel_dirty); * * This incoherency between the folio's dirty flag and xarray tag is * unfortunate, but it only exists while the folio is locked. + * + * If the folio is pinned, its writeback is problematic so we just don't bother + * for memory cleaning writeback - this is why writeback control is passed in. + * If it is NULL, we assume pinned pages are not expected (e.g. this can be + * a metadata page) and warn if the page is actually pinned. */ -bool folio_clear_dirty_for_io(struct folio *folio) +bool folio_clear_dirty_for_io(struct writeback_control *wbc, + struct folio *folio) { struct address_space *mapping = folio_mapping(folio); bool ret = false; @@ -2975,6 +2981,16 @@ bool folio_clear_dirty_for_io(struct folio *folio) */ if (folio_mkclean(folio)) folio_mark_dirty(folio); + /* + * For pinned folios we have no chance to reclaim them anyway + * and we cannot be sure folio is ever clean. So memory + * cleaning writeback is pointless. Just skip it. + */ + if (wbc && wbc->sync_mode == WB_SYNC_NONE && + folio_maybe_dma_pinned(folio)) + return false; + /* Catch callers not expecting pinned pages */ + WARN_ON_ONCE(!wbc && folio_maybe_dma_pinned(folio)); /* * We carefully synchronise fault handlers against * installing a dirty pte and marking the folio dirty diff --git a/mm/vmscan.c b/mm/vmscan.c index ab3911a8b116..71a226b47ac6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1283,7 +1283,7 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping, if (mapping->a_ops->writepage == NULL) return PAGE_ACTIVATE; - if (folio_clear_dirty_for_io(folio)) { + if (folio_clear_dirty_for_io(NULL, folio)) { int res; struct writeback_control wbc = { .sync_mode = WB_SYNC_NONE, -- 2.35.3 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback 2023-02-09 12:31 ` [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback Jan Kara @ 2023-02-10 1:54 ` John Hubbard 2023-02-10 2:10 ` John Hubbard 2023-02-10 10:54 ` Jan Kara 0 siblings, 2 replies; 26+ messages in thread From: John Hubbard @ 2023-02-10 1:54 UTC (permalink / raw) To: Jan Kara, linux-fsdevel Cc: linux-block, linux-mm, David Howells, David Hildenbrand On 2/9/23 04:31, Jan Kara wrote: > When a folio is pinned, there is no point in trying to write it during > memory cleaning writeback. We cannot reclaim the folio until it is > unpinned anyway and we cannot even be sure the folio is really clean. > On top of that writeback of such folio may be problematic as the data > can change while the writeback is running thus causing checksum or > DIF/DIX failures. So just don't bother doing memory cleaning writeback > for pinned folios. > > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/9p/vfs_addr.c | 2 +- > fs/afs/file.c | 2 +- > fs/afs/write.c | 6 +++--- > fs/btrfs/extent_io.c | 14 +++++++------- > fs/btrfs/free-space-cache.c | 2 +- > fs/btrfs/inode.c | 2 +- > fs/btrfs/subpage.c | 2 +- Hi Jan! Just a quick note that this breaks the btrfs build in subpage.c. Because, unfortunately, btrfs creates 6 sets of functions via calls to a macro: IMPLEMENT_BTRFS_PAGE_OPS(). And that expects only one argument to the clear_page_func, and thus to clear_page_dirty_for_io(). It seems infeasible (right?) to add another argument to the other clear_page_func functions, which include: ClearPageUptodate ClearPageError end_page_writeback ClearPageOrdered ClearPageChecked , so I expect IMPLEMENT_BTRFS_PAGE_OPS() may need to be partially unrolled, in order to pass in the new writeback control arg to clear_page_dirty_for_io(). thanks, -- John Hubbard NVIDIA > fs/ceph/addr.c | 6 +++--- > fs/cifs/file.c | 6 +++--- > fs/ext4/inode.c | 4 ++-- > fs/f2fs/checkpoint.c | 4 ++-- > fs/f2fs/compress.c | 2 +- > fs/f2fs/data.c | 2 +- > fs/f2fs/dir.c | 2 +- > fs/f2fs/gc.c | 4 ++-- > fs/f2fs/inline.c | 2 +- > fs/f2fs/node.c | 10 +++++----- > fs/fuse/file.c | 2 +- > fs/gfs2/aops.c | 2 +- > fs/nfs/write.c | 2 +- > fs/nilfs2/page.c | 2 +- > fs/nilfs2/segment.c | 8 ++++---- > fs/orangefs/inode.c | 2 +- > fs/ubifs/file.c | 2 +- > include/linux/pagemap.h | 5 +++-- > mm/folio-compat.c | 4 ++-- > mm/migrate.c | 2 +- > mm/page-writeback.c | 24 ++++++++++++++++++++---- > mm/vmscan.c | 2 +- > 29 files changed, 73 insertions(+), 56 deletions(-) > > diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c > index 97599edbc300..a14ff3c02eb1 100644 > --- a/fs/9p/vfs_addr.c > +++ b/fs/9p/vfs_addr.c > @@ -221,7 +221,7 @@ static int v9fs_launder_folio(struct folio *folio) > { > int retval; > > - if (folio_clear_dirty_for_io(folio)) { > + if (folio_clear_dirty_for_io(NULL, folio)) { > retval = v9fs_vfs_write_folio_locked(folio); > if (retval) > return retval; > diff --git a/fs/afs/file.c b/fs/afs/file.c > index 68d6d5dc608d..8a81ac9c12fa 100644 > --- a/fs/afs/file.c > +++ b/fs/afs/file.c > @@ -453,7 +453,7 @@ static void afs_invalidate_dirty(struct folio *folio, size_t offset, > > undirty: > trace_afs_folio_dirty(vnode, tracepoint_string("undirty"), folio); > - folio_clear_dirty_for_io(folio); > + folio_clear_dirty_for_io(NULL, folio); > full_invalidate: > trace_afs_folio_dirty(vnode, tracepoint_string("inval"), folio); > folio_detach_private(folio); > diff --git a/fs/afs/write.c b/fs/afs/write.c > index 19df10d63323..9a5e6d59040c 100644 > --- a/fs/afs/write.c > +++ b/fs/afs/write.c > @@ -555,7 +555,7 @@ static void afs_extend_writeback(struct address_space *mapping, > folio = page_folio(pvec.pages[i]); > trace_afs_folio_dirty(vnode, tracepoint_string("store+"), folio); > > - if (!folio_clear_dirty_for_io(folio)) > + if (!folio_clear_dirty_for_io(NULL, folio)) > BUG(); > if (folio_start_writeback(folio)) > BUG(); > @@ -769,7 +769,7 @@ static int afs_writepages_region(struct address_space *mapping, > continue; > } > > - if (!folio_clear_dirty_for_io(folio)) > + if (!folio_clear_dirty_for_io(NULL, folio)) > BUG(); > ret = afs_write_back_from_locked_folio(mapping, wbc, folio, start, end); > folio_put(folio); > @@ -1000,7 +1000,7 @@ int afs_launder_folio(struct folio *folio) > _enter("{%lx}", folio->index); > > priv = (unsigned long)folio_get_private(folio); > - if (folio_clear_dirty_for_io(folio)) { > + if (folio_clear_dirty_for_io(NULL, folio)) { > f = 0; > t = folio_size(folio); > if (folio_test_private(folio)) { > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 9bd32daa9b9a..2026f567cbdd 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -215,7 +215,7 @@ void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end) > while (index <= end_index) { > page = find_get_page(inode->i_mapping, index); > BUG_ON(!page); /* Pages should be in the extent_io_tree */ > - clear_page_dirty_for_io(page); > + clear_page_dirty_for_io(NULL, page); > put_page(page); > index++; > } > @@ -2590,7 +2590,7 @@ static int write_one_subpage_eb(struct extent_buffer *eb, > no_dirty_ebs = btrfs_subpage_clear_and_test_dirty(fs_info, page, > eb->start, eb->len); > if (no_dirty_ebs) > - clear_page_dirty_for_io(page); > + clear_page_dirty_for_io(NULL, page); > > bio_ctrl->end_io_func = end_bio_subpage_eb_writepage; > > @@ -2633,7 +2633,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb, > for (i = 0; i < num_pages; i++) { > struct page *p = eb->pages[i]; > > - clear_page_dirty_for_io(p); > + clear_page_dirty_for_io(NULL, p); > set_page_writeback(p); > ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc, > bio_ctrl, disk_bytenr, p, > @@ -2655,7 +2655,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb, > if (unlikely(ret)) { > for (; i < num_pages; i++) { > struct page *p = eb->pages[i]; > - clear_page_dirty_for_io(p); > + clear_page_dirty_for_io(NULL, p); > unlock_page(p); > } > } > @@ -3083,7 +3083,7 @@ static int extent_write_cache_pages(struct address_space *mapping, > } > > if (PageWriteback(page) || > - !clear_page_dirty_for_io(page)) { > + !clear_page_dirty_for_io(wbc, page)) { > unlock_page(page); > continue; > } > @@ -3174,7 +3174,7 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end) > */ > ASSERT(PageLocked(page)); > ASSERT(PageDirty(page)); > - clear_page_dirty_for_io(page); > + clear_page_dirty_for_io(NULL, page); > ret = __extent_writepage(page, &wbc_writepages, &bio_ctrl); > ASSERT(ret <= 0); > if (ret < 0) { > @@ -4698,7 +4698,7 @@ static void btree_clear_page_dirty(struct page *page) > { > ASSERT(PageDirty(page)); > ASSERT(PageLocked(page)); > - clear_page_dirty_for_io(page); > + clear_page_dirty_for_io(NULL, page); > xa_lock_irq(&page->mapping->i_pages); > if (!PageDirty(page)) > __xa_clear_mark(&page->mapping->i_pages, > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c > index 0d250d052487..02ec9987fc17 100644 > --- a/fs/btrfs/free-space-cache.c > +++ b/fs/btrfs/free-space-cache.c > @@ -507,7 +507,7 @@ static int io_ctl_prepare_pages(struct btrfs_io_ctl *io_ctl, bool uptodate) > } > > for (i = 0; i < io_ctl->num_pages; i++) > - clear_page_dirty_for_io(io_ctl->pages[i]); > + clear_page_dirty_for_io(NULL, io_ctl->pages[i]); > > return 0; > } > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 98a800b8bd43..26820c697c5b 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -3002,7 +3002,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work) > */ > mapping_set_error(page->mapping, ret); > end_extent_writepage(page, ret, page_start, page_end); > - clear_page_dirty_for_io(page); > + clear_page_dirty_for_io(NULL, page); > SetPageError(page); > } > btrfs_page_clear_checked(inode->root->fs_info, page, page_start, PAGE_SIZE); > diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c > index dd46b978ac2c..f85b5a2ccdab 100644 > --- a/fs/btrfs/subpage.c > +++ b/fs/btrfs/subpage.c > @@ -515,7 +515,7 @@ void btrfs_subpage_clear_dirty(const struct btrfs_fs_info *fs_info, > > last = btrfs_subpage_clear_and_test_dirty(fs_info, page, start, len); > if (last) > - clear_page_dirty_for_io(page); > + clear_page_dirty_for_io(NULL, page); > } > > void btrfs_subpage_set_writeback(const struct btrfs_fs_info *fs_info, > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c > index cac4083e387a..ff940c9cd1cf 100644 > --- a/fs/ceph/addr.c > +++ b/fs/ceph/addr.c > @@ -926,7 +926,7 @@ static int ceph_writepages_start(struct address_space *mapping, > folio->index, ceph_wbc.i_size); > if ((ceph_wbc.size_stable || > folio_pos(folio) >= i_size_read(inode)) && > - folio_clear_dirty_for_io(folio)) > + folio_clear_dirty_for_io(wbc, folio)) > folio_invalidate(folio, 0, > folio_size(folio)); > folio_unlock(folio); > @@ -948,7 +948,7 @@ static int ceph_writepages_start(struct address_space *mapping, > wait_on_page_fscache(page); > } > > - if (!clear_page_dirty_for_io(page)) { > + if (!clear_page_dirty_for_io(wbc, page)) { > dout("%p !clear_page_dirty_for_io\n", page); > unlock_page(page); > continue; > @@ -1282,7 +1282,7 @@ ceph_find_incompatible(struct page *page) > > /* yay, writeable, do it now (without dropping page lock) */ > dout(" page %p snapc %p not current, but oldest\n", page, snapc); > - if (clear_page_dirty_for_io(page)) { > + if (clear_page_dirty_for_io(NULL, page)) { > int r = writepage_nounlock(page, NULL); > if (r < 0) > return ERR_PTR(r); > diff --git a/fs/cifs/file.c b/fs/cifs/file.c > index 22dfc1f8b4f1..93e36829896e 100644 > --- a/fs/cifs/file.c > +++ b/fs/cifs/file.c > @@ -2342,7 +2342,7 @@ cifs_writev_requeue(struct cifs_writedata *wdata) > for (j = 0; j < nr_pages; j++) { > wdata2->pages[j] = wdata->pages[i + j]; > lock_page(wdata2->pages[j]); > - clear_page_dirty_for_io(wdata2->pages[j]); > + clear_page_dirty_for_io(NULL, wdata2->pages[j]); > } > > wdata2->sync_mode = wdata->sync_mode; > @@ -2582,7 +2582,7 @@ wdata_prepare_pages(struct cifs_writedata *wdata, unsigned int found_pages, > wait_on_page_writeback(page); > > if (PageWriteback(page) || > - !clear_page_dirty_for_io(page)) { > + !clear_page_dirty_for_io(wbc, page)) { > unlock_page(page); > break; > } > @@ -5076,7 +5076,7 @@ static int cifs_launder_folio(struct folio *folio) > > cifs_dbg(FYI, "Launder page: %lu\n", folio->index); > > - if (folio_clear_dirty_for_io(folio)) > + if (folio_clear_dirty_for_io(&wbc, folio)) > rc = cifs_writepage_locked(&folio->page, &wbc); > > folio_wait_fscache(folio); > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 46078651ce32..7082b6ba8e12 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -1616,7 +1616,7 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd, > BUG_ON(folio_test_writeback(folio)); > if (invalidate) { > if (folio_mapped(folio)) > - folio_clear_dirty_for_io(folio); > + folio_clear_dirty_for_io(NULL, folio); > block_invalidate_folio(folio, 0, > folio_size(folio)); > folio_clear_uptodate(folio); > @@ -2106,7 +2106,7 @@ static int mpage_submit_page(struct mpage_da_data *mpd, struct page *page) > int err; > > BUG_ON(page->index != mpd->first_page); > - clear_page_dirty_for_io(page); > + clear_page_dirty_for_io(NULL, page); > /* > * We have to be very careful here! Nothing protects writeback path > * against i_size changes and the page can be writeably mapped into > diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c > index 56f7d0d6a8b2..37f951b11153 100644 > --- a/fs/f2fs/checkpoint.c > +++ b/fs/f2fs/checkpoint.c > @@ -435,7 +435,7 @@ long f2fs_sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type, > > f2fs_wait_on_page_writeback(page, META, true, true); > > - if (!clear_page_dirty_for_io(page)) > + if (!clear_page_dirty_for_io(NULL, page)) > goto continue_unlock; > > if (__f2fs_write_meta_page(page, &wbc, io_type)) { > @@ -1415,7 +1415,7 @@ static void commit_checkpoint(struct f2fs_sb_info *sbi, > memcpy(page_address(page), src, PAGE_SIZE); > > set_page_dirty(page); > - if (unlikely(!clear_page_dirty_for_io(page))) > + if (unlikely(!clear_page_dirty_for_io(NULL, page))) > f2fs_bug_on(sbi, 1); > > /* writeout cp pack 2 page */ > diff --git a/fs/f2fs/compress.c b/fs/f2fs/compress.c > index 2532f369cb10..efd09e280b2c 100644 > --- a/fs/f2fs/compress.c > +++ b/fs/f2fs/compress.c > @@ -1459,7 +1459,7 @@ static int f2fs_write_raw_pages(struct compress_ctx *cc, > if (!PageDirty(cc->rpages[i])) > goto continue_unlock; > > - if (!clear_page_dirty_for_io(cc->rpages[i])) > + if (!clear_page_dirty_for_io(NULL, cc->rpages[i])) > goto continue_unlock; > > ret = f2fs_write_single_data_page(cc->rpages[i], &_submitted, > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c > index 97e816590cd9..f1d622b64690 100644 > --- a/fs/f2fs/data.c > +++ b/fs/f2fs/data.c > @@ -3102,7 +3102,7 @@ static int f2fs_write_cache_pages(struct address_space *mapping, > goto continue_unlock; > } > > - if (!clear_page_dirty_for_io(page)) > + if (!clear_page_dirty_for_io(NULL, page)) > goto continue_unlock; > > #ifdef CONFIG_F2FS_FS_COMPRESSION > diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c > index 8e025157f35c..73005e711b83 100644 > --- a/fs/f2fs/dir.c > +++ b/fs/f2fs/dir.c > @@ -938,7 +938,7 @@ void f2fs_delete_entry(struct f2fs_dir_entry *dentry, struct page *page, > if (bit_pos == NR_DENTRY_IN_BLOCK && > !f2fs_truncate_hole(dir, page->index, page->index + 1)) { > f2fs_clear_page_cache_dirty_tag(page); > - clear_page_dirty_for_io(page); > + clear_page_dirty_for_io(NULL, page); > ClearPageUptodate(page); > > clear_page_private_gcing(page); > diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c > index 6e2cae3d2e71..1f647287e3eb 100644 > --- a/fs/f2fs/gc.c > +++ b/fs/f2fs/gc.c > @@ -1361,7 +1361,7 @@ static int move_data_block(struct inode *inode, block_t bidx, > f2fs_invalidate_compress_page(fio.sbi, fio.old_blkaddr); > > set_page_dirty(fio.encrypted_page); > - if (clear_page_dirty_for_io(fio.encrypted_page)) > + if (clear_page_dirty_for_io(NULL, fio.encrypted_page)) > dec_page_count(fio.sbi, F2FS_DIRTY_META); > > set_page_writeback(fio.encrypted_page); > @@ -1446,7 +1446,7 @@ static int move_data_page(struct inode *inode, block_t bidx, int gc_type, > f2fs_wait_on_page_writeback(page, DATA, true, true); > > set_page_dirty(page); > - if (clear_page_dirty_for_io(page)) { > + if (clear_page_dirty_for_io(NULL, page)) { > inode_dec_dirty_pages(inode); > f2fs_remove_dirty_inode(inode); > } > diff --git a/fs/f2fs/inline.c b/fs/f2fs/inline.c > index 21a495234ffd..2bfade0ead67 100644 > --- a/fs/f2fs/inline.c > +++ b/fs/f2fs/inline.c > @@ -170,7 +170,7 @@ int f2fs_convert_inline_page(struct dnode_of_data *dn, struct page *page) > set_page_dirty(page); > > /* clear dirty state */ > - dirty = clear_page_dirty_for_io(page); > + dirty = clear_page_dirty_for_io(NULL, page); > > /* write data page to try to make data consistent */ > set_page_writeback(page); > diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c > index dde4c0458704..6f5571cac2b3 100644 > --- a/fs/f2fs/node.c > +++ b/fs/f2fs/node.c > @@ -124,7 +124,7 @@ static void clear_node_page_dirty(struct page *page) > { > if (PageDirty(page)) { > f2fs_clear_page_cache_dirty_tag(page); > - clear_page_dirty_for_io(page); > + clear_page_dirty_for_io(NULL, page); > dec_page_count(F2FS_P_SB(page), F2FS_DIRTY_NODES); > } > ClearPageUptodate(page); > @@ -1501,7 +1501,7 @@ static void flush_inline_data(struct f2fs_sb_info *sbi, nid_t ino) > if (!PageDirty(page)) > goto page_out; > > - if (!clear_page_dirty_for_io(page)) > + if (!clear_page_dirty_for_io(NULL, page)) > goto page_out; > > ret = f2fs_write_inline_data(inode, page); > @@ -1696,7 +1696,7 @@ int f2fs_move_node_page(struct page *node_page, int gc_type) > > set_page_dirty(node_page); > > - if (!clear_page_dirty_for_io(node_page)) { > + if (!clear_page_dirty_for_io(NULL, node_page)) { > err = -EAGAIN; > goto out_page; > } > @@ -1803,7 +1803,7 @@ int f2fs_fsync_node_pages(struct f2fs_sb_info *sbi, struct inode *inode, > set_page_dirty(page); > } > > - if (!clear_page_dirty_for_io(page)) > + if (!clear_page_dirty_for_io(NULL, page)) > goto continue_unlock; > > ret = __write_node_page(page, atomic && > @@ -2011,7 +2011,7 @@ int f2fs_sync_node_pages(struct f2fs_sb_info *sbi, > write_node: > f2fs_wait_on_page_writeback(page, NODE, true, true); > > - if (!clear_page_dirty_for_io(page)) > + if (!clear_page_dirty_for_io(NULL, page)) > goto continue_unlock; > > set_fsync_mark(page, 0); > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > index 875314ee6f59..cb9561128b4b 100644 > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -2399,7 +2399,7 @@ static int fuse_write_end(struct file *file, struct address_space *mapping, > static int fuse_launder_folio(struct folio *folio) > { > int err = 0; > - if (folio_clear_dirty_for_io(folio)) { > + if (folio_clear_dirty_for_io(NULL, folio)) { > struct inode *inode = folio->mapping->host; > > /* Serialize with pending writeback for the same page */ > diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c > index e782b4f1d104..cf784dd5fd3b 100644 > --- a/fs/gfs2/aops.c > +++ b/fs/gfs2/aops.c > @@ -247,7 +247,7 @@ static int gfs2_write_jdata_pagevec(struct address_space *mapping, > } > > BUG_ON(PageWriteback(page)); > - if (!clear_page_dirty_for_io(page)) > + if (!clear_page_dirty_for_io(wbc, page)) > goto continue_unlock; > > trace_wbc_writepage(wbc, inode_to_bdi(inode)); > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > index 80c240e50952..c07b686c84ce 100644 > --- a/fs/nfs/write.c > +++ b/fs/nfs/write.c > @@ -2089,7 +2089,7 @@ int nfs_wb_page(struct inode *inode, struct page *page) > > for (;;) { > wait_on_page_writeback(page); > - if (clear_page_dirty_for_io(page)) { > + if (clear_page_dirty_for_io(&wbc, page)) { > ret = nfs_writepage_locked(page, &wbc); > if (ret < 0) > goto out_error; > diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c > index 39b7eea2642a..9c3cc20b446e 100644 > --- a/fs/nilfs2/page.c > +++ b/fs/nilfs2/page.c > @@ -456,7 +456,7 @@ int __nilfs_clear_page_dirty(struct page *page) > __xa_clear_mark(&mapping->i_pages, page_index(page), > PAGECACHE_TAG_DIRTY); > xa_unlock_irq(&mapping->i_pages); > - return clear_page_dirty_for_io(page); > + return clear_page_dirty_for_io(NULL, page); > } > xa_unlock_irq(&mapping->i_pages); > return 0; > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c > index 76c3bd88b858..123494739030 100644 > --- a/fs/nilfs2/segment.c > +++ b/fs/nilfs2/segment.c > @@ -1644,7 +1644,7 @@ static void nilfs_begin_page_io(struct page *page) > return; > > lock_page(page); > - clear_page_dirty_for_io(page); > + clear_page_dirty_for_io(NULL, page); > set_page_writeback(page); > unlock_page(page); > } > @@ -1662,7 +1662,7 @@ static void nilfs_segctor_prepare_write(struct nilfs_sc_info *sci) > if (bh->b_page != bd_page) { > if (bd_page) { > lock_page(bd_page); > - clear_page_dirty_for_io(bd_page); > + clear_page_dirty_for_io(NULL, bd_page); > set_page_writeback(bd_page); > unlock_page(bd_page); > } > @@ -1676,7 +1676,7 @@ static void nilfs_segctor_prepare_write(struct nilfs_sc_info *sci) > if (bh == segbuf->sb_super_root) { > if (bh->b_page != bd_page) { > lock_page(bd_page); > - clear_page_dirty_for_io(bd_page); > + clear_page_dirty_for_io(NULL, bd_page); > set_page_writeback(bd_page); > unlock_page(bd_page); > bd_page = bh->b_page; > @@ -1691,7 +1691,7 @@ static void nilfs_segctor_prepare_write(struct nilfs_sc_info *sci) > } > if (bd_page) { > lock_page(bd_page); > - clear_page_dirty_for_io(bd_page); > + clear_page_dirty_for_io(NULL, bd_page); > set_page_writeback(bd_page); > unlock_page(bd_page); > } > diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c > index 4df560894386..829de5553a77 100644 > --- a/fs/orangefs/inode.c > +++ b/fs/orangefs/inode.c > @@ -501,7 +501,7 @@ static int orangefs_launder_folio(struct folio *folio) > .nr_to_write = 0, > }; > folio_wait_writeback(folio); > - if (folio_clear_dirty_for_io(folio)) { > + if (folio_clear_dirty_for_io(&wbc, folio)) { > r = orangefs_writepage_locked(&folio->page, &wbc); > folio_end_writeback(folio); > } > diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c > index f2353dd676ef..41d764fd5511 100644 > --- a/fs/ubifs/file.c > +++ b/fs/ubifs/file.c > @@ -1159,7 +1159,7 @@ static int do_truncation(struct ubifs_info *c, struct inode *inode, > */ > ubifs_assert(c, PagePrivate(page)); > > - clear_page_dirty_for_io(page); > + clear_page_dirty_for_io(NULL, page); > if (UBIFS_BLOCKS_PER_PAGE_SHIFT) > offset = new_size & > (PAGE_SIZE - 1); > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index 29e1f9e76eb6..81c42a95cf8d 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -1059,8 +1059,9 @@ static inline void folio_cancel_dirty(struct folio *folio) > if (folio_test_dirty(folio)) > __folio_cancel_dirty(folio); > } > -bool folio_clear_dirty_for_io(struct folio *folio); > -bool clear_page_dirty_for_io(struct page *page); > +bool folio_clear_dirty_for_io(struct writeback_control *wbc, > + struct folio *folio); > +bool clear_page_dirty_for_io(struct writeback_control *wbc, struct page *page); > void folio_invalidate(struct folio *folio, size_t offset, size_t length); > int __must_check folio_write_one(struct folio *folio); > static inline int __must_check write_one_page(struct page *page) > diff --git a/mm/folio-compat.c b/mm/folio-compat.c > index 69ed25790c68..748f82def674 100644 > --- a/mm/folio-compat.c > +++ b/mm/folio-compat.c > @@ -63,9 +63,9 @@ int __set_page_dirty_nobuffers(struct page *page) > } > EXPORT_SYMBOL(__set_page_dirty_nobuffers); > > -bool clear_page_dirty_for_io(struct page *page) > +bool clear_page_dirty_for_io(struct writeback_control *wbc, struct page *page) > { > - return folio_clear_dirty_for_io(page_folio(page)); > + return folio_clear_dirty_for_io(wbc, page_folio(page)); > } > EXPORT_SYMBOL(clear_page_dirty_for_io); > > diff --git a/mm/migrate.c b/mm/migrate.c > index a4d3fc65085f..0bda652153b9 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -870,7 +870,7 @@ static int writeout(struct address_space *mapping, struct folio *folio) > /* No write method for the address space */ > return -EINVAL; > > - if (!folio_clear_dirty_for_io(folio)) > + if (!folio_clear_dirty_for_io(&wbc, folio)) > /* Someone else already triggered a write */ > return -EAGAIN; > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index ad608ef2a243..2d70070e533c 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -2465,7 +2465,7 @@ int write_cache_pages(struct address_space *mapping, > } > > BUG_ON(PageWriteback(page)); > - if (!clear_page_dirty_for_io(page)) > + if (!clear_page_dirty_for_io(wbc, page)) > goto continue_unlock; > > trace_wbc_writepage(wbc, inode_to_bdi(mapping->host)); > @@ -2628,7 +2628,7 @@ int folio_write_one(struct folio *folio) > > folio_wait_writeback(folio); > > - if (folio_clear_dirty_for_io(folio)) { > + if (folio_clear_dirty_for_io(&wbc, folio)) { > folio_get(folio); > ret = mapping->a_ops->writepage(&folio->page, &wbc); > if (ret == 0) > @@ -2924,7 +2924,7 @@ EXPORT_SYMBOL(__folio_cancel_dirty); > > /* > * Clear a folio's dirty flag, while caring for dirty memory accounting. > - * Returns true if the folio was previously dirty. > + * Returns true if the folio was previously dirty and should be written back. > * > * This is for preparing to put the folio under writeout. We leave > * the folio tagged as dirty in the xarray so that a concurrent > @@ -2935,8 +2935,14 @@ EXPORT_SYMBOL(__folio_cancel_dirty); > * > * This incoherency between the folio's dirty flag and xarray tag is > * unfortunate, but it only exists while the folio is locked. > + * > + * If the folio is pinned, its writeback is problematic so we just don't bother > + * for memory cleaning writeback - this is why writeback control is passed in. > + * If it is NULL, we assume pinned pages are not expected (e.g. this can be > + * a metadata page) and warn if the page is actually pinned. > */ > -bool folio_clear_dirty_for_io(struct folio *folio) > +bool folio_clear_dirty_for_io(struct writeback_control *wbc, > + struct folio *folio) > { > struct address_space *mapping = folio_mapping(folio); > bool ret = false; > @@ -2975,6 +2981,16 @@ bool folio_clear_dirty_for_io(struct folio *folio) > */ > if (folio_mkclean(folio)) > folio_mark_dirty(folio); > + /* > + * For pinned folios we have no chance to reclaim them anyway > + * and we cannot be sure folio is ever clean. So memory > + * cleaning writeback is pointless. Just skip it. > + */ > + if (wbc && wbc->sync_mode == WB_SYNC_NONE && > + folio_maybe_dma_pinned(folio)) > + return false; > + /* Catch callers not expecting pinned pages */ > + WARN_ON_ONCE(!wbc && folio_maybe_dma_pinned(folio)); > /* > * We carefully synchronise fault handlers against > * installing a dirty pte and marking the folio dirty > diff --git a/mm/vmscan.c b/mm/vmscan.c > index ab3911a8b116..71a226b47ac6 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1283,7 +1283,7 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping, > if (mapping->a_ops->writepage == NULL) > return PAGE_ACTIVATE; > > - if (folio_clear_dirty_for_io(folio)) { > + if (folio_clear_dirty_for_io(NULL, folio)) { > int res; > struct writeback_control wbc = { > .sync_mode = WB_SYNC_NONE, ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback 2023-02-10 1:54 ` John Hubbard @ 2023-02-10 2:10 ` John Hubbard 2023-02-10 10:42 ` Jan Kara 2023-02-10 10:54 ` Jan Kara 1 sibling, 1 reply; 26+ messages in thread From: John Hubbard @ 2023-02-10 2:10 UTC (permalink / raw) To: Jan Kara, linux-fsdevel Cc: linux-block, linux-mm, David Howells, David Hildenbrand On 2/9/23 17:54, John Hubbard wrote: > On 2/9/23 04:31, Jan Kara wrote: >> When a folio is pinned, there is no point in trying to write it during >> memory cleaning writeback. We cannot reclaim the folio until it is >> unpinned anyway and we cannot even be sure the folio is really clean. >> On top of that writeback of such folio may be problematic as the data >> can change while the writeback is running thus causing checksum or >> DIF/DIX failures. So just don't bother doing memory cleaning writeback >> for pinned folios. >> >> Signed-off-by: Jan Kara <jack@suse.cz> >> --- >> fs/9p/vfs_addr.c | 2 +- >> fs/afs/file.c | 2 +- >> fs/afs/write.c | 6 +++--- >> fs/btrfs/extent_io.c | 14 +++++++------- >> fs/btrfs/free-space-cache.c | 2 +- >> fs/btrfs/inode.c | 2 +- >> fs/btrfs/subpage.c | 2 +- > Oh, and one more fix, below, is required in order to build with my local test config. Assuming that it is reasonable to deal with pinned pages here, which I think it is: diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c index 9c759df700ca..c3279fb0edc8 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c @@ -313,7 +313,7 @@ void __shmem_writeback(size_t size, struct address_space *mapping) if (!page) continue; - if (!page_mapped(page) && clear_page_dirty_for_io(page)) { + if (!page_mapped(page) && clear_page_dirty_for_io(&wbc, page)) { int ret; SetPageReclaim(page); thanks, -- John Hubbard NVIDIA ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback 2023-02-10 2:10 ` John Hubbard @ 2023-02-10 10:42 ` Jan Kara 0 siblings, 0 replies; 26+ messages in thread From: Jan Kara @ 2023-02-10 10:42 UTC (permalink / raw) To: John Hubbard Cc: Jan Kara, linux-fsdevel, linux-block, linux-mm, David Howells, David Hildenbrand On Thu 09-02-23 18:10:23, John Hubbard wrote: > On 2/9/23 17:54, John Hubbard wrote: > > On 2/9/23 04:31, Jan Kara wrote: > > > When a folio is pinned, there is no point in trying to write it during > > > memory cleaning writeback. We cannot reclaim the folio until it is > > > unpinned anyway and we cannot even be sure the folio is really clean. > > > On top of that writeback of such folio may be problematic as the data > > > can change while the writeback is running thus causing checksum or > > > DIF/DIX failures. So just don't bother doing memory cleaning writeback > > > for pinned folios. > > > > > > Signed-off-by: Jan Kara <jack@suse.cz> > > > --- > > > fs/9p/vfs_addr.c | 2 +- > > > fs/afs/file.c | 2 +- > > > fs/afs/write.c | 6 +++--- > > > fs/btrfs/extent_io.c | 14 +++++++------- > > > fs/btrfs/free-space-cache.c | 2 +- > > > fs/btrfs/inode.c | 2 +- > > > fs/btrfs/subpage.c | 2 +- > > > > Oh, and one more fix, below, is required in order to build with my local > test config. Assuming that it is reasonable to deal with pinned pages > here, which I think it is: > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > index 9c759df700ca..c3279fb0edc8 100644 > --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c > @@ -313,7 +313,7 @@ void __shmem_writeback(size_t size, struct address_space *mapping) > if (!page) > continue; > - if (!page_mapped(page) && clear_page_dirty_for_io(page)) { > + if (!page_mapped(page) && clear_page_dirty_for_io(&wbc, page)) { > int ret; > SetPageReclaim(page); > Thanks, fixed up. It didn't occur to me to grep drivers/ for these functions :). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback 2023-02-10 1:54 ` John Hubbard 2023-02-10 2:10 ` John Hubbard @ 2023-02-10 10:54 ` Jan Kara 1 sibling, 0 replies; 26+ messages in thread From: Jan Kara @ 2023-02-10 10:54 UTC (permalink / raw) To: John Hubbard Cc: Jan Kara, linux-fsdevel, linux-block, linux-mm, David Howells, David Hildenbrand On Thu 09-02-23 17:54:14, John Hubbard wrote: > On 2/9/23 04:31, Jan Kara wrote: > > When a folio is pinned, there is no point in trying to write it during > > memory cleaning writeback. We cannot reclaim the folio until it is > > unpinned anyway and we cannot even be sure the folio is really clean. > > On top of that writeback of such folio may be problematic as the data > > can change while the writeback is running thus causing checksum or > > DIF/DIX failures. So just don't bother doing memory cleaning writeback > > for pinned folios. > > > > Signed-off-by: Jan Kara <jack@suse.cz> > > --- > > fs/9p/vfs_addr.c | 2 +- > > fs/afs/file.c | 2 +- > > fs/afs/write.c | 6 +++--- > > fs/btrfs/extent_io.c | 14 +++++++------- > > fs/btrfs/free-space-cache.c | 2 +- > > fs/btrfs/inode.c | 2 +- > > fs/btrfs/subpage.c | 2 +- > > Hi Jan! > > Just a quick note that this breaks the btrfs build in subpage.c. > Because, unfortunately, btrfs creates 6 sets of functions via calls to a > macro: IMPLEMENT_BTRFS_PAGE_OPS(). And that expects only one argument to > the clear_page_func, and thus to clear_page_dirty_for_io(). > > It seems infeasible (right?) to add another argument to the other > clear_page_func functions, which include: > > ClearPageUptodate > ClearPageError > end_page_writeback > ClearPageOrdered > ClearPageChecked > > , so I expect IMPLEMENT_BTRFS_PAGE_OPS() may need to be partially > unrolled, in order to pass in the new writeback control arg to > clear_page_dirty_for_io(). Aha, thanks for catching this. So it is easy to fix this to make things compile (just a wrapper around clear_page_dirty_for_io() - done now). It will be a bit more challenging to propagate wbc into there for proper decision - that will probably need these functions not to be defined by the macros. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 4/5] block: Add support for bouncing pinned pages 2023-02-09 12:31 [PATCH RFC 0/5] Writeback handling of pinned pages Jan Kara ` (2 preceding siblings ...) 2023-02-09 12:31 ` [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback Jan Kara @ 2023-02-09 12:31 ` Jan Kara 2023-02-13 9:59 ` Christoph Hellwig 2023-02-09 12:31 ` [PATCH 5/5] iomap: Bounce pinned pages during writeback Jan Kara 4 siblings, 1 reply; 26+ messages in thread From: Jan Kara @ 2023-02-09 12:31 UTC (permalink / raw) To: linux-fsdevel Cc: linux-block, linux-mm, John Hubbard, David Howells, David Hildenbrand, Jan Kara When there is direct IO (or other DMA write) running into a page, it is not generally safe to submit this page for another IO (such as writeback) because this can cause checksum failures or similar issues. However sometimes we cannot avoid writing contents of these pages as pages can be pinned for extensive amount of time (e.g. for RDMA). For these cases we need to just bounce the pages if we really need to write them out. Add support for this type of bouncing into the block layer infrastructure. Signed-off-by: Jan Kara <jack@suse.cz> --- block/blk.h | 10 +++++++++- block/bounce.c | 9 +++++++-- include/linux/blk_types.h | 1 + mm/Kconfig | 8 ++++---- 4 files changed, 21 insertions(+), 7 deletions(-) diff --git a/block/blk.h b/block/blk.h index 4c3b3325219a..def7ab8379bc 100644 --- a/block/blk.h +++ b/block/blk.h @@ -384,10 +384,18 @@ static inline bool blk_queue_may_bounce(struct request_queue *q) max_low_pfn >= max_pfn; } +static inline bool bio_need_pin_bounce(struct bio *bio, + struct request_queue *q) +{ + return IS_ENABLED(CONFIG_BOUNCE) && + bio->bi_flags & (1 << BIO_NEED_PIN_BOUNCE); +} + static inline struct bio *blk_queue_bounce(struct bio *bio, struct request_queue *q) { - if (unlikely(blk_queue_may_bounce(q) && bio_has_data(bio))) + if (unlikely((blk_queue_may_bounce(q) || bio_need_pin_bounce(bio, q)) && + bio_has_data(bio))) return __blk_queue_bounce(bio, q); return bio; } diff --git a/block/bounce.c b/block/bounce.c index 7cfcb242f9a1..ebda95953d58 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -207,12 +207,16 @@ struct bio *__blk_queue_bounce(struct bio *bio_orig, struct request_queue *q) struct bvec_iter iter; unsigned i = 0, bytes = 0; bool bounce = false; + bool pinned_bounce = bio_orig->bi_flags & (1 << BIO_NEED_PIN_BOUNCE); + bool highmem_bounce = blk_queue_may_bounce(q); int sectors; bio_for_each_segment(from, bio_orig, iter) { if (i++ < BIO_MAX_VECS) bytes += from.bv_len; - if (PageHighMem(from.bv_page)) + if (highmem_bounce && PageHighMem(from.bv_page)) + bounce = true; + if (pinned_bounce && page_maybe_dma_pinned(from.bv_page)) bounce = true; } if (!bounce) @@ -241,7 +245,8 @@ struct bio *__blk_queue_bounce(struct bio *bio_orig, struct request_queue *q) for (i = 0, to = bio->bi_io_vec; i < bio->bi_vcnt; to++, i++) { struct page *bounce_page; - if (!PageHighMem(to->bv_page)) + if (!((highmem_bounce && PageHighMem(to->bv_page)) || + (pinned_bounce && page_maybe_dma_pinned(to->bv_page)))) continue; bounce_page = mempool_alloc(&page_pool, GFP_NOIO); diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 99be590f952f..3aa1dc5d8dc6 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -321,6 +321,7 @@ enum { BIO_NO_PAGE_REF, /* don't put release vec pages */ BIO_CLONED, /* doesn't own data */ BIO_BOUNCED, /* bio is a bounce bio */ + BIO_NEED_PIN_BOUNCE, /* bio needs to bounce pinned pages */ BIO_QUIET, /* Make BIO Quiet */ BIO_CHAIN, /* chained bio, ->bi_remaining in effect */ BIO_REFFED, /* bio has elevated ->bi_cnt */ diff --git a/mm/Kconfig b/mm/Kconfig index ff7b209dec05..eba075e959e8 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -659,11 +659,11 @@ config PHYS_ADDR_T_64BIT config BOUNCE bool "Enable bounce buffers" default y - depends on BLOCK && MMU && HIGHMEM + depends on BLOCK && MMU help - Enable bounce buffers for devices that cannot access the full range of - memory available to the CPU. Enabled by default when HIGHMEM is - selected, but you may say n to override this. + Enable bounce buffers. This is used for devices that cannot access + the full range of memory available to the CPU or when DMA can be + modifying pages while they are submitted for writeback. config MMU_NOTIFIER bool -- 2.35.3 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] block: Add support for bouncing pinned pages 2023-02-09 12:31 ` [PATCH 4/5] block: Add support for bouncing pinned pages Jan Kara @ 2023-02-13 9:59 ` Christoph Hellwig 2023-02-14 13:56 ` Jan Kara 0 siblings, 1 reply; 26+ messages in thread From: Christoph Hellwig @ 2023-02-13 9:59 UTC (permalink / raw) To: Jan Kara Cc: linux-fsdevel, linux-block, linux-mm, John Hubbard, David Howells, David Hildenbrand Eww. The block bounc code really needs to go away, so a new user makes me very unhappy. But independent of that I don't think this is enough anyway. Just copying the data out into a new page in the block layer doesn't solve the problem that this page needs to be tracked as dirtied for fs accounting. e.g. every time we write this copy it needs space allocated for COW file systems. Which brings me back to if and when we do writeback for pinned page. I don't think doing any I/O for short term pins like direct I/O make sense. These pins are defined to be unpinned after I/O completes, so we might as well just wait for the unpin instead of doing anything complicated. Long term pins are more troublesome, but I really wonder what the defined semantics for data integrity writeback like fsync on them is to start with as the content is very much undefined. Should an fsync on a (partially) long term pinned file simplfy fail? It's not like we can win in that scenario. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] block: Add support for bouncing pinned pages 2023-02-13 9:59 ` Christoph Hellwig @ 2023-02-14 13:56 ` Jan Kara 2023-02-15 4:59 ` Dave Chinner 0 siblings, 1 reply; 26+ messages in thread From: Jan Kara @ 2023-02-14 13:56 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, linux-fsdevel, linux-block, linux-mm, John Hubbard, David Howells, David Hildenbrand On Mon 13-02-23 01:59:28, Christoph Hellwig wrote: > Eww. The block bounc code really needs to go away, so a new user > makes me very unhappy. > > But independent of that I don't think this is enough anyway. Just > copying the data out into a new page in the block layer doesn't solve > the problem that this page needs to be tracked as dirtied for fs > accounting. e.g. every time we write this copy it needs space allocated > for COW file systems. Right, I forgot about this in my RFC. My original plan was to not clear the dirty bit in clear_page_dirty_for_io() even for WB_SYNC_ALL writeback when we do writeback the page and perhaps indicate this in the return value of clear_page_dirty_for_io() so that the COW filesystem can keep tracking this page as dirty. > Which brings me back to if and when we do writeback for pinned page. > I don't think doing any I/O for short term pins like direct I/O > make sense. These pins are defined to be unpinned after I/O > completes, so we might as well just wait for the unpin instead of doing > anything complicated. Agreed. For short term pins we could just wait which should be quite simple (although there's some DoS potential of this behavior if somebody runs multiple processes that keep pinning some page with short term pins). > Long term pins are more troublesome, but I really wonder what the > defined semantics for data integrity writeback like fsync on them > is to start with as the content is very much undefined. Should > an fsync on a (partially) long term pinned file simplfy fail? It's > not like we can win in that scenario. Well, we have also cases like sync(2) so one would have to be careful with error propagation and I'm afraid there are enough programs out-there that treat any error return from fsync(2) as catastrophic so I suspect this could lead to some surprises. The case I'm most worried about is if some application sets up RDMA to an mmaped file, runs the transfer and waits for it to complete, doesn't bother to unpin the pages (keeps them for future transfers) and calls fsync(2) to make data stable on local storage. That does seem like quite sensible use and so far it works just fine. And not writing pages with fsync(2) would break such uses. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] block: Add support for bouncing pinned pages 2023-02-14 13:56 ` Jan Kara @ 2023-02-15 4:59 ` Dave Chinner 2023-02-15 6:24 ` Christoph Hellwig 0 siblings, 1 reply; 26+ messages in thread From: Dave Chinner @ 2023-02-15 4:59 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, linux-fsdevel, linux-block, linux-mm, John Hubbard, David Howells, David Hildenbrand On Tue, Feb 14, 2023 at 02:56:04PM +0100, Jan Kara wrote: > On Mon 13-02-23 01:59:28, Christoph Hellwig wrote: > > Eww. The block bounc code really needs to go away, so a new user > > makes me very unhappy. > > > > But independent of that I don't think this is enough anyway. Just > > copying the data out into a new page in the block layer doesn't solve > > the problem that this page needs to be tracked as dirtied for fs > > accounting. e.g. every time we write this copy it needs space allocated > > for COW file systems. > > Right, I forgot about this in my RFC. My original plan was to not clear the > dirty bit in clear_page_dirty_for_io() even for WB_SYNC_ALL writeback when > we do writeback the page and perhaps indicate this in the return value of > clear_page_dirty_for_io() so that the COW filesystem can keep tracking this > page as dirty. I don't think this works, especially if the COW mechanism relies on delayed allocation to prevent ENOSPC during writeback. That is, we need a write() or page fault (to run ->page_mkwrite()) after every call to folio_clear_dirty_for_io() in the writeback path to ensure that new space is reserved for the allocation that will occur during a future writeback of that page. Hence we can't just leave the page dirty on COW filesystems - it has to go through a clean state so that the clean->dirty event can be gated on gaining the space reservation that allows it to be written back again. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] block: Add support for bouncing pinned pages 2023-02-15 4:59 ` Dave Chinner @ 2023-02-15 6:24 ` Christoph Hellwig 2023-02-16 12:33 ` Jan Kara 0 siblings, 1 reply; 26+ messages in thread From: Christoph Hellwig @ 2023-02-15 6:24 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, linux-block, linux-mm, John Hubbard, David Howells, David Hildenbrand On Wed, Feb 15, 2023 at 03:59:52PM +1100, Dave Chinner wrote: > I don't think this works, especially if the COW mechanism relies on > delayed allocation to prevent ENOSPC during writeback. That is, we > need a write() or page fault (to run ->page_mkwrite()) after every > call to folio_clear_dirty_for_io() in the writeback path to ensure > that new space is reserved for the allocation that will occur > during a future writeback of that page. > > Hence we can't just leave the page dirty on COW filesystems - it has > to go through a clean state so that the clean->dirty event can be > gated on gaining the space reservation that allows it to be written > back again. Exactly. Although if we really want we could do the redirtying without formally moving to a clean state, but it certainly would require special new code to the same steps as if we were redirtying. Which is another reason why I'd prefer to avoid all that if we can. For that we probably need an inventory of what long term pins we have in the kernel tree that can and do operate on shared file mappings, and what kind of I/O semantics they expect. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] block: Add support for bouncing pinned pages 2023-02-15 6:24 ` Christoph Hellwig @ 2023-02-16 12:33 ` Jan Kara 2023-02-20 6:22 ` Christoph Hellwig 0 siblings, 1 reply; 26+ messages in thread From: Jan Kara @ 2023-02-16 12:33 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, Jan Kara, linux-fsdevel, linux-block, linux-mm, John Hubbard, David Howells, David Hildenbrand On Tue 14-02-23 22:24:33, Christoph Hellwig wrote: > On Wed, Feb 15, 2023 at 03:59:52PM +1100, Dave Chinner wrote: > > I don't think this works, especially if the COW mechanism relies on > > delayed allocation to prevent ENOSPC during writeback. That is, we > > need a write() or page fault (to run ->page_mkwrite()) after every > > call to folio_clear_dirty_for_io() in the writeback path to ensure > > that new space is reserved for the allocation that will occur > > during a future writeback of that page. > > > > Hence we can't just leave the page dirty on COW filesystems - it has > > to go through a clean state so that the clean->dirty event can be > > gated on gaining the space reservation that allows it to be written > > back again. > > Exactly. Although if we really want we could do the redirtying without > formally moving to a clean state, but it certainly would require special > new code to the same steps as if we were redirtying. Yes. > Which is another reason why I'd prefer to avoid all that if we can. > For that we probably need an inventory of what long term pins we have > in the kernel tree that can and do operate on shared file mappings, > and what kind of I/O semantics they expect. I'm a bit skeptical we can reasonably assess that (as much as I would love to just not write these pages and be done with it) because a lot of FOLL_LONGTERM users just pin passed userspace address range, then allow userspace to manipulate it with other operations, and finally unpin it with another call. Who knows whether shared pagecache pages are passed in and what userspace is doing with them while they are pinned? We have stuff like io_uring using FOLL_LONGTERM for IO buffers passed from userspace (e.g. IORING_REGISTER_BUFFERS operation), we have V4L2 which similarly pins buffers for video processing (and I vaguely remember one bugreport due to some phone passing shared file pages there to acquire screenshots from a webcam), and we have various infiniband drivers doing this (not all of them are using FOLL_LONGTERM but they should AFAICS). We even have vmsplice(2) that should be arguably using pinning with FOLL_LONGTERM (at least that's the plan AFAIK) and not writing such pages would IMO provide an interesting attack vector... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] block: Add support for bouncing pinned pages 2023-02-16 12:33 ` Jan Kara @ 2023-02-20 6:22 ` Christoph Hellwig 2023-02-27 11:39 ` Jan Kara 0 siblings, 1 reply; 26+ messages in thread From: Christoph Hellwig @ 2023-02-20 6:22 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-block, linux-mm, John Hubbard, David Howells, David Hildenbrand On Thu, Feb 16, 2023 at 01:33:16PM +0100, Jan Kara wrote: > I'm a bit skeptical we can reasonably assess that (as much as I would love > to just not write these pages and be done with it) because a lot of > FOLL_LONGTERM users just pin passed userspace address range, then allow > userspace to manipulate it with other operations, and finally unpin it with > another call. Who knows whether shared pagecache pages are passed in and > what userspace is doing with them while they are pinned? True. So what other sensible thing could we do at a higher level? Treat MAP_SHARED buffers that are long term registered as MAP_PRIVATE while registered, and just do writeback using in-kernel O_DIRECT on fsync? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] block: Add support for bouncing pinned pages 2023-02-20 6:22 ` Christoph Hellwig @ 2023-02-27 11:39 ` Jan Kara 2023-02-27 13:36 ` Christoph Hellwig 0 siblings, 1 reply; 26+ messages in thread From: Jan Kara @ 2023-02-27 11:39 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, Dave Chinner, linux-fsdevel, linux-block, linux-mm, John Hubbard, David Howells, David Hildenbrand On Sun 19-02-23 22:22:32, Christoph Hellwig wrote: > On Thu, Feb 16, 2023 at 01:33:16PM +0100, Jan Kara wrote: > > I'm a bit skeptical we can reasonably assess that (as much as I would love > > to just not write these pages and be done with it) because a lot of > > FOLL_LONGTERM users just pin passed userspace address range, then allow > > userspace to manipulate it with other operations, and finally unpin it with > > another call. Who knows whether shared pagecache pages are passed in and > > what userspace is doing with them while they are pinned? > > True. > > So what other sensible thing could we do at a higher level? > > Treat MAP_SHARED buffers that are long term registered as MAP_PRIVATE > while registered, and just do writeback using in-kernel O_DIRECT on > fsync? Do you mean to copy these pages on fsync(2) to newly allocated buffer and then submit it via direct IO? That looks sensible to me. We could then make writeback path just completely ignore these long term pinned pages and just add this copying logic into filemap_fdatawrite() or something like that. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 4/5] block: Add support for bouncing pinned pages 2023-02-27 11:39 ` Jan Kara @ 2023-02-27 13:36 ` Christoph Hellwig 0 siblings, 0 replies; 26+ messages in thread From: Christoph Hellwig @ 2023-02-27 13:36 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-block, linux-mm, John Hubbard, David Howells, David Hildenbrand On Mon, Feb 27, 2023 at 12:39:26PM +0100, Jan Kara wrote: > Do you mean to copy these pages on fsync(2) to newly allocated buffer and > then submit it via direct IO? That looks sensible to me. We could then make > writeback path just completely ignore these long term pinned pages and just > add this copying logic into filemap_fdatawrite() or something like that. I don't think we'd even have to copy them on fsync. Just do an in-kernel ITER_BVEC direct I/O on them. The only hard part would be to come up with a way to skip the pagecache invalidation and writeout for these I/Os. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 5/5] iomap: Bounce pinned pages during writeback 2023-02-09 12:31 [PATCH RFC 0/5] Writeback handling of pinned pages Jan Kara ` (3 preceding siblings ...) 2023-02-09 12:31 ` [PATCH 4/5] block: Add support for bouncing pinned pages Jan Kara @ 2023-02-09 12:31 ` Jan Kara 4 siblings, 0 replies; 26+ messages in thread From: Jan Kara @ 2023-02-09 12:31 UTC (permalink / raw) To: linux-fsdevel Cc: linux-block, linux-mm, John Hubbard, David Howells, David Hildenbrand, Jan Kara When there is direct IO (or other DMA write) running into a page, it is not generally safe to submit this page for writeback because this can cause DIF/DIX failures or similar issues. Ask block layer to bounce the page in this case. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/iomap/buffered-io.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 356193e44cf0..e6469e7715cc 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1563,6 +1563,14 @@ iomap_add_to_ioend(struct inode *inode, loff_t pos, struct folio *folio, bio_add_folio(wpc->ioend->io_bio, folio, len, poff); } + /* + * Folio may be modified by the owner of the pin and we require stable + * page contents during writeback? Ask block layer to bounce the bio. + */ + if (inode->i_sb->s_iflags & SB_I_STABLE_WRITES && + folio_maybe_dma_pinned(folio)) + wpc->ioend->io_bio->bi_flags |= 1 << BIO_NEED_PIN_BOUNCE; + if (iop) atomic_add(len, &iop->write_bytes_pending); wpc->ioend->io_size += len; -- 2.35.3 ^ permalink raw reply related [flat|nested] 26+ messages in thread
end of thread, other threads:[~2023-02-27 13:37 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-02-09 12:31 [PATCH RFC 0/5] Writeback handling of pinned pages Jan Kara 2023-02-09 12:31 ` [PATCH 1/5] mm: Do not reclaim private data from pinned page Jan Kara 2023-02-09 16:17 ` Matthew Wilcox 2023-02-10 11:29 ` Jan Kara 2023-02-13 9:55 ` Christoph Hellwig 2023-02-14 13:06 ` Jan Kara 2023-02-14 21:40 ` John Hubbard 2023-02-16 11:56 ` Jan Kara 2023-02-13 9:01 ` David Hildenbrand 2023-02-14 13:00 ` Jan Kara 2023-02-09 12:31 ` [PATCH 2/5] ext4: Drop workaround for mm reclaiming fs private page data Jan Kara 2023-02-09 12:31 ` [PATCH 3/5] mm: Do not try to write pinned folio during memory cleaning writeback Jan Kara 2023-02-10 1:54 ` John Hubbard 2023-02-10 2:10 ` John Hubbard 2023-02-10 10:42 ` Jan Kara 2023-02-10 10:54 ` Jan Kara 2023-02-09 12:31 ` [PATCH 4/5] block: Add support for bouncing pinned pages Jan Kara 2023-02-13 9:59 ` Christoph Hellwig 2023-02-14 13:56 ` Jan Kara 2023-02-15 4:59 ` Dave Chinner 2023-02-15 6:24 ` Christoph Hellwig 2023-02-16 12:33 ` Jan Kara 2023-02-20 6:22 ` Christoph Hellwig 2023-02-27 11:39 ` Jan Kara 2023-02-27 13:36 ` Christoph Hellwig 2023-02-09 12:31 ` [PATCH 5/5] iomap: Bounce pinned pages during writeback Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).