linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Luis Chamberlain <mcgrof@kernel.org>
To: Zi Yan <ziy@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	Sean Christopherson <seanjc@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	"Pankaj Raghav (Samsung)" <kernel@pankajraghav.com>,
	djwong@kernel.org, brauner@kernel.org, david@fromorbit.com,
	chandan.babu@oracle.com, akpm@linux-foundation.org,
	linux-fsdevel@vger.kernel.org, hare@suse.de,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-xfs@vger.kernel.org, gost.dev@samsung.com,
	p.raghav@samsung.com
Subject: Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement
Date: Mon, 29 Apr 2024 17:31:04 -0700	[thread overview]
Message-ID: <ZjA7yBQjkh52TM_T@bombadil.infradead.org> (raw)
In-Reply-To: <6799F341-9E37-4F3E-B0D0-B5B2138A5F5F@nvidia.com>

On Mon, Apr 29, 2024 at 10:29:29AM -0400, Zi Yan wrote:
> On 28 Apr 2024, at 23:56, Luis Chamberlain wrote:
> 
> > On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote:
> >> On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote:
> >>> On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote:
> >>>> On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
> >>>>> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
> >>>>>> From: Pankaj Raghav <p.raghav@samsung.com>
> >>>>>>
> >>>>>> using that API for LBS is resulting in an NULL ptr dereference
> >>>>>> error in the writeback path [1].
> >>>>>>
> >>>>>> [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
> >>>>>
> >>>>>  How would I go about reproducing this?
> >>
> >> Well so the below fixes this but I am not sure if this is correct.
> >> folio_mark_dirty() at least says that a folio should not be truncated
> >> while its running. I am not sure if we should try to split folios then
> >> even though we check for writeback once. truncate_inode_partial_folio()
> >> will folio_wait_writeback() but it will split_folio() before checking
> >> for claiming to fail to truncate with folio_test_dirty(). But since the
> >> folio is locked its not clear why this should be possible.
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 83955362d41c..90195506211a 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >>  	if (new_order >= folio_order(folio))
> >>  		return -EINVAL;
> >>
> >> -	if (folio_test_writeback(folio))
> >> +	if (folio_test_dirty(folio) || folio_test_writeback(folio))
> >>  		return -EBUSY;
> >>
> >>  	if (!folio_test_anon(folio)) {
> >
> > I wondered what code path is causing this and triggering this null
> > pointer, so I just sprinkled a check here:
> >
> > 	VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio);
> >
> > The answer was:
> >
> > kcompactd() --> migrate_pages_batch()
> >                   --> try_split_folio --> split_folio_to_list() -->
> > 		       split_huge_page_to_list_to_order()
> >
> 
> There are 3 try_split_folio() in migrate_pages_batch().

This is only true for linux-next, for v6.9-rc5 off of which this testing
is based on there are only two.

> First one is to split anonymous large folios that are on deferred
> split list, so not related;

This is in linux-next and not v6.9-rc5.

> second one is to split THPs when thp migration is not supported, but
> this is compaction, so not related; third one is to split large folios
> when there is no same size free page in the system, and this should be
> the one.

Agreed, the case where migrate_folio_unmap() failed with -ENOMEM. This
also helps us enhance the reproducer further, which I'll do next.

> > And I verified that moving the check only to the migrate_pages_batch()
> > path also fixes the crash:
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 73a052a382f1..83b528eb7100 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
> >  	int rc;
> >
> >  	folio_lock(folio);
> > +	if (folio_test_dirty(folio)) {
> > +		rc = -EBUSY;
> > +		goto out;
> > +	}
> >  	rc = split_folio_to_list(folio, split_folios);
> > +out:
> >  	folio_unlock(folio);
> >  	if (!rc)
> >  		list_move_tail(&folio->lru, split_folios);
> >
> > However I'd like compaction folks to review this. I see some indications
> > in the code that migration can race with truncation but we feel fine by
> > it by taking the folio lock. However here we have a case where we see
> > the folio clearly locked and the folio is dirty. Other migraiton code
> > seems to write back the code and can wait, here we just move on. Further
> > reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping
> > as completely unmovable") seems to hint that migration is safe if the
> > mapping either does not exist or the mapping does exist but has
> > mapping->a_ops->migrate_folio so I'd like further feedback on this.
> 
> During migration, all page table entries pointing to this dirty folio
> are invalid, and accesses to this folio will cause page fault and
> wait on the migration entry. I am not sure we need to skip dirty folios.

I see.. thanks!

> > Another thing which requires review is if we we split a folio but not
> > down to order 0 but to the new min order, does the accounting on
> > migrate_pages_batch() require changing?  And most puzzling, why do we
> 
> What accounting are you referring to? split code should take care of it.

The folio order can change after split, and so I was concerned about the
nr_pages used in migrate_pages_batch(). But I see now that when
migrate_folio_unmap() first failed we try to split the folio, and if
successful I see now we the caller will again call migrate_pages_batch()
with a retry attempt of 1 only to the split folios. I also see the
nr_pages is just local to each list for each loop, first on the from
list to unmap and afte on the unmap list so we move the folios.

> > not see this with regular large folios, but we do see it with minorder ?
> 
> I wonder if the split code handles folio->mapping->i_pages properly.
> Does the i_pages store just folio pointers or also need all tail page
> pointers? I am no expert in fs, thus need help.

mapping->i_pages stores folio pointers in the page cache or
swap/dax/shadow entries (xa_is_value(folio)). The folios however can be
special and we special-case them with shmem_mapping(mapping) checks.
split_huge_page_to_list_to_order() doens't get called with swap/dax/shadow 
entries, and we also bail out on shmem_mapping(mapping) already.

  Luis


  reply	other threads:[~2024-04-30  0:31 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-25 11:37 [PATCH v4 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
2024-04-25 11:37 ` [PATCH v4 01/11] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
2024-04-25 11:37 ` [PATCH v4 02/11] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
2024-04-25 18:07   ` Hannes Reinecke
2024-04-26 15:09   ` Darrick J. Wong
2024-04-25 11:37 ` [PATCH v4 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
2024-04-25 19:04   ` Hannes Reinecke
2024-04-26 15:12   ` Darrick J. Wong
2024-04-28 20:59     ` Pankaj Raghav (Samsung)
2024-04-25 11:37 ` [PATCH v4 04/11] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
2024-04-25 18:53   ` Matthew Wilcox
2024-04-25 11:37 ` [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement Pankaj Raghav (Samsung)
2024-04-25 20:10   ` Matthew Wilcox
2024-04-26  0:47     ` Luis Chamberlain
2024-04-26 23:46       ` Luis Chamberlain
2024-04-28  0:57         ` Luis Chamberlain
2024-04-29  3:56           ` Luis Chamberlain
2024-04-29 14:29             ` Zi Yan
2024-04-30  0:31               ` Luis Chamberlain [this message]
2024-04-30  0:49                 ` Luis Chamberlain
2024-04-30  2:43                 ` Zi Yan
2024-04-30 19:27                   ` Luis Chamberlain
2024-05-01  4:13                     ` Matthew Wilcox
2024-05-01 14:28                       ` Matthew Wilcox
2024-04-26 15:49     ` Pankaj Raghav (Samsung)
2024-04-25 11:37 ` [PATCH v4 06/11] filemap: cap PTE range to be created to i_size in folio_map_range() Pankaj Raghav (Samsung)
2024-04-25 20:24   ` Matthew Wilcox
2024-04-26 12:54     ` Pankaj Raghav (Samsung)
2024-04-25 11:37 ` [PATCH v4 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
2024-04-26  6:22   ` Christoph Hellwig
2024-04-26 11:43     ` Pankaj Raghav (Samsung)
2024-04-27  5:12       ` Christoph Hellwig
2024-04-29 21:02         ` Pankaj Raghav (Samsung)
2024-04-27  3:26     ` Matthew Wilcox
2024-04-27  4:52       ` Christoph Hellwig
2024-04-25 11:37 ` [PATCH v4 08/11] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
2024-04-26 15:18   ` Darrick J. Wong
2024-04-28 21:06     ` Pankaj Raghav (Samsung)
2024-04-25 11:37 ` [PATCH v4 09/11] xfs: expose block size in stat Pankaj Raghav (Samsung)
2024-04-26 15:15   ` Darrick J. Wong
2024-04-25 11:37 ` [PATCH v4 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
2024-04-26 15:16   ` Darrick J. Wong
2024-04-25 11:37 ` [PATCH v4 11/11] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
2024-04-26 15:18   ` Darrick J. Wong
     [not found] ` <87y18zxvpd.fsf@gmail.com>
2024-04-27  5:05   ` [PATCH v4 00/11] enable bs > ps in XFS Darrick J. Wong
2024-04-29 20:39   ` Pankaj Raghav (Samsung)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZjA7yBQjkh52TM_T@bombadil.infradead.org \
    --to=mcgrof@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=chandan.babu@oracle.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=gost.dev@samsung.com \
    --cc=hare@suse.de \
    --cc=kernel@pankajraghav.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=seanjc@google.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).