From: Jeff Layton <jlayton@kernel.org>
To: Matthew Wilcox <willy@infradead.org>
Cc: "Luís Henriques" <lhenriques@suse.de>,
"Xiubo Li" <xiubli@redhat.com>,
"Josef Bacik" <josef@toxicpanda.com>,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: Freeing page flags
Date: Fri, 13 May 2022 09:57:19 -0400 [thread overview]
Message-ID: <063f8a744f08f3bbc33aa840b4e91c5a443d41c3.camel@kernel.org> (raw)
In-Reply-To: <Yn5fXm83IfUWkv8w@casper.infradead.org>
On Fri, 2022-05-13 at 14:38 +0100, Matthew Wilcox wrote:
> On Fri, May 13, 2022 at 09:21:11AM -0400, Jeff Layton wrote:
> > On Fri, 2022-05-13 at 13:53 +0100, Matthew Wilcox wrote:
> > > On Fri, May 13, 2022 at 10:40:05AM +0100, Luís Henriques wrote:
> > > > Matthew Wilcox <willy@infradead.org> writes:
> > > >
> > > > > On Thu, May 12, 2022 at 10:41:41PM -0400, Josef Bacik wrote:
> > > > > > On Thu, May 12, 2022 at 09:54:59PM +0100, Matthew Wilcox wrote:
> > > > > > > The LWN writeup [1] on merging the MGLRU reminded me that I need to send
> > > > > > > out a plan for removing page flags that we can do without.
> > > > > > >
> > > > > > > 1. PG_error. It's basically useless. If the page was read successfully,
> > > > > > > PG_uptodate is set. If not, PG_uptodate is clear. The page cache
> > > > > > > doesn't use PG_error. Some filesystems do, and we need to transition
> > > > > > > them away from using it.
> > > > > > >
> > > > > >
> > > > > > What about writes? A cursory look shows we don't clear Uptodate if we fail to
> > > > > > write, which is correct I think. The only way to indicate we had a write error
> > > > > > to check later is the page error.
> > > > >
> > > > > On encountering a write error, we're supposed to call mapping_set_error(),
> > > > > not SetPageError().
> > > > >
> > > > > > > 2. PG_private. This tells us whether we have anything stored at
> > > > > > > page->private. We can just check if page->private is NULL or not.
> > > > > > > No need to have this extra bit. Again, there may be some filesystems
> > > > > > > that are a bit wonky here, but I'm sure they're fixable.
> > > > > > >
> > > > > >
> > > > > > At least for Btrfs we serialize the page->private with the private_lock, so we
> > > > > > could probably just drop PG_private, but it's kind of nice to check first before
> > > > > > we have to take the spin lock. I suppose we can just do
> > > > > >
> > > > > > if (page->private)
> > > > > > // do lock and check thingy
> > > > >
> > > > > That's my hope! I think btrfs is already using folio_attach_private() /
> > > > > attach_page_private(), which makes everything easier. Some filesystems
> > > > > still manipulate page->private and PagePrivate by hand.
> > > >
> > > > In ceph we've recently [1] spent a bit of time debugging a bug related
> > > > with ->private not being NULL even though we expected it to be. The
> > > > solution found was to replace the check for NULL and use
> > > > folio_test_private() instead, but we _may_ have not figured the whole
> > > > thing out.
> > > >
> > > > We assumed that folios were being recycled and not cleaned-up. The values
> > > > we were seeing in ->private looked like they were some sort of flags as
> > > > only a few bits were set (e.g. 0x0200000):
> > > >
> > > > [ 1672.578313] page:00000000e23868c1 refcount:2 mapcount:0 mapping:0000000022e0d3b4 index:0xd8 pfn:0x74e83
> > > > [ 1672.581934] aops:ceph_aops [ceph] ino:10000016c9e dentry name:"faed"
> > > > [ 1672.584457] flags: 0x4000000000000015(locked|uptodate|lru|zone=1)
> > > > [ 1672.586878] raw: 4000000000000015 ffffea0001d3a108 ffffea0001d3a088 ffff888003491948
> > > > [ 1672.589894] raw: 00000000000000d8 0000000000200000 00000002ffffffff 0000000000000000
> > > > [ 1672.592935] page dumped because: VM_BUG_ON_FOLIO(1)
> > > >
> > > > [1] https://lore.kernel.org/all/20220508061543.318394-1-xiubli@redhat.com/
> > >
> > > I remember Jeff asking me about this problem a few days ago. A folio
> > > passed to you in ->dirty_folio() or ->invalidate_folio() belongs to
> > > your filesystem. Nobody else should be storing to the ->private field;
> > > there's no race that could lead to it being freed while you see it.
> > > There may, of course, be bugs that are overwriting folio->private, but
> > > it's definitely not supposed to happen. I agree with you that it looks
> > > like a bit has been set (is it possibly bad RAM?)
> > >
> > > We do use page->private in the buddy allocator, but that stores the order
> > > of the page; it wouldn't be storing 1<<21. PG flag 21 is PG_mlocked,
> > > which seems like a weird one to be setting in the wrong field, so probably
> > > not that.
> > >
> > > Is it always bit 21 that gets set?
> >
> > No, it varies, but it was always just a few bits in the field that end
> > up being set. I was never able to reproduce it locally, but saw it in a
> > run in ceph's teuthology lab a few times. Xiubo did the most digging
> > here, so he may be able to add more info.
> >
> > Basically though, we call __filemap_get_folio in netfs_write_begin and
> > it will sometimes give us a folio that has PG_private clear, but the
> > ->private field has just a few bits that aren't zeroed out. I'm pretty
> > sure we zero out that field in ceph, so the theory was that the page was
> > traveling through some other subsystem before coming to us.
>
> It _shouldn't_ be. __filemap_get_folio() may return a page that was
> already in the page cache (and so may have had page->private set by
> the filesystem originally), or it may allocate a fresh page in
> filemap_alloc_folio() which _should_ come with page->private clear.
> Adding an assert that is true might be a good debugging tactic:
>
> +++ b/mm/filemap.c
> @@ -2008,6 +2008,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> goto repeat;
> }
>
> +VM_BUG_ON_FOLIO(folio->private, folio);
> /*
> * filemap_add_folio locks the page, and for mmap
> * we expect an unlocked page.
>
> > He wasn't able to ascertain the cause, and just decided to check for
> > PG_private instead since you (presumably) shouldn't trust ->private
> > unless that's set anyway.
>
> They are usually in sync ... which means we can reclaim the flag ;-)
Agreed. I'm all for freeing up a page bit and PG_private seems like
belt-and-suspenders stuff. There are some details in this tracker
ticket.
This note in particular seems to indicate that ->private is not always
coming back as zero:
https://tracker.ceph.com/issues/55421#note-20
--
Jeff Layton <jlayton@kernel.org>
next prev parent reply other threads:[~2022-05-13 13:57 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-12 20:54 Freeing page flags Matthew Wilcox
2022-05-13 2:41 ` Josef Bacik
2022-05-13 3:39 ` Matthew Wilcox
2022-05-13 9:40 ` Luís Henriques
2022-05-13 12:53 ` Matthew Wilcox
2022-05-13 13:18 ` Luís Henriques
2022-05-13 13:21 ` Jeff Layton
2022-05-13 13:38 ` Matthew Wilcox
2022-05-13 13:57 ` Jeff Layton [this message]
2022-05-17 0:34 ` Xiubo Li
2022-05-13 13:17 ` Josef Bacik
2022-05-13 3:46 ` Yu Zhao
2022-05-14 6:10 ` Eric Biggers
2022-05-23 6:38 ` Mike Rapoport
2022-06-10 17:39 ` David Hildenbrand
2022-05-30 16:54 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=063f8a744f08f3bbc33aa840b4e91c5a443d41c3.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=josef@toxicpanda.com \
--cc=lhenriques@suse.de \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=willy@infradead.org \
--cc=xiubli@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).