ZONE_DEVICE refcounting

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* ZONE_DEVICE refcounting
@ 2024-03-08  4:24 Alistair Popple
  2024-03-08 13:44 ` Jason Gunthorpe
  2024-03-13  6:32 ` Dan Williams
  0 siblings, 2 replies; 10+ messages in thread
From: Alistair Popple @ 2024-03-08  4:24 UTC (permalink / raw)
  To: linux-mm
  Cc: jhubbard, rcampbell, willy, jgg, dan.j.williams, david,
	linux-fsdevel, jack, djwong, hch, david

Hi,

I have been looking at fixing up ZONE_DEVICE refcounting again. Specifically I
have been looking  at fixing the 1-based refcounts that are currently used for
FS DAX pages (and p2pdma pages, but that's trival).

This started with the simple idea of "just subtract one from the
refcounts everywhere and that will fix the off by one". Unfortunately
it's not that simple. For starters doing a simple conversion like that
requires allowing pages to be mapped with zero refcounts. That seems
wrong. It also leads to problems detecting idle IO vs. page map pages.

So instead I'm thinking of doing something along the lines of the following:

1. Refcount FS DAX pages normally. Ie. map them with vm_insert_page() and
   increment the refcount inline with mapcount and decrement it when pages are
   unmapped.

2. As per normal pages the pages are considered free when the refcount drops
   to zero.

3. Because these are treated as normal pages for refcounting we no longer map
   them as pte_devmap() (possibly freeing up a PTE bit).

4. PMD sized FS DAX pages get treated the same as normal compound pages.

5. This means we need to allow compound ZONE DEVICE pages. Tail pages share
   the page->pgmap field with page->compound_head, but this isn't a problem
   because the LSB of page->pgmap is free and we can still get pgmap from
   compound_head(page)->pgmap.

6. When FS DAX pages are freed they notify filesystem drivers. This can be done
   from the pgmap->ops->page_free() callback.

7. We could probably get rid of the pgmap refcounting because we can just scan
   pages and look for any pages with non-zero references and wait for them to be
   freed whilst ensuring no new mappings can be created (some drivers do a
   similar thing for private pages today). This might be a follow-up change.

I have made good progress implementing the above, and am reasonably confident I
can make it work (I have some tests that exercise these code paths working).

However my knowledge of the filesystem layer is a bit thin, so before going too
much further down this path I was hoping to get some feedback on the overall
direction to see if there are any corner cases or other potential problems I
have missed that may prevent the above being practical.

If not I will clean my series up and post it as an RFC. Thanks.

 - Alistair

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ZONE_DEVICE refcounting
  2024-03-08  4:24 ZONE_DEVICE refcounting Alistair Popple
@ 2024-03-08 13:44 ` Jason Gunthorpe
  2024-03-13  6:32 ` Dan Williams
  1 sibling, 0 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2024-03-08 13:44 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, jhubbard, rcampbell, willy, dan.j.williams, david,
	linux-fsdevel, jack, djwong, hch, david

On Fri, Mar 08, 2024 at 03:24:35PM +1100, Alistair Popple wrote:
> Hi,
> 
> I have been looking at fixing up ZONE_DEVICE refcounting again. Specifically I
> have been looking  at fixing the 1-based refcounts that are currently used for
> FS DAX pages (and p2pdma pages, but that's trival).
> 
> This started with the simple idea of "just subtract one from the
> refcounts everywhere and that will fix the off by one". Unfortunately
> it's not that simple. For starters doing a simple conversion like that
> requires allowing pages to be mapped with zero refcounts. That seems
> wrong. It also leads to problems detecting idle IO vs. page map pages.
> 
> So instead I'm thinking of doing something along the lines of the following:
> 
> 1. Refcount FS DAX pages normally. Ie. map them with vm_insert_page() and
>    increment the refcount inline with mapcount and decrement it when pages are
>    unmapped.

This is the right thing to do

> 2. As per normal pages the pages are considered free when the refcount drops
>    to zero.
> 
> 3. Because these are treated as normal pages for refcounting we no longer map
>    them as pte_devmap() (possibly freeing up a PTE bit).

Yes, the pmd/pte_devmap() should ideally go away.

> 4. PMD sized FS DAX pages get treated the same as normal compound pages.
> 
> 5. This means we need to allow compound ZONE DEVICE pages. Tail pages share
>    the page->pgmap field with page->compound_head, but this isn't a problem
>    because the LSB of page->pgmap is free and we can still get pgmap from
>    compound_head(page)->pgmap.

Right, this is the actual work - the mm is obviously already happy
with its part, fsdax just need to create a properly sized folio and
map it properly.

> 6. When FS DAX pages are freed they notify filesystem drivers. This can be done
>    from the pgmap->ops->page_free() callback.
> 
> 7. We could probably get rid of the pgmap refcounting because we can just scan
>    pages and look for any pages with non-zero references and wait for them to be
>    freed whilst ensuring no new mappings can be created (some drivers do a
>    similar thing for private pages today). This might be a follow-up
>    change.

Yeah, the pgmap refcounting needs some cleanup for sure.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ZONE_DEVICE refcounting
  2024-03-08  4:24 ZONE_DEVICE refcounting Alistair Popple
  2024-03-08 13:44 ` Jason Gunthorpe
@ 2024-03-13  6:32 ` Dan Williams
  2024-03-20  5:20   ` Alistair Popple
  1 sibling, 1 reply; 10+ messages in thread
From: Dan Williams @ 2024-03-13  6:32 UTC (permalink / raw)
  To: Alistair Popple, linux-mm
  Cc: jhubbard, rcampbell, willy, jgg, dan.j.williams, david,
	linux-fsdevel, jack, djwong, hch, david

Alistair Popple wrote:
> Hi,
> 
> I have been looking at fixing up ZONE_DEVICE refcounting again. Specifically I
> have been looking  at fixing the 1-based refcounts that are currently used for
> FS DAX pages (and p2pdma pages, but that's trival).
> 
> This started with the simple idea of "just subtract one from the
> refcounts everywhere and that will fix the off by one". Unfortunately
> it's not that simple. For starters doing a simple conversion like that
> requires allowing pages to be mapped with zero refcounts. That seems
> wrong. It also leads to problems detecting idle IO vs. page map pages.
> 
> So instead I'm thinking of doing something along the lines of the following:
> 
> 1. Refcount FS DAX pages normally. Ie. map them with vm_insert_page() and
>    increment the refcount inline with mapcount and decrement it when pages are
>    unmapped.

It has been a while but the sticking point last time was how to plumb
the "allocation" mechanism that elevated the page from 0 to 1. However,
that seems solvable.

> 2. As per normal pages the pages are considered free when the refcount drops
>    to zero.

That is the dream, yes.

> 3. Because these are treated as normal pages for refcounting we no longer map
>    them as pte_devmap() (possibly freeing up a PTE bit).

Yeah, pte_devmap() dies once mapcount behaves normally.

> 4. PMD sized FS DAX pages get treated the same as normal compound pages.

Here potentially be dragons. There are pud_devmap() checks in places
where mm code needs to be careful not to treat a dax page as a typical
transhuge page that can be split.

> 5. This means we need to allow compound ZONE DEVICE pages. Tail pages share
>    the page->pgmap field with page->compound_head, but this isn't a problem
>    because the LSB of page->pgmap is free and we can still get pgmap from
>    compound_head(page)->pgmap.

Sounds plausible.

> 6. When FS DAX pages are freed they notify filesystem drivers. This can be done
>    from the pgmap->ops->page_free() callback.

Yes necessary for DAX-GUP iteractions.

> 7. We could probably get rid of the pgmap refcounting because we can just scan
>    pages and look for any pages with non-zero references and wait for them to be
>    freed whilst ensuring no new mappings can be created (some drivers do a
>    similar thing for private pages today). This might be a follow-up change.

This sounds reasonable.

> I have made good progress implementing the above, and am reasonably confident I
> can make it work (I have some tests that exercise these code paths working).

Wow, that's great! Really appreciate and will be paying you back with
review cycles.

> However my knowledge of the filesystem layer is a bit thin, so before going too
> much further down this path I was hoping to get some feedback on the overall
> direction to see if there are any corner cases or other potential problems I
> have missed that may prevent the above being practical.

If you want to send me draft patches for that on or offlist feel free.

> If not I will clean my series up and post it as an RFC. Thanks.

Thanks, Alistair!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ZONE_DEVICE refcounting
  2024-03-13  6:32 ` Dan Williams
@ 2024-03-20  5:20   ` Alistair Popple
  2024-03-21  5:26     ` Alistair Popple
  0 siblings, 1 reply; 10+ messages in thread
From: Alistair Popple @ 2024-03-20  5:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, jhubbard, rcampbell, willy, jgg, david, linux-fsdevel,
	jack, djwong, hch, david

Dan Williams <dan.j.williams@intel.com> writes:

> Alistair Popple wrote:
>> Hi,
>> 
>> I have been looking at fixing up ZONE_DEVICE refcounting again. Specifically I
>> have been looking  at fixing the 1-based refcounts that are currently used for
>> FS DAX pages (and p2pdma pages, but that's trival).
>> 
>> This started with the simple idea of "just subtract one from the
>> refcounts everywhere and that will fix the off by one". Unfortunately
>> it's not that simple. For starters doing a simple conversion like that
>> requires allowing pages to be mapped with zero refcounts. That seems
>> wrong. It also leads to problems detecting idle IO vs. page map pages.
>> 
>> So instead I'm thinking of doing something along the lines of the following:
>> 
>> 1. Refcount FS DAX pages normally. Ie. map them with vm_insert_page() and
>>    increment the refcount inline with mapcount and decrement it when pages are
>>    unmapped.
>
> It has been a while but the sticking point last time was how to plumb
> the "allocation" mechanism that elevated the page from 0 to 1. However,
> that seems solvable.

So as a proof-of-concept I was just doing it as part of the actual page
mapping (ie. by replacing vm_insert_mixed() with vm_insert_page()) done
in dax_fault_iter().

That did have the weirdness of passing a zero-refcount page in to be
mapped, but I think that's solvable by taking a reference when
converting the pfn to a page in eg. dax_iomap_direct_access().

The issue I have just run into is the initialisation of these struct
pages is tricky. It would be ok if we didn't have compound pages.
However because we do it means we could get an access request to a
subpage that has already been mapped as a compound page and we obviously
can't just switch the struct page back and forth on every
dax_iomap_direct_access() call.

But I've been reading the DAX pagecache implementation and it looks like
doing the page initialisation there is actually the correct spot as it
already deals with some of this. I've also just discovered the
page->share overloading which wasn't a highlight of my day :-)

Thankfully I think that can also just get rolled into the refcounting if
we do that properly.

[...]

>> 4. PMD sized FS DAX pages get treated the same as normal compound pages.
>
> Here potentially be dragons. There are pud_devmap() checks in places
> where mm code needs to be careful not to treat a dax page as a typical
> transhuge page that can be split.

Interesting. Thanks for pointing that out - I'd overlooked the fact
pXd_trans_huge() implies !pXd_devmap(). Most callers end up checking
both though, and I don't quite understand the issue with splitting.

Specifically things like __split_huge_pud() do pretty much the same
thing for pud_trans_huge() and pud_devmap() which is to clear the
pud/pmd (although treating FS DAX pages as normal compound pages removes
some special cases).

And folio splitting already seems like it would have dragons given these
are not LRU pages and hence can't be split to lists, etc. Most of the
code I looked at there checked that though, and if we have the folio we
can easily look up the zone anyway.

I also noticed folio_anon() is not safe to call on a FS DAX page due to
sharing PAGE_MAPPING_DAX_SHARED.

[...]

>> I have made good progress implementing the above, and am reasonably confident I
>> can make it work (I have some tests that exercise these code paths working).
>
> Wow, that's great! Really appreciate and will be paying you back with
> review cycles.

Thanks! Do you have a preferred test suite that you use to stress FS
DAX?  I wrote a couple of directed "tests" to get an understanding of
and to exercise some of the tricker code paths (eg. GUP). Have also
tried the DAX xfstests but they were passing which made me suspicious.

>> However my knowledge of the filesystem layer is a bit thin, so before going too
>> much further down this path I was hoping to get some feedback on the overall
>> direction to see if there are any corner cases or other potential problems I
>> have missed that may prevent the above being practical.
>
> If you want to send me draft patches for that on or offlist feel free.

Great, once I figure out the compound page initialisation I should have
something reasonable to send.

>> If not I will clean my series up and post it as an RFC. Thanks.
>
> Thanks, Alistair!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ZONE_DEVICE refcounting
  2024-03-20  5:20   ` Alistair Popple
@ 2024-03-21  5:26     ` Alistair Popple
  2024-03-21  6:03       ` Dan Williams
  0 siblings, 1 reply; 10+ messages in thread
From: Alistair Popple @ 2024-03-21  5:26 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Dan Williams, linux-mm, jhubbard, rcampbell, willy, jgg, david,
	linux-fsdevel, jack, djwong, hch, david, ruansy.fnst

Alistair Popple <apopple@nvidia.com> writes:

> Dan Williams <dan.j.williams@intel.com> writes:
>
>> Alistair Popple wrote:
>
> I also noticed folio_anon() is not safe to call on a FS DAX page due to
> sharing PAGE_MAPPING_DAX_SHARED.

Also it feels like I could be missing something here. AFAICT the
page->mapping and page->index fields can't actually be used outside of
fs/dax because they are overloaded for the shared case. Therefore
setting/clearing them could be skipped and the only reason for doing so
is so dax_associate_entry()/dax_disassociate_entry() can generate
warnings which should never occur anyway. So all that code is
functionally unnecessary.

I guess that makes sense because DAX pages aren't in the pagecache or
LRU and therefore nothing tries to look them up via the normal
rmap. Does that sounds about right or did I miss something?

 - Alistair

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ZONE_DEVICE refcounting
  2024-03-21  5:26     ` Alistair Popple
@ 2024-03-21  6:03       ` Dan Williams
  2024-03-22  0:01         ` Alistair Popple
  0 siblings, 1 reply; 10+ messages in thread
From: Dan Williams @ 2024-03-21  6:03 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Dan Williams, linux-mm, jhubbard, rcampbell, willy, jgg, david,
	linux-fsdevel, jack, djwong, hch, david, ruansy.fnst

Alistair Popple wrote:
> 
> Alistair Popple <apopple@nvidia.com> writes:
> 
> > Dan Williams <dan.j.williams@intel.com> writes:
> >
> >> Alistair Popple wrote:
> >
> > I also noticed folio_anon() is not safe to call on a FS DAX page due to
> > sharing PAGE_MAPPING_DAX_SHARED.
> 
> Also it feels like I could be missing something here. AFAICT the
> page->mapping and page->index fields can't actually be used outside of
> fs/dax because they are overloaded for the shared case. Therefore
> setting/clearing them could be skipped and the only reason for doing so
> is so dax_associate_entry()/dax_disassociate_entry() can generate
> warnings which should never occur anyway. So all that code is
> functionally unnecessary.

What do you mean outside of fs/dax, do you literally mean outside of
fs/dax.c, or the devdax case (i.e. dax without fs-entanglements)? Memory
failure needs ->mapping and ->index to rmap dax pages. See
mm/memory-failure.c::__add_to_kill() and
mm/memory-failure.c::__add_to_kill_fsdax() where that latter one is for
cases where the fs needs has signed up to react to dax page failure.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ZONE_DEVICE refcounting
  2024-03-21  6:03       ` Dan Williams
@ 2024-03-22  0:01         ` Alistair Popple
  2024-03-22  3:18           ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Alistair Popple @ 2024-03-22  0:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, jhubbard, rcampbell, willy, jgg, david, linux-fsdevel,
	jack, djwong, hch, david, ruansy.fnst


Dan Williams <dan.j.williams@intel.com> writes:

> Alistair Popple wrote:
>> 
>> Alistair Popple <apopple@nvidia.com> writes:
>> 
>> > Dan Williams <dan.j.williams@intel.com> writes:
>> >
>> >> Alistair Popple wrote:
>> >
>> > I also noticed folio_anon() is not safe to call on a FS DAX page due to
>> > sharing PAGE_MAPPING_DAX_SHARED.
>> 
>> Also it feels like I could be missing something here. AFAICT the
>> page->mapping and page->index fields can't actually be used outside of
>> fs/dax because they are overloaded for the shared case. Therefore
>> setting/clearing them could be skipped and the only reason for doing so
>> is so dax_associate_entry()/dax_disassociate_entry() can generate
>> warnings which should never occur anyway. So all that code is
>> functionally unnecessary.
>
> What do you mean outside of fs/dax, do you literally mean outside of
> fs/dax.c, or the devdax case (i.e. dax without fs-entanglements)?

Only the cases fs dax pages might need it. ie. Not devdax which I
haven't looked at closely yet.

> Memory
> failure needs ->mapping and ->index to rmap dax pages. See
> mm/memory-failure.c::__add_to_kill() and
> mm/memory-failure.c::__add_to_kill_fsdax() where that latter one is for
> cases where the fs needs has signed up to react to dax page failure.

How does that work for reflink/shared pages which overwrite
page->mapping and page->index? Eg. in __add_to_kill() if *p is a shared fs
dax page then p->mapping == PAGE_MAPPING_DAX_SHARED and
page_address_in_vma(vma, p) will probably crash.

Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ZONE_DEVICE refcounting
  2024-03-22  0:01         ` Alistair Popple
@ 2024-03-22  3:18           ` Dave Chinner
  2024-03-22  5:34             ` Alistair Popple
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2024-03-22  3:18 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Dan Williams, linux-mm, jhubbard, rcampbell, willy, jgg,
	linux-fsdevel, jack, djwong, hch, david, ruansy.fnst

On Fri, Mar 22, 2024 at 11:01:25AM +1100, Alistair Popple wrote:
> 
> Dan Williams <dan.j.williams@intel.com> writes:
> 
> > Alistair Popple wrote:
> >> 
> >> Alistair Popple <apopple@nvidia.com> writes:
> >> 
> >> > Dan Williams <dan.j.williams@intel.com> writes:
> >> >
> >> >> Alistair Popple wrote:
> >> >
> >> > I also noticed folio_anon() is not safe to call on a FS DAX page due to
> >> > sharing PAGE_MAPPING_DAX_SHARED.
> >> 
> >> Also it feels like I could be missing something here. AFAICT the
> >> page->mapping and page->index fields can't actually be used outside of
> >> fs/dax because they are overloaded for the shared case. Therefore
> >> setting/clearing them could be skipped and the only reason for doing so
> >> is so dax_associate_entry()/dax_disassociate_entry() can generate
> >> warnings which should never occur anyway. So all that code is
> >> functionally unnecessary.
> >
> > What do you mean outside of fs/dax, do you literally mean outside of
> > fs/dax.c, or the devdax case (i.e. dax without fs-entanglements)?
> 
> Only the cases fs dax pages might need it. ie. Not devdax which I
> haven't looked at closely yet.
> 
> > Memory
> > failure needs ->mapping and ->index to rmap dax pages. See
> > mm/memory-failure.c::__add_to_kill() and
> > mm/memory-failure.c::__add_to_kill_fsdax() where that latter one is for
> > cases where the fs needs has signed up to react to dax page failure.
> 
> How does that work for reflink/shared pages which overwrite
> page->mapping and page->index?

Via reverse mapping in the *filesystem*, not the mm rmap stuff.

pmem_pagemap_memory_failure()
  dax_holder_notify_failure()
    .notify_failure()
      xfs_dax_notify_failure()
        xfs_dax_notify_ddev_failure()
	  xfs_rmap_query_range(xfs_dax_failure_fn)
	     xfs_dax_failure_fn(rmap record)
	       <grabs inode from cache>
	       <converts range to file offset>
	       mf_dax_kill_procs(inode->mapping, pgoff)
	         collect_procs_fsdax(mapping, page)
		   add_to_kill_fsdax(task)
		     __add_to_kill(task)
		 unmap_and_kill_tasks()

Remember: in FSDAX, the pages are the storage media physically owned
by the filesystem, not the mm subsystem. Hence answering questions
like "who owns this page" can only be answered correctly by asking
the filesystem.

We shortcut that for pages that only have one owner - we just store
the owner information in the page as a {mapping, offset} tuple. But
when we have multiple owners, the only way to find all the {mapping,
offset} tuples is to ask the filesystem to find all the owners of
that page.

Hence the special case values for page->mapping/page->index for
pages over shared filesystem extents. These shared extents are

communicated to the fsdax layer via the IOMAP_F_SHARED flag
in the iomaps returned by the filesystem. This flag is the trigger
for the special mapping share count behaviour to be used. e.g. see
dax_insert_entry(iomap_iter) -> dax_associate_entry(shared) ->
dax_page_share_get()....

> Eg. in __add_to_kill() if *p is a shared fs
> dax page then p->mapping == PAGE_MAPPING_DAX_SHARED and
> page_address_in_vma(vma, p) will probably crash.

As per above, we don't get the mapping from the page in those cases.

If you haven't got access to the page though a filesystem method and
guaranteed that truncate() can't free it from under you, then you're
probably doing the wrong thing with fsdax...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ZONE_DEVICE refcounting
  2024-03-22  3:18           ` Dave Chinner
@ 2024-03-22  5:34             ` Alistair Popple
  2024-03-22  6:58               ` Dan Williams
  0 siblings, 1 reply; 10+ messages in thread
From: Alistair Popple @ 2024-03-22  5:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, linux-mm, jhubbard, rcampbell, willy, jgg,
	linux-fsdevel, jack, djwong, hch, david, ruansy.fnst


Dave Chinner <david@fromorbit.com> writes:

> On Fri, Mar 22, 2024 at 11:01:25AM +1100, Alistair Popple wrote:
>> 
>> Dan Williams <dan.j.williams@intel.com> writes:
>> 
>> > Alistair Popple wrote:
>> >> 
>> >> Alistair Popple <apopple@nvidia.com> writes:
>> >> 
>> >> > Dan Williams <dan.j.williams@intel.com> writes:
>> >> >
>> >> >> Alistair Popple wrote:
>> >> >
>> >> > I also noticed folio_anon() is not safe to call on a FS DAX page due to
>> >> > sharing PAGE_MAPPING_DAX_SHARED.
>> >> 
>> >> Also it feels like I could be missing something here. AFAICT the
>> >> page->mapping and page->index fields can't actually be used outside of
>> >> fs/dax because they are overloaded for the shared case. Therefore
>> >> setting/clearing them could be skipped and the only reason for doing so
>> >> is so dax_associate_entry()/dax_disassociate_entry() can generate
>> >> warnings which should never occur anyway. So all that code is
>> >> functionally unnecessary.
>> >
>> > What do you mean outside of fs/dax, do you literally mean outside of
>> > fs/dax.c, or the devdax case (i.e. dax without fs-entanglements)?
>> 
>> Only the cases fs dax pages might need it. ie. Not devdax which I
>> haven't looked at closely yet.
>> 
>> > Memory
>> > failure needs ->mapping and ->index to rmap dax pages. See
>> > mm/memory-failure.c::__add_to_kill() and
>> > mm/memory-failure.c::__add_to_kill_fsdax() where that latter one is for
>> > cases where the fs needs has signed up to react to dax page failure.
>> 
>> How does that work for reflink/shared pages which overwrite
>> page->mapping and page->index?
>
> Via reverse mapping in the *filesystem*, not the mm rmap stuff.
>
> pmem_pagemap_memory_failure()
>   dax_holder_notify_failure()
>     .notify_failure()
>       xfs_dax_notify_failure()
>         xfs_dax_notify_ddev_failure()
> 	  xfs_rmap_query_range(xfs_dax_failure_fn)
> 	     xfs_dax_failure_fn(rmap record)
> 	       <grabs inode from cache>
> 	       <converts range to file offset>
> 	       mf_dax_kill_procs(inode->mapping, pgoff)
> 	         collect_procs_fsdax(mapping, page)
> 		   add_to_kill_fsdax(task)
> 		     __add_to_kill(task)
> 		 unmap_and_kill_tasks()
>
> Remember: in FSDAX, the pages are the storage media physically owned
> by the filesystem, not the mm subsystem. Hence answering questions
> like "who owns this page" can only be answered correctly by asking
> the filesystem.

Thanks Dave for writing that up, it really helped solidify my
understanding of how this is all supposed to work.

> We shortcut that for pages that only have one owner - we just store
> the owner information in the page as a {mapping, offset} tuple. But
> when we have multiple owners, the only way to find all the {mapping,
> offset} tuples is to ask the filesystem to find all the owners of
> that page.
>
> Hence the special case values for page->mapping/page->index for
> pages over shared filesystem extents. These shared extents are
> communicated to the fsdax layer via the IOMAP_F_SHARED flag
> in the iomaps returned by the filesystem. This flag is the trigger
> for the special mapping share count behaviour to be used. e.g. see
> dax_insert_entry(iomap_iter) -> dax_associate_entry(shared) ->
> dax_page_share_get()....
>
>> Eg. in __add_to_kill() if *p is a shared fs
>> dax page then p->mapping == PAGE_MAPPING_DAX_SHARED and
>> page_address_in_vma(vma, p) will probably crash.
>
> As per above, we don't get the mapping from the page in those cases.

Yep, that all makes sense and I see where I was getting confsued. It was
because in __add_to_kill() we do actually use page->mapping when
page_address_in_vma(vma, p) is called. And because
folio_test_anon(page_folio(p)) is also true for shared FS DAX pages
(p->mapping == PAGE_MAPPING_DAX_SHARED/PAGE_MAPPING_DAX_SHARED) I
thought things would go bad there.

However after re-reading the page_address_in_vma() implementation I
noticed that isn't the case, because folio_anon_vma(page_folio(p)) will
still return NULL which was the detail I had missed.

So to go back to my original point it seems page->mapping and
page->index is largely unused on fs dax pages, even for memory
failure. Because for memory failure the mapping and fsdax_pgoff comes
from the filesystem not the page.

> If you haven't got access to the page though a filesystem method and
> guaranteed that truncate() can't free it from under you, then you're
> probably doing the wrong thing with fsdax...
>
> -Dave.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ZONE_DEVICE refcounting
  2024-03-22  5:34             ` Alistair Popple
@ 2024-03-22  6:58               ` Dan Williams
  0 siblings, 0 replies; 10+ messages in thread
From: Dan Williams @ 2024-03-22  6:58 UTC (permalink / raw)
  To: Alistair Popple, Dave Chinner
  Cc: Dan Williams, linux-mm, jhubbard, rcampbell, willy, jgg,
	linux-fsdevel, jack, djwong, hch, david, ruansy.fnst

Alistair Popple wrote:
> 
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Fri, Mar 22, 2024 at 11:01:25AM +1100, Alistair Popple wrote:
> >> 
> >> Dan Williams <dan.j.williams@intel.com> writes:
> >> 
> >> > Alistair Popple wrote:
> >> >> 
> >> >> Alistair Popple <apopple@nvidia.com> writes:
> >> >> 
> >> >> > Dan Williams <dan.j.williams@intel.com> writes:
> >> >> >
> >> >> >> Alistair Popple wrote:
> >> >> >
> >> >> > I also noticed folio_anon() is not safe to call on a FS DAX page due to
> >> >> > sharing PAGE_MAPPING_DAX_SHARED.
> >> >> 
> >> >> Also it feels like I could be missing something here. AFAICT the
> >> >> page->mapping and page->index fields can't actually be used outside of
> >> >> fs/dax because they are overloaded for the shared case. Therefore
> >> >> setting/clearing them could be skipped and the only reason for doing so
> >> >> is so dax_associate_entry()/dax_disassociate_entry() can generate
> >> >> warnings which should never occur anyway. So all that code is
> >> >> functionally unnecessary.
> >> >
> >> > What do you mean outside of fs/dax, do you literally mean outside of
> >> > fs/dax.c, or the devdax case (i.e. dax without fs-entanglements)?
> >> 
> >> Only the cases fs dax pages might need it. ie. Not devdax which I
> >> haven't looked at closely yet.
> >> 
> >> > Memory
> >> > failure needs ->mapping and ->index to rmap dax pages. See
> >> > mm/memory-failure.c::__add_to_kill() and
> >> > mm/memory-failure.c::__add_to_kill_fsdax() where that latter one is for
> >> > cases where the fs needs has signed up to react to dax page failure.
> >> 
> >> How does that work for reflink/shared pages which overwrite
> >> page->mapping and page->index?
> >
> > Via reverse mapping in the *filesystem*, not the mm rmap stuff.
> >
> > pmem_pagemap_memory_failure()
> >   dax_holder_notify_failure()
> >     .notify_failure()
> >       xfs_dax_notify_failure()
> >         xfs_dax_notify_ddev_failure()
> > 	  xfs_rmap_query_range(xfs_dax_failure_fn)
> > 	     xfs_dax_failure_fn(rmap record)
> > 	       <grabs inode from cache>
> > 	       <converts range to file offset>
> > 	       mf_dax_kill_procs(inode->mapping, pgoff)
> > 	         collect_procs_fsdax(mapping, page)
> > 		   add_to_kill_fsdax(task)
> > 		     __add_to_kill(task)
> > 		 unmap_and_kill_tasks()
> >
> > Remember: in FSDAX, the pages are the storage media physically owned
> > by the filesystem, not the mm subsystem. Hence answering questions
> > like "who owns this page" can only be answered correctly by asking
> > the filesystem.
> 
> Thanks Dave for writing that up, it really helped solidify my
> understanding of how this is all supposed to work.

Yes, thanks Dave!

> > We shortcut that for pages that only have one owner - we just store
> > the owner information in the page as a {mapping, offset} tuple. But
> > when we have multiple owners, the only way to find all the {mapping,
> > offset} tuples is to ask the filesystem to find all the owners of
> > that page.
> >
> > Hence the special case values for page->mapping/page->index for
> > pages over shared filesystem extents. These shared extents are
> > communicated to the fsdax layer via the IOMAP_F_SHARED flag
> > in the iomaps returned by the filesystem. This flag is the trigger
> > for the special mapping share count behaviour to be used. e.g. see
> > dax_insert_entry(iomap_iter) -> dax_associate_entry(shared) ->
> > dax_page_share_get()....
> >
> >> Eg. in __add_to_kill() if *p is a shared fs
> >> dax page then p->mapping == PAGE_MAPPING_DAX_SHARED and
> >> page_address_in_vma(vma, p) will probably crash.
> >
> > As per above, we don't get the mapping from the page in those cases.
> 
> Yep, that all makes sense and I see where I was getting confsued. It was
> because in __add_to_kill() we do actually use page->mapping when
> page_address_in_vma(vma, p) is called. And because
> folio_test_anon(page_folio(p)) is also true for shared FS DAX pages
> (p->mapping == PAGE_MAPPING_DAX_SHARED/PAGE_MAPPING_DAX_SHARED) I
> thought things would go bad there.
> 
> However after re-reading the page_address_in_vma() implementation I
> noticed that isn't the case, because folio_anon_vma(page_folio(p)) will
> still return NULL which was the detail I had missed.
> 
> So to go back to my original point it seems page->mapping and
> page->index is largely unused on fs dax pages, even for memory
> failure. Because for memory failure the mapping and fsdax_pgoff comes
> from the filesystem not the page.

Only for XFS, all other filesystems depend on typical rmap, and I
believe XFS only uses that path if the fs is mounted with reflink
support.

You had asked about tests before and I do have a regression test for
memory-failure in the ndctl regression suite, that should be validating
this for xfs, and ext4:

https://github.com/pmem/ndctl/blob/main/test/dax-poison.c#L47

meson test -C build --suite ndctl:ndctl

There are setup instructions here:
https://github.com/pmem/ndctl/blob/main/README.md

I typically run it in its own VM. This run_qemu script can build and
boot an enironment to run it:
https://github.com/pmem/run_qemu

There is also kdevops that includes it, but I have not tried that:
https://github.com/linux-kdevops/kdevops

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-03-22  6:58 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-08  4:24 ZONE_DEVICE refcounting Alistair Popple
2024-03-08 13:44 ` Jason Gunthorpe
2024-03-13  6:32 ` Dan Williams
2024-03-20  5:20   ` Alistair Popple
2024-03-21  5:26     ` Alistair Popple
2024-03-21  6:03       ` Dan Williams
2024-03-22  0:01         ` Alistair Popple
2024-03-22  3:18           ` Dave Chinner
2024-03-22  5:34             ` Alistair Popple
2024-03-22  6:58               ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).