* ZONE_DEVICE refcounting @ 2024-03-08 4:24 Alistair Popple 2024-03-08 13:44 ` Jason Gunthorpe 2024-03-13 6:32 ` Dan Williams 0 siblings, 2 replies; 10+ messages in thread From: Alistair Popple @ 2024-03-08 4:24 UTC (permalink / raw) To: linux-mm Cc: jhubbard, rcampbell, willy, jgg, dan.j.williams, david, linux-fsdevel, jack, djwong, hch, david Hi, I have been looking at fixing up ZONE_DEVICE refcounting again. Specifically I have been looking at fixing the 1-based refcounts that are currently used for FS DAX pages (and p2pdma pages, but that's trival). This started with the simple idea of "just subtract one from the refcounts everywhere and that will fix the off by one". Unfortunately it's not that simple. For starters doing a simple conversion like that requires allowing pages to be mapped with zero refcounts. That seems wrong. It also leads to problems detecting idle IO vs. page map pages. So instead I'm thinking of doing something along the lines of the following: 1. Refcount FS DAX pages normally. Ie. map them with vm_insert_page() and increment the refcount inline with mapcount and decrement it when pages are unmapped. 2. As per normal pages the pages are considered free when the refcount drops to zero. 3. Because these are treated as normal pages for refcounting we no longer map them as pte_devmap() (possibly freeing up a PTE bit). 4. PMD sized FS DAX pages get treated the same as normal compound pages. 5. This means we need to allow compound ZONE DEVICE pages. Tail pages share the page->pgmap field with page->compound_head, but this isn't a problem because the LSB of page->pgmap is free and we can still get pgmap from compound_head(page)->pgmap. 6. When FS DAX pages are freed they notify filesystem drivers. This can be done from the pgmap->ops->page_free() callback. 7. We could probably get rid of the pgmap refcounting because we can just scan pages and look for any pages with non-zero references and wait for them to be freed whilst ensuring no new mappings can be created (some drivers do a similar thing for private pages today). This might be a follow-up change. I have made good progress implementing the above, and am reasonably confident I can make it work (I have some tests that exercise these code paths working). However my knowledge of the filesystem layer is a bit thin, so before going too much further down this path I was hoping to get some feedback on the overall direction to see if there are any corner cases or other potential problems I have missed that may prevent the above being practical. If not I will clean my series up and post it as an RFC. Thanks. - Alistair ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ZONE_DEVICE refcounting 2024-03-08 4:24 ZONE_DEVICE refcounting Alistair Popple @ 2024-03-08 13:44 ` Jason Gunthorpe 2024-03-13 6:32 ` Dan Williams 1 sibling, 0 replies; 10+ messages in thread From: Jason Gunthorpe @ 2024-03-08 13:44 UTC (permalink / raw) To: Alistair Popple Cc: linux-mm, jhubbard, rcampbell, willy, dan.j.williams, david, linux-fsdevel, jack, djwong, hch, david On Fri, Mar 08, 2024 at 03:24:35PM +1100, Alistair Popple wrote: > Hi, > > I have been looking at fixing up ZONE_DEVICE refcounting again. Specifically I > have been looking at fixing the 1-based refcounts that are currently used for > FS DAX pages (and p2pdma pages, but that's trival). > > This started with the simple idea of "just subtract one from the > refcounts everywhere and that will fix the off by one". Unfortunately > it's not that simple. For starters doing a simple conversion like that > requires allowing pages to be mapped with zero refcounts. That seems > wrong. It also leads to problems detecting idle IO vs. page map pages. > > So instead I'm thinking of doing something along the lines of the following: > > 1. Refcount FS DAX pages normally. Ie. map them with vm_insert_page() and > increment the refcount inline with mapcount and decrement it when pages are > unmapped. This is the right thing to do > 2. As per normal pages the pages are considered free when the refcount drops > to zero. > > 3. Because these are treated as normal pages for refcounting we no longer map > them as pte_devmap() (possibly freeing up a PTE bit). Yes, the pmd/pte_devmap() should ideally go away. > 4. PMD sized FS DAX pages get treated the same as normal compound pages. > > 5. This means we need to allow compound ZONE DEVICE pages. Tail pages share > the page->pgmap field with page->compound_head, but this isn't a problem > because the LSB of page->pgmap is free and we can still get pgmap from > compound_head(page)->pgmap. Right, this is the actual work - the mm is obviously already happy with its part, fsdax just need to create a properly sized folio and map it properly. > 6. When FS DAX pages are freed they notify filesystem drivers. This can be done > from the pgmap->ops->page_free() callback. > > 7. We could probably get rid of the pgmap refcounting because we can just scan > pages and look for any pages with non-zero references and wait for them to be > freed whilst ensuring no new mappings can be created (some drivers do a > similar thing for private pages today). This might be a follow-up > change. Yeah, the pgmap refcounting needs some cleanup for sure. Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ZONE_DEVICE refcounting 2024-03-08 4:24 ZONE_DEVICE refcounting Alistair Popple 2024-03-08 13:44 ` Jason Gunthorpe @ 2024-03-13 6:32 ` Dan Williams 2024-03-20 5:20 ` Alistair Popple 1 sibling, 1 reply; 10+ messages in thread From: Dan Williams @ 2024-03-13 6:32 UTC (permalink / raw) To: Alistair Popple, linux-mm Cc: jhubbard, rcampbell, willy, jgg, dan.j.williams, david, linux-fsdevel, jack, djwong, hch, david Alistair Popple wrote: > Hi, > > I have been looking at fixing up ZONE_DEVICE refcounting again. Specifically I > have been looking at fixing the 1-based refcounts that are currently used for > FS DAX pages (and p2pdma pages, but that's trival). > > This started with the simple idea of "just subtract one from the > refcounts everywhere and that will fix the off by one". Unfortunately > it's not that simple. For starters doing a simple conversion like that > requires allowing pages to be mapped with zero refcounts. That seems > wrong. It also leads to problems detecting idle IO vs. page map pages. > > So instead I'm thinking of doing something along the lines of the following: > > 1. Refcount FS DAX pages normally. Ie. map them with vm_insert_page() and > increment the refcount inline with mapcount and decrement it when pages are > unmapped. It has been a while but the sticking point last time was how to plumb the "allocation" mechanism that elevated the page from 0 to 1. However, that seems solvable. > 2. As per normal pages the pages are considered free when the refcount drops > to zero. That is the dream, yes. > 3. Because these are treated as normal pages for refcounting we no longer map > them as pte_devmap() (possibly freeing up a PTE bit). Yeah, pte_devmap() dies once mapcount behaves normally. > 4. PMD sized FS DAX pages get treated the same as normal compound pages. Here potentially be dragons. There are pud_devmap() checks in places where mm code needs to be careful not to treat a dax page as a typical transhuge page that can be split. > 5. This means we need to allow compound ZONE DEVICE pages. Tail pages share > the page->pgmap field with page->compound_head, but this isn't a problem > because the LSB of page->pgmap is free and we can still get pgmap from > compound_head(page)->pgmap. Sounds plausible. > 6. When FS DAX pages are freed they notify filesystem drivers. This can be done > from the pgmap->ops->page_free() callback. Yes necessary for DAX-GUP iteractions. > 7. We could probably get rid of the pgmap refcounting because we can just scan > pages and look for any pages with non-zero references and wait for them to be > freed whilst ensuring no new mappings can be created (some drivers do a > similar thing for private pages today). This might be a follow-up change. This sounds reasonable. > I have made good progress implementing the above, and am reasonably confident I > can make it work (I have some tests that exercise these code paths working). Wow, that's great! Really appreciate and will be paying you back with review cycles. > However my knowledge of the filesystem layer is a bit thin, so before going too > much further down this path I was hoping to get some feedback on the overall > direction to see if there are any corner cases or other potential problems I > have missed that may prevent the above being practical. If you want to send me draft patches for that on or offlist feel free. > If not I will clean my series up and post it as an RFC. Thanks. Thanks, Alistair! ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ZONE_DEVICE refcounting 2024-03-13 6:32 ` Dan Williams @ 2024-03-20 5:20 ` Alistair Popple 2024-03-21 5:26 ` Alistair Popple 0 siblings, 1 reply; 10+ messages in thread From: Alistair Popple @ 2024-03-20 5:20 UTC (permalink / raw) To: Dan Williams Cc: linux-mm, jhubbard, rcampbell, willy, jgg, david, linux-fsdevel, jack, djwong, hch, david Dan Williams <dan.j.williams@intel.com> writes: > Alistair Popple wrote: >> Hi, >> >> I have been looking at fixing up ZONE_DEVICE refcounting again. Specifically I >> have been looking at fixing the 1-based refcounts that are currently used for >> FS DAX pages (and p2pdma pages, but that's trival). >> >> This started with the simple idea of "just subtract one from the >> refcounts everywhere and that will fix the off by one". Unfortunately >> it's not that simple. For starters doing a simple conversion like that >> requires allowing pages to be mapped with zero refcounts. That seems >> wrong. It also leads to problems detecting idle IO vs. page map pages. >> >> So instead I'm thinking of doing something along the lines of the following: >> >> 1. Refcount FS DAX pages normally. Ie. map them with vm_insert_page() and >> increment the refcount inline with mapcount and decrement it when pages are >> unmapped. > > It has been a while but the sticking point last time was how to plumb > the "allocation" mechanism that elevated the page from 0 to 1. However, > that seems solvable. So as a proof-of-concept I was just doing it as part of the actual page mapping (ie. by replacing vm_insert_mixed() with vm_insert_page()) done in dax_fault_iter(). That did have the weirdness of passing a zero-refcount page in to be mapped, but I think that's solvable by taking a reference when converting the pfn to a page in eg. dax_iomap_direct_access(). The issue I have just run into is the initialisation of these struct pages is tricky. It would be ok if we didn't have compound pages. However because we do it means we could get an access request to a subpage that has already been mapped as a compound page and we obviously can't just switch the struct page back and forth on every dax_iomap_direct_access() call. But I've been reading the DAX pagecache implementation and it looks like doing the page initialisation there is actually the correct spot as it already deals with some of this. I've also just discovered the page->share overloading which wasn't a highlight of my day :-) Thankfully I think that can also just get rolled into the refcounting if we do that properly. [...] >> 4. PMD sized FS DAX pages get treated the same as normal compound pages. > > Here potentially be dragons. There are pud_devmap() checks in places > where mm code needs to be careful not to treat a dax page as a typical > transhuge page that can be split. Interesting. Thanks for pointing that out - I'd overlooked the fact pXd_trans_huge() implies !pXd_devmap(). Most callers end up checking both though, and I don't quite understand the issue with splitting. Specifically things like __split_huge_pud() do pretty much the same thing for pud_trans_huge() and pud_devmap() which is to clear the pud/pmd (although treating FS DAX pages as normal compound pages removes some special cases). And folio splitting already seems like it would have dragons given these are not LRU pages and hence can't be split to lists, etc. Most of the code I looked at there checked that though, and if we have the folio we can easily look up the zone anyway. I also noticed folio_anon() is not safe to call on a FS DAX page due to sharing PAGE_MAPPING_DAX_SHARED. [...] >> I have made good progress implementing the above, and am reasonably confident I >> can make it work (I have some tests that exercise these code paths working). > > Wow, that's great! Really appreciate and will be paying you back with > review cycles. Thanks! Do you have a preferred test suite that you use to stress FS DAX? I wrote a couple of directed "tests" to get an understanding of and to exercise some of the tricker code paths (eg. GUP). Have also tried the DAX xfstests but they were passing which made me suspicious. >> However my knowledge of the filesystem layer is a bit thin, so before going too >> much further down this path I was hoping to get some feedback on the overall >> direction to see if there are any corner cases or other potential problems I >> have missed that may prevent the above being practical. > > If you want to send me draft patches for that on or offlist feel free. Great, once I figure out the compound page initialisation I should have something reasonable to send. >> If not I will clean my series up and post it as an RFC. Thanks. > > Thanks, Alistair! ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ZONE_DEVICE refcounting 2024-03-20 5:20 ` Alistair Popple @ 2024-03-21 5:26 ` Alistair Popple 2024-03-21 6:03 ` Dan Williams 0 siblings, 1 reply; 10+ messages in thread From: Alistair Popple @ 2024-03-21 5:26 UTC (permalink / raw) To: Alistair Popple Cc: Dan Williams, linux-mm, jhubbard, rcampbell, willy, jgg, david, linux-fsdevel, jack, djwong, hch, david, ruansy.fnst Alistair Popple <apopple@nvidia.com> writes: > Dan Williams <dan.j.williams@intel.com> writes: > >> Alistair Popple wrote: > > I also noticed folio_anon() is not safe to call on a FS DAX page due to > sharing PAGE_MAPPING_DAX_SHARED. Also it feels like I could be missing something here. AFAICT the page->mapping and page->index fields can't actually be used outside of fs/dax because they are overloaded for the shared case. Therefore setting/clearing them could be skipped and the only reason for doing so is so dax_associate_entry()/dax_disassociate_entry() can generate warnings which should never occur anyway. So all that code is functionally unnecessary. I guess that makes sense because DAX pages aren't in the pagecache or LRU and therefore nothing tries to look them up via the normal rmap. Does that sounds about right or did I miss something? - Alistair ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ZONE_DEVICE refcounting 2024-03-21 5:26 ` Alistair Popple @ 2024-03-21 6:03 ` Dan Williams 2024-03-22 0:01 ` Alistair Popple 0 siblings, 1 reply; 10+ messages in thread From: Dan Williams @ 2024-03-21 6:03 UTC (permalink / raw) To: Alistair Popple Cc: Dan Williams, linux-mm, jhubbard, rcampbell, willy, jgg, david, linux-fsdevel, jack, djwong, hch, david, ruansy.fnst Alistair Popple wrote: > > Alistair Popple <apopple@nvidia.com> writes: > > > Dan Williams <dan.j.williams@intel.com> writes: > > > >> Alistair Popple wrote: > > > > I also noticed folio_anon() is not safe to call on a FS DAX page due to > > sharing PAGE_MAPPING_DAX_SHARED. > > Also it feels like I could be missing something here. AFAICT the > page->mapping and page->index fields can't actually be used outside of > fs/dax because they are overloaded for the shared case. Therefore > setting/clearing them could be skipped and the only reason for doing so > is so dax_associate_entry()/dax_disassociate_entry() can generate > warnings which should never occur anyway. So all that code is > functionally unnecessary. What do you mean outside of fs/dax, do you literally mean outside of fs/dax.c, or the devdax case (i.e. dax without fs-entanglements)? Memory failure needs ->mapping and ->index to rmap dax pages. See mm/memory-failure.c::__add_to_kill() and mm/memory-failure.c::__add_to_kill_fsdax() where that latter one is for cases where the fs needs has signed up to react to dax page failure. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ZONE_DEVICE refcounting 2024-03-21 6:03 ` Dan Williams @ 2024-03-22 0:01 ` Alistair Popple 2024-03-22 3:18 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Alistair Popple @ 2024-03-22 0:01 UTC (permalink / raw) To: Dan Williams Cc: linux-mm, jhubbard, rcampbell, willy, jgg, david, linux-fsdevel, jack, djwong, hch, david, ruansy.fnst Dan Williams <dan.j.williams@intel.com> writes: > Alistair Popple wrote: >> >> Alistair Popple <apopple@nvidia.com> writes: >> >> > Dan Williams <dan.j.williams@intel.com> writes: >> > >> >> Alistair Popple wrote: >> > >> > I also noticed folio_anon() is not safe to call on a FS DAX page due to >> > sharing PAGE_MAPPING_DAX_SHARED. >> >> Also it feels like I could be missing something here. AFAICT the >> page->mapping and page->index fields can't actually be used outside of >> fs/dax because they are overloaded for the shared case. Therefore >> setting/clearing them could be skipped and the only reason for doing so >> is so dax_associate_entry()/dax_disassociate_entry() can generate >> warnings which should never occur anyway. So all that code is >> functionally unnecessary. > > What do you mean outside of fs/dax, do you literally mean outside of > fs/dax.c, or the devdax case (i.e. dax without fs-entanglements)? Only the cases fs dax pages might need it. ie. Not devdax which I haven't looked at closely yet. > Memory > failure needs ->mapping and ->index to rmap dax pages. See > mm/memory-failure.c::__add_to_kill() and > mm/memory-failure.c::__add_to_kill_fsdax() where that latter one is for > cases where the fs needs has signed up to react to dax page failure. How does that work for reflink/shared pages which overwrite page->mapping and page->index? Eg. in __add_to_kill() if *p is a shared fs dax page then p->mapping == PAGE_MAPPING_DAX_SHARED and page_address_in_vma(vma, p) will probably crash. Thanks. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ZONE_DEVICE refcounting 2024-03-22 0:01 ` Alistair Popple @ 2024-03-22 3:18 ` Dave Chinner 2024-03-22 5:34 ` Alistair Popple 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2024-03-22 3:18 UTC (permalink / raw) To: Alistair Popple Cc: Dan Williams, linux-mm, jhubbard, rcampbell, willy, jgg, linux-fsdevel, jack, djwong, hch, david, ruansy.fnst On Fri, Mar 22, 2024 at 11:01:25AM +1100, Alistair Popple wrote: > > Dan Williams <dan.j.williams@intel.com> writes: > > > Alistair Popple wrote: > >> > >> Alistair Popple <apopple@nvidia.com> writes: > >> > >> > Dan Williams <dan.j.williams@intel.com> writes: > >> > > >> >> Alistair Popple wrote: > >> > > >> > I also noticed folio_anon() is not safe to call on a FS DAX page due to > >> > sharing PAGE_MAPPING_DAX_SHARED. > >> > >> Also it feels like I could be missing something here. AFAICT the > >> page->mapping and page->index fields can't actually be used outside of > >> fs/dax because they are overloaded for the shared case. Therefore > >> setting/clearing them could be skipped and the only reason for doing so > >> is so dax_associate_entry()/dax_disassociate_entry() can generate > >> warnings which should never occur anyway. So all that code is > >> functionally unnecessary. > > > > What do you mean outside of fs/dax, do you literally mean outside of > > fs/dax.c, or the devdax case (i.e. dax without fs-entanglements)? > > Only the cases fs dax pages might need it. ie. Not devdax which I > haven't looked at closely yet. > > > Memory > > failure needs ->mapping and ->index to rmap dax pages. See > > mm/memory-failure.c::__add_to_kill() and > > mm/memory-failure.c::__add_to_kill_fsdax() where that latter one is for > > cases where the fs needs has signed up to react to dax page failure. > > How does that work for reflink/shared pages which overwrite > page->mapping and page->index? Via reverse mapping in the *filesystem*, not the mm rmap stuff. pmem_pagemap_memory_failure() dax_holder_notify_failure() .notify_failure() xfs_dax_notify_failure() xfs_dax_notify_ddev_failure() xfs_rmap_query_range(xfs_dax_failure_fn) xfs_dax_failure_fn(rmap record) <grabs inode from cache> <converts range to file offset> mf_dax_kill_procs(inode->mapping, pgoff) collect_procs_fsdax(mapping, page) add_to_kill_fsdax(task) __add_to_kill(task) unmap_and_kill_tasks() Remember: in FSDAX, the pages are the storage media physically owned by the filesystem, not the mm subsystem. Hence answering questions like "who owns this page" can only be answered correctly by asking the filesystem. We shortcut that for pages that only have one owner - we just store the owner information in the page as a {mapping, offset} tuple. But when we have multiple owners, the only way to find all the {mapping, offset} tuples is to ask the filesystem to find all the owners of that page. Hence the special case values for page->mapping/page->index for pages over shared filesystem extents. These shared extents are communicated to the fsdax layer via the IOMAP_F_SHARED flag in the iomaps returned by the filesystem. This flag is the trigger for the special mapping share count behaviour to be used. e.g. see dax_insert_entry(iomap_iter) -> dax_associate_entry(shared) -> dax_page_share_get().... > Eg. in __add_to_kill() if *p is a shared fs > dax page then p->mapping == PAGE_MAPPING_DAX_SHARED and > page_address_in_vma(vma, p) will probably crash. As per above, we don't get the mapping from the page in those cases. If you haven't got access to the page though a filesystem method and guaranteed that truncate() can't free it from under you, then you're probably doing the wrong thing with fsdax... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ZONE_DEVICE refcounting 2024-03-22 3:18 ` Dave Chinner @ 2024-03-22 5:34 ` Alistair Popple 2024-03-22 6:58 ` Dan Williams 0 siblings, 1 reply; 10+ messages in thread From: Alistair Popple @ 2024-03-22 5:34 UTC (permalink / raw) To: Dave Chinner Cc: Dan Williams, linux-mm, jhubbard, rcampbell, willy, jgg, linux-fsdevel, jack, djwong, hch, david, ruansy.fnst Dave Chinner <david@fromorbit.com> writes: > On Fri, Mar 22, 2024 at 11:01:25AM +1100, Alistair Popple wrote: >> >> Dan Williams <dan.j.williams@intel.com> writes: >> >> > Alistair Popple wrote: >> >> >> >> Alistair Popple <apopple@nvidia.com> writes: >> >> >> >> > Dan Williams <dan.j.williams@intel.com> writes: >> >> > >> >> >> Alistair Popple wrote: >> >> > >> >> > I also noticed folio_anon() is not safe to call on a FS DAX page due to >> >> > sharing PAGE_MAPPING_DAX_SHARED. >> >> >> >> Also it feels like I could be missing something here. AFAICT the >> >> page->mapping and page->index fields can't actually be used outside of >> >> fs/dax because they are overloaded for the shared case. Therefore >> >> setting/clearing them could be skipped and the only reason for doing so >> >> is so dax_associate_entry()/dax_disassociate_entry() can generate >> >> warnings which should never occur anyway. So all that code is >> >> functionally unnecessary. >> > >> > What do you mean outside of fs/dax, do you literally mean outside of >> > fs/dax.c, or the devdax case (i.e. dax without fs-entanglements)? >> >> Only the cases fs dax pages might need it. ie. Not devdax which I >> haven't looked at closely yet. >> >> > Memory >> > failure needs ->mapping and ->index to rmap dax pages. See >> > mm/memory-failure.c::__add_to_kill() and >> > mm/memory-failure.c::__add_to_kill_fsdax() where that latter one is for >> > cases where the fs needs has signed up to react to dax page failure. >> >> How does that work for reflink/shared pages which overwrite >> page->mapping and page->index? > > Via reverse mapping in the *filesystem*, not the mm rmap stuff. > > pmem_pagemap_memory_failure() > dax_holder_notify_failure() > .notify_failure() > xfs_dax_notify_failure() > xfs_dax_notify_ddev_failure() > xfs_rmap_query_range(xfs_dax_failure_fn) > xfs_dax_failure_fn(rmap record) > <grabs inode from cache> > <converts range to file offset> > mf_dax_kill_procs(inode->mapping, pgoff) > collect_procs_fsdax(mapping, page) > add_to_kill_fsdax(task) > __add_to_kill(task) > unmap_and_kill_tasks() > > Remember: in FSDAX, the pages are the storage media physically owned > by the filesystem, not the mm subsystem. Hence answering questions > like "who owns this page" can only be answered correctly by asking > the filesystem. Thanks Dave for writing that up, it really helped solidify my understanding of how this is all supposed to work. > We shortcut that for pages that only have one owner - we just store > the owner information in the page as a {mapping, offset} tuple. But > when we have multiple owners, the only way to find all the {mapping, > offset} tuples is to ask the filesystem to find all the owners of > that page. > > Hence the special case values for page->mapping/page->index for > pages over shared filesystem extents. These shared extents are > communicated to the fsdax layer via the IOMAP_F_SHARED flag > in the iomaps returned by the filesystem. This flag is the trigger > for the special mapping share count behaviour to be used. e.g. see > dax_insert_entry(iomap_iter) -> dax_associate_entry(shared) -> > dax_page_share_get().... > >> Eg. in __add_to_kill() if *p is a shared fs >> dax page then p->mapping == PAGE_MAPPING_DAX_SHARED and >> page_address_in_vma(vma, p) will probably crash. > > As per above, we don't get the mapping from the page in those cases. Yep, that all makes sense and I see where I was getting confsued. It was because in __add_to_kill() we do actually use page->mapping when page_address_in_vma(vma, p) is called. And because folio_test_anon(page_folio(p)) is also true for shared FS DAX pages (p->mapping == PAGE_MAPPING_DAX_SHARED/PAGE_MAPPING_DAX_SHARED) I thought things would go bad there. However after re-reading the page_address_in_vma() implementation I noticed that isn't the case, because folio_anon_vma(page_folio(p)) will still return NULL which was the detail I had missed. So to go back to my original point it seems page->mapping and page->index is largely unused on fs dax pages, even for memory failure. Because for memory failure the mapping and fsdax_pgoff comes from the filesystem not the page. > If you haven't got access to the page though a filesystem method and > guaranteed that truncate() can't free it from under you, then you're > probably doing the wrong thing with fsdax... > > -Dave. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ZONE_DEVICE refcounting 2024-03-22 5:34 ` Alistair Popple @ 2024-03-22 6:58 ` Dan Williams 0 siblings, 0 replies; 10+ messages in thread From: Dan Williams @ 2024-03-22 6:58 UTC (permalink / raw) To: Alistair Popple, Dave Chinner Cc: Dan Williams, linux-mm, jhubbard, rcampbell, willy, jgg, linux-fsdevel, jack, djwong, hch, david, ruansy.fnst Alistair Popple wrote: > > Dave Chinner <david@fromorbit.com> writes: > > > On Fri, Mar 22, 2024 at 11:01:25AM +1100, Alistair Popple wrote: > >> > >> Dan Williams <dan.j.williams@intel.com> writes: > >> > >> > Alistair Popple wrote: > >> >> > >> >> Alistair Popple <apopple@nvidia.com> writes: > >> >> > >> >> > Dan Williams <dan.j.williams@intel.com> writes: > >> >> > > >> >> >> Alistair Popple wrote: > >> >> > > >> >> > I also noticed folio_anon() is not safe to call on a FS DAX page due to > >> >> > sharing PAGE_MAPPING_DAX_SHARED. > >> >> > >> >> Also it feels like I could be missing something here. AFAICT the > >> >> page->mapping and page->index fields can't actually be used outside of > >> >> fs/dax because they are overloaded for the shared case. Therefore > >> >> setting/clearing them could be skipped and the only reason for doing so > >> >> is so dax_associate_entry()/dax_disassociate_entry() can generate > >> >> warnings which should never occur anyway. So all that code is > >> >> functionally unnecessary. > >> > > >> > What do you mean outside of fs/dax, do you literally mean outside of > >> > fs/dax.c, or the devdax case (i.e. dax without fs-entanglements)? > >> > >> Only the cases fs dax pages might need it. ie. Not devdax which I > >> haven't looked at closely yet. > >> > >> > Memory > >> > failure needs ->mapping and ->index to rmap dax pages. See > >> > mm/memory-failure.c::__add_to_kill() and > >> > mm/memory-failure.c::__add_to_kill_fsdax() where that latter one is for > >> > cases where the fs needs has signed up to react to dax page failure. > >> > >> How does that work for reflink/shared pages which overwrite > >> page->mapping and page->index? > > > > Via reverse mapping in the *filesystem*, not the mm rmap stuff. > > > > pmem_pagemap_memory_failure() > > dax_holder_notify_failure() > > .notify_failure() > > xfs_dax_notify_failure() > > xfs_dax_notify_ddev_failure() > > xfs_rmap_query_range(xfs_dax_failure_fn) > > xfs_dax_failure_fn(rmap record) > > <grabs inode from cache> > > <converts range to file offset> > > mf_dax_kill_procs(inode->mapping, pgoff) > > collect_procs_fsdax(mapping, page) > > add_to_kill_fsdax(task) > > __add_to_kill(task) > > unmap_and_kill_tasks() > > > > Remember: in FSDAX, the pages are the storage media physically owned > > by the filesystem, not the mm subsystem. Hence answering questions > > like "who owns this page" can only be answered correctly by asking > > the filesystem. > > Thanks Dave for writing that up, it really helped solidify my > understanding of how this is all supposed to work. Yes, thanks Dave! > > We shortcut that for pages that only have one owner - we just store > > the owner information in the page as a {mapping, offset} tuple. But > > when we have multiple owners, the only way to find all the {mapping, > > offset} tuples is to ask the filesystem to find all the owners of > > that page. > > > > Hence the special case values for page->mapping/page->index for > > pages over shared filesystem extents. These shared extents are > > communicated to the fsdax layer via the IOMAP_F_SHARED flag > > in the iomaps returned by the filesystem. This flag is the trigger > > for the special mapping share count behaviour to be used. e.g. see > > dax_insert_entry(iomap_iter) -> dax_associate_entry(shared) -> > > dax_page_share_get().... > > > >> Eg. in __add_to_kill() if *p is a shared fs > >> dax page then p->mapping == PAGE_MAPPING_DAX_SHARED and > >> page_address_in_vma(vma, p) will probably crash. > > > > As per above, we don't get the mapping from the page in those cases. > > Yep, that all makes sense and I see where I was getting confsued. It was > because in __add_to_kill() we do actually use page->mapping when > page_address_in_vma(vma, p) is called. And because > folio_test_anon(page_folio(p)) is also true for shared FS DAX pages > (p->mapping == PAGE_MAPPING_DAX_SHARED/PAGE_MAPPING_DAX_SHARED) I > thought things would go bad there. > > However after re-reading the page_address_in_vma() implementation I > noticed that isn't the case, because folio_anon_vma(page_folio(p)) will > still return NULL which was the detail I had missed. > > So to go back to my original point it seems page->mapping and > page->index is largely unused on fs dax pages, even for memory > failure. Because for memory failure the mapping and fsdax_pgoff comes > from the filesystem not the page. Only for XFS, all other filesystems depend on typical rmap, and I believe XFS only uses that path if the fs is mounted with reflink support. You had asked about tests before and I do have a regression test for memory-failure in the ndctl regression suite, that should be validating this for xfs, and ext4: https://github.com/pmem/ndctl/blob/main/test/dax-poison.c#L47 meson test -C build --suite ndctl:ndctl There are setup instructions here: https://github.com/pmem/ndctl/blob/main/README.md I typically run it in its own VM. This run_qemu script can build and boot an enironment to run it: https://github.com/pmem/run_qemu There is also kdevops that includes it, but I have not tried that: https://github.com/linux-kdevops/kdevops ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-03-22 6:58 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-03-08 4:24 ZONE_DEVICE refcounting Alistair Popple 2024-03-08 13:44 ` Jason Gunthorpe 2024-03-13 6:32 ` Dan Williams 2024-03-20 5:20 ` Alistair Popple 2024-03-21 5:26 ` Alistair Popple 2024-03-21 6:03 ` Dan Williams 2024-03-22 0:01 ` Alistair Popple 2024-03-22 3:18 ` Dave Chinner 2024-03-22 5:34 ` Alistair Popple 2024-03-22 6:58 ` Dan Williams
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).