From: Peter Xu <peterx@redhat.com>
To: Alistair Popple <apopple@nvidia.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
rcampbell@nvidia.com, linux-doc@vger.kernel.org,
nouveau@lists.freedesktop.org, hughd@google.com,
linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org,
hch@infradead.org, bskeggs@redhat.com, jgg@nvidia.com,
shakeelb@google.com, jhubbard@nvidia.com, willy@infradead.org,
Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v10 07/10] mm: Device exclusive memory access
Date: Wed, 9 Jun 2021 12:05:06 -0400 [thread overview]
Message-ID: <YMDmsha6GDtUf3Vs@t490s> (raw)
In-Reply-To: <270551728.uXnuCZxQlr@nvdebian>
On Wed, Jun 09, 2021 at 07:38:04PM +1000, Alistair Popple wrote:
> On Wednesday, 9 June 2021 4:33:52 AM AEST Peter Xu wrote:
> > On Mon, Jun 07, 2021 at 05:58:52PM +1000, Alistair Popple wrote:
> >
> > [...]
> >
> > > +static bool page_make_device_exclusive_one(struct page *page,
> > > + struct vm_area_struct *vma, unsigned long address, void *priv)
> > > +{
> > > + struct mm_struct *mm = vma->vm_mm;
> > > + struct page_vma_mapped_walk pvmw = {
> > > + .page = page,
> > > + .vma = vma,
> > > + .address = address,
> > > + };
> > > + struct make_exclusive_args *args = priv;
> > > + pte_t pteval;
> > > + struct page *subpage;
> > > + bool ret = true;
> > > + struct mmu_notifier_range range;
> > > + swp_entry_t entry;
> > > + pte_t swp_pte;
> > > +
> > > + mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
> > > + vma->vm_mm, address, min(vma->vm_end,
> > > + address + page_size(page)), args->owner);
> > > + mmu_notifier_invalidate_range_start(&range);
> > > +
> > > + while (page_vma_mapped_walk(&pvmw)) {
> > > + /* Unexpected PMD-mapped THP? */
> > > + VM_BUG_ON_PAGE(!pvmw.pte, page);
> >
> > [1]
> >
> > > +
> > > + if (!pte_present(*pvmw.pte)) {
> > > + ret = false;
> > > + page_vma_mapped_walk_done(&pvmw);
> > > + break;
> > > + }
> > > +
> > > + subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> > > + address = pvmw.address;
> >
> > I raised a question here previously and didn't get an answer...
> >
> > https://lore.kernel.org/linux-mm/YLDr%2FRyAdUR4q0kk@t490s/
>
> Sorry, I had overlooked that. Will continue the discussion here.
No problem. I also didn't really express clearly last time, I'm happy we can
discuss this more thoroughly, even if it may be a corner case only.
>
> > I think I get your point now and it does look possible that the split page can
> > still be mapped somewhere else as thp, then having some subpage maintainance
> > looks necessary. The confusing part is above [1] you've also got that
> > VM_BUG_ON_PAGE() assuming it must not be a mapped pmd at all..
>
> Going back I thought your original question was whether subpage != page is
> possible. My main point was it's possible if we get a thp head. In that case we
> need to replace all pte's with exclusive entries because I haven't (yet)
> defined a pmd version of device exclusive entries and also rmap_walk won't deal
> with tail pages (see below).
>
> > Then I remembered these code majorly come from the try_to_unmap() so I looked
> > there. I _think_ what's missing here is something like:
> >
> > if (flags & TTU_SPLIT_HUGE_PMD)
> > split_huge_pmd_address(vma, address, false, page);
> >
> > at the entry of page_make_device_exclusive_one()?
> >
> > That !pte assertion in try_to_unmap() makes sense to me as long as it has split
> > the thp page first always. However seems not the case for FOLL_SPLIT_PMD as
> > you previously mentioned.
>
> At present this is limited to PageAnon pages which have had CoW broken, which I
> think means there shouldn't be other mappings so I expect the PMD will always
> have been split into small PTEs mapping subpages by GUP which is what that
> assertion [1] is checking. I could call split_huge_pmd_address() unconditionally
> as suggested but see the discussion below.
Yes, I think calling that unconditionally should be enough.
>
> > Meanwhile, I also started to wonder whether it's even right to call rmap_walk()
> > with tail pages... Please see below.
> >
> > > +
> > > + /* Nuke the page table entry. */
> > > + flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> > > + pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > > +
> > > + /* Move the dirty bit to the page. Now the pte is gone. */
> > > + if (pte_dirty(pteval))
> > > + set_page_dirty(page);
> > > +
> > > + /*
> > > + * Check that our target page is still mapped at the expected
> > > + * address.
> > > + */
> > > + if (args->mm == mm && args->address == address &&
> > > + pte_write(pteval))
> > > + args->valid = true;
> > > +
> > > + /*
> > > + * Store the pfn of the page in a special migration
> > > + * pte. do_swap_page() will wait until the migration
> > > + * pte is removed and then restart fault handling.
> > > + */
> > > + if (pte_write(pteval))
> > > + entry = make_writable_device_exclusive_entry(
> > > + page_to_pfn(subpage));
> > > + else
> > > + entry = make_readable_device_exclusive_entry(
> > > + page_to_pfn(subpage));
> > > + swp_pte = swp_entry_to_pte(entry);
> > > + if (pte_soft_dirty(pteval))
> > > + swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > + if (pte_uffd_wp(pteval))
> > > + swp_pte = pte_swp_mkuffd_wp(swp_pte);
> > > +
> > > + set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > +
> > > + /*
> > > + * There is a reference on the page for the swap entry which has
> > > + * been removed, so shouldn't take another.
> > > + */
> > > + page_remove_rmap(subpage, false);
> > > + }
> > > +
> > > + mmu_notifier_invalidate_range_end(&range);
> > > +
> > > + return ret;
> > > +}
> > > +
> > > +/**
> > > + * page_make_device_exclusive - mark the page exclusively owned by a device
> > > + * @page: the page to replace page table entries for
> > > + * @mm: the mm_struct where the page is expected to be mapped
> > > + * @address: address where the page is expected to be mapped
> > > + * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
> > > + *
> > > + * Tries to remove all the page table entries which are mapping this page and
> > > + * replace them with special device exclusive swap entries to grant a device
> > > + * exclusive access to the page. Caller must hold the page lock.
> > > + *
> > > + * Returns false if the page is still mapped, or if it could not be unmapped
> > > + * from the expected address. Otherwise returns true (success).
> > > + */
> > > +static bool page_make_device_exclusive(struct page *page, struct mm_struct *mm,
> > > + unsigned long address, void *owner)
> > > +{
> > > + struct make_exclusive_args args = {
> > > + .mm = mm,
> > > + .address = address,
> > > + .owner = owner,
> > > + .valid = false,
> > > + };
> > > + struct rmap_walk_control rwc = {
> > > + .rmap_one = page_make_device_exclusive_one,
> > > + .done = page_not_mapped,
> > > + .anon_lock = page_lock_anon_vma_read,
> > > + .arg = &args,
> > > + };
> > > +
> > > + /*
> > > + * Restrict to anonymous pages for now to avoid potential writeback
> > > + * issues.
> > > + */
> > > + if (!PageAnon(page))
> > > + return false;
> > > +
> > > + rmap_walk(page, &rwc);
> >
> > Here we call rmap_walk() on each page we've got. If it was thp then IIUC it'll
> > become the tail pages to walk as the outcome of FOLL_SPLIT_PMD gup (please
> > refer to the last reply of mine). However now I'm uncertain whether we can do
> > rmap_walk on tail page at all... As rmap_walk_anon() has thp_nr_pages() which
> > has:
> >
> > VM_BUG_ON_PGFLAGS(PageTail(page), page);
>
> In either case (FOLL_SPLIT_PMD or not) my understanding is GUP will return a
> sub/tail page (perhaps I mixed up some terminology in the last thread but I
> think we're in agreement here).
Aha, I totally missed this when I read last time (of follow_trans_huge_pmd)..
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
Now I agree it'll always return subpage, even if thp mapped. And do
FOLL_SPLIT_PMD makes sense too to do early break on cow pages as you said
before.
> For thp this means we could end up passing
> tail pages to rmap_walk(), however it doesn't actually walk them.
>
> Based on the results of previous testing I had done I assumed rmap_walk()
> filtered out tail pages. It does, and I didn't hit the BUG_ON above, but the
> filtering was not as deliberate as assumed.
>
> I've gone back and looked at what was happening in my earlier tests and the
> tail pages get filtered because the VMA is not getting locked in
> page_lock_anon_vma_read() due to failing this check:
>
> anon_mapping = (unsigned long)READ_ONCE(page->mapping);
> if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> goto out;
>
> And now I'm not sure it makes sense to read page->mapping of a tail page. So
> it might be best if we explicitly ignore any tail pages returned from GUP, at
> least for now (a future series will improve thp support such as adding a pmd
> version for exclusive entries).
I feel like it's illegal to access page->mapping of tail pages; I looked at
what happens if we call page_anon_vma() on a tail page:
struct anon_vma *page_anon_vma(struct page *page)
{
unsigned long mapping;
page = compound_head(page);
mapping = (unsigned long)page->mapping;
if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
return NULL;
return __page_rmapping(page);
}
It'll just take the head's mapping instead. It makes sense since the tail page
shouldn't have a different value against the head page, afaiu.
It would be great if thp experts could chim in. Before that happens, I agree
with you that a safer approach is to explicitly not walk a tail page for its
rmap (and I think the rmap of a tail page will be the same of the head
anyways.. since they seem to share the anon_vma as quoted).
>
> > So... for thp mappings, wondering whether we should do normal GUP (without
> > SPLIT), pass in always normal or head pages into rmap_walk(), but then
> > unconditionally split_huge_pmd_address() in page_make_device_exclusive_one()?
>
> That could work (although I think GUP will still return tail pages - see
> follow_trans_huge_pmd() which is called from follow_pmd_mask() in gup).
Agreed.
> The main problem is split_huge_pmd_address() unconditionally calls a mmu
> notifier so I would need to plumb in passing an owner everywhere which could
> get messy.
Could I ask why? split_huge_pmd_address() will notify with CLEAR, so I'm a bit
confused why we need to pass over the owner.
I thought plumb it right before your EXCLUSIVE notifier init would work?
---8<---
diff --git a/mm/rmap.c b/mm/rmap.c
index a94d9aed9d95..360ce86f3822 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2042,6 +2042,12 @@ static bool page_make_device_exclusive_one(struct page *page,
swp_entry_t entry;
pte_t swp_pte;
+ /*
+ * Make sure thps split as device exclusive entries only support pte
+ * level for now.
+ */
+ split_huge_pmd_address(vma, address, false, page);
+
mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
vma->vm_mm, address, min(vma->vm_end,
address + page_size(page)), args->owner);
---8<---
Thanks,
--
Peter Xu
WARNING: multiple messages have this Message-ID (diff)
From: Peter Xu <peterx@redhat.com>
To: Alistair Popple <apopple@nvidia.com>
Cc: rcampbell@nvidia.com, willy@infradead.org,
linux-doc@vger.kernel.org, nouveau@lists.freedesktop.org,
hughd@google.com, linux-kernel@vger.kernel.org,
dri-devel@lists.freedesktop.org, hch@infradead.org,
linux-mm@kvack.org, shakeelb@google.com, bskeggs@redhat.com,
jgg@nvidia.com, akpm@linux-foundation.org,
Christoph Hellwig <hch@lst.de>
Subject: Re: [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access
Date: Wed, 9 Jun 2021 12:05:06 -0400 [thread overview]
Message-ID: <YMDmsha6GDtUf3Vs@t490s> (raw)
In-Reply-To: <270551728.uXnuCZxQlr@nvdebian>
On Wed, Jun 09, 2021 at 07:38:04PM +1000, Alistair Popple wrote:
> On Wednesday, 9 June 2021 4:33:52 AM AEST Peter Xu wrote:
> > On Mon, Jun 07, 2021 at 05:58:52PM +1000, Alistair Popple wrote:
> >
> > [...]
> >
> > > +static bool page_make_device_exclusive_one(struct page *page,
> > > + struct vm_area_struct *vma, unsigned long address, void *priv)
> > > +{
> > > + struct mm_struct *mm = vma->vm_mm;
> > > + struct page_vma_mapped_walk pvmw = {
> > > + .page = page,
> > > + .vma = vma,
> > > + .address = address,
> > > + };
> > > + struct make_exclusive_args *args = priv;
> > > + pte_t pteval;
> > > + struct page *subpage;
> > > + bool ret = true;
> > > + struct mmu_notifier_range range;
> > > + swp_entry_t entry;
> > > + pte_t swp_pte;
> > > +
> > > + mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
> > > + vma->vm_mm, address, min(vma->vm_end,
> > > + address + page_size(page)), args->owner);
> > > + mmu_notifier_invalidate_range_start(&range);
> > > +
> > > + while (page_vma_mapped_walk(&pvmw)) {
> > > + /* Unexpected PMD-mapped THP? */
> > > + VM_BUG_ON_PAGE(!pvmw.pte, page);
> >
> > [1]
> >
> > > +
> > > + if (!pte_present(*pvmw.pte)) {
> > > + ret = false;
> > > + page_vma_mapped_walk_done(&pvmw);
> > > + break;
> > > + }
> > > +
> > > + subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> > > + address = pvmw.address;
> >
> > I raised a question here previously and didn't get an answer...
> >
> > https://lore.kernel.org/linux-mm/YLDr%2FRyAdUR4q0kk@t490s/
>
> Sorry, I had overlooked that. Will continue the discussion here.
No problem. I also didn't really express clearly last time, I'm happy we can
discuss this more thoroughly, even if it may be a corner case only.
>
> > I think I get your point now and it does look possible that the split page can
> > still be mapped somewhere else as thp, then having some subpage maintainance
> > looks necessary. The confusing part is above [1] you've also got that
> > VM_BUG_ON_PAGE() assuming it must not be a mapped pmd at all..
>
> Going back I thought your original question was whether subpage != page is
> possible. My main point was it's possible if we get a thp head. In that case we
> need to replace all pte's with exclusive entries because I haven't (yet)
> defined a pmd version of device exclusive entries and also rmap_walk won't deal
> with tail pages (see below).
>
> > Then I remembered these code majorly come from the try_to_unmap() so I looked
> > there. I _think_ what's missing here is something like:
> >
> > if (flags & TTU_SPLIT_HUGE_PMD)
> > split_huge_pmd_address(vma, address, false, page);
> >
> > at the entry of page_make_device_exclusive_one()?
> >
> > That !pte assertion in try_to_unmap() makes sense to me as long as it has split
> > the thp page first always. However seems not the case for FOLL_SPLIT_PMD as
> > you previously mentioned.
>
> At present this is limited to PageAnon pages which have had CoW broken, which I
> think means there shouldn't be other mappings so I expect the PMD will always
> have been split into small PTEs mapping subpages by GUP which is what that
> assertion [1] is checking. I could call split_huge_pmd_address() unconditionally
> as suggested but see the discussion below.
Yes, I think calling that unconditionally should be enough.
>
> > Meanwhile, I also started to wonder whether it's even right to call rmap_walk()
> > with tail pages... Please see below.
> >
> > > +
> > > + /* Nuke the page table entry. */
> > > + flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> > > + pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > > +
> > > + /* Move the dirty bit to the page. Now the pte is gone. */
> > > + if (pte_dirty(pteval))
> > > + set_page_dirty(page);
> > > +
> > > + /*
> > > + * Check that our target page is still mapped at the expected
> > > + * address.
> > > + */
> > > + if (args->mm == mm && args->address == address &&
> > > + pte_write(pteval))
> > > + args->valid = true;
> > > +
> > > + /*
> > > + * Store the pfn of the page in a special migration
> > > + * pte. do_swap_page() will wait until the migration
> > > + * pte is removed and then restart fault handling.
> > > + */
> > > + if (pte_write(pteval))
> > > + entry = make_writable_device_exclusive_entry(
> > > + page_to_pfn(subpage));
> > > + else
> > > + entry = make_readable_device_exclusive_entry(
> > > + page_to_pfn(subpage));
> > > + swp_pte = swp_entry_to_pte(entry);
> > > + if (pte_soft_dirty(pteval))
> > > + swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > + if (pte_uffd_wp(pteval))
> > > + swp_pte = pte_swp_mkuffd_wp(swp_pte);
> > > +
> > > + set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > +
> > > + /*
> > > + * There is a reference on the page for the swap entry which has
> > > + * been removed, so shouldn't take another.
> > > + */
> > > + page_remove_rmap(subpage, false);
> > > + }
> > > +
> > > + mmu_notifier_invalidate_range_end(&range);
> > > +
> > > + return ret;
> > > +}
> > > +
> > > +/**
> > > + * page_make_device_exclusive - mark the page exclusively owned by a device
> > > + * @page: the page to replace page table entries for
> > > + * @mm: the mm_struct where the page is expected to be mapped
> > > + * @address: address where the page is expected to be mapped
> > > + * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
> > > + *
> > > + * Tries to remove all the page table entries which are mapping this page and
> > > + * replace them with special device exclusive swap entries to grant a device
> > > + * exclusive access to the page. Caller must hold the page lock.
> > > + *
> > > + * Returns false if the page is still mapped, or if it could not be unmapped
> > > + * from the expected address. Otherwise returns true (success).
> > > + */
> > > +static bool page_make_device_exclusive(struct page *page, struct mm_struct *mm,
> > > + unsigned long address, void *owner)
> > > +{
> > > + struct make_exclusive_args args = {
> > > + .mm = mm,
> > > + .address = address,
> > > + .owner = owner,
> > > + .valid = false,
> > > + };
> > > + struct rmap_walk_control rwc = {
> > > + .rmap_one = page_make_device_exclusive_one,
> > > + .done = page_not_mapped,
> > > + .anon_lock = page_lock_anon_vma_read,
> > > + .arg = &args,
> > > + };
> > > +
> > > + /*
> > > + * Restrict to anonymous pages for now to avoid potential writeback
> > > + * issues.
> > > + */
> > > + if (!PageAnon(page))
> > > + return false;
> > > +
> > > + rmap_walk(page, &rwc);
> >
> > Here we call rmap_walk() on each page we've got. If it was thp then IIUC it'll
> > become the tail pages to walk as the outcome of FOLL_SPLIT_PMD gup (please
> > refer to the last reply of mine). However now I'm uncertain whether we can do
> > rmap_walk on tail page at all... As rmap_walk_anon() has thp_nr_pages() which
> > has:
> >
> > VM_BUG_ON_PGFLAGS(PageTail(page), page);
>
> In either case (FOLL_SPLIT_PMD or not) my understanding is GUP will return a
> sub/tail page (perhaps I mixed up some terminology in the last thread but I
> think we're in agreement here).
Aha, I totally missed this when I read last time (of follow_trans_huge_pmd)..
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
Now I agree it'll always return subpage, even if thp mapped. And do
FOLL_SPLIT_PMD makes sense too to do early break on cow pages as you said
before.
> For thp this means we could end up passing
> tail pages to rmap_walk(), however it doesn't actually walk them.
>
> Based on the results of previous testing I had done I assumed rmap_walk()
> filtered out tail pages. It does, and I didn't hit the BUG_ON above, but the
> filtering was not as deliberate as assumed.
>
> I've gone back and looked at what was happening in my earlier tests and the
> tail pages get filtered because the VMA is not getting locked in
> page_lock_anon_vma_read() due to failing this check:
>
> anon_mapping = (unsigned long)READ_ONCE(page->mapping);
> if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> goto out;
>
> And now I'm not sure it makes sense to read page->mapping of a tail page. So
> it might be best if we explicitly ignore any tail pages returned from GUP, at
> least for now (a future series will improve thp support such as adding a pmd
> version for exclusive entries).
I feel like it's illegal to access page->mapping of tail pages; I looked at
what happens if we call page_anon_vma() on a tail page:
struct anon_vma *page_anon_vma(struct page *page)
{
unsigned long mapping;
page = compound_head(page);
mapping = (unsigned long)page->mapping;
if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
return NULL;
return __page_rmapping(page);
}
It'll just take the head's mapping instead. It makes sense since the tail page
shouldn't have a different value against the head page, afaiu.
It would be great if thp experts could chim in. Before that happens, I agree
with you that a safer approach is to explicitly not walk a tail page for its
rmap (and I think the rmap of a tail page will be the same of the head
anyways.. since they seem to share the anon_vma as quoted).
>
> > So... for thp mappings, wondering whether we should do normal GUP (without
> > SPLIT), pass in always normal or head pages into rmap_walk(), but then
> > unconditionally split_huge_pmd_address() in page_make_device_exclusive_one()?
>
> That could work (although I think GUP will still return tail pages - see
> follow_trans_huge_pmd() which is called from follow_pmd_mask() in gup).
Agreed.
> The main problem is split_huge_pmd_address() unconditionally calls a mmu
> notifier so I would need to plumb in passing an owner everywhere which could
> get messy.
Could I ask why? split_huge_pmd_address() will notify with CLEAR, so I'm a bit
confused why we need to pass over the owner.
I thought plumb it right before your EXCLUSIVE notifier init would work?
---8<---
diff --git a/mm/rmap.c b/mm/rmap.c
index a94d9aed9d95..360ce86f3822 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2042,6 +2042,12 @@ static bool page_make_device_exclusive_one(struct page *page,
swp_entry_t entry;
pte_t swp_pte;
+ /*
+ * Make sure thps split as device exclusive entries only support pte
+ * level for now.
+ */
+ split_huge_pmd_address(vma, address, false, page);
+
mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
vma->vm_mm, address, min(vma->vm_end,
address + page_size(page)), args->owner);
---8<---
Thanks,
--
Peter Xu
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau
WARNING: multiple messages have this Message-ID (diff)
From: Peter Xu <peterx@redhat.com>
To: Alistair Popple <apopple@nvidia.com>
Cc: rcampbell@nvidia.com, willy@infradead.org,
linux-doc@vger.kernel.org, nouveau@lists.freedesktop.org,
hughd@google.com, linux-kernel@vger.kernel.org,
dri-devel@lists.freedesktop.org, hch@infradead.org,
linux-mm@kvack.org, shakeelb@google.com, bskeggs@redhat.com,
jgg@nvidia.com, jhubbard@nvidia.com, akpm@linux-foundation.org,
Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v10 07/10] mm: Device exclusive memory access
Date: Wed, 9 Jun 2021 12:05:06 -0400 [thread overview]
Message-ID: <YMDmsha6GDtUf3Vs@t490s> (raw)
In-Reply-To: <270551728.uXnuCZxQlr@nvdebian>
On Wed, Jun 09, 2021 at 07:38:04PM +1000, Alistair Popple wrote:
> On Wednesday, 9 June 2021 4:33:52 AM AEST Peter Xu wrote:
> > On Mon, Jun 07, 2021 at 05:58:52PM +1000, Alistair Popple wrote:
> >
> > [...]
> >
> > > +static bool page_make_device_exclusive_one(struct page *page,
> > > + struct vm_area_struct *vma, unsigned long address, void *priv)
> > > +{
> > > + struct mm_struct *mm = vma->vm_mm;
> > > + struct page_vma_mapped_walk pvmw = {
> > > + .page = page,
> > > + .vma = vma,
> > > + .address = address,
> > > + };
> > > + struct make_exclusive_args *args = priv;
> > > + pte_t pteval;
> > > + struct page *subpage;
> > > + bool ret = true;
> > > + struct mmu_notifier_range range;
> > > + swp_entry_t entry;
> > > + pte_t swp_pte;
> > > +
> > > + mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
> > > + vma->vm_mm, address, min(vma->vm_end,
> > > + address + page_size(page)), args->owner);
> > > + mmu_notifier_invalidate_range_start(&range);
> > > +
> > > + while (page_vma_mapped_walk(&pvmw)) {
> > > + /* Unexpected PMD-mapped THP? */
> > > + VM_BUG_ON_PAGE(!pvmw.pte, page);
> >
> > [1]
> >
> > > +
> > > + if (!pte_present(*pvmw.pte)) {
> > > + ret = false;
> > > + page_vma_mapped_walk_done(&pvmw);
> > > + break;
> > > + }
> > > +
> > > + subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> > > + address = pvmw.address;
> >
> > I raised a question here previously and didn't get an answer...
> >
> > https://lore.kernel.org/linux-mm/YLDr%2FRyAdUR4q0kk@t490s/
>
> Sorry, I had overlooked that. Will continue the discussion here.
No problem. I also didn't really express clearly last time, I'm happy we can
discuss this more thoroughly, even if it may be a corner case only.
>
> > I think I get your point now and it does look possible that the split page can
> > still be mapped somewhere else as thp, then having some subpage maintainance
> > looks necessary. The confusing part is above [1] you've also got that
> > VM_BUG_ON_PAGE() assuming it must not be a mapped pmd at all..
>
> Going back I thought your original question was whether subpage != page is
> possible. My main point was it's possible if we get a thp head. In that case we
> need to replace all pte's with exclusive entries because I haven't (yet)
> defined a pmd version of device exclusive entries and also rmap_walk won't deal
> with tail pages (see below).
>
> > Then I remembered these code majorly come from the try_to_unmap() so I looked
> > there. I _think_ what's missing here is something like:
> >
> > if (flags & TTU_SPLIT_HUGE_PMD)
> > split_huge_pmd_address(vma, address, false, page);
> >
> > at the entry of page_make_device_exclusive_one()?
> >
> > That !pte assertion in try_to_unmap() makes sense to me as long as it has split
> > the thp page first always. However seems not the case for FOLL_SPLIT_PMD as
> > you previously mentioned.
>
> At present this is limited to PageAnon pages which have had CoW broken, which I
> think means there shouldn't be other mappings so I expect the PMD will always
> have been split into small PTEs mapping subpages by GUP which is what that
> assertion [1] is checking. I could call split_huge_pmd_address() unconditionally
> as suggested but see the discussion below.
Yes, I think calling that unconditionally should be enough.
>
> > Meanwhile, I also started to wonder whether it's even right to call rmap_walk()
> > with tail pages... Please see below.
> >
> > > +
> > > + /* Nuke the page table entry. */
> > > + flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> > > + pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > > +
> > > + /* Move the dirty bit to the page. Now the pte is gone. */
> > > + if (pte_dirty(pteval))
> > > + set_page_dirty(page);
> > > +
> > > + /*
> > > + * Check that our target page is still mapped at the expected
> > > + * address.
> > > + */
> > > + if (args->mm == mm && args->address == address &&
> > > + pte_write(pteval))
> > > + args->valid = true;
> > > +
> > > + /*
> > > + * Store the pfn of the page in a special migration
> > > + * pte. do_swap_page() will wait until the migration
> > > + * pte is removed and then restart fault handling.
> > > + */
> > > + if (pte_write(pteval))
> > > + entry = make_writable_device_exclusive_entry(
> > > + page_to_pfn(subpage));
> > > + else
> > > + entry = make_readable_device_exclusive_entry(
> > > + page_to_pfn(subpage));
> > > + swp_pte = swp_entry_to_pte(entry);
> > > + if (pte_soft_dirty(pteval))
> > > + swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > + if (pte_uffd_wp(pteval))
> > > + swp_pte = pte_swp_mkuffd_wp(swp_pte);
> > > +
> > > + set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > +
> > > + /*
> > > + * There is a reference on the page for the swap entry which has
> > > + * been removed, so shouldn't take another.
> > > + */
> > > + page_remove_rmap(subpage, false);
> > > + }
> > > +
> > > + mmu_notifier_invalidate_range_end(&range);
> > > +
> > > + return ret;
> > > +}
> > > +
> > > +/**
> > > + * page_make_device_exclusive - mark the page exclusively owned by a device
> > > + * @page: the page to replace page table entries for
> > > + * @mm: the mm_struct where the page is expected to be mapped
> > > + * @address: address where the page is expected to be mapped
> > > + * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
> > > + *
> > > + * Tries to remove all the page table entries which are mapping this page and
> > > + * replace them with special device exclusive swap entries to grant a device
> > > + * exclusive access to the page. Caller must hold the page lock.
> > > + *
> > > + * Returns false if the page is still mapped, or if it could not be unmapped
> > > + * from the expected address. Otherwise returns true (success).
> > > + */
> > > +static bool page_make_device_exclusive(struct page *page, struct mm_struct *mm,
> > > + unsigned long address, void *owner)
> > > +{
> > > + struct make_exclusive_args args = {
> > > + .mm = mm,
> > > + .address = address,
> > > + .owner = owner,
> > > + .valid = false,
> > > + };
> > > + struct rmap_walk_control rwc = {
> > > + .rmap_one = page_make_device_exclusive_one,
> > > + .done = page_not_mapped,
> > > + .anon_lock = page_lock_anon_vma_read,
> > > + .arg = &args,
> > > + };
> > > +
> > > + /*
> > > + * Restrict to anonymous pages for now to avoid potential writeback
> > > + * issues.
> > > + */
> > > + if (!PageAnon(page))
> > > + return false;
> > > +
> > > + rmap_walk(page, &rwc);
> >
> > Here we call rmap_walk() on each page we've got. If it was thp then IIUC it'll
> > become the tail pages to walk as the outcome of FOLL_SPLIT_PMD gup (please
> > refer to the last reply of mine). However now I'm uncertain whether we can do
> > rmap_walk on tail page at all... As rmap_walk_anon() has thp_nr_pages() which
> > has:
> >
> > VM_BUG_ON_PGFLAGS(PageTail(page), page);
>
> In either case (FOLL_SPLIT_PMD or not) my understanding is GUP will return a
> sub/tail page (perhaps I mixed up some terminology in the last thread but I
> think we're in agreement here).
Aha, I totally missed this when I read last time (of follow_trans_huge_pmd)..
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
Now I agree it'll always return subpage, even if thp mapped. And do
FOLL_SPLIT_PMD makes sense too to do early break on cow pages as you said
before.
> For thp this means we could end up passing
> tail pages to rmap_walk(), however it doesn't actually walk them.
>
> Based on the results of previous testing I had done I assumed rmap_walk()
> filtered out tail pages. It does, and I didn't hit the BUG_ON above, but the
> filtering was not as deliberate as assumed.
>
> I've gone back and looked at what was happening in my earlier tests and the
> tail pages get filtered because the VMA is not getting locked in
> page_lock_anon_vma_read() due to failing this check:
>
> anon_mapping = (unsigned long)READ_ONCE(page->mapping);
> if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> goto out;
>
> And now I'm not sure it makes sense to read page->mapping of a tail page. So
> it might be best if we explicitly ignore any tail pages returned from GUP, at
> least for now (a future series will improve thp support such as adding a pmd
> version for exclusive entries).
I feel like it's illegal to access page->mapping of tail pages; I looked at
what happens if we call page_anon_vma() on a tail page:
struct anon_vma *page_anon_vma(struct page *page)
{
unsigned long mapping;
page = compound_head(page);
mapping = (unsigned long)page->mapping;
if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
return NULL;
return __page_rmapping(page);
}
It'll just take the head's mapping instead. It makes sense since the tail page
shouldn't have a different value against the head page, afaiu.
It would be great if thp experts could chim in. Before that happens, I agree
with you that a safer approach is to explicitly not walk a tail page for its
rmap (and I think the rmap of a tail page will be the same of the head
anyways.. since they seem to share the anon_vma as quoted).
>
> > So... for thp mappings, wondering whether we should do normal GUP (without
> > SPLIT), pass in always normal or head pages into rmap_walk(), but then
> > unconditionally split_huge_pmd_address() in page_make_device_exclusive_one()?
>
> That could work (although I think GUP will still return tail pages - see
> follow_trans_huge_pmd() which is called from follow_pmd_mask() in gup).
Agreed.
> The main problem is split_huge_pmd_address() unconditionally calls a mmu
> notifier so I would need to plumb in passing an owner everywhere which could
> get messy.
Could I ask why? split_huge_pmd_address() will notify with CLEAR, so I'm a bit
confused why we need to pass over the owner.
I thought plumb it right before your EXCLUSIVE notifier init would work?
---8<---
diff --git a/mm/rmap.c b/mm/rmap.c
index a94d9aed9d95..360ce86f3822 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2042,6 +2042,12 @@ static bool page_make_device_exclusive_one(struct page *page,
swp_entry_t entry;
pte_t swp_pte;
+ /*
+ * Make sure thps split as device exclusive entries only support pte
+ * level for now.
+ */
+ split_huge_pmd_address(vma, address, false, page);
+
mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
vma->vm_mm, address, min(vma->vm_end,
address + page_size(page)), args->owner);
---8<---
Thanks,
--
Peter Xu
next prev parent reply other threads:[~2021-06-09 16:05 UTC|newest]
Thread overview: 82+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-07 7:58 [PATCH v10 00/10] Add support for SVM atomics in Nouveau Alistair Popple
2021-06-07 7:58 ` Alistair Popple
2021-06-07 7:58 ` [Nouveau] " Alistair Popple
2021-06-07 7:58 ` [PATCH v10 01/10] mm: Remove special swap entry functions Alistair Popple
2021-06-07 7:58 ` Alistair Popple
2021-06-07 7:58 ` [Nouveau] " Alistair Popple
2021-06-07 7:58 ` [PATCH v10 02/10] mm/swapops: Rework swap entry manipulation code Alistair Popple
2021-06-07 7:58 ` Alistair Popple
2021-06-07 7:58 ` [Nouveau] " Alistair Popple
2021-06-07 7:58 ` [PATCH v10 03/10] mm/rmap: Split try_to_munlock from try_to_unmap Alistair Popple
2021-06-07 7:58 ` Alistair Popple
2021-06-07 7:58 ` [Nouveau] " Alistair Popple
2021-06-07 7:58 ` [PATCH v10 04/10] mm/rmap: Split migration into its own function Alistair Popple
2021-06-07 7:58 ` Alistair Popple
2021-06-07 7:58 ` [Nouveau] " Alistair Popple
2021-06-07 7:58 ` [PATCH v10 05/10] mm: Rename migrate_pgmap_owner Alistair Popple
2021-06-07 7:58 ` Alistair Popple
2021-06-07 7:58 ` [Nouveau] " Alistair Popple
2021-06-08 15:16 ` Peter Xu
2021-06-08 15:16 ` Peter Xu
2021-06-08 15:16 ` [Nouveau] " Peter Xu
2021-06-07 7:58 ` [PATCH v10 06/10] mm/memory.c: Allow different return codes for copy_nonpresent_pte() Alistair Popple
2021-06-07 7:58 ` Alistair Popple
2021-06-07 7:58 ` [Nouveau] " Alistair Popple
2021-06-08 15:19 ` Peter Xu
2021-06-08 15:19 ` Peter Xu
2021-06-08 15:19 ` [Nouveau] " Peter Xu
2021-06-07 7:58 ` [PATCH v10 07/10] mm: Device exclusive memory access Alistair Popple
2021-06-07 7:58 ` Alistair Popple
2021-06-07 7:58 ` [Nouveau] " Alistair Popple
2021-06-08 18:33 ` Peter Xu
2021-06-08 18:33 ` Peter Xu
2021-06-08 18:33 ` [Nouveau] " Peter Xu
2021-06-09 9:38 ` Alistair Popple
2021-06-09 9:38 ` Alistair Popple
2021-06-09 9:38 ` [Nouveau] " Alistair Popple
2021-06-09 16:05 ` Peter Xu [this message]
2021-06-09 16:05 ` Peter Xu
2021-06-09 16:05 ` [Nouveau] " Peter Xu
2021-06-10 0:18 ` Alistair Popple
2021-06-10 0:18 ` Alistair Popple
2021-06-10 0:18 ` [Nouveau] " Alistair Popple
2021-06-10 18:04 ` Peter Xu
2021-06-10 18:04 ` Peter Xu
2021-06-10 18:04 ` [Nouveau] " Peter Xu
2021-06-10 14:21 ` Alistair Popple
2021-06-10 14:21 ` Alistair Popple
2021-06-10 14:21 ` [Nouveau] " Alistair Popple
2021-06-10 23:04 ` Peter Xu
2021-06-10 23:04 ` Peter Xu
2021-06-10 23:04 ` [Nouveau] " Peter Xu
2021-06-10 23:17 ` Alistair Popple
2021-06-10 23:17 ` Alistair Popple
2021-06-10 23:17 ` [Nouveau] " Alistair Popple
2021-06-11 1:00 ` Peter Xu
2021-06-11 1:00 ` Peter Xu
2021-06-11 1:00 ` [Nouveau] " Peter Xu
2021-06-11 3:43 ` Alistair Popple
2021-06-11 3:43 ` Alistair Popple
2021-06-11 3:43 ` [Nouveau] " Alistair Popple
2021-06-11 15:01 ` Peter Xu
2021-06-11 15:01 ` Peter Xu
2021-06-11 15:01 ` [Nouveau] " Peter Xu
2021-06-15 3:08 ` Alistair Popple
2021-06-15 3:08 ` Alistair Popple
2021-06-15 3:08 ` [Nouveau] " Alistair Popple
2021-06-15 16:25 ` Peter Xu
2021-06-15 16:25 ` Peter Xu
2021-06-15 16:25 ` [Nouveau] " Peter Xu
2021-06-16 2:47 ` Alistair Popple
2021-06-16 2:47 ` Alistair Popple
2021-06-16 2:47 ` [Nouveau] " Alistair Popple
2021-06-07 7:58 ` [PATCH v10 08/10] mm: Selftests for exclusive device memory Alistair Popple
2021-06-07 7:58 ` Alistair Popple
2021-06-07 7:58 ` [Nouveau] " Alistair Popple
2021-06-07 7:58 ` [PATCH v10 09/10] nouveau/svm: Refactor nouveau_range_fault Alistair Popple
2021-06-07 7:58 ` Alistair Popple
2021-06-07 7:58 ` [Nouveau] " Alistair Popple
2021-06-07 7:58 ` [PATCH v10 10/10] nouveau/svm: Implement atomic SVM access Alistair Popple
2021-06-07 7:58 ` Alistair Popple
2021-06-07 7:58 ` [Nouveau] " Alistair Popple
-- strict thread matches above, loose matches on Subject: below --
2021-06-09 23:02 [PATCH v10 07/10] mm: Device exclusive memory access kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YMDmsha6GDtUf3Vs@t490s \
--to=peterx@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=bskeggs@redhat.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=hch@infradead.org \
--cc=hch@lst.de \
--cc=hughd@google.com \
--cc=jgg@nvidia.com \
--cc=jhubbard@nvidia.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nouveau@lists.freedesktop.org \
--cc=rcampbell@nvidia.com \
--cc=shakeelb@google.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.