Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Alistair Popple <apopple@nvidia.com>
To: Peter Xu <peterx@redhat.com>
Cc: "Sierra Guiza, Alejandro \(Alex\)" <alex.sierra@amd.com>,
	Ralph Campbell <rcampbell@nvidia.com>,
	Lyude Paul <lyude@redhat.com>, Karol Herbst <kherbst@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	David Hildenbrand <david@redhat.com>,
	Nadav Amit <nadav.amit@gmail.com>,
	Felix Kuehling <Felix.Kuehling@amd.com>,
	linuxppc-dev@lists.ozlabs.org,
	LKML <linux-kernel@vger.kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Linux MM <linux-mm@kvack.org>,
	Logan Gunthorpe <logang@deltatee.com>,
	Ben Skeggs <bskeggs@redhat.com>, Jason Gunthorpe <jgg@nvidia.com>,
	"Huang, Ying" <ying.huang@intel.com>,
	stable@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>,
	huang ying <huang.ying.caritas@gmail.com>
Subject: Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page
Date: Thu, 25 Aug 2022 10:42:41 +1000	[thread overview]
Message-ID: <87o7w9f7dp.fsf@nvdebian.thelocal> (raw)
In-Reply-To: <YwaOpj54/qUb5fXa@xz-m1.local>


Peter Xu <peterx@redhat.com> writes:

> On Wed, Aug 24, 2022 at 04:25:44PM -0400, Peter Xu wrote:
>> On Wed, Aug 24, 2022 at 11:56:25AM +1000, Alistair Popple wrote:
>> > >> Still I don't know whether there'll be any side effect of having stall tlbs
>> > >> in !present ptes because I'm not familiar enough with the private dev swap
>> > >> migration code.  But I think having them will be safe, even if redundant.
>> >
>> > What side-effect were you thinking of? I don't see any issue with not
>> > TLB flushing stale device-private TLBs prior to the migration because
>> > they're not accessible anyway and shouldn't be in any TLB.
>>
>> Sorry to be misleading, I never meant we must add them.  As I said it's
>> just that I don't know the code well so I don't know whether it's safe to
>> not have it.
>>
>> IIUC it's about whether having stall system-ram stall tlb in other
>> processor would matter or not here.  E.g. some none pte that this code
>> collected (boosted both "cpages" and "npages" for a none pte) could have
>> stall tlb in other cores that makes the page writable there.
>
> For this one, let me give a more detailed example.

Thanks, I would have been completely lost about what you were talking
about without this :-)

> It's about whether below could happen:
>
>        thread 1                thread 2                 thread 3
>        --------                --------                 --------
>                           write to page P (data=P1)
>                             (cached TLB writable)
>   zap_pte_range()
>     pgtable lock
>     clear pte for page P
>     pgtable unlock
>     ...
>                                                      migrate_vma_collect
>                                                        pte none, npages++, cpages++
>                                                        allocate device page
>                                                        copy data (with P1)
>                                                        map pte as device swap
>                           write to page P again
>                           (data updated from P1->P2)
>   flush tlb
>
> Then at last from processor side P should have data P2 but actually from
> device memory it's P1. Data corrupt.

In the above scenario migrate_vma_collect_pmd() will observe pte_none.
This will mark the src_pfn[] array as needing a new zero page which will
be installed by migrate_vma_pages()->migrate_vma_insert_page().

So there is no data to be copied hence there can't be any data
corruption. Remember these are private anonymous pages, so any
zap_pte_range() indicates the data is no longer needed (eg.
MADV_DONTNEED).

>>
>> When I said I'm not familiar with the code, it's majorly about one thing I
>> never figured out myself, in that migrate_vma_collect_pmd() has this
>> optimization to trylock on the page, collect if it succeeded:
>>
>>   /*
>>    * Optimize for the common case where page is only mapped once
>>    * in one process. If we can lock the page, then we can safely
>>    * set up a special migration page table entry now.
>>    */
>>    if (trylock_page(page)) {
>>           ...
>>    } else {
>>           put_page(page);
>>           mpfn = 0;
>>    }
>>
>> But it's kind of against a pure "optimization" in that if trylock failed,
>> we'll clear the mpfn so the src[i] will be zero at last.  Then will we
>> directly give up on this page, or will we try to lock_page() again
>> somewhere?

That comment is out dated. We used to try locking the page again but
that was removed by ab09243aa95a ("mm/migrate.c: remove
MIGRATE_PFN_LOCKED"). See
https://lkml.kernel.org/r/20211025041608.289017-1-apopple@nvidia.com

Will post a clean-up for it.

>> The future unmap op is also based on this "cpages", not "npages":
>>
>> 	if (args->cpages)
>> 		migrate_vma_unmap(args);
>>
>> So I never figured out how this code really works.  It'll be great if you
>> could shed some light to it.
>>
>> Thanks,
>>
>> --
>> Peter Xu

WARNING: multiple messages have this Message-ID (diff)

From: Alistair Popple <apopple@nvidia.com>
To: Peter Xu <peterx@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>,
	Nadav Amit <nadav.amit@gmail.com>,
	huang ying <huang.ying.caritas@gmail.com>,
	Linux MM <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	"Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>,
	Felix Kuehling <Felix.Kuehling@amd.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	John Hubbard <jhubbard@nvidia.com>,
	David Hildenbrand <david@redhat.com>,
	Ralph Campbell <rcampbell@nvidia.com>,
	Matthew Wilcox <willy@infradead.org>,
	Karol Herbst <kherbst@redhat.com>, Lyude Paul <lyude@redhat.com>,
	Ben Skeggs <bskeggs@redhat.com>,
	Logan Gunthorpe <logang@deltatee.com>,
	paulus@ozlabs.org, linuxppc-dev@lists.ozlabs.org,
	stable@vger.kernel.org
Subject: Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page
Date: Thu, 25 Aug 2022 10:42:41 +1000	[thread overview]
Message-ID: <87o7w9f7dp.fsf@nvdebian.thelocal> (raw)
In-Reply-To: <YwaOpj54/qUb5fXa@xz-m1.local>


Peter Xu <peterx@redhat.com> writes:

> On Wed, Aug 24, 2022 at 04:25:44PM -0400, Peter Xu wrote:
>> On Wed, Aug 24, 2022 at 11:56:25AM +1000, Alistair Popple wrote:
>> > >> Still I don't know whether there'll be any side effect of having stall tlbs
>> > >> in !present ptes because I'm not familiar enough with the private dev swap
>> > >> migration code.  But I think having them will be safe, even if redundant.
>> >
>> > What side-effect were you thinking of? I don't see any issue with not
>> > TLB flushing stale device-private TLBs prior to the migration because
>> > they're not accessible anyway and shouldn't be in any TLB.
>>
>> Sorry to be misleading, I never meant we must add them.  As I said it's
>> just that I don't know the code well so I don't know whether it's safe to
>> not have it.
>>
>> IIUC it's about whether having stall system-ram stall tlb in other
>> processor would matter or not here.  E.g. some none pte that this code
>> collected (boosted both "cpages" and "npages" for a none pte) could have
>> stall tlb in other cores that makes the page writable there.
>
> For this one, let me give a more detailed example.

Thanks, I would have been completely lost about what you were talking
about without this :-)

> It's about whether below could happen:
>
>        thread 1                thread 2                 thread 3
>        --------                --------                 --------
>                           write to page P (data=P1)
>                             (cached TLB writable)
>   zap_pte_range()
>     pgtable lock
>     clear pte for page P
>     pgtable unlock
>     ...
>                                                      migrate_vma_collect
>                                                        pte none, npages++, cpages++
>                                                        allocate device page
>                                                        copy data (with P1)
>                                                        map pte as device swap
>                           write to page P again
>                           (data updated from P1->P2)
>   flush tlb
>
> Then at last from processor side P should have data P2 but actually from
> device memory it's P1. Data corrupt.

In the above scenario migrate_vma_collect_pmd() will observe pte_none.
This will mark the src_pfn[] array as needing a new zero page which will
be installed by migrate_vma_pages()->migrate_vma_insert_page().

So there is no data to be copied hence there can't be any data
corruption. Remember these are private anonymous pages, so any
zap_pte_range() indicates the data is no longer needed (eg.
MADV_DONTNEED).

>>
>> When I said I'm not familiar with the code, it's majorly about one thing I
>> never figured out myself, in that migrate_vma_collect_pmd() has this
>> optimization to trylock on the page, collect if it succeeded:
>>
>>   /*
>>    * Optimize for the common case where page is only mapped once
>>    * in one process. If we can lock the page, then we can safely
>>    * set up a special migration page table entry now.
>>    */
>>    if (trylock_page(page)) {
>>           ...
>>    } else {
>>           put_page(page);
>>           mpfn = 0;
>>    }
>>
>> But it's kind of against a pure "optimization" in that if trylock failed,
>> we'll clear the mpfn so the src[i] will be zero at last.  Then will we
>> directly give up on this page, or will we try to lock_page() again
>> somewhere?

That comment is out dated. We used to try locking the page again but
that was removed by ab09243aa95a ("mm/migrate.c: remove
MIGRATE_PFN_LOCKED"). See
https://lkml.kernel.org/r/20211025041608.289017-1-apopple@nvidia.com

Will post a clean-up for it.

>> The future unmap op is also based on this "cpages", not "npages":
>>
>> 	if (args->cpages)
>> 		migrate_vma_unmap(args);
>>
>> So I never figured out how this code really works.  It'll be great if you
>> could shed some light to it.
>>
>> Thanks,
>>
>> --
>> Peter Xu

next prev parent reply	other threads:[~2022-08-25  1:19 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-16  7:39 [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page Alistair Popple
2022-08-16  7:39 ` Alistair Popple
2022-08-16  7:39 ` [PATCH v2 2/2] selftests/hmm-tests: Add test for dirty bits Alistair Popple
2022-08-16  7:39   ` Alistair Popple
2022-08-16  8:10 ` [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page huang ying
2022-08-16  8:10   ` huang ying
2022-08-16 20:35   ` Peter Xu
2022-08-16 20:35     ` Peter Xu
2022-08-17  1:49     ` Alistair Popple
2022-08-17  1:49       ` Alistair Popple
2022-08-17  2:45       ` Peter Xu
2022-08-17  2:45         ` Peter Xu
2022-08-17  5:41         ` Alistair Popple
2022-08-17  5:41           ` Alistair Popple
2022-08-17  7:17           ` Huang, Ying
2022-08-17  7:17             ` Huang, Ying
2022-08-17  9:41             ` Nadav Amit
2022-08-17  9:41               ` Nadav Amit
2022-08-17 19:27               ` Peter Xu
2022-08-17 19:27                 ` Peter Xu
2022-08-18  6:34                 ` Huang, Ying
2022-08-18  6:34                   ` Huang, Ying
2022-08-18 14:44                   ` Peter Xu
2022-08-18 14:44                     ` Peter Xu
2022-08-19  2:51                     ` Huang, Ying
2022-08-19  2:51                       ` Huang, Ying
2022-08-24  1:56                       ` Alistair Popple
2022-08-24  1:56                         ` Alistair Popple
2022-08-24 20:25                         ` Peter Xu
2022-08-24 20:25                           ` Peter Xu
2022-08-24 20:48                           ` Peter Xu
2022-08-24 20:48                             ` Peter Xu
2022-08-25  0:42                             ` Alistair Popple [this message]
2022-08-25  0:42                               ` Alistair Popple
2022-08-25  1:24                               ` Alistair Popple
2022-08-25  1:24                                 ` Alistair Popple
2022-08-25 15:04                                 ` Peter Xu
2022-08-25 15:04                                   ` Peter Xu
2022-08-25 22:09                                   ` Alistair Popple
2022-08-25 22:09                                     ` Alistair Popple
2022-08-25 23:36                                     ` Peter Xu
2022-08-25 23:36                                       ` Peter Xu
2022-08-25 14:40                               ` Peter Xu
2022-08-25 14:40                                 ` Peter Xu
2022-08-18  5:59               ` Huang, Ying
2022-08-18  5:59                 ` Huang, Ying
2022-08-17 19:07           ` Peter Xu
2022-08-17 19:07             ` Peter Xu
2022-08-17  1:38   ` Alistair Popple
2022-08-17  1:38     ` Alistair Popple

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87o7w9f7dp.fsf@nvdebian.thelocal \
    --to=apopple@nvidia.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.sierra@amd.com \
    --cc=bskeggs@redhat.com \
    --cc=david@redhat.com \
    --cc=huang.ying.caritas@gmail.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kherbst@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=logang@deltatee.com \
    --cc=lyude@redhat.com \
    --cc=nadav.amit@gmail.com \
    --cc=peterx@redhat.com \
    --cc=rcampbell@nvidia.com \
    --cc=stable@vger.kernel.org \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.