From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
To: Jerome Glisse <j.glisse@gmail.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>,
akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org,
"Linus Torvalds" <torvalds@linux-foundation.org>,
joro@8bytes.org, "Mel Gorman" <mgorman@suse.de>,
"H. Peter Anvin" <hpa@zytor.com>,
"Peter Zijlstra" <peterz@infradead.org>,
"Andrea Arcangeli" <aarcange@redhat.com>,
"Johannes Weiner" <jweiner@redhat.com>,
"Larry Woodman" <lwoodman@redhat.com>,
"Rik van Riel" <riel@redhat.com>,
"Dave Airlie" <airlied@redhat.com>,
"Brendan Conoboy" <blc@redhat.com>,
"Joe Donohue" <jdonohue@redhat.com>,
"Christophe Harle" <charle@nvidia.com>,
"Duncan Poole" <dpoole@nvidia.com>,
"Sherry Cheung" <SCheung@nvidia.com>,
"Subhash Gutti" <sgutti@nvidia.com>,
"John Hubbard" <jhubbard@nvidia.com>,
"Mark Hairgrove" <mhairgrove@nvidia.com>,
"Lucien Dunning" <ldunning@nvidia.com>,
"Cameron Buschardt" <cabuschardt@nvidia.com>,
"Arvind Gopalakrishnan" <arvindg@nvidia.com>,
"Haggai Eran" <haggaie@mellanox.com>,
"Shachar Raindel" <raindel@mellanox.com>,
"Liran Liss" <liranl@mellanox.com>,
"Roland Dreier" <roland@purestorage.com>,
"Ben Sander" <ben.sander@amd.com>,
"Greg Stoner" <Greg.Stoner@amd.com>,
"John Bridgman" <John.Bridgman@amd.com>,
"Michael Mantor" <Michael.Mantor@amd.com>,
"Paul Blinzer" <Paul.Blinzer@amd.com>,
"Leonid Shamis" <Leonid.Shamis@amd.com>,
"Laurent Morichetti" <Laurent.Morichetti@amd.com>,
"Alexander Deucher" <Alexander.Deucher@amd.com>
Subject: Re: [PATCH v12 21/29] HMM: mm add helper to update page table when migrating memory back v2.
Date: Mon, 21 Mar 2016 19:18:41 +0530 [thread overview]
Message-ID: <871t74uekm.fsf@linux.vnet.ibm.com> (raw)
In-Reply-To: <20160321120251.GA4518@gmail.com>
Jerome Glisse <j.glisse@gmail.com> writes:
> [ text/plain ]
> On Mon, Mar 21, 2016 at 04:57:32PM +0530, Aneesh Kumar K.V wrote:
>> Jérôme Glisse <jglisse@redhat.com> writes:
>
> [...]
>
>> > +
>> > +#ifdef CONFIG_HMM
>> > +/* mm_hmm_migrate_back() - lock HMM CPU page table entry and allocate new page.
>> > + *
>> > + * @mm: The mm struct.
>> > + * @vma: The vm area struct the range is in.
>> > + * @new_pte: Array of new CPU page table entry value.
>> > + * @start: Start address of the range (inclusive).
>> > + * @end: End address of the range (exclusive).
>> > + *
>> > + * This function will lock HMM page table entry and allocate new page for entry
>> > + * it successfully locked.
>> > + */
>>
>>
>> Can you add more comments around this ?
>
> I should describe the process a bit more i guess. It is multi-step, first we update
> CPU page table with special HMM "lock" entry, this is to exclude concurrent migration
> happening on same page. Once we have "locked" the CPU page table entry we allocate
> the proper number of pages. Then we schedule the dma from the GPU to this pages and
> once it is done we update the CPU page table to point to this pages. This is why we
> are going over the page table so many times. This should answer most of your questions
> below but i still provide answer for each of them.
>
>>
>> > +int mm_hmm_migrate_back(struct mm_struct *mm,
>> > + struct vm_area_struct *vma,
>> > + pte_t *new_pte,
>> > + unsigned long start,
>> > + unsigned long end)
>> > +{
>> > + pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry_locked());
>> > + unsigned long addr, i;
>> > + int ret = 0;
>> > +
>> > + VM_BUG_ON(vma->vm_ops || (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
>> > +
>> > + if (unlikely(anon_vma_prepare(vma)))
>> > + return -ENOMEM;
>> > +
>> > + start &= PAGE_MASK;
>> > + end = PAGE_ALIGN(end);
>> > + memset(new_pte, 0, sizeof(pte_t) * ((end - start) >> PAGE_SHIFT));
>> > +
>> > + for (addr = start; addr < end;) {
>> > + unsigned long cstart, next;
>> > + spinlock_t *ptl;
>> > + pgd_t *pgdp;
>> > + pud_t *pudp;
>> > + pmd_t *pmdp;
>> > + pte_t *ptep;
>> > +
>> > + pgdp = pgd_offset(mm, addr);
>> > + pudp = pud_offset(pgdp, addr);
>> > + /*
>> > + * Some other thread might already have migrated back the entry
>> > + * and freed the page table. Unlikely thought.
>> > + */
>> > + if (unlikely(!pudp)) {
>> > + addr = min((addr + PUD_SIZE) & PUD_MASK, end);
>> > + continue;
>> > + }
>> > + pmdp = pmd_offset(pudp, addr);
>> > + if (unlikely(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
>> > + pmd_trans_huge(*pmdp))) {
>> > + addr = min((addr + PMD_SIZE) & PMD_MASK, end);
>> > + continue;
>> > + }
>> > + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>> > + for (cstart = addr, i = (addr - start) >> PAGE_SHIFT,
>> > + next = min((addr + PMD_SIZE) & PMD_MASK, end);
>> > + addr < next; addr += PAGE_SIZE, ptep++, i++) {
>> > + swp_entry_t entry;
>> > +
>> > + entry = pte_to_swp_entry(*ptep);
>> > + if (pte_none(*ptep) || pte_present(*ptep) ||
>> > + !is_hmm_entry(entry) ||
>> > + is_hmm_entry_locked(entry))
>> > + continue;
>> > +
>> > + set_pte_at(mm, addr, ptep, hmm_entry);
>> > + new_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
>> > + vma->vm_page_prot));
>> > + }
>> > + pte_unmap_unlock(ptep - 1, ptl);
>>
>>
>> I guess this is fixing all the ptes in the cpu page table mapping a pmd
>> entry. But then what is below ?
>
> Because we are dealing with special swap entry we know we can not have huge pages.
> So we only care about HMM special swap entry. We record entry we want to migrate
> in the new_pte array. The loop above is under pmd spin lock, the loop below does
> memory allocation and we do not want to hold any spin lock while doing allocation.
>
Can this go as code comment ?
>>
>> > +
>> > + for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
>> > + addr < next; addr += PAGE_SIZE, i++) {
>>
>> Your use of vairable addr with multiple loops updating then is also
>> making it complex. We should definitely add more comments here. I guess
>> we are going through the same range we iterated above here.
>
> Correct we are going over the exact same range, i am keeping the addr only
> for alloc_zeroed_user_highpage_movable() purpose.
>
Can we use a different variable name there ?
>>
>> > + struct mem_cgroup *memcg;
>> > + struct page *page;
>> > +
>> > + if (!pte_present(new_pte[i]))
>> > + continue;
>>
>> What is that checking for ?. We set that using pte_mkspecial above ?
>
> Not all entry in the range might match the criteria (ie special unlocked HMM swap
> entry). We want to allocate pages only for entry that match the criteria.
>
Since we did in the beginning,
memset(new_pte, 0, sizeof(pte_t) * ((end - start) >> PAGE_SHIFT));
we should not find present bit set ? using present there is confusing,
may be pte_none(). Also with comments around explaining the details ?
>>
>> > +
>> > + page = alloc_zeroed_user_highpage_movable(vma, addr);
>> > + if (!page) {
>> > + ret = -ENOMEM;
>> > + break;
>> > + }
>> > + __SetPageUptodate(page);
>> > + if (mem_cgroup_try_charge(page, mm, GFP_KERNEL,
>> > + &memcg)) {
>> > + page_cache_release(page);
>> > + ret = -ENOMEM;
>> > + break;
>> > + }
>> > + /*
>> > + * We can safely reuse the s_mem/mapping field of page
>> > + * struct to store the memcg as the page is only seen
>> > + * by HMM at this point and we can clear it before it
>> > + * is public see mm_hmm_migrate_back_cleanup().
>> > + */
>> > + page->s_mem = memcg;
>> > + new_pte[i] = mk_pte(page, vma->vm_page_prot);
>> > + if (vma->vm_flags & VM_WRITE) {
>> > + new_pte[i] = pte_mkdirty(new_pte[i]);
>> > + new_pte[i] = pte_mkwrite(new_pte[i]);
>> > + }
>>
>> Why mark it dirty if vm_flags is VM_WRITE ?
>
> It is a left over of some debuging i was doing, i missed it.
>
>>
>> > + }
>> > +
>> > + if (!ret)
>> > + continue;
>> > +
>> > + hmm_entry = swp_entry_to_pte(make_hmm_entry());
>> > + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>
>>
>> Again we loop through the same range ?
>
> Yes but this is the out of memory code path here, ie we have to split the migration
> into several pass. So what happen here is we clear the new_pte array for entry we
> failed to allocate a page for.
>
>>
>> > + for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
>> > + addr < next; addr += PAGE_SIZE, ptep++, i++) {
>> > + unsigned long pfn = pte_pfn(new_pte[i]);
>> > +
>> > + if (!pte_present(new_pte[i]) || !is_zero_pfn(pfn))
>> > + continue;
>>
So here we are using the fact that we had set new pte using zero pfn in
the firs loop and hence if we find a present new_pte with zero pfn, it implies we
failed to allocate a page for that ?
>>
>> What is that checking for ?
>
> If new_pte entry is not present then it is not something we want to migrate. If it
> is present but does not point to zero pfn then it is an entry for which we allocated
> a page so we want to keep it.
>
>> > +
>> > + set_pte_at(mm, addr, ptep, hmm_entry);
>> > + pte_clear(mm, addr, &new_pte[i]);
>>
>> what is that pte_clear for ?. Handling of new_pte needs more code comments.
>>
>
> Entry for which we failed to allocate memory we clear the special HMM swap entry
> as well as the new_pte entry so that migration code knows it does not have to do
> anything here.
>
So that pte_clear is not expecting to do any sort of tlb flushes etc ? The
idea is to put new_pte = 0 ?.
Can we do all those conditionals without using pte bits ? A check like
pte_present, is_zero_pfn etc confuse the reader. Instead can
we do
if (pte_state[i] == SKIP_LOOP_FIRST)
if (pte_state[i] == SKIP_LOOP_SECOND)
I understand that we want to return new_pte array with valid pages, so
may be the above will make code complex, but atleast code should have
more comments explaining each step
-aneesh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-03-21 13:49 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-08 20:42 HMM (Heterogeneous Memory Management) Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 01/29] mmu_notifier: add event information to address invalidation v9 Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 02/29] mmu_notifier: keep track of active invalidation ranges v5 Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 03/29] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() v2 Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 04/29] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 05/29] HMM: introduce heterogeneous memory management v5 Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 06/29] HMM: add HMM page table v4 Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 07/29] HMM: add per mirror " Jérôme Glisse
2016-03-29 22:58 ` John Hubbard
2016-03-08 20:43 ` [PATCH v12 08/29] HMM: add device page fault support v6 Jérôme Glisse
2016-03-23 6:52 ` Aneesh Kumar K.V
2016-03-23 10:09 ` Jerome Glisse
2016-03-23 10:29 ` Aneesh Kumar K.V
2016-03-23 11:25 ` Jerome Glisse
2016-03-08 20:43 ` [PATCH v12 09/29] HMM: add mm page table iterator helpers Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 10/29] HMM: use CPU page table during invalidation Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 11/29] HMM: add discard range helper (to clear and free resources for a range) Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 12/29] HMM: add dirty range helper (toggle dirty bit inside mirror page table) v2 Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 13/29] HMM: DMA map memory on behalf of device driver v2 Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 14/29] HMM: Add support for hugetlb Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 15/29] HMM: add documentation explaining HMM internals and how to use it Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 16/29] fork: pass the dst vma to copy_page_range() and its sub-functions Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 17/29] HMM: add special swap filetype for memory migrated to device v2 Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 18/29] HMM: add new HMM page table flag (valid device memory) Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 19/29] HMM: add new HMM page table flag (select flag) Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 20/29] HMM: handle HMM device page table entry on mirror page table fault and update Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 21/29] HMM: mm add helper to update page table when migrating memory back v2 Jérôme Glisse
2016-03-21 11:27 ` Aneesh Kumar K.V
2016-03-21 12:02 ` Jerome Glisse
2016-03-21 13:48 ` Aneesh Kumar K.V [this message]
2016-03-21 14:30 ` Jerome Glisse
2016-03-08 20:43 ` [PATCH v12 22/29] HMM: mm add helper to update page table when migrating memory v3 Jérôme Glisse
2016-03-21 14:24 ` Aneesh Kumar K.V
2016-03-08 20:43 ` [PATCH v12 23/29] HMM: new callback for copying memory from and to device memory v2 Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 24/29] HMM: allow to get pointer to spinlock protecting a directory Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 25/29] HMM: split DMA mapping function in two Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 26/29] HMM: add helpers for migration back to system memory v3 Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 27/29] HMM: fork copy migrated memory into system memory for child process Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 28/29] HMM: CPU page fault on migrated memory Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 29/29] HMM: add mirror fault support for system to device memory migration v3 Jérôme Glisse
2016-03-08 22:02 ` HMM (Heterogeneous Memory Management) John Hubbard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=871t74uekm.fsf@linux.vnet.ibm.com \
--to=aneesh.kumar@linux.vnet.ibm.com \
--cc=Alexander.Deucher@amd.com \
--cc=Greg.Stoner@amd.com \
--cc=John.Bridgman@amd.com \
--cc=Laurent.Morichetti@amd.com \
--cc=Leonid.Shamis@amd.com \
--cc=Michael.Mantor@amd.com \
--cc=Paul.Blinzer@amd.com \
--cc=SCheung@nvidia.com \
--cc=aarcange@redhat.com \
--cc=airlied@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=arvindg@nvidia.com \
--cc=ben.sander@amd.com \
--cc=blc@redhat.com \
--cc=cabuschardt@nvidia.com \
--cc=charle@nvidia.com \
--cc=dpoole@nvidia.com \
--cc=haggaie@mellanox.com \
--cc=hpa@zytor.com \
--cc=j.glisse@gmail.com \
--cc=jdonohue@redhat.com \
--cc=jglisse@redhat.com \
--cc=jhubbard@nvidia.com \
--cc=joro@8bytes.org \
--cc=jweiner@redhat.com \
--cc=ldunning@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=liranl@mellanox.com \
--cc=lwoodman@redhat.com \
--cc=mgorman@suse.de \
--cc=mhairgrove@nvidia.com \
--cc=peterz@infradead.org \
--cc=raindel@mellanox.com \
--cc=riel@redhat.com \
--cc=roland@purestorage.com \
--cc=sgutti@nvidia.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).