Re: [PATCH v12 22/29] HMM: mm add helper to update page table when migrating memory v3.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
To: "Jérôme Glisse" <jglisse@redhat.com>,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	joro@8bytes.org, Mel Gorman <mgorman@suse.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Larry Woodman <lwoodman@redhat.com>,
	Rik van Riel <riel@redhat.com>, Dave Airlie <airlied@redhat.com>,
	Brendan Conoboy <blc@redhat.com>,
	Joe Donohue <jdonohue@redhat.com>,
	Christophe Harle <charle@nvidia.com>,
	Duncan Poole <dpoole@nvidia.com>,
	Sherry Cheung <SCheung@nvidia.com>,
	Subhash Gutti <sgutti@nvidia.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Mark Hairgrove <mhairgrove@nvidia.com>,
	Lucien Dunning <ldunning@nvidia.com>,
	Cameron Buschardt <cabuschardt@nvidia.com>,
	Arvind Gopalakrishnan <arvindg@nvidia.com>,
	Haggai Eran <haggaie@mellanox.com>,
	Shachar Raindel <raindel@mellanox.com>,
	Liran Liss <liranl@mellanox.com>,
	Roland Dreier <roland@purestorage.com>,
	Ben Sander <ben.sander@amd.com>,
	Greg Stoner <Greg.Stoner@amd.com>,
	John Bridgman <John.Bridgman@amd.com>,
	Michael Mantor <Michael.Mantor@amd.com>,
	Paul Blinzer <Paul.Blinzer@amd.com>,
	Leonid Shamis <Leonid.Shamis@amd.com>,
	Laurent Morichetti <Laurent.Morichetti@amd.com>,
	Alexander Deucher <Alexander.Deucher@amd.com>
Subject: Re: [PATCH v12 22/29] HMM: mm add helper to update page table when migrating memory v3.
Date: Mon, 21 Mar 2016 19:54:39 +0530	[thread overview]
Message-ID: <87y49bucwo.fsf@linux.vnet.ibm.com> (raw)
In-Reply-To: <1457469802-11850-23-git-send-email-jglisse@redhat.com>

Jérôme Glisse <jglisse@redhat.com> writes:

> +
> +	/* Try to fail early on. */
> +	if (unlikely(anon_vma_prepare(vma)))
> +		return -ENOMEM;
> +

What is this about ?

> +retry:
> +	lru_add_drain();
> +	tlb_gather_mmu(&tlb, mm, range.start, range.end);
> +	update_hiwater_rss(mm);
> +	mmu_notifier_invalidate_range_start_excluding(mm, &range,
> +						      mmu_notifier_exclude);
> +	tlb_start_vma(&tlb, vma);
> +	for (addr = range.start, i = 0; addr < end && !ret;) {
> +		unsigned long cstart, next, npages = 0;
> +		spinlock_t *ptl;
> +		pgd_t *pgdp;
> +		pud_t *pudp;
> +		pmd_t *pmdp;
> +		pte_t *ptep;
> +
> +		/*
> +		 * Pretty much the exact same logic as __handle_mm_fault(),
> +		 * exception being the handling of huge pmd.
> +		 */
> +		pgdp = pgd_offset(mm, addr);
> +		pudp = pud_alloc(mm, pgdp, addr);
> +		if (!pudp) {
> +			ret = -ENOMEM;
> +			break;
> +		}
> +		pmdp = pmd_alloc(mm, pudp, addr);
> +		if (!pmdp) {
> +			ret = -ENOMEM;
> +			break;
> +		}
> +		if (unlikely(pte_alloc(mm, pmdp, addr))) {
> +			ret = -ENOMEM;
> +			break;
> +		}
> +
> +		/*
> +		 * If a huge pmd materialized under us just retry later.  Use
> +		 * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
> +		 * didn't become pmd_trans_huge under us and then back to pmd_none, as
> +		 * a result of MADV_DONTNEED running immediately after a huge pmd fault
> +		 * in a different thread of this mm, in turn leading to a misleading
> +		 * pmd_trans_huge() retval.  All we have to ensure is that it is a
> +		 * regular pmd that we can walk with pte_offset_map() and we can do that
> +		 * through an atomic read in C, which is what pmd_trans_unstable()
> +		 * provides.
> +		 */
> +		if (unlikely(pmd_trans_unstable(pmdp) || pmd_devmap(*pmdp))) {
> +			ret = -EAGAIN;
> +			break;
> +		}
> +
> +		/*
> +		 * If an huge pmd materialized from under us split it and break
> +		 * out of the loop to retry.
> +		 */
> +		if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp)) {
> +			split_huge_pmd(vma, addr, pmdp);
> +			ret = -EAGAIN;
> +			break;
> +		}
> +
> +		/*
> +		 * A regular pmd is established and it can't morph into a huge pmd
> +		 * from under us anymore at this point because we hold the mmap_sem
> +		 * read mode and khugepaged takes it in write mode. So now it's
> +		 * safe to run pte_offset_map().
> +		 */
> +		ptep = pte_offset_map(pmdp, addr);
> +
> +		/*
> +		 * A regular pmd is established and it can't morph into a huge
> +		 * pmd from under us anymore at this point because we hold the
> +		 * mmap_sem read mode and khugepaged takes it in write mode. So
> +		 * now it's safe to run pte_offset_map().
> +		 */
> +		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);


Why pte_offset_map followed by map_lock ?

> +		for (i = (addr - start) >> PAGE_SHIFT, cstart = addr,
> +		     next = min((addr + PMD_SIZE) & PMD_MASK, end);
> +		     addr < next; addr += PAGE_SIZE, ptep++, i++) {
> +			save_pte[i] = ptep_get_and_clear(mm, addr, ptep);
> +			tlb_remove_tlb_entry(&tlb, ptep, addr);
> +			set_pte_at(mm, addr, ptep, hmm_entry);
> +
> +			if (pte_present(save_pte[i]))
> +				continue;
> +
> +			if (!pte_none(save_pte[i])) {
> +				set_pte_at(mm, addr, ptep, save_pte[i]);
> +				ret = -ENOENT;
> +				ptep++;
> +				break;
> +			}

What is special about pte_none ? Why break the loop ? I guess we are
checking for swap_pte ? why not is_swap_pte ? is that we already checked
pte_present ?

> +			/*
> +			 * TODO: This mm_forbids_zeropage() really does not
> +			 * apply to us. First it seems only S390 have it set,
> +			 * second we are not even using the zero page entry
> +			 * to populate the CPU page table, thought on error
> +			 * we might use the save_pte entry to set the CPU
> +			 * page table entry.
> +			 *
> +			 * Live with that oddity for now.
> +			 */
> +			if (mm_forbids_zeropage(mm)) {
> +				pte_clear(mm, addr, &save_pte[i]);
> +				npages++;
> +				continue;
> +			}
> +			save_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
> +						    vma->vm_page_prot));
> +		}
> +		pte_unmap_unlock(ptep - 1, ptl);
> +
> +		/*
> +		 * So we must allocate pages before checking for error, which
> +		 * here indicate that one entry is a swap entry. We need to
> +		 * allocate first because otherwise there is no easy way to
> +		 * know on retry or in error code path wether the CPU page
> +		 * table locked HMM entry is ours or from some other thread.
> +		 */
> +
> +		if (!npages)
> +			continue;
> +
> +		for (next = addr, addr = cstart,
> +		     i = (addr - start) >> PAGE_SHIFT;
> +		     addr < next; addr += PAGE_SIZE, i++) {
> +			struct mem_cgroup *memcg;
> +			struct page *page;
> +
> +			if (pte_present(save_pte[i]) || !pte_none(save_pte[i]))
> +				continue;
> +
> +			page = alloc_zeroed_user_highpage_movable(vma, addr);
> +			if (!page) {
> +				ret = -ENOMEM;
> +				break;
> +			}
> +			__SetPageUptodate(page);
> +			if (mem_cgroup_try_charge(page, mm, GFP_KERNEL,
> +						  &memcg, false)) {
> +				page_cache_release(page);
> +				ret = -ENOMEM;
> +				break;
> +			}
> +			save_pte[i] = mk_pte(page, vma->vm_page_prot);
> +			if (vma->vm_flags & VM_WRITE)
> +				save_pte[i] = pte_mkwrite(save_pte[i]);

I guess this also need to go ?

> +			inc_mm_counter_fast(mm, MM_ANONPAGES);
> +			/*
> +			 * Because we set the page table entry to the special
> +			 * HMM locked entry we know no other process might do
> +			 * anything with it and thus we can safely account the
> +			 * page without holding any lock at this point.
> +			 */
> +			page_add_new_anon_rmap(page, vma, addr, false);
> +			mem_cgroup_commit_charge(page, memcg, false, false);
> +			/*
> +			 * Add to active list so we know vmscan will not waste
> +			 * its time with that page while we are still using it.
> +			 */
> +			lru_cache_add_active_or_unevictable(page, vma);
> +		}
> +	}
> +	tlb_end_vma(&tlb, vma);
> +	mmu_notifier_invalidate_range_end_excluding(mm, &range,
> +						    mmu_notifier_exclude);
> +	tlb_finish_mmu(&tlb, range.start, range.end);
> +
> +	if (backoff && *backoff) {
> +		/* Stick to the range we updated. */
> +		ret = -EAGAIN;
> +		end = addr;
> +		goto out;
> +	}
> +
> +	/* Check if something is missing or something went wrong. */
> +	if (ret == -ENOENT) {
> +		int flags = FAULT_FLAG_ALLOW_RETRY;
> +
> +		do {
> +			/*
> +			 * Using __handle_mm_fault() as current->mm != mm ie we
> +			 * might have been call from a kernel thread on behalf
> +			 * of a driver and all accounting handle_mm_fault() is
> +			 * pointless in our case.
> +			 */
> +			ret = __handle_mm_fault(mm, vma, addr, flags);
> +			flags |= FAULT_FLAG_TRIED;
> +		} while ((ret & VM_FAULT_RETRY));
> +		if ((ret & VM_FAULT_ERROR)) {
> +			/* Stick to the range we updated. */
> +			end = addr;
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +		range.start = addr;
> +		goto retry;
> +	}
> +	if (ret == -EAGAIN) {
> +		range.start = addr;
> +		goto retry;
> +	}
> +	if (ret)
> +		/* Stick to the range we updated. */
> +		end = addr;
> +
> +	/*
> +	 * At this point no one else can take a reference on the page from this
> +	 * process CPU page table. So we can safely check wether we can migrate
> +	 * or not the page.
> +	 */
> +
> +out:
> +	for (addr = start, i = 0; addr < end;) {
> +		unsigned long next;
> +		spinlock_t *ptl;
> +		pgd_t *pgdp;
> +		pud_t *pudp;
> +		pmd_t *pmdp;
> +		pte_t *ptep;
> +
> +		/*
> +		 * We know for certain that we did set special swap entry for
> +		 * the range and HMM entry are mark as locked so it means that
> +		 * no one beside us can modify them which apply that all level
> +		 * of the CPU page table are valid.
> +		 */
> +		pgdp = pgd_offset(mm, addr);
> +		pudp = pud_offset(pgdp, addr);
> +		VM_BUG_ON(!pudp);
> +		pmdp = pmd_offset(pudp, addr);
> +		VM_BUG_ON(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
> +			  pmd_trans_huge(*pmdp));
> +
> +		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +		for (next = min((addr + PMD_SIZE) & PMD_MASK, end),
> +		     i = (addr - start) >> PAGE_SHIFT; addr < next;
> +		     addr += PAGE_SIZE, ptep++, i++) {
> +			struct page *page;
> +			swp_entry_t entry;
> +			int swapped;
> +
> +			entry = pte_to_swp_entry(save_pte[i]);
> +			if (is_hmm_entry(entry)) {
> +				/*
> +				 * Logic here is pretty involve. If save_pte is
> +				 * an HMM special swap entry then it means that
> +				 * we failed to swap in that page so error must
> +				 * be set.
> +				 *
> +				 * If that's not the case than it means we are
> +				 * seriously screw.
> +				 */
> +				VM_BUG_ON(!ret);
> +				continue;
> +			}
> +
> +			/*
> +			 * This can not happen, no one else can replace our
> +			 * special entry and as range end is re-ajusted on
> +			 * error.
> +			 */
> +			entry = pte_to_swp_entry(*ptep);
> +			VM_BUG_ON(!is_hmm_entry_locked(entry));
> +
> +			/* On error or backoff restore all the saved pte. */
> +			if (ret)
> +				goto restore;
> +
> +			page = vm_normal_page(vma, addr, save_pte[i]);
> +			/* The zero page is fine to migrate. */
> +			if (!page)
> +				continue;
> +
> +			/*
> +			 * Check that only CPU mapping hold a reference on the
> +			 * page. To make thing simpler we just refuse bail out
> +			 * if page_mapcount() != page_count() (also accounting
> +			 * for swap cache).
> +			 *
> +			 * There is a small window here where wp_page_copy()
> +			 * might have decremented mapcount but have not yet
> +			 * decremented the page count. This is not an issue as
> +			 * we backoff in that case.
> +			 */
> +			swapped = PageSwapCache(page);
> +			if (page_mapcount(page) + swapped == page_count(page))
> +				continue;
> +
> +restore:
> +			/* Ok we have to restore that page. */
> +			set_pte_at(mm, addr, ptep, save_pte[i]);
> +			/*
> +			 * No need to invalidate - it was non-present
> +			 * before.
> +			 */
> +			update_mmu_cache(vma, addr, ptep);
> +			pte_clear(mm, addr, &save_pte[i]);
> +		}
> +		pte_unmap_unlock(ptep - 1, ptl);
> +	}
> +	return ret;
> +}
> +EXPORT_SYMBOL(mm_hmm_migrate);

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2016-03-21 14:25 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-08 20:42 HMM (Heterogeneous Memory Management) Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 01/29] mmu_notifier: add event information to address invalidation v9 Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 02/29] mmu_notifier: keep track of active invalidation ranges v5 Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 03/29] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() v2 Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 04/29] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 05/29] HMM: introduce heterogeneous memory management v5 Jérôme Glisse
2016-03-08 20:42 ` [PATCH v12 06/29] HMM: add HMM page table v4 Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 07/29] HMM: add per mirror " Jérôme Glisse
2016-03-29 22:58   ` John Hubbard
2016-03-08 20:43 ` [PATCH v12 08/29] HMM: add device page fault support v6 Jérôme Glisse
2016-03-23  6:52   ` Aneesh Kumar K.V
2016-03-23 10:09     ` Jerome Glisse
2016-03-23 10:29       ` Aneesh Kumar K.V
2016-03-23 11:25         ` Jerome Glisse
2016-03-08 20:43 ` [PATCH v12 09/29] HMM: add mm page table iterator helpers Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 10/29] HMM: use CPU page table during invalidation Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 11/29] HMM: add discard range helper (to clear and free resources for a range) Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 12/29] HMM: add dirty range helper (toggle dirty bit inside mirror page table) v2 Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 13/29] HMM: DMA map memory on behalf of device driver v2 Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 14/29] HMM: Add support for hugetlb Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 15/29] HMM: add documentation explaining HMM internals and how to use it Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 16/29] fork: pass the dst vma to copy_page_range() and its sub-functions Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 17/29] HMM: add special swap filetype for memory migrated to device v2 Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 18/29] HMM: add new HMM page table flag (valid device memory) Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 19/29] HMM: add new HMM page table flag (select flag) Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 20/29] HMM: handle HMM device page table entry on mirror page table fault and update Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 21/29] HMM: mm add helper to update page table when migrating memory back v2 Jérôme Glisse
2016-03-21 11:27   ` Aneesh Kumar K.V
2016-03-21 12:02     ` Jerome Glisse
2016-03-21 13:48       ` Aneesh Kumar K.V
2016-03-21 14:30         ` Jerome Glisse
2016-03-08 20:43 ` [PATCH v12 22/29] HMM: mm add helper to update page table when migrating memory v3 Jérôme Glisse
2016-03-21 14:24   ` Aneesh Kumar K.V [this message]
2016-03-08 20:43 ` [PATCH v12 23/29] HMM: new callback for copying memory from and to device memory v2 Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 24/29] HMM: allow to get pointer to spinlock protecting a directory Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 25/29] HMM: split DMA mapping function in two Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 26/29] HMM: add helpers for migration back to system memory v3 Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 27/29] HMM: fork copy migrated memory into system memory for child process Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 28/29] HMM: CPU page fault on migrated memory Jérôme Glisse
2016-03-08 20:43 ` [PATCH v12 29/29] HMM: add mirror fault support for system to device memory migration v3 Jérôme Glisse
2016-03-08 22:02 ` HMM (Heterogeneous Memory Management) John Hubbard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87y49bucwo.fsf@linux.vnet.ibm.com \
    --to=aneesh.kumar@linux.vnet.ibm.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Greg.Stoner@amd.com \
    --cc=John.Bridgman@amd.com \
    --cc=Laurent.Morichetti@amd.com \
    --cc=Leonid.Shamis@amd.com \
    --cc=Michael.Mantor@amd.com \
    --cc=Paul.Blinzer@amd.com \
    --cc=SCheung@nvidia.com \
    --cc=aarcange@redhat.com \
    --cc=airlied@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arvindg@nvidia.com \
    --cc=ben.sander@amd.com \
    --cc=blc@redhat.com \
    --cc=cabuschardt@nvidia.com \
    --cc=charle@nvidia.com \
    --cc=dpoole@nvidia.com \
    --cc=haggaie@mellanox.com \
    --cc=hpa@zytor.com \
    --cc=jdonohue@redhat.com \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=joro@8bytes.org \
    --cc=jweiner@redhat.com \
    --cc=ldunning@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liranl@mellanox.com \
    --cc=lwoodman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mhairgrove@nvidia.com \
    --cc=peterz@infradead.org \
    --cc=raindel@mellanox.com \
    --cc=riel@redhat.com \
    --cc=roland@purestorage.com \
    --cc=sgutti@nvidia.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).