Re: [PATCH 06/36] HMM: add HMM page table v2.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jerome Glisse <j.glisse@gmail.com>
To: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"joro@8bytes.org" <joro@8bytes.org>,
	"Mel Gorman" <mgorman@suse.de>, "H. Peter Anvin" <hpa@zytor.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Andrea Arcangeli" <aarcange@redhat.com>,
	"Johannes Weiner" <jweiner@redhat.com>,
	"Larry Woodman" <lwoodman@redhat.com>,
	"Rik van Riel" <riel@redhat.com>,
	"Dave Airlie" <airlied@redhat.com>,
	"Brendan Conoboy" <blc@redhat.com>,
	"Joe Donohue" <jdonohue@redhat.com>,
	"Duncan Poole" <dpoole@nvidia.com>,
	"Sherry Cheung" <SCheung@nvidia.com>,
	"Subhash Gutti" <sgutti@nvidia.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Lucien Dunning" <ldunning@nvidia.com>,
	"Cameron Buschardt" <cabuschardt@nvidia.com>,
	"Arvind Gopalakrishnan" <arvindg@nvidia.com>,
	"Haggai Eran" <haggaie@mellanox.com>,
	"Shachar Raindel" <raindel@mellanox.com>,
	"Liran Liss" <liranl@mellanox.com>,
	"Roland Dreier" <roland@purestorage.com>,
	"Ben Sander" <ben.sander@amd.com>,
	"Greg Stoner" <Greg.Stoner@amd.com>,
	"John Bridgman" <John.Bridgman@amd.com>,
	"Michael Mantor" <Michael.Mantor@amd.com>,
	"Paul Blinzer" <Paul.Blinzer@amd.com>,
	"Laurent Morichetti" <Laurent.Morichetti@amd.com>,
	"Alexander Deucher" <Alexander.Deucher@amd.com>,
	"Oded Gabbay" <Oded.Gabbay@amd.com>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Jatin Kumar" <jakumar@nvidia.com>
Subject: Re: [PATCH 06/36] HMM: add HMM page table v2.
Date: Fri, 19 Jun 2015 14:07:14 -0400	[thread overview]
Message-ID: <20150619180713.GA17308@gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1506171724110.32592@mdh-linux64-2.nvidia.com>

On Thu, Jun 18, 2015 at 07:06:08PM -0700, Mark Hairgrove wrote:
> On Thu, 21 May 2015, j.glisse@gmail.com wrote:

[...]
> > +
> > +static inline dma_addr_t hmm_pde_from_pfn(dma_addr_t pfn)
> > +{
> > +	return (pfn << PAGE_SHIFT) | HMM_PDE_VALID;
> > +}
> > +
> > +static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
> > +{
> > +	return (pde & HMM_PDE_VALID) ? pde >> PAGE_SHIFT : 0;
> > +}
> > +
> 
> Does hmm_pde_pfn return a dma_addr_t pfn or a system memory pfn?
> 
> The types between these two functions don't match. According to 
> hmm_pde_from_pfn, both the pde and the pfn are supposed to be dma_addr_t. 
> But hmm_pde_pfn returns an unsigned long as a pfn instead of a dma_addr_t. 
> If hmm_pde_pfn sometimes used to get a dma_addr_t pfn then shouldn't it 
> also return a dma_addr_t, since as you pointed out in the commit message, 
> dma_addr_t might be bigger than an unsigned long?
> 

Yes internal it use dma_addr_t but for device driver that want to use
physical system page address aka pfn i want them to use the specialize
helper hmm_pte_from_pfn() and hmm_pte_pfn() so type casting happen in
hmm and it make it easier to review device driver as device driver will
be consistent ie either it wants to use pfn or it want to use dma_addr_t
but not mix the 2.

A latter patch add the hmm_pte_from_dma() and hmm_pte_dma_addr() helper
for the dma case. So this patch only introduce the pfn version.

> > [...]
> > +
> > +static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
> > +{
> > +	return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
> > +}
> > +
> > +static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
> > +{
> > +	return hmm_pte_test_valid_pfn(&pte) ? pte >> PAGE_SHIFT : 0;
> > +}
> 
> Same question as hmm_pde_pfn above.

See above.

> 
> > [...]
> > +/* struct hmm_pt_iter - page table iterator states.
> > + *
> > + * @ptd: Array of directory struct page pointer for each levels.
> > + * @ptdp: Array of pointer to mapped directory levels.
> > + * @dead_directories: List of directories that died while walking page table.
> > + * @cur: Current address.
> > + */
> > +struct hmm_pt_iter {
> > +	struct page		*ptd[HMM_PT_MAX_LEVEL - 1];
> > +	dma_addr_t		*ptdp[HMM_PT_MAX_LEVEL - 1];
> 
> These are sized to be HMM_PT_MAX_LEVEL - 1 rather than HMM_PT_MAX_LEVEL 
> because the iterator doesn't store the top level, correct? This results in 
> a lot of "level - 1" and "level - 2" logic when dealing with the iterator. 
> Have you considered keeping the levels consistent to get rid of all the 
> extra offset-by-1 logic?

All this should be optimized away by the compiler thought i have not
check the assembly.

[...]
> > + *
> > + * @iter: Iterator states that currently protect the directory.
> > + * @level: Level of the directory to reference.
> > + *
> > + * This function will reference a directory but it is illegal for refcount to
> > + * be 0 as this helper should only be call when iterator is protecting the
> > + * directory (ie iterator hold a reference for the directory).
> > + *
> > + * HMM user will call this with level = pt.llevel any other value is supicious
> > + * outside of hmm_pt code.
> > + */
> > +static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter,
> > +					     char level)
> > +{
> > +	/* Nothing to do for root level. */
> > +	if (!level)
> > +		return;
> > +
> > +	if (!atomic_inc_not_zero(&iter->ptd[level - 1]->_mapcount))
> > +		/* Illegal this should not happen. */
> > +		BUG();
> > +}
> > +
> > [...]
> > +
> > +int hmm_pt_init(struct hmm_pt *pt)
> > +{
> > +	unsigned directory_shift, i = 0, npgd;
> > +
> > +	pt->last &= PAGE_MASK;
> > +	spin_lock_init(&pt->lock);
> > +	/* Directory shift is the number of bits that a single directory level
> > +	 * represent. For instance if PAGE_SIZE is 4096 and each entry takes 8
> > +	 * bytes (sizeof(dma_addr_t) == 8) then directory_shift = 9.
> > +	 */
> > +	directory_shift = PAGE_SHIFT - ilog2(sizeof(dma_addr_t));
> > +	/* Level 0 is the root level of the page table. It might use less
> > +	 * bits than directory_shift but all sub-directory level will use all
> > +	 * directory_shift bits.
> > +	 *
> > +	 * For instance if hmm_pt.last == (1 << 48), PAGE_SHIFT == 12 and
> 
> This example should say that hmm_pt.last == (1 << 48) - 1, since last is 
> inclusive. Otherwise llevel will be 4.

Yes correct i switched from exclusive to inclusive and forgot to update
comment.

> 
> > +	 * sizeof(dma_addr_t) == 8 then :
> > +	 *   directory_shift = 9
> > +	 *   shift[0] = 39
> > +	 *   shift[1] = 30
> > +	 *   shift[2] = 21
> > +	 *   shift[3] = 12
> > +	 *   llevel = 3
> > +	 *
> > +	 * Note that shift[llevel] == PAGE_SHIFT because the last level
> > +	 * correspond to the page table entry level (ignoring the case of huge
> > +	 * page).
> > +	 */
> > +	pt->shift[0] = ((__fls(pt->last >> PAGE_SHIFT) / directory_shift) *
> > +			directory_shift) + PAGE_SHIFT;
> > +	while (pt->shift[i++] > PAGE_SHIFT)
> > +		pt->shift[i] = pt->shift[i - 1] - directory_shift;
> > +	pt->llevel = i - 1;
> > +	pt->directory_mask = (1 << directory_shift) - 1;
> > +
> > +	for (i = 0; i <= pt->llevel; ++i)
> > +		pt->mask[i] = ~((1UL << pt->shift[i]) - 1);
> > +
> > +	npgd = (pt->last >> pt->shift[0]) + 1;
> > +	pt->pgd = kzalloc(npgd * sizeof(dma_addr_t), GFP_KERNEL);
> > +	if (!pt->pgd)
> > +		return -ENOMEM;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(hmm_pt_init);
> 
> Does this need to be EXPORT_SYMBOL? It seems like a driver would never 
> need to call this, only core hmm. Same question for hmm_pt_fini.
> 

Well i wanted to use that in the hmm_dummy driver, but i have mix feeling
as using it in dummy driver allow to test it in more use case but at the
same time a bug might be hidden because same code is use in dummy driver.
I will probably juste use it in dummy driver.

[...]
> > + *
> > + * @iter: Iterator states.
> > + *
> > + * This function will initialize iterator states. It must always be pair with a
> > + * call to hmm_pt_iter_fini().
> > + */
> > +void hmm_pt_iter_init(struct hmm_pt_iter *iter)
> > +{
> > +	memset(iter->ptd, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));
> > +	memset(iter->ptdp, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));
> 
> The memset sizes can simply be sizeof(iter->ptd) and sizeof(iter->ptdp).

Yes i should have.

> > +	INIT_LIST_HEAD(&iter->dead_directories);
> > +}
> > +EXPORT_SYMBOL(hmm_pt_iter_init);
> > +
> > +/* hmm_pt_iter_directory_unref_safe() - unref a directory that is safe to free.
> > + *
> > + * @iter: Iterator states.
> > + * @pt: HMM page table.
> > + * @level: Level of the directory to unref.
> > + *
> > + * This function will unreference a directory and add it to dead list if
> > + * directory no longer have any reference. It will also clear the entry to
> > + * that directory into the upper level directory as well as dropping ref
> > + * on the upper directory.
> > + */
> > +static void hmm_pt_iter_directory_unref_safe(struct hmm_pt_iter *iter,
> > +					     struct hmm_pt *pt,
> > +					     unsigned level)
> > +{
> > +	struct page *upper_ptd;
> > +	dma_addr_t *upper_ptdp;
> > +
> > +	/* Nothing to do for root level. */
> > +	if (!level)
> > +		return;
> > +
> > +	if (!atomic_dec_and_test(&iter->ptd[level - 1]->_mapcount))
> > +		return;
> > +
> > +	upper_ptd = level > 1 ? iter->ptd[level - 2] : NULL;
> > +	upper_ptdp = level > 1 ? iter->ptdp[level - 2] : pt->pgd;
> > +	upper_ptdp = &upper_ptdp[hmm_pt_index(pt, iter->cur, level - 1)];
> > +	hmm_pt_directory_lock(pt, upper_ptd, level - 1);
> > +	/*
> > +	 * There might be race btw decrementing reference count on a directory
> > +	 * and another thread trying to fault in a new directory. To avoid
> > +	 * erasing the new directory entry we need to check that the entry
> > +	 * still correspond to the directory we are removing.
> > +	 */
> > +	if (hmm_pde_pfn(*upper_ptdp) == page_to_pfn(iter->ptd[level - 1]))
> > +		*upper_ptdp = 0;
> > +	hmm_pt_directory_unlock(pt, upper_ptd, level - 1);
> > +
> > +	/* Add it to delayed free list. */
> > +	list_add_tail(&iter->ptd[level - 1]->lru, &iter->dead_directories);
> > +
> > +	/*
> > +	 * The upper directory is not safe to unref as we have an extra ref and
> 
> This should be "IS safe to unref", correct?

Yes (s/not/now) -> "is NOW safe to unref" /me blame my fingers.

[...]
> > +	/*
> > +	 * Some iterator may have dereferenced a dead directory entry and looked
> > +	 * up the struct page but haven't check yet the reference count. As all
> > +	 * the above happen in rcu read critical section we know that we need
> > +	 * to wait for grace period before being able to free any of the dead
> > +	 * directory page.
> > +	 */
> > +	synchronize_rcu();
> > +	list_for_each_entry_safe(ptd, tmp, &iter->dead_directories, lru) {
> > +		list_del(&ptd->lru);
> > +		atomic_set(&ptd->_mapcount, -1);
> > +		__free_page(ptd);
> > +	}
> > +}
> 
> If I'm following this correctly, a migrate to the device will allocate HMM 
> page tables and the subsequent migrate from the device will free them. 
> Assuming that's the case, might thrashing of page allocations be a 
> problem? What about keeping the HMM page tables around until the actual 
> munmap() of the corresponding VA range?

HMM page table is allocate anytime a device mirror a range ie migration to
device is not a special case. When migrating to and from device, the HMM
page table is allocated prior to the migration and outlive the migration
back.

That said the rational here is that i want to free HMM resources as early as
possible mostly to support the use GPU on dataset onetime (ie dataset is use
once and only once by the GPU). I think it will be a common and important use
case and making sure we free resource early does not prevent other use case
where dataset are use for longer time to work properly and efficiently.

In a latter patch i add an helper so that device driver can discard a range
ie tell HMM that they no longer using a range of address allowing HMM to
free associated resources.

However you are correct that currently some MM event will lead to HMM page
table being free and then reallocated right after once again by the device.
Which is obviously bad. But because i did not want to make this patch or
this serie any more complex than it already is i did not include any mecanism
to delay HMM page table directory reclaim. Such delayed reclaim mecanism is
on my road map and i think i shared that roadmap with you. I think it is
something we can optimize latter on. The important part here is that device
driver knows that HMM page table need to be carefully accessed so that when
agressive pruning of HMM page table happens it does not disrupt the device
driver.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2015-06-19 18:07 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
2015-05-21 19:31 ` [PATCH 01/36] mmu_notifier: add event information to address invalidation v7 j.glisse
2015-05-30  3:43   ` John Hubbard
2015-06-01 19:03     ` Jerome Glisse
2015-06-01 23:10       ` John Hubbard
2015-06-03 16:07         ` Jerome Glisse
2015-06-03 23:02           ` John Hubbard
2015-05-21 19:31 ` [PATCH 02/36] mmu_notifier: keep track of active invalidation ranges v3 j.glisse
2015-05-27  5:09   ` Aneesh Kumar K.V
2015-05-27 14:32     ` Jerome Glisse
2015-06-02  9:32   ` John Hubbard
2015-06-03 17:15     ` Jerome Glisse
2015-06-05  3:29       ` John Hubbard
2015-05-21 19:31 ` [PATCH 03/36] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() j.glisse
2015-05-27  5:17   ` Aneesh Kumar K.V
2015-05-27 14:33     ` Jerome Glisse
2015-06-03  4:25   ` John Hubbard
2015-05-21 19:31 ` [PATCH 04/36] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier j.glisse
2015-05-21 19:31 ` [PATCH 05/36] HMM: introduce heterogeneous memory management v3 j.glisse
2015-05-27  5:50   ` Aneesh Kumar K.V
2015-05-27 14:38     ` Jerome Glisse
2015-06-08 19:40   ` Mark Hairgrove
2015-06-08 21:17     ` Jerome Glisse
2015-06-09  1:54       ` Mark Hairgrove
2015-06-09 15:56         ` Jerome Glisse
2015-06-10  3:33           ` Mark Hairgrove
2015-06-10 15:42             ` Jerome Glisse
2015-06-11  1:15               ` Mark Hairgrove
2015-06-11 14:23                 ` Jerome Glisse
2015-06-11 22:26                   ` Mark Hairgrove
2015-06-15 14:32                     ` Jerome Glisse
2015-05-21 19:31 ` [PATCH 06/36] HMM: add HMM page table v2 j.glisse
2015-06-19  2:06   ` Mark Hairgrove
2015-06-19 18:07     ` Jerome Glisse [this message]
2015-06-20  2:34       ` Mark Hairgrove
2015-06-25 22:57   ` Mark Hairgrove
2015-06-26 16:30     ` Jerome Glisse
2015-06-27  1:34       ` Mark Hairgrove
2015-06-29 14:43         ` Jerome Glisse
2015-07-01  2:51           ` Mark Hairgrove
2015-07-01 15:07             ` Jerome Glisse
2015-05-21 19:31 ` [PATCH 07/36] HMM: add per mirror page table v3 j.glisse
2015-06-25 23:05   ` Mark Hairgrove
2015-06-26 16:43     ` Jerome Glisse
2015-06-27  3:02       ` Mark Hairgrove
2015-06-29 14:50         ` Jerome Glisse
2015-05-21 19:31 ` [PATCH 08/36] HMM: add device page fault support v3 j.glisse
2015-05-21 19:31 ` [PATCH 09/36] HMM: add mm page table iterator helpers j.glisse
2015-05-21 19:31 ` [PATCH 10/36] HMM: use CPU page table during invalidation j.glisse
2015-05-21 19:31 ` [PATCH 11/36] HMM: add discard range helper (to clear and free resources for a range) j.glisse
2015-05-21 19:31 ` [PATCH 12/36] HMM: add dirty range helper (to toggle dirty bit inside mirror page table) j.glisse
2015-05-21 19:31 ` [PATCH 13/36] HMM: DMA map memory on behalf of device driver j.glisse
2015-05-21 19:31 ` [PATCH 14/36] fork: pass the dst vma to copy_page_range() and its sub-functions j.glisse
2015-05-21 19:31 ` [PATCH 15/36] memcg: export get_mem_cgroup_from_mm() j.glisse
2015-05-21 19:31 ` [PATCH 16/36] HMM: add special swap filetype for memory migrated to HMM device memory j.glisse
2015-06-24  7:49   ` Haggai Eran
2015-05-21 19:31 ` [PATCH 17/36] HMM: add new HMM page table flag (valid device memory) j.glisse
2015-05-21 19:31 ` [PATCH 18/36] HMM: add new HMM page table flag (select flag) j.glisse
2015-05-21 19:31 ` [PATCH 19/36] HMM: handle HMM device page table entry on mirror page table fault and update j.glisse
2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
2015-05-21 20:22   ` [PATCH 21/36] HMM: mm add helper to update page table when migrating memory jglisse
2015-05-21 20:22   ` [PATCH 22/36] HMM: add new callback for copying memory from and to device memory jglisse
2015-05-21 20:22   ` [PATCH 23/36] HMM: allow to get pointer to spinlock protecting a directory jglisse
2015-05-21 20:23   ` [PATCH 24/36] HMM: split DMA mapping function in two jglisse
2015-05-21 20:23   ` [PATCH 25/36] HMM: add helpers for migration back to system memory jglisse
2015-05-21 20:23   ` [PATCH 26/36] HMM: fork copy migrated memory into system memory for child process jglisse
2015-05-21 20:23   ` [PATCH 27/36] HMM: CPU page fault on migrated memory jglisse
2015-05-21 20:23   ` [PATCH 28/36] HMM: add mirror fault support for system to device memory migration jglisse
2015-05-21 20:23   ` [PATCH 29/36] IB/mlx5: add a new paramter to __mlx_ib_populated_pas for ODP with HMM jglisse
2015-05-21 20:23   ` [PATCH 30/36] IB/mlx5: add a new paramter to mlx5_ib_update_mtt() " jglisse
2015-05-21 20:23   ` [PATCH 31/36] IB/odp: export rbt_ib_umem_for_each_in_range() jglisse
2015-05-21 20:23   ` [PATCH 32/36] IB/odp/hmm: add new kernel option to use HMM for ODP jglisse
2015-05-21 20:23   ` [PATCH 33/36] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM jglisse
2015-06-24 13:59     ` Haggai Eran
2015-05-21 20:23   ` [PATCH 34/36] IB/mlx5/hmm: add mlx5 HMM device initialization and callback jglisse
2015-05-21 20:23   ` [PATCH 35/36] IB/mlx5/hmm: add page fault support for ODP on HMM jglisse
2015-05-21 20:23   ` [PATCH 36/36] IB/mlx5/hmm: enable ODP using HMM jglisse
2015-05-30  3:01 ` HMM (Heterogeneous Memory Management) v8 John Hubbard
2015-05-31  6:56 ` Haggai Eran

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150619180713.GA17308@gmail.com \
    --to=j.glisse@gmail.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Greg.Stoner@amd.com \
    --cc=John.Bridgman@amd.com \
    --cc=Laurent.Morichetti@amd.com \
    --cc=Michael.Mantor@amd.com \
    --cc=Oded.Gabbay@amd.com \
    --cc=Paul.Blinzer@amd.com \
    --cc=SCheung@nvidia.com \
    --cc=aarcange@redhat.com \
    --cc=airlied@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arvindg@nvidia.com \
    --cc=ben.sander@amd.com \
    --cc=blc@redhat.com \
    --cc=cabuschardt@nvidia.com \
    --cc=dpoole@nvidia.com \
    --cc=haggaie@mellanox.com \
    --cc=hpa@zytor.com \
    --cc=jakumar@nvidia.com \
    --cc=jdonohue@redhat.com \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=joro@8bytes.org \
    --cc=jweiner@redhat.com \
    --cc=ldunning@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liranl@mellanox.com \
    --cc=lwoodman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mhairgrove@nvidia.com \
    --cc=peterz@infradead.org \
    --cc=raindel@mellanox.com \
    --cc=riel@redhat.com \
    --cc=roland@purestorage.com \
    --cc=sgutti@nvidia.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).