linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jerome Glisse <jglisse@redhat.com>
To: Balbir Singh <bsingharora@gmail.com>
Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, John Hubbard <jhubbard@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Ross Zwisler <ross.zwisler@linux.intel.com>
Subject: Re: [HMM v13 06/18] mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable
Date: Tue, 22 Nov 2016 08:59:33 -0500	[thread overview]
Message-ID: <20161122135933.GA5684@redhat.com> (raw)
In-Reply-To: <0c55d239-b2f6-c515-6268-94a97fde8c81@gmail.com>

On Tue, Nov 22, 2016 at 01:19:42PM +1100, Balbir Singh wrote:
> 
> 
> On 21/11/16 16:05, Jerome Glisse wrote:
> > On Mon, Nov 21, 2016 at 01:06:45PM +1100, Balbir Singh wrote:
> >>
> >>
> >> On 19/11/16 05:18, Jerome Glisse wrote:
> >>> To allow use of device un-addressable memory inside a process add a
> >>> special swap type. Also add a new callback to handle page fault on
> >>> such entry.
> >>>
> >>> Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> >>> Cc: Dan Williams <dan.j.williams@intel.com>
> >>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> >>> ---
> >>>  fs/proc/task_mmu.c       | 10 +++++++-
> >>>  include/linux/memremap.h |  5 ++++
> >>>  include/linux/swap.h     | 18 ++++++++++---
> >>>  include/linux/swapops.h  | 67 ++++++++++++++++++++++++++++++++++++++++++++++++
> >>>  kernel/memremap.c        | 14 ++++++++++
> >>>  mm/Kconfig               | 12 +++++++++
> >>>  mm/memory.c              | 24 +++++++++++++++++
> >>>  mm/mprotect.c            | 12 +++++++++
> >>>  8 files changed, 158 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >>> index 6909582..0726d39 100644
> >>> --- a/fs/proc/task_mmu.c
> >>> +++ b/fs/proc/task_mmu.c
> >>> @@ -544,8 +544,11 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
> >>>  			} else {
> >>>  				mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
> >>>  			}
> >>> -		} else if (is_migration_entry(swpent))
> >>> +		} else if (is_migration_entry(swpent)) {
> >>>  			page = migration_entry_to_page(swpent);
> >>> +		} else if (is_device_entry(swpent)) {
> >>> +			page = device_entry_to_page(swpent);
> >>> +		}
> >>
> >>
> >> So the reason there is a device swap entry for a page belonging to a user process is
> >> that it is in the middle of migration or is it always that a swap entry represents
> >> unaddressable memory belonging to a GPU device, but its tracked in the page table
> >> entries of the process.
> > 
> > For page being migrated i use the existing special migration pte entry. This new device
> > special swap entry is only for unaddressable memory belonging to a device (GPU or any
> > else). We need to keep track of those inside the CPU page table. Using a new special
> > swap entry is the easiest way with the minimum amount of change to core mm.
> > 
> 
> Thanks, makes sense
> 
> > [...]
> > 
> >>> +#ifdef CONFIG_DEVICE_UNADDRESSABLE
> >>> +static inline swp_entry_t make_device_entry(struct page *page, bool write)
> >>> +{
> >>> +	return swp_entry(write?SWP_DEVICE_WRITE:SWP_DEVICE, page_to_pfn(page));
> >>
> >> Code style checks
> > 
> > I was trying to balance against 79 columns break rule :)
> > 
> > [...]
> > 
> >>> +		} else if (is_device_entry(entry)) {
> >>> +			page = device_entry_to_page(entry);
> >>> +
> >>> +			get_page(page);
> >>> +			rss[mm_counter(page)]++;
> >>
> >> Why does rss count go up?
> > 
> > I wanted the device page to be treated like any other page. There is an argument
> > to be made against and for doing that. Do you have strong argument for not doing
> > this ?
> > 
> 
> Yes, It will end up confusing rss accounting IMHO. If a task is using a lot of
> pages on the GPU, should be it a candidate for OOM based on it's RSS for example?
> 
> > [...]
> > 
> >>> @@ -2536,6 +2557,9 @@ int do_swap_page(struct fault_env *fe, pte_t orig_pte)
> >>>  	if (unlikely(non_swap_entry(entry))) {
> >>>  		if (is_migration_entry(entry)) {
> >>>  			migration_entry_wait(vma->vm_mm, fe->pmd, fe->address);
> >>> +		} else if (is_device_entry(entry)) {
> >>> +			ret = device_entry_fault(vma, fe->address, entry,
> >>> +						 fe->flags, fe->pmd);
> >>
> >> What does device_entry_fault() actually do here?
> > 
> > Well it is a special fault handler, it must migrate the memory back to some place
> > where the CPU can access it. It only matter for unaddressable memory.
> 
> So effectively swap the page back in, chances are it can ping pong ...but I was wondering if we can
> tell the GPU that the CPU is accessing these pages as well. I presume any operation that causes
> memory access - core dump will swap back in things from the HMM side onto the CPU side.

Well it is up to device driver to gather statistic on what can/should be inside device memory.
My expectation is that they will detect ping pong and stop asking to migrate a given address/
range to device memory.

> 
> > 
> >>>  		} else if (is_hwpoison_entry(entry)) {
> >>>  			ret = VM_FAULT_HWPOISON;
> >>>  		} else {
> >>> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >>> index 1bc1eb3..70aff3a 100644
> >>> --- a/mm/mprotect.c
> >>> +++ b/mm/mprotect.c
> >>> @@ -139,6 +139,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >>>  
> >>>  				pages++;
> >>>  			}
> >>> +
> >>> +			if (is_write_device_entry(entry)) {
> >>> +				pte_t newpte;
> >>> +
> >>> +				make_device_entry_read(&entry);
> >>> +				newpte = swp_entry_to_pte(entry);
> >>> +				if (pte_swp_soft_dirty(oldpte))
> >>> +					newpte = pte_swp_mksoft_dirty(newpte);
> >>> +				set_pte_at(mm, addr, pte, newpte);
> >>> +
> >>> +				pages++;
> >>> +			}
> >>
> >> Does it make sense to call mprotect() on device memory ranges?
> > 
> > There is nothing special about vma that containt device memory. They can be
> > private anonymous, share, file back ... So any existing memory syscall must
> > behave as expected. This is really just like any other page except that CPU
> > can not access it.
> 
> I understand that, but what would marking it as R/O when the GPU is in the middle
> of write mean? I would also worry about passing "executable" pages over to the
> other side.
> 

Any memory protection change will trigger an mmu_notifier calls which in turn will
update the device page table accordingly. So R/O status will also happen on the GPU.

We assume here that the device driver is not doing evil thing and that device driver
obey memory protection for all range it mirrors. Upstream driver are easy to check.
Close driver might be more problematic, in NVidia case this part if open source and
is easily checkable.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2016-11-22 13:59 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
2016-11-18 18:18 ` [HMM v13 01/18] mm/memory/hotplug: convert device parameter bool to set of flags Jérôme Glisse
2016-11-21  0:44   ` Balbir Singh
2016-11-21  4:53     ` Jerome Glisse
2016-11-21  6:57       ` Anshuman Khandual
2016-11-21 12:19         ` Jerome Glisse
2016-11-21  6:41   ` Anshuman Khandual
2016-11-21 12:27     ` Jerome Glisse
2016-11-22  5:35       ` Anshuman Khandual
2016-11-22 14:08         ` Jerome Glisse
2016-11-18 18:18 ` [HMM v13 02/18] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory Jérôme Glisse
2016-11-21  8:06   ` Anshuman Khandual
2016-11-21 12:33     ` Jerome Glisse
2016-11-22  5:15       ` Anshuman Khandual
2016-11-18 18:18 ` [HMM v13 03/18] mm/ZONE_DEVICE/free_hot_cold_page: catch ZONE_DEVICE pages Jérôme Glisse
2016-11-21  8:18   ` Anshuman Khandual
2016-11-21 12:50     ` Jerome Glisse
2016-11-22  4:30       ` Anshuman Khandual
2016-11-18 18:18 ` [HMM v13 04/18] mm/ZONE_DEVICE/free-page: callback when page is freed Jérôme Glisse
2016-11-21  1:49   ` Balbir Singh
2016-11-21  4:57     ` Jerome Glisse
2016-11-21  8:26   ` Anshuman Khandual
2016-11-21 12:34     ` Jerome Glisse
2016-11-22  5:02       ` Anshuman Khandual
2016-11-18 18:18 ` [HMM v13 05/18] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory Jérôme Glisse
2016-11-21 10:37   ` Anshuman Khandual
2016-11-21 12:39     ` Jerome Glisse
2016-11-22  4:54       ` Anshuman Khandual
2016-11-18 18:18 ` [HMM v13 06/18] mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable Jérôme Glisse
2016-11-21  2:06   ` Balbir Singh
2016-11-21  5:05     ` Jerome Glisse
2016-11-22  2:19       ` Balbir Singh
2016-11-22 13:59         ` Jerome Glisse [this message]
2016-11-21 11:10     ` Anshuman Khandual
2016-11-21 10:58   ` Anshuman Khandual
2016-11-21 12:42     ` Jerome Glisse
2016-11-22  4:48       ` Anshuman Khandual
2016-11-24 13:56         ` Jerome Glisse
2016-11-18 18:18 ` [HMM v13 07/18] mm/ZONE_DEVICE/x86: add support for un-addressable device memory Jérôme Glisse
2016-11-21  2:08   ` Balbir Singh
2016-11-21  5:08     ` Jerome Glisse
2016-11-18 18:18 ` [HMM v13 08/18] mm/hmm: heterogeneous memory management (HMM for short) Jérôme Glisse
2016-11-21  2:29   ` Balbir Singh
2016-11-21  5:14     ` Jerome Glisse
2016-11-23  4:03   ` Anshuman Khandual
2016-11-27 13:10     ` Jerome Glisse
2016-11-28  2:58       ` Anshuman Khandual
2016-11-28  9:41         ` Jerome Glisse
2016-11-18 18:18 ` [HMM v13 09/18] mm/hmm/mirror: mirror process address space on device with HMM helpers Jérôme Glisse
2016-11-21  2:42   ` Balbir Singh
2016-11-21  5:18     ` Jerome Glisse
2016-11-18 18:18 ` [HMM v13 10/18] mm/hmm/mirror: add range lock helper, prevent CPU page table update for the range Jérôme Glisse
2016-11-18 18:18 ` [HMM v13 11/18] mm/hmm/mirror: add range monitor helper, to monitor CPU page table update Jérôme Glisse
2016-11-18 18:18 ` [HMM v13 12/18] mm/hmm/mirror: helper to snapshot CPU page table Jérôme Glisse
2016-11-18 18:18 ` [HMM v13 13/18] mm/hmm/mirror: device page fault handler Jérôme Glisse
2016-11-18 18:18 ` [HMM v13 14/18] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration Jérôme Glisse
2016-11-18 18:18 ` [HMM v13 15/18] mm/hmm/migrate: add new boolean copy flag to migratepage() callback Jérôme Glisse
2016-11-18 18:18 ` [HMM v13 16/18] mm/hmm/migrate: new memory migration helper for use with device memory Jérôme Glisse
2016-11-18 19:57   ` Aneesh Kumar K.V
2016-11-18 20:15     ` Jerome Glisse
2016-11-19 14:32   ` Aneesh Kumar K.V
2016-11-19 17:17     ` Jerome Glisse
2016-11-20 18:21       ` Aneesh Kumar K.V
2016-11-20 20:06         ` Jerome Glisse
2016-11-21  3:30   ` Balbir Singh
2016-11-21  5:31     ` Jerome Glisse
2016-11-18 18:18 ` [HMM v13 17/18] mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory Jérôme Glisse
2016-11-18 18:18 ` [HMM v13 18/18] mm/hmm/devmem: dummy HMM device as an helper for " Jérôme Glisse
2016-11-19  0:41 ` [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 John Hubbard
2016-11-19 14:50   ` Aneesh Kumar K.V
2016-11-23  9:16 ` Haggai Eran
2016-11-25 16:16   ` Jerome Glisse
2016-11-27 13:27     ` Haggai Eran

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161122135933.GA5684@redhat.com \
    --to=jglisse@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=bsingharora@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ross.zwisler@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).