From: Jerome Glisse <jglisse@redhat.com>
To: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
mhocko@suse.com, vbabka@suse.cz, mgorman@suse.de,
minchan@kernel.org, aneesh.kumar@linux.vnet.ibm.com,
bsingharora@gmail.com, srikar@linux.vnet.ibm.com,
haren@linux.vnet.ibm.com, dave.hansen@intel.com,
dan.j.williams@intel.com
Subject: Re: [RFC V2 08/12] mm: Add new VMA flag VM_CDM
Date: Tue, 31 Jan 2017 01:05:12 -0500 [thread overview]
Message-ID: <20170131060509.GA2017@redhat.com> (raw)
In-Reply-To: <28bd4abc-3cbd-514e-1535-15ce67131772@linux.vnet.ibm.com>
On Tue, Jan 31, 2017 at 09:52:20AM +0530, Anshuman Khandual wrote:
> On 01/31/2017 12:22 AM, Jerome Glisse wrote:
> > On Mon, Jan 30, 2017 at 09:05:49AM +0530, Anshuman Khandual wrote:
> >> VMA which contains CDM memory pages should be marked with new VM_CDM flag.
> >> These VMAs need to be identified in various core kernel paths for special
> >> handling and this flag will help in their identification.
> >>
> >> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> >
> >
> > Why doing this on vma basis ? Why not special casing all those path on page
> > basis ?
>
> The primary motivation being the cost. Wont it be too expensive to account
> for and act on individual pages rather than on the VMA as a whole ? For
> example page_to_nid() seemed pretty expensive when tried to tag VMA on
> individual page fault basis.
No i don't think it would be too expensive. What is confusing in this patchset
is that you are conflating 3 different problems. First one is how to create
struct page for coherent device memory and exclude those pages from regular
allocations.
Second one is how to allow userspace to set allocation policy that would direct
allocation for a given vma to use a specific device memory.
Finaly last one is how to block some kernel feature such as numa or ksm as you
expect (and i share that believe) that they will be hurtfull.
I do believe, that this last requirement, is better left to be done on a per page
basis as page_to_nid() is only a memory lookup and i would be stun if that memory
lookup register as more than a blip on any profiler radar.
The vma flag as all or nothing choice is bad in my view and its stickyness and how
to handle its lifetime and inheritance is troubling and hard. Checking through node
if a page should undergo ksm or numa is a better solution in my view.
>
> >
> > After all you can have a big vma with some pages in it being cdm and other
> > being regular page. The CPU process might migrate to different CPU in a
> > different node and you might still want to have the regular page to migrate
> > to this new node and keep the cdm page while the device is still working
> > on them.
>
> Right, that is the ideal thing to do. But wont it be better to split the
> big VMA into smaller chunks and tag them appropriately so that those VMAs
> tagged would contain as much CDM pages as possible for them to be likely
> restricted from auto NUMA, KSM etc.
Think a vma in which every odd 4k address point to a device page is device and
even 4k address point to a regular page, would you want to create as many vma
for this ?
Setting policy for allocation make sense, but setting flag that enable/disable
kernel feature for a range, overridding other policy is bad in my view.
>
> >
> > This is just an example, same can apply for ksm or any other kernel feature
> > you want to special case. Maybe we can store a set of flag in node that
> > tells what is allowed for page in node (ksm, hugetlb, migrate, numa, ...).
> >
> > This would be more flexible and the policy choice can be left to each of
> > the device driver.
>
> Hmm, thats another way of doing the special cases. The other way as Dave
> had mentioned before is to classify coherent memory property into various
> kinds and store them for each node and implement a predefined set of
> restrictions for each kind of coherent memory which might include features
> like auto NUMA, HugeTLB, KSM etc. Maintaining two different property sets
> one for the kind of coherent memory and the other being for each special
> cases) wont be too complicated ?
I am not sure i follow, you have a single mask provided by the driver that
register the memory something like:
CDM_ALLOW_NUMA (1 << 0)
CDM_ALLOW_KSM (1 << 1)
...
Then you have bool page_node_allow_numa(page), bool page_node_allow_ksm(page),
... that is it. Both numa and ksm perform heavy operations and having to go
check a mask inside node struct isn't gonna slow them down.
I am not talking about kind matching to sets of restriction. Just a simple
mask of thing that allowed on that memory. You can add thing like GUP or any
other mechanism that i can't think of right now.
I really think that the vma flag is a bad idea, my expectation is that we
will see more vma with a mix of device and regular memory. I don't think the
only workload will be some big vma device only (ie only access by device) or
CPU only. I believe we will see everything on the spectrum from highly
fragmented to completetly regular.
Cheers,
Jerome
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-01-31 6:05 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-01-30 3:35 [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
2017-01-30 3:35 ` [RFC V2 01/12] mm: Define coherent device memory (CDM) node Anshuman Khandual
2017-01-30 3:35 ` [RFC V2 02/12] mm: Isolate HugeTLB allocations away from CDM nodes Anshuman Khandual
2017-01-30 17:19 ` Dave Hansen
2017-01-31 1:03 ` Anshuman Khandual
2017-01-31 1:37 ` Dave Hansen
2017-02-01 13:59 ` Anshuman Khandual
2017-02-01 19:01 ` Dave Hansen
2017-01-30 3:35 ` [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process Anshuman Khandual
2017-01-30 17:34 ` Dave Hansen
2017-01-31 1:36 ` Anshuman Khandual
2017-01-31 1:57 ` Dave Hansen
2017-01-31 7:25 ` John Hubbard
2017-01-31 18:04 ` Dave Hansen
2017-01-31 19:14 ` David Nellans
2017-02-01 6:56 ` Anshuman Khandual
2017-02-01 6:46 ` Anshuman Khandual
2017-02-01 6:40 ` Anshuman Khandual
2017-01-30 3:35 ` [RFC V2 04/12] mm: Change mbind(MPOL_BIND) implementation for CDM nodes Anshuman Khandual
2017-01-30 3:35 ` [RFC V2 05/12] cpuset: Add cpuset_inc() inside cpuset_init() Anshuman Khandual
2017-01-30 17:36 ` Dave Hansen
2017-01-30 20:30 ` Mel Gorman
2017-01-31 14:22 ` [RFC] cpuset: Enable changing of top_cpuset's mems_allowed nodemask Anshuman Khandual
2017-01-31 16:00 ` Mel Gorman
2017-02-01 7:31 ` Anshuman Khandual
2017-02-01 8:53 ` Michal Hocko
2017-02-01 9:18 ` Mel Gorman
2017-01-31 14:36 ` [RFC V2 05/12] cpuset: Add cpuset_inc() inside cpuset_init() Vlastimil Babka
2017-01-31 15:30 ` Anshuman Khandual
2017-01-30 3:35 ` [RFC V2 06/12] mm: Exclude CDM nodes from task->mems_allowed and root cpuset Anshuman Khandual
2017-01-30 3:35 ` [RFC V2 07/12] mm: Ignore cpuset enforcement when allocation flag has __GFP_THISNODE Anshuman Khandual
2017-01-30 3:35 ` [RFC V2 08/12] mm: Add new VMA flag VM_CDM Anshuman Khandual
2017-01-30 18:52 ` Jerome Glisse
2017-01-31 4:22 ` Anshuman Khandual
2017-01-31 6:05 ` Jerome Glisse [this message]
2017-01-30 3:35 ` [RFC V2 09/12] mm: Exclude CDM marked VMAs from auto NUMA Anshuman Khandual
2017-01-30 3:35 ` [RFC V2 10/12] mm: Ignore madvise(MADV_MERGEABLE) request for VM_CDM marked VMAs Anshuman Khandual
2017-01-30 3:35 ` [RFC V2 11/12] mm: Tag VMA with VM_CDM flag during page fault Anshuman Khandual
2017-01-30 17:51 ` Dave Hansen
2017-01-31 5:10 ` Anshuman Khandual
2017-01-31 17:54 ` Dave Hansen
2017-01-30 3:35 ` [RFC V2 12/12] mm: Tag VMA with VM_CDM flag explicitly during mbind(MPOL_BIND) Anshuman Khandual
2017-01-30 17:54 ` Dave Hansen
2017-01-31 4:36 ` Anshuman Khandual
2017-02-07 18:07 ` Dave Hansen
2017-02-08 14:13 ` Anshuman Khandual
2017-02-08 15:04 ` Jerome Glisse
2017-01-30 3:35 ` [DEBUG 13/21] powerpc/mm: Identify coherent device memory nodes during platform init Anshuman Khandual
2017-01-30 3:35 ` [DEBUG 14/21] powerpc/mm: Create numa nodes for hotplug memory Anshuman Khandual
2017-01-30 3:35 ` [DEBUG 15/21] powerpc/mm: Enable CONFIG_MOVABLE_NODE for PPC64 platform Anshuman Khandual
2017-01-30 3:35 ` [DEBUG 16/21] mm: Enable CONFIG_MOVABLE_NODE on powerpc Anshuman Khandual
2017-01-30 3:35 ` [DEBUG 17/21] mm: Export definition of 'zone_names' array through mmzone.h Anshuman Khandual
2017-01-30 3:35 ` [DEBUG 18/21] mm: Add debugfs interface to dump each node's zonelist information Anshuman Khandual
2017-01-30 3:36 ` [DEBUG 19/21] mm: Add migrate_virtual_range migration interface Anshuman Khandual
2017-01-30 3:36 ` [DEBUG 20/21] drivers: Add two drivers for coherent device memory tests Anshuman Khandual
2017-01-30 3:36 ` [DEBUG 21/21] selftests/powerpc: Add a script to perform random VMA migrations Anshuman Khandual
2017-01-31 5:48 ` [RFC V2 00/12] Define coherent device memory node Anshuman Khandual
2017-01-31 6:15 ` Jerome Glisse
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170131060509.GA2017@redhat.com \
--to=jglisse@redhat.com \
--cc=aneesh.kumar@linux.vnet.ibm.com \
--cc=bsingharora@gmail.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@intel.com \
--cc=haren@linux.vnet.ibm.com \
--cc=khandual@linux.vnet.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@suse.com \
--cc=minchan@kernel.org \
--cc=srikar@linux.vnet.ibm.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).