Re: [RFC PATCH 2/6] mm/gmem: add arch-independent abstraction to track address mapping status

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Weixi Zhu <weixi.zhu@huawei.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, weixi.zhu@openeuler.sh,
	mgorman@suse.de, jglisse@redhat.com, rcampbell@nvidia.com,
	jhubbard@nvidia.com, apopple@nvidia.com, mhairgrove@nvidia.com,
	ziy@nvidia.com, alexander.deucher@amd.com,
	christian.koenig@amd.com, Xinhui.Pan@amd.com,
	amd-gfx@lists.freedesktop.org, Felix.Kuehling@amd.com,
	ogabbay@kernel.org, dri-devel@lists.freedesktop.org,
	jgg@nvidia.com, leonro@nvidia.com, zhenyuw@linux.intel.com,
	zhi.a.wang@intel.com, intel-gvt-dev@lists.freedesktop.org,
	intel-gfx@lists.freedesktop.org, jani.nikula@linux.intel.com,
	joonas.lahtinen@linux.intel.com, rodrigo.vivi@intel.com,
	tvrtko.ursulin@linux.intel.com
Subject: Re: [RFC PATCH 2/6] mm/gmem: add arch-independent abstraction to track address mapping status
Date: Mon, 4 Dec 2023 11:21:03 +0100	[thread overview]
Message-ID: <1c68ee91-1b6a-41e8-b96f-bcaf9faffa08@redhat.com> (raw)
In-Reply-To: <CAKbZUD25mwVXowDcN1Cj5Op9wRAopYhYZcesR0tk2r_Wn-d95g@mail.gmail.com>

On 02.12.23 15:50, Pedro Falcato wrote:
> On Fri, Dec 1, 2023 at 9:23 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 28.11.23 13:50, Weixi Zhu wrote:
>>> This patch adds an abstraction layer, struct vm_object, that maintains
>>> per-process virtual-to-physical mapping status stored in struct gm_mapping.
>>> For example, a virtual page may be mapped to a CPU physical page or to a
>>> device physical page. Struct vm_object effectively maintains an
>>> arch-independent page table, which is defined as a "logical page table".
>>> While arch-dependent page table used by a real MMU is named a "physical
>>> page table". The logical page table is useful if Linux core MM is extended
>>> to handle a unified virtual address space with external accelerators using
>>> customized MMUs.
>>
>> Which raises the question why we are dealing with anonymous memory at
>> all? Why not go for shmem if you are already only special-casing VMAs
>> with a MMAP flag right now?
>>
>> That would maybe avoid having to introduce controversial BSD design
>> concepts into Linux, that feel like going a step backwards in time to me
>> and adding *more* MM complexity.
>>
>>>
>>> In this patch, struct vm_object utilizes a radix
>>> tree (xarray) to track where a virtual page is mapped to. This adds extra
>>> memory consumption from xarray, but provides a nice abstraction to isolate
>>> mapping status from the machine-dependent layer (PTEs). Besides supporting
>>> accelerators with external MMUs, struct vm_object is planned to further
>>> union with i_pages in struct address_mapping for file-backed memory.
>>
>> A file already has a tree structure (pagecache) to manage the pages that
>> are theoretically mapped. It's easy to translate from a VMA to a page
>> inside that tree structure that is currently not present in page tables.
>>
>> Why the need for that tree structure if you can just remove anon memory
>> from the picture?
>>
>>>
>>> The idea of struct vm_object is originated from FreeBSD VM design, which
>>> provides a unified abstraction for anonymous memory, file-backed memory,
>>> page cache and etc[1].
>>
>> :/
>>
>>> Currently, Linux utilizes a set of hierarchical page walk functions to
>>> abstract page table manipulations of different CPU architecture. The
>>> problem happens when a device wants to reuse Linux MM code to manage its
>>> page table -- the device page table may not be accessible to the CPU.
>>> Existing solution like Linux HMM utilizes the MMU notifier mechanisms to
>>> invoke device-specific MMU functions, but relies on encoding the mapping
>>> status on the CPU page table entries. This entangles machine-independent
>>> code with machine-dependent code, and also brings unnecessary restrictions.
>>
>> Why? we have primitives to walk arch page tables in a non-arch specific
>> fashion and are using them all over the place.
>>
>> We even have various mechanisms to map something into the page tables
>> and get the CPU to fault on it, as if it is inaccessible (PROT_NONE as
>> used for NUMA balancing, fake swap entries).
>>
>>> The PTE size and format vary arch by arch, which harms the extensibility.
>>
>> Not really.
>>
>> We might have some features limited to some architectures because of the
>> lack of PTE bits. And usually the problem is that people don't care
>> enough about enabling these features on older architectures.
>>
>> If we ever *really* need more space for sw-defined data, it would be
>> possible to allocate auxiliary data for page tables only where required
>> (where the features apply), instead of crafting a completely new,
>> auxiliary datastructure with it's own locking.
>>
>> So far it was not required to enable the feature we need on the
>> architectures we care about.
>>
>>>
>>> [1] https://docs.freebsd.org/en/articles/vm-design/
>>
>> In the cover letter you have:
>>
>> "The future plan of logical page table is to provide a generic
>> abstraction layer that support common anonymous memory (I am looking at
>> you, transparent huge pages) and file-backed memory."
>>
>> Which I doubt will happen; there is little interest in making anonymous
>> memory management slower, more serialized, and wasting more memory on
>> metadata.
> 
> Also worth noting that:
> 
> 1) Mach VM (which FreeBSD inherited, from the old BSD) vm_objects
> aren't quite what's being stated here, rather they are somewhat
> replacements for both anon_vma and address_space[1]. Very similarly to
> Linux, they take pages from vm_objects and map them in page tables
> using pmap (the big difference is anon memory, which has its
> bookkeeping in page tables, on Linux)
> 
> 2) These vm_objects were a horrendous mistake (see CoW chaining) and
> FreeBSD has to go to horrendous lengths to make them tolerable. The
> UVM paper/dissertation (by Charles Cranor) talks about these issues at
> length, and 20 years later it's still true.
> 
> 3) Despite Linux MM having its warts, it's probably correct to
> consider it a solid improvement over FreeBSD MM or NetBSD UVM
> 
> And, finally, randomly tacking on core MM concepts from other systems
> is at best a *really weird* idea. Particularly when they aren't even
> what was stated!

Can you read my mind? :) thanks for noting all that, with which I 100% 
agree.

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2023-12-04 10:21 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-28 12:50 [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices Weixi Zhu
2023-11-28 12:50 ` [RFC PATCH 1/6] mm/gmem: add heterogeneous NUMA node Weixi Zhu
2023-11-28 12:50 ` [RFC PATCH 2/6] mm/gmem: add arch-independent abstraction to track address mapping status Weixi Zhu
2023-11-29  8:33   ` emily
2023-11-29  8:49     ` zhuweixi
2023-12-01  9:23   ` David Hildenbrand
2023-12-02 14:50     ` Pedro Falcato
2023-12-04 10:21       ` David Hildenbrand [this message]
2023-11-28 12:50 ` [RFC PATCH 3/6] mm/gmem: add GMEM (Generalized Memory Management) interface for external accelerators Weixi Zhu
2023-11-28 12:50 ` [RFC PATCH 4/6] mm/gmem: add new syscall hmadvise() to issue memory hints for heterogeneous NUMA nodes Weixi Zhu
2023-11-28 12:50 ` [RFC PATCH 5/6] mm/gmem: resolve VMA conflicts for attached peer devices Weixi Zhu
2023-11-28 12:50 ` [RFC PATCH 6/6] mm/gmem: extending Linux core MM to support unified virtual address space Weixi Zhu
2023-11-28 13:06 ` [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices Christian König
2023-11-29  5:14   ` Dave Airlie
2023-11-29  8:27     ` zhuweixi
2023-11-29 15:22       ` Christian König
2023-11-30  7:22         ` zhuweixi
2023-11-30 13:05           ` Christian König
2023-12-01  2:37             ` zhuweixi
2023-12-01 21:28               ` Philipp Stanner
2023-11-30 14:55       ` David Hildenbrand
2023-12-01  2:44         ` zhuweixi
2023-12-01  9:29           ` David Hildenbrand
2023-11-28 13:09 ` Christian König
2023-11-29 22:23   ` Zeng, Oak
2023-11-30  8:27     ` Christian König
2023-11-30 10:48       ` zhuweixi
2023-12-01  5:48         ` Zeng, Oak
2023-12-01  6:11           ` Alistair Popple
2023-12-01 13:16           ` Christian König
2023-12-03 23:32             ` Alistair Popple
2023-12-04  9:35               ` Christian König
2023-12-01  6:01         ` Alistair Popple

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1c68ee91-1b6a-41e8-b96f-bcaf9faffa08@redhat.com \
    --to=david@redhat.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Xinhui.Pan@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=apopple@nvidia.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=intel-gvt-dev@lists.freedesktop.org \
    --cc=jani.nikula@linux.intel.com \
    --cc=jgg@nvidia.com \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=joonas.lahtinen@linux.intel.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhairgrove@nvidia.com \
    --cc=ogabbay@kernel.org \
    --cc=pedro.falcato@gmail.com \
    --cc=rcampbell@nvidia.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=tvrtko.ursulin@linux.intel.com \
    --cc=weixi.zhu@huawei.com \
    --cc=weixi.zhu@openeuler.sh \
    --cc=zhenyuw@linux.intel.com \
    --cc=zhi.a.wang@intel.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).