Separating xe_vma- and page-table state

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Separating xe_vma- and page-table state
@ 2024-03-12  7:43 Thomas Hellström
  2024-03-12 23:02 ` Zeng, Oak
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Hellström @ 2024-03-12  7:43 UTC (permalink / raw)
  To: intel-xe; +Cc: Matthew Brost, Zeng, Oak

Hi,

It's IMO become apparent both in the system allocator discussion and in
the patch that enables the presence of invalid vmas that we need to be
better at separating xe_vma and page-table state, so that xe_vma state
would contain things that are mostly immutable and that the user
requested: PAT index, memory attributes, requested tile presence etc,
whereas the page-table state would contain mutable state like actual
tile presence, invalidaton state and MMU notifier. 

So far we have had no reason to separate the two, but with hmmptr we
would likely end up with multiple page-table regions per xe-vma, and
with the patch discussed earlier we could've easily reused
xe_vm_unbind_vma() that only touches the mutable page-table state and
does the correct locking.

The page table could would then typically take a const xe_vma *, and
and xe_pt_state *, or whatever we choose to call it. All xe_vmas except
hmmptr ones would have an 1:1 xe_vma <-> xe_pt_state relationship.

Thoughts, comments?

Thanks,
Thomas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: Separating xe_vma- and page-table state
  2024-03-12  7:43 Separating xe_vma- and page-table state Thomas Hellström
@ 2024-03-12 23:02 ` Zeng, Oak
  2024-03-13  1:27   ` Matthew Brost
  0 siblings, 1 reply; 11+ messages in thread
From: Zeng, Oak @ 2024-03-12 23:02 UTC (permalink / raw)
  To: Thomas Hellström, intel-xe@lists.freedesktop.org; +Cc: Brost, Matthew

Hi Thomas,

> -----Original Message-----
> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Sent: Tuesday, March 12, 2024 3:43 AM
> To: intel-xe@lists.freedesktop.org
> Cc: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> <oak.zeng@intel.com>
> Subject: Separating xe_vma- and page-table state
> 
> Hi,
> 
> It's IMO become apparent both in the system allocator discussion and in
> the patch that enables the presence of invalid vmas 
> that we need to be
> better at separating xe_vma and page-table state, so that xe_vma state
> would contain things that are mostly immutable and that the user
> requested: PAT index, memory attributes, requested tile presence etc,
> whereas the page-table state would contain mutable state like actual
> tile presence, invalidaton state and MMU notifier.

It is a valid reasoning to me... if we want to do what community want us to do with system allocator, 
And if we want to meet our umd's requirement of "free w/o vm_unbind", yes,  we need this "invalid" vma concept.

The strange thing is, it seems Matt can still achieve the goal without introducing invalid vma concept... it doesn't look like he has
This concept in his patch....

> 
> So far we have had no reason to separate the two, but with hmmptr we
> would likely end up with multiple page-table regions per xe-vma, 

Can we still maintain 1 xe-vma : 1 page table region relationship for simplicity?

This requires vma splitting. i.e., if you have a hmmptr vma cover range [0~100],
And fault happens at range [40~60], then we will end up with 3 vmas:
[0~40], a dummy hmmptr vma, not mapped gpu
[40~60], hmmptr vma mapped to gp
[60~100], dummy, not mapped to gpu.

Does this work for you? Or do you see a benefit of not splitting vma?

and
> with the patch discussed earlier we could've easily reused
> xe_vm_unbind_vma() that only touches the mutable page-table state and
> does the correct locking.
> 
> The page table could would then typically take a const xe_vma *, and
> and xe_pt_state *, or whatever we choose to call it. All xe_vmas except
> hmmptr ones would have an 1:1 xe_vma <-> xe_pt_state relationship.
> 

Matt has POC codes which works for me for system allocator w/o this vma/pt_state splitting: https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-/commits/system_allocator_poc?ref_type=heads

Can you take a look also... 

When I worked on system allocator HLD, I pictured the invalid vma concept also. But somehow Matt was able to make it working without such concept....

Oak

> Thoughts, comments?
> 
> Thanks,
> Thomas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Separating xe_vma- and page-table state
  2024-03-12 23:02 ` Zeng, Oak
@ 2024-03-13  1:27   ` Matthew Brost
  2024-03-13  2:16     ` Zeng, Oak
  2024-03-13 10:56     ` Thomas Hellström
  0 siblings, 2 replies; 11+ messages in thread
From: Matthew Brost @ 2024-03-13  1:27 UTC (permalink / raw)
  To: Zeng, Oak; +Cc: Thomas Hellström, intel-xe@lists.freedesktop.org

On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote:
> Hi Thomas,

Going to reply to both Thomas and Oak to not split this thread.

> 
> > -----Original Message-----
> > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Sent: Tuesday, March 12, 2024 3:43 AM
> > To: intel-xe@lists.freedesktop.org
> > Cc: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> > <oak.zeng@intel.com>
> > Subject: Separating xe_vma- and page-table state
> > 
> > Hi,
> > 
> > It's IMO become apparent both in the system allocator discussion and in
> > the patch that enables the presence of invalid vmas 
> > that we need to be
> > better at separating xe_vma and page-table state, so that xe_vma state
> > would contain things that are mostly immutable and that the user
> > requested: PAT index, memory attributes, requested tile presence etc,
> > whereas the page-table state would contain mutable state like actual
> > tile presence, invalidaton state and MMU notifier.

Thomas:

100% agree with the idea. There are 2 distict parts of xe_vma - what I
call GPUVM state and PT state. In addition, there is what I call IOCTL
state (sync objs, etc...) that is updated in various places in xe_vm.c.

Functions to update these respective states should be clearly seperated.
In the current implementaion is a complete mess in this regard, you
mention xe_vm_unbind_vma below, in its current state there is no way
this function could be easily be reused.

My series [1] vastly improves this seperation. It still could be split a
bit better though. In my next rebase I can look at this series through
the lens of clearly maintaining seperation + likely update [2] to do an
unbind rather than a invalidation.

[1] https://patchwork.freedesktop.org/series/125608/
[2] https://patchwork.freedesktop.org/series/130935/

> 
> It is a valid reasoning to me... if we want to do what community want us to do with system allocator, 
> And if we want to meet our umd's requirement of "free w/o vm_unbind", yes,  we need this "invalid" vma concept.
> 
> The strange thing is, it seems Matt can still achieve the goal without introducing invalid vma concept... it doesn't look like he has
> This concept in his patch....
>

Oak:

I'm a little confused by what mean by "invalid" vma concept. Does
that mean no GPU backing or page tables? That is essentially what I call
system allocator VMA in [3], right?

Also based on Thomas say below related to xe_vm_unbind_vma and [2] we
should be able to have VMA that points to a backing store (either
userptr or BO) but doesn't have GPU mappings. We do have this for
faulting VMs before the initial bind of a VMA but we cannot currently
dynamically change this (we can invalidate page tables but cannot remove
them without destroying the VMA). We should fix that.

[3] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-/commit/3fcc83c9b075364a9d83415ca73a9f9625543d7c

> 
> > 
> > So far we have had no reason to separate the two, but with hmmptr we
> > would likely end up with multiple page-table regions per xe-vma, 
> 
> Can we still maintain 1 xe-vma : 1 page table region relationship for simplicity?
> 
> This requires vma splitting. i.e., if you have a hmmptr vma cover range [0~100],
> And fault happens at range [40~60], then we will end up with 3 vmas:
> [0~40], a dummy hmmptr vma, not mapped gpu
> [40~60], hmmptr vma mapped to gp
> [60~100], dummy, not mapped to gpu.
> 
> Does this work for you? Or do you see a benefit of not splitting vma?
>

Thomas / Oak:

Agree for at least the initial implementation a 1 to 1 relationship will
be easier. Also unsure what the benefit of not splitting VMAs is?

That being said, I don't think this is something we need to decide now.

> 
> 
> and
> > with the patch discussed earlier we could've easily reused
> > xe_vm_unbind_vma() that only touches the mutable page-table state and
> > does the correct locking.
> > 
> > The page table could would then typically take a const xe_vma *, and
> > and xe_pt_state *, or whatever we choose to call it. All xe_vmas except
> > hmmptr ones would have an 1:1 xe_vma <-> xe_pt_state relationship.

Thomas:

I like the idea of VMAs in the PT code function being marked as const
and having the xe_pt_state as non const. It makes ownership very clear.

Not sure how that will fit into [1] as that series passes around
a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It does this
to make "struct xe_vm_ops" a single atomic operation. The VMAs are
extracted either the GPUVM base operation or "struct xe_vma_op". Maybe
these can be const? I'll look into that but this might not work out in
practice.

Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship fits in
hmmptrs. Could you explain your thinking here?

> > 
> 
> Matt has POC codes which works for me for system allocator w/o this vma/pt_state splitting: https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-/commits/system_allocator_poc?ref_type=heads
> 
> Can you take a look also... 
> 

Thomas:

It might be helpful to look at that branch too, the last 3 patches [3]
[4] [5] implement a PoC system allocator design built on top of [1],
using existing userptrs for now, and is based on our discussions in the
system allocator design doc. Coding is sometimes the easiest way to
convay ideas and this roughly what I had in mind. Not complete by any
means and still tons of work to do to make this a working solution.

[4] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-/commit/3fcc83c9b075364a9d83415ca73a9f9625543d7c
[5] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-/commit/d2371e7c007406fb5e84e6605da238d12caa5c62
[6] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-/commit/91cf8b38428ae9180c92435d3519193d074e837a

> When I worked on system allocator HLD, I pictured the invalid vma concept also. But somehow Matt was able to make it working without such concept....
>

Oak:

Still unclear on this. Maybe we can chat offline to clear this up.

Matt

> Oak
> 
> > Thoughts, comments?
> > 
> > Thanks,
> > Thomas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: Separating xe_vma- and page-table state
  2024-03-13  1:27   ` Matthew Brost
@ 2024-03-13  2:16     ` Zeng, Oak
  2024-03-13  3:16       ` Matthew Brost
  2024-03-13 10:56     ` Thomas Hellström
  1 sibling, 1 reply; 11+ messages in thread
From: Zeng, Oak @ 2024-03-13  2:16 UTC (permalink / raw)
  To: Brost, Matthew; +Cc: Thomas Hellström, intel-xe@lists.freedesktop.org

Hi Matt,

> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: Tuesday, March 12, 2024 9:27 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>; intel-
> xe@lists.freedesktop.org
> Subject: Re: Separating xe_vma- and page-table state
> 
> On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote:
> > Hi Thomas,
> 
> Going to reply to both Thomas and Oak to not split this thread.
> 
> >
> > > -----Original Message-----
> > > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Sent: Tuesday, March 12, 2024 3:43 AM
> > > To: intel-xe@lists.freedesktop.org
> > > Cc: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> > > <oak.zeng@intel.com>
> > > Subject: Separating xe_vma- and page-table state
> > >
> > > Hi,
> > >
> > > It's IMO become apparent both in the system allocator discussion and in
> > > the patch that enables the presence of invalid vmas
> > > that we need to be
> > > better at separating xe_vma and page-table state, so that xe_vma state
> > > would contain things that are mostly immutable and that the user
> > > requested: PAT index, memory attributes, requested tile presence etc,
> > > whereas the page-table state would contain mutable state like actual
> > > tile presence, invalidaton state and MMU notifier.
> 
> Thomas:
> 
> 100% agree with the idea. There are 2 distict parts of xe_vma - what I
> call GPUVM state and PT state. In addition, there is what I call IOCTL
> state (sync objs, etc...) that is updated in various places in xe_vm.c.
> 
> Functions to update these respective states should be clearly seperated.
> In the current implementaion is a complete mess in this regard, you
> mention xe_vm_unbind_vma below, in its current state there is no way
> this function could be easily be reused.
> 
> My series [1] vastly improves this seperation. It still could be split a
> bit better though. In my next rebase I can look at this series through
> the lens of clearly maintaining seperation + likely update [2] to do an
> unbind rather than a invalidation.
> 
> [1] https://patchwork.freedesktop.org/series/125608/
> [2] https://patchwork.freedesktop.org/series/130935/
> 
> >
> > It is a valid reasoning to me... if we want to do what community want us to do
> with system allocator,
> > And if we want to meet our umd's requirement of "free w/o vm_unbind", yes,
> we need this "invalid" vma concept.
> >
> > The strange thing is, it seems Matt can still achieve the goal without introducing
> invalid vma concept... it doesn't look like he has
> > This concept in his patch....
> >
> 
> Oak:
> 
> I'm a little confused by what mean by "invalid" vma concept. Does
> that mean no GPU backing or page tables? That is essentially what I call
> system allocator VMA in [3], right?

Thomas also mentioned this "invalid vmas" concept. Here is my understanding:

1) from the system allocator discussion: we introduced a gigantic vma which is invalid because address in this gigantic vma are not a valid gpu address.
When we create a vma in gpu page fault handler, we carve out a vma from gigantic vma, this vma is a valid vma. 

Note there is subtle difference b/t invalid vma and vma without backing or page table. A valid vma can be migrated to system memory and at that time it lost its gpu backing or gpu page table but it is still a valid vma.

In the POC code you give, you don't the migration logic, so you don't need the note of "invalid vma".

2) from the userptr discussion "free userptr without vm_unbind": in your patch, when user free userptr without vm_unbind, the code behavior is, you invalidate that userptr from gpu page table and put it into a repin list. Userptr vma will be repined on the next exec submission. I guess Thomas call such vma (freed, but not vm_unbind) as "invalid vma"

For me this behavior causes unnecessary complication. I would destroy the userptr vma when user free it. We can create a new userptr vma on next umd vm_bind. Is this simpler?

Oak


> 
> Also based on Thomas say below related to xe_vm_unbind_vma and [2] we
> should be able to have VMA that points to a backing store (either
> userptr or BO) but doesn't have GPU mappings. We do have this for
> faulting VMs before the initial bind of a VMA but we cannot currently
> dynamically change this (we can invalidate page tables but cannot remove
> them without destroying the VMA). We should fix that.
> 
> [3] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-
> /commit/3fcc83c9b075364a9d83415ca73a9f9625543d7c
> 
> >
> > >
> > > So far we have had no reason to separate the two, but with hmmptr we
> > > would likely end up with multiple page-table regions per xe-vma,
> >
> > Can we still maintain 1 xe-vma : 1 page table region relationship for simplicity?
> >
> > This requires vma splitting. i.e., if you have a hmmptr vma cover range [0~100],
> > And fault happens at range [40~60], then we will end up with 3 vmas:
> > [0~40], a dummy hmmptr vma, not mapped gpu
> > [40~60], hmmptr vma mapped to gp
> > [60~100], dummy, not mapped to gpu.
> >
> > Does this work for you? Or do you see a benefit of not splitting vma?
> >
> 
> Thomas / Oak:
> 
> Agree for at least the initial implementation a 1 to 1 relationship will
> be easier. Also unsure what the benefit of not splitting VMAs is?
> 
> That being said, I don't think this is something we need to decide now.
> 
> >
> >
> > and
> > > with the patch discussed earlier we could've easily reused
> > > xe_vm_unbind_vma() that only touches the mutable page-table state and
> > > does the correct locking.
> > >
> > > The page table could would then typically take a const xe_vma *, and
> > > and xe_pt_state *, or whatever we choose to call it. All xe_vmas except
> > > hmmptr ones would have an 1:1 xe_vma <-> xe_pt_state relationship.
> 
> Thomas:
> 
> I like the idea of VMAs in the PT code function being marked as const
> and having the xe_pt_state as non const. It makes ownership very clear.
> 
> Not sure how that will fit into [1] as that series passes around
> a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It does this
> to make "struct xe_vm_ops" a single atomic operation. The VMAs are
> extracted either the GPUVM base operation or "struct xe_vma_op". Maybe
> these can be const? I'll look into that but this might not work out in
> practice.
> 
> Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship fits in
> hmmptrs. Could you explain your thinking here?
> 
> > >
> >
> > Matt has POC codes which works for me for system allocator w/o this
> vma/pt_state splitting: https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-
> system-allocator/-/commits/system_allocator_poc?ref_type=heads
> >
> > Can you take a look also...
> >
> 
> Thomas:
> 
> It might be helpful to look at that branch too, the last 3 patches [3]
> [4] [5] implement a PoC system allocator design built on top of [1],
> using existing userptrs for now, and is based on our discussions in the
> system allocator design doc. Coding is sometimes the easiest way to
> convay ideas and this roughly what I had in mind. Not complete by any
> means and still tons of work to do to make this a working solution.
> 
> [4] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-
> /commit/3fcc83c9b075364a9d83415ca73a9f9625543d7c
> [5] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-
> /commit/d2371e7c007406fb5e84e6605da238d12caa5c62
> [6] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-
> /commit/91cf8b38428ae9180c92435d3519193d074e837a
> 
> > When I worked on system allocator HLD, I pictured the invalid vma concept also.
> But somehow Matt was able to make it working without such concept....
> >
> 
> Oak:
> 
> Still unclear on this. Maybe we can chat offline to clear this up.
> 
> Matt
> 
> > Oak
> >
> > > Thoughts, comments?
> > >
> > > Thanks,
> > > Thomas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Separating xe_vma- and page-table state
  2024-03-13  2:16     ` Zeng, Oak
@ 2024-03-13  3:16       ` Matthew Brost
  0 siblings, 0 replies; 11+ messages in thread
From: Matthew Brost @ 2024-03-13  3:16 UTC (permalink / raw)
  To: Zeng, Oak; +Cc: Thomas Hellström, intel-xe@lists.freedesktop.org

On Tue, Mar 12, 2024 at 08:16:24PM -0600, Zeng, Oak wrote:
> Hi Matt,
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: Tuesday, March 12, 2024 9:27 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>; intel-
> > xe@lists.freedesktop.org
> > Subject: Re: Separating xe_vma- and page-table state
> > 
> > On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote:
> > > Hi Thomas,
> > 
> > Going to reply to both Thomas and Oak to not split this thread.
> > 
> > >
> > > > -----Original Message-----
> > > > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > Sent: Tuesday, March 12, 2024 3:43 AM
> > > > To: intel-xe@lists.freedesktop.org
> > > > Cc: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> > > > <oak.zeng@intel.com>
> > > > Subject: Separating xe_vma- and page-table state
> > > >
> > > > Hi,
> > > >
> > > > It's IMO become apparent both in the system allocator discussion and in
> > > > the patch that enables the presence of invalid vmas
> > > > that we need to be
> > > > better at separating xe_vma and page-table state, so that xe_vma state
> > > > would contain things that are mostly immutable and that the user
> > > > requested: PAT index, memory attributes, requested tile presence etc,
> > > > whereas the page-table state would contain mutable state like actual
> > > > tile presence, invalidaton state and MMU notifier.
> > 
> > Thomas:
> > 
> > 100% agree with the idea. There are 2 distict parts of xe_vma - what I
> > call GPUVM state and PT state. In addition, there is what I call IOCTL
> > state (sync objs, etc...) that is updated in various places in xe_vm.c.
> > 
> > Functions to update these respective states should be clearly seperated.
> > In the current implementaion is a complete mess in this regard, you
> > mention xe_vm_unbind_vma below, in its current state there is no way
> > this function could be easily be reused.
> > 
> > My series [1] vastly improves this seperation. It still could be split a
> > bit better though. In my next rebase I can look at this series through
> > the lens of clearly maintaining seperation + likely update [2] to do an
> > unbind rather than a invalidation.
> > 
> > [1] https://patchwork.freedesktop.org/series/125608/
> > [2] https://patchwork.freedesktop.org/series/130935/
> > 
> > >
> > > It is a valid reasoning to me... if we want to do what community want us to do
> > with system allocator,
> > > And if we want to meet our umd's requirement of "free w/o vm_unbind", yes,
> > we need this "invalid" vma concept.
> > >
> > > The strange thing is, it seems Matt can still achieve the goal without introducing
> > invalid vma concept... it doesn't look like he has
> > > This concept in his patch....
> > >
> > 
> > Oak:
> > 
> > I'm a little confused by what mean by "invalid" vma concept. Does
> > that mean no GPU backing or page tables? That is essentially what I call
> > system allocator VMA in [3], right?
> 
> Thomas also mentioned this "invalid vmas" concept. Here is my understanding:
> 
> 1) from the system allocator discussion: we introduced a gigantic vma which is invalid because address in this gigantic vma are not a valid gpu address.
> When we create a vma in gpu page fault handler, we carve out a vma from gigantic vma, this vma is a valid vma. 
> 
> Note there is subtle difference b/t invalid vma and vma without backing or page table. A valid vma can be migrated to system memory and at that time it lost its gpu backing or gpu page table but it is still a valid vma.
> 
> In the POC code you give, you don't the migration logic, so you don't need the note of "invalid vma".
> 

We have gotten a little off topic from original point of this thread...

Anyways we already have this concept on tip, see xe_vm_invalidate_vma.
When a BO is moved or a userptr is invalidated we call
xe_vm_invalidate_vma which invalidates the page tables. Next GPU access
will result in a fault and page fault handler will rebind the VMA. Any
new migration code would call xe_vm_invalidate_vma too.

> 2) from the userptr discussion "free userptr without vm_unbind": in your patch, when user free userptr without vm_unbind, the code behavior is, you invalidate that userptr from gpu page table and put it into a repin list. Userptr vma will be repined on the next exec submission. I guess Thomas call such vma (freed, but not vm_unbind) as "invalid vma"
> 
> For me this behavior causes unnecessary complication. I would destroy the userptr vma when user free it. We can create a new userptr vma on next umd vm_bind. Is this simpler?

Still not following but let's table this as again this is off topic.

Matt

> 
> Oak
> 
> 
> > 
> > Also based on Thomas say below related to xe_vm_unbind_vma and [2] we
> > should be able to have VMA that points to a backing store (either
> > userptr or BO) but doesn't have GPU mappings. We do have this for
> > faulting VMs before the initial bind of a VMA but we cannot currently
> > dynamically change this (we can invalidate page tables but cannot remove
> > them without destroying the VMA). We should fix that.
> > 
> > [3] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-
> > /commit/3fcc83c9b075364a9d83415ca73a9f9625543d7c
> > 
> > >
> > > >
> > > > So far we have had no reason to separate the two, but with hmmptr we
> > > > would likely end up with multiple page-table regions per xe-vma,
> > >
> > > Can we still maintain 1 xe-vma : 1 page table region relationship for simplicity?
> > >
> > > This requires vma splitting. i.e., if you have a hmmptr vma cover range [0~100],
> > > And fault happens at range [40~60], then we will end up with 3 vmas:
> > > [0~40], a dummy hmmptr vma, not mapped gpu
> > > [40~60], hmmptr vma mapped to gp
> > > [60~100], dummy, not mapped to gpu.
> > >
> > > Does this work for you? Or do you see a benefit of not splitting vma?
> > >
> > 
> > Thomas / Oak:
> > 
> > Agree for at least the initial implementation a 1 to 1 relationship will
> > be easier. Also unsure what the benefit of not splitting VMAs is?
> > 
> > That being said, I don't think this is something we need to decide now.
> > 
> > >
> > >
> > > and
> > > > with the patch discussed earlier we could've easily reused
> > > > xe_vm_unbind_vma() that only touches the mutable page-table state and
> > > > does the correct locking.
> > > >
> > > > The page table could would then typically take a const xe_vma *, and
> > > > and xe_pt_state *, or whatever we choose to call it. All xe_vmas except
> > > > hmmptr ones would have an 1:1 xe_vma <-> xe_pt_state relationship.
> > 
> > Thomas:
> > 
> > I like the idea of VMAs in the PT code function being marked as const
> > and having the xe_pt_state as non const. It makes ownership very clear.
> > 
> > Not sure how that will fit into [1] as that series passes around
> > a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It does this
> > to make "struct xe_vm_ops" a single atomic operation. The VMAs are
> > extracted either the GPUVM base operation or "struct xe_vma_op". Maybe
> > these can be const? I'll look into that but this might not work out in
> > practice.
> > 
> > Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship fits in
> > hmmptrs. Could you explain your thinking here?
> > 
> > > >
> > >
> > > Matt has POC codes which works for me for system allocator w/o this
> > vma/pt_state splitting: https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-
> > system-allocator/-/commits/system_allocator_poc?ref_type=heads
> > >
> > > Can you take a look also...
> > >
> > 
> > Thomas:
> > 
> > It might be helpful to look at that branch too, the last 3 patches [3]
> > [4] [5] implement a PoC system allocator design built on top of [1],
> > using existing userptrs for now, and is based on our discussions in the
> > system allocator design doc. Coding is sometimes the easiest way to
> > convay ideas and this roughly what I had in mind. Not complete by any
> > means and still tons of work to do to make this a working solution.
> > 
> > [4] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-
> > /commit/3fcc83c9b075364a9d83415ca73a9f9625543d7c
> > [5] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-
> > /commit/d2371e7c007406fb5e84e6605da238d12caa5c62
> > [6] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-
> > /commit/91cf8b38428ae9180c92435d3519193d074e837a
> > 
> > > When I worked on system allocator HLD, I pictured the invalid vma concept also.
> > But somehow Matt was able to make it working without such concept....
> > >
> > 
> > Oak:
> > 
> > Still unclear on this. Maybe we can chat offline to clear this up.
> > 
> > Matt
> > 
> > > Oak
> > >
> > > > Thoughts, comments?
> > > >
> > > > Thanks,
> > > > Thomas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Separating xe_vma- and page-table state
  2024-03-13  1:27   ` Matthew Brost
  2024-03-13  2:16     ` Zeng, Oak
@ 2024-03-13 10:56     ` Thomas Hellström
  2024-03-13 17:06       ` Zeng, Oak
  2024-03-13 19:43       ` Matthew Brost
  1 sibling, 2 replies; 11+ messages in thread
From: Thomas Hellström @ 2024-03-13 10:56 UTC (permalink / raw)
  To: Matthew Brost, Zeng, Oak; +Cc: intel-xe@lists.freedesktop.org

On Wed, 2024-03-13 at 01:27 +0000, Matthew Brost wrote:
> On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote:
> > Hi Thomas,
> 
> 

....

> Thomas:
> 
> I like the idea of VMAs in the PT code function being marked as const
> and having the xe_pt_state as non const. It makes ownership very
> clear.
> 
> Not sure how that will fit into [1] as that series passes around
> a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It does
> this
> to make "struct xe_vm_ops" a single atomic operation. The VMAs are
> extracted either the GPUVM base operation or "struct xe_vma_op".
> Maybe
> these can be const? I'll look into that but this might not work out
> in
> practice.
> 
> Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship fits in
> hmmptrs. Could you explain your thinking here?

There is a need for hmmptrs to be sparse. When we fault we create a
chunk of PTEs that we populate. This chunk could potentially be large
and covering the whole CPU vma or it could be limited to, say 2MiB and
aligned to allow for large page-table entries. In Oak's POC these
chunks are called "svm ranges"

So the question arises, how do we map that to the current vma
management and page-table code? There are basically two ways:

1) Split VMAs so they are either fully populated or unpopulated, each
svm_range becomes an xe_vma.
2) Create xe_pt_range / xe_pt_state whatever with an 1:1 mapping with
the svm_mange and a 1:N mapping with xe_vmas.

Initially my thinking was that 1) Would be the simplest approach with
the code we have today. I lifted that briefly with Sima and he answered
"And why would we want to do that?", and the answer at hand was ofc
that the page-table code worked with vmas. Or rather that we mix vma
state (the hmmptr range / attributes) and page-table state (the regions
of the hmmptr that are actually populated), so it would be a
consequence of our current implementation (limitations).

With the suggestion to separate vma state and pt state, the xe_svm
ranges map to pt state and are managed per hmmptr vma. The vmas would
then be split mainly as a result of UMD mapping something else (bo) on
top, or UMD giving new memory attributes for a range (madvise type of
operations).

/Thomas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: Separating xe_vma- and page-table state
  2024-03-13 10:56     ` Thomas Hellström
@ 2024-03-13 17:06       ` Zeng, Oak
  2024-03-14  8:52         ` Thomas Hellström
  2024-03-13 19:43       ` Matthew Brost
  1 sibling, 1 reply; 11+ messages in thread
From: Zeng, Oak @ 2024-03-13 17:06 UTC (permalink / raw)
  To: Thomas Hellström, Brost, Matthew; +Cc: intel-xe@lists.freedesktop.org

Hi Thomas,

For simplicity of the discussion, let's forget about BO vm_bind, forget about memory attributes for a moment... Only consider system allocator. So with the scheme below, we have a gigantic xe_vma in the background holding some immutable state, never split. And we have mutable page-table state which is created during GPU access and destroyed during CPU munmap/invalidation, dynamically

For the mutable page-table state, you would maintain another RB-tree so you can searching it, as I did  in POC, the tree is in xe_svm. For BO driver, you don’t need this extra tree, you just need the xe_vma tree as xe_vma has 1:1 mapping with page-table-state for BO driver...

I saw this scheme can be aligned with my POC....

Mapping this scheme to the userptr "free without vm_unbind" thing, I can see when user free, we can destroy page-table-state during mmu notifier callback, while keep the xe_vma. Is this also how you look at it? 

Need to say, the "free without vm_unbind" thing should only affect our decision temporarily: once system allocator is ready, UMD wouldn't need the userptr vm_bind anymore, so the problem will be more perfectly solved with system allocator - umd just remove the vm_bind, things would magically work with system allocator. I guess what user need is really a system allocator but we don't have it at that time, so userptr technology is used. For long term, system allocator should eventually replace userptr.

One thing I can't picture clearly is, how hard is it to change the current xekmd to separate xe_vma into mutable and unmutable? 

Is the split scheme with xe_vma maintaining both mutable and unmutable simpler? It doesn't have xe_svm concept. No xe_svm_range/page_table-state, single RB tree per gpuvm, no need to re-construct xe_vma....depending on how we want to solve the multiple device problem, the xe_svm concept can come back though...

Oak

> -----Original Message-----
> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Sent: Wednesday, March 13, 2024 6:56 AM
> To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> <oak.zeng@intel.com>
> Cc: intel-xe@lists.freedesktop.org
> Subject: Re: Separating xe_vma- and page-table state
> 
> On Wed, 2024-03-13 at 01:27 +0000, Matthew Brost wrote:
> > On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote:
> > > Hi Thomas,
> >
> >
> 
> ....
> 
> > Thomas:
> >
> > I like the idea of VMAs in the PT code function being marked as const
> > and having the xe_pt_state as non const. It makes ownership very
> > clear.
> >
> > Not sure how that will fit into [1] as that series passes around
> > a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It does
> > this
> > to make "struct xe_vm_ops" a single atomic operation. The VMAs are
> > extracted either the GPUVM base operation or "struct xe_vma_op".
> > Maybe
> > these can be const? I'll look into that but this might not work out
> > in
> > practice.
> >
> > Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship fits in
> > hmmptrs. Could you explain your thinking here?
> 
> There is a need for hmmptrs to be sparse. When we fault we create a
> chunk of PTEs that we populate. This chunk could potentially be large
> and covering the whole CPU vma or it could be limited to, say 2MiB and
> aligned to allow for large page-table entries. In Oak's POC these
> chunks are called "svm ranges"
> 
> So the question arises, how do we map that to the current vma
> management and page-table code? There are basically two ways:
> 
> 1) Split VMAs so they are either fully populated or unpopulated, each
> svm_range becomes an xe_vma.
> 2) Create xe_pt_range / xe_pt_state whatever with an 1:1 mapping with
> the svm_mange and a 1:N mapping with xe_vmas.
> 
> Initially my thinking was that 1) Would be the simplest approach with
> the code we have today. I lifted that briefly with Sima and he answered
> "And why would we want to do that?", and the answer at hand was ofc
> that the page-table code worked with vmas. Or rather that we mix vma
> state (the hmmptr range / attributes) and page-table state (the regions
> of the hmmptr that are actually populated), so it would be a
> consequence of our current implementation (limitations).
> 
> With the suggestion to separate vma state and pt state, the xe_svm
> ranges map to pt state and are managed per hmmptr vma. The vmas would
> then be split mainly as a result of UMD mapping something else (bo) on
> top, or UMD giving new memory attributes for a range (madvise type of
> operations).
> 
> /Thomas
> 
> 
> 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Separating xe_vma- and page-table state
  2024-03-13 10:56     ` Thomas Hellström
  2024-03-13 17:06       ` Zeng, Oak
@ 2024-03-13 19:43       ` Matthew Brost
  2024-03-14  8:57         ` Thomas Hellström
  1 sibling, 1 reply; 11+ messages in thread
From: Matthew Brost @ 2024-03-13 19:43 UTC (permalink / raw)
  To: Thomas Hellström; +Cc: Zeng, Oak, intel-xe@lists.freedesktop.org

On Wed, Mar 13, 2024 at 11:56:12AM +0100, Thomas Hellström wrote:
> On Wed, 2024-03-13 at 01:27 +0000, Matthew Brost wrote:
> > On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote:
> > > Hi Thomas,
> > 
> > 
> 
> ....
> 
> > Thomas:
> > 
> > I like the idea of VMAs in the PT code function being marked as const
> > and having the xe_pt_state as non const. It makes ownership very
> > clear.
> > 
> > Not sure how that will fit into [1] as that series passes around
> > a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It does
> > this
> > to make "struct xe_vm_ops" a single atomic operation. The VMAs are
> > extracted either the GPUVM base operation or "struct xe_vma_op".
> > Maybe
> > these can be const? I'll look into that but this might not work out
> > in
> > practice.
> > 
> > Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship fits in
> > hmmptrs. Could you explain your thinking here?
> 
> There is a need for hmmptrs to be sparse. When we fault we create a
> chunk of PTEs that we populate. This chunk could potentially be large
> and covering the whole CPU vma or it could be limited to, say 2MiB and
> aligned to allow for large page-table entries. In Oak's POC these
> chunks are called "svm ranges"
> 
> So the question arises, how do we map that to the current vma
> management and page-table code? There are basically two ways:
> 
> 1) Split VMAs so they are either fully populated or unpopulated, each
> svm_range becomes an xe_vma.
> 2) Create xe_pt_range / xe_pt_state whatever with an 1:1 mapping with
> the svm_mange and a 1:N mapping with xe_vmas.
> 
> Initially my thinking was that 1) Would be the simplest approach with
> the code we have today. I lifted that briefly with Sima and he answered
> "And why would we want to do that?", and the answer at hand was ofc
> that the page-table code worked with vmas. Or rather that we mix vma
> state (the hmmptr range / attributes) and page-table state (the regions
> of the hmmptr that are actually populated), so it would be a
> consequence of our current implementation (limitations).
> 
> With the suggestion to separate vma state and pt state, the xe_svm
> ranges map to pt state and are managed per hmmptr vma. The vmas would
> then be split mainly as a result of UMD mapping something else (bo) on
> top, or UMD giving new memory attributes for a range (madvise type of
> operations).
> 

Thanks for the explaination.

My 2 cents.

Do an initial implementation is based on 1) - Split VMAs so they are
either fully populated or unpopulated, each svm_range becomes an xe_vma.

This is what I did in a quick PoC [A] and got basic system allocator
working in ~500 loc. Is it fully baked? No, but also it not all that far
off.

We get the uAPI in place, agree on all the semantics, get IGTs in place,
and basic UMD support. We walk before we run...

Then transform to code to 2). The real benefit of 2) is going to be the
memory footprint of all the state and the fragmentation of GPUVM RB tree
that comes with spliting VMAs. I'll state one obvious benefit,
xe_pt_state < xe_vma in size.

I'll give a quick example of the fragmentation issue too.

User binds 0-2^46 - 1 xe_vma
Fault - 3 xe_vma (split into 2 unpopulated xe_vma, 1 populated xe_vma)
Free - 3 xe_vma (replace 1 populated xe_vma with 1 unpopulated xe_vma)

Do this a couple of million times and we will tons of fragmentation and
unnecessary xe_vma floating around. So pursuing 2) is probably worth
while.

Is 2) going to quite a bit more complicated implementation. Let me give
fairly simple example.

User binds 0-2^46 - 1 xe_vma
Tons of occurs - 1 xe_vma, tons of xe_pt_states
User binds a BO - 3 xe_vma, need to split xe_pt_states between 2 new xe_vma, possibly unmap some xe_pt_states, yikes...

While 2) is more complicated I don't think it is complete rewrite from
1) either. [A] is built on top of [B] which introduces the concept of PT
OPs (in addition to existing concepts of GPUVM op / xe_vma_op). With a
bit more of seperation I think getting 2) to work won't be all the
difficult and the uppers layer (IOCTLs, migrate code, fault handler,
etc..) should largely be unchanged from 1) -> 2).

With a staging plan of 1) -> 2) I think this doable. Thoughts?

[A] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-/commits/system_allocator_poc/?ref_type=heads
[B] https://patchwork.freedesktop.org/series/125608/ 

Matt

> /Thomas
> 
> 
> 
> 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Separating xe_vma- and page-table state
  2024-03-13 17:06       ` Zeng, Oak
@ 2024-03-14  8:52         ` Thomas Hellström
  2024-03-14 16:00           ` Zeng, Oak
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Hellström @ 2024-03-14  8:52 UTC (permalink / raw)
  To: Zeng, Oak, Brost, Matthew; +Cc: intel-xe@lists.freedesktop.org

On Wed, 2024-03-13 at 17:06 +0000, Zeng, Oak wrote:
> Hi Thomas,
> 
> For simplicity of the discussion, let's forget about BO vm_bind,
> forget about memory attributes for a moment... Only consider system
> allocator. So with the scheme below, we have a gigantic xe_vma in the
> background holding some immutable state, never split. And we have
> mutable page-table state which is created during GPU access and
> destroyed during CPU munmap/invalidation, dynamically
> 
> For the mutable page-table state, you would maintain another RB-tree
> so you can searching it, as I did  in POC, the tree is in xe_svm. For
> BO driver, you don’t need this extra tree, you just need the xe_vma
> tree as xe_vma has 1:1 mapping with page-table-state for BO driver...
> 
> I saw this scheme can be aligned with my POC....
> 
> Mapping this scheme to the userptr "free without vm_unbind" thing, I
> can see when user free, we can destroy page-table-state during mmu
> notifier callback, while keep the xe_vma. Is this also how you look
> at it? 
> 
> Need to say, the "free without vm_unbind" thing should only affect
> our decision temporarily: once system allocator is ready, UMD
> wouldn't need the userptr vm_bind anymore, so the problem will be
> more perfectly solved with system allocator - umd just remove the
> vm_bind, things would magically work with system allocator. I guess
> what user need is really a system allocator but we don't have it at
> that time, so userptr technology is used. For long term, system
> allocator should eventually replace userptr.

I mostly agree on the above, I think.

> 
> One thing I can't picture clearly is, how hard is it to change the
> current xekmd to separate xe_vma into mutable and unmutable?

It's not that hard at all, it's mostly changing the xe_pt.c interfaces.
An obstacle, though, is that we don't want to do this before Matt's big
vm_bind refactoring is reviewed and in place.

>  
> 
> Is the split scheme with xe_vma maintaining both mutable and
> unmutable simpler? It doesn't have xe_svm concept. No
> xe_svm_range/page_table-state, single RB tree per gpuvm, no need to
> re-construct xe_vma....depending on how we want to solve the multiple
> device problem, the xe_svm concept can come back though...

For the ordinary VMA types we have today, Userptr / Bo /NULL it's
neither simpler nor more complex IMO, but it makes the code clearer and
hopefully easier to maintain.

For hmmptr/SVM system it's too early to answer. Here it depends really
on whether 1) we do an 1:1 mapping between xe_vma and svm_range, or
whether 2) we do an 1:N mapping of xe_vma and svm_range. Probably both
approaches have their benefits so I'd tend to favour Matt's suggestion
there that we start off with 1), make it work and then do a POC with 2)
to see what it looks like.

Comments, suggestions?

/Thomas


> 
> Oak
> 
> > -----Original Message-----
> > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Sent: Wednesday, March 13, 2024 6:56 AM
> > To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> > <oak.zeng@intel.com>
> > Cc: intel-xe@lists.freedesktop.org
> > Subject: Re: Separating xe_vma- and page-table state
> > 
> > On Wed, 2024-03-13 at 01:27 +0000, Matthew Brost wrote:
> > > On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote:
> > > > Hi Thomas,
> > > 
> > > 
> > 
> > ....
> > 
> > > Thomas:
> > > 
> > > I like the idea of VMAs in the PT code function being marked as
> > > const
> > > and having the xe_pt_state as non const. It makes ownership very
> > > clear.
> > > 
> > > Not sure how that will fit into [1] as that series passes around
> > > a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It
> > > does
> > > this
> > > to make "struct xe_vm_ops" a single atomic operation. The VMAs
> > > are
> > > extracted either the GPUVM base operation or "struct xe_vma_op".
> > > Maybe
> > > these can be const? I'll look into that but this might not work
> > > out
> > > in
> > > practice.
> > > 
> > > Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship
> > > fits in
> > > hmmptrs. Could you explain your thinking here?
> > 
> > There is a need for hmmptrs to be sparse. When we fault we create a
> > chunk of PTEs that we populate. This chunk could potentially be
> > large
> > and covering the whole CPU vma or it could be limited to, say 2MiB
> > and
> > aligned to allow for large page-table entries. In Oak's POC these
> > chunks are called "svm ranges"
> > 
> > So the question arises, how do we map that to the current vma
> > management and page-table code? There are basically two ways:
> > 
> > 1) Split VMAs so they are either fully populated or unpopulated,
> > each
> > svm_range becomes an xe_vma.
> > 2) Create xe_pt_range / xe_pt_state whatever with an 1:1 mapping
> > with
> > the svm_mange and a 1:N mapping with xe_vmas.
> > 
> > Initially my thinking was that 1) Would be the simplest approach
> > with
> > the code we have today. I lifted that briefly with Sima and he
> > answered
> > "And why would we want to do that?", and the answer at hand was ofc
> > that the page-table code worked with vmas. Or rather that we mix
> > vma
> > state (the hmmptr range / attributes) and page-table state (the
> > regions
> > of the hmmptr that are actually populated), so it would be a
> > consequence of our current implementation (limitations).
> > 
> > With the suggestion to separate vma state and pt state, the xe_svm
> > ranges map to pt state and are managed per hmmptr vma. The vmas
> > would
> > then be split mainly as a result of UMD mapping something else (bo)
> > on
> > top, or UMD giving new memory attributes for a range (madvise type
> > of
> > operations).
> > 
> > /Thomas
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Separating xe_vma- and page-table state
  2024-03-13 19:43       ` Matthew Brost
@ 2024-03-14  8:57         ` Thomas Hellström
  0 siblings, 0 replies; 11+ messages in thread
From: Thomas Hellström @ 2024-03-14  8:57 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Zeng, Oak, intel-xe@lists.freedesktop.org

Hi, Matt,


 Wed, 2024-03-13 at 19:43 +0000, Matthew Brost wrote:
> On Wed, Mar 13, 2024 at 11:56:12AM +0100, Thomas Hellström wrote:
> > On Wed, 2024-03-13 at 01:27 +0000, Matthew Brost wrote:
> > > On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote:
> > > > Hi Thomas,
> > > 
> > > 
> > 
> > ....
> > 
> > > Thomas:
> > > 
> > > I like the idea of VMAs in the PT code function being marked as
> > > const
> > > and having the xe_pt_state as non const. It makes ownership very
> > > clear.
> > > 
> > > Not sure how that will fit into [1] as that series passes around
> > > a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It
> > > does
> > > this
> > > to make "struct xe_vm_ops" a single atomic operation. The VMAs
> > > are
> > > extracted either the GPUVM base operation or "struct xe_vma_op".
> > > Maybe
> > > these can be const? I'll look into that but this might not work
> > > out
> > > in
> > > practice.
> > > 
> > > Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship
> > > fits in
> > > hmmptrs. Could you explain your thinking here?
> > 
> > There is a need for hmmptrs to be sparse. When we fault we create a
> > chunk of PTEs that we populate. This chunk could potentially be
> > large
> > and covering the whole CPU vma or it could be limited to, say 2MiB
> > and
> > aligned to allow for large page-table entries. In Oak's POC these
> > chunks are called "svm ranges"
> > 
> > So the question arises, how do we map that to the current vma
> > management and page-table code? There are basically two ways:
> > 
> > 1) Split VMAs so they are either fully populated or unpopulated,
> > each
> > svm_range becomes an xe_vma.
> > 2) Create xe_pt_range / xe_pt_state whatever with an 1:1 mapping
> > with
> > the svm_mange and a 1:N mapping with xe_vmas.
> > 
> > Initially my thinking was that 1) Would be the simplest approach
> > with
> > the code we have today. I lifted that briefly with Sima and he
> > answered
> > "And why would we want to do that?", and the answer at hand was ofc
> > that the page-table code worked with vmas. Or rather that we mix
> > vma
> > state (the hmmptr range / attributes) and page-table state (the
> > regions
> > of the hmmptr that are actually populated), so it would be a
> > consequence of our current implementation (limitations).
> > 
> > With the suggestion to separate vma state and pt state, the xe_svm
> > ranges map to pt state and are managed per hmmptr vma. The vmas
> > would
> > then be split mainly as a result of UMD mapping something else (bo)
> > on
> > top, or UMD giving new memory attributes for a range (madvise type
> > of
> > operations).
> > 
> 
> Thanks for the explaination.
> 
> My 2 cents.
> 
> Do an initial implementation is based on 1) - Split VMAs so they are
> either fully populated or unpopulated, each svm_range becomes an
> xe_vma.
> 
> This is what I did in a quick PoC [A] and got basic system allocator
> working in ~500 loc. Is it fully baked? No, but also it not all that
> far
> off.
> 
> We get the uAPI in place, agree on all the semantics, get IGTs in
> place,
> and basic UMD support. We walk before we run...
> 
> Then transform to code to 2). The real benefit of 2) is going to be
> the
> memory footprint of all the state and the fragmentation of GPUVM RB
> tree
> that comes with spliting VMAs. I'll state one obvious benefit,
> xe_pt_state < xe_vma in size.
> 
> I'll give a quick example of the fragmentation issue too.
> 
> User binds 0-2^46 - 1 xe_vma
> Fault - 3 xe_vma (split into 2 unpopulated xe_vma, 1 populated
> xe_vma)
> Free - 3 xe_vma (replace 1 populated xe_vma with 1 unpopulated
> xe_vma)
> 
> Do this a couple of million times and we will tons of fragmentation
> and
> unnecessary xe_vma floating around. So pursuing 2) is probably worth
> while.
> 
> Is 2) going to quite a bit more complicated implementation. Let me
> give
> fairly simple example.
> 
> User binds 0-2^46 - 1 xe_vma
> Tons of occurs - 1 xe_vma, tons of xe_pt_states
> User binds a BO - 3 xe_vma, need to split xe_pt_states between 2 new
> xe_vma, possibly unmap some xe_pt_states, yikes...
> 
> While 2) is more complicated I don't think it is complete rewrite
> from
> 1) either. [A] is built on top of [B] which introduces the concept of
> PT
> OPs (in addition to existing concepts of GPUVM op / xe_vma_op). With
> a
> bit more of seperation I think getting 2) to work won't be all the
> difficult and the uppers layer (IOCTLs, migrate code, fault handler,
> etc..) should largely be unchanged from 1) -> 2).
> 
> With a staging plan of 1) -> 2) I think this doable. Thoughts?
> 
> [A]
> https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-system-allocator/-/commits/system_allocator_poc/?ref_type=heads
> [B] https://patchwork.freedesktop.org/series/125608/ 
> 
> Matt

Without looking at the code ablve, yes I think this sounds like a good
plan. Oak, it would be good to hear your comments on this as well. I
think, given 1) Implementing 2) as a POC wouldn't be too hard. Then we
also need to figure how both approaches tie into the possibility to
make GPUVM hepers out of this...

/Thomas




> 
> > /Thomas
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: Separating xe_vma- and page-table state
  2024-03-14  8:52         ` Thomas Hellström
@ 2024-03-14 16:00           ` Zeng, Oak
  0 siblings, 0 replies; 11+ messages in thread
From: Zeng, Oak @ 2024-03-14 16:00 UTC (permalink / raw)
  To: Thomas Hellström, Brost, Matthew, Ghimiray, Himal Prasad
  Cc: intel-xe@lists.freedesktop.org

Hi Thomas, Matt,

So separating xe_vma and page-table-state is a good thing to have, as it clean the current code, makes it easier to maintain. We want to do it regardless whether there is a requirement from system allocator/hmmptr design. Since I wasn't able to see through how hard it is, and how much it can cleanup the current codes, I also asked Himal to take a closer look of it.

Here is my thoughts on the two approaches for system allocator:

1)split gigantic vma during gpu page fault(so maintain 1:1 mapping b/t xe_vma and page-table-state/svm_range):
	Since Matt already have some POC codes, I should be able to quickly put together a series out for review. From design perspective, I think this is cleaner/simpler than 2): you have single RB-tree for one GPUVM, which I think we should.

2) non-split (so 1:N mapping):
	This will be depends on the separating of xe-vma and page-table-state work which depends on Matt's big vm bind refactory series. So time wise this is going to take long time. This is the main reason I want to go 1) first.
	Once we have cleaned up xe-vma/page-table-state, this won't be hard based on my POC series
	As Matt mentioned, a benefit would be less vma segmentation
	From design perspective, I think this matches the gigantic vm_bind and sparse page table population design. We will have to maintain another RB-tree for page-table-state/svm-range. Two RB-tree for one GPUVM is a little awkward to me.


So here is my plan in summary: I will continue the work with approach 1); at the same time Himal will give an evaluation of the separating xe-vma and pt-state.

Thanks,
Oak

> -----Original Message-----
> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Sent: Thursday, March 14, 2024 4:52 AM
> To: Zeng, Oak <oak.zeng@intel.com>; Brost, Matthew
> <matthew.brost@intel.com>
> Cc: intel-xe@lists.freedesktop.org
> Subject: Re: Separating xe_vma- and page-table state
> 
> On Wed, 2024-03-13 at 17:06 +0000, Zeng, Oak wrote:
> > Hi Thomas,
> >
> > For simplicity of the discussion, let's forget about BO vm_bind,
> > forget about memory attributes for a moment... Only consider system
> > allocator. So with the scheme below, we have a gigantic xe_vma in the
> > background holding some immutable state, never split. And we have
> > mutable page-table state which is created during GPU access and
> > destroyed during CPU munmap/invalidation, dynamically
> >
> > For the mutable page-table state, you would maintain another RB-tree
> > so you can searching it, as I did  in POC, the tree is in xe_svm. For
> > BO driver, you don’t need this extra tree, you just need the xe_vma
> > tree as xe_vma has 1:1 mapping with page-table-state for BO driver...
> >
> > I saw this scheme can be aligned with my POC....
> >
> > Mapping this scheme to the userptr "free without vm_unbind" thing, I
> > can see when user free, we can destroy page-table-state during mmu
> > notifier callback, while keep the xe_vma. Is this also how you look
> > at it?
> >
> > Need to say, the "free without vm_unbind" thing should only affect
> > our decision temporarily: once system allocator is ready, UMD
> > wouldn't need the userptr vm_bind anymore, so the problem will be
> > more perfectly solved with system allocator - umd just remove the
> > vm_bind, things would magically work with system allocator. I guess
> > what user need is really a system allocator but we don't have it at
> > that time, so userptr technology is used. For long term, system
> > allocator should eventually replace userptr.
> 
> I mostly agree on the above, I think.
> 
> >
> > One thing I can't picture clearly is, how hard is it to change the
> > current xekmd to separate xe_vma into mutable and unmutable?
> 
> It's not that hard at all, it's mostly changing the xe_pt.c interfaces.
> An obstacle, though, is that we don't want to do this before Matt's big
> vm_bind refactoring is reviewed and in place.
> 
> >
> >
> > Is the split scheme with xe_vma maintaining both mutable and
> > unmutable simpler? It doesn't have xe_svm concept. No
> > xe_svm_range/page_table-state, single RB tree per gpuvm, no need to
> > re-construct xe_vma....depending on how we want to solve the multiple
> > device problem, the xe_svm concept can come back though...
> 
> For the ordinary VMA types we have today, Userptr / Bo /NULL it's
> neither simpler nor more complex IMO, but it makes the code clearer and
> hopefully easier to maintain.
> 
> For hmmptr/SVM system it's too early to answer. Here it depends really
> on whether 1) we do an 1:1 mapping between xe_vma and svm_range, or
> whether 2) we do an 1:N mapping of xe_vma and svm_range. Probably both
> approaches have their benefits so I'd tend to favour Matt's suggestion
> there that we start off with 1), make it work and then do a POC with 2)
> to see what it looks like.
> 
> Comments, suggestions?
> 
> /Thomas
> 
> 
> >
> > Oak
> >
> > > -----Original Message-----
> > > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Sent: Wednesday, March 13, 2024 6:56 AM
> > > To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> > > <oak.zeng@intel.com>
> > > Cc: intel-xe@lists.freedesktop.org
> > > Subject: Re: Separating xe_vma- and page-table state
> > >
> > > On Wed, 2024-03-13 at 01:27 +0000, Matthew Brost wrote:
> > > > On Tue, Mar 12, 2024 at 05:02:20PM -0600, Zeng, Oak wrote:
> > > > > Hi Thomas,
> > > >
> > > >
> > >
> > > ....
> > >
> > > > Thomas:
> > > >
> > > > I like the idea of VMAs in the PT code function being marked as
> > > > const
> > > > and having the xe_pt_state as non const. It makes ownership very
> > > > clear.
> > > >
> > > > Not sure how that will fit into [1] as that series passes around
> > > > a "struct xe_vm_ops" which is a list of "struct xe_vma_op". It
> > > > does
> > > > this
> > > > to make "struct xe_vm_ops" a single atomic operation. The VMAs
> > > > are
> > > > extracted either the GPUVM base operation or "struct xe_vma_op".
> > > > Maybe
> > > > these can be const? I'll look into that but this might not work
> > > > out
> > > > in
> > > > practice.
> > > >
> > > > Agree also unsure how 1:N xe_vma <-> xe_pt_state relationship
> > > > fits in
> > > > hmmptrs. Could you explain your thinking here?
> > >
> > > There is a need for hmmptrs to be sparse. When we fault we create a
> > > chunk of PTEs that we populate. This chunk could potentially be
> > > large
> > > and covering the whole CPU vma or it could be limited to, say 2MiB
> > > and
> > > aligned to allow for large page-table entries. In Oak's POC these
> > > chunks are called "svm ranges"
> > >
> > > So the question arises, how do we map that to the current vma
> > > management and page-table code? There are basically two ways:
> > >
> > > 1) Split VMAs so they are either fully populated or unpopulated,
> > > each
> > > svm_range becomes an xe_vma.
> > > 2) Create xe_pt_range / xe_pt_state whatever with an 1:1 mapping
> > > with
> > > the svm_mange and a 1:N mapping with xe_vmas.
> > >
> > > Initially my thinking was that 1) Would be the simplest approach
> > > with
> > > the code we have today. I lifted that briefly with Sima and he
> > > answered
> > > "And why would we want to do that?", and the answer at hand was ofc
> > > that the page-table code worked with vmas. Or rather that we mix
> > > vma
> > > state (the hmmptr range / attributes) and page-table state (the
> > > regions
> > > of the hmmptr that are actually populated), so it would be a
> > > consequence of our current implementation (limitations).
> > >
> > > With the suggestion to separate vma state and pt state, the xe_svm
> > > ranges map to pt state and are managed per hmmptr vma. The vmas
> > > would
> > > then be split mainly as a result of UMD mapping something else (bo)
> > > on
> > > top, or UMD giving new memory attributes for a range (madvise type
> > > of
> > > operations).
> > >
> > > /Thomas
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-03-14 16:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-12  7:43 Separating xe_vma- and page-table state Thomas Hellström
2024-03-12 23:02 ` Zeng, Oak
2024-03-13  1:27   ` Matthew Brost
2024-03-13  2:16     ` Zeng, Oak
2024-03-13  3:16       ` Matthew Brost
2024-03-13 10:56     ` Thomas Hellström
2024-03-13 17:06       ` Zeng, Oak
2024-03-14  8:52         ` Thomas Hellström
2024-03-14 16:00           ` Zeng, Oak
2024-03-13 19:43       ` Matthew Brost
2024-03-14  8:57         ` Thomas Hellström

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox