All of lore.kernel.org
 help / color / mirror / Atom feed
* CUDA fixed VA allocations and sparse mappings
@ 2015-07-07  0:42 Andrew Chew
  2015-07-07 15:29 ` [Nouveau] " Ilia Mirkin
       [not found] ` <20150707004249.GC27924-hKyou4+EtHjTuHN6Nbh//0n48jw8i0AO@public.gmane.org>
  0 siblings, 2 replies; 24+ messages in thread
From: Andrew Chew @ 2015-07-07  0:42 UTC (permalink / raw)
  To: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Hello,

I am currently looking into ways to support fixed virtual address allocations
and sparse mappings in nouveau, as a step towards supporting CUDA.

CUDA requires that the GPU virtual address for a given buffer match the
CPU virtual address.  Therefore, when mapping a CUDA buffer, we have to have
a way of specifying a particular virtual address to map to (we would ask that
the CPU virtual address be used).  Currently, as I understand it, the allocator
implemented in nvkm/core/mm.c, used to provision virtual addresses, doesn't
allow for this (but it's very easy to modify the allocator slightly to allow
for this, which I have done locally in my experiments).

In addition, the CUDA use case typically involves allocating a big chunk of
address space ahead of time as a way to reserve that chunk for future CUDA
use.  It then maps individual buffers into that address space as needed.
Currently, the virtual address allocation is done during buffer mapping, so
in order to support these sparse mappings, it seems to me that the virtual
address allocation and buffer mapping need to be decoupled into separate
operations.

My current strawman proposal for supporting this is to introduce two new ioctls
DRM_IOCTL_NOUVEAU_AS_ALLOC and DRM_IOCTL_NOUVEAU_AS_FREE, that look roughly
like this:

#define NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET 0x1
struct drm_nouveau_as_alloc {
        uint64_t pages;     /* in, pages */
        uint32_t page_size; /* in, bytes */
        uint32_t flags;     /* in */
        uint64_t offset;    /* in/out, byte address */
};

struct drm_nouveau_as_free {
        uint64_t offset;    /* in, byte address */
};

These ioctls just call into the allocator to allocate a range of addresses,
resulting in a struct nvkm_vma that tracks that allocation (or releases the
struct nvkm_vma back into the virtual address pool in the case of the free
ioctl).  If NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET is set, offset specifies the
requested virtual address.  Otherwise, an arbitrary address will be
allocated.

In addition to this, a way to map/unmap buffers is needed.  Ordinarily, one
would just use DRM_IOCTL_PRIME_FD_TO_HANDLE to import and map a dmabuf into
gem.  However, this ioctl will try to grab the virtual address range for this
buffer, which will fail in the CUDA case since the virtual address range
has been reserved ahead of time.  So we perhaps introduce a set of ioctls
to map/unmap buffers on top of an already existing virtual address allocation.

Please, feedback and questions are very much appreciated.
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Nouveau] CUDA fixed VA allocations and sparse mappings
  2015-07-07  0:42 CUDA fixed VA allocations and sparse mappings Andrew Chew
@ 2015-07-07 15:29 ` Ilia Mirkin
       [not found]   ` <CAKb7UviePF2XcmyeKHQ2cv=hy=NZyYcMrWiTpajJxTFE+10LwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found] ` <20150707004249.GC27924-hKyou4+EtHjTuHN6Nbh//0n48jw8i0AO@public.gmane.org>
  1 sibling, 1 reply; 24+ messages in thread
From: Ilia Mirkin @ 2015-07-07 15:29 UTC (permalink / raw)
  To: Andrew Chew
  Cc: nouveau@lists.freedesktop.org, dri-devel@lists.freedesktop.org

On Mon, Jul 6, 2015 at 8:42 PM, Andrew Chew <achew@nvidia.com> wrote:
> Hello,
>
> I am currently looking into ways to support fixed virtual address allocations
> and sparse mappings in nouveau, as a step towards supporting CUDA.
>
> CUDA requires that the GPU virtual address for a given buffer match the
> CPU virtual address.  Therefore, when mapping a CUDA buffer, we have to have
> a way of specifying a particular virtual address to map to (we would ask that
> the CPU virtual address be used).  Currently, as I understand it, the allocator
> implemented in nvkm/core/mm.c, used to provision virtual addresses, doesn't
> allow for this (but it's very easy to modify the allocator slightly to allow
> for this, which I have done locally in my experiments).
>
> In addition, the CUDA use case typically involves allocating a big chunk of
> address space ahead of time as a way to reserve that chunk for future CUDA
> use.  It then maps individual buffers into that address space as needed.
> Currently, the virtual address allocation is done during buffer mapping, so
> in order to support these sparse mappings, it seems to me that the virtual
> address allocation and buffer mapping need to be decoupled into separate
> operations.
>
> My current strawman proposal for supporting this is to introduce two new ioctls
> DRM_IOCTL_NOUVEAU_AS_ALLOC and DRM_IOCTL_NOUVEAU_AS_FREE, that look roughly
> like this:
>
> #define NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET 0x1
> struct drm_nouveau_as_alloc {
>         uint64_t pages;     /* in, pages */
>         uint32_t page_size; /* in, bytes */
>         uint32_t flags;     /* in */
>         uint64_t offset;    /* in/out, byte address */
> };
>
> struct drm_nouveau_as_free {
>         uint64_t offset;    /* in, byte address */
> };
>
> These ioctls just call into the allocator to allocate a range of addresses,
> resulting in a struct nvkm_vma that tracks that allocation (or releases the
> struct nvkm_vma back into the virtual address pool in the case of the free
> ioctl).  If NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET is set, offset specifies the
> requested virtual address.  Otherwise, an arbitrary address will be
> allocated.

Well, this can't just be an address space. You still need bo's, if
this is to work with nouveau -- it has to know when to swap things in
and out, when they're used, etc. (and/or move between VRAM and GART
and system/swap). I suspect that your target here are the GK20A and
GM20B chips which don't have dedicated VRAM, but the ioctl's need to
work for everything.

Would it be sufficient to extend NOUVEAU_GEM_NEW or create a
NOUVEAU_GEM_NEW_FIXED or something? IOW, why do have to separate the
concept of a GEM object and a VM allocation?

>
> In addition to this, a way to map/unmap buffers is needed.  Ordinarily, one
> would just use DRM_IOCTL_PRIME_FD_TO_HANDLE to import and map a dmabuf into
> gem.  However, this ioctl will try to grab the virtual address range for this
> buffer, which will fail in the CUDA case since the virtual address range
> has been reserved ahead of time.  So we perhaps introduce a set of ioctls
> to map/unmap buffers on top of an already existing virtual address allocation.

My suggestion above is an alternative to this, right? I think dmabufs
tend to be used for sharing between devices. I suspect there's more
going on here that I don't understand though -- I assume the CUDA
use-case is similar to the HSA use-case -- being able to build up data
structures that point to one another on the CPU and then process them
on the GPU? Can you detail a specific use-case perhaps, including the
interactions with the GPU and its address space?

Jérôme, I believe you were doing the HSA kernel implementation.
Perhaps you'd have some feedback on this proposal?

Cheers,

  -ilia
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]   ` <CAKb7UviePF2XcmyeKHQ2cv=hy=NZyYcMrWiTpajJxTFE+10LwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-07 17:27     ` Jerome Glisse
  2015-07-09  9:26       ` [Nouveau] " Oded Gabbay
  2015-07-07 18:47     ` Andrew Chew
  1 sibling, 1 reply; 24+ messages in thread
From: Jerome Glisse @ 2015-07-07 17:27 UTC (permalink / raw)
  To: Ilia Mirkin
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org

On Tue, Jul 07, 2015 at 11:29:38AM -0400, Ilia Mirkin wrote:
> On Mon, Jul 6, 2015 at 8:42 PM, Andrew Chew <achew@nvidia.com> wrote:
> > Hello,
> >
> > I am currently looking into ways to support fixed virtual address allocations
> > and sparse mappings in nouveau, as a step towards supporting CUDA.
> >
> > CUDA requires that the GPU virtual address for a given buffer match the
> > CPU virtual address.  Therefore, when mapping a CUDA buffer, we have to have
> > a way of specifying a particular virtual address to map to (we would ask that
> > the CPU virtual address be used).  Currently, as I understand it, the allocator
> > implemented in nvkm/core/mm.c, used to provision virtual addresses, doesn't
> > allow for this (but it's very easy to modify the allocator slightly to allow
> > for this, which I have done locally in my experiments).
> >
> > In addition, the CUDA use case typically involves allocating a big chunk of
> > address space ahead of time as a way to reserve that chunk for future CUDA
> > use.  It then maps individual buffers into that address space as needed.
> > Currently, the virtual address allocation is done during buffer mapping, so
> > in order to support these sparse mappings, it seems to me that the virtual
> > address allocation and buffer mapping need to be decoupled into separate
> > operations.
> >
> > My current strawman proposal for supporting this is to introduce two new ioctls
> > DRM_IOCTL_NOUVEAU_AS_ALLOC and DRM_IOCTL_NOUVEAU_AS_FREE, that look roughly
> > like this:
> >
> > #define NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET 0x1
> > struct drm_nouveau_as_alloc {
> >         uint64_t pages;     /* in, pages */
> >         uint32_t page_size; /* in, bytes */
> >         uint32_t flags;     /* in */
> >         uint64_t offset;    /* in/out, byte address */
> > };
> >
> > struct drm_nouveau_as_free {
> >         uint64_t offset;    /* in, byte address */
> > };
> >
> > These ioctls just call into the allocator to allocate a range of addresses,
> > resulting in a struct nvkm_vma that tracks that allocation (or releases the
> > struct nvkm_vma back into the virtual address pool in the case of the free
> > ioctl).  If NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET is set, offset specifies the
> > requested virtual address.  Otherwise, an arbitrary address will be
> > allocated.
> 
> Well, this can't just be an address space. You still need bo's, if
> this is to work with nouveau -- it has to know when to swap things in
> and out, when they're used, etc. (and/or move between VRAM and GART
> and system/swap). I suspect that your target here are the GK20A and
> GM20B chips which don't have dedicated VRAM, but the ioctl's need to
> work for everything.
> 
> Would it be sufficient to extend NOUVEAU_GEM_NEW or create a
> NOUVEAU_GEM_NEW_FIXED or something? IOW, why do have to separate the
> concept of a GEM object and a VM allocation?

Well maybe something like i did for radeon. With radeon you have 2 set of
ioctl. One to create/delete bo (GEM stuff) and one to associate a virtual
address with a bo. I wanted to let the userspace decide on virtual address
of buffer precisely for the same reason CUDA do it ie to allow to map some
buffer at same address in GPU address space as in CPU address space. So far
we never really took advantage of that on radeon side.

Also on radeon you can map same bo at different virtual address in same
process (you will need different file descriptor for each mapping and you
can only submit command stream using mapping valid for the file descriptor).
Thought this is mostly usefull when sharing same bo accross different
process.

I think my radeon virtual address ioclt are nice design but other might
disagree. If you want to look at the code :

  drivers/gpu/drm/radeon/radeon_vm.c
  drivers/gpu/drm/radeon/radeon_gem.c

Grep for _va (virtual address per bo) or _vm (virtual address manager per
file descriptor) function name and structure name.

On the command stream and bo eviction side everything is as usual on radeon.
So a bo can be evicted btw 2 command stream to make room for another one.
Either its mapping is invalidated or updated to point to system memory. So
most of the logic for everything else remain the same (just need to update
the multiple virtual address space).


> 
> >
> > In addition to this, a way to map/unmap buffers is needed.  Ordinarily, one
> > would just use DRM_IOCTL_PRIME_FD_TO_HANDLE to import and map a dmabuf into
> > gem.  However, this ioctl will try to grab the virtual address range for this
> > buffer, which will fail in the CUDA case since the virtual address range
> > has been reserved ahead of time.  So we perhaps introduce a set of ioctls
> > to map/unmap buffers on top of an already existing virtual address allocation.
> 
> My suggestion above is an alternative to this, right? I think dmabufs
> tend to be used for sharing between devices. I suspect there's more
> going on here that I don't understand though -- I assume the CUDA
> use-case is similar to the HSA use-case -- being able to build up data
> structures that point to one another on the CPU and then process them
> on the GPU? Can you detail a specific use-case perhaps, including the
> interactions with the GPU and its address space?

I think you nailed it, it is really about having the same address pointing to
the same thing on both the GPU and CPU. But this is also valid and usefull for
VRAM. OpenCL 2.0 have various level of transparent address space (probably
not the term use in the spec) and the lowest level would need something like
what radeon have to work. The most advance level needs more plumbing inside
core kernel mm or inside the CPU and GPU hardware.


> Jérôme, I believe you were doing the HSA kernel implementation.
> Perhaps you'd have some feedback on this proposal?

No i did not do the HSA stuff, AMD team leaded by Oded did :)

Cheers,
Jérôme
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]   ` <CAKb7UviePF2XcmyeKHQ2cv=hy=NZyYcMrWiTpajJxTFE+10LwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-07-07 17:27     ` Jerome Glisse
@ 2015-07-07 18:47     ` Andrew Chew
  1 sibling, 0 replies; 24+ messages in thread
From: Andrew Chew @ 2015-07-07 18:47 UTC (permalink / raw)
  To: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org

On Tue, Jul 07, 2015 at 11:29:38AM -0400, Ilia Mirkin wrote:
> On Mon, Jul 6, 2015 at 8:42 PM, Andrew Chew <achew@nvidia.com> wrote:
> > These ioctls just call into the allocator to allocate a range of addresses,
> > resulting in a struct nvkm_vma that tracks that allocation (or releases the
> > struct nvkm_vma back into the virtual address pool in the case of the free
> > ioctl).  If NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET is set, offset specifies the
> > requested virtual address.  Otherwise, an arbitrary address will be
> > allocated.
> 
> Well, this can't just be an address space. You still need bo's, if
> this is to work with nouveau -- it has to know when to swap things in
> and out, when they're used, etc. (and/or move between VRAM and GART
> and system/swap). I suspect that your target here are the GK20A and
> GM20B chips which don't have dedicated VRAM, but the ioctl's need to
> work for everything.
> 
> Would it be sufficient to extend NOUVEAU_GEM_NEW or create a
> NOUVEAU_GEM_NEW_FIXED or something? IOW, why do have to separate the
> concept of a GEM object and a VM allocation?

You're correct.  This is for gk20a and gm20b.

The thing these proposed ioctls are supposed to accomplish is to reserve,
ahead of time, a portion of the address space.  So at this time, there
really aren't any buffer objects yet, and there's nothing to be mapped to
the GMMU.  That part would come later.

> > In addition to this, a way to map/unmap buffers is needed.  Ordinarily, one
> > would just use DRM_IOCTL_PRIME_FD_TO_HANDLE to import and map a dmabuf into
> > gem.  However, this ioctl will try to grab the virtual address range for this
> > buffer, which will fail in the CUDA case since the virtual address range
> > has been reserved ahead of time.  So we perhaps introduce a set of ioctls
> > to map/unmap buffers on top of an already existing virtual address allocation.
> 
> My suggestion above is an alternative to this, right? I think dmabufs
> tend to be used for sharing between devices. I suspect there's more
> going on here that I don't understand though -- I assume the CUDA
> use-case is similar to the HSA use-case -- being able to build up data
> structures that point to one another on the CPU and then process them
> on the GPU? Can you detail a specific use-case perhaps, including the
> interactions with the GPU and its address space?

The whole dmabufs thing is kind of a side issue.  I'll take a look at
NOUVEAU_GEM_NEW, but that could be an alternative to this, maybe, if
extended (or we make a new NOUVEAU_GEM_NEW_FIXED, as you suggested).
Crucially, the NOUVEAU_GEM_NEW_FIXED operation shouldn't result in trying
to get a virtual address region and then failing because a previous
operation (see above) has reserved it already.

The use case is exactly as you describe.  There are data structures built
up that contain CPU pointers, and those pointers need to make sense to
the GPU as well.
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found] ` <20150707004249.GC27924-hKyou4+EtHjTuHN6Nbh//0n48jw8i0AO@public.gmane.org>
@ 2015-07-07 21:09   ` Ben Skeggs
       [not found]     ` <CACAvsv6=OwXnabpY5c_HHaMkumV-QqCvPd+zia15S_G+Oq29UA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Ben Skeggs @ 2015-07-07 21:09 UTC (permalink / raw)
  To: Andrew Chew; +Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org

On 7 July 2015 at 10:42, Andrew Chew <achew@nvidia.com> wrote:
> Hello,
>
> I am currently looking into ways to support fixed virtual address allocations
> and sparse mappings in nouveau, as a step towards supporting CUDA.
Hey Andrew,

The sparse mappings was something I'd actually planned on doing too in
the near future, though I haven't yet settled on exactly how it'd be
exposed.

Fixed address allocations weren't going to be part of that, but I see
that it makes sense for a variety of use cases.  One question I have
here is how this is intended to work where the RM needs to make some
of these allocations itself (for graphics context mapping, etc), how
should potential conflicts with user mappings be handled?

Thanks,
Ben.

>
> CUDA requires that the GPU virtual address for a given buffer match the
> CPU virtual address.  Therefore, when mapping a CUDA buffer, we have to have
> a way of specifying a particular virtual address to map to (we would ask that
> the CPU virtual address be used).  Currently, as I understand it, the allocator
> implemented in nvkm/core/mm.c, used to provision virtual addresses, doesn't
> allow for this (but it's very easy to modify the allocator slightly to allow
> for this, which I have done locally in my experiments).
>
> In addition, the CUDA use case typically involves allocating a big chunk of
> address space ahead of time as a way to reserve that chunk for future CUDA
> use.  It then maps individual buffers into that address space as needed.
> Currently, the virtual address allocation is done during buffer mapping, so
> in order to support these sparse mappings, it seems to me that the virtual
> address allocation and buffer mapping need to be decoupled into separate
> operations.
>
> My current strawman proposal for supporting this is to introduce two new ioctls
> DRM_IOCTL_NOUVEAU_AS_ALLOC and DRM_IOCTL_NOUVEAU_AS_FREE, that look roughly
> like this:
>
> #define NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET 0x1
> struct drm_nouveau_as_alloc {
>         uint64_t pages;     /* in, pages */
>         uint32_t page_size; /* in, bytes */
>         uint32_t flags;     /* in */
>         uint64_t offset;    /* in/out, byte address */
> };
>
> struct drm_nouveau_as_free {
>         uint64_t offset;    /* in, byte address */
> };
>
> These ioctls just call into the allocator to allocate a range of addresses,
> resulting in a struct nvkm_vma that tracks that allocation (or releases the
> struct nvkm_vma back into the virtual address pool in the case of the free
> ioctl).  If NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET is set, offset specifies the
> requested virtual address.  Otherwise, an arbitrary address will be
> allocated.
>
> In addition to this, a way to map/unmap buffers is needed.  Ordinarily, one
> would just use DRM_IOCTL_PRIME_FD_TO_HANDLE to import and map a dmabuf into
> gem.  However, this ioctl will try to grab the virtual address range for this
> buffer, which will fail in the CUDA case since the virtual address range
> has been reserved ahead of time.  So we perhaps introduce a set of ioctls
> to map/unmap buffers on top of an already existing virtual address allocation.
>
> Please, feedback and questions are very much appreciated.
> _______________________________________________
> Nouveau mailing list
> Nouveau@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/nouveau
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]     ` <CACAvsv6=OwXnabpY5c_HHaMkumV-QqCvPd+zia15S_G+Oq29UA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-07 23:53       ` C Bergström
       [not found]         ` <CAOnawYpbqZ04-h2q4JpWjWfygPk5UQX9JWC4oj0RWNn7rzhcBA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: C Bergström @ 2015-07-07 23:53 UTC (permalink / raw)
  To: Ben Skeggs; +Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org

regarding
--------
Fixed address allocations weren't going to be part of that, but I see
that it makes sense for a variety of use cases.  One question I have
here is how this is intended to work where the RM needs to make some
of these allocations itself (for graphics context mapping, etc), how
should potential conflicts with user mappings be handled?
--------
As an initial implemetation you can probably assume that the GPU
offloading is in "exclusive" mode. Basically that the CUDA or OpenACC
code has full ownership of the card. The Tesla cards don't even have a
video out on them. To complicate this even more - some offloading code
has very long running kernels and even worse - may critically depend
on using the full available GPU ram. (Large matrix sizes and soon big
Fortran arrays or complex data types)

Long term - direct PCIe copies between cards will be important.. aka
zero-copy. It may seem crazy, but when you have 16+ GPU in a single
workstation (Cirrascale) stuff like this is key.
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]         ` <CAOnawYpbqZ04-h2q4JpWjWfygPk5UQX9JWC4oj0RWNn7rzhcBA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-07 23:58           ` Ben Skeggs
       [not found]             ` <CACAvsv5ZrSLzb=N5kLpZP5fwbF+=S414O_QDgsNbi9FvvqxxLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Ben Skeggs @ 2015-07-07 23:58 UTC (permalink / raw)
  To: C Bergström
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org

On 8 July 2015 at 09:53, C Bergström <cbergstrom@pathscale.com> wrote:
> regarding
> --------
> Fixed address allocations weren't going to be part of that, but I see
> that it makes sense for a variety of use cases.  One question I have
> here is how this is intended to work where the RM needs to make some
> of these allocations itself (for graphics context mapping, etc), how
> should potential conflicts with user mappings be handled?
> --------
> As an initial implemetation you can probably assume that the GPU
> offloading is in "exclusive" mode. Basically that the CUDA or OpenACC
> code has full ownership of the card. The Tesla cards don't even have a
> video out on them. To complicate this even more - some offloading code
> has very long running kernels and even worse - may critically depend
> on using the full available GPU ram. (Large matrix sizes and soon big
> Fortran arrays or complex data types)
This doesn't change that, to setup the graphics engine, the driver
needs to map various system-use data structures into the channel's
address space *somewhere* :)

>
> Long term - direct PCIe copies between cards will be important.. aka
> zero-copy. It may seem crazy, but when you have 16+ GPU in a single
> workstation (Cirrascale) stuff like this is key.
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]             ` <CACAvsv5ZrSLzb=N5kLpZP5fwbF+=S414O_QDgsNbi9FvvqxxLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-08  0:07               ` C Bergström
       [not found]                 ` <CAOnawYphTmUDxkKrEhUsVR6YRyLQj0P4hwgOkw2Jf4b0BZOSnw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: C Bergström @ 2015-07-08  0:07 UTC (permalink / raw)
  To: Ben Skeggs; +Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org

On Wed, Jul 8, 2015 at 6:58 AM, Ben Skeggs <skeggsb@gmail.com> wrote:
> On 8 July 2015 at 09:53, C Bergström <cbergstrom@pathscale.com> wrote:
>> regarding
>> --------
>> Fixed address allocations weren't going to be part of that, but I see
>> that it makes sense for a variety of use cases.  One question I have
>> here is how this is intended to work where the RM needs to make some
>> of these allocations itself (for graphics context mapping, etc), how
>> should potential conflicts with user mappings be handled?
>> --------
>> As an initial implemetation you can probably assume that the GPU
>> offloading is in "exclusive" mode. Basically that the CUDA or OpenACC
>> code has full ownership of the card. The Tesla cards don't even have a
>> video out on them. To complicate this even more - some offloading code
>> has very long running kernels and even worse - may critically depend
>> on using the full available GPU ram. (Large matrix sizes and soon big
>> Fortran arrays or complex data types)
> This doesn't change that, to setup the graphics engine, the driver
> needs to map various system-use data structures into the channel's
> address space *somewhere* :)

I'm not sure I follow exactly what you mean, but I think the answer is
- don't setup the graphics engine if you're in "compute" mode. Doing
that, iiuc, will at least provide a start to support for compute.
Anyone who argues that graphics+compute is critical to have working at
the same time is probably a 1%.
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]                 ` <CAOnawYphTmUDxkKrEhUsVR6YRyLQj0P4hwgOkw2Jf4b0BZOSnw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-08  0:08                   ` Ilia Mirkin
       [not found]                     ` <CAKb7UviOx-rNJUkwYB4h8XyQ4x8qp3xAbeHOAeW++O+bHFuyKQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Ilia Mirkin @ 2015-07-08  0:08 UTC (permalink / raw)
  To: C Bergström
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org

On Tue, Jul 7, 2015 at 8:07 PM, C Bergström <cbergstrom@pathscale.com> wrote:
> On Wed, Jul 8, 2015 at 6:58 AM, Ben Skeggs <skeggsb@gmail.com> wrote:
>> On 8 July 2015 at 09:53, C Bergström <cbergstrom@pathscale.com> wrote:
>>> regarding
>>> --------
>>> Fixed address allocations weren't going to be part of that, but I see
>>> that it makes sense for a variety of use cases.  One question I have
>>> here is how this is intended to work where the RM needs to make some
>>> of these allocations itself (for graphics context mapping, etc), how
>>> should potential conflicts with user mappings be handled?
>>> --------
>>> As an initial implemetation you can probably assume that the GPU
>>> offloading is in "exclusive" mode. Basically that the CUDA or OpenACC
>>> code has full ownership of the card. The Tesla cards don't even have a
>>> video out on them. To complicate this even more - some offloading code
>>> has very long running kernels and even worse - may critically depend
>>> on using the full available GPU ram. (Large matrix sizes and soon big
>>> Fortran arrays or complex data types)
>> This doesn't change that, to setup the graphics engine, the driver
>> needs to map various system-use data structures into the channel's
>> address space *somewhere* :)
>
> I'm not sure I follow exactly what you mean, but I think the answer is
> - don't setup the graphics engine if you're in "compute" mode. Doing
> that, iiuc, will at least provide a start to support for compute.
> Anyone who argues that graphics+compute is critical to have working at
> the same time is probably a 1%.

On NVIDIA GPUs, compute _is_ part of the graphics engine... aka PGRAPH.
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]                     ` <CAKb7UviOx-rNJUkwYB4h8XyQ4x8qp3xAbeHOAeW++O+bHFuyKQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-08  0:11                       ` C Bergström
       [not found]                         ` <CAOnawYo=EFk6KhmudKWi3r-z_J4AHjswTrZSfyp_qZfdmQc=tw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: C Bergström @ 2015-07-08  0:11 UTC (permalink / raw)
  To: Ilia Mirkin; +Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org

On Wed, Jul 8, 2015 at 7:08 AM, Ilia Mirkin <imirkin@alum.mit.edu> wrote:
> On Tue, Jul 7, 2015 at 8:07 PM, C Bergström <cbergstrom@pathscale.com> wrote:
>> On Wed, Jul 8, 2015 at 6:58 AM, Ben Skeggs <skeggsb@gmail.com> wrote:
>>> On 8 July 2015 at 09:53, C Bergström <cbergstrom@pathscale.com> wrote:
>>>> regarding
>>>> --------
>>>> Fixed address allocations weren't going to be part of that, but I see
>>>> that it makes sense for a variety of use cases.  One question I have
>>>> here is how this is intended to work where the RM needs to make some
>>>> of these allocations itself (for graphics context mapping, etc), how
>>>> should potential conflicts with user mappings be handled?
>>>> --------
>>>> As an initial implemetation you can probably assume that the GPU
>>>> offloading is in "exclusive" mode. Basically that the CUDA or OpenACC
>>>> code has full ownership of the card. The Tesla cards don't even have a
>>>> video out on them. To complicate this even more - some offloading code
>>>> has very long running kernels and even worse - may critically depend
>>>> on using the full available GPU ram. (Large matrix sizes and soon big
>>>> Fortran arrays or complex data types)
>>> This doesn't change that, to setup the graphics engine, the driver
>>> needs to map various system-use data structures into the channel's
>>> address space *somewhere* :)
>>
>> I'm not sure I follow exactly what you mean, but I think the answer is
>> - don't setup the graphics engine if you're in "compute" mode. Doing
>> that, iiuc, will at least provide a start to support for compute.
>> Anyone who argues that graphics+compute is critical to have working at
>> the same time is probably a 1%.
>
> On NVIDIA GPUs, compute _is_ part of the graphics engine... aka PGRAPH.

You can afaik setup PGRAPH without mapping memory for graphics. You
just init the engine and get out of the way.
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]                         ` <CAOnawYo=EFk6KhmudKWi3r-z_J4AHjswTrZSfyp_qZfdmQc=tw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-08  0:13                           ` Ilia Mirkin
       [not found]                             ` <CAKb7UvhOM+65x80HPAcdTsQB4KsPA780cKg8_30vOy5qWFZt4w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Ilia Mirkin @ 2015-07-08  0:13 UTC (permalink / raw)
  To: C Bergström
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org

On Tue, Jul 7, 2015 at 8:11 PM, C Bergström <cbergstrom@pathscale.com> wrote:
> On Wed, Jul 8, 2015 at 7:08 AM, Ilia Mirkin <imirkin@alum.mit.edu> wrote:
>> On Tue, Jul 7, 2015 at 8:07 PM, C Bergström <cbergstrom@pathscale.com> wrote:
>>> On Wed, Jul 8, 2015 at 6:58 AM, Ben Skeggs <skeggsb@gmail.com> wrote:
>>>> On 8 July 2015 at 09:53, C Bergström <cbergstrom@pathscale.com> wrote:
>>>>> regarding
>>>>> --------
>>>>> Fixed address allocations weren't going to be part of that, but I see
>>>>> that it makes sense for a variety of use cases.  One question I have
>>>>> here is how this is intended to work where the RM needs to make some
>>>>> of these allocations itself (for graphics context mapping, etc), how
>>>>> should potential conflicts with user mappings be handled?
>>>>> --------
>>>>> As an initial implemetation you can probably assume that the GPU
>>>>> offloading is in "exclusive" mode. Basically that the CUDA or OpenACC
>>>>> code has full ownership of the card. The Tesla cards don't even have a
>>>>> video out on them. To complicate this even more - some offloading code
>>>>> has very long running kernels and even worse - may critically depend
>>>>> on using the full available GPU ram. (Large matrix sizes and soon big
>>>>> Fortran arrays or complex data types)
>>>> This doesn't change that, to setup the graphics engine, the driver
>>>> needs to map various system-use data structures into the channel's
>>>> address space *somewhere* :)
>>>
>>> I'm not sure I follow exactly what you mean, but I think the answer is
>>> - don't setup the graphics engine if you're in "compute" mode. Doing
>>> that, iiuc, will at least provide a start to support for compute.
>>> Anyone who argues that graphics+compute is critical to have working at
>>> the same time is probably a 1%.
>>
>> On NVIDIA GPUs, compute _is_ part of the graphics engine... aka PGRAPH.
>
> You can afaik setup PGRAPH without mapping memory for graphics. You
> just init the engine and get out of the way.

But... you need to map memory to set up the engine. Not a lot, but
it's gotta go *somewhere*.
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]                             ` <CAKb7UvhOM+65x80HPAcdTsQB4KsPA780cKg8_30vOy5qWFZt4w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-08  0:15                               ` Andrew Chew
       [not found]                                 ` <20150708001559.GA30347-hKyou4+EtHjTuHN6Nbh//0n48jw8i0AO@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Chew @ 2015-07-08  0:15 UTC (permalink / raw)
  To: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On Tue, Jul 07, 2015 at 08:13:28PM -0400, Ilia Mirkin wrote:
> On Tue, Jul 7, 2015 at 8:11 PM, C Bergström <cbergstrom@pathscale.com> wrote:
> > On Wed, Jul 8, 2015 at 7:08 AM, Ilia Mirkin <imirkin@alum.mit.edu> wrote:
> >> On Tue, Jul 7, 2015 at 8:07 PM, C Bergström <cbergstrom@pathscale.com> wrote:
> >>> On Wed, Jul 8, 2015 at 6:58 AM, Ben Skeggs <skeggsb@gmail.com> wrote:
> >>>> On 8 July 2015 at 09:53, C Bergström <cbergstrom@pathscale.com> wrote:
> >>>>> regarding
> >>>>> --------
> >>>>> Fixed address allocations weren't going to be part of that, but I see
> >>>>> that it makes sense for a variety of use cases.  One question I have
> >>>>> here is how this is intended to work where the RM needs to make some
> >>>>> of these allocations itself (for graphics context mapping, etc), how
> >>>>> should potential conflicts with user mappings be handled?
> >>>>> --------
> >>>>> As an initial implemetation you can probably assume that the GPU
> >>>>> offloading is in "exclusive" mode. Basically that the CUDA or OpenACC
> >>>>> code has full ownership of the card. The Tesla cards don't even have a
> >>>>> video out on them. To complicate this even more - some offloading code
> >>>>> has very long running kernels and even worse - may critically depend
> >>>>> on using the full available GPU ram. (Large matrix sizes and soon big
> >>>>> Fortran arrays or complex data types)
> >>>> This doesn't change that, to setup the graphics engine, the driver
> >>>> needs to map various system-use data structures into the channel's
> >>>> address space *somewhere* :)
> >>>
> >>> I'm not sure I follow exactly what you mean, but I think the answer is
> >>> - don't setup the graphics engine if you're in "compute" mode. Doing
> >>> that, iiuc, will at least provide a start to support for compute.
> >>> Anyone who argues that graphics+compute is critical to have working at
> >>> the same time is probably a 1%.
> >>
> >> On NVIDIA GPUs, compute _is_ part of the graphics engine... aka PGRAPH.
> >
> > You can afaik setup PGRAPH without mapping memory for graphics. You
> > just init the engine and get out of the way.
> 
> But... you need to map memory to set up the engine. Not a lot, but
> it's gotta go *somewhere*.

There's some minimal state that needs to be mapped into GPU address space.
One thing that comes to mind are pushbuffers, which are needed to submit
stuff to any engine.
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]                                 ` <20150708001559.GA30347-hKyou4+EtHjTuHN6Nbh//0n48jw8i0AO@public.gmane.org>
@ 2015-07-08  0:18                                   ` Ben Skeggs
       [not found]                                     ` <CACAvsv5q5yJUmjPgJtxnv1dU--UzD1veePkJzvqjRyNtx=EEbw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Ben Skeggs @ 2015-07-08  0:18 UTC (permalink / raw)
  To: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org

On 8 July 2015 at 10:15, Andrew Chew <achew@nvidia.com> wrote:
> On Tue, Jul 07, 2015 at 08:13:28PM -0400, Ilia Mirkin wrote:
>> On Tue, Jul 7, 2015 at 8:11 PM, C Bergström <cbergstrom@pathscale.com> wrote:
>> > On Wed, Jul 8, 2015 at 7:08 AM, Ilia Mirkin <imirkin@alum.mit.edu> wrote:
>> >> On Tue, Jul 7, 2015 at 8:07 PM, C Bergström <cbergstrom@pathscale.com> wrote:
>> >>> On Wed, Jul 8, 2015 at 6:58 AM, Ben Skeggs <skeggsb@gmail.com> wrote:
>> >>>> On 8 July 2015 at 09:53, C Bergström <cbergstrom@pathscale.com> wrote:
>> >>>>> regarding
>> >>>>> --------
>> >>>>> Fixed address allocations weren't going to be part of that, but I see
>> >>>>> that it makes sense for a variety of use cases.  One question I have
>> >>>>> here is how this is intended to work where the RM needs to make some
>> >>>>> of these allocations itself (for graphics context mapping, etc), how
>> >>>>> should potential conflicts with user mappings be handled?
>> >>>>> --------
>> >>>>> As an initial implemetation you can probably assume that the GPU
>> >>>>> offloading is in "exclusive" mode. Basically that the CUDA or OpenACC
>> >>>>> code has full ownership of the card. The Tesla cards don't even have a
>> >>>>> video out on them. To complicate this even more - some offloading code
>> >>>>> has very long running kernels and even worse - may critically depend
>> >>>>> on using the full available GPU ram. (Large matrix sizes and soon big
>> >>>>> Fortran arrays or complex data types)
>> >>>> This doesn't change that, to setup the graphics engine, the driver
>> >>>> needs to map various system-use data structures into the channel's
>> >>>> address space *somewhere* :)
>> >>>
>> >>> I'm not sure I follow exactly what you mean, but I think the answer is
>> >>> - don't setup the graphics engine if you're in "compute" mode. Doing
>> >>> that, iiuc, will at least provide a start to support for compute.
>> >>> Anyone who argues that graphics+compute is critical to have working at
>> >>> the same time is probably a 1%.
>> >>
>> >> On NVIDIA GPUs, compute _is_ part of the graphics engine... aka PGRAPH.
>> >
>> > You can afaik setup PGRAPH without mapping memory for graphics. You
>> > just init the engine and get out of the way.
>>
>> But... you need to map memory to set up the engine. Not a lot, but
>> it's gotta go *somewhere*.
>
> There's some minimal state that needs to be mapped into GPU address space.
> One thing that comes to mind are pushbuffers, which are needed to submit
> stuff to any engine.
I guess you can probably use the start of the kernel's address space
carveout for these kind of mappings actually?  It's not like userspace
can ever have virtual addresses there?

Ben.

> _______________________________________________
> Nouveau mailing list
> Nouveau@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/nouveau
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]                                     ` <CACAvsv5q5yJUmjPgJtxnv1dU--UzD1veePkJzvqjRyNtx=EEbw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-08  0:31                                       ` Andrew Chew
       [not found]                                         ` <20150708003153.GA30426-hKyou4+EtHjTuHN6Nbh//0n48jw8i0AO@public.gmane.org>
  2015-07-08 18:27                                       ` Ken Adams
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Chew @ 2015-07-08  0:31 UTC (permalink / raw)
  To: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On Wed, Jul 08, 2015 at 10:18:36AM +1000, Ben Skeggs wrote:
> > There's some minimal state that needs to be mapped into GPU address space.
> > One thing that comes to mind are pushbuffers, which are needed to submit
> > stuff to any engine.
> I guess you can probably use the start of the kernel's address space
> carveout for these kind of mappings actually?  It's not like userspace
> can ever have virtual addresses there?

Yeah.  I'm looking into it further, but to answer your original question,
I believe there is essentially an address range that nouveau would know
about, which it uses for fixed address allocations (I'm referring to how
the nvgpu driver does things...we may or may not come up with something
different for nouveau).

Although it's dangerous, AFAIK the allocator in nouveau starts allocating
addresses at page 1, and as you suggested, one wouldn't ever get a CPU
address that low.  But having a set of addresses reserved would be much
better of course.
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]                                         ` <20150708003153.GA30426-hKyou4+EtHjTuHN6Nbh//0n48jw8i0AO@public.gmane.org>
@ 2015-07-08  0:37                                           ` Ben Skeggs
       [not found]                                             ` <CACAvsv5DtA2WsBQkNWnxZMsonbHsvJ-oKA+frVd-btZXfgiAyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Ben Skeggs @ 2015-07-08  0:37 UTC (permalink / raw)
  To: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org

On 8 July 2015 at 10:31, Andrew Chew <achew@nvidia.com> wrote:
> On Wed, Jul 08, 2015 at 10:18:36AM +1000, Ben Skeggs wrote:
>> > There's some minimal state that needs to be mapped into GPU address space.
>> > One thing that comes to mind are pushbuffers, which are needed to submit
>> > stuff to any engine.
>> I guess you can probably use the start of the kernel's address space
>> carveout for these kind of mappings actually?  It's not like userspace
>> can ever have virtual addresses there?
>
> Yeah.  I'm looking into it further, but to answer your original question,
> I believe there is essentially an address range that nouveau would know
> about, which it uses for fixed address allocations (I'm referring to how
> the nvgpu driver does things...we may or may not come up with something
> different for nouveau).
>
> Although it's dangerous, AFAIK the allocator in nouveau starts allocating
> addresses at page 1, and as you suggested, one wouldn't ever get a CPU
> address that low.  But having a set of addresses reserved would be much
> better of course.
I'm thinking more about the top of the address space.  As I understand
it, the kernel already splits the CPU virtual address space into
user/system areas (3GiB/1GiB for 32-bit IIUC), or something very
similar to that.

Perhaps, if we can get at that information, we can use those same
definitions for GPU address space?

> _______________________________________________
> Nouveau mailing list
> Nouveau@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/nouveau
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]                                             ` <CACAvsv5DtA2WsBQkNWnxZMsonbHsvJ-oKA+frVd-btZXfgiAyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-08  0:47                                               ` Andrew Chew
       [not found]                                                 ` <20150708004735.GA30570-hKyou4+EtHjTuHN6Nbh//0n48jw8i0AO@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Chew @ 2015-07-08  0:47 UTC (permalink / raw)
  To: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On Wed, Jul 08, 2015 at 10:37:34AM +1000, Ben Skeggs wrote:
> On 8 July 2015 at 10:31, Andrew Chew <achew@nvidia.com> wrote:
> > On Wed, Jul 08, 2015 at 10:18:36AM +1000, Ben Skeggs wrote:
> >> > There's some minimal state that needs to be mapped into GPU address space.
> >> > One thing that comes to mind are pushbuffers, which are needed to submit
> >> > stuff to any engine.
> >> I guess you can probably use the start of the kernel's address space
> >> carveout for these kind of mappings actually?  It's not like userspace
> >> can ever have virtual addresses there?
> >
> > Yeah.  I'm looking into it further, but to answer your original question,
> > I believe there is essentially an address range that nouveau would know
> > about, which it uses for fixed address allocations (I'm referring to how
> > the nvgpu driver does things...we may or may not come up with something
> > different for nouveau).
> >
> > Although it's dangerous, AFAIK the allocator in nouveau starts allocating
> > addresses at page 1, and as you suggested, one wouldn't ever get a CPU
> > address that low.  But having a set of addresses reserved would be much
> > better of course.
> I'm thinking more about the top of the address space.  As I understand
> it, the kernel already splits the CPU virtual address space into
> user/system areas (3GiB/1GiB for 32-bit IIUC), or something very
> similar to that.
> 
> Perhaps, if we can get at that information, we can use those same
> definitions for GPU address space?

Ah, I get what you're saying.  Sure, I think that might be okay.  Not sure
how we would get at that information, though, and it would be horrible to
just bake it in somewhere.  I'm looking into how nvgpu driver does it...
maybe they have good reasons to do it the way they do.  Sorry if I go
quiet for a little bit...
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]                                                 ` <20150708004735.GA30570-hKyou4+EtHjTuHN6Nbh//0n48jw8i0AO@public.gmane.org>
@ 2015-07-08  0:51                                                   ` Ben Skeggs
       [not found]                                                     ` <CACAvsv56doVLnMJKCfQyrPj-ijsW7yuAMv53kR0OKxJ0LKM5iQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Ben Skeggs @ 2015-07-08  0:51 UTC (permalink / raw)
  To: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org

On 8 July 2015 at 10:47, Andrew Chew <achew@nvidia.com> wrote:
> On Wed, Jul 08, 2015 at 10:37:34AM +1000, Ben Skeggs wrote:
>> On 8 July 2015 at 10:31, Andrew Chew <achew@nvidia.com> wrote:
>> > On Wed, Jul 08, 2015 at 10:18:36AM +1000, Ben Skeggs wrote:
>> >> > There's some minimal state that needs to be mapped into GPU address space.
>> >> > One thing that comes to mind are pushbuffers, which are needed to submit
>> >> > stuff to any engine.
>> >> I guess you can probably use the start of the kernel's address space
>> >> carveout for these kind of mappings actually?  It's not like userspace
>> >> can ever have virtual addresses there?
>> >
>> > Yeah.  I'm looking into it further, but to answer your original question,
>> > I believe there is essentially an address range that nouveau would know
>> > about, which it uses for fixed address allocations (I'm referring to how
>> > the nvgpu driver does things...we may or may not come up with something
>> > different for nouveau).
>> >
>> > Although it's dangerous, AFAIK the allocator in nouveau starts allocating
>> > addresses at page 1, and as you suggested, one wouldn't ever get a CPU
>> > address that low.  But having a set of addresses reserved would be much
>> > better of course.
>> I'm thinking more about the top of the address space.  As I understand
>> it, the kernel already splits the CPU virtual address space into
>> user/system areas (3GiB/1GiB for 32-bit IIUC), or something very
>> similar to that.
>>
>> Perhaps, if we can get at that information, we can use those same
>> definitions for GPU address space?
>
> Ah, I get what you're saying.  Sure, I think that might be okay.  Not sure
> how we would get at that information, though, and it would be horrible to
> just bake it in somewhere.  I'm looking into how nvgpu driver does it...
> maybe they have good reasons to do it the way they do.  Sorry if I go
> quiet for a little bit...
After a very quick look, it looks like the kernel defines a
PAGE_OFFSET macro which is the start of kernel virtual address space.

> _______________________________________________
> Nouveau mailing list
> Nouveau@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/nouveau
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]                                     ` <CACAvsv5q5yJUmjPgJtxnv1dU--UzD1veePkJzvqjRyNtx=EEbw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-07-08  0:31                                       ` Andrew Chew
@ 2015-07-08 18:27                                       ` Ken Adams
  1 sibling, 0 replies; 24+ messages in thread
From: Ken Adams @ 2015-07-08 18:27 UTC (permalink / raw)
  To: Ben Skeggs,
	nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org

responding to this bit of text from ben below:
> "I guess you can probably use the start of the kernel's address space
carveout for these kind of mappings actually?  It's not like userspace
can ever have virtual addresses there?"



one of the salient points of how we implement gr and compute setup is that
these buffer regions (shared, global, any but for a hole 0-128MB) are
allocated dynamically.  an address space can be setup well in advance and
as long as the gr/compute engine setup buffer allocator is playing along
(i.e honoring the previously allocated regions) things work out just fine.
 the term we use internally is "anonymous" address spaces.  unbound,
unused as yet. 

now, as for how the gpu and cpu address ranges work/or don't: that's up to
the user space code to work through.  the cuda guys have various
techniques to make it unified (some work in 64b only, some both, and
almost all require specific API conditions).  but, as long as we can have
them tell the kernel what gpu ranges to avoid (by allocating them in
advance) it's up to that code to fulfill the cpu portion.

---
ken 

On 7/7/15, 8:18 PM, "Nouveau on behalf of Ben Skeggs"
<nouveau-bounces@lists.freedesktop.org on behalf of skeggsb@gmail.com>
wrote:

>On 8 July 2015 at 10:15, Andrew Chew <achew@nvidia.com> wrote:
>> On Tue, Jul 07, 2015 at 08:13:28PM -0400, Ilia Mirkin wrote:
>>> On Tue, Jul 7, 2015 at 8:11 PM, C Bergström <cbergstrom@pathscale.com>
>>>wrote:
>>> > On Wed, Jul 8, 2015 at 7:08 AM, Ilia Mirkin <imirkin@alum.mit.edu>
>>>wrote:
>>> >> On Tue, Jul 7, 2015 at 8:07 PM, C Bergström
>>><cbergstrom@pathscale.com> wrote:
>>> >>> On Wed, Jul 8, 2015 at 6:58 AM, Ben Skeggs <skeggsb@gmail.com>
>>>wrote:
>>> >>>> On 8 July 2015 at 09:53, C Bergström <cbergstrom@pathscale.com>
>>>wrote:
>>> >>>>> regarding
>>> >>>>> --------
>>> >>>>> Fixed address allocations weren't going to be part of that, but
>>>I see
>>> >>>>> that it makes sense for a variety of use cases.  One question I
>>>have
>>> >>>>> here is how this is intended to work where the RM needs to make
>>>some
>>> >>>>> of these allocations itself (for graphics context mapping, etc),
>>>how
>>> >>>>> should potential conflicts with user mappings be handled?
>>> >>>>> --------
>>> >>>>> As an initial implemetation you can probably assume that the GPU
>>> >>>>> offloading is in "exclusive" mode. Basically that the CUDA or
>>>OpenACC
>>> >>>>> code has full ownership of the card. The Tesla cards don't even
>>>have a
>>> >>>>> video out on them. To complicate this even more - some
>>>offloading code
>>> >>>>> has very long running kernels and even worse - may critically
>>>depend
>>> >>>>> on using the full available GPU ram. (Large matrix sizes and
>>>soon big
>>> >>>>> Fortran arrays or complex data types)
>>> >>>> This doesn't change that, to setup the graphics engine, the driver
>>> >>>> needs to map various system-use data structures into the channel's
>>> >>>> address space *somewhere* :)
>>> >>>
>>> >>> I'm not sure I follow exactly what you mean, but I think the
>>>answer is
>>> >>> - don't setup the graphics engine if you're in "compute" mode.
>>>Doing
>>> >>> that, iiuc, will at least provide a start to support for compute.
>>> >>> Anyone who argues that graphics+compute is critical to have
>>>working at
>>> >>> the same time is probably a 1%.
>>> >>
>>> >> On NVIDIA GPUs, compute _is_ part of the graphics engine... aka
>>>PGRAPH.
>>> >
>>> > You can afaik setup PGRAPH without mapping memory for graphics. You
>>> > just init the engine and get out of the way.
>>>
>>> But... you need to map memory to set up the engine. Not a lot, but
>>> it's gotta go *somewhere*.
>>
>> There's some minimal state that needs to be mapped into GPU address
>>space.
>> One thing that comes to mind are pushbuffers, which are needed to submit
>> stuff to any engine.
>I guess you can probably use the start of the kernel's address space
>carveout for these kind of mappings actually?  It's not like userspace
>can ever have virtual addresses there?
>
>Ben.
>
>> _______________________________________________
>> Nouveau mailing list
>> Nouveau@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/nouveau
>_______________________________________________
>Nouveau mailing list
>Nouveau@lists.freedesktop.org
>http://lists.freedesktop.org/mailman/listinfo/nouveau

_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]                                                     ` <CACAvsv56doVLnMJKCfQyrPj-ijsW7yuAMv53kR0OKxJ0LKM5iQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-08 19:40                                                       ` Jerome Glisse
       [not found]                                                         ` <20150708194004.GA8122-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Jerome Glisse @ 2015-07-08 19:40 UTC (permalink / raw)
  To: Ben Skeggs; +Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org

On Wed, Jul 08, 2015 at 10:51:55AM +1000, Ben Skeggs wrote:
> On 8 July 2015 at 10:47, Andrew Chew <achew@nvidia.com> wrote:
> > On Wed, Jul 08, 2015 at 10:37:34AM +1000, Ben Skeggs wrote:
> >> On 8 July 2015 at 10:31, Andrew Chew <achew@nvidia.com> wrote:
> >> > On Wed, Jul 08, 2015 at 10:18:36AM +1000, Ben Skeggs wrote:
> >> >> > There's some minimal state that needs to be mapped into GPU address space.
> >> >> > One thing that comes to mind are pushbuffers, which are needed to submit
> >> >> > stuff to any engine.
> >> >> I guess you can probably use the start of the kernel's address space
> >> >> carveout for these kind of mappings actually?  It's not like userspace
> >> >> can ever have virtual addresses there?
> >> >
> >> > Yeah.  I'm looking into it further, but to answer your original question,
> >> > I believe there is essentially an address range that nouveau would know
> >> > about, which it uses for fixed address allocations (I'm referring to how
> >> > the nvgpu driver does things...we may or may not come up with something
> >> > different for nouveau).
> >> >
> >> > Although it's dangerous, AFAIK the allocator in nouveau starts allocating
> >> > addresses at page 1, and as you suggested, one wouldn't ever get a CPU
> >> > address that low.  But having a set of addresses reserved would be much
> >> > better of course.
> >> I'm thinking more about the top of the address space.  As I understand
> >> it, the kernel already splits the CPU virtual address space into
> >> user/system areas (3GiB/1GiB for 32-bit IIUC), or something very
> >> similar to that.
> >>
> >> Perhaps, if we can get at that information, we can use those same
> >> definitions for GPU address space?
> >
> > Ah, I get what you're saying.  Sure, I think that might be okay.  Not sure
> > how we would get at that information, though, and it would be horrible to
> > just bake it in somewhere.  I'm looking into how nvgpu driver does it...
> > maybe they have good reasons to do it the way they do.  Sorry if I go
> > quiet for a little bit...
> After a very quick look, it looks like the kernel defines a
> PAGE_OFFSET macro which is the start of kernel virtual address space.

You need to be carefull here, first the hardware might not have as many bit
as the CPU. For instance x86-64 have a 48bits for virtual address ie only
48bits of the address is meaning full, older radeon (<CI iirc) only have
40bits for the address bus. With such configuration you could not move all
private kernel allocation inside the kernel zone.

Second issue is thing like 32bit process on 64bit kernel, in which case
you have the usual 3GB userspace, 1GB kernel space split. So instead of
using PAGE_OFFSET you might want to use TASK_SIZE which is a macro that
will lookup the limit using the current (process struct pointer).

I think issue for nouveau is that kernel space already handle some
allocation of virtual address, while for radeon the whole virtual address
space is fully under the userspace control.

Given this, you might want to use trick on both side (kernel and user
space). For instance you could mmap a region with PROT_NONE to reserve
a range of virtual address from userspace, then tell the driver about
that range and have the driver initialize the GPU and use that chunk
for kernel private structure allocation.

Issue is that it is kind of a API violation for nouveau kernel driver.
Thought i am not familiar enough, maybe you can do ioctl to nouveau
before nouveau inialize and allocate the kernel private buffer (gr and
other stuff). If so then problem solve i guess. Process that want to
use CUDA will need to do the mmap dance and early ioctl.


Hope this helps, cheers
Jérôme
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]                                                         ` <20150708194004.GA8122-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2015-07-08 21:18                                                           ` Andrew Chew
       [not found]                                                             ` <20150708211801.GA27080-hKyou4+EtHjTuHN6Nbh//0n48jw8i0AO@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Chew @ 2015-07-08 21:18 UTC (permalink / raw)
  To: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> > > Ah, I get what you're saying.  Sure, I think that might be okay.  Not sure
> > > how we would get at that information, though, and it would be horrible to
> > > just bake it in somewhere.  I'm looking into how nvgpu driver does it...
> > > maybe they have good reasons to do it the way they do.  Sorry if I go
> > > quiet for a little bit...
> > After a very quick look, it looks like the kernel defines a
> > PAGE_OFFSET macro which is the start of kernel virtual address space.
> 
> You need to be carefull here, first the hardware might not have as many bit
> as the CPU. For instance x86-64 have a 48bits for virtual address ie only
> 48bits of the address is meaning full, older radeon (<CI iirc) only have
> 40bits for the address bus. With such configuration you could not move all
> private kernel allocation inside the kernel zone.
> 
> Second issue is thing like 32bit process on 64bit kernel, in which case
> you have the usual 3GB userspace, 1GB kernel space split. So instead of
> using PAGE_OFFSET you might want to use TASK_SIZE which is a macro that
> will lookup the limit using the current (process struct pointer).
> 
> I think issue for nouveau is that kernel space already handle some
> allocation of virtual address, while for radeon the whole virtual address
> space is fully under the userspace control.
> 
> Given this, you might want to use trick on both side (kernel and user
> space). For instance you could mmap a region with PROT_NONE to reserve
> a range of virtual address from userspace, then tell the driver about
> that range and have the driver initialize the GPU and use that chunk
> for kernel private structure allocation.
> 
> Issue is that it is kind of a API violation for nouveau kernel driver.
> Thought i am not familiar enough, maybe you can do ioctl to nouveau
> before nouveau inialize and allocate the kernel private buffer (gr and
> other stuff). If so then problem solve i guess. Process that want to
> use CUDA will need to do the mmap dance and early ioctl.

I think we can have a nouveau ioctl to report the full address range that
the GPU supports.  Userspace can use this information to know what range
it can reserve.  The reservation part we can do with the original AS_ALLOC
and AS_FREE nouveau ioctls that I originally proposed, and in the CUDA case,
this reservation should happen before any channel for a particular context
gets created.
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Nouveau] CUDA fixed VA allocations and sparse mappings
  2015-07-07 17:27     ` Jerome Glisse
@ 2015-07-09  9:26       ` Oded Gabbay
       [not found]         ` <CAFCwf11pEc7vq0aPxdCRypzcbuaJRgWb_55q4ZUQPAvw5zXzHg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Oded Gabbay @ 2015-07-09  9:26 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: nouveau@lists.freedesktop.org, Andrew Chew,
	dri-devel@lists.freedesktop.org

On Tue, Jul 7, 2015 at 8:27 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Tue, Jul 07, 2015 at 11:29:38AM -0400, Ilia Mirkin wrote:
>> On Mon, Jul 6, 2015 at 8:42 PM, Andrew Chew <achew@nvidia.com> wrote:
>> > Hello,
>> >
>> > I am currently looking into ways to support fixed virtual address allocations
>> > and sparse mappings in nouveau, as a step towards supporting CUDA.
>> >
>> > CUDA requires that the GPU virtual address for a given buffer match the
>> > CPU virtual address.  Therefore, when mapping a CUDA buffer, we have to have
>> > a way of specifying a particular virtual address to map to (we would ask that
>> > the CPU virtual address be used).  Currently, as I understand it, the allocator
>> > implemented in nvkm/core/mm.c, used to provision virtual addresses, doesn't
>> > allow for this (but it's very easy to modify the allocator slightly to allow
>> > for this, which I have done locally in my experiments).
>> >
>> > In addition, the CUDA use case typically involves allocating a big chunk of
>> > address space ahead of time as a way to reserve that chunk for future CUDA
>> > use.  It then maps individual buffers into that address space as needed.
>> > Currently, the virtual address allocation is done during buffer mapping, so
>> > in order to support these sparse mappings, it seems to me that the virtual
>> > address allocation and buffer mapping need to be decoupled into separate
>> > operations.
>> >
>> > My current strawman proposal for supporting this is to introduce two new ioctls
>> > DRM_IOCTL_NOUVEAU_AS_ALLOC and DRM_IOCTL_NOUVEAU_AS_FREE, that look roughly
>> > like this:
>> >
>> > #define NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET 0x1
>> > struct drm_nouveau_as_alloc {
>> >         uint64_t pages;     /* in, pages */
>> >         uint32_t page_size; /* in, bytes */
>> >         uint32_t flags;     /* in */
>> >         uint64_t offset;    /* in/out, byte address */
>> > };
>> >
>> > struct drm_nouveau_as_free {
>> >         uint64_t offset;    /* in, byte address */
>> > };
>> >
>> > These ioctls just call into the allocator to allocate a range of addresses,
>> > resulting in a struct nvkm_vma that tracks that allocation (or releases the
>> > struct nvkm_vma back into the virtual address pool in the case of the free
>> > ioctl).  If NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET is set, offset specifies the
>> > requested virtual address.  Otherwise, an arbitrary address will be
>> > allocated.
>>
>> Well, this can't just be an address space. You still need bo's, if
>> this is to work with nouveau -- it has to know when to swap things in
>> and out, when they're used, etc. (and/or move between VRAM and GART
>> and system/swap). I suspect that your target here are the GK20A and
>> GM20B chips which don't have dedicated VRAM, but the ioctl's need to
>> work for everything.
>>
>> Would it be sufficient to extend NOUVEAU_GEM_NEW or create a
>> NOUVEAU_GEM_NEW_FIXED or something? IOW, why do have to separate the
>> concept of a GEM object and a VM allocation?
>
> Well maybe something like i did for radeon. With radeon you have 2 set of
> ioctl. One to create/delete bo (GEM stuff) and one to associate a virtual
> address with a bo. I wanted to let the userspace decide on virtual address
> of buffer precisely for the same reason CUDA do it ie to allow to map some
> buffer at same address in GPU address space as in CPU address space. So far
> we never really took advantage of that on radeon side.
>
> Also on radeon you can map same bo at different virtual address in same
> process (you will need different file descriptor for each mapping and you
> can only submit command stream using mapping valid for the file descriptor).
> Thought this is mostly usefull when sharing same bo accross different
> process.
>
> I think my radeon virtual address ioclt are nice design but other might
> disagree. If you want to look at the code :
>
>   drivers/gpu/drm/radeon/radeon_vm.c
>   drivers/gpu/drm/radeon/radeon_gem.c
>
> Grep for _va (virtual address per bo) or _vm (virtual address manager per
> file descriptor) function name and structure name.
>
> On the command stream and bo eviction side everything is as usual on radeon.
> So a bo can be evicted btw 2 command stream to make room for another one.
> Either its mapping is invalidated or updated to point to system memory. So
> most of the logic for everything else remain the same (just need to update
> the multiple virtual address space).
>
>
>>
>> >
>> > In addition to this, a way to map/unmap buffers is needed.  Ordinarily, one
>> > would just use DRM_IOCTL_PRIME_FD_TO_HANDLE to import and map a dmabuf into
>> > gem.  However, this ioctl will try to grab the virtual address range for this
>> > buffer, which will fail in the CUDA case since the virtual address range
>> > has been reserved ahead of time.  So we perhaps introduce a set of ioctls
>> > to map/unmap buffers on top of an already existing virtual address allocation.
>>
>> My suggestion above is an alternative to this, right? I think dmabufs
>> tend to be used for sharing between devices. I suspect there's more
>> going on here that I don't understand though -- I assume the CUDA
>> use-case is similar to the HSA use-case -- being able to build up data
>> structures that point to one another on the CPU and then process them
>> on the GPU? Can you detail a specific use-case perhaps, including the
>> interactions with the GPU and its address space?
>
> I think you nailed it, it is really about having the same address pointing to
> the same thing on both the GPU and CPU. But this is also valid and usefull for
> VRAM. OpenCL 2.0 have various level of transparent address space (probably
> not the term use in the spec) and the lowest level would need something like
> what radeon have to work. The most advance level needs more plumbing inside
> core kernel mm or inside the CPU and GPU hardware.
>
>
>> Jérôme, I believe you were doing the HSA kernel implementation.
>> Perhaps you'd have some feedback on this proposal?
>
> No i did not do the HSA stuff, AMD team leaded by Oded did :)
>
> Cheers,
> Jérôme
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

Hi,
So this is very similar to HSA and OpenCL 2.0 requirements.
We had an easy life so far for HSA, because amdkfd currently only
support APUs, where the system memory is shared between the CPU and
GPU cores. In this scenario, we have a dedicated H/W module called
IOMMUv2, which uses the CPU page tables to provide access from the GPU
core to the system memory.

To support discrete GPUs, amdkfd need to implement one of the following models:

1. Have different virtual address spaces for CPU and GPU buffers, aka
OpenCL 1.2. In this model, we prepare the data on the system memory,
copy the data to local memory, the GPU works on it, then we copy the
results back to system memory. We actually have an implementation of
this but it is not upstreamed yet (look below for reference to
implementation).

2. Use the GPUVM inside the GPU to access system memory. The
limitation here is that the GPUVM uses 40-bit addresses, so the
virtual address range must be in the lower 40-bit address range of the
CPU. Access speed is limited by PCI-e bandwidth. Another limitation is
that the system memory pages must be pinned as the GPU doesn't support
page faults.

I think your model is the same as the latter. The latest planning
before I left AMD was:
1. Reserve a large chunk of address space in the lower 40-bit address
space of the process, when it is created.
2. When a buffer is required by the application, reserve a chunk out
of that address space, then create BOs and map them to that address
space. I advise to use a fixed size BO (2/4 MB) and if the application
require a larger allocation, allocate a list of BOs

The operation of reserving address space and BO creation is one IOCTL,
while the mapping of the BO to the address space is a second IOCTL.
There are of course unmap and free IOCTLs. The separation is done for
a couple of reasons:

1. If the application knows that it wants to use only part of the
memory area it allocated, then there is no point in pinning all the
BOs. So, the application can map/unmap just part of the allocation.

2. If the application knows that it has finished using the BOs, and it
also knows that it will use them later on, it can unmap the BOs (to
make them unpinned) but not free them so the memory is still reserved
(with its contents intact).

For reference to the first model, look at
http://cgit.freedesktop.org/~gabbayo/linux/tree/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c?h=kfd-1.4.x

Specifically, at functions:

kfd_ioctl_alloc_memory_of_gpu
kfd_ioctl_free_memory_of_gpu
kfd_ioctl_map_memory_to_gpu
kfd_ioctl_unmap_memory_from_gpu

You can also look at the matching userspace code at:

http://cgit.freedesktop.org/~gabbayo/libhsakmt/tree/src/memory.c?h=libhsakmt-1.4.x

Hope this helps.

Oded
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]         ` <CAFCwf11pEc7vq0aPxdCRypzcbuaJRgWb_55q4ZUQPAvw5zXzHg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-07-10  0:31           ` Andrew Chew
  0 siblings, 0 replies; 24+ messages in thread
From: Andrew Chew @ 2015-07-10  0:31 UTC (permalink / raw)
  To: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org

> 2. Use the GPUVM inside the GPU to access system memory. The
> limitation here is that the GPUVM uses 40-bit addresses, so the
> virtual address range must be in the lower 40-bit address range of the
> CPU. Access speed is limited by PCI-e bandwidth. Another limitation is
> that the system memory pages must be pinned as the GPU doesn't support
> page faults.
> 
> The operation of reserving address space and BO creation is one IOCTL,
> while the mapping of the BO to the address space is a second IOCTL.
> There are of course unmap and free IOCTLs. The separation is done for
> a couple of reasons:
> 
> 1. If the application knows that it wants to use only part of the
> memory area it allocated, then there is no point in pinning all the
> BOs. So, the application can map/unmap just part of the allocation.
> 
> 2. If the application knows that it has finished using the BOs, and it
> also knows that it will use them later on, it can unmap the BOs (to
> make them unpinned) but not free them so the memory is still reserved
> (with its contents intact).

Yes, thanks Oled.  I think this is pretty much exactly how I imagine things
to work.

I'll post my code soon and see what you guys think.  There was some
misunderstanding on my part on how bo's work, so I need to rework some
stuff.
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: CUDA fixed VA allocations and sparse mappings
       [not found]                                                             ` <20150708211801.GA27080-hKyou4+EtHjTuHN6Nbh//0n48jw8i0AO@public.gmane.org>
@ 2015-07-13 18:45                                                               ` Andrew Chew
  2015-07-16  6:34                                                                 ` [Nouveau] " Alexandre Courbot
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Chew @ 2015-07-13 18:45 UTC (permalink / raw)
  To: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

I apologize for my ignorance.  In digging through nouveau, I've become
a bit confused regarding the relationship between virtual address
allocations and nouveau bo's.

From my reading of the code, it seems that a nouveau_bo really
encapsulates a buffer (whether imported, or allocated within nouveau like,
say, pushbuffers).  So I'm confused about an earlier statement that to
allocate a chunk of address space, I have to create a nouveau_bo for it.

What I really want to do is reserve some space in the address allocator
(the stuff in nvkm/subdev/mmu/base.c).  Note that there are no buffers
at this time.  This is just blocking out some chunk of the address space
so that normal address space allocations (for, say, bo's) avoid this region.

At some point after that, I'd like to import a buffer, and map it to
certain regions of my pre-allocated address space.  This is why I can't
go through the normal path of importing a buffer...that path assumes there
is no address for this buffer, and tries to allocate one.  In our case,
we already have an address in mind.  Naively, at this point, I'd like to
create a nouveau_bo for this imported buffer, but not have it go through
the address allocator and instead just take a fixed address.

Can someone clear up some of my confusion?
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Nouveau] CUDA fixed VA allocations and sparse mappings
  2015-07-13 18:45                                                               ` Andrew Chew
@ 2015-07-16  6:34                                                                 ` Alexandre Courbot
  0 siblings, 0 replies; 24+ messages in thread
From: Alexandre Courbot @ 2015-07-16  6:34 UTC (permalink / raw)
  To: nouveau@lists.freedesktop.org, Andrew Chew,
	dri-devel@lists.freedesktop.org

On Tue, Jul 14, 2015 at 3:45 AM, Andrew Chew <achew@nvidia.com> wrote:
> I apologize for my ignorance.  In digging through nouveau, I've become
> a bit confused regarding the relationship between virtual address
> allocations and nouveau bo's.
>
> From my reading of the code, it seems that a nouveau_bo really
> encapsulates a buffer (whether imported, or allocated within nouveau like,
> say, pushbuffers).  So I'm confused about an earlier statement that to
> allocate a chunk of address space, I have to create a nouveau_bo for it.

It is the case right now because there is no mean for user-space to
manipulate the GPU address space without having a nouveau_bo. So both
are closely related. But if you implement the address space
reservation ioctl, a nouveau_bo will not be required until you want to
back that space with actual memory.

> What I really want to do is reserve some space in the address allocator
> (the stuff in nvkm/subdev/mmu/base.c).  Note that there are no buffers
> at this time.  This is just blocking out some chunk of the address space
> so that normal address space allocations (for, say, bo's) avoid this region.
>
> At some point after that, I'd like to import a buffer, and map it to
> certain regions of my pre-allocated address space.  This is why I can't
> go through the normal path of importing a buffer...that path assumes there
> is no address for this buffer, and tries to allocate one.  In our case,
> we already have an address in mind.  Naively, at this point, I'd like to
> create a nouveau_bo for this imported buffer, but not have it go through
> the address allocator and instead just take a fixed address.

I think our main issue is that (someone correct me if I am wrong)
Nouveau will automatically create a GPU mapping when a buffer is
imported through PRIME. If we can (1) prevent this from happening (or,
less ideally, re-map the imported buffer afterwards), and (2) perform
the mapping by ourselves, we should be good. For the sake of
completeness, we should also solve that same issue for buffers created
using the NOUVEAU_GEM_NEW ioctl.

I am not sure how we can make (1) happen. Surely we cannot change the
semantics of DRM_IOCTL_PRIME_FD_TO_HANDLE without breaking user-space.
But maybe we can delay that automatic mapping, and only make it happen
if no manual mapping has been performed in-between? This would leave
us a window right after the object is imported to decide its GPU
address, which is precisely what we need. For objects created using
NOUVEAU_GEM_NEW, things might be as simple as adding a "do not map
yet" flag.

Regarding (2), I kind of feel like this is related to another issue we
were having with imported buffers: that we have no way to specify
their tiling options, contrary to buffers created with
NOUVEAU_GEM_NEW. I had a pretty lame attempt at fixing this last point
(http://lists.freedesktop.org/archives/dri-devel/2015-May/083052.html
), but it has been rejected, and probably for the best now that I
think of it.

Tiling and offset inside the GPU VM are both properties of a buffer,
so why not handle them both using the same ioctl? We currently have
DRM_NOUVEAU_GEM_INFO, which returns all these properties to user-space
(see struct drm_nouveau_gem_info). How about introducing
DRM_NOUVEAU_GEM_SET_INFO that would allow to change these properties,
i.e. to change the tiling flags (which the tiling ioctl attempted to
do), but also the mapping address if it is specified and valid?

So in order to import a buffer at a fixed GPU address, after reserving
a portion of the GPU VM for that purpose, one would:

1) use DRM_IOCTL_PRIME_FD_TO_HANDLE to import the buffer
2) invoke DRM_NOUVEAU_GEM_SET_INFO to map (or re-map if 1) already
created a mapping) the buffer to the right address

I suspect this proposal is full of flaws though, so feel free to shoot
it down. :)
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-07-16  6:34 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-07  0:42 CUDA fixed VA allocations and sparse mappings Andrew Chew
2015-07-07 15:29 ` [Nouveau] " Ilia Mirkin
     [not found]   ` <CAKb7UviePF2XcmyeKHQ2cv=hy=NZyYcMrWiTpajJxTFE+10LwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-07 17:27     ` Jerome Glisse
2015-07-09  9:26       ` [Nouveau] " Oded Gabbay
     [not found]         ` <CAFCwf11pEc7vq0aPxdCRypzcbuaJRgWb_55q4ZUQPAvw5zXzHg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-10  0:31           ` Andrew Chew
2015-07-07 18:47     ` Andrew Chew
     [not found] ` <20150707004249.GC27924-hKyou4+EtHjTuHN6Nbh//0n48jw8i0AO@public.gmane.org>
2015-07-07 21:09   ` Ben Skeggs
     [not found]     ` <CACAvsv6=OwXnabpY5c_HHaMkumV-QqCvPd+zia15S_G+Oq29UA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-07 23:53       ` C Bergström
     [not found]         ` <CAOnawYpbqZ04-h2q4JpWjWfygPk5UQX9JWC4oj0RWNn7rzhcBA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-07 23:58           ` Ben Skeggs
     [not found]             ` <CACAvsv5ZrSLzb=N5kLpZP5fwbF+=S414O_QDgsNbi9FvvqxxLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-08  0:07               ` C Bergström
     [not found]                 ` <CAOnawYphTmUDxkKrEhUsVR6YRyLQj0P4hwgOkw2Jf4b0BZOSnw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-08  0:08                   ` Ilia Mirkin
     [not found]                     ` <CAKb7UviOx-rNJUkwYB4h8XyQ4x8qp3xAbeHOAeW++O+bHFuyKQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-08  0:11                       ` C Bergström
     [not found]                         ` <CAOnawYo=EFk6KhmudKWi3r-z_J4AHjswTrZSfyp_qZfdmQc=tw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-08  0:13                           ` Ilia Mirkin
     [not found]                             ` <CAKb7UvhOM+65x80HPAcdTsQB4KsPA780cKg8_30vOy5qWFZt4w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-08  0:15                               ` Andrew Chew
     [not found]                                 ` <20150708001559.GA30347-hKyou4+EtHjTuHN6Nbh//0n48jw8i0AO@public.gmane.org>
2015-07-08  0:18                                   ` Ben Skeggs
     [not found]                                     ` <CACAvsv5q5yJUmjPgJtxnv1dU--UzD1veePkJzvqjRyNtx=EEbw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-08  0:31                                       ` Andrew Chew
     [not found]                                         ` <20150708003153.GA30426-hKyou4+EtHjTuHN6Nbh//0n48jw8i0AO@public.gmane.org>
2015-07-08  0:37                                           ` Ben Skeggs
     [not found]                                             ` <CACAvsv5DtA2WsBQkNWnxZMsonbHsvJ-oKA+frVd-btZXfgiAyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-08  0:47                                               ` Andrew Chew
     [not found]                                                 ` <20150708004735.GA30570-hKyou4+EtHjTuHN6Nbh//0n48jw8i0AO@public.gmane.org>
2015-07-08  0:51                                                   ` Ben Skeggs
     [not found]                                                     ` <CACAvsv56doVLnMJKCfQyrPj-ijsW7yuAMv53kR0OKxJ0LKM5iQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-08 19:40                                                       ` Jerome Glisse
     [not found]                                                         ` <20150708194004.GA8122-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-07-08 21:18                                                           ` Andrew Chew
     [not found]                                                             ` <20150708211801.GA27080-hKyou4+EtHjTuHN6Nbh//0n48jw8i0AO@public.gmane.org>
2015-07-13 18:45                                                               ` Andrew Chew
2015-07-16  6:34                                                                 ` [Nouveau] " Alexandre Courbot
2015-07-08 18:27                                       ` Ken Adams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.