* [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-17 13:57 Oded Gabbay
2014-07-20 17:46 ` Jerome Glisse
0 siblings, 1 reply; 49+ messages in thread
From: Oded Gabbay @ 2014-07-17 13:57 UTC (permalink / raw)
To: David Airlie, Jerome Glisse, Alex Deucher, Andrew Morton
Cc: John Bridgman, Joerg Roedel, Andrew Lewycky, Christian König,
Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org,
linux-mm
Forgot to cc mailing list on cover letter. Sorry.
As a continuation to the existing discussion, here is a v2 patch series
restructured with a cleaner history and no totally-different-early-versions of
the code.
Instead of 83 patches, there are now a total of 25 patches, where 5 of them
are modifications to radeon driver and 18 of them include only amdkfd code.
There is no code going away or even modified between patches, only added.
The driver was renamed from radeon_kfd to amdkfd and moved to reside under
drm/radeon/amdkfd. This move was done to emphasize the fact that this driver is
an AMD-only driver at this point. Having said that, we do foresee a generic hsa
framework being implemented in the future and in that case, we will adjust
amdkfd to work within that framework.
As the amdkfd driver should support multiple AMD gfx drivers, we want to keep it
as a seperate driver from radeon. Therefore, the amdkfd code is contained in its
own folder. The amdkfd folder was put under the radeon folder because the only
AMD gfx driver in the Linux kernel at this point
is the radeon driver. Having said that, we will probably need to move it (maybe
to be directly under drm) after we integrate with additional AMD gfx drivers.
For people who like to review using git, the v2 patch set is located at:
http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
Written by Oded Gabbayh <oded.gabbay@amd.com>
Original Cover Letter:
This patch set implements a Heterogeneous System Architecture (HSA) driver for
radeon-family GPUs.
HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share system
resources more effectively via HW features including shared pageable memory,
userspace-accessible work queues, and platform-level atomics. In addition to the
memory protection mechanisms in GPUVM and IOMMUv2, the Sea Islands family of
GPUs also performs HW-level validation of commands passed in through the queues
(aka rings).
The code in this patch set is intended to serve both as a sample driver for
other HSA-compatible hardware devices and as a production driver for
radeon-family processors. The code is architected to support multiple CPUs each
with connected GPUs, although the current implementation focuses on a single
Kaveri/Berlin APU, and works alongside the existing radeon kernel graphics
driver (kgd).
AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
functionality between HSA compute and regular gfx/compute (memory, interrupts,
registers), while other functionality has been added specifically for HSA
compute (hw scheduler for virtualized compute rings). All shared hardware is
owned by the radeon graphics driver, and an interface between kfd and kgd allows
the kfd to make use of those shared resources, while HSA-specific functionality
is managed directly by kfd by submitting packets into an HSA-specific command
queue (the "HIQ").
During kfd module initialization a char device node (/dev/kfd) is created
(surviving until module exit), with ioctls for queue creation & management, and
data structures are initialized for managing HSA device topology.
The rest of the initialization is driven by calls from the radeon kgd at the
following points :
- radeon_init (kfd_init)
- radeon_exit (kfd_fini)
- radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
- radeon_driver_unload_kms (kfd_device_fini)
During the probe and init processing per-device data structures are established
which connect to the associated graphics kernel driver. This information is
exposed to userspace via sysfs, along with a version number allowing userspace
to determine if a topology change has occurred while it was reading from sysfs.
The interface between kfd and kgd also allows the kfd to request buffer
management services from kgd, and allows kgd to route interrupt requests to kfd
code since the interrupt block is shared between regular graphics/compute and
HSA compute subsystems in the GPU.
The kfd code works with an open source usermode library ("libhsakmt") which is
in the final stages of IP review and should be published in a separate repo over
the next few days.
The code operates in one of three modes, selectable via the sched_policy module
parameter :
- sched_policy=0 uses a hardware scheduler running in the MEC block within CP,
and allows oversubscription (more queues than HW slots)
- sched_policy=1 also uses HW scheduling but does not allow oversubscription, so
create_queue requests fail when we run out of HW slots
- sched_policy=2 does not use HW scheduling, so the driver manually assigns
queues to HW slots by programming registers
The "no HW scheduling" option is for debug & new hardware bringup only, so has
less test coverage than the other options. Default in the current code is "HW
scheduling without oversubscription" since that is where we have the most test
coverage but we expect to change the default to "HW scheduling with
oversubscription" after further testing. This effectively removes the HW limit
on the number of work queues available to applications.
Programs running on the GPU are associated with an address space through the
VMID field, which is translated to a unique PASID at access time via a set of 16
VMID-to-PASID mapping registers. The available VMIDs (currently 16) are
partitioned (under control of the radeon kgd) between current gfx/compute and
HSA compute, with each getting 8 in the current code. The VMID-to-PASID mapping
registers are updated by the HW scheduler when used, and by driver code if HW
scheduling is not being used.
The Sea Islands compute queues use a new "doorbell" mechanism instead of the
earlier kernel-managed write pointer registers. Doorbells use a separate BAR
dedicated for this purpose, and pages within the doorbell aperture are mapped to
userspace (each page mapped to only one user address space). Writes to the
doorbell aperture are intercepted by GPU hardware, allowing userspace code to
safely manage work queues (rings) without requiring a kernel call for every ring
update.
First step for an application process is to open the kfd device. Calls to open
create a kfd "process" structure only for the first thread of the process.
Subsequent open calls are checked to see if they are from processes using the
same mm_struct and, if so, don't do anything. The kfd per-process data lives as
long as the mm_struct exists. Each mm_struct is associated with a unique PASID,
allowing the IOMMUv2 to make userspace process memory accessible to the GPU.
Next step is for the application to collect topology information via sysfs. This
gives userspace enough information to be able to identify specific nodes
(processors) in subsequent queue management calls. Application processes can
create queues on multiple processors, and processors support queues from
multiple processes.
At this point the application can create work queues in userspace memory and
pass them through the usermode library to kfd to have them mapped onto HW queue
slots so that commands written to the queues can be executed by the GPU. Queue
operations specify a processor node, and so the bulk of this code is
device-specific.
Written by John Bridgman <John.Bridgman@amd.com>
Alexey Skidanov (1):
amdkfd: Implement the Get Process Aperture IOCTL
Andrew Lewycky (3):
amdkfd: Add basic modules to amdkfd
amdkfd: Add interrupt handling module
amdkfd: Implement the Set Memory Policy IOCTL
Ben Goz (8):
amdkfd: Add queue module
amdkfd: Add mqd_manager module
amdkfd: Add kernel queue module
amdkfd: Add module parameter of scheduling policy
amdkfd: Add packet manager module
amdkfd: Add process queue manager module
amdkfd: Add device queue manager module
amdkfd: Implement the create/destroy/update queue IOCTLs
Evgeny Pinchuk (3):
amdkfd: Add topology module to amdkfd
amdkfd: Implement the Get Clock Counters IOCTL
amdkfd: Implement the PMC Acquire/Release IOCTLs
Oded Gabbay (10):
mm: Add kfd_process pointer to mm_struct
drm/radeon: reduce number of free VMIDs and pipes in KV
drm/radeon/cik: Don't touch int of pipes 1-7
drm/radeon: Report doorbell configuration to amdkfd
drm/radeon: adding synchronization for GRBM GFX
drm/radeon: Add radeon <--> amdkfd interface
Update MAINTAINERS and CREDITS files with amdkfd info
amdkfd: Add IOCTL set definitions of amdkfd
amdkfd: Add amdkfd skeleton driver
amdkfd: Add binding/unbinding calls to amd_iommu driver
CREDITS | 7 +
MAINTAINERS | 10 +
drivers/gpu/drm/radeon/Kconfig | 2 +
drivers/gpu/drm/radeon/Makefile | 3 +
drivers/gpu/drm/radeon/amdkfd/Kconfig | 10 +
drivers/gpu/drm/radeon/amdkfd/Makefile | 14 +
drivers/gpu/drm/radeon/amdkfd/cik_mqds.h | 185 +++
drivers/gpu/drm/radeon/amdkfd/cik_regs.h | 220 ++++
drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c | 123 ++
drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c | 518 +++++++++
drivers/gpu/drm/radeon/amdkfd/kfd_crat.h | 294 +++++
drivers/gpu/drm/radeon/amdkfd/kfd_device.c | 254 ++++
.../drm/radeon/amdkfd/kfd_device_queue_manager.c | 985 ++++++++++++++++
.../drm/radeon/amdkfd/kfd_device_queue_manager.h | 101 ++
drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c | 264 +++++
drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c | 161 +++
drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c | 305 +++++
drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h | 66 ++
drivers/gpu/drm/radeon/amdkfd/kfd_module.c | 131 +++
drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c | 291 +++++
drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h | 54 +
drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c | 488 ++++++++
drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c | 97 ++
drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h | 682 +++++++++++
drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h | 107 ++
drivers/gpu/drm/radeon/amdkfd/kfd_priv.h | 466 ++++++++
drivers/gpu/drm/radeon/amdkfd/kfd_process.c | 405 +++++++
.../drm/radeon/amdkfd/kfd_process_queue_manager.c | 343 ++++++
drivers/gpu/drm/radeon/amdkfd/kfd_queue.c | 109 ++
drivers/gpu/drm/radeon/amdkfd/kfd_topology.c | 1207 ++++++++++++++++++++
drivers/gpu/drm/radeon/amdkfd/kfd_topology.h | 168 +++
drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c | 96 ++
drivers/gpu/drm/radeon/cik.c | 154 +--
drivers/gpu/drm/radeon/cik_reg.h | 65 ++
drivers/gpu/drm/radeon/cikd.h | 51 +-
drivers/gpu/drm/radeon/radeon.h | 9 +
drivers/gpu/drm/radeon/radeon_device.c | 32 +
drivers/gpu/drm/radeon/radeon_drv.c | 5 +
drivers/gpu/drm/radeon/radeon_kfd.c | 566 +++++++++
drivers/gpu/drm/radeon/radeon_kfd.h | 119 ++
drivers/gpu/drm/radeon/radeon_kms.c | 7 +
include/linux/mm_types.h | 14 +
include/uapi/linux/kfd_ioctl.h | 133 +++
43 files changed, 9226 insertions(+), 95 deletions(-)
create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
create mode 100644 include/uapi/linux/kfd_ioctl.h
--
1.9.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-17 13:57 [PATCH v2 00/25] AMDKFD kernel driver Oded Gabbay @ 2014-07-20 17:46 ` Jerome Glisse 2014-07-21 3:03 ` Jerome Glisse ` (2 more replies) 0 siblings, 3 replies; 49+ messages in thread From: Jerome Glisse @ 2014-07-20 17:46 UTC (permalink / raw) To: Oded Gabbay Cc: David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Christian König, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: > Forgot to cc mailing list on cover letter. Sorry. > > As a continuation to the existing discussion, here is a v2 patch series > restructured with a cleaner history and no totally-different-early-versions > of the code. > > Instead of 83 patches, there are now a total of 25 patches, where 5 of them > are modifications to radeon driver and 18 of them include only amdkfd code. > There is no code going away or even modified between patches, only added. > > The driver was renamed from radeon_kfd to amdkfd and moved to reside under > drm/radeon/amdkfd. This move was done to emphasize the fact that this driver > is an AMD-only driver at this point. Having said that, we do foresee a > generic hsa framework being implemented in the future and in that case, we > will adjust amdkfd to work within that framework. > > As the amdkfd driver should support multiple AMD gfx drivers, we want to > keep it as a seperate driver from radeon. Therefore, the amdkfd code is > contained in its own folder. The amdkfd folder was put under the radeon > folder because the only AMD gfx driver in the Linux kernel at this point > is the radeon driver. Having said that, we will probably need to move it > (maybe to be directly under drm) after we integrate with additional AMD gfx > drivers. > > For people who like to review using git, the v2 patch set is located at: > http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 > > Written by Oded Gabbayh <oded.gabbay@amd.com> So quick comments before i finish going over all patches. There is many things that need more documentation espacialy as of right now there is no userspace i can go look at. There few show stopper, biggest one is gpu memory pinning this is a big no, that would need serious arguments for any hope of convincing me on that side. It might be better to add a drivers/gpu/drm/amd directory and add common stuff there. Given that this is not intended to be final HSA api AFAICT then i would say this far better to avoid the whole kfd module and add ioctl to radeon. This would avoid crazy communication btw radeon and kfd. The whole aperture business needs some serious explanation. Especialy as you want to use userspace address there is nothing to prevent userspace program from allocating things at address you reserve for lds, scratch, ... only sane way would be to move those lds, scratch inside the virtual address reserved for kernel (see kernel memory map). The whole business of locking performance counter for exclusive per process access is a big NO. Which leads me to the questionable usefullness of user space command ring. I only see issues with that. First and foremost i would need to see solid figures that kernel ioctl or syscall has a higher an overhead that is measurable in any meaning full way against a simple function call. I know the userspace command ring is a big marketing features that please ignorant userspace programmer. But really this only brings issues and for absolutely not upside afaict. So i would rather see a very simple ioctl that write the doorbell and might do more than that in case of ring/queue overcommit where it would first have to wait for a free ring/queue to schedule stuff. This would also allow sane implementation of things like performance counter that could be acquire by kernel for duration of a job submitted by userspace. While still not optimal this would be better that userspace locking. I might have more thoughts once i am done with all the patches. Cheers, Jerome > > Original Cover Letter: > > This patch set implements a Heterogeneous System Architecture (HSA) driver > for radeon-family GPUs. > HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share > system resources more effectively via HW features including shared pageable > memory, userspace-accessible work queues, and platform-level atomics. In > addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea > Islands family of GPUs also performs HW-level validation of commands passed > in through the queues (aka rings). > > The code in this patch set is intended to serve both as a sample driver for > other HSA-compatible hardware devices and as a production driver for > radeon-family processors. The code is architected to support multiple CPUs > each with connected GPUs, although the current implementation focuses on a > single Kaveri/Berlin APU, and works alongside the existing radeon kernel > graphics driver (kgd). > AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware > functionality between HSA compute and regular gfx/compute (memory, > interrupts, registers), while other functionality has been added > specifically for HSA compute (hw scheduler for virtualized compute rings). > All shared hardware is owned by the radeon graphics driver, and an interface > between kfd and kgd allows the kfd to make use of those shared resources, > while HSA-specific functionality is managed directly by kfd by submitting > packets into an HSA-specific command queue (the "HIQ"). > > During kfd module initialization a char device node (/dev/kfd) is created > (surviving until module exit), with ioctls for queue creation & management, > and data structures are initialized for managing HSA device topology. > The rest of the initialization is driven by calls from the radeon kgd at the > following points : > > - radeon_init (kfd_init) > - radeon_exit (kfd_fini) > - radeon_driver_load_kms (kfd_device_probe, kfd_device_init) > - radeon_driver_unload_kms (kfd_device_fini) > > During the probe and init processing per-device data structures are > established which connect to the associated graphics kernel driver. This > information is exposed to userspace via sysfs, along with a version number > allowing userspace to determine if a topology change has occurred while it > was reading from sysfs. > The interface between kfd and kgd also allows the kfd to request buffer > management services from kgd, and allows kgd to route interrupt requests to > kfd code since the interrupt block is shared between regular > graphics/compute and HSA compute subsystems in the GPU. > > The kfd code works with an open source usermode library ("libhsakmt") which > is in the final stages of IP review and should be published in a separate > repo over the next few days. > The code operates in one of three modes, selectable via the sched_policy > module parameter : > > - sched_policy=0 uses a hardware scheduler running in the MEC block within > CP, and allows oversubscription (more queues than HW slots) > - sched_policy=1 also uses HW scheduling but does not allow > oversubscription, so create_queue requests fail when we run out of HW slots > - sched_policy=2 does not use HW scheduling, so the driver manually assigns > queues to HW slots by programming registers > > The "no HW scheduling" option is for debug & new hardware bringup only, so > has less test coverage than the other options. Default in the current code > is "HW scheduling without oversubscription" since that is where we have the > most test coverage but we expect to change the default to "HW scheduling > with oversubscription" after further testing. This effectively removes the > HW limit on the number of work queues available to applications. > > Programs running on the GPU are associated with an address space through the > VMID field, which is translated to a unique PASID at access time via a set > of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) > are partitioned (under control of the radeon kgd) between current > gfx/compute and HSA compute, with each getting 8 in the current code. The > VMID-to-PASID mapping registers are updated by the HW scheduler when used, > and by driver code if HW scheduling is not being used. > The Sea Islands compute queues use a new "doorbell" mechanism instead of the > earlier kernel-managed write pointer registers. Doorbells use a separate BAR > dedicated for this purpose, and pages within the doorbell aperture are > mapped to userspace (each page mapped to only one user address space). > Writes to the doorbell aperture are intercepted by GPU hardware, allowing > userspace code to safely manage work queues (rings) without requiring a > kernel call for every ring update. > First step for an application process is to open the kfd device. Calls to > open create a kfd "process" structure only for the first thread of the > process. Subsequent open calls are checked to see if they are from processes > using the same mm_struct and, if so, don't do anything. The kfd per-process > data lives as long as the mm_struct exists. Each mm_struct is associated > with a unique PASID, allowing the IOMMUv2 to make userspace process memory > accessible to the GPU. > Next step is for the application to collect topology information via sysfs. > This gives userspace enough information to be able to identify specific > nodes (processors) in subsequent queue management calls. Application > processes can create queues on multiple processors, and processors support > queues from multiple processes. > At this point the application can create work queues in userspace memory and > pass them through the usermode library to kfd to have them mapped onto HW > queue slots so that commands written to the queues can be executed by the > GPU. Queue operations specify a processor node, and so the bulk of this code > is device-specific. > Written by John Bridgman <John.Bridgman@amd.com> > > > Alexey Skidanov (1): > amdkfd: Implement the Get Process Aperture IOCTL > > Andrew Lewycky (3): > amdkfd: Add basic modules to amdkfd > amdkfd: Add interrupt handling module > amdkfd: Implement the Set Memory Policy IOCTL > > Ben Goz (8): > amdkfd: Add queue module > amdkfd: Add mqd_manager module > amdkfd: Add kernel queue module > amdkfd: Add module parameter of scheduling policy > amdkfd: Add packet manager module > amdkfd: Add process queue manager module > amdkfd: Add device queue manager module > amdkfd: Implement the create/destroy/update queue IOCTLs > > Evgeny Pinchuk (3): > amdkfd: Add topology module to amdkfd > amdkfd: Implement the Get Clock Counters IOCTL > amdkfd: Implement the PMC Acquire/Release IOCTLs > > Oded Gabbay (10): > mm: Add kfd_process pointer to mm_struct > drm/radeon: reduce number of free VMIDs and pipes in KV > drm/radeon/cik: Don't touch int of pipes 1-7 > drm/radeon: Report doorbell configuration to amdkfd > drm/radeon: adding synchronization for GRBM GFX > drm/radeon: Add radeon <--> amdkfd interface > Update MAINTAINERS and CREDITS files with amdkfd info > amdkfd: Add IOCTL set definitions of amdkfd > amdkfd: Add amdkfd skeleton driver > amdkfd: Add binding/unbinding calls to amd_iommu driver > > CREDITS | 7 + > MAINTAINERS | 10 + > drivers/gpu/drm/radeon/Kconfig | 2 + > drivers/gpu/drm/radeon/Makefile | 3 + > drivers/gpu/drm/radeon/amdkfd/Kconfig | 10 + > drivers/gpu/drm/radeon/amdkfd/Makefile | 14 + > drivers/gpu/drm/radeon/amdkfd/cik_mqds.h | 185 +++ > drivers/gpu/drm/radeon/amdkfd/cik_regs.h | 220 ++++ > drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c | 123 ++ > drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c | 518 +++++++++ > drivers/gpu/drm/radeon/amdkfd/kfd_crat.h | 294 +++++ > drivers/gpu/drm/radeon/amdkfd/kfd_device.c | 254 ++++ > .../drm/radeon/amdkfd/kfd_device_queue_manager.c | 985 ++++++++++++++++ > .../drm/radeon/amdkfd/kfd_device_queue_manager.h | 101 ++ > drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c | 264 +++++ > drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c | 161 +++ > drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c | 305 +++++ > drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h | 66 ++ > drivers/gpu/drm/radeon/amdkfd/kfd_module.c | 131 +++ > drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c | 291 +++++ > drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h | 54 + > drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c | 488 ++++++++ > drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c | 97 ++ > drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h | 682 +++++++++++ > drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h | 107 ++ > drivers/gpu/drm/radeon/amdkfd/kfd_priv.h | 466 ++++++++ > drivers/gpu/drm/radeon/amdkfd/kfd_process.c | 405 +++++++ > .../drm/radeon/amdkfd/kfd_process_queue_manager.c | 343 ++++++ > drivers/gpu/drm/radeon/amdkfd/kfd_queue.c | 109 ++ > drivers/gpu/drm/radeon/amdkfd/kfd_topology.c | 1207 ++++++++++++++++++++ > drivers/gpu/drm/radeon/amdkfd/kfd_topology.h | 168 +++ > drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c | 96 ++ > drivers/gpu/drm/radeon/cik.c | 154 +-- > drivers/gpu/drm/radeon/cik_reg.h | 65 ++ > drivers/gpu/drm/radeon/cikd.h | 51 +- > drivers/gpu/drm/radeon/radeon.h | 9 + > drivers/gpu/drm/radeon/radeon_device.c | 32 + > drivers/gpu/drm/radeon/radeon_drv.c | 5 + > drivers/gpu/drm/radeon/radeon_kfd.c | 566 +++++++++ > drivers/gpu/drm/radeon/radeon_kfd.h | 119 ++ > drivers/gpu/drm/radeon/radeon_kms.c | 7 + > include/linux/mm_types.h | 14 + > include/uapi/linux/kfd_ioctl.h | 133 +++ > 43 files changed, 9226 insertions(+), 95 deletions(-) > create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig > create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile > create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h > create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c > create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c > create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h > create mode 100644 include/uapi/linux/kfd_ioctl.h > > -- > 1.9.1 > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-20 17:46 ` Jerome Glisse @ 2014-07-21 3:03 ` Jerome Glisse 2014-07-21 7:01 ` Daniel Vetter 2014-07-21 12:36 ` Oded Gabbay 2 siblings, 0 replies; 49+ messages in thread From: Jerome Glisse @ 2014-07-21 3:03 UTC (permalink / raw) To: Oded Gabbay Cc: David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Christian König, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On Sun, Jul 20, 2014 at 01:46:53PM -0400, Jerome Glisse wrote: > On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: > > Forgot to cc mailing list on cover letter. Sorry. > > > > As a continuation to the existing discussion, here is a v2 patch series > > restructured with a cleaner history and no totally-different-early-versions > > of the code. > > > > Instead of 83 patches, there are now a total of 25 patches, where 5 of them > > are modifications to radeon driver and 18 of them include only amdkfd code. > > There is no code going away or even modified between patches, only added. > > > > The driver was renamed from radeon_kfd to amdkfd and moved to reside under > > drm/radeon/amdkfd. This move was done to emphasize the fact that this driver > > is an AMD-only driver at this point. Having said that, we do foresee a > > generic hsa framework being implemented in the future and in that case, we > > will adjust amdkfd to work within that framework. > > > > As the amdkfd driver should support multiple AMD gfx drivers, we want to > > keep it as a seperate driver from radeon. Therefore, the amdkfd code is > > contained in its own folder. The amdkfd folder was put under the radeon > > folder because the only AMD gfx driver in the Linux kernel at this point > > is the radeon driver. Having said that, we will probably need to move it > > (maybe to be directly under drm) after we integrate with additional AMD gfx > > drivers. > > > > For people who like to review using git, the v2 patch set is located at: > > http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 > > > > Written by Oded Gabbayh <oded.gabbay@amd.com> > > So quick comments before i finish going over all patches. There is many > things that need more documentation espacialy as of right now there is > no userspace i can go look at. > > There few show stopper, biggest one is gpu memory pinning this is a big > no, that would need serious arguments for any hope of convincing me on > that side. > > It might be better to add a drivers/gpu/drm/amd directory and add common > stuff there. > > Given that this is not intended to be final HSA api AFAICT then i would > say this far better to avoid the whole kfd module and add ioctl to radeon. > This would avoid crazy communication btw radeon and kfd. > > The whole aperture business needs some serious explanation. Especialy as > you want to use userspace address there is nothing to prevent userspace > program from allocating things at address you reserve for lds, scratch, > ... only sane way would be to move those lds, scratch inside the virtual > address reserved for kernel (see kernel memory map). So i skimmed over the iommu v2 specification and while the iommu v2 claims to obey the user/supervisor flags of the cpu page table, it does not seems that this is a property set in the iommu against a pasid (ie is a given pasid is allow supervisor access or not). It seems that the supervisor is part of pcie tlp request which i assume is control by the gpu. So how is this bit set ? How can we make sure that there is no way to abuse it ? > > The whole business of locking performance counter for exclusive per process > access is a big NO. Which leads me to the questionable usefullness of user > space command ring. I only see issues with that. First and foremost i would > need to see solid figures that kernel ioctl or syscall has a higher an > overhead that is measurable in any meaning full way against a simple > function call. I know the userspace command ring is a big marketing features > that please ignorant userspace programmer. But really this only brings issues > and for absolutely not upside afaict. > > So i would rather see a very simple ioctl that write the doorbell and might > do more than that in case of ring/queue overcommit where it would first have > to wait for a free ring/queue to schedule stuff. This would also allow sane > implementation of things like performance counter that could be acquire by > kernel for duration of a job submitted by userspace. While still not optimal > this would be better that userspace locking. > > > I might have more thoughts once i am done with all the patches. > > Cheers, > Jerome > > > > > Original Cover Letter: > > > > This patch set implements a Heterogeneous System Architecture (HSA) driver > > for radeon-family GPUs. > > HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share > > system resources more effectively via HW features including shared pageable > > memory, userspace-accessible work queues, and platform-level atomics. In > > addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea > > Islands family of GPUs also performs HW-level validation of commands passed > > in through the queues (aka rings). > > > > The code in this patch set is intended to serve both as a sample driver for > > other HSA-compatible hardware devices and as a production driver for > > radeon-family processors. The code is architected to support multiple CPUs > > each with connected GPUs, although the current implementation focuses on a > > single Kaveri/Berlin APU, and works alongside the existing radeon kernel > > graphics driver (kgd). > > AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware > > functionality between HSA compute and regular gfx/compute (memory, > > interrupts, registers), while other functionality has been added > > specifically for HSA compute (hw scheduler for virtualized compute rings). > > All shared hardware is owned by the radeon graphics driver, and an interface > > between kfd and kgd allows the kfd to make use of those shared resources, > > while HSA-specific functionality is managed directly by kfd by submitting > > packets into an HSA-specific command queue (the "HIQ"). > > > > During kfd module initialization a char device node (/dev/kfd) is created > > (surviving until module exit), with ioctls for queue creation & management, > > and data structures are initialized for managing HSA device topology. > > The rest of the initialization is driven by calls from the radeon kgd at the > > following points : > > > > - radeon_init (kfd_init) > > - radeon_exit (kfd_fini) > > - radeon_driver_load_kms (kfd_device_probe, kfd_device_init) > > - radeon_driver_unload_kms (kfd_device_fini) > > > > During the probe and init processing per-device data structures are > > established which connect to the associated graphics kernel driver. This > > information is exposed to userspace via sysfs, along with a version number > > allowing userspace to determine if a topology change has occurred while it > > was reading from sysfs. > > The interface between kfd and kgd also allows the kfd to request buffer > > management services from kgd, and allows kgd to route interrupt requests to > > kfd code since the interrupt block is shared between regular > > graphics/compute and HSA compute subsystems in the GPU. > > > > The kfd code works with an open source usermode library ("libhsakmt") which > > is in the final stages of IP review and should be published in a separate > > repo over the next few days. > > The code operates in one of three modes, selectable via the sched_policy > > module parameter : > > > > - sched_policy=0 uses a hardware scheduler running in the MEC block within > > CP, and allows oversubscription (more queues than HW slots) > > - sched_policy=1 also uses HW scheduling but does not allow > > oversubscription, so create_queue requests fail when we run out of HW slots > > - sched_policy=2 does not use HW scheduling, so the driver manually assigns > > queues to HW slots by programming registers > > > > The "no HW scheduling" option is for debug & new hardware bringup only, so > > has less test coverage than the other options. Default in the current code > > is "HW scheduling without oversubscription" since that is where we have the > > most test coverage but we expect to change the default to "HW scheduling > > with oversubscription" after further testing. This effectively removes the > > HW limit on the number of work queues available to applications. > > > > Programs running on the GPU are associated with an address space through the > > VMID field, which is translated to a unique PASID at access time via a set > > of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) > > are partitioned (under control of the radeon kgd) between current > > gfx/compute and HSA compute, with each getting 8 in the current code. The > > VMID-to-PASID mapping registers are updated by the HW scheduler when used, > > and by driver code if HW scheduling is not being used. > > The Sea Islands compute queues use a new "doorbell" mechanism instead of the > > earlier kernel-managed write pointer registers. Doorbells use a separate BAR > > dedicated for this purpose, and pages within the doorbell aperture are > > mapped to userspace (each page mapped to only one user address space). > > Writes to the doorbell aperture are intercepted by GPU hardware, allowing > > userspace code to safely manage work queues (rings) without requiring a > > kernel call for every ring update. > > First step for an application process is to open the kfd device. Calls to > > open create a kfd "process" structure only for the first thread of the > > process. Subsequent open calls are checked to see if they are from processes > > using the same mm_struct and, if so, don't do anything. The kfd per-process > > data lives as long as the mm_struct exists. Each mm_struct is associated > > with a unique PASID, allowing the IOMMUv2 to make userspace process memory > > accessible to the GPU. > > Next step is for the application to collect topology information via sysfs. > > This gives userspace enough information to be able to identify specific > > nodes (processors) in subsequent queue management calls. Application > > processes can create queues on multiple processors, and processors support > > queues from multiple processes. > > At this point the application can create work queues in userspace memory and > > pass them through the usermode library to kfd to have them mapped onto HW > > queue slots so that commands written to the queues can be executed by the > > GPU. Queue operations specify a processor node, and so the bulk of this code > > is device-specific. > > Written by John Bridgman <John.Bridgman@amd.com> > > > > > > Alexey Skidanov (1): > > amdkfd: Implement the Get Process Aperture IOCTL > > > > Andrew Lewycky (3): > > amdkfd: Add basic modules to amdkfd > > amdkfd: Add interrupt handling module > > amdkfd: Implement the Set Memory Policy IOCTL > > > > Ben Goz (8): > > amdkfd: Add queue module > > amdkfd: Add mqd_manager module > > amdkfd: Add kernel queue module > > amdkfd: Add module parameter of scheduling policy > > amdkfd: Add packet manager module > > amdkfd: Add process queue manager module > > amdkfd: Add device queue manager module > > amdkfd: Implement the create/destroy/update queue IOCTLs > > > > Evgeny Pinchuk (3): > > amdkfd: Add topology module to amdkfd > > amdkfd: Implement the Get Clock Counters IOCTL > > amdkfd: Implement the PMC Acquire/Release IOCTLs > > > > Oded Gabbay (10): > > mm: Add kfd_process pointer to mm_struct > > drm/radeon: reduce number of free VMIDs and pipes in KV > > drm/radeon/cik: Don't touch int of pipes 1-7 > > drm/radeon: Report doorbell configuration to amdkfd > > drm/radeon: adding synchronization for GRBM GFX > > drm/radeon: Add radeon <--> amdkfd interface > > Update MAINTAINERS and CREDITS files with amdkfd info > > amdkfd: Add IOCTL set definitions of amdkfd > > amdkfd: Add amdkfd skeleton driver > > amdkfd: Add binding/unbinding calls to amd_iommu driver > > > > CREDITS | 7 + > > MAINTAINERS | 10 + > > drivers/gpu/drm/radeon/Kconfig | 2 + > > drivers/gpu/drm/radeon/Makefile | 3 + > > drivers/gpu/drm/radeon/amdkfd/Kconfig | 10 + > > drivers/gpu/drm/radeon/amdkfd/Makefile | 14 + > > drivers/gpu/drm/radeon/amdkfd/cik_mqds.h | 185 +++ > > drivers/gpu/drm/radeon/amdkfd/cik_regs.h | 220 ++++ > > drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c | 123 ++ > > drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c | 518 +++++++++ > > drivers/gpu/drm/radeon/amdkfd/kfd_crat.h | 294 +++++ > > drivers/gpu/drm/radeon/amdkfd/kfd_device.c | 254 ++++ > > .../drm/radeon/amdkfd/kfd_device_queue_manager.c | 985 ++++++++++++++++ > > .../drm/radeon/amdkfd/kfd_device_queue_manager.h | 101 ++ > > drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c | 264 +++++ > > drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c | 161 +++ > > drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c | 305 +++++ > > drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h | 66 ++ > > drivers/gpu/drm/radeon/amdkfd/kfd_module.c | 131 +++ > > drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c | 291 +++++ > > drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h | 54 + > > drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c | 488 ++++++++ > > drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c | 97 ++ > > drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h | 682 +++++++++++ > > drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h | 107 ++ > > drivers/gpu/drm/radeon/amdkfd/kfd_priv.h | 466 ++++++++ > > drivers/gpu/drm/radeon/amdkfd/kfd_process.c | 405 +++++++ > > .../drm/radeon/amdkfd/kfd_process_queue_manager.c | 343 ++++++ > > drivers/gpu/drm/radeon/amdkfd/kfd_queue.c | 109 ++ > > drivers/gpu/drm/radeon/amdkfd/kfd_topology.c | 1207 ++++++++++++++++++++ > > drivers/gpu/drm/radeon/amdkfd/kfd_topology.h | 168 +++ > > drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c | 96 ++ > > drivers/gpu/drm/radeon/cik.c | 154 +-- > > drivers/gpu/drm/radeon/cik_reg.h | 65 ++ > > drivers/gpu/drm/radeon/cikd.h | 51 +- > > drivers/gpu/drm/radeon/radeon.h | 9 + > > drivers/gpu/drm/radeon/radeon_device.c | 32 + > > drivers/gpu/drm/radeon/radeon_drv.c | 5 + > > drivers/gpu/drm/radeon/radeon_kfd.c | 566 +++++++++ > > drivers/gpu/drm/radeon/radeon_kfd.h | 119 ++ > > drivers/gpu/drm/radeon/radeon_kms.c | 7 + > > include/linux/mm_types.h | 14 + > > include/uapi/linux/kfd_ioctl.h | 133 +++ > > 43 files changed, 9226 insertions(+), 95 deletions(-) > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h > > create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c > > create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c > > create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h > > create mode 100644 include/uapi/linux/kfd_ioctl.h > > > > -- > > 1.9.1 > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-20 17:46 ` Jerome Glisse 2014-07-21 3:03 ` Jerome Glisse @ 2014-07-21 7:01 ` Daniel Vetter 2014-07-21 9:34 ` Christian König 2014-07-21 12:36 ` Oded Gabbay 2 siblings, 1 reply; 49+ messages in thread From: Daniel Vetter @ 2014-07-21 7:01 UTC (permalink / raw) To: Jerome Glisse Cc: Oded Gabbay, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Christian König, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On Sun, Jul 20, 2014 at 01:46:53PM -0400, Jerome Glisse wrote: > On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: > > Forgot to cc mailing list on cover letter. Sorry. > > > > As a continuation to the existing discussion, here is a v2 patch series > > restructured with a cleaner history and no totally-different-early-versions > > of the code. > > > > Instead of 83 patches, there are now a total of 25 patches, where 5 of them > > are modifications to radeon driver and 18 of them include only amdkfd code. > > There is no code going away or even modified between patches, only added. > > > > The driver was renamed from radeon_kfd to amdkfd and moved to reside under > > drm/radeon/amdkfd. This move was done to emphasize the fact that this driver > > is an AMD-only driver at this point. Having said that, we do foresee a > > generic hsa framework being implemented in the future and in that case, we > > will adjust amdkfd to work within that framework. > > > > As the amdkfd driver should support multiple AMD gfx drivers, we want to > > keep it as a seperate driver from radeon. Therefore, the amdkfd code is > > contained in its own folder. The amdkfd folder was put under the radeon > > folder because the only AMD gfx driver in the Linux kernel at this point > > is the radeon driver. Having said that, we will probably need to move it > > (maybe to be directly under drm) after we integrate with additional AMD gfx > > drivers. > > > > For people who like to review using git, the v2 patch set is located at: > > http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 > > > > Written by Oded Gabbayh <oded.gabbay@amd.com> > > So quick comments before i finish going over all patches. There is many > things that need more documentation espacialy as of right now there is > no userspace i can go look at. > > There few show stopper, biggest one is gpu memory pinning this is a big > no, that would need serious arguments for any hope of convincing me on > that side. > > It might be better to add a drivers/gpu/drm/amd directory and add common > stuff there. > > Given that this is not intended to be final HSA api AFAICT then i would > say this far better to avoid the whole kfd module and add ioctl to radeon. > This would avoid crazy communication btw radeon and kfd. > > The whole aperture business needs some serious explanation. Especialy as > you want to use userspace address there is nothing to prevent userspace > program from allocating things at address you reserve for lds, scratch, > ... only sane way would be to move those lds, scratch inside the virtual > address reserved for kernel (see kernel memory map). > > The whole business of locking performance counter for exclusive per process > access is a big NO. Which leads me to the questionable usefullness of user > space command ring. I only see issues with that. First and foremost i would > need to see solid figures that kernel ioctl or syscall has a higher an > overhead that is measurable in any meaning full way against a simple > function call. I know the userspace command ring is a big marketing features > that please ignorant userspace programmer. But really this only brings issues > and for absolutely not upside afaict. > > So i would rather see a very simple ioctl that write the doorbell and might > do more than that in case of ring/queue overcommit where it would first have > to wait for a free ring/queue to schedule stuff. This would also allow sane > implementation of things like performance counter that could be acquire by > kernel for duration of a job submitted by userspace. While still not optimal > this would be better that userspace locking. Quick aside and mostly off the record: In i915 we plan to have the first implementation exactly as Jerome suggests here: - New flag at context creationg for svm/seamless-gpgpu contexts. - New ioctl in i915 for submitting stuff to the hw (through doorbell or whatever else we want to do). The ring in the ctx would be under the kernel's control. Of course there's lots of GEM stuff we don't need at all for such contexts, but there's still lots of shared code. Imo creating a 2nd driver has too much interface surface and so is a maintainence hell. And the ioctl submission gives us flexibility in case the hw doesn't quite live up to promise (e.g. scheduling, cmd parsing, ...). -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 7:01 ` Daniel Vetter @ 2014-07-21 9:34 ` Christian König 0 siblings, 0 replies; 49+ messages in thread From: Christian König @ 2014-07-21 9:34 UTC (permalink / raw) To: Jerome Glisse, Oded Gabbay, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm Am 21.07.2014 09:01, schrieb Daniel Vetter: > On Sun, Jul 20, 2014 at 01:46:53PM -0400, Jerome Glisse wrote: >> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: >>> Forgot to cc mailing list on cover letter. Sorry. >>> >>> As a continuation to the existing discussion, here is a v2 patch series >>> restructured with a cleaner history and no totally-different-early-versions >>> of the code. >>> >>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them >>> are modifications to radeon driver and 18 of them include only amdkfd code. >>> There is no code going away or even modified between patches, only added. >>> >>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under >>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver >>> is an AMD-only driver at this point. Having said that, we do foresee a >>> generic hsa framework being implemented in the future and in that case, we >>> will adjust amdkfd to work within that framework. >>> >>> As the amdkfd driver should support multiple AMD gfx drivers, we want to >>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is >>> contained in its own folder. The amdkfd folder was put under the radeon >>> folder because the only AMD gfx driver in the Linux kernel at this point >>> is the radeon driver. Having said that, we will probably need to move it >>> (maybe to be directly under drm) after we integrate with additional AMD gfx >>> drivers. >>> >>> For people who like to review using git, the v2 patch set is located at: >>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 >>> >>> Written by Oded Gabbayh <oded.gabbay@amd.com> >> So quick comments before i finish going over all patches. There is many >> things that need more documentation espacialy as of right now there is >> no userspace i can go look at. >> >> There few show stopper, biggest one is gpu memory pinning this is a big >> no, that would need serious arguments for any hope of convincing me on >> that side. >> >> It might be better to add a drivers/gpu/drm/amd directory and add common >> stuff there. >> >> Given that this is not intended to be final HSA api AFAICT then i would >> say this far better to avoid the whole kfd module and add ioctl to radeon. >> This would avoid crazy communication btw radeon and kfd. >> >> The whole aperture business needs some serious explanation. Especialy as >> you want to use userspace address there is nothing to prevent userspace >> program from allocating things at address you reserve for lds, scratch, >> ... only sane way would be to move those lds, scratch inside the virtual >> address reserved for kernel (see kernel memory map). >> >> The whole business of locking performance counter for exclusive per process >> access is a big NO. Which leads me to the questionable usefullness of user >> space command ring. I only see issues with that. First and foremost i would >> need to see solid figures that kernel ioctl or syscall has a higher an >> overhead that is measurable in any meaning full way against a simple >> function call. I know the userspace command ring is a big marketing features >> that please ignorant userspace programmer. But really this only brings issues >> and for absolutely not upside afaict. >> >> So i would rather see a very simple ioctl that write the doorbell and might >> do more than that in case of ring/queue overcommit where it would first have >> to wait for a free ring/queue to schedule stuff. This would also allow sane >> implementation of things like performance counter that could be acquire by >> kernel for duration of a job submitted by userspace. While still not optimal >> this would be better that userspace locking. > Quick aside and mostly off the record: In i915 we plan to have the first > implementation exactly as Jerome suggests here: > - New flag at context creationg for svm/seamless-gpgpu contexts. > - New ioctl in i915 for submitting stuff to the hw (through doorbell or > whatever else we want to do). The ring in the ctx would be under the > kernel's control. And looking at the existing Radeon code, that's exactly what we are already doing with the compute queues on CIK as well. We just use the existing command submission interface, cause when you use an IOCTL anyway it's not beneficial any more to do all the complex scheduling and other stuff directly on the hardware. What's mostly missing in the existing module is proper support for accessing the CPU address space through IOMMUv2. Christian. > Of course there's lots of GEM stuff we don't need at all for such > contexts, but there's still lots of shared code. Imo creating a 2nd driver > has too much interface surface and so is a maintainence hell. > > And the ioctl submission gives us flexibility in case the hw doesn't quite > live up to promise (e.g. scheduling, cmd parsing, ...). > -Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-20 17:46 ` Jerome Glisse 2014-07-21 3:03 ` Jerome Glisse 2014-07-21 7:01 ` Daniel Vetter @ 2014-07-21 12:36 ` Oded Gabbay 2014-07-21 13:39 ` Christian König 2 siblings, 1 reply; 49+ messages in thread From: Oded Gabbay @ 2014-07-21 12:36 UTC (permalink / raw) To: Jerome Glisse Cc: David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Christian König, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On 20/07/14 20:46, Jerome Glisse wrote: > On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: >> Forgot to cc mailing list on cover letter. Sorry. >> >> As a continuation to the existing discussion, here is a v2 patch series >> restructured with a cleaner history and no totally-different-early-versions >> of the code. >> >> Instead of 83 patches, there are now a total of 25 patches, where 5 of them >> are modifications to radeon driver and 18 of them include only amdkfd code. >> There is no code going away or even modified between patches, only added. >> >> The driver was renamed from radeon_kfd to amdkfd and moved to reside under >> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver >> is an AMD-only driver at this point. Having said that, we do foresee a >> generic hsa framework being implemented in the future and in that case, we >> will adjust amdkfd to work within that framework. >> >> As the amdkfd driver should support multiple AMD gfx drivers, we want to >> keep it as a seperate driver from radeon. Therefore, the amdkfd code is >> contained in its own folder. The amdkfd folder was put under the radeon >> folder because the only AMD gfx driver in the Linux kernel at this point >> is the radeon driver. Having said that, we will probably need to move it >> (maybe to be directly under drm) after we integrate with additional AMD gfx >> drivers. >> >> For people who like to review using git, the v2 patch set is located at: >> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 >> >> Written by Oded Gabbayh <oded.gabbay@amd.com> > > So quick comments before i finish going over all patches. There is many > things that need more documentation espacialy as of right now there is > no userspace i can go look at. So quick comments on some of your questions but first of all, thanks for the time you dedicated to review the code. > > There few show stopper, biggest one is gpu memory pinning this is a big > no, that would need serious arguments for any hope of convincing me on > that side. We only do gpu memory pinning for kernel objects. There are no userspace objects that are pinned on the gpu memory in our driver. If that is the case, is it still a show stopper ? The kernel objects are: - pipelines (4 per device) - mqd per hiq (only 1 per device) - mqd per userspace queue. On KV, we support up to 1K queues per process, for a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in 256 alignment. So total *possible* memory is 128MB - kernel queue (only 1 per device) - fence address for kernel queue - runlists for the CP (1 or 2 per device) > > It might be better to add a drivers/gpu/drm/amd directory and add common > stuff there. > > Given that this is not intended to be final HSA api AFAICT then i would > say this far better to avoid the whole kfd module and add ioctl to radeon. > This would avoid crazy communication btw radeon and kfd. > > The whole aperture business needs some serious explanation. Especialy as > you want to use userspace address there is nothing to prevent userspace > program from allocating things at address you reserve for lds, scratch, > ... only sane way would be to move those lds, scratch inside the virtual > address reserved for kernel (see kernel memory map). > > The whole business of locking performance counter for exclusive per process > access is a big NO. Which leads me to the questionable usefullness of user > space command ring. That's like saying: "Which leads me to the questionable usefulness of HSA". I find it analogous to a situation where a network maintainer nacking a driver for a network card, which is slower than a different network card. Doesn't seem reasonable this situation is would happen. He would still put both the drivers in the kernel because people want to use the H/W and its features. So, I don't think this is a valid reason to NACK the driver. > I only see issues with that. First and foremost i would > need to see solid figures that kernel ioctl or syscall has a higher an > overhead that is measurable in any meaning full way against a simple > function call. I know the userspace command ring is a big marketing features > that please ignorant userspace programmer. But really this only brings issues > and for absolutely not upside afaict. Really ? You think that doing a context switch to kernel space, with all its overhead, is _not_ more expansive than just calling a function in userspace which only puts a buffer on a ring and writes a doorbell ? > > So i would rather see a very simple ioctl that write the doorbell and might > do more than that in case of ring/queue overcommit where it would first have > to wait for a free ring/queue to schedule stuff. This would also allow sane > implementation of things like performance counter that could be acquire by > kernel for duration of a job submitted by userspace. While still not optimal > this would be better that userspace locking. > > > I might have more thoughts once i am done with all the patches. > > Cheers, > Jérôme > >> >> Original Cover Letter: >> >> This patch set implements a Heterogeneous System Architecture (HSA) driver >> for radeon-family GPUs. >> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share >> system resources more effectively via HW features including shared pageable >> memory, userspace-accessible work queues, and platform-level atomics. In >> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea >> Islands family of GPUs also performs HW-level validation of commands passed >> in through the queues (aka rings). >> >> The code in this patch set is intended to serve both as a sample driver for >> other HSA-compatible hardware devices and as a production driver for >> radeon-family processors. The code is architected to support multiple CPUs >> each with connected GPUs, although the current implementation focuses on a >> single Kaveri/Berlin APU, and works alongside the existing radeon kernel >> graphics driver (kgd). >> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware >> functionality between HSA compute and regular gfx/compute (memory, >> interrupts, registers), while other functionality has been added >> specifically for HSA compute (hw scheduler for virtualized compute rings). >> All shared hardware is owned by the radeon graphics driver, and an interface >> between kfd and kgd allows the kfd to make use of those shared resources, >> while HSA-specific functionality is managed directly by kfd by submitting >> packets into an HSA-specific command queue (the "HIQ"). >> >> During kfd module initialization a char device node (/dev/kfd) is created >> (surviving until module exit), with ioctls for queue creation & management, >> and data structures are initialized for managing HSA device topology. >> The rest of the initialization is driven by calls from the radeon kgd at the >> following points : >> >> - radeon_init (kfd_init) >> - radeon_exit (kfd_fini) >> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init) >> - radeon_driver_unload_kms (kfd_device_fini) >> >> During the probe and init processing per-device data structures are >> established which connect to the associated graphics kernel driver. This >> information is exposed to userspace via sysfs, along with a version number >> allowing userspace to determine if a topology change has occurred while it >> was reading from sysfs. >> The interface between kfd and kgd also allows the kfd to request buffer >> management services from kgd, and allows kgd to route interrupt requests to >> kfd code since the interrupt block is shared between regular >> graphics/compute and HSA compute subsystems in the GPU. >> >> The kfd code works with an open source usermode library ("libhsakmt") which >> is in the final stages of IP review and should be published in a separate >> repo over the next few days. >> The code operates in one of three modes, selectable via the sched_policy >> module parameter : >> >> - sched_policy=0 uses a hardware scheduler running in the MEC block within >> CP, and allows oversubscription (more queues than HW slots) >> - sched_policy=1 also uses HW scheduling but does not allow >> oversubscription, so create_queue requests fail when we run out of HW slots >> - sched_policy=2 does not use HW scheduling, so the driver manually assigns >> queues to HW slots by programming registers >> >> The "no HW scheduling" option is for debug & new hardware bringup only, so >> has less test coverage than the other options. Default in the current code >> is "HW scheduling without oversubscription" since that is where we have the >> most test coverage but we expect to change the default to "HW scheduling >> with oversubscription" after further testing. This effectively removes the >> HW limit on the number of work queues available to applications. >> >> Programs running on the GPU are associated with an address space through the >> VMID field, which is translated to a unique PASID at access time via a set >> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) >> are partitioned (under control of the radeon kgd) between current >> gfx/compute and HSA compute, with each getting 8 in the current code. The >> VMID-to-PASID mapping registers are updated by the HW scheduler when used, >> and by driver code if HW scheduling is not being used. >> The Sea Islands compute queues use a new "doorbell" mechanism instead of the >> earlier kernel-managed write pointer registers. Doorbells use a separate BAR >> dedicated for this purpose, and pages within the doorbell aperture are >> mapped to userspace (each page mapped to only one user address space). >> Writes to the doorbell aperture are intercepted by GPU hardware, allowing >> userspace code to safely manage work queues (rings) without requiring a >> kernel call for every ring update. >> First step for an application process is to open the kfd device. Calls to >> open create a kfd "process" structure only for the first thread of the >> process. Subsequent open calls are checked to see if they are from processes >> using the same mm_struct and, if so, don't do anything. The kfd per-process >> data lives as long as the mm_struct exists. Each mm_struct is associated >> with a unique PASID, allowing the IOMMUv2 to make userspace process memory >> accessible to the GPU. >> Next step is for the application to collect topology information via sysfs. >> This gives userspace enough information to be able to identify specific >> nodes (processors) in subsequent queue management calls. Application >> processes can create queues on multiple processors, and processors support >> queues from multiple processes. >> At this point the application can create work queues in userspace memory and >> pass them through the usermode library to kfd to have them mapped onto HW >> queue slots so that commands written to the queues can be executed by the >> GPU. Queue operations specify a processor node, and so the bulk of this code >> is device-specific. >> Written by John Bridgman <John.Bridgman@amd.com> >> >> >> Alexey Skidanov (1): >> amdkfd: Implement the Get Process Aperture IOCTL >> >> Andrew Lewycky (3): >> amdkfd: Add basic modules to amdkfd >> amdkfd: Add interrupt handling module >> amdkfd: Implement the Set Memory Policy IOCTL >> >> Ben Goz (8): >> amdkfd: Add queue module >> amdkfd: Add mqd_manager module >> amdkfd: Add kernel queue module >> amdkfd: Add module parameter of scheduling policy >> amdkfd: Add packet manager module >> amdkfd: Add process queue manager module >> amdkfd: Add device queue manager module >> amdkfd: Implement the create/destroy/update queue IOCTLs >> >> Evgeny Pinchuk (3): >> amdkfd: Add topology module to amdkfd >> amdkfd: Implement the Get Clock Counters IOCTL >> amdkfd: Implement the PMC Acquire/Release IOCTLs >> >> Oded Gabbay (10): >> mm: Add kfd_process pointer to mm_struct >> drm/radeon: reduce number of free VMIDs and pipes in KV >> drm/radeon/cik: Don't touch int of pipes 1-7 >> drm/radeon: Report doorbell configuration to amdkfd >> drm/radeon: adding synchronization for GRBM GFX >> drm/radeon: Add radeon <--> amdkfd interface >> Update MAINTAINERS and CREDITS files with amdkfd info >> amdkfd: Add IOCTL set definitions of amdkfd >> amdkfd: Add amdkfd skeleton driver >> amdkfd: Add binding/unbinding calls to amd_iommu driver >> >> CREDITS | 7 + >> MAINTAINERS | 10 + >> drivers/gpu/drm/radeon/Kconfig | 2 + >> drivers/gpu/drm/radeon/Makefile | 3 + >> drivers/gpu/drm/radeon/amdkfd/Kconfig | 10 + >> drivers/gpu/drm/radeon/amdkfd/Makefile | 14 + >> drivers/gpu/drm/radeon/amdkfd/cik_mqds.h | 185 +++ >> drivers/gpu/drm/radeon/amdkfd/cik_regs.h | 220 ++++ >> drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c | 123 ++ >> drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c | 518 +++++++++ >> drivers/gpu/drm/radeon/amdkfd/kfd_crat.h | 294 +++++ >> drivers/gpu/drm/radeon/amdkfd/kfd_device.c | 254 ++++ >> .../drm/radeon/amdkfd/kfd_device_queue_manager.c | 985 ++++++++++++++++ >> .../drm/radeon/amdkfd/kfd_device_queue_manager.h | 101 ++ >> drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c | 264 +++++ >> drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c | 161 +++ >> drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c | 305 +++++ >> drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h | 66 ++ >> drivers/gpu/drm/radeon/amdkfd/kfd_module.c | 131 +++ >> drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c | 291 +++++ >> drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h | 54 + >> drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c | 488 ++++++++ >> drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c | 97 ++ >> drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h | 682 +++++++++++ >> drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h | 107 ++ >> drivers/gpu/drm/radeon/amdkfd/kfd_priv.h | 466 ++++++++ >> drivers/gpu/drm/radeon/amdkfd/kfd_process.c | 405 +++++++ >> .../drm/radeon/amdkfd/kfd_process_queue_manager.c | 343 ++++++ >> drivers/gpu/drm/radeon/amdkfd/kfd_queue.c | 109 ++ >> drivers/gpu/drm/radeon/amdkfd/kfd_topology.c | 1207 ++++++++++++++++++++ >> drivers/gpu/drm/radeon/amdkfd/kfd_topology.h | 168 +++ >> drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c | 96 ++ >> drivers/gpu/drm/radeon/cik.c | 154 +-- >> drivers/gpu/drm/radeon/cik_reg.h | 65 ++ >> drivers/gpu/drm/radeon/cikd.h | 51 +- >> drivers/gpu/drm/radeon/radeon.h | 9 + >> drivers/gpu/drm/radeon/radeon_device.c | 32 + >> drivers/gpu/drm/radeon/radeon_drv.c | 5 + >> drivers/gpu/drm/radeon/radeon_kfd.c | 566 +++++++++ >> drivers/gpu/drm/radeon/radeon_kfd.h | 119 ++ >> drivers/gpu/drm/radeon/radeon_kms.c | 7 + >> include/linux/mm_types.h | 14 + >> include/uapi/linux/kfd_ioctl.h | 133 +++ >> 43 files changed, 9226 insertions(+), 95 deletions(-) >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h >> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c >> create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c >> create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h >> create mode 100644 include/uapi/linux/kfd_ioctl.h >> >> -- >> 1.9.1 >> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 12:36 ` Oded Gabbay @ 2014-07-21 13:39 ` Christian König 2014-07-21 14:12 ` Oded Gabbay 2014-07-21 15:25 ` Daniel Vetter 0 siblings, 2 replies; 49+ messages in thread From: Christian König @ 2014-07-21 13:39 UTC (permalink / raw) To: Oded Gabbay, Jerome Glisse Cc: David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm Am 21.07.2014 14:36, schrieb Oded Gabbay: > On 20/07/14 20:46, Jerome Glisse wrote: >> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: >>> Forgot to cc mailing list on cover letter. Sorry. >>> >>> As a continuation to the existing discussion, here is a v2 patch series >>> restructured with a cleaner history and no >>> totally-different-early-versions >>> of the code. >>> >>> Instead of 83 patches, there are now a total of 25 patches, where 5 >>> of them >>> are modifications to radeon driver and 18 of them include only >>> amdkfd code. >>> There is no code going away or even modified between patches, only >>> added. >>> >>> The driver was renamed from radeon_kfd to amdkfd and moved to reside >>> under >>> drm/radeon/amdkfd. This move was done to emphasize the fact that >>> this driver >>> is an AMD-only driver at this point. Having said that, we do foresee a >>> generic hsa framework being implemented in the future and in that >>> case, we >>> will adjust amdkfd to work within that framework. >>> >>> As the amdkfd driver should support multiple AMD gfx drivers, we >>> want to >>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is >>> contained in its own folder. The amdkfd folder was put under the radeon >>> folder because the only AMD gfx driver in the Linux kernel at this >>> point >>> is the radeon driver. Having said that, we will probably need to >>> move it >>> (maybe to be directly under drm) after we integrate with additional >>> AMD gfx >>> drivers. >>> >>> For people who like to review using git, the v2 patch set is located >>> at: >>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 >>> >>> Written by Oded Gabbayh <oded.gabbay@amd.com> >> >> So quick comments before i finish going over all patches. There is many >> things that need more documentation espacialy as of right now there is >> no userspace i can go look at. > So quick comments on some of your questions but first of all, thanks > for the time you dedicated to review the code. >> >> There few show stopper, biggest one is gpu memory pinning this is a big >> no, that would need serious arguments for any hope of convincing me on >> that side. > We only do gpu memory pinning for kernel objects. There are no > userspace objects that are pinned on the gpu memory in our driver. If > that is the case, is it still a show stopper ? > > The kernel objects are: > - pipelines (4 per device) > - mqd per hiq (only 1 per device) > - mqd per userspace queue. On KV, we support up to 1K queues per > process, for a total of 512K queues. Each mqd is 151 bytes, but the > allocation is done in 256 alignment. So total *possible* memory is 128MB > - kernel queue (only 1 per device) > - fence address for kernel queue > - runlists for the CP (1 or 2 per device) The main questions here are if it's avoid able to pin down the memory and if the memory is pinned down at driver load, by request from userspace or by anything else. As far as I can see only the "mqd per userspace queue" might be a bit questionable, everything else sounds reasonable. Christian. >> >> It might be better to add a drivers/gpu/drm/amd directory and add common >> stuff there. >> >> Given that this is not intended to be final HSA api AFAICT then i would >> say this far better to avoid the whole kfd module and add ioctl to >> radeon. >> This would avoid crazy communication btw radeon and kfd. >> >> The whole aperture business needs some serious explanation. Especialy as >> you want to use userspace address there is nothing to prevent userspace >> program from allocating things at address you reserve for lds, scratch, >> ... only sane way would be to move those lds, scratch inside the virtual >> address reserved for kernel (see kernel memory map). >> >> The whole business of locking performance counter for exclusive per >> process >> access is a big NO. Which leads me to the questionable usefullness of >> user >> space command ring. > That's like saying: "Which leads me to the questionable usefulness of > HSA". I find it analogous to a situation where a network maintainer > nacking a driver for a network card, which is slower than a different > network card. Doesn't seem reasonable this situation is would happen. > He would still put both the drivers in the kernel because people want > to use the H/W and its features. So, I don't think this is a valid > reason to NACK the driver. > >> I only see issues with that. First and foremost i would >> need to see solid figures that kernel ioctl or syscall has a higher an >> overhead that is measurable in any meaning full way against a simple >> function call. I know the userspace command ring is a big marketing >> features >> that please ignorant userspace programmer. But really this only >> brings issues >> and for absolutely not upside afaict. > Really ? You think that doing a context switch to kernel space, with > all its overhead, is _not_ more expansive than just calling a function > in userspace which only puts a buffer on a ring and writes a doorbell ? >> >> So i would rather see a very simple ioctl that write the doorbell and >> might >> do more than that in case of ring/queue overcommit where it would >> first have >> to wait for a free ring/queue to schedule stuff. This would also >> allow sane >> implementation of things like performance counter that could be >> acquire by >> kernel for duration of a job submitted by userspace. While still not >> optimal >> this would be better that userspace locking. >> >> >> I might have more thoughts once i am done with all the patches. >> >> Cheers, >> Jerome >> >>> >>> Original Cover Letter: >>> >>> This patch set implements a Heterogeneous System Architecture (HSA) >>> driver >>> for radeon-family GPUs. >>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share >>> system resources more effectively via HW features including shared >>> pageable >>> memory, userspace-accessible work queues, and platform-level >>> atomics. In >>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, >>> the Sea >>> Islands family of GPUs also performs HW-level validation of commands >>> passed >>> in through the queues (aka rings). >>> >>> The code in this patch set is intended to serve both as a sample >>> driver for >>> other HSA-compatible hardware devices and as a production driver for >>> radeon-family processors. The code is architected to support >>> multiple CPUs >>> each with connected GPUs, although the current implementation >>> focuses on a >>> single Kaveri/Berlin APU, and works alongside the existing radeon >>> kernel >>> graphics driver (kgd). >>> AMD GPUs designed for use with HSA (Sea Islands and up) share some >>> hardware >>> functionality between HSA compute and regular gfx/compute (memory, >>> interrupts, registers), while other functionality has been added >>> specifically for HSA compute (hw scheduler for virtualized compute >>> rings). >>> All shared hardware is owned by the radeon graphics driver, and an >>> interface >>> between kfd and kgd allows the kfd to make use of those shared >>> resources, >>> while HSA-specific functionality is managed directly by kfd by >>> submitting >>> packets into an HSA-specific command queue (the "HIQ"). >>> >>> During kfd module initialization a char device node (/dev/kfd) is >>> created >>> (surviving until module exit), with ioctls for queue creation & >>> management, >>> and data structures are initialized for managing HSA device topology. >>> The rest of the initialization is driven by calls from the radeon >>> kgd at the >>> following points : >>> >>> - radeon_init (kfd_init) >>> - radeon_exit (kfd_fini) >>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init) >>> - radeon_driver_unload_kms (kfd_device_fini) >>> >>> During the probe and init processing per-device data structures are >>> established which connect to the associated graphics kernel driver. >>> This >>> information is exposed to userspace via sysfs, along with a version >>> number >>> allowing userspace to determine if a topology change has occurred >>> while it >>> was reading from sysfs. >>> The interface between kfd and kgd also allows the kfd to request buffer >>> management services from kgd, and allows kgd to route interrupt >>> requests to >>> kfd code since the interrupt block is shared between regular >>> graphics/compute and HSA compute subsystems in the GPU. >>> >>> The kfd code works with an open source usermode library >>> ("libhsakmt") which >>> is in the final stages of IP review and should be published in a >>> separate >>> repo over the next few days. >>> The code operates in one of three modes, selectable via the >>> sched_policy >>> module parameter : >>> >>> - sched_policy=0 uses a hardware scheduler running in the MEC block >>> within >>> CP, and allows oversubscription (more queues than HW slots) >>> - sched_policy=1 also uses HW scheduling but does not allow >>> oversubscription, so create_queue requests fail when we run out of >>> HW slots >>> - sched_policy=2 does not use HW scheduling, so the driver manually >>> assigns >>> queues to HW slots by programming registers >>> >>> The "no HW scheduling" option is for debug & new hardware bringup >>> only, so >>> has less test coverage than the other options. Default in the >>> current code >>> is "HW scheduling without oversubscription" since that is where we >>> have the >>> most test coverage but we expect to change the default to "HW >>> scheduling >>> with oversubscription" after further testing. This effectively >>> removes the >>> HW limit on the number of work queues available to applications. >>> >>> Programs running on the GPU are associated with an address space >>> through the >>> VMID field, which is translated to a unique PASID at access time via >>> a set >>> of 16 VMID-to-PASID mapping registers. The available VMIDs >>> (currently 16) >>> are partitioned (under control of the radeon kgd) between current >>> gfx/compute and HSA compute, with each getting 8 in the current >>> code. The >>> VMID-to-PASID mapping registers are updated by the HW scheduler when >>> used, >>> and by driver code if HW scheduling is not being used. >>> The Sea Islands compute queues use a new "doorbell" mechanism >>> instead of the >>> earlier kernel-managed write pointer registers. Doorbells use a >>> separate BAR >>> dedicated for this purpose, and pages within the doorbell aperture are >>> mapped to userspace (each page mapped to only one user address space). >>> Writes to the doorbell aperture are intercepted by GPU hardware, >>> allowing >>> userspace code to safely manage work queues (rings) without requiring a >>> kernel call for every ring update. >>> First step for an application process is to open the kfd device. >>> Calls to >>> open create a kfd "process" structure only for the first thread of the >>> process. Subsequent open calls are checked to see if they are from >>> processes >>> using the same mm_struct and, if so, don't do anything. The kfd >>> per-process >>> data lives as long as the mm_struct exists. Each mm_struct is >>> associated >>> with a unique PASID, allowing the IOMMUv2 to make userspace process >>> memory >>> accessible to the GPU. >>> Next step is for the application to collect topology information via >>> sysfs. >>> This gives userspace enough information to be able to identify specific >>> nodes (processors) in subsequent queue management calls. Application >>> processes can create queues on multiple processors, and processors >>> support >>> queues from multiple processes. >>> At this point the application can create work queues in userspace >>> memory and >>> pass them through the usermode library to kfd to have them mapped >>> onto HW >>> queue slots so that commands written to the queues can be executed >>> by the >>> GPU. Queue operations specify a processor node, and so the bulk of >>> this code >>> is device-specific. >>> Written by John Bridgman <John.Bridgman@amd.com> >>> >>> >>> Alexey Skidanov (1): >>> amdkfd: Implement the Get Process Aperture IOCTL >>> >>> Andrew Lewycky (3): >>> amdkfd: Add basic modules to amdkfd >>> amdkfd: Add interrupt handling module >>> amdkfd: Implement the Set Memory Policy IOCTL >>> >>> Ben Goz (8): >>> amdkfd: Add queue module >>> amdkfd: Add mqd_manager module >>> amdkfd: Add kernel queue module >>> amdkfd: Add module parameter of scheduling policy >>> amdkfd: Add packet manager module >>> amdkfd: Add process queue manager module >>> amdkfd: Add device queue manager module >>> amdkfd: Implement the create/destroy/update queue IOCTLs >>> >>> Evgeny Pinchuk (3): >>> amdkfd: Add topology module to amdkfd >>> amdkfd: Implement the Get Clock Counters IOCTL >>> amdkfd: Implement the PMC Acquire/Release IOCTLs >>> >>> Oded Gabbay (10): >>> mm: Add kfd_process pointer to mm_struct >>> drm/radeon: reduce number of free VMIDs and pipes in KV >>> drm/radeon/cik: Don't touch int of pipes 1-7 >>> drm/radeon: Report doorbell configuration to amdkfd >>> drm/radeon: adding synchronization for GRBM GFX >>> drm/radeon: Add radeon <--> amdkfd interface >>> Update MAINTAINERS and CREDITS files with amdkfd info >>> amdkfd: Add IOCTL set definitions of amdkfd >>> amdkfd: Add amdkfd skeleton driver >>> amdkfd: Add binding/unbinding calls to amd_iommu driver >>> >>> CREDITS | 7 + >>> MAINTAINERS | 10 + >>> drivers/gpu/drm/radeon/Kconfig | 2 + >>> drivers/gpu/drm/radeon/Makefile | 3 + >>> drivers/gpu/drm/radeon/amdkfd/Kconfig | 10 + >>> drivers/gpu/drm/radeon/amdkfd/Makefile | 14 + >>> drivers/gpu/drm/radeon/amdkfd/cik_mqds.h | 185 +++ >>> drivers/gpu/drm/radeon/amdkfd/cik_regs.h | 220 ++++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c | 123 ++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c | 518 +++++++++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_crat.h | 294 +++++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_device.c | 254 ++++ >>> .../drm/radeon/amdkfd/kfd_device_queue_manager.c | 985 >>> ++++++++++++++++ >>> .../drm/radeon/amdkfd/kfd_device_queue_manager.h | 101 ++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c | 264 +++++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c | 161 +++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c | 305 +++++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h | 66 ++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_module.c | 131 +++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c | 291 +++++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h | 54 + >>> drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c | 488 ++++++++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c | 97 ++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h | 682 +++++++++++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h | 107 ++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_priv.h | 466 ++++++++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_process.c | 405 +++++++ >>> .../drm/radeon/amdkfd/kfd_process_queue_manager.c | 343 ++++++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_queue.c | 109 ++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_topology.c | 1207 >>> ++++++++++++++++++++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_topology.h | 168 +++ >>> drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c | 96 ++ >>> drivers/gpu/drm/radeon/cik.c | 154 +-- >>> drivers/gpu/drm/radeon/cik_reg.h | 65 ++ >>> drivers/gpu/drm/radeon/cikd.h | 51 +- >>> drivers/gpu/drm/radeon/radeon.h | 9 + >>> drivers/gpu/drm/radeon/radeon_device.c | 32 + >>> drivers/gpu/drm/radeon/radeon_drv.c | 5 + >>> drivers/gpu/drm/radeon/radeon_kfd.c | 566 +++++++++ >>> drivers/gpu/drm/radeon/radeon_kfd.h | 119 ++ >>> drivers/gpu/drm/radeon/radeon_kms.c | 7 + >>> include/linux/mm_types.h | 14 + >>> include/uapi/linux/kfd_ioctl.h | 133 +++ >>> 43 files changed, 9226 insertions(+), 95 deletions(-) >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c >>> create mode 100644 >>> drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c >>> create mode 100644 >>> drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c >>> create mode 100644 >>> drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h >>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c >>> create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c >>> create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h >>> create mode 100644 include/uapi/linux/kfd_ioctl.h >>> >>> -- >>> 1.9.1 >>> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 13:39 ` Christian König @ 2014-07-21 14:12 ` Oded Gabbay 2014-07-21 15:54 ` Jerome Glisse 2014-07-21 15:25 ` Daniel Vetter 1 sibling, 1 reply; 49+ messages in thread From: Oded Gabbay @ 2014-07-21 14:12 UTC (permalink / raw) To: Christian König, Jerome Glisse Cc: David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On 21/07/14 16:39, Christian König wrote: > Am 21.07.2014 14:36, schrieb Oded Gabbay: >> On 20/07/14 20:46, Jerome Glisse wrote: >>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: >>>> Forgot to cc mailing list on cover letter. Sorry. >>>> >>>> As a continuation to the existing discussion, here is a v2 patch series >>>> restructured with a cleaner history and no totally-different-early-versions >>>> of the code. >>>> >>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them >>>> are modifications to radeon driver and 18 of them include only amdkfd code. >>>> There is no code going away or even modified between patches, only added. >>>> >>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under >>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver >>>> is an AMD-only driver at this point. Having said that, we do foresee a >>>> generic hsa framework being implemented in the future and in that case, we >>>> will adjust amdkfd to work within that framework. >>>> >>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to >>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is >>>> contained in its own folder. The amdkfd folder was put under the radeon >>>> folder because the only AMD gfx driver in the Linux kernel at this point >>>> is the radeon driver. Having said that, we will probably need to move it >>>> (maybe to be directly under drm) after we integrate with additional AMD gfx >>>> drivers. >>>> >>>> For people who like to review using git, the v2 patch set is located at: >>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 >>>> >>>> Written by Oded Gabbayh <oded.gabbay@amd.com> >>> >>> So quick comments before i finish going over all patches. There is many >>> things that need more documentation espacialy as of right now there is >>> no userspace i can go look at. >> So quick comments on some of your questions but first of all, thanks for the >> time you dedicated to review the code. >>> >>> There few show stopper, biggest one is gpu memory pinning this is a big >>> no, that would need serious arguments for any hope of convincing me on >>> that side. >> We only do gpu memory pinning for kernel objects. There are no userspace >> objects that are pinned on the gpu memory in our driver. If that is the case, >> is it still a show stopper ? >> >> The kernel objects are: >> - pipelines (4 per device) >> - mqd per hiq (only 1 per device) >> - mqd per userspace queue. On KV, we support up to 1K queues per process, for >> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in >> 256 alignment. So total *possible* memory is 128MB >> - kernel queue (only 1 per device) >> - fence address for kernel queue >> - runlists for the CP (1 or 2 per device) > > The main questions here are if it's avoid able to pin down the memory and if the > memory is pinned down at driver load, by request from userspace or by anything > else. > > As far as I can see only the "mqd per userspace queue" might be a bit > questionable, everything else sounds reasonable. > > Christian. Most of the pin downs are done on device initialization. The "mqd per userspace" is done per userspace queue creation. However, as I said, it has an upper limit of 128MB on KV, and considering the 2G local memory, I think it is OK. The runlists are also done on userspace queue creation/deletion, but we only have 1 or 2 runlists per device, so it is not that bad. Oded > >>> >>> It might be better to add a drivers/gpu/drm/amd directory and add common >>> stuff there. >>> >>> Given that this is not intended to be final HSA api AFAICT then i would >>> say this far better to avoid the whole kfd module and add ioctl to radeon. >>> This would avoid crazy communication btw radeon and kfd. >>> >>> The whole aperture business needs some serious explanation. Especialy as >>> you want to use userspace address there is nothing to prevent userspace >>> program from allocating things at address you reserve for lds, scratch, >>> ... only sane way would be to move those lds, scratch inside the virtual >>> address reserved for kernel (see kernel memory map). >>> >>> The whole business of locking performance counter for exclusive per process >>> access is a big NO. Which leads me to the questionable usefullness of user >>> space command ring. >> That's like saying: "Which leads me to the questionable usefulness of HSA". I >> find it analogous to a situation where a network maintainer nacking a driver >> for a network card, which is slower than a different network card. Doesn't >> seem reasonable this situation is would happen. He would still put both the >> drivers in the kernel because people want to use the H/W and its features. So, >> I don't think this is a valid reason to NACK the driver. >> >>> I only see issues with that. First and foremost i would >>> need to see solid figures that kernel ioctl or syscall has a higher an >>> overhead that is measurable in any meaning full way against a simple >>> function call. I know the userspace command ring is a big marketing features >>> that please ignorant userspace programmer. But really this only brings issues >>> and for absolutely not upside afaict. >> Really ? You think that doing a context switch to kernel space, with all its >> overhead, is _not_ more expansive than just calling a function in userspace >> which only puts a buffer on a ring and writes a doorbell ? >>> >>> So i would rather see a very simple ioctl that write the doorbell and might >>> do more than that in case of ring/queue overcommit where it would first have >>> to wait for a free ring/queue to schedule stuff. This would also allow sane >>> implementation of things like performance counter that could be acquire by >>> kernel for duration of a job submitted by userspace. While still not optimal >>> this would be better that userspace locking. >>> >>> >>> I might have more thoughts once i am done with all the patches. >>> >>> Cheers, >>> Jérôme >>> >>>> >>>> Original Cover Letter: >>>> >>>> This patch set implements a Heterogeneous System Architecture (HSA) driver >>>> for radeon-family GPUs. >>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share >>>> system resources more effectively via HW features including shared pageable >>>> memory, userspace-accessible work queues, and platform-level atomics. In >>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea >>>> Islands family of GPUs also performs HW-level validation of commands passed >>>> in through the queues (aka rings). >>>> >>>> The code in this patch set is intended to serve both as a sample driver for >>>> other HSA-compatible hardware devices and as a production driver for >>>> radeon-family processors. The code is architected to support multiple CPUs >>>> each with connected GPUs, although the current implementation focuses on a >>>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel >>>> graphics driver (kgd). >>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware >>>> functionality between HSA compute and regular gfx/compute (memory, >>>> interrupts, registers), while other functionality has been added >>>> specifically for HSA compute (hw scheduler for virtualized compute rings). >>>> All shared hardware is owned by the radeon graphics driver, and an interface >>>> between kfd and kgd allows the kfd to make use of those shared resources, >>>> while HSA-specific functionality is managed directly by kfd by submitting >>>> packets into an HSA-specific command queue (the "HIQ"). >>>> >>>> During kfd module initialization a char device node (/dev/kfd) is created >>>> (surviving until module exit), with ioctls for queue creation & management, >>>> and data structures are initialized for managing HSA device topology. >>>> The rest of the initialization is driven by calls from the radeon kgd at the >>>> following points : >>>> >>>> - radeon_init (kfd_init) >>>> - radeon_exit (kfd_fini) >>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init) >>>> - radeon_driver_unload_kms (kfd_device_fini) >>>> >>>> During the probe and init processing per-device data structures are >>>> established which connect to the associated graphics kernel driver. This >>>> information is exposed to userspace via sysfs, along with a version number >>>> allowing userspace to determine if a topology change has occurred while it >>>> was reading from sysfs. >>>> The interface between kfd and kgd also allows the kfd to request buffer >>>> management services from kgd, and allows kgd to route interrupt requests to >>>> kfd code since the interrupt block is shared between regular >>>> graphics/compute and HSA compute subsystems in the GPU. >>>> >>>> The kfd code works with an open source usermode library ("libhsakmt") which >>>> is in the final stages of IP review and should be published in a separate >>>> repo over the next few days. >>>> The code operates in one of three modes, selectable via the sched_policy >>>> module parameter : >>>> >>>> - sched_policy=0 uses a hardware scheduler running in the MEC block within >>>> CP, and allows oversubscription (more queues than HW slots) >>>> - sched_policy=1 also uses HW scheduling but does not allow >>>> oversubscription, so create_queue requests fail when we run out of HW slots >>>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns >>>> queues to HW slots by programming registers >>>> >>>> The "no HW scheduling" option is for debug & new hardware bringup only, so >>>> has less test coverage than the other options. Default in the current code >>>> is "HW scheduling without oversubscription" since that is where we have the >>>> most test coverage but we expect to change the default to "HW scheduling >>>> with oversubscription" after further testing. This effectively removes the >>>> HW limit on the number of work queues available to applications. >>>> >>>> Programs running on the GPU are associated with an address space through the >>>> VMID field, which is translated to a unique PASID at access time via a set >>>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) >>>> are partitioned (under control of the radeon kgd) between current >>>> gfx/compute and HSA compute, with each getting 8 in the current code. The >>>> VMID-to-PASID mapping registers are updated by the HW scheduler when used, >>>> and by driver code if HW scheduling is not being used. >>>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the >>>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR >>>> dedicated for this purpose, and pages within the doorbell aperture are >>>> mapped to userspace (each page mapped to only one user address space). >>>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing >>>> userspace code to safely manage work queues (rings) without requiring a >>>> kernel call for every ring update. >>>> First step for an application process is to open the kfd device. Calls to >>>> open create a kfd "process" structure only for the first thread of the >>>> process. Subsequent open calls are checked to see if they are from processes >>>> using the same mm_struct and, if so, don't do anything. The kfd per-process >>>> data lives as long as the mm_struct exists. Each mm_struct is associated >>>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory >>>> accessible to the GPU. >>>> Next step is for the application to collect topology information via sysfs. >>>> This gives userspace enough information to be able to identify specific >>>> nodes (processors) in subsequent queue management calls. Application >>>> processes can create queues on multiple processors, and processors support >>>> queues from multiple processes. >>>> At this point the application can create work queues in userspace memory and >>>> pass them through the usermode library to kfd to have them mapped onto HW >>>> queue slots so that commands written to the queues can be executed by the >>>> GPU. Queue operations specify a processor node, and so the bulk of this code >>>> is device-specific. >>>> Written by John Bridgman <John.Bridgman@amd.com> >>>> >>>> >>>> Alexey Skidanov (1): >>>> amdkfd: Implement the Get Process Aperture IOCTL >>>> >>>> Andrew Lewycky (3): >>>> amdkfd: Add basic modules to amdkfd >>>> amdkfd: Add interrupt handling module >>>> amdkfd: Implement the Set Memory Policy IOCTL >>>> >>>> Ben Goz (8): >>>> amdkfd: Add queue module >>>> amdkfd: Add mqd_manager module >>>> amdkfd: Add kernel queue module >>>> amdkfd: Add module parameter of scheduling policy >>>> amdkfd: Add packet manager module >>>> amdkfd: Add process queue manager module >>>> amdkfd: Add device queue manager module >>>> amdkfd: Implement the create/destroy/update queue IOCTLs >>>> >>>> Evgeny Pinchuk (3): >>>> amdkfd: Add topology module to amdkfd >>>> amdkfd: Implement the Get Clock Counters IOCTL >>>> amdkfd: Implement the PMC Acquire/Release IOCTLs >>>> >>>> Oded Gabbay (10): >>>> mm: Add kfd_process pointer to mm_struct >>>> drm/radeon: reduce number of free VMIDs and pipes in KV >>>> drm/radeon/cik: Don't touch int of pipes 1-7 >>>> drm/radeon: Report doorbell configuration to amdkfd >>>> drm/radeon: adding synchronization for GRBM GFX >>>> drm/radeon: Add radeon <--> amdkfd interface >>>> Update MAINTAINERS and CREDITS files with amdkfd info >>>> amdkfd: Add IOCTL set definitions of amdkfd >>>> amdkfd: Add amdkfd skeleton driver >>>> amdkfd: Add binding/unbinding calls to amd_iommu driver >>>> >>>> CREDITS | 7 + >>>> MAINTAINERS | 10 + >>>> drivers/gpu/drm/radeon/Kconfig | 2 + >>>> drivers/gpu/drm/radeon/Makefile | 3 + >>>> drivers/gpu/drm/radeon/amdkfd/Kconfig | 10 + >>>> drivers/gpu/drm/radeon/amdkfd/Makefile | 14 + >>>> drivers/gpu/drm/radeon/amdkfd/cik_mqds.h | 185 +++ >>>> drivers/gpu/drm/radeon/amdkfd/cik_regs.h | 220 ++++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c | 123 ++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c | 518 +++++++++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_crat.h | 294 +++++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_device.c | 254 ++++ >>>> .../drm/radeon/amdkfd/kfd_device_queue_manager.c | 985 ++++++++++++++++ >>>> .../drm/radeon/amdkfd/kfd_device_queue_manager.h | 101 ++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c | 264 +++++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c | 161 +++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c | 305 +++++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h | 66 ++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_module.c | 131 +++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c | 291 +++++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h | 54 + >>>> drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c | 488 ++++++++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c | 97 ++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h | 682 +++++++++++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h | 107 ++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_priv.h | 466 ++++++++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_process.c | 405 +++++++ >>>> .../drm/radeon/amdkfd/kfd_process_queue_manager.c | 343 ++++++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_queue.c | 109 ++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_topology.c | 1207 >>>> ++++++++++++++++++++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_topology.h | 168 +++ >>>> drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c | 96 ++ >>>> drivers/gpu/drm/radeon/cik.c | 154 +-- >>>> drivers/gpu/drm/radeon/cik_reg.h | 65 ++ >>>> drivers/gpu/drm/radeon/cikd.h | 51 +- >>>> drivers/gpu/drm/radeon/radeon.h | 9 + >>>> drivers/gpu/drm/radeon/radeon_device.c | 32 + >>>> drivers/gpu/drm/radeon/radeon_drv.c | 5 + >>>> drivers/gpu/drm/radeon/radeon_kfd.c | 566 +++++++++ >>>> drivers/gpu/drm/radeon/radeon_kfd.h | 119 ++ >>>> drivers/gpu/drm/radeon/radeon_kms.c | 7 + >>>> include/linux/mm_types.h | 14 + >>>> include/uapi/linux/kfd_ioctl.h | 133 +++ >>>> 43 files changed, 9226 insertions(+), 95 deletions(-) >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c >>>> create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c >>>> create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h >>>> create mode 100644 include/uapi/linux/kfd_ioctl.h >>>> >>>> -- >>>> 1.9.1 >>>> >> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 14:12 ` Oded Gabbay @ 2014-07-21 15:54 ` Jerome Glisse 2014-07-21 17:42 ` Oded Gabbay 0 siblings, 1 reply; 49+ messages in thread From: Jerome Glisse @ 2014-07-21 15:54 UTC (permalink / raw) To: Oded Gabbay Cc: Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote: > On 21/07/14 16:39, Christian Konig wrote: > >Am 21.07.2014 14:36, schrieb Oded Gabbay: > >>On 20/07/14 20:46, Jerome Glisse wrote: > >>>On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: > >>>>Forgot to cc mailing list on cover letter. Sorry. > >>>> > >>>>As a continuation to the existing discussion, here is a v2 patch series > >>>>restructured with a cleaner history and no totally-different-early-versions > >>>>of the code. > >>>> > >>>>Instead of 83 patches, there are now a total of 25 patches, where 5 of them > >>>>are modifications to radeon driver and 18 of them include only amdkfd code. > >>>>There is no code going away or even modified between patches, only added. > >>>> > >>>>The driver was renamed from radeon_kfd to amdkfd and moved to reside under > >>>>drm/radeon/amdkfd. This move was done to emphasize the fact that this driver > >>>>is an AMD-only driver at this point. Having said that, we do foresee a > >>>>generic hsa framework being implemented in the future and in that case, we > >>>>will adjust amdkfd to work within that framework. > >>>> > >>>>As the amdkfd driver should support multiple AMD gfx drivers, we want to > >>>>keep it as a seperate driver from radeon. Therefore, the amdkfd code is > >>>>contained in its own folder. The amdkfd folder was put under the radeon > >>>>folder because the only AMD gfx driver in the Linux kernel at this point > >>>>is the radeon driver. Having said that, we will probably need to move it > >>>>(maybe to be directly under drm) after we integrate with additional AMD gfx > >>>>drivers. > >>>> > >>>>For people who like to review using git, the v2 patch set is located at: > >>>>http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 > >>>> > >>>>Written by Oded Gabbayh <oded.gabbay@amd.com> > >>> > >>>So quick comments before i finish going over all patches. There is many > >>>things that need more documentation espacialy as of right now there is > >>>no userspace i can go look at. > >>So quick comments on some of your questions but first of all, thanks for the > >>time you dedicated to review the code. > >>> > >>>There few show stopper, biggest one is gpu memory pinning this is a big > >>>no, that would need serious arguments for any hope of convincing me on > >>>that side. > >>We only do gpu memory pinning for kernel objects. There are no userspace > >>objects that are pinned on the gpu memory in our driver. If that is the case, > >>is it still a show stopper ? > >> > >>The kernel objects are: > >>- pipelines (4 per device) > >>- mqd per hiq (only 1 per device) > >>- mqd per userspace queue. On KV, we support up to 1K queues per process, for > >>a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in > >>256 alignment. So total *possible* memory is 128MB > >>- kernel queue (only 1 per device) > >>- fence address for kernel queue > >>- runlists for the CP (1 or 2 per device) > > > >The main questions here are if it's avoid able to pin down the memory and if the > >memory is pinned down at driver load, by request from userspace or by anything > >else. > > > >As far as I can see only the "mqd per userspace queue" might be a bit > >questionable, everything else sounds reasonable. > > > >Christian. > > Most of the pin downs are done on device initialization. > The "mqd per userspace" is done per userspace queue creation. However, as I > said, it has an upper limit of 128MB on KV, and considering the 2G local > memory, I think it is OK. > The runlists are also done on userspace queue creation/deletion, but we only > have 1 or 2 runlists per device, so it is not that bad. 2G local memory ? You can not assume anything on userside configuration some one might build an hsa computer with 512M and still expect a functioning desktop. I need to go look into what all this mqd is for, what it does and what it is about. But pinning is really bad and this is an issue with userspace command scheduling an issue that obviously AMD fails to take into account in design phase. > Oded > > > >>> > >>>It might be better to add a drivers/gpu/drm/amd directory and add common > >>>stuff there. > >>> > >>>Given that this is not intended to be final HSA api AFAICT then i would > >>>say this far better to avoid the whole kfd module and add ioctl to radeon. > >>>This would avoid crazy communication btw radeon and kfd. > >>> > >>>The whole aperture business needs some serious explanation. Especialy as > >>>you want to use userspace address there is nothing to prevent userspace > >>>program from allocating things at address you reserve for lds, scratch, > >>>... only sane way would be to move those lds, scratch inside the virtual > >>>address reserved for kernel (see kernel memory map). > >>> > >>>The whole business of locking performance counter for exclusive per process > >>>access is a big NO. Which leads me to the questionable usefullness of user > >>>space command ring. > >>That's like saying: "Which leads me to the questionable usefulness of HSA". I > >>find it analogous to a situation where a network maintainer nacking a driver > >>for a network card, which is slower than a different network card. Doesn't > >>seem reasonable this situation is would happen. He would still put both the > >>drivers in the kernel because people want to use the H/W and its features. So, > >>I don't think this is a valid reason to NACK the driver. Let me rephrase, drop the the performance counter ioctl and modulo memory pinning i see no objection. In other word, i am not NACKING whole patchset i am NACKING the performance ioctl. Again this is another argument for round trip to the kernel. As inside kernel you could properly do exclusive gpu counter access accross single user cmd buffer execution. > >> > >>>I only see issues with that. First and foremost i would > >>>need to see solid figures that kernel ioctl or syscall has a higher an > >>>overhead that is measurable in any meaning full way against a simple > >>>function call. I know the userspace command ring is a big marketing features > >>>that please ignorant userspace programmer. But really this only brings issues > >>>and for absolutely not upside afaict. > >>Really ? You think that doing a context switch to kernel space, with all its > >>overhead, is _not_ more expansive than just calling a function in userspace > >>which only puts a buffer on a ring and writes a doorbell ? I am saying the overhead is not that big and it probably will not matter in most usecase. For instance i did wrote the most useless kernel module that add two number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so ioctl is 13 times slower. Now if there is enough data that shows that a significant percentage of jobs submited to the GPU will take less that 0.35microsecond then yes userspace scheduling does make sense. But so far all we have is handwaving with no data to support any facts. Now if we want to schedule from userspace than you will need to do something about the pinning, something that gives control to kernel so that kernel can unpin when it wants and move object when it wants no matter what userspace is doing. > >>> > >>>So i would rather see a very simple ioctl that write the doorbell and might > >>>do more than that in case of ring/queue overcommit where it would first have > >>>to wait for a free ring/queue to schedule stuff. This would also allow sane > >>>implementation of things like performance counter that could be acquire by > >>>kernel for duration of a job submitted by userspace. While still not optimal > >>>this would be better that userspace locking. > >>> > >>> > >>>I might have more thoughts once i am done with all the patches. > >>> > >>>Cheers, > >>>Jerome > >>> > >>>> > >>>>Original Cover Letter: > >>>> > >>>>This patch set implements a Heterogeneous System Architecture (HSA) driver > >>>>for radeon-family GPUs. > >>>>HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share > >>>>system resources more effectively via HW features including shared pageable > >>>>memory, userspace-accessible work queues, and platform-level atomics. In > >>>>addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea > >>>>Islands family of GPUs also performs HW-level validation of commands passed > >>>>in through the queues (aka rings). > >>>> > >>>>The code in this patch set is intended to serve both as a sample driver for > >>>>other HSA-compatible hardware devices and as a production driver for > >>>>radeon-family processors. The code is architected to support multiple CPUs > >>>>each with connected GPUs, although the current implementation focuses on a > >>>>single Kaveri/Berlin APU, and works alongside the existing radeon kernel > >>>>graphics driver (kgd). > >>>>AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware > >>>>functionality between HSA compute and regular gfx/compute (memory, > >>>>interrupts, registers), while other functionality has been added > >>>>specifically for HSA compute (hw scheduler for virtualized compute rings). > >>>>All shared hardware is owned by the radeon graphics driver, and an interface > >>>>between kfd and kgd allows the kfd to make use of those shared resources, > >>>>while HSA-specific functionality is managed directly by kfd by submitting > >>>>packets into an HSA-specific command queue (the "HIQ"). > >>>> > >>>>During kfd module initialization a char device node (/dev/kfd) is created > >>>>(surviving until module exit), with ioctls for queue creation & management, > >>>>and data structures are initialized for managing HSA device topology. > >>>>The rest of the initialization is driven by calls from the radeon kgd at the > >>>>following points : > >>>> > >>>>- radeon_init (kfd_init) > >>>>- radeon_exit (kfd_fini) > >>>>- radeon_driver_load_kms (kfd_device_probe, kfd_device_init) > >>>>- radeon_driver_unload_kms (kfd_device_fini) > >>>> > >>>>During the probe and init processing per-device data structures are > >>>>established which connect to the associated graphics kernel driver. This > >>>>information is exposed to userspace via sysfs, along with a version number > >>>>allowing userspace to determine if a topology change has occurred while it > >>>>was reading from sysfs. > >>>>The interface between kfd and kgd also allows the kfd to request buffer > >>>>management services from kgd, and allows kgd to route interrupt requests to > >>>>kfd code since the interrupt block is shared between regular > >>>>graphics/compute and HSA compute subsystems in the GPU. > >>>> > >>>>The kfd code works with an open source usermode library ("libhsakmt") which > >>>>is in the final stages of IP review and should be published in a separate > >>>>repo over the next few days. > >>>>The code operates in one of three modes, selectable via the sched_policy > >>>>module parameter : > >>>> > >>>>- sched_policy=0 uses a hardware scheduler running in the MEC block within > >>>>CP, and allows oversubscription (more queues than HW slots) > >>>>- sched_policy=1 also uses HW scheduling but does not allow > >>>>oversubscription, so create_queue requests fail when we run out of HW slots > >>>>- sched_policy=2 does not use HW scheduling, so the driver manually assigns > >>>>queues to HW slots by programming registers > >>>> > >>>>The "no HW scheduling" option is for debug & new hardware bringup only, so > >>>>has less test coverage than the other options. Default in the current code > >>>>is "HW scheduling without oversubscription" since that is where we have the > >>>>most test coverage but we expect to change the default to "HW scheduling > >>>>with oversubscription" after further testing. This effectively removes the > >>>>HW limit on the number of work queues available to applications. > >>>> > >>>>Programs running on the GPU are associated with an address space through the > >>>>VMID field, which is translated to a unique PASID at access time via a set > >>>>of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) > >>>>are partitioned (under control of the radeon kgd) between current > >>>>gfx/compute and HSA compute, with each getting 8 in the current code. The > >>>>VMID-to-PASID mapping registers are updated by the HW scheduler when used, > >>>>and by driver code if HW scheduling is not being used. > >>>>The Sea Islands compute queues use a new "doorbell" mechanism instead of the > >>>>earlier kernel-managed write pointer registers. Doorbells use a separate BAR > >>>>dedicated for this purpose, and pages within the doorbell aperture are > >>>>mapped to userspace (each page mapped to only one user address space). > >>>>Writes to the doorbell aperture are intercepted by GPU hardware, allowing > >>>>userspace code to safely manage work queues (rings) without requiring a > >>>>kernel call for every ring update. > >>>>First step for an application process is to open the kfd device. Calls to > >>>>open create a kfd "process" structure only for the first thread of the > >>>>process. Subsequent open calls are checked to see if they are from processes > >>>>using the same mm_struct and, if so, don't do anything. The kfd per-process > >>>>data lives as long as the mm_struct exists. Each mm_struct is associated > >>>>with a unique PASID, allowing the IOMMUv2 to make userspace process memory > >>>>accessible to the GPU. > >>>>Next step is for the application to collect topology information via sysfs. > >>>>This gives userspace enough information to be able to identify specific > >>>>nodes (processors) in subsequent queue management calls. Application > >>>>processes can create queues on multiple processors, and processors support > >>>>queues from multiple processes. > >>>>At this point the application can create work queues in userspace memory and > >>>>pass them through the usermode library to kfd to have them mapped onto HW > >>>>queue slots so that commands written to the queues can be executed by the > >>>>GPU. Queue operations specify a processor node, and so the bulk of this code > >>>>is device-specific. > >>>>Written by John Bridgman <John.Bridgman@amd.com> > >>>> > >>>> > >>>>Alexey Skidanov (1): > >>>> amdkfd: Implement the Get Process Aperture IOCTL > >>>> > >>>>Andrew Lewycky (3): > >>>> amdkfd: Add basic modules to amdkfd > >>>> amdkfd: Add interrupt handling module > >>>> amdkfd: Implement the Set Memory Policy IOCTL > >>>> > >>>>Ben Goz (8): > >>>> amdkfd: Add queue module > >>>> amdkfd: Add mqd_manager module > >>>> amdkfd: Add kernel queue module > >>>> amdkfd: Add module parameter of scheduling policy > >>>> amdkfd: Add packet manager module > >>>> amdkfd: Add process queue manager module > >>>> amdkfd: Add device queue manager module > >>>> amdkfd: Implement the create/destroy/update queue IOCTLs > >>>> > >>>>Evgeny Pinchuk (3): > >>>> amdkfd: Add topology module to amdkfd > >>>> amdkfd: Implement the Get Clock Counters IOCTL > >>>> amdkfd: Implement the PMC Acquire/Release IOCTLs > >>>> > >>>>Oded Gabbay (10): > >>>> mm: Add kfd_process pointer to mm_struct > >>>> drm/radeon: reduce number of free VMIDs and pipes in KV > >>>> drm/radeon/cik: Don't touch int of pipes 1-7 > >>>> drm/radeon: Report doorbell configuration to amdkfd > >>>> drm/radeon: adding synchronization for GRBM GFX > >>>> drm/radeon: Add radeon <--> amdkfd interface > >>>> Update MAINTAINERS and CREDITS files with amdkfd info > >>>> amdkfd: Add IOCTL set definitions of amdkfd > >>>> amdkfd: Add amdkfd skeleton driver > >>>> amdkfd: Add binding/unbinding calls to amd_iommu driver > >>>> > >>>> CREDITS | 7 + > >>>> MAINTAINERS | 10 + > >>>> drivers/gpu/drm/radeon/Kconfig | 2 + > >>>> drivers/gpu/drm/radeon/Makefile | 3 + > >>>> drivers/gpu/drm/radeon/amdkfd/Kconfig | 10 + > >>>> drivers/gpu/drm/radeon/amdkfd/Makefile | 14 + > >>>> drivers/gpu/drm/radeon/amdkfd/cik_mqds.h | 185 +++ > >>>> drivers/gpu/drm/radeon/amdkfd/cik_regs.h | 220 ++++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c | 123 ++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c | 518 +++++++++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_crat.h | 294 +++++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_device.c | 254 ++++ > >>>> .../drm/radeon/amdkfd/kfd_device_queue_manager.c | 985 ++++++++++++++++ > >>>> .../drm/radeon/amdkfd/kfd_device_queue_manager.h | 101 ++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c | 264 +++++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c | 161 +++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c | 305 +++++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h | 66 ++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_module.c | 131 +++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c | 291 +++++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h | 54 + > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c | 488 ++++++++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c | 97 ++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h | 682 +++++++++++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h | 107 ++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_priv.h | 466 ++++++++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_process.c | 405 +++++++ > >>>> .../drm/radeon/amdkfd/kfd_process_queue_manager.c | 343 ++++++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_queue.c | 109 ++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_topology.c | 1207 > >>>>++++++++++++++++++++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_topology.h | 168 +++ > >>>> drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c | 96 ++ > >>>> drivers/gpu/drm/radeon/cik.c | 154 +-- > >>>> drivers/gpu/drm/radeon/cik_reg.h | 65 ++ > >>>> drivers/gpu/drm/radeon/cikd.h | 51 +- > >>>> drivers/gpu/drm/radeon/radeon.h | 9 + > >>>> drivers/gpu/drm/radeon/radeon_device.c | 32 + > >>>> drivers/gpu/drm/radeon/radeon_drv.c | 5 + > >>>> drivers/gpu/drm/radeon/radeon_kfd.c | 566 +++++++++ > >>>> drivers/gpu/drm/radeon/radeon_kfd.h | 119 ++ > >>>> drivers/gpu/drm/radeon/radeon_kms.c | 7 + > >>>> include/linux/mm_types.h | 14 + > >>>> include/uapi/linux/kfd_ioctl.h | 133 +++ > >>>> 43 files changed, 9226 insertions(+), 95 deletions(-) > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h > >>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c > >>>> create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c > >>>> create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h > >>>> create mode 100644 include/uapi/linux/kfd_ioctl.h > >>>> > >>>>-- > >>>>1.9.1 > >>>> > >> > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 15:54 ` Jerome Glisse @ 2014-07-21 17:42 ` Oded Gabbay 2014-07-21 18:14 ` Jerome Glisse 0 siblings, 1 reply; 49+ messages in thread From: Oded Gabbay @ 2014-07-21 17:42 UTC (permalink / raw) To: Jerome Glisse Cc: Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On 21/07/14 18:54, Jerome Glisse wrote: > On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote: >> On 21/07/14 16:39, Christian König wrote: >>> Am 21.07.2014 14:36, schrieb Oded Gabbay: >>>> On 20/07/14 20:46, Jerome Glisse wrote: >>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: >>>>>> Forgot to cc mailing list on cover letter. Sorry. >>>>>> >>>>>> As a continuation to the existing discussion, here is a v2 patch series >>>>>> restructured with a cleaner history and no totally-different-early-versions >>>>>> of the code. >>>>>> >>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them >>>>>> are modifications to radeon driver and 18 of them include only amdkfd code. >>>>>> There is no code going away or even modified between patches, only added. >>>>>> >>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under >>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver >>>>>> is an AMD-only driver at this point. Having said that, we do foresee a >>>>>> generic hsa framework being implemented in the future and in that case, we >>>>>> will adjust amdkfd to work within that framework. >>>>>> >>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to >>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is >>>>>> contained in its own folder. The amdkfd folder was put under the radeon >>>>>> folder because the only AMD gfx driver in the Linux kernel at this point >>>>>> is the radeon driver. Having said that, we will probably need to move it >>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx >>>>>> drivers. >>>>>> >>>>>> For people who like to review using git, the v2 patch set is located at: >>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 >>>>>> >>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com> >>>>> >>>>> So quick comments before i finish going over all patches. There is many >>>>> things that need more documentation espacialy as of right now there is >>>>> no userspace i can go look at. >>>> So quick comments on some of your questions but first of all, thanks for the >>>> time you dedicated to review the code. >>>>> >>>>> There few show stopper, biggest one is gpu memory pinning this is a big >>>>> no, that would need serious arguments for any hope of convincing me on >>>>> that side. >>>> We only do gpu memory pinning for kernel objects. There are no userspace >>>> objects that are pinned on the gpu memory in our driver. If that is the case, >>>> is it still a show stopper ? >>>> >>>> The kernel objects are: >>>> - pipelines (4 per device) >>>> - mqd per hiq (only 1 per device) >>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for >>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in >>>> 256 alignment. So total *possible* memory is 128MB >>>> - kernel queue (only 1 per device) >>>> - fence address for kernel queue >>>> - runlists for the CP (1 or 2 per device) >>> >>> The main questions here are if it's avoid able to pin down the memory and if the >>> memory is pinned down at driver load, by request from userspace or by anything >>> else. >>> >>> As far as I can see only the "mqd per userspace queue" might be a bit >>> questionable, everything else sounds reasonable. >>> >>> Christian. >> >> Most of the pin downs are done on device initialization. >> The "mqd per userspace" is done per userspace queue creation. However, as I >> said, it has an upper limit of 128MB on KV, and considering the 2G local >> memory, I think it is OK. >> The runlists are also done on userspace queue creation/deletion, but we only >> have 1 or 2 runlists per device, so it is not that bad. > > 2G local memory ? You can not assume anything on userside configuration some > one might build an hsa computer with 512M and still expect a functioning > desktop. First of all, I'm only considering Kaveri computer, not "hsa" computer. Second, I would imagine we can build some protection around it, like checking total local memory and limit number of queues based on some percentage of that total local memory. So, if someone will have only 512M, he will be able to open less queues. > > I need to go look into what all this mqd is for, what it does and what it is > about. But pinning is really bad and this is an issue with userspace command > scheduling an issue that obviously AMD fails to take into account in design > phase. Maybe, but that is the H/W design non-the-less. We can't very well change the H/W. Oded > >> Oded >>> >>>>> >>>>> It might be better to add a drivers/gpu/drm/amd directory and add common >>>>> stuff there. >>>>> >>>>> Given that this is not intended to be final HSA api AFAICT then i would >>>>> say this far better to avoid the whole kfd module and add ioctl to radeon. >>>>> This would avoid crazy communication btw radeon and kfd. >>>>> >>>>> The whole aperture business needs some serious explanation. Especialy as >>>>> you want to use userspace address there is nothing to prevent userspace >>>>> program from allocating things at address you reserve for lds, scratch, >>>>> ... only sane way would be to move those lds, scratch inside the virtual >>>>> address reserved for kernel (see kernel memory map). >>>>> >>>>> The whole business of locking performance counter for exclusive per process >>>>> access is a big NO. Which leads me to the questionable usefullness of user >>>>> space command ring. >>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I >>>> find it analogous to a situation where a network maintainer nacking a driver >>>> for a network card, which is slower than a different network card. Doesn't >>>> seem reasonable this situation is would happen. He would still put both the >>>> drivers in the kernel because people want to use the H/W and its features. So, >>>> I don't think this is a valid reason to NACK the driver. > > Let me rephrase, drop the the performance counter ioctl and modulo memory pinning > i see no objection. In other word, i am not NACKING whole patchset i am NACKING > the performance ioctl. > > Again this is another argument for round trip to the kernel. As inside kernel you > could properly do exclusive gpu counter access accross single user cmd buffer > execution. > >>>> >>>>> I only see issues with that. First and foremost i would >>>>> need to see solid figures that kernel ioctl or syscall has a higher an >>>>> overhead that is measurable in any meaning full way against a simple >>>>> function call. I know the userspace command ring is a big marketing features >>>>> that please ignorant userspace programmer. But really this only brings issues >>>>> and for absolutely not upside afaict. >>>> Really ? You think that doing a context switch to kernel space, with all its >>>> overhead, is _not_ more expansive than just calling a function in userspace >>>> which only puts a buffer on a ring and writes a doorbell ? > > I am saying the overhead is not that big and it probably will not matter in most > usecase. For instance i did wrote the most useless kernel module that add two > number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and > it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so > ioctl is 13 times slower. > > Now if there is enough data that shows that a significant percentage of jobs > submited to the GPU will take less that 0.35microsecond then yes userspace > scheduling does make sense. But so far all we have is handwaving with no data > to support any facts. > > > Now if we want to schedule from userspace than you will need to do something > about the pinning, something that gives control to kernel so that kernel can > unpin when it wants and move object when it wants no matter what userspace is > doing. > >>>>> >>>>> So i would rather see a very simple ioctl that write the doorbell and might >>>>> do more than that in case of ring/queue overcommit where it would first have >>>>> to wait for a free ring/queue to schedule stuff. This would also allow sane >>>>> implementation of things like performance counter that could be acquire by >>>>> kernel for duration of a job submitted by userspace. While still not optimal >>>>> this would be better that userspace locking. >>>>> >>>>> >>>>> I might have more thoughts once i am done with all the patches. >>>>> >>>>> Cheers, >>>>> Jérôme >>>>> >>>>>> >>>>>> Original Cover Letter: >>>>>> >>>>>> This patch set implements a Heterogeneous System Architecture (HSA) driver >>>>>> for radeon-family GPUs. >>>>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share >>>>>> system resources more effectively via HW features including shared pageable >>>>>> memory, userspace-accessible work queues, and platform-level atomics. In >>>>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea >>>>>> Islands family of GPUs also performs HW-level validation of commands passed >>>>>> in through the queues (aka rings). >>>>>> >>>>>> The code in this patch set is intended to serve both as a sample driver for >>>>>> other HSA-compatible hardware devices and as a production driver for >>>>>> radeon-family processors. The code is architected to support multiple CPUs >>>>>> each with connected GPUs, although the current implementation focuses on a >>>>>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel >>>>>> graphics driver (kgd). >>>>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware >>>>>> functionality between HSA compute and regular gfx/compute (memory, >>>>>> interrupts, registers), while other functionality has been added >>>>>> specifically for HSA compute (hw scheduler for virtualized compute rings). >>>>>> All shared hardware is owned by the radeon graphics driver, and an interface >>>>>> between kfd and kgd allows the kfd to make use of those shared resources, >>>>>> while HSA-specific functionality is managed directly by kfd by submitting >>>>>> packets into an HSA-specific command queue (the "HIQ"). >>>>>> >>>>>> During kfd module initialization a char device node (/dev/kfd) is created >>>>>> (surviving until module exit), with ioctls for queue creation & management, >>>>>> and data structures are initialized for managing HSA device topology. >>>>>> The rest of the initialization is driven by calls from the radeon kgd at the >>>>>> following points : >>>>>> >>>>>> - radeon_init (kfd_init) >>>>>> - radeon_exit (kfd_fini) >>>>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init) >>>>>> - radeon_driver_unload_kms (kfd_device_fini) >>>>>> >>>>>> During the probe and init processing per-device data structures are >>>>>> established which connect to the associated graphics kernel driver. This >>>>>> information is exposed to userspace via sysfs, along with a version number >>>>>> allowing userspace to determine if a topology change has occurred while it >>>>>> was reading from sysfs. >>>>>> The interface between kfd and kgd also allows the kfd to request buffer >>>>>> management services from kgd, and allows kgd to route interrupt requests to >>>>>> kfd code since the interrupt block is shared between regular >>>>>> graphics/compute and HSA compute subsystems in the GPU. >>>>>> >>>>>> The kfd code works with an open source usermode library ("libhsakmt") which >>>>>> is in the final stages of IP review and should be published in a separate >>>>>> repo over the next few days. >>>>>> The code operates in one of three modes, selectable via the sched_policy >>>>>> module parameter : >>>>>> >>>>>> - sched_policy=0 uses a hardware scheduler running in the MEC block within >>>>>> CP, and allows oversubscription (more queues than HW slots) >>>>>> - sched_policy=1 also uses HW scheduling but does not allow >>>>>> oversubscription, so create_queue requests fail when we run out of HW slots >>>>>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns >>>>>> queues to HW slots by programming registers >>>>>> >>>>>> The "no HW scheduling" option is for debug & new hardware bringup only, so >>>>>> has less test coverage than the other options. Default in the current code >>>>>> is "HW scheduling without oversubscription" since that is where we have the >>>>>> most test coverage but we expect to change the default to "HW scheduling >>>>>> with oversubscription" after further testing. This effectively removes the >>>>>> HW limit on the number of work queues available to applications. >>>>>> >>>>>> Programs running on the GPU are associated with an address space through the >>>>>> VMID field, which is translated to a unique PASID at access time via a set >>>>>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) >>>>>> are partitioned (under control of the radeon kgd) between current >>>>>> gfx/compute and HSA compute, with each getting 8 in the current code. The >>>>>> VMID-to-PASID mapping registers are updated by the HW scheduler when used, >>>>>> and by driver code if HW scheduling is not being used. >>>>>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the >>>>>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR >>>>>> dedicated for this purpose, and pages within the doorbell aperture are >>>>>> mapped to userspace (each page mapped to only one user address space). >>>>>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing >>>>>> userspace code to safely manage work queues (rings) without requiring a >>>>>> kernel call for every ring update. >>>>>> First step for an application process is to open the kfd device. Calls to >>>>>> open create a kfd "process" structure only for the first thread of the >>>>>> process. Subsequent open calls are checked to see if they are from processes >>>>>> using the same mm_struct and, if so, don't do anything. The kfd per-process >>>>>> data lives as long as the mm_struct exists. Each mm_struct is associated >>>>>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory >>>>>> accessible to the GPU. >>>>>> Next step is for the application to collect topology information via sysfs. >>>>>> This gives userspace enough information to be able to identify specific >>>>>> nodes (processors) in subsequent queue management calls. Application >>>>>> processes can create queues on multiple processors, and processors support >>>>>> queues from multiple processes. >>>>>> At this point the application can create work queues in userspace memory and >>>>>> pass them through the usermode library to kfd to have them mapped onto HW >>>>>> queue slots so that commands written to the queues can be executed by the >>>>>> GPU. Queue operations specify a processor node, and so the bulk of this code >>>>>> is device-specific. >>>>>> Written by John Bridgman <John.Bridgman@amd.com> >>>>>> >>>>>> >>>>>> Alexey Skidanov (1): >>>>>> amdkfd: Implement the Get Process Aperture IOCTL >>>>>> >>>>>> Andrew Lewycky (3): >>>>>> amdkfd: Add basic modules to amdkfd >>>>>> amdkfd: Add interrupt handling module >>>>>> amdkfd: Implement the Set Memory Policy IOCTL >>>>>> >>>>>> Ben Goz (8): >>>>>> amdkfd: Add queue module >>>>>> amdkfd: Add mqd_manager module >>>>>> amdkfd: Add kernel queue module >>>>>> amdkfd: Add module parameter of scheduling policy >>>>>> amdkfd: Add packet manager module >>>>>> amdkfd: Add process queue manager module >>>>>> amdkfd: Add device queue manager module >>>>>> amdkfd: Implement the create/destroy/update queue IOCTLs >>>>>> >>>>>> Evgeny Pinchuk (3): >>>>>> amdkfd: Add topology module to amdkfd >>>>>> amdkfd: Implement the Get Clock Counters IOCTL >>>>>> amdkfd: Implement the PMC Acquire/Release IOCTLs >>>>>> >>>>>> Oded Gabbay (10): >>>>>> mm: Add kfd_process pointer to mm_struct >>>>>> drm/radeon: reduce number of free VMIDs and pipes in KV >>>>>> drm/radeon/cik: Don't touch int of pipes 1-7 >>>>>> drm/radeon: Report doorbell configuration to amdkfd >>>>>> drm/radeon: adding synchronization for GRBM GFX >>>>>> drm/radeon: Add radeon <--> amdkfd interface >>>>>> Update MAINTAINERS and CREDITS files with amdkfd info >>>>>> amdkfd: Add IOCTL set definitions of amdkfd >>>>>> amdkfd: Add amdkfd skeleton driver >>>>>> amdkfd: Add binding/unbinding calls to amd_iommu driver >>>>>> >>>>>> CREDITS | 7 + >>>>>> MAINTAINERS | 10 + >>>>>> drivers/gpu/drm/radeon/Kconfig | 2 + >>>>>> drivers/gpu/drm/radeon/Makefile | 3 + >>>>>> drivers/gpu/drm/radeon/amdkfd/Kconfig | 10 + >>>>>> drivers/gpu/drm/radeon/amdkfd/Makefile | 14 + >>>>>> drivers/gpu/drm/radeon/amdkfd/cik_mqds.h | 185 +++ >>>>>> drivers/gpu/drm/radeon/amdkfd/cik_regs.h | 220 ++++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c | 123 ++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c | 518 +++++++++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_crat.h | 294 +++++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_device.c | 254 ++++ >>>>>> .../drm/radeon/amdkfd/kfd_device_queue_manager.c | 985 ++++++++++++++++ >>>>>> .../drm/radeon/amdkfd/kfd_device_queue_manager.h | 101 ++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c | 264 +++++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c | 161 +++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c | 305 +++++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h | 66 ++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_module.c | 131 +++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c | 291 +++++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h | 54 + >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c | 488 ++++++++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c | 97 ++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h | 682 +++++++++++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h | 107 ++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_priv.h | 466 ++++++++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_process.c | 405 +++++++ >>>>>> .../drm/radeon/amdkfd/kfd_process_queue_manager.c | 343 ++++++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_queue.c | 109 ++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_topology.c | 1207 >>>>>> ++++++++++++++++++++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_topology.h | 168 +++ >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c | 96 ++ >>>>>> drivers/gpu/drm/radeon/cik.c | 154 +-- >>>>>> drivers/gpu/drm/radeon/cik_reg.h | 65 ++ >>>>>> drivers/gpu/drm/radeon/cikd.h | 51 +- >>>>>> drivers/gpu/drm/radeon/radeon.h | 9 + >>>>>> drivers/gpu/drm/radeon/radeon_device.c | 32 + >>>>>> drivers/gpu/drm/radeon/radeon_drv.c | 5 + >>>>>> drivers/gpu/drm/radeon/radeon_kfd.c | 566 +++++++++ >>>>>> drivers/gpu/drm/radeon/radeon_kfd.h | 119 ++ >>>>>> drivers/gpu/drm/radeon/radeon_kms.c | 7 + >>>>>> include/linux/mm_types.h | 14 + >>>>>> include/uapi/linux/kfd_ioctl.h | 133 +++ >>>>>> 43 files changed, 9226 insertions(+), 95 deletions(-) >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c >>>>>> create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c >>>>>> create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h >>>>>> create mode 100644 include/uapi/linux/kfd_ioctl.h >>>>>> >>>>>> -- >>>>>> 1.9.1 >>>>>> >>>> >>> >> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 17:42 ` Oded Gabbay @ 2014-07-21 18:14 ` Jerome Glisse 2014-07-21 18:36 ` Oded Gabbay 0 siblings, 1 reply; 49+ messages in thread From: Jerome Glisse @ 2014-07-21 18:14 UTC (permalink / raw) To: Oded Gabbay Cc: Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote: > On 21/07/14 18:54, Jerome Glisse wrote: > > On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote: > >> On 21/07/14 16:39, Christian Konig wrote: > >>> Am 21.07.2014 14:36, schrieb Oded Gabbay: > >>>> On 20/07/14 20:46, Jerome Glisse wrote: > >>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: > >>>>>> Forgot to cc mailing list on cover letter. Sorry. > >>>>>> > >>>>>> As a continuation to the existing discussion, here is a v2 patch series > >>>>>> restructured with a cleaner history and no totally-different-early-versions > >>>>>> of the code. > >>>>>> > >>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them > >>>>>> are modifications to radeon driver and 18 of them include only amdkfd code. > >>>>>> There is no code going away or even modified between patches, only added. > >>>>>> > >>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under > >>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver > >>>>>> is an AMD-only driver at this point. Having said that, we do foresee a > >>>>>> generic hsa framework being implemented in the future and in that case, we > >>>>>> will adjust amdkfd to work within that framework. > >>>>>> > >>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to > >>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is > >>>>>> contained in its own folder. The amdkfd folder was put under the radeon > >>>>>> folder because the only AMD gfx driver in the Linux kernel at this point > >>>>>> is the radeon driver. Having said that, we will probably need to move it > >>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx > >>>>>> drivers. > >>>>>> > >>>>>> For people who like to review using git, the v2 patch set is located at: > >>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 > >>>>>> > >>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com> > >>>>> > >>>>> So quick comments before i finish going over all patches. There is many > >>>>> things that need more documentation espacialy as of right now there is > >>>>> no userspace i can go look at. > >>>> So quick comments on some of your questions but first of all, thanks for the > >>>> time you dedicated to review the code. > >>>>> > >>>>> There few show stopper, biggest one is gpu memory pinning this is a big > >>>>> no, that would need serious arguments for any hope of convincing me on > >>>>> that side. > >>>> We only do gpu memory pinning for kernel objects. There are no userspace > >>>> objects that are pinned on the gpu memory in our driver. If that is the case, > >>>> is it still a show stopper ? > >>>> > >>>> The kernel objects are: > >>>> - pipelines (4 per device) > >>>> - mqd per hiq (only 1 per device) > >>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for > >>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in > >>>> 256 alignment. So total *possible* memory is 128MB > >>>> - kernel queue (only 1 per device) > >>>> - fence address for kernel queue > >>>> - runlists for the CP (1 or 2 per device) > >>> > >>> The main questions here are if it's avoid able to pin down the memory and if the > >>> memory is pinned down at driver load, by request from userspace or by anything > >>> else. > >>> > >>> As far as I can see only the "mqd per userspace queue" might be a bit > >>> questionable, everything else sounds reasonable. > >>> > >>> Christian. > >> > >> Most of the pin downs are done on device initialization. > >> The "mqd per userspace" is done per userspace queue creation. However, as I > >> said, it has an upper limit of 128MB on KV, and considering the 2G local > >> memory, I think it is OK. > >> The runlists are also done on userspace queue creation/deletion, but we only > >> have 1 or 2 runlists per device, so it is not that bad. > > > > 2G local memory ? You can not assume anything on userside configuration some > > one might build an hsa computer with 512M and still expect a functioning > > desktop. > First of all, I'm only considering Kaveri computer, not "hsa" computer. > Second, I would imagine we can build some protection around it, like > checking total local memory and limit number of queues based on some > percentage of that total local memory. So, if someone will have only > 512M, he will be able to open less queues. > > > > > > I need to go look into what all this mqd is for, what it does and what it is > > about. But pinning is really bad and this is an issue with userspace command > > scheduling an issue that obviously AMD fails to take into account in design > > phase. > Maybe, but that is the H/W design non-the-less. We can't very well > change the H/W. You can not change the hardware but it is not an excuse to allow bad design to sneak in software to work around that. So i would rather penalize bad hardware design and have command submission in the kernel, until AMD fix its hardware to allow proper scheduling by the kernel and proper control by the kernel. Because really where we want to go is having GPU closer to a CPU in term of scheduling capacity and once we get there we want the kernel to always be able to take over and do whatever it wants behind process back. > >>> > >>>>> > >>>>> It might be better to add a drivers/gpu/drm/amd directory and add common > >>>>> stuff there. > >>>>> > >>>>> Given that this is not intended to be final HSA api AFAICT then i would > >>>>> say this far better to avoid the whole kfd module and add ioctl to radeon. > >>>>> This would avoid crazy communication btw radeon and kfd. > >>>>> > >>>>> The whole aperture business needs some serious explanation. Especialy as > >>>>> you want to use userspace address there is nothing to prevent userspace > >>>>> program from allocating things at address you reserve for lds, scratch, > >>>>> ... only sane way would be to move those lds, scratch inside the virtual > >>>>> address reserved for kernel (see kernel memory map). > >>>>> > >>>>> The whole business of locking performance counter for exclusive per process > >>>>> access is a big NO. Which leads me to the questionable usefullness of user > >>>>> space command ring. > >>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I > >>>> find it analogous to a situation where a network maintainer nacking a driver > >>>> for a network card, which is slower than a different network card. Doesn't > >>>> seem reasonable this situation is would happen. He would still put both the > >>>> drivers in the kernel because people want to use the H/W and its features. So, > >>>> I don't think this is a valid reason to NACK the driver. > > > > Let me rephrase, drop the the performance counter ioctl and modulo memory pinning > > i see no objection. In other word, i am not NACKING whole patchset i am NACKING > > the performance ioctl. > > > > Again this is another argument for round trip to the kernel. As inside kernel you > > could properly do exclusive gpu counter access accross single user cmd buffer > > execution. > > > >>>> > >>>>> I only see issues with that. First and foremost i would > >>>>> need to see solid figures that kernel ioctl or syscall has a higher an > >>>>> overhead that is measurable in any meaning full way against a simple > >>>>> function call. I know the userspace command ring is a big marketing features > >>>>> that please ignorant userspace programmer. But really this only brings issues > >>>>> and for absolutely not upside afaict. > >>>> Really ? You think that doing a context switch to kernel space, with all its > >>>> overhead, is _not_ more expansive than just calling a function in userspace > >>>> which only puts a buffer on a ring and writes a doorbell ? > > > > I am saying the overhead is not that big and it probably will not matter in most > > usecase. For instance i did wrote the most useless kernel module that add two > > number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and > > it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so > > ioctl is 13 times slower. > > > > Now if there is enough data that shows that a significant percentage of jobs > > submited to the GPU will take less that 0.35microsecond then yes userspace > > scheduling does make sense. But so far all we have is handwaving with no data > > to support any facts. > > > > > > Now if we want to schedule from userspace than you will need to do something > > about the pinning, something that gives control to kernel so that kernel can > > unpin when it wants and move object when it wants no matter what userspace is > > doing. > > > >>>>> > >>>>> So i would rather see a very simple ioctl that write the doorbell and might > >>>>> do more than that in case of ring/queue overcommit where it would first have > >>>>> to wait for a free ring/queue to schedule stuff. This would also allow sane > >>>>> implementation of things like performance counter that could be acquire by > >>>>> kernel for duration of a job submitted by userspace. While still not optimal > >>>>> this would be better that userspace locking. > >>>>> > >>>>> > >>>>> I might have more thoughts once i am done with all the patches. > >>>>> > >>>>> Cheers, > >>>>> Jerome > >>>>> > >>>>>> > >>>>>> Original Cover Letter: > >>>>>> > >>>>>> This patch set implements a Heterogeneous System Architecture (HSA) driver > >>>>>> for radeon-family GPUs. > >>>>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share > >>>>>> system resources more effectively via HW features including shared pageable > >>>>>> memory, userspace-accessible work queues, and platform-level atomics. In > >>>>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea > >>>>>> Islands family of GPUs also performs HW-level validation of commands passed > >>>>>> in through the queues (aka rings). > >>>>>> > >>>>>> The code in this patch set is intended to serve both as a sample driver for > >>>>>> other HSA-compatible hardware devices and as a production driver for > >>>>>> radeon-family processors. The code is architected to support multiple CPUs > >>>>>> each with connected GPUs, although the current implementation focuses on a > >>>>>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel > >>>>>> graphics driver (kgd). > >>>>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware > >>>>>> functionality between HSA compute and regular gfx/compute (memory, > >>>>>> interrupts, registers), while other functionality has been added > >>>>>> specifically for HSA compute (hw scheduler for virtualized compute rings). > >>>>>> All shared hardware is owned by the radeon graphics driver, and an interface > >>>>>> between kfd and kgd allows the kfd to make use of those shared resources, > >>>>>> while HSA-specific functionality is managed directly by kfd by submitting > >>>>>> packets into an HSA-specific command queue (the "HIQ"). > >>>>>> > >>>>>> During kfd module initialization a char device node (/dev/kfd) is created > >>>>>> (surviving until module exit), with ioctls for queue creation & management, > >>>>>> and data structures are initialized for managing HSA device topology. > >>>>>> The rest of the initialization is driven by calls from the radeon kgd at the > >>>>>> following points : > >>>>>> > >>>>>> - radeon_init (kfd_init) > >>>>>> - radeon_exit (kfd_fini) > >>>>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init) > >>>>>> - radeon_driver_unload_kms (kfd_device_fini) > >>>>>> > >>>>>> During the probe and init processing per-device data structures are > >>>>>> established which connect to the associated graphics kernel driver. This > >>>>>> information is exposed to userspace via sysfs, along with a version number > >>>>>> allowing userspace to determine if a topology change has occurred while it > >>>>>> was reading from sysfs. > >>>>>> The interface between kfd and kgd also allows the kfd to request buffer > >>>>>> management services from kgd, and allows kgd to route interrupt requests to > >>>>>> kfd code since the interrupt block is shared between regular > >>>>>> graphics/compute and HSA compute subsystems in the GPU. > >>>>>> > >>>>>> The kfd code works with an open source usermode library ("libhsakmt") which > >>>>>> is in the final stages of IP review and should be published in a separate > >>>>>> repo over the next few days. > >>>>>> The code operates in one of three modes, selectable via the sched_policy > >>>>>> module parameter : > >>>>>> > >>>>>> - sched_policy=0 uses a hardware scheduler running in the MEC block within > >>>>>> CP, and allows oversubscription (more queues than HW slots) > >>>>>> - sched_policy=1 also uses HW scheduling but does not allow > >>>>>> oversubscription, so create_queue requests fail when we run out of HW slots > >>>>>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns > >>>>>> queues to HW slots by programming registers > >>>>>> > >>>>>> The "no HW scheduling" option is for debug & new hardware bringup only, so > >>>>>> has less test coverage than the other options. Default in the current code > >>>>>> is "HW scheduling without oversubscription" since that is where we have the > >>>>>> most test coverage but we expect to change the default to "HW scheduling > >>>>>> with oversubscription" after further testing. This effectively removes the > >>>>>> HW limit on the number of work queues available to applications. > >>>>>> > >>>>>> Programs running on the GPU are associated with an address space through the > >>>>>> VMID field, which is translated to a unique PASID at access time via a set > >>>>>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) > >>>>>> are partitioned (under control of the radeon kgd) between current > >>>>>> gfx/compute and HSA compute, with each getting 8 in the current code. The > >>>>>> VMID-to-PASID mapping registers are updated by the HW scheduler when used, > >>>>>> and by driver code if HW scheduling is not being used. > >>>>>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the > >>>>>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR > >>>>>> dedicated for this purpose, and pages within the doorbell aperture are > >>>>>> mapped to userspace (each page mapped to only one user address space). > >>>>>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing > >>>>>> userspace code to safely manage work queues (rings) without requiring a > >>>>>> kernel call for every ring update. > >>>>>> First step for an application process is to open the kfd device. Calls to > >>>>>> open create a kfd "process" structure only for the first thread of the > >>>>>> process. Subsequent open calls are checked to see if they are from processes > >>>>>> using the same mm_struct and, if so, don't do anything. The kfd per-process > >>>>>> data lives as long as the mm_struct exists. Each mm_struct is associated > >>>>>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory > >>>>>> accessible to the GPU. > >>>>>> Next step is for the application to collect topology information via sysfs. > >>>>>> This gives userspace enough information to be able to identify specific > >>>>>> nodes (processors) in subsequent queue management calls. Application > >>>>>> processes can create queues on multiple processors, and processors support > >>>>>> queues from multiple processes. > >>>>>> At this point the application can create work queues in userspace memory and > >>>>>> pass them through the usermode library to kfd to have them mapped onto HW > >>>>>> queue slots so that commands written to the queues can be executed by the > >>>>>> GPU. Queue operations specify a processor node, and so the bulk of this code > >>>>>> is device-specific. > >>>>>> Written by John Bridgman <John.Bridgman@amd.com> > >>>>>> > >>>>>> > >>>>>> Alexey Skidanov (1): > >>>>>> amdkfd: Implement the Get Process Aperture IOCTL > >>>>>> > >>>>>> Andrew Lewycky (3): > >>>>>> amdkfd: Add basic modules to amdkfd > >>>>>> amdkfd: Add interrupt handling module > >>>>>> amdkfd: Implement the Set Memory Policy IOCTL > >>>>>> > >>>>>> Ben Goz (8): > >>>>>> amdkfd: Add queue module > >>>>>> amdkfd: Add mqd_manager module > >>>>>> amdkfd: Add kernel queue module > >>>>>> amdkfd: Add module parameter of scheduling policy > >>>>>> amdkfd: Add packet manager module > >>>>>> amdkfd: Add process queue manager module > >>>>>> amdkfd: Add device queue manager module > >>>>>> amdkfd: Implement the create/destroy/update queue IOCTLs > >>>>>> > >>>>>> Evgeny Pinchuk (3): > >>>>>> amdkfd: Add topology module to amdkfd > >>>>>> amdkfd: Implement the Get Clock Counters IOCTL > >>>>>> amdkfd: Implement the PMC Acquire/Release IOCTLs > >>>>>> > >>>>>> Oded Gabbay (10): > >>>>>> mm: Add kfd_process pointer to mm_struct > >>>>>> drm/radeon: reduce number of free VMIDs and pipes in KV > >>>>>> drm/radeon/cik: Don't touch int of pipes 1-7 > >>>>>> drm/radeon: Report doorbell configuration to amdkfd > >>>>>> drm/radeon: adding synchronization for GRBM GFX > >>>>>> drm/radeon: Add radeon <--> amdkfd interface > >>>>>> Update MAINTAINERS and CREDITS files with amdkfd info > >>>>>> amdkfd: Add IOCTL set definitions of amdkfd > >>>>>> amdkfd: Add amdkfd skeleton driver > >>>>>> amdkfd: Add binding/unbinding calls to amd_iommu driver > >>>>>> > >>>>>> CREDITS | 7 + > >>>>>> MAINTAINERS | 10 + > >>>>>> drivers/gpu/drm/radeon/Kconfig | 2 + > >>>>>> drivers/gpu/drm/radeon/Makefile | 3 + > >>>>>> drivers/gpu/drm/radeon/amdkfd/Kconfig | 10 + > >>>>>> drivers/gpu/drm/radeon/amdkfd/Makefile | 14 + > >>>>>> drivers/gpu/drm/radeon/amdkfd/cik_mqds.h | 185 +++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/cik_regs.h | 220 ++++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c | 123 ++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c | 518 +++++++++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_crat.h | 294 +++++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_device.c | 254 ++++ > >>>>>> .../drm/radeon/amdkfd/kfd_device_queue_manager.c | 985 ++++++++++++++++ > >>>>>> .../drm/radeon/amdkfd/kfd_device_queue_manager.h | 101 ++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c | 264 +++++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c | 161 +++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c | 305 +++++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h | 66 ++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_module.c | 131 +++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c | 291 +++++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h | 54 + > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c | 488 ++++++++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c | 97 ++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h | 682 +++++++++++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h | 107 ++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_priv.h | 466 ++++++++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_process.c | 405 +++++++ > >>>>>> .../drm/radeon/amdkfd/kfd_process_queue_manager.c | 343 ++++++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_queue.c | 109 ++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_topology.c | 1207 > >>>>>> ++++++++++++++++++++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_topology.h | 168 +++ > >>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c | 96 ++ > >>>>>> drivers/gpu/drm/radeon/cik.c | 154 +-- > >>>>>> drivers/gpu/drm/radeon/cik_reg.h | 65 ++ > >>>>>> drivers/gpu/drm/radeon/cikd.h | 51 +- > >>>>>> drivers/gpu/drm/radeon/radeon.h | 9 + > >>>>>> drivers/gpu/drm/radeon/radeon_device.c | 32 + > >>>>>> drivers/gpu/drm/radeon/radeon_drv.c | 5 + > >>>>>> drivers/gpu/drm/radeon/radeon_kfd.c | 566 +++++++++ > >>>>>> drivers/gpu/drm/radeon/radeon_kfd.h | 119 ++ > >>>>>> drivers/gpu/drm/radeon/radeon_kms.c | 7 + > >>>>>> include/linux/mm_types.h | 14 + > >>>>>> include/uapi/linux/kfd_ioctl.h | 133 +++ > >>>>>> 43 files changed, 9226 insertions(+), 95 deletions(-) > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h > >>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c > >>>>>> create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c > >>>>>> create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h > >>>>>> create mode 100644 include/uapi/linux/kfd_ioctl.h > >>>>>> > >>>>>> -- > >>>>>> 1.9.1 > >>>>>> > >>>> > >>> > >> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 18:14 ` Jerome Glisse @ 2014-07-21 18:36 ` Oded Gabbay 2014-07-21 18:59 ` Jerome Glisse 0 siblings, 1 reply; 49+ messages in thread From: Oded Gabbay @ 2014-07-21 18:36 UTC (permalink / raw) To: Jerome Glisse Cc: Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On 21/07/14 21:14, Jerome Glisse wrote: > On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote: >> On 21/07/14 18:54, Jerome Glisse wrote: >>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote: >>>> On 21/07/14 16:39, Christian König wrote: >>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay: >>>>>> On 20/07/14 20:46, Jerome Glisse wrote: >>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: >>>>>>>> Forgot to cc mailing list on cover letter. Sorry. >>>>>>>> >>>>>>>> As a continuation to the existing discussion, here is a v2 patch series >>>>>>>> restructured with a cleaner history and no totally-different-early-versions >>>>>>>> of the code. >>>>>>>> >>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them >>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code. >>>>>>>> There is no code going away or even modified between patches, only added. >>>>>>>> >>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under >>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver >>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a >>>>>>>> generic hsa framework being implemented in the future and in that case, we >>>>>>>> will adjust amdkfd to work within that framework. >>>>>>>> >>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to >>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is >>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon >>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point >>>>>>>> is the radeon driver. Having said that, we will probably need to move it >>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx >>>>>>>> drivers. >>>>>>>> >>>>>>>> For people who like to review using git, the v2 patch set is located at: >>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 >>>>>>>> >>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com> >>>>>>> >>>>>>> So quick comments before i finish going over all patches. There is many >>>>>>> things that need more documentation espacialy as of right now there is >>>>>>> no userspace i can go look at. >>>>>> So quick comments on some of your questions but first of all, thanks for the >>>>>> time you dedicated to review the code. >>>>>>> >>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big >>>>>>> no, that would need serious arguments for any hope of convincing me on >>>>>>> that side. >>>>>> We only do gpu memory pinning for kernel objects. There are no userspace >>>>>> objects that are pinned on the gpu memory in our driver. If that is the case, >>>>>> is it still a show stopper ? >>>>>> >>>>>> The kernel objects are: >>>>>> - pipelines (4 per device) >>>>>> - mqd per hiq (only 1 per device) >>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for >>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in >>>>>> 256 alignment. So total *possible* memory is 128MB >>>>>> - kernel queue (only 1 per device) >>>>>> - fence address for kernel queue >>>>>> - runlists for the CP (1 or 2 per device) >>>>> >>>>> The main questions here are if it's avoid able to pin down the memory and if the >>>>> memory is pinned down at driver load, by request from userspace or by anything >>>>> else. >>>>> >>>>> As far as I can see only the "mqd per userspace queue" might be a bit >>>>> questionable, everything else sounds reasonable. >>>>> >>>>> Christian. >>>> >>>> Most of the pin downs are done on device initialization. >>>> The "mqd per userspace" is done per userspace queue creation. However, as I >>>> said, it has an upper limit of 128MB on KV, and considering the 2G local >>>> memory, I think it is OK. >>>> The runlists are also done on userspace queue creation/deletion, but we only >>>> have 1 or 2 runlists per device, so it is not that bad. >>> >>> 2G local memory ? You can not assume anything on userside configuration some >>> one might build an hsa computer with 512M and still expect a functioning >>> desktop. >> First of all, I'm only considering Kaveri computer, not "hsa" computer. >> Second, I would imagine we can build some protection around it, like >> checking total local memory and limit number of queues based on some >> percentage of that total local memory. So, if someone will have only >> 512M, he will be able to open less queues. >> >> >>> >>> I need to go look into what all this mqd is for, what it does and what it is >>> about. But pinning is really bad and this is an issue with userspace command >>> scheduling an issue that obviously AMD fails to take into account in design >>> phase. >> Maybe, but that is the H/W design non-the-less. We can't very well >> change the H/W. > > You can not change the hardware but it is not an excuse to allow bad design to > sneak in software to work around that. So i would rather penalize bad hardware > design and have command submission in the kernel, until AMD fix its hardware to > allow proper scheduling by the kernel and proper control by the kernel. I'm sorry but I do *not* think this is a bad design. S/W scheduling in the kernel can not, IMO, scale well to 100K queues and 10K processes. > Because really where we want to go is having GPU closer to a CPU in term of scheduling > capacity and once we get there we want the kernel to always be able to take over > and do whatever it wants behind process back. Who do you refer to when you say "we" ? AFAIK, the hw scheduling direction is where AMD is now and where it is heading in the future. That doesn't preclude the option to allow the kernel to take over and do what he wants. I agree that in KV we have a problem where we can't do a mid-wave preemption, so theoretically, a long running compute kernel can make things messy, but in Carrizo, we will have this ability. Having said that, it will only be through the CP H/W scheduling. So AMD is _not_ going to abandon H/W scheduling. You can dislike it, but this is the situation. > >>>>> >>>>>>> >>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common >>>>>>> stuff there. >>>>>>> >>>>>>> Given that this is not intended to be final HSA api AFAICT then i would >>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon. >>>>>>> This would avoid crazy communication btw radeon and kfd. >>>>>>> >>>>>>> The whole aperture business needs some serious explanation. Especialy as >>>>>>> you want to use userspace address there is nothing to prevent userspace >>>>>>> program from allocating things at address you reserve for lds, scratch, >>>>>>> ... only sane way would be to move those lds, scratch inside the virtual >>>>>>> address reserved for kernel (see kernel memory map). >>>>>>> >>>>>>> The whole business of locking performance counter for exclusive per process >>>>>>> access is a big NO. Which leads me to the questionable usefullness of user >>>>>>> space command ring. >>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I >>>>>> find it analogous to a situation where a network maintainer nacking a driver >>>>>> for a network card, which is slower than a different network card. Doesn't >>>>>> seem reasonable this situation is would happen. He would still put both the >>>>>> drivers in the kernel because people want to use the H/W and its features. So, >>>>>> I don't think this is a valid reason to NACK the driver. >>> >>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning >>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING >>> the performance ioctl. >>> >>> Again this is another argument for round trip to the kernel. As inside kernel you >>> could properly do exclusive gpu counter access accross single user cmd buffer >>> execution. >>> >>>>>> >>>>>>> I only see issues with that. First and foremost i would >>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an >>>>>>> overhead that is measurable in any meaning full way against a simple >>>>>>> function call. I know the userspace command ring is a big marketing features >>>>>>> that please ignorant userspace programmer. But really this only brings issues >>>>>>> and for absolutely not upside afaict. >>>>>> Really ? You think that doing a context switch to kernel space, with all its >>>>>> overhead, is _not_ more expansive than just calling a function in userspace >>>>>> which only puts a buffer on a ring and writes a doorbell ? >>> >>> I am saying the overhead is not that big and it probably will not matter in most >>> usecase. For instance i did wrote the most useless kernel module that add two >>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and >>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so >>> ioctl is 13 times slower. >>> >>> Now if there is enough data that shows that a significant percentage of jobs >>> submited to the GPU will take less that 0.35microsecond then yes userspace >>> scheduling does make sense. But so far all we have is handwaving with no data >>> to support any facts. >>> >>> >>> Now if we want to schedule from userspace than you will need to do something >>> about the pinning, something that gives control to kernel so that kernel can >>> unpin when it wants and move object when it wants no matter what userspace is >>> doing. >>> >>>>>>> >>>>>>> So i would rather see a very simple ioctl that write the doorbell and might >>>>>>> do more than that in case of ring/queue overcommit where it would first have >>>>>>> to wait for a free ring/queue to schedule stuff. This would also allow sane >>>>>>> implementation of things like performance counter that could be acquire by >>>>>>> kernel for duration of a job submitted by userspace. While still not optimal >>>>>>> this would be better that userspace locking. >>>>>>> >>>>>>> >>>>>>> I might have more thoughts once i am done with all the patches. >>>>>>> >>>>>>> Cheers, >>>>>>> Jérôme >>>>>>> >>>>>>>> >>>>>>>> Original Cover Letter: >>>>>>>> >>>>>>>> This patch set implements a Heterogeneous System Architecture (HSA) driver >>>>>>>> for radeon-family GPUs. >>>>>>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share >>>>>>>> system resources more effectively via HW features including shared pageable >>>>>>>> memory, userspace-accessible work queues, and platform-level atomics. In >>>>>>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea >>>>>>>> Islands family of GPUs also performs HW-level validation of commands passed >>>>>>>> in through the queues (aka rings). >>>>>>>> >>>>>>>> The code in this patch set is intended to serve both as a sample driver for >>>>>>>> other HSA-compatible hardware devices and as a production driver for >>>>>>>> radeon-family processors. The code is architected to support multiple CPUs >>>>>>>> each with connected GPUs, although the current implementation focuses on a >>>>>>>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel >>>>>>>> graphics driver (kgd). >>>>>>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware >>>>>>>> functionality between HSA compute and regular gfx/compute (memory, >>>>>>>> interrupts, registers), while other functionality has been added >>>>>>>> specifically for HSA compute (hw scheduler for virtualized compute rings). >>>>>>>> All shared hardware is owned by the radeon graphics driver, and an interface >>>>>>>> between kfd and kgd allows the kfd to make use of those shared resources, >>>>>>>> while HSA-specific functionality is managed directly by kfd by submitting >>>>>>>> packets into an HSA-specific command queue (the "HIQ"). >>>>>>>> >>>>>>>> During kfd module initialization a char device node (/dev/kfd) is created >>>>>>>> (surviving until module exit), with ioctls for queue creation & management, >>>>>>>> and data structures are initialized for managing HSA device topology. >>>>>>>> The rest of the initialization is driven by calls from the radeon kgd at the >>>>>>>> following points : >>>>>>>> >>>>>>>> - radeon_init (kfd_init) >>>>>>>> - radeon_exit (kfd_fini) >>>>>>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init) >>>>>>>> - radeon_driver_unload_kms (kfd_device_fini) >>>>>>>> >>>>>>>> During the probe and init processing per-device data structures are >>>>>>>> established which connect to the associated graphics kernel driver. This >>>>>>>> information is exposed to userspace via sysfs, along with a version number >>>>>>>> allowing userspace to determine if a topology change has occurred while it >>>>>>>> was reading from sysfs. >>>>>>>> The interface between kfd and kgd also allows the kfd to request buffer >>>>>>>> management services from kgd, and allows kgd to route interrupt requests to >>>>>>>> kfd code since the interrupt block is shared between regular >>>>>>>> graphics/compute and HSA compute subsystems in the GPU. >>>>>>>> >>>>>>>> The kfd code works with an open source usermode library ("libhsakmt") which >>>>>>>> is in the final stages of IP review and should be published in a separate >>>>>>>> repo over the next few days. >>>>>>>> The code operates in one of three modes, selectable via the sched_policy >>>>>>>> module parameter : >>>>>>>> >>>>>>>> - sched_policy=0 uses a hardware scheduler running in the MEC block within >>>>>>>> CP, and allows oversubscription (more queues than HW slots) >>>>>>>> - sched_policy=1 also uses HW scheduling but does not allow >>>>>>>> oversubscription, so create_queue requests fail when we run out of HW slots >>>>>>>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns >>>>>>>> queues to HW slots by programming registers >>>>>>>> >>>>>>>> The "no HW scheduling" option is for debug & new hardware bringup only, so >>>>>>>> has less test coverage than the other options. Default in the current code >>>>>>>> is "HW scheduling without oversubscription" since that is where we have the >>>>>>>> most test coverage but we expect to change the default to "HW scheduling >>>>>>>> with oversubscription" after further testing. This effectively removes the >>>>>>>> HW limit on the number of work queues available to applications. >>>>>>>> >>>>>>>> Programs running on the GPU are associated with an address space through the >>>>>>>> VMID field, which is translated to a unique PASID at access time via a set >>>>>>>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) >>>>>>>> are partitioned (under control of the radeon kgd) between current >>>>>>>> gfx/compute and HSA compute, with each getting 8 in the current code. The >>>>>>>> VMID-to-PASID mapping registers are updated by the HW scheduler when used, >>>>>>>> and by driver code if HW scheduling is not being used. >>>>>>>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the >>>>>>>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR >>>>>>>> dedicated for this purpose, and pages within the doorbell aperture are >>>>>>>> mapped to userspace (each page mapped to only one user address space). >>>>>>>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing >>>>>>>> userspace code to safely manage work queues (rings) without requiring a >>>>>>>> kernel call for every ring update. >>>>>>>> First step for an application process is to open the kfd device. Calls to >>>>>>>> open create a kfd "process" structure only for the first thread of the >>>>>>>> process. Subsequent open calls are checked to see if they are from processes >>>>>>>> using the same mm_struct and, if so, don't do anything. The kfd per-process >>>>>>>> data lives as long as the mm_struct exists. Each mm_struct is associated >>>>>>>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory >>>>>>>> accessible to the GPU. >>>>>>>> Next step is for the application to collect topology information via sysfs. >>>>>>>> This gives userspace enough information to be able to identify specific >>>>>>>> nodes (processors) in subsequent queue management calls. Application >>>>>>>> processes can create queues on multiple processors, and processors support >>>>>>>> queues from multiple processes. >>>>>>>> At this point the application can create work queues in userspace memory and >>>>>>>> pass them through the usermode library to kfd to have them mapped onto HW >>>>>>>> queue slots so that commands written to the queues can be executed by the >>>>>>>> GPU. Queue operations specify a processor node, and so the bulk of this code >>>>>>>> is device-specific. >>>>>>>> Written by John Bridgman <John.Bridgman@amd.com> >>>>>>>> >>>>>>>> >>>>>>>> Alexey Skidanov (1): >>>>>>>> amdkfd: Implement the Get Process Aperture IOCTL >>>>>>>> >>>>>>>> Andrew Lewycky (3): >>>>>>>> amdkfd: Add basic modules to amdkfd >>>>>>>> amdkfd: Add interrupt handling module >>>>>>>> amdkfd: Implement the Set Memory Policy IOCTL >>>>>>>> >>>>>>>> Ben Goz (8): >>>>>>>> amdkfd: Add queue module >>>>>>>> amdkfd: Add mqd_manager module >>>>>>>> amdkfd: Add kernel queue module >>>>>>>> amdkfd: Add module parameter of scheduling policy >>>>>>>> amdkfd: Add packet manager module >>>>>>>> amdkfd: Add process queue manager module >>>>>>>> amdkfd: Add device queue manager module >>>>>>>> amdkfd: Implement the create/destroy/update queue IOCTLs >>>>>>>> >>>>>>>> Evgeny Pinchuk (3): >>>>>>>> amdkfd: Add topology module to amdkfd >>>>>>>> amdkfd: Implement the Get Clock Counters IOCTL >>>>>>>> amdkfd: Implement the PMC Acquire/Release IOCTLs >>>>>>>> >>>>>>>> Oded Gabbay (10): >>>>>>>> mm: Add kfd_process pointer to mm_struct >>>>>>>> drm/radeon: reduce number of free VMIDs and pipes in KV >>>>>>>> drm/radeon/cik: Don't touch int of pipes 1-7 >>>>>>>> drm/radeon: Report doorbell configuration to amdkfd >>>>>>>> drm/radeon: adding synchronization for GRBM GFX >>>>>>>> drm/radeon: Add radeon <--> amdkfd interface >>>>>>>> Update MAINTAINERS and CREDITS files with amdkfd info >>>>>>>> amdkfd: Add IOCTL set definitions of amdkfd >>>>>>>> amdkfd: Add amdkfd skeleton driver >>>>>>>> amdkfd: Add binding/unbinding calls to amd_iommu driver >>>>>>>> >>>>>>>> CREDITS | 7 + >>>>>>>> MAINTAINERS | 10 + >>>>>>>> drivers/gpu/drm/radeon/Kconfig | 2 + >>>>>>>> drivers/gpu/drm/radeon/Makefile | 3 + >>>>>>>> drivers/gpu/drm/radeon/amdkfd/Kconfig | 10 + >>>>>>>> drivers/gpu/drm/radeon/amdkfd/Makefile | 14 + >>>>>>>> drivers/gpu/drm/radeon/amdkfd/cik_mqds.h | 185 +++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/cik_regs.h | 220 ++++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c | 123 ++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c | 518 +++++++++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_crat.h | 294 +++++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_device.c | 254 ++++ >>>>>>>> .../drm/radeon/amdkfd/kfd_device_queue_manager.c | 985 ++++++++++++++++ >>>>>>>> .../drm/radeon/amdkfd/kfd_device_queue_manager.h | 101 ++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c | 264 +++++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c | 161 +++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c | 305 +++++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h | 66 ++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_module.c | 131 +++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c | 291 +++++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h | 54 + >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c | 488 ++++++++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c | 97 ++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h | 682 +++++++++++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h | 107 ++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_priv.h | 466 ++++++++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_process.c | 405 +++++++ >>>>>>>> .../drm/radeon/amdkfd/kfd_process_queue_manager.c | 343 ++++++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_queue.c | 109 ++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_topology.c | 1207 >>>>>>>> ++++++++++++++++++++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_topology.h | 168 +++ >>>>>>>> drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c | 96 ++ >>>>>>>> drivers/gpu/drm/radeon/cik.c | 154 +-- >>>>>>>> drivers/gpu/drm/radeon/cik_reg.h | 65 ++ >>>>>>>> drivers/gpu/drm/radeon/cikd.h | 51 +- >>>>>>>> drivers/gpu/drm/radeon/radeon.h | 9 + >>>>>>>> drivers/gpu/drm/radeon/radeon_device.c | 32 + >>>>>>>> drivers/gpu/drm/radeon/radeon_drv.c | 5 + >>>>>>>> drivers/gpu/drm/radeon/radeon_kfd.c | 566 +++++++++ >>>>>>>> drivers/gpu/drm/radeon/radeon_kfd.h | 119 ++ >>>>>>>> drivers/gpu/drm/radeon/radeon_kms.c | 7 + >>>>>>>> include/linux/mm_types.h | 14 + >>>>>>>> include/uapi/linux/kfd_ioctl.h | 133 +++ >>>>>>>> 43 files changed, 9226 insertions(+), 95 deletions(-) >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c >>>>>>>> create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h >>>>>>>> create mode 100644 include/uapi/linux/kfd_ioctl.h >>>>>>>> >>>>>>>> -- >>>>>>>> 1.9.1 >>>>>>>> >>>>>> >>>>> >>>> >> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 18:36 ` Oded Gabbay @ 2014-07-21 18:59 ` Jerome Glisse 2014-07-21 19:23 ` Oded Gabbay 0 siblings, 1 reply; 49+ messages in thread From: Jerome Glisse @ 2014-07-21 18:59 UTC (permalink / raw) To: Oded Gabbay Cc: Andrew Lewycky, Michel Dänzer, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote: > On 21/07/14 21:14, Jerome Glisse wrote: > > On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote: > >> On 21/07/14 18:54, Jerome Glisse wrote: > >>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote: > >>>> On 21/07/14 16:39, Christian Konig wrote: > >>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay: > >>>>>> On 20/07/14 20:46, Jerome Glisse wrote: > >>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: > >>>>>>>> Forgot to cc mailing list on cover letter. Sorry. > >>>>>>>> > >>>>>>>> As a continuation to the existing discussion, here is a v2 patch series > >>>>>>>> restructured with a cleaner history and no totally-different-early-versions > >>>>>>>> of the code. > >>>>>>>> > >>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them > >>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code. > >>>>>>>> There is no code going away or even modified between patches, only added. > >>>>>>>> > >>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under > >>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver > >>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a > >>>>>>>> generic hsa framework being implemented in the future and in that case, we > >>>>>>>> will adjust amdkfd to work within that framework. > >>>>>>>> > >>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to > >>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is > >>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon > >>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point > >>>>>>>> is the radeon driver. Having said that, we will probably need to move it > >>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx > >>>>>>>> drivers. > >>>>>>>> > >>>>>>>> For people who like to review using git, the v2 patch set is located at: > >>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 > >>>>>>>> > >>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com> > >>>>>>> > >>>>>>> So quick comments before i finish going over all patches. There is many > >>>>>>> things that need more documentation espacialy as of right now there is > >>>>>>> no userspace i can go look at. > >>>>>> So quick comments on some of your questions but first of all, thanks for the > >>>>>> time you dedicated to review the code. > >>>>>>> > >>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big > >>>>>>> no, that would need serious arguments for any hope of convincing me on > >>>>>>> that side. > >>>>>> We only do gpu memory pinning for kernel objects. There are no userspace > >>>>>> objects that are pinned on the gpu memory in our driver. If that is the case, > >>>>>> is it still a show stopper ? > >>>>>> > >>>>>> The kernel objects are: > >>>>>> - pipelines (4 per device) > >>>>>> - mqd per hiq (only 1 per device) > >>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for > >>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in > >>>>>> 256 alignment. So total *possible* memory is 128MB > >>>>>> - kernel queue (only 1 per device) > >>>>>> - fence address for kernel queue > >>>>>> - runlists for the CP (1 or 2 per device) > >>>>> > >>>>> The main questions here are if it's avoid able to pin down the memory and if the > >>>>> memory is pinned down at driver load, by request from userspace or by anything > >>>>> else. > >>>>> > >>>>> As far as I can see only the "mqd per userspace queue" might be a bit > >>>>> questionable, everything else sounds reasonable. > >>>>> > >>>>> Christian. > >>>> > >>>> Most of the pin downs are done on device initialization. > >>>> The "mqd per userspace" is done per userspace queue creation. However, as I > >>>> said, it has an upper limit of 128MB on KV, and considering the 2G local > >>>> memory, I think it is OK. > >>>> The runlists are also done on userspace queue creation/deletion, but we only > >>>> have 1 or 2 runlists per device, so it is not that bad. > >>> > >>> 2G local memory ? You can not assume anything on userside configuration some > >>> one might build an hsa computer with 512M and still expect a functioning > >>> desktop. > >> First of all, I'm only considering Kaveri computer, not "hsa" computer. > >> Second, I would imagine we can build some protection around it, like > >> checking total local memory and limit number of queues based on some > >> percentage of that total local memory. So, if someone will have only > >> 512M, he will be able to open less queues. > >> > >> > >>> > >>> I need to go look into what all this mqd is for, what it does and what it is > >>> about. But pinning is really bad and this is an issue with userspace command > >>> scheduling an issue that obviously AMD fails to take into account in design > >>> phase. > >> Maybe, but that is the H/W design non-the-less. We can't very well > >> change the H/W. > > > > You can not change the hardware but it is not an excuse to allow bad design to > > sneak in software to work around that. So i would rather penalize bad hardware > > design and have command submission in the kernel, until AMD fix its hardware to > > allow proper scheduling by the kernel and proper control by the kernel. > I'm sorry but I do *not* think this is a bad design. S/W scheduling in > the kernel can not, IMO, scale well to 100K queues and 10K processes. I am not advocating for having kernel decide down to the very last details. I am advocating for kernel being able to preempt at any time and be able to decrease or increase user queue priority so overall kernel is in charge of resources management and it can handle rogue client in proper fashion. > > > Because really where we want to go is having GPU closer to a CPU in term of scheduling > > capacity and once we get there we want the kernel to always be able to take over > > and do whatever it wants behind process back. > Who do you refer to when you say "we" ? AFAIK, the hw scheduling > direction is where AMD is now and where it is heading in the future. > That doesn't preclude the option to allow the kernel to take over and do > what he wants. I agree that in KV we have a problem where we can't do a > mid-wave preemption, so theoretically, a long running compute kernel can > make things messy, but in Carrizo, we will have this ability. Having > said that, it will only be through the CP H/W scheduling. So AMD is > _not_ going to abandon H/W scheduling. You can dislike it, but this is > the situation. We was for the overall Linux community but maybe i should not pretend to talk for anyone interested in having a common standard. My point is that current hardware do not have approriate hardware support for preemption hence, current hardware should use ioctl to schedule job and AMD should think a bit more on commiting to a design and handwaving any hardware short coming as something that can be work around in the software. The pinning thing is broken by design, only way to work around it is through kernel cmd queue scheduling that's a fact. Once hardware support proper preemption and allows to move around/evict buffer use on behalf of userspace command queue then we can allow userspace scheduling but until then my personnal opinion is that it should not be allowed and that people will have to pay the ioctl price which i proved to be small, because really if you 100K queue each with one job, i would not expect that all those 100K job will complete in less time than it takes to execute an ioctl ie by even if you do not have the ioctl delay what ever you schedule will have to wait on previously submited jobs. > > > >>>>> > >>>>>>> > >>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common > >>>>>>> stuff there. > >>>>>>> > >>>>>>> Given that this is not intended to be final HSA api AFAICT then i would > >>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon. > >>>>>>> This would avoid crazy communication btw radeon and kfd. > >>>>>>> > >>>>>>> The whole aperture business needs some serious explanation. Especialy as > >>>>>>> you want to use userspace address there is nothing to prevent userspace > >>>>>>> program from allocating things at address you reserve for lds, scratch, > >>>>>>> ... only sane way would be to move those lds, scratch inside the virtual > >>>>>>> address reserved for kernel (see kernel memory map). > >>>>>>> > >>>>>>> The whole business of locking performance counter for exclusive per process > >>>>>>> access is a big NO. Which leads me to the questionable usefullness of user > >>>>>>> space command ring. > >>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I > >>>>>> find it analogous to a situation where a network maintainer nacking a driver > >>>>>> for a network card, which is slower than a different network card. Doesn't > >>>>>> seem reasonable this situation is would happen. He would still put both the > >>>>>> drivers in the kernel because people want to use the H/W and its features. So, > >>>>>> I don't think this is a valid reason to NACK the driver. > >>> > >>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning > >>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING > >>> the performance ioctl. > >>> > >>> Again this is another argument for round trip to the kernel. As inside kernel you > >>> could properly do exclusive gpu counter access accross single user cmd buffer > >>> execution. > >>> > >>>>>> > >>>>>>> I only see issues with that. First and foremost i would > >>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an > >>>>>>> overhead that is measurable in any meaning full way against a simple > >>>>>>> function call. I know the userspace command ring is a big marketing features > >>>>>>> that please ignorant userspace programmer. But really this only brings issues > >>>>>>> and for absolutely not upside afaict. > >>>>>> Really ? You think that doing a context switch to kernel space, with all its > >>>>>> overhead, is _not_ more expansive than just calling a function in userspace > >>>>>> which only puts a buffer on a ring and writes a doorbell ? > >>> > >>> I am saying the overhead is not that big and it probably will not matter in most > >>> usecase. For instance i did wrote the most useless kernel module that add two > >>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and > >>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so > >>> ioctl is 13 times slower. > >>> > >>> Now if there is enough data that shows that a significant percentage of jobs > >>> submited to the GPU will take less that 0.35microsecond then yes userspace > >>> scheduling does make sense. But so far all we have is handwaving with no data > >>> to support any facts. > >>> > >>> > >>> Now if we want to schedule from userspace than you will need to do something > >>> about the pinning, something that gives control to kernel so that kernel can > >>> unpin when it wants and move object when it wants no matter what userspace is > >>> doing. > >>> > >>>>>>> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 18:59 ` Jerome Glisse @ 2014-07-21 19:23 ` Oded Gabbay 2014-07-21 19:28 ` Jerome Glisse 2014-07-22 7:23 ` Daniel Vetter 0 siblings, 2 replies; 49+ messages in thread From: Oded Gabbay @ 2014-07-21 19:23 UTC (permalink / raw) To: Jerome Glisse Cc: Andrew Lewycky, Michel Dänzer, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton On 21/07/14 21:59, Jerome Glisse wrote: > On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote: >> On 21/07/14 21:14, Jerome Glisse wrote: >>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote: >>>> On 21/07/14 18:54, Jerome Glisse wrote: >>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote: >>>>>> On 21/07/14 16:39, Christian König wrote: >>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay: >>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote: >>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: >>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry. >>>>>>>>>> >>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series >>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions >>>>>>>>>> of the code. >>>>>>>>>> >>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them >>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code. >>>>>>>>>> There is no code going away or even modified between patches, only added. >>>>>>>>>> >>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under >>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver >>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a >>>>>>>>>> generic hsa framework being implemented in the future and in that case, we >>>>>>>>>> will adjust amdkfd to work within that framework. >>>>>>>>>> >>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to >>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is >>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon >>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point >>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it >>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx >>>>>>>>>> drivers. >>>>>>>>>> >>>>>>>>>> For people who like to review using git, the v2 patch set is located at: >>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 >>>>>>>>>> >>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com> >>>>>>>>> >>>>>>>>> So quick comments before i finish going over all patches. There is many >>>>>>>>> things that need more documentation espacialy as of right now there is >>>>>>>>> no userspace i can go look at. >>>>>>>> So quick comments on some of your questions but first of all, thanks for the >>>>>>>> time you dedicated to review the code. >>>>>>>>> >>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big >>>>>>>>> no, that would need serious arguments for any hope of convincing me on >>>>>>>>> that side. >>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace >>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case, >>>>>>>> is it still a show stopper ? >>>>>>>> >>>>>>>> The kernel objects are: >>>>>>>> - pipelines (4 per device) >>>>>>>> - mqd per hiq (only 1 per device) >>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for >>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in >>>>>>>> 256 alignment. So total *possible* memory is 128MB >>>>>>>> - kernel queue (only 1 per device) >>>>>>>> - fence address for kernel queue >>>>>>>> - runlists for the CP (1 or 2 per device) >>>>>>> >>>>>>> The main questions here are if it's avoid able to pin down the memory and if the >>>>>>> memory is pinned down at driver load, by request from userspace or by anything >>>>>>> else. >>>>>>> >>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit >>>>>>> questionable, everything else sounds reasonable. >>>>>>> >>>>>>> Christian. >>>>>> >>>>>> Most of the pin downs are done on device initialization. >>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I >>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local >>>>>> memory, I think it is OK. >>>>>> The runlists are also done on userspace queue creation/deletion, but we only >>>>>> have 1 or 2 runlists per device, so it is not that bad. >>>>> >>>>> 2G local memory ? You can not assume anything on userside configuration some >>>>> one might build an hsa computer with 512M and still expect a functioning >>>>> desktop. >>>> First of all, I'm only considering Kaveri computer, not "hsa" computer. >>>> Second, I would imagine we can build some protection around it, like >>>> checking total local memory and limit number of queues based on some >>>> percentage of that total local memory. So, if someone will have only >>>> 512M, he will be able to open less queues. >>>> >>>> >>>>> >>>>> I need to go look into what all this mqd is for, what it does and what it is >>>>> about. But pinning is really bad and this is an issue with userspace command >>>>> scheduling an issue that obviously AMD fails to take into account in design >>>>> phase. >>>> Maybe, but that is the H/W design non-the-less. We can't very well >>>> change the H/W. >>> >>> You can not change the hardware but it is not an excuse to allow bad design to >>> sneak in software to work around that. So i would rather penalize bad hardware >>> design and have command submission in the kernel, until AMD fix its hardware to >>> allow proper scheduling by the kernel and proper control by the kernel. >> I'm sorry but I do *not* think this is a bad design. S/W scheduling in >> the kernel can not, IMO, scale well to 100K queues and 10K processes. > > I am not advocating for having kernel decide down to the very last details. I am > advocating for kernel being able to preempt at any time and be able to decrease > or increase user queue priority so overall kernel is in charge of resources > management and it can handle rogue client in proper fashion. > >> >>> Because really where we want to go is having GPU closer to a CPU in term of scheduling >>> capacity and once we get there we want the kernel to always be able to take over >>> and do whatever it wants behind process back. >> Who do you refer to when you say "we" ? AFAIK, the hw scheduling >> direction is where AMD is now and where it is heading in the future. >> That doesn't preclude the option to allow the kernel to take over and do >> what he wants. I agree that in KV we have a problem where we can't do a >> mid-wave preemption, so theoretically, a long running compute kernel can >> make things messy, but in Carrizo, we will have this ability. Having >> said that, it will only be through the CP H/W scheduling. So AMD is >> _not_ going to abandon H/W scheduling. You can dislike it, but this is >> the situation. > > We was for the overall Linux community but maybe i should not pretend to talk > for anyone interested in having a common standard. > > My point is that current hardware do not have approriate hardware support for > preemption hence, current hardware should use ioctl to schedule job and AMD > should think a bit more on commiting to a design and handwaving any hardware > short coming as something that can be work around in the software. The pinning > thing is broken by design, only way to work around it is through kernel cmd > queue scheduling that's a fact. > > Once hardware support proper preemption and allows to move around/evict buffer > use on behalf of userspace command queue then we can allow userspace scheduling > but until then my personnal opinion is that it should not be allowed and that > people will have to pay the ioctl price which i proved to be small, because > really if you 100K queue each with one job, i would not expect that all those > 100K job will complete in less time than it takes to execute an ioctl ie by > even if you do not have the ioctl delay what ever you schedule will have to > wait on previously submited jobs. But Jerome, the core problem still remains in effect, even with your suggestion. If an application, either via userspace queue or via ioctl, submits a long-running kernel, than the CPU in general can't stop the GPU from running it. And if that kernel does while(1); than that's it, game's over, and no matter how you submitted the work. So I don't really see the big advantage in your proposal. Only in CZ we can stop this wave (by CP H/W scheduling only). What are you saying is basically I won't allow people to use compute on Linux KV system because it _may_ get the system stuck. So even if I really wanted to, and I may agree with you theoretically on that, I can't fulfill your desire to make the "kernel being able to preempt at any time and be able to decrease or increase user queue priority so overall kernel is in charge of resources management and it can handle rogue client in proper fashion". Not in KV, and I guess not in CZ as well. Oded > >>> >>>>>>> >>>>>>>>> >>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common >>>>>>>>> stuff there. >>>>>>>>> >>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would >>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon. >>>>>>>>> This would avoid crazy communication btw radeon and kfd. >>>>>>>>> >>>>>>>>> The whole aperture business needs some serious explanation. Especialy as >>>>>>>>> you want to use userspace address there is nothing to prevent userspace >>>>>>>>> program from allocating things at address you reserve for lds, scratch, >>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual >>>>>>>>> address reserved for kernel (see kernel memory map). >>>>>>>>> >>>>>>>>> The whole business of locking performance counter for exclusive per process >>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user >>>>>>>>> space command ring. >>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I >>>>>>>> find it analogous to a situation where a network maintainer nacking a driver >>>>>>>> for a network card, which is slower than a different network card. Doesn't >>>>>>>> seem reasonable this situation is would happen. He would still put both the >>>>>>>> drivers in the kernel because people want to use the H/W and its features. So, >>>>>>>> I don't think this is a valid reason to NACK the driver. >>>>> >>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning >>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING >>>>> the performance ioctl. >>>>> >>>>> Again this is another argument for round trip to the kernel. As inside kernel you >>>>> could properly do exclusive gpu counter access accross single user cmd buffer >>>>> execution. >>>>> >>>>>>>> >>>>>>>>> I only see issues with that. First and foremost i would >>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an >>>>>>>>> overhead that is measurable in any meaning full way against a simple >>>>>>>>> function call. I know the userspace command ring is a big marketing features >>>>>>>>> that please ignorant userspace programmer. But really this only brings issues >>>>>>>>> and for absolutely not upside afaict. >>>>>>>> Really ? You think that doing a context switch to kernel space, with all its >>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace >>>>>>>> which only puts a buffer on a ring and writes a doorbell ? >>>>> >>>>> I am saying the overhead is not that big and it probably will not matter in most >>>>> usecase. For instance i did wrote the most useless kernel module that add two >>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and >>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so >>>>> ioctl is 13 times slower. >>>>> >>>>> Now if there is enough data that shows that a significant percentage of jobs >>>>> submited to the GPU will take less that 0.35microsecond then yes userspace >>>>> scheduling does make sense. But so far all we have is handwaving with no data >>>>> to support any facts. >>>>> >>>>> >>>>> Now if we want to schedule from userspace than you will need to do something >>>>> about the pinning, something that gives control to kernel so that kernel can >>>>> unpin when it wants and move object when it wants no matter what userspace is >>>>> doing. >>>>> >>>>>>>>> > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 19:23 ` Oded Gabbay @ 2014-07-21 19:28 ` Jerome Glisse 2014-07-21 21:56 ` Oded Gabbay 2014-07-22 7:23 ` Daniel Vetter 1 sibling, 1 reply; 49+ messages in thread From: Jerome Glisse @ 2014-07-21 19:28 UTC (permalink / raw) To: Oded Gabbay Cc: Andrew Lewycky, Michel Dänzer, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote: > On 21/07/14 21:59, Jerome Glisse wrote: > > On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote: > >> On 21/07/14 21:14, Jerome Glisse wrote: > >>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote: > >>>> On 21/07/14 18:54, Jerome Glisse wrote: > >>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote: > >>>>>> On 21/07/14 16:39, Christian Konig wrote: > >>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay: > >>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote: > >>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: > >>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry. > >>>>>>>>>> > >>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series > >>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions > >>>>>>>>>> of the code. > >>>>>>>>>> > >>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them > >>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code. > >>>>>>>>>> There is no code going away or even modified between patches, only added. > >>>>>>>>>> > >>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under > >>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver > >>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a > >>>>>>>>>> generic hsa framework being implemented in the future and in that case, we > >>>>>>>>>> will adjust amdkfd to work within that framework. > >>>>>>>>>> > >>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to > >>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is > >>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon > >>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point > >>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it > >>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx > >>>>>>>>>> drivers. > >>>>>>>>>> > >>>>>>>>>> For people who like to review using git, the v2 patch set is located at: > >>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 > >>>>>>>>>> > >>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com> > >>>>>>>>> > >>>>>>>>> So quick comments before i finish going over all patches. There is many > >>>>>>>>> things that need more documentation espacialy as of right now there is > >>>>>>>>> no userspace i can go look at. > >>>>>>>> So quick comments on some of your questions but first of all, thanks for the > >>>>>>>> time you dedicated to review the code. > >>>>>>>>> > >>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big > >>>>>>>>> no, that would need serious arguments for any hope of convincing me on > >>>>>>>>> that side. > >>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace > >>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case, > >>>>>>>> is it still a show stopper ? > >>>>>>>> > >>>>>>>> The kernel objects are: > >>>>>>>> - pipelines (4 per device) > >>>>>>>> - mqd per hiq (only 1 per device) > >>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for > >>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in > >>>>>>>> 256 alignment. So total *possible* memory is 128MB > >>>>>>>> - kernel queue (only 1 per device) > >>>>>>>> - fence address for kernel queue > >>>>>>>> - runlists for the CP (1 or 2 per device) > >>>>>>> > >>>>>>> The main questions here are if it's avoid able to pin down the memory and if the > >>>>>>> memory is pinned down at driver load, by request from userspace or by anything > >>>>>>> else. > >>>>>>> > >>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit > >>>>>>> questionable, everything else sounds reasonable. > >>>>>>> > >>>>>>> Christian. > >>>>>> > >>>>>> Most of the pin downs are done on device initialization. > >>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I > >>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local > >>>>>> memory, I think it is OK. > >>>>>> The runlists are also done on userspace queue creation/deletion, but we only > >>>>>> have 1 or 2 runlists per device, so it is not that bad. > >>>>> > >>>>> 2G local memory ? You can not assume anything on userside configuration some > >>>>> one might build an hsa computer with 512M and still expect a functioning > >>>>> desktop. > >>>> First of all, I'm only considering Kaveri computer, not "hsa" computer. > >>>> Second, I would imagine we can build some protection around it, like > >>>> checking total local memory and limit number of queues based on some > >>>> percentage of that total local memory. So, if someone will have only > >>>> 512M, he will be able to open less queues. > >>>> > >>>> > >>>>> > >>>>> I need to go look into what all this mqd is for, what it does and what it is > >>>>> about. But pinning is really bad and this is an issue with userspace command > >>>>> scheduling an issue that obviously AMD fails to take into account in design > >>>>> phase. > >>>> Maybe, but that is the H/W design non-the-less. We can't very well > >>>> change the H/W. > >>> > >>> You can not change the hardware but it is not an excuse to allow bad design to > >>> sneak in software to work around that. So i would rather penalize bad hardware > >>> design and have command submission in the kernel, until AMD fix its hardware to > >>> allow proper scheduling by the kernel and proper control by the kernel. > >> I'm sorry but I do *not* think this is a bad design. S/W scheduling in > >> the kernel can not, IMO, scale well to 100K queues and 10K processes. > > > > I am not advocating for having kernel decide down to the very last details. I am > > advocating for kernel being able to preempt at any time and be able to decrease > > or increase user queue priority so overall kernel is in charge of resources > > management and it can handle rogue client in proper fashion. > > > >> > >>> Because really where we want to go is having GPU closer to a CPU in term of scheduling > >>> capacity and once we get there we want the kernel to always be able to take over > >>> and do whatever it wants behind process back. > >> Who do you refer to when you say "we" ? AFAIK, the hw scheduling > >> direction is where AMD is now and where it is heading in the future. > >> That doesn't preclude the option to allow the kernel to take over and do > >> what he wants. I agree that in KV we have a problem where we can't do a > >> mid-wave preemption, so theoretically, a long running compute kernel can > >> make things messy, but in Carrizo, we will have this ability. Having > >> said that, it will only be through the CP H/W scheduling. So AMD is > >> _not_ going to abandon H/W scheduling. You can dislike it, but this is > >> the situation. > > > > We was for the overall Linux community but maybe i should not pretend to talk > > for anyone interested in having a common standard. > > > > My point is that current hardware do not have approriate hardware support for > > preemption hence, current hardware should use ioctl to schedule job and AMD > > should think a bit more on commiting to a design and handwaving any hardware > > short coming as something that can be work around in the software. The pinning > > thing is broken by design, only way to work around it is through kernel cmd > > queue scheduling that's a fact. > > > > > Once hardware support proper preemption and allows to move around/evict buffer > > use on behalf of userspace command queue then we can allow userspace scheduling > > but until then my personnal opinion is that it should not be allowed and that > > people will have to pay the ioctl price which i proved to be small, because > > really if you 100K queue each with one job, i would not expect that all those > > 100K job will complete in less time than it takes to execute an ioctl ie by > > even if you do not have the ioctl delay what ever you schedule will have to > > wait on previously submited jobs. > > But Jerome, the core problem still remains in effect, even with your > suggestion. If an application, either via userspace queue or via ioctl, > submits a long-running kernel, than the CPU in general can't stop the > GPU from running it. And if that kernel does while(1); than that's it, > game's over, and no matter how you submitted the work. So I don't really > see the big advantage in your proposal. Only in CZ we can stop this wave > (by CP H/W scheduling only). What are you saying is basically I won't > allow people to use compute on Linux KV system because it _may_ get the > system stuck. > > So even if I really wanted to, and I may agree with you theoretically on > that, I can't fulfill your desire to make the "kernel being able to > preempt at any time and be able to decrease or increase user queue > priority so overall kernel is in charge of resources management and it > can handle rogue client in proper fashion". Not in KV, and I guess not > in CZ as well. > > Oded I do understand that but using kernel ioctl provide the same kind of control as we have now ie we can bind/unbind buffer on per command buffer submission basis, just like with current graphic or compute stuff. Yes current graphic and compute stuff can launch a while and never return back and yes currently we have nothing against that but we should and solution would be simple just kill the gpu thread. > > > > >>> > >>>>>>> > >>>>>>>>> > >>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common > >>>>>>>>> stuff there. > >>>>>>>>> > >>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would > >>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon. > >>>>>>>>> This would avoid crazy communication btw radeon and kfd. > >>>>>>>>> > >>>>>>>>> The whole aperture business needs some serious explanation. Especialy as > >>>>>>>>> you want to use userspace address there is nothing to prevent userspace > >>>>>>>>> program from allocating things at address you reserve for lds, scratch, > >>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual > >>>>>>>>> address reserved for kernel (see kernel memory map). > >>>>>>>>> > >>>>>>>>> The whole business of locking performance counter for exclusive per process > >>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user > >>>>>>>>> space command ring. > >>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I > >>>>>>>> find it analogous to a situation where a network maintainer nacking a driver > >>>>>>>> for a network card, which is slower than a different network card. Doesn't > >>>>>>>> seem reasonable this situation is would happen. He would still put both the > >>>>>>>> drivers in the kernel because people want to use the H/W and its features. So, > >>>>>>>> I don't think this is a valid reason to NACK the driver. > >>>>> > >>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning > >>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING > >>>>> the performance ioctl. > >>>>> > >>>>> Again this is another argument for round trip to the kernel. As inside kernel you > >>>>> could properly do exclusive gpu counter access accross single user cmd buffer > >>>>> execution. > >>>>> > >>>>>>>> > >>>>>>>>> I only see issues with that. First and foremost i would > >>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an > >>>>>>>>> overhead that is measurable in any meaning full way against a simple > >>>>>>>>> function call. I know the userspace command ring is a big marketing features > >>>>>>>>> that please ignorant userspace programmer. But really this only brings issues > >>>>>>>>> and for absolutely not upside afaict. > >>>>>>>> Really ? You think that doing a context switch to kernel space, with all its > >>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace > >>>>>>>> which only puts a buffer on a ring and writes a doorbell ? > >>>>> > >>>>> I am saying the overhead is not that big and it probably will not matter in most > >>>>> usecase. For instance i did wrote the most useless kernel module that add two > >>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and > >>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so > >>>>> ioctl is 13 times slower. > >>>>> > >>>>> Now if there is enough data that shows that a significant percentage of jobs > >>>>> submited to the GPU will take less that 0.35microsecond then yes userspace > >>>>> scheduling does make sense. But so far all we have is handwaving with no data > >>>>> to support any facts. > >>>>> > >>>>> > >>>>> Now if we want to schedule from userspace than you will need to do something > >>>>> about the pinning, something that gives control to kernel so that kernel can > >>>>> unpin when it wants and move object when it wants no matter what userspace is > >>>>> doing. > >>>>> > >>>>>>>>> > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 19:28 ` Jerome Glisse @ 2014-07-21 21:56 ` Oded Gabbay 2014-07-21 23:05 ` Jerome Glisse 0 siblings, 1 reply; 49+ messages in thread From: Oded Gabbay @ 2014-07-21 21:56 UTC (permalink / raw) To: Jerome Glisse Cc: Andrew Lewycky, Michel Dänzer, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton On 21/07/14 22:28, Jerome Glisse wrote: > On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote: >> On 21/07/14 21:59, Jerome Glisse wrote: >>> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote: >>>> On 21/07/14 21:14, Jerome Glisse wrote: >>>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote: >>>>>> On 21/07/14 18:54, Jerome Glisse wrote: >>>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote: >>>>>>>> On 21/07/14 16:39, Christian König wrote: >>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay: >>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote: >>>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: >>>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry. >>>>>>>>>>>> >>>>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series >>>>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions >>>>>>>>>>>> of the code. >>>>>>>>>>>> >>>>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them >>>>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code. >>>>>>>>>>>> There is no code going away or even modified between patches, only added. >>>>>>>>>>>> >>>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under >>>>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver >>>>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a >>>>>>>>>>>> generic hsa framework being implemented in the future and in that case, we >>>>>>>>>>>> will adjust amdkfd to work within that framework. >>>>>>>>>>>> >>>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to >>>>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is >>>>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon >>>>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point >>>>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it >>>>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx >>>>>>>>>>>> drivers. >>>>>>>>>>>> >>>>>>>>>>>> For people who like to review using git, the v2 patch set is located at: >>>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 >>>>>>>>>>>> >>>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com> >>>>>>>>>>> >>>>>>>>>>> So quick comments before i finish going over all patches. There is many >>>>>>>>>>> things that need more documentation espacialy as of right now there is >>>>>>>>>>> no userspace i can go look at. >>>>>>>>>> So quick comments on some of your questions but first of all, thanks for the >>>>>>>>>> time you dedicated to review the code. >>>>>>>>>>> >>>>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big >>>>>>>>>>> no, that would need serious arguments for any hope of convincing me on >>>>>>>>>>> that side. >>>>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace >>>>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case, >>>>>>>>>> is it still a show stopper ? >>>>>>>>>> >>>>>>>>>> The kernel objects are: >>>>>>>>>> - pipelines (4 per device) >>>>>>>>>> - mqd per hiq (only 1 per device) >>>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for >>>>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in >>>>>>>>>> 256 alignment. So total *possible* memory is 128MB >>>>>>>>>> - kernel queue (only 1 per device) >>>>>>>>>> - fence address for kernel queue >>>>>>>>>> - runlists for the CP (1 or 2 per device) >>>>>>>>> >>>>>>>>> The main questions here are if it's avoid able to pin down the memory and if the >>>>>>>>> memory is pinned down at driver load, by request from userspace or by anything >>>>>>>>> else. >>>>>>>>> >>>>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit >>>>>>>>> questionable, everything else sounds reasonable. >>>>>>>>> >>>>>>>>> Christian. >>>>>>>> >>>>>>>> Most of the pin downs are done on device initialization. >>>>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I >>>>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local >>>>>>>> memory, I think it is OK. >>>>>>>> The runlists are also done on userspace queue creation/deletion, but we only >>>>>>>> have 1 or 2 runlists per device, so it is not that bad. >>>>>>> >>>>>>> 2G local memory ? You can not assume anything on userside configuration some >>>>>>> one might build an hsa computer with 512M and still expect a functioning >>>>>>> desktop. >>>>>> First of all, I'm only considering Kaveri computer, not "hsa" computer. >>>>>> Second, I would imagine we can build some protection around it, like >>>>>> checking total local memory and limit number of queues based on some >>>>>> percentage of that total local memory. So, if someone will have only >>>>>> 512M, he will be able to open less queues. >>>>>> >>>>>> >>>>>>> >>>>>>> I need to go look into what all this mqd is for, what it does and what it is >>>>>>> about. But pinning is really bad and this is an issue with userspace command >>>>>>> scheduling an issue that obviously AMD fails to take into account in design >>>>>>> phase. >>>>>> Maybe, but that is the H/W design non-the-less. We can't very well >>>>>> change the H/W. >>>>> >>>>> You can not change the hardware but it is not an excuse to allow bad design to >>>>> sneak in software to work around that. So i would rather penalize bad hardware >>>>> design and have command submission in the kernel, until AMD fix its hardware to >>>>> allow proper scheduling by the kernel and proper control by the kernel. >>>> I'm sorry but I do *not* think this is a bad design. S/W scheduling in >>>> the kernel can not, IMO, scale well to 100K queues and 10K processes. >>> >>> I am not advocating for having kernel decide down to the very last details. I am >>> advocating for kernel being able to preempt at any time and be able to decrease >>> or increase user queue priority so overall kernel is in charge of resources >>> management and it can handle rogue client in proper fashion. >>> >>>> >>>>> Because really where we want to go is having GPU closer to a CPU in term of scheduling >>>>> capacity and once we get there we want the kernel to always be able to take over >>>>> and do whatever it wants behind process back. >>>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling >>>> direction is where AMD is now and where it is heading in the future. >>>> That doesn't preclude the option to allow the kernel to take over and do >>>> what he wants. I agree that in KV we have a problem where we can't do a >>>> mid-wave preemption, so theoretically, a long running compute kernel can >>>> make things messy, but in Carrizo, we will have this ability. Having >>>> said that, it will only be through the CP H/W scheduling. So AMD is >>>> _not_ going to abandon H/W scheduling. You can dislike it, but this is >>>> the situation. >>> >>> We was for the overall Linux community but maybe i should not pretend to talk >>> for anyone interested in having a common standard. >>> >>> My point is that current hardware do not have approriate hardware support for >>> preemption hence, current hardware should use ioctl to schedule job and AMD >>> should think a bit more on commiting to a design and handwaving any hardware >>> short coming as something that can be work around in the software. The pinning >>> thing is broken by design, only way to work around it is through kernel cmd >>> queue scheduling that's a fact. >> >>> >>> Once hardware support proper preemption and allows to move around/evict buffer >>> use on behalf of userspace command queue then we can allow userspace scheduling >>> but until then my personnal opinion is that it should not be allowed and that >>> people will have to pay the ioctl price which i proved to be small, because >>> really if you 100K queue each with one job, i would not expect that all those >>> 100K job will complete in less time than it takes to execute an ioctl ie by >>> even if you do not have the ioctl delay what ever you schedule will have to >>> wait on previously submited jobs. >> >> But Jerome, the core problem still remains in effect, even with your >> suggestion. If an application, either via userspace queue or via ioctl, >> submits a long-running kernel, than the CPU in general can't stop the >> GPU from running it. And if that kernel does while(1); than that's it, >> game's over, and no matter how you submitted the work. So I don't really >> see the big advantage in your proposal. Only in CZ we can stop this wave >> (by CP H/W scheduling only). What are you saying is basically I won't >> allow people to use compute on Linux KV system because it _may_ get the >> system stuck. >> >> So even if I really wanted to, and I may agree with you theoretically on >> that, I can't fulfill your desire to make the "kernel being able to >> preempt at any time and be able to decrease or increase user queue >> priority so overall kernel is in charge of resources management and it >> can handle rogue client in proper fashion". Not in KV, and I guess not >> in CZ as well. >> >> Oded > > I do understand that but using kernel ioctl provide the same kind of control > as we have now ie we can bind/unbind buffer on per command buffer submission > basis, just like with current graphic or compute stuff. > > Yes current graphic and compute stuff can launch a while and never return back > and yes currently we have nothing against that but we should and solution would > be simple just kill the gpu thread. > OK, so in that case, the kernel can simple unmap all the queues by simply writing an UNMAP_QUEUES packet to the HIQ. Even if the queues are userspace, they will not be mapped to the internal CP scheduler. Does that satisfy the kernel control level you want ? Oded >> >>> >>>>> >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common >>>>>>>>>>> stuff there. >>>>>>>>>>> >>>>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would >>>>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon. >>>>>>>>>>> This would avoid crazy communication btw radeon and kfd. >>>>>>>>>>> >>>>>>>>>>> The whole aperture business needs some serious explanation. Especialy as >>>>>>>>>>> you want to use userspace address there is nothing to prevent userspace >>>>>>>>>>> program from allocating things at address you reserve for lds, scratch, >>>>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual >>>>>>>>>>> address reserved for kernel (see kernel memory map). >>>>>>>>>>> >>>>>>>>>>> The whole business of locking performance counter for exclusive per process >>>>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user >>>>>>>>>>> space command ring. >>>>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I >>>>>>>>>> find it analogous to a situation where a network maintainer nacking a driver >>>>>>>>>> for a network card, which is slower than a different network card. Doesn't >>>>>>>>>> seem reasonable this situation is would happen. He would still put both the >>>>>>>>>> drivers in the kernel because people want to use the H/W and its features. So, >>>>>>>>>> I don't think this is a valid reason to NACK the driver. >>>>>>> >>>>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning >>>>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING >>>>>>> the performance ioctl. >>>>>>> >>>>>>> Again this is another argument for round trip to the kernel. As inside kernel you >>>>>>> could properly do exclusive gpu counter access accross single user cmd buffer >>>>>>> execution. >>>>>>> >>>>>>>>>> >>>>>>>>>>> I only see issues with that. First and foremost i would >>>>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an >>>>>>>>>>> overhead that is measurable in any meaning full way against a simple >>>>>>>>>>> function call. I know the userspace command ring is a big marketing features >>>>>>>>>>> that please ignorant userspace programmer. But really this only brings issues >>>>>>>>>>> and for absolutely not upside afaict. >>>>>>>>>> Really ? You think that doing a context switch to kernel space, with all its >>>>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace >>>>>>>>>> which only puts a buffer on a ring and writes a doorbell ? >>>>>>> >>>>>>> I am saying the overhead is not that big and it probably will not matter in most >>>>>>> usecase. For instance i did wrote the most useless kernel module that add two >>>>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and >>>>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so >>>>>>> ioctl is 13 times slower. >>>>>>> >>>>>>> Now if there is enough data that shows that a significant percentage of jobs >>>>>>> submited to the GPU will take less that 0.35microsecond then yes userspace >>>>>>> scheduling does make sense. But so far all we have is handwaving with no data >>>>>>> to support any facts. >>>>>>> >>>>>>> >>>>>>> Now if we want to schedule from userspace than you will need to do something >>>>>>> about the pinning, something that gives control to kernel so that kernel can >>>>>>> unpin when it wants and move object when it wants no matter what userspace is >>>>>>> doing. >>>>>>> >>>>>>>>>>> >>> >>> -- >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in >>> the body to majordomo@kvack.org. For more info on Linux MM, >>> see: http://www.linux-mm.org/ . >>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> >>> >> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 21:56 ` Oded Gabbay @ 2014-07-21 23:05 ` Jerome Glisse 2014-07-21 23:29 ` Bridgman, John 2014-07-22 8:05 ` Oded Gabbay 0 siblings, 2 replies; 49+ messages in thread From: Jerome Glisse @ 2014-07-21 23:05 UTC (permalink / raw) To: Oded Gabbay Cc: Andrew Lewycky, linux-mm, Michel Dänzer, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton On Tue, Jul 22, 2014 at 12:56:13AM +0300, Oded Gabbay wrote: > On 21/07/14 22:28, Jerome Glisse wrote: > > On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote: > >> On 21/07/14 21:59, Jerome Glisse wrote: > >>> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote: > >>>> On 21/07/14 21:14, Jerome Glisse wrote: > >>>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote: > >>>>>> On 21/07/14 18:54, Jerome Glisse wrote: > >>>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote: > >>>>>>>> On 21/07/14 16:39, Christian Konig wrote: > >>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay: > >>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote: > >>>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: > >>>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry. > >>>>>>>>>>>> > >>>>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series > >>>>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions > >>>>>>>>>>>> of the code. > >>>>>>>>>>>> > >>>>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them > >>>>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code. > >>>>>>>>>>>> There is no code going away or even modified between patches, only added. > >>>>>>>>>>>> > >>>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under > >>>>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver > >>>>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a > >>>>>>>>>>>> generic hsa framework being implemented in the future and in that case, we > >>>>>>>>>>>> will adjust amdkfd to work within that framework. > >>>>>>>>>>>> > >>>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to > >>>>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is > >>>>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon > >>>>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point > >>>>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it > >>>>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx > >>>>>>>>>>>> drivers. > >>>>>>>>>>>> > >>>>>>>>>>>> For people who like to review using git, the v2 patch set is located at: > >>>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 > >>>>>>>>>>>> > >>>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com> > >>>>>>>>>>> > >>>>>>>>>>> So quick comments before i finish going over all patches. There is many > >>>>>>>>>>> things that need more documentation espacialy as of right now there is > >>>>>>>>>>> no userspace i can go look at. > >>>>>>>>>> So quick comments on some of your questions but first of all, thanks for the > >>>>>>>>>> time you dedicated to review the code. > >>>>>>>>>>> > >>>>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big > >>>>>>>>>>> no, that would need serious arguments for any hope of convincing me on > >>>>>>>>>>> that side. > >>>>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace > >>>>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case, > >>>>>>>>>> is it still a show stopper ? > >>>>>>>>>> > >>>>>>>>>> The kernel objects are: > >>>>>>>>>> - pipelines (4 per device) > >>>>>>>>>> - mqd per hiq (only 1 per device) > >>>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for > >>>>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in > >>>>>>>>>> 256 alignment. So total *possible* memory is 128MB > >>>>>>>>>> - kernel queue (only 1 per device) > >>>>>>>>>> - fence address for kernel queue > >>>>>>>>>> - runlists for the CP (1 or 2 per device) > >>>>>>>>> > >>>>>>>>> The main questions here are if it's avoid able to pin down the memory and if the > >>>>>>>>> memory is pinned down at driver load, by request from userspace or by anything > >>>>>>>>> else. > >>>>>>>>> > >>>>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit > >>>>>>>>> questionable, everything else sounds reasonable. > >>>>>>>>> > >>>>>>>>> Christian. > >>>>>>>> > >>>>>>>> Most of the pin downs are done on device initialization. > >>>>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I > >>>>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local > >>>>>>>> memory, I think it is OK. > >>>>>>>> The runlists are also done on userspace queue creation/deletion, but we only > >>>>>>>> have 1 or 2 runlists per device, so it is not that bad. > >>>>>>> > >>>>>>> 2G local memory ? You can not assume anything on userside configuration some > >>>>>>> one might build an hsa computer with 512M and still expect a functioning > >>>>>>> desktop. > >>>>>> First of all, I'm only considering Kaveri computer, not "hsa" computer. > >>>>>> Second, I would imagine we can build some protection around it, like > >>>>>> checking total local memory and limit number of queues based on some > >>>>>> percentage of that total local memory. So, if someone will have only > >>>>>> 512M, he will be able to open less queues. > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> I need to go look into what all this mqd is for, what it does and what it is > >>>>>>> about. But pinning is really bad and this is an issue with userspace command > >>>>>>> scheduling an issue that obviously AMD fails to take into account in design > >>>>>>> phase. > >>>>>> Maybe, but that is the H/W design non-the-less. We can't very well > >>>>>> change the H/W. > >>>>> > >>>>> You can not change the hardware but it is not an excuse to allow bad design to > >>>>> sneak in software to work around that. So i would rather penalize bad hardware > >>>>> design and have command submission in the kernel, until AMD fix its hardware to > >>>>> allow proper scheduling by the kernel and proper control by the kernel. > >>>> I'm sorry but I do *not* think this is a bad design. S/W scheduling in > >>>> the kernel can not, IMO, scale well to 100K queues and 10K processes. > >>> > >>> I am not advocating for having kernel decide down to the very last details. I am > >>> advocating for kernel being able to preempt at any time and be able to decrease > >>> or increase user queue priority so overall kernel is in charge of resources > >>> management and it can handle rogue client in proper fashion. > >>> > >>>> > >>>>> Because really where we want to go is having GPU closer to a CPU in term of scheduling > >>>>> capacity and once we get there we want the kernel to always be able to take over > >>>>> and do whatever it wants behind process back. > >>>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling > >>>> direction is where AMD is now and where it is heading in the future. > >>>> That doesn't preclude the option to allow the kernel to take over and do > >>>> what he wants. I agree that in KV we have a problem where we can't do a > >>>> mid-wave preemption, so theoretically, a long running compute kernel can > >>>> make things messy, but in Carrizo, we will have this ability. Having > >>>> said that, it will only be through the CP H/W scheduling. So AMD is > >>>> _not_ going to abandon H/W scheduling. You can dislike it, but this is > >>>> the situation. > >>> > >>> We was for the overall Linux community but maybe i should not pretend to talk > >>> for anyone interested in having a common standard. > >>> > >>> My point is that current hardware do not have approriate hardware support for > >>> preemption hence, current hardware should use ioctl to schedule job and AMD > >>> should think a bit more on commiting to a design and handwaving any hardware > >>> short coming as something that can be work around in the software. The pinning > >>> thing is broken by design, only way to work around it is through kernel cmd > >>> queue scheduling that's a fact. > >> > >>> > >>> Once hardware support proper preemption and allows to move around/evict buffer > >>> use on behalf of userspace command queue then we can allow userspace scheduling > >>> but until then my personnal opinion is that it should not be allowed and that > >>> people will have to pay the ioctl price which i proved to be small, because > >>> really if you 100K queue each with one job, i would not expect that all those > >>> 100K job will complete in less time than it takes to execute an ioctl ie by > >>> even if you do not have the ioctl delay what ever you schedule will have to > >>> wait on previously submited jobs. > >> > >> But Jerome, the core problem still remains in effect, even with your > >> suggestion. If an application, either via userspace queue or via ioctl, > >> submits a long-running kernel, than the CPU in general can't stop the > >> GPU from running it. And if that kernel does while(1); than that's it, > >> game's over, and no matter how you submitted the work. So I don't really > >> see the big advantage in your proposal. Only in CZ we can stop this wave > >> (by CP H/W scheduling only). What are you saying is basically I won't > >> allow people to use compute on Linux KV system because it _may_ get the > >> system stuck. > >> > >> So even if I really wanted to, and I may agree with you theoretically on > >> that, I can't fulfill your desire to make the "kernel being able to > >> preempt at any time and be able to decrease or increase user queue > >> priority so overall kernel is in charge of resources management and it > >> can handle rogue client in proper fashion". Not in KV, and I guess not > >> in CZ as well. > >> > >> Oded > > > > I do understand that but using kernel ioctl provide the same kind of control > > as we have now ie we can bind/unbind buffer on per command buffer submission > > basis, just like with current graphic or compute stuff. > > > > Yes current graphic and compute stuff can launch a while and never return back > > and yes currently we have nothing against that but we should and solution would > > be simple just kill the gpu thread. > > > OK, so in that case, the kernel can simple unmap all the queues by > simply writing an UNMAP_QUEUES packet to the HIQ. Even if the queues are > userspace, they will not be mapped to the internal CP scheduler. > Does that satisfy the kernel control level you want ? This raises questions, what does happen to currently running thread when you unmap queue ? Do they keep running until done ? If not than this means this will break user application and those is not an acceptable solution. Otherwise, infrastructre inside radeon would be needed to force this queue unmap on bo_pin failure so gfx pinning can be retry. Also how do you cope with doorbell exhaustion ? Do you just plan to error out ? In which case this is another DDOS vector but only affecting the gpu. And there is many other questions that need answer, like my kernel memory map question because as of right now i assume that kfd allow any thread on the gpu to access any kernel memory. Otherthings are how ill formated packet are handled by the hardware ? I do not see any mecanism to deal with SIGBUS or SIGFAULT. Also it is a worrisome prospect of seeing resource management completely ignore for future AMD hardware. Kernel exist for a reason ! Kernel main purpose is to provide resource management if AMD fails to understand that, this is not looking good on long term and i expect none of the HSA technology will get momentum and i would certainly advocate against any use of it inside product i work on. Cheers, Jerome > > Oded > >> > >>> > >>>>> > >>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common > >>>>>>>>>>> stuff there. > >>>>>>>>>>> > >>>>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would > >>>>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon. > >>>>>>>>>>> This would avoid crazy communication btw radeon and kfd. > >>>>>>>>>>> > >>>>>>>>>>> The whole aperture business needs some serious explanation. Especialy as > >>>>>>>>>>> you want to use userspace address there is nothing to prevent userspace > >>>>>>>>>>> program from allocating things at address you reserve for lds, scratch, > >>>>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual > >>>>>>>>>>> address reserved for kernel (see kernel memory map). > >>>>>>>>>>> > >>>>>>>>>>> The whole business of locking performance counter for exclusive per process > >>>>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user > >>>>>>>>>>> space command ring. > >>>>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I > >>>>>>>>>> find it analogous to a situation where a network maintainer nacking a driver > >>>>>>>>>> for a network card, which is slower than a different network card. Doesn't > >>>>>>>>>> seem reasonable this situation is would happen. He would still put both the > >>>>>>>>>> drivers in the kernel because people want to use the H/W and its features. So, > >>>>>>>>>> I don't think this is a valid reason to NACK the driver. > >>>>>>> > >>>>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning > >>>>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING > >>>>>>> the performance ioctl. > >>>>>>> > >>>>>>> Again this is another argument for round trip to the kernel. As inside kernel you > >>>>>>> could properly do exclusive gpu counter access accross single user cmd buffer > >>>>>>> execution. > >>>>>>> > >>>>>>>>>> > >>>>>>>>>>> I only see issues with that. First and foremost i would > >>>>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an > >>>>>>>>>>> overhead that is measurable in any meaning full way against a simple > >>>>>>>>>>> function call. I know the userspace command ring is a big marketing features > >>>>>>>>>>> that please ignorant userspace programmer. But really this only brings issues > >>>>>>>>>>> and for absolutely not upside afaict. > >>>>>>>>>> Really ? You think that doing a context switch to kernel space, with all its > >>>>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace > >>>>>>>>>> which only puts a buffer on a ring and writes a doorbell ? > >>>>>>> > >>>>>>> I am saying the overhead is not that big and it probably will not matter in most > >>>>>>> usecase. For instance i did wrote the most useless kernel module that add two > >>>>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and > >>>>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so > >>>>>>> ioctl is 13 times slower. > >>>>>>> > >>>>>>> Now if there is enough data that shows that a significant percentage of jobs > >>>>>>> submited to the GPU will take less that 0.35microsecond then yes userspace > >>>>>>> scheduling does make sense. But so far all we have is handwaving with no data > >>>>>>> to support any facts. > >>>>>>> > >>>>>>> > >>>>>>> Now if we want to schedule from userspace than you will need to do something > >>>>>>> about the pinning, something that gives control to kernel so that kernel can > >>>>>>> unpin when it wants and move object when it wants no matter what userspace is > >>>>>>> doing. > >>>>>>> > >>>>>>>>>>> > >>> > >>> -- > >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in > >>> the body to majordomo@kvack.org. For more info on Linux MM, > >>> see: http://www.linux-mm.org/ . > >>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > >>> > >> > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/dri-devel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 23:05 ` Jerome Glisse @ 2014-07-21 23:29 ` Bridgman, John 2014-07-21 23:36 ` Jerome Glisse 2014-07-22 8:05 ` Oded Gabbay 1 sibling, 1 reply; 49+ messages in thread From: Bridgman, John @ 2014-07-21 23:29 UTC (permalink / raw) To: Jerome Glisse, Gabbay, Oded Cc: Lewycky, Andrew, Pinchuk, Evgeny, Daenzer, Michel, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Skidanov, Alexey, Andrew Morton >-----Original Message----- >From: dri-devel [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf >Of Jerome Glisse >Sent: Monday, July 21, 2014 7:06 PM >To: Gabbay, Oded >Cc: Lewycky, Andrew; Pinchuk, Evgeny; Daenzer, Michel; linux- >kernel@vger.kernel.org; dri-devel@lists.freedesktop.org; linux-mm; >Skidanov, Alexey; Andrew Morton >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver > >On Tue, Jul 22, 2014 at 12:56:13AM +0300, Oded Gabbay wrote: >> On 21/07/14 22:28, Jerome Glisse wrote: >> > On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote: >> >> On 21/07/14 21:59, Jerome Glisse wrote: >> >>> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote: >> >>>> On 21/07/14 21:14, Jerome Glisse wrote: >> >>>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote: >> >>>>>> On 21/07/14 18:54, Jerome Glisse wrote: >> >>>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote: >> >>>>>>>> On 21/07/14 16:39, Christian König wrote: >> >>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay: >> >>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote: >> >>>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: >> >>>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry. >> >>>>>>>>>>>> >> >>>>>>>>>>>> As a continuation to the existing discussion, here is a >> >>>>>>>>>>>> v2 patch series restructured with a cleaner history and >> >>>>>>>>>>>> no totally-different-early-versions of the code. >> >>>>>>>>>>>> >> >>>>>>>>>>>> Instead of 83 patches, there are now a total of 25 >> >>>>>>>>>>>> patches, where 5 of them are modifications to radeon driver >and 18 of them include only amdkfd code. >> >>>>>>>>>>>> There is no code going away or even modified between >patches, only added. >> >>>>>>>>>>>> >> >>>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and >> >>>>>>>>>>>> moved to reside under drm/radeon/amdkfd. This move was >> >>>>>>>>>>>> done to emphasize the fact that this driver is an >> >>>>>>>>>>>> AMD-only driver at this point. Having said that, we do >> >>>>>>>>>>>> foresee a generic hsa framework being implemented in the >future and in that case, we will adjust amdkfd to work within that >framework. >> >>>>>>>>>>>> >> >>>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx >> >>>>>>>>>>>> drivers, we want to keep it as a seperate driver from >> >>>>>>>>>>>> radeon. Therefore, the amdkfd code is contained in its >> >>>>>>>>>>>> own folder. The amdkfd folder was put under the radeon >> >>>>>>>>>>>> folder because the only AMD gfx driver in the Linux >> >>>>>>>>>>>> kernel at this point is the radeon driver. Having said >> >>>>>>>>>>>> that, we will probably need to move it (maybe to be directly >under drm) after we integrate with additional AMD gfx drivers. >> >>>>>>>>>>>> >> >>>>>>>>>>>> For people who like to review using git, the v2 patch set is >located at: >> >>>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-nex >> >>>>>>>>>>>> t-3.17-v2 >> >>>>>>>>>>>> >> >>>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com> >> >>>>>>>>>>> >> >>>>>>>>>>> So quick comments before i finish going over all patches. >> >>>>>>>>>>> There is many things that need more documentation >> >>>>>>>>>>> espacialy as of right now there is no userspace i can go look at. >> >>>>>>>>>> So quick comments on some of your questions but first of >> >>>>>>>>>> all, thanks for the time you dedicated to review the code. >> >>>>>>>>>>> >> >>>>>>>>>>> There few show stopper, biggest one is gpu memory pinning >> >>>>>>>>>>> this is a big no, that would need serious arguments for >> >>>>>>>>>>> any hope of convincing me on that side. >> >>>>>>>>>> We only do gpu memory pinning for kernel objects. There are >> >>>>>>>>>> no userspace objects that are pinned on the gpu memory in >> >>>>>>>>>> our driver. If that is the case, is it still a show stopper ? >> >>>>>>>>>> >> >>>>>>>>>> The kernel objects are: >> >>>>>>>>>> - pipelines (4 per device) >> >>>>>>>>>> - mqd per hiq (only 1 per device) >> >>>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K >> >>>>>>>>>> queues per process, for a total of 512K queues. Each mqd is >> >>>>>>>>>> 151 bytes, but the allocation is done in >> >>>>>>>>>> 256 alignment. So total *possible* memory is 128MB >> >>>>>>>>>> - kernel queue (only 1 per device) >> >>>>>>>>>> - fence address for kernel queue >> >>>>>>>>>> - runlists for the CP (1 or 2 per device) >> >>>>>>>>> >> >>>>>>>>> The main questions here are if it's avoid able to pin down >> >>>>>>>>> the memory and if the memory is pinned down at driver load, >> >>>>>>>>> by request from userspace or by anything else. >> >>>>>>>>> >> >>>>>>>>> As far as I can see only the "mqd per userspace queue" might >> >>>>>>>>> be a bit questionable, everything else sounds reasonable. >> >>>>>>>>> >> >>>>>>>>> Christian. >> >>>>>>>> >> >>>>>>>> Most of the pin downs are done on device initialization. >> >>>>>>>> The "mqd per userspace" is done per userspace queue creation. >> >>>>>>>> However, as I said, it has an upper limit of 128MB on KV, and >> >>>>>>>> considering the 2G local memory, I think it is OK. >> >>>>>>>> The runlists are also done on userspace queue >> >>>>>>>> creation/deletion, but we only have 1 or 2 runlists per device, so >it is not that bad. >> >>>>>>> >> >>>>>>> 2G local memory ? You can not assume anything on userside >> >>>>>>> configuration some one might build an hsa computer with 512M >> >>>>>>> and still expect a functioning desktop. >> >>>>>> First of all, I'm only considering Kaveri computer, not "hsa" >computer. >> >>>>>> Second, I would imagine we can build some protection around it, >> >>>>>> like checking total local memory and limit number of queues >> >>>>>> based on some percentage of that total local memory. So, if >> >>>>>> someone will have only 512M, he will be able to open less queues. >> >>>>>> >> >>>>>> >> >>>>>>> >> >>>>>>> I need to go look into what all this mqd is for, what it does >> >>>>>>> and what it is about. But pinning is really bad and this is an >> >>>>>>> issue with userspace command scheduling an issue that >> >>>>>>> obviously AMD fails to take into account in design phase. >> >>>>>> Maybe, but that is the H/W design non-the-less. We can't very >> >>>>>> well change the H/W. >> >>>>> >> >>>>> You can not change the hardware but it is not an excuse to allow >> >>>>> bad design to sneak in software to work around that. So i would >> >>>>> rather penalize bad hardware design and have command submission >> >>>>> in the kernel, until AMD fix its hardware to allow proper scheduling >by the kernel and proper control by the kernel. >> >>>> I'm sorry but I do *not* think this is a bad design. S/W >> >>>> scheduling in the kernel can not, IMO, scale well to 100K queues and >10K processes. >> >>> >> >>> I am not advocating for having kernel decide down to the very last >> >>> details. I am advocating for kernel being able to preempt at any >> >>> time and be able to decrease or increase user queue priority so >> >>> overall kernel is in charge of resources management and it can handle >rogue client in proper fashion. >> >>> >> >>>> >> >>>>> Because really where we want to go is having GPU closer to a CPU >> >>>>> in term of scheduling capacity and once we get there we want the >> >>>>> kernel to always be able to take over and do whatever it wants >behind process back. >> >>>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling >> >>>> direction is where AMD is now and where it is heading in the future. >> >>>> That doesn't preclude the option to allow the kernel to take over >> >>>> and do what he wants. I agree that in KV we have a problem where >> >>>> we can't do a mid-wave preemption, so theoretically, a long >> >>>> running compute kernel can make things messy, but in Carrizo, we >> >>>> will have this ability. Having said that, it will only be through >> >>>> the CP H/W scheduling. So AMD is _not_ going to abandon H/W >> >>>> scheduling. You can dislike it, but this is the situation. >> >>> >> >>> We was for the overall Linux community but maybe i should not >> >>> pretend to talk for anyone interested in having a common standard. >> >>> >> >>> My point is that current hardware do not have approriate hardware >> >>> support for preemption hence, current hardware should use ioctl to >> >>> schedule job and AMD should think a bit more on commiting to a >> >>> design and handwaving any hardware short coming as something that >> >>> can be work around in the software. The pinning thing is broken by >> >>> design, only way to work around it is through kernel cmd queue >scheduling that's a fact. >> >> >> >>> >> >>> Once hardware support proper preemption and allows to move >> >>> around/evict buffer use on behalf of userspace command queue then >> >>> we can allow userspace scheduling but until then my personnal >> >>> opinion is that it should not be allowed and that people will have >> >>> to pay the ioctl price which i proved to be small, because really >> >>> if you 100K queue each with one job, i would not expect that all >> >>> those 100K job will complete in less time than it takes to execute >> >>> an ioctl ie by even if you do not have the ioctl delay what ever you >schedule will have to wait on previously submited jobs. >> >> >> >> But Jerome, the core problem still remains in effect, even with >> >> your suggestion. If an application, either via userspace queue or >> >> via ioctl, submits a long-running kernel, than the CPU in general >> >> can't stop the GPU from running it. And if that kernel does >> >> while(1); than that's it, game's over, and no matter how you >> >> submitted the work. So I don't really see the big advantage in your >> >> proposal. Only in CZ we can stop this wave (by CP H/W scheduling >> >> only). What are you saying is basically I won't allow people to use >> >> compute on Linux KV system because it _may_ get the system stuck. >> >> >> >> So even if I really wanted to, and I may agree with you >> >> theoretically on that, I can't fulfill your desire to make the >> >> "kernel being able to preempt at any time and be able to decrease >> >> or increase user queue priority so overall kernel is in charge of >> >> resources management and it can handle rogue client in proper >> >> fashion". Not in KV, and I guess not in CZ as well. >> >> >> >> Oded >> > >> > I do understand that but using kernel ioctl provide the same kind of >> > control as we have now ie we can bind/unbind buffer on per command >> > buffer submission basis, just like with current graphic or compute stuff. >> > >> > Yes current graphic and compute stuff can launch a while and never >> > return back and yes currently we have nothing against that but we >> > should and solution would be simple just kill the gpu thread. >> > >> OK, so in that case, the kernel can simple unmap all the queues by >> simply writing an UNMAP_QUEUES packet to the HIQ. Even if the queues >> are userspace, they will not be mapped to the internal CP scheduler. >> Does that satisfy the kernel control level you want ? > >This raises questions, what does happen to currently running thread when >you unmap queue ? Do they keep running until done ? If not than this means >this will break user application and those is not an acceptable solution. > >Otherwise, infrastructre inside radeon would be needed to force this queue >unmap on bo_pin failure so gfx pinning can be retry. > >Also how do you cope with doorbell exhaustion ? Do you just plan to error >out ? >In which case this is another DDOS vector but only affecting the gpu. > >And there is many other questions that need answer, like my kernel memory >map question because as of right now i assume that kfd allow any thread on >the gpu to access any kernel memory. > >Otherthings are how ill formated packet are handled by the hardware ? I do >not see any mecanism to deal with SIGBUS or SIGFAULT. > > >Also it is a worrisome prospect of seeing resource management completely >ignore for future AMD hardware. Kernel exist for a reason ! Kernel main >purpose is to provide resource management if AMD fails to understand that, >this is not looking good on long term and i expect none of the HSA >technology will get momentum and i would certainly advocate against any >use of it inside product i work on. Hi Jerome; I was following along until the above comment. It seems to be the exact opposite of what Oded has been saying, which is that future AMD hardware *does* have more capabilities for resource management and that we do have some capabilities today. Can you help me understand what the comment it was based on ? Thanks, JB > >Cheers, >Jérôme > >> >> Oded >> >> >> >>> >> >>>>> >> >>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory >> >>>>>>>>>>> and add common stuff there. >> >>>>>>>>>>> >> >>>>>>>>>>> Given that this is not intended to be final HSA api AFAICT >> >>>>>>>>>>> then i would say this far better to avoid the whole kfd module >and add ioctl to radeon. >> >>>>>>>>>>> This would avoid crazy communication btw radeon and kfd. >> >>>>>>>>>>> >> >>>>>>>>>>> The whole aperture business needs some serious >> >>>>>>>>>>> explanation. Especialy as you want to use userspace >> >>>>>>>>>>> address there is nothing to prevent userspace program from >> >>>>>>>>>>> allocating things at address you reserve for lds, scratch, >> >>>>>>>>>>> ... only sane way would be to move those lds, scratch inside >the virtual address reserved for kernel (see kernel memory map). >> >>>>>>>>>>> >> >>>>>>>>>>> The whole business of locking performance counter for >> >>>>>>>>>>> exclusive per process access is a big NO. Which leads me >> >>>>>>>>>>> to the questionable usefullness of user space command ring. >> >>>>>>>>>> That's like saying: "Which leads me to the questionable >> >>>>>>>>>> usefulness of HSA". I find it analogous to a situation >> >>>>>>>>>> where a network maintainer nacking a driver for a network >> >>>>>>>>>> card, which is slower than a different network card. >> >>>>>>>>>> Doesn't seem reasonable this situation is would happen. He >> >>>>>>>>>> would still put both the drivers in the kernel because people >want to use the H/W and its features. So, I don't think this is a valid reason to >NACK the driver. >> >>>>>>> >> >>>>>>> Let me rephrase, drop the the performance counter ioctl and >> >>>>>>> modulo memory pinning i see no objection. In other word, i am >> >>>>>>> not NACKING whole patchset i am NACKING the performance ioctl. >> >>>>>>> >> >>>>>>> Again this is another argument for round trip to the kernel. >> >>>>>>> As inside kernel you could properly do exclusive gpu counter >> >>>>>>> access accross single user cmd buffer execution. >> >>>>>>> >> >>>>>>>>>> >> >>>>>>>>>>> I only see issues with that. First and foremost i would >> >>>>>>>>>>> need to see solid figures that kernel ioctl or syscall has >> >>>>>>>>>>> a higher an overhead that is measurable in any meaning >> >>>>>>>>>>> full way against a simple function call. I know the >> >>>>>>>>>>> userspace command ring is a big marketing features that >> >>>>>>>>>>> please ignorant userspace programmer. But really this only >brings issues and for absolutely not upside afaict. >> >>>>>>>>>> Really ? You think that doing a context switch to kernel >> >>>>>>>>>> space, with all its overhead, is _not_ more expansive than >> >>>>>>>>>> just calling a function in userspace which only puts a buffer on a >ring and writes a doorbell ? >> >>>>>>> >> >>>>>>> I am saying the overhead is not that big and it probably will >> >>>>>>> not matter in most usecase. For instance i did wrote the most >> >>>>>>> useless kernel module that add two number through an ioctl >> >>>>>>> (http://people.freedesktop.org/~glisse/adder.tar) and it takes >> >>>>>>> ~0.35microseconds with ioctl while function is ~0.025microseconds >so ioctl is 13 times slower. >> >>>>>>> >> >>>>>>> Now if there is enough data that shows that a significant >> >>>>>>> percentage of jobs submited to the GPU will take less that >> >>>>>>> 0.35microsecond then yes userspace scheduling does make sense. >> >>>>>>> But so far all we have is handwaving with no data to support any >facts. >> >>>>>>> >> >>>>>>> >> >>>>>>> Now if we want to schedule from userspace than you will need >> >>>>>>> to do something about the pinning, something that gives >> >>>>>>> control to kernel so that kernel can unpin when it wants and >> >>>>>>> move object when it wants no matter what userspace is doing. >> >>>>>>> >> >>>>>>>>>>> >> >>> >> >>> -- >> >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in the >> >>> body to majordomo@kvack.org. For more info on Linux MM, >> >>> see: http://www.linux-mm.org/ . >> >>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> >> >>> >> >> >> >> _______________________________________________ >> dri-devel mailing list >> dri-devel@lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/dri-devel >_______________________________________________ >dri-devel mailing list >dri-devel@lists.freedesktop.org >http://lists.freedesktop.org/mailman/listinfo/dri-devel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 23:29 ` Bridgman, John @ 2014-07-21 23:36 ` Jerome Glisse 0 siblings, 0 replies; 49+ messages in thread From: Jerome Glisse @ 2014-07-21 23:36 UTC (permalink / raw) To: Bridgman, John Cc: Gabbay, Oded, Lewycky, Andrew, Pinchuk, Evgeny, Daenzer, Michel, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Skidanov, Alexey, Andrew Morton On Mon, Jul 21, 2014 at 11:29:23PM +0000, Bridgman, John wrote: > >> >> So even if I really wanted to, and I may agree with you > >> >> theoretically on that, I can't fulfill your desire to make the > >> >> "kernel being able to preempt at any time and be able to decrease > >> >> or increase user queue priority so overall kernel is in charge of > >> >> resources management and it can handle rogue client in proper > >> >> fashion". Not in KV, and I guess not in CZ as well. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > >Also it is a worrisome prospect of seeing resource management completely > >ignore for future AMD hardware. Kernel exist for a reason ! Kernel main > >purpose is to provide resource management if AMD fails to understand that, > >this is not looking good on long term and i expect none of the HSA > >technology will get momentum and i would certainly advocate against any > >use of it inside product i work on. > > Hi Jerome; > > I was following along until the above comment. It seems to be the exact opposite of what Oded has been saying, which is that future AMD hardware *does* have more capabilities for resource management and that we do have some capabilities today. Can you help me understand what the comment it was based on ? Highlighted above. Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 23:05 ` Jerome Glisse 2014-07-21 23:29 ` Bridgman, John @ 2014-07-22 8:05 ` Oded Gabbay 1 sibling, 0 replies; 49+ messages in thread From: Oded Gabbay @ 2014-07-22 8:05 UTC (permalink / raw) To: Jerome Glisse Cc: Andrew Lewycky, linux-mm, Michel Dänzer, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, Alexey Skidanov, Andrew Morton, Dave Airlie, Bridgman, John, Deucher, Alexander, Joerg Roedel, Ben Goz, Christian König, Daniel Vetter, Sellek, Tom On 22/07/14 02:05, Jerome Glisse wrote: > On Tue, Jul 22, 2014 at 12:56:13AM +0300, Oded Gabbay wrote: >> On 21/07/14 22:28, Jerome Glisse wrote: >>> On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote: >>>> On 21/07/14 21:59, Jerome Glisse wrote: >>>>> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote: >>>>>> On 21/07/14 21:14, Jerome Glisse wrote: >>>>>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote: >>>>>>>> On 21/07/14 18:54, Jerome Glisse wrote: >>>>>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote: >>>>>>>>>> On 21/07/14 16:39, Christian König wrote: >>>>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay: >>>>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote: >>>>>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: >>>>>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry. >>>>>>>>>>>>>> >>>>>>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series >>>>>>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions >>>>>>>>>>>>>> of the code. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them >>>>>>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code. >>>>>>>>>>>>>> There is no code going away or even modified between patches, only added. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under >>>>>>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver >>>>>>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a >>>>>>>>>>>>>> generic hsa framework being implemented in the future and in that case, we >>>>>>>>>>>>>> will adjust amdkfd to work within that framework. >>>>>>>>>>>>>> >>>>>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to >>>>>>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is >>>>>>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon >>>>>>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point >>>>>>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it >>>>>>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx >>>>>>>>>>>>>> drivers. >>>>>>>>>>>>>> >>>>>>>>>>>>>> For people who like to review using git, the v2 patch set is located at: >>>>>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com> >>>>>>>>>>>>> >>>>>>>>>>>>> So quick comments before i finish going over all patches. There is many >>>>>>>>>>>>> things that need more documentation espacialy as of right now there is >>>>>>>>>>>>> no userspace i can go look at. >>>>>>>>>>>> So quick comments on some of your questions but first of all, thanks for the >>>>>>>>>>>> time you dedicated to review the code. >>>>>>>>>>>>> >>>>>>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big >>>>>>>>>>>>> no, that would need serious arguments for any hope of convincing me on >>>>>>>>>>>>> that side. >>>>>>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace >>>>>>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case, >>>>>>>>>>>> is it still a show stopper ? >>>>>>>>>>>> >>>>>>>>>>>> The kernel objects are: >>>>>>>>>>>> - pipelines (4 per device) >>>>>>>>>>>> - mqd per hiq (only 1 per device) >>>>>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for >>>>>>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in >>>>>>>>>>>> 256 alignment. So total *possible* memory is 128MB >>>>>>>>>>>> - kernel queue (only 1 per device) >>>>>>>>>>>> - fence address for kernel queue >>>>>>>>>>>> - runlists for the CP (1 or 2 per device) >>>>>>>>>>> >>>>>>>>>>> The main questions here are if it's avoid able to pin down the memory and if the >>>>>>>>>>> memory is pinned down at driver load, by request from userspace or by anything >>>>>>>>>>> else. >>>>>>>>>>> >>>>>>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit >>>>>>>>>>> questionable, everything else sounds reasonable. >>>>>>>>>>> >>>>>>>>>>> Christian. >>>>>>>>>> >>>>>>>>>> Most of the pin downs are done on device initialization. >>>>>>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I >>>>>>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local >>>>>>>>>> memory, I think it is OK. >>>>>>>>>> The runlists are also done on userspace queue creation/deletion, but we only >>>>>>>>>> have 1 or 2 runlists per device, so it is not that bad. >>>>>>>>> >>>>>>>>> 2G local memory ? You can not assume anything on userside configuration some >>>>>>>>> one might build an hsa computer with 512M and still expect a functioning >>>>>>>>> desktop. >>>>>>>> First of all, I'm only considering Kaveri computer, not "hsa" computer. >>>>>>>> Second, I would imagine we can build some protection around it, like >>>>>>>> checking total local memory and limit number of queues based on some >>>>>>>> percentage of that total local memory. So, if someone will have only >>>>>>>> 512M, he will be able to open less queues. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> I need to go look into what all this mqd is for, what it does and what it is >>>>>>>>> about. But pinning is really bad and this is an issue with userspace command >>>>>>>>> scheduling an issue that obviously AMD fails to take into account in design >>>>>>>>> phase. >>>>>>>> Maybe, but that is the H/W design non-the-less. We can't very well >>>>>>>> change the H/W. >>>>>>> >>>>>>> You can not change the hardware but it is not an excuse to allow bad design to >>>>>>> sneak in software to work around that. So i would rather penalize bad hardware >>>>>>> design and have command submission in the kernel, until AMD fix its hardware to >>>>>>> allow proper scheduling by the kernel and proper control by the kernel. >>>>>> I'm sorry but I do *not* think this is a bad design. S/W scheduling in >>>>>> the kernel can not, IMO, scale well to 100K queues and 10K processes. >>>>> >>>>> I am not advocating for having kernel decide down to the very last details. I am >>>>> advocating for kernel being able to preempt at any time and be able to decrease >>>>> or increase user queue priority so overall kernel is in charge of resources >>>>> management and it can handle rogue client in proper fashion. >>>>> >>>>>> >>>>>>> Because really where we want to go is having GPU closer to a CPU in term of scheduling >>>>>>> capacity and once we get there we want the kernel to always be able to take over >>>>>>> and do whatever it wants behind process back. >>>>>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling >>>>>> direction is where AMD is now and where it is heading in the future. >>>>>> That doesn't preclude the option to allow the kernel to take over and do >>>>>> what he wants. I agree that in KV we have a problem where we can't do a >>>>>> mid-wave preemption, so theoretically, a long running compute kernel can >>>>>> make things messy, but in Carrizo, we will have this ability. Having >>>>>> said that, it will only be through the CP H/W scheduling. So AMD is >>>>>> _not_ going to abandon H/W scheduling. You can dislike it, but this is >>>>>> the situation. >>>>> >>>>> We was for the overall Linux community but maybe i should not pretend to talk >>>>> for anyone interested in having a common standard. >>>>> >>>>> My point is that current hardware do not have approriate hardware support for >>>>> preemption hence, current hardware should use ioctl to schedule job and AMD >>>>> should think a bit more on commiting to a design and handwaving any hardware >>>>> short coming as something that can be work around in the software. The pinning >>>>> thing is broken by design, only way to work around it is through kernel cmd >>>>> queue scheduling that's a fact. >>>> >>>>> >>>>> Once hardware support proper preemption and allows to move around/evict buffer >>>>> use on behalf of userspace command queue then we can allow userspace scheduling >>>>> but until then my personnal opinion is that it should not be allowed and that >>>>> people will have to pay the ioctl price which i proved to be small, because >>>>> really if you 100K queue each with one job, i would not expect that all those >>>>> 100K job will complete in less time than it takes to execute an ioctl ie by >>>>> even if you do not have the ioctl delay what ever you schedule will have to >>>>> wait on previously submited jobs. >>>> >>>> But Jerome, the core problem still remains in effect, even with your >>>> suggestion. If an application, either via userspace queue or via ioctl, >>>> submits a long-running kernel, than the CPU in general can't stop the >>>> GPU from running it. And if that kernel does while(1); than that's it, >>>> game's over, and no matter how you submitted the work. So I don't really >>>> see the big advantage in your proposal. Only in CZ we can stop this wave >>>> (by CP H/W scheduling only). What are you saying is basically I won't >>>> allow people to use compute on Linux KV system because it _may_ get the >>>> system stuck. >>>> >>>> So even if I really wanted to, and I may agree with you theoretically on >>>> that, I can't fulfill your desire to make the "kernel being able to >>>> preempt at any time and be able to decrease or increase user queue >>>> priority so overall kernel is in charge of resources management and it >>>> can handle rogue client in proper fashion". Not in KV, and I guess not >>>> in CZ as well. >>>> >>>> Oded >>> >>> I do understand that but using kernel ioctl provide the same kind of control >>> as we have now ie we can bind/unbind buffer on per command buffer submission >>> basis, just like with current graphic or compute stuff. >>> >>> Yes current graphic and compute stuff can launch a while and never return back >>> and yes currently we have nothing against that but we should and solution would >>> be simple just kill the gpu thread. >>> >> OK, so in that case, the kernel can simple unmap all the queues by >> simply writing an UNMAP_QUEUES packet to the HIQ. Even if the queues are >> userspace, they will not be mapped to the internal CP scheduler. >> Does that satisfy the kernel control level you want ? > > This raises questions, what does happen to currently running thread when you > unmap queue ? Do they keep running until done ? If not than this means this > will break user application and those is not an acceptable solution. They keep running until they are done. However, their submission of workloads to their queues has no effect, of course. Maybe I should explain how this works from the userspace POV. When the userspace app wants to submit a work to the queue, it writes to 2 different locations, the doorbell and a wptr shadow (which is in system memory, viewable by the GPU). Every write to the doorbell triggers the CP (and other stuff) in the GPU. The CP then checks if the doorbell's queue is mapped. If so, than it handles this write. If not, it simply ignores it. So, when we do unmap queues, the CP will ignore the doorbell writes by the userspace app, however the app will not know that (unless it specifically waits for results). When the queue is re-mapped, the CP will take the wptr shadow and use that to re-synchronize itself with the queue. > > Otherwise, infrastructre inside radeon would be needed to force this queue > unmap on bo_pin failure so gfx pinning can be retry. If we fail to bo_pin than we of course unmap the queue and return -ENOMEM. I would like to add another information here that is relevant. I checked the code again, and the "mqd per userspace queue" allocation is done only on RADEON_GEM_DOMAIN_GTT, which AFAIK is *system memory* that is also mapped (and pinned) on the GART address space. Does that still counts as GPU memory from your POV ? Are you really concern about GART address space being exhausted ? Moreover, in all of our code, I don't see us using RADEON_GEM_DOMAIN_VRAM. We have a function in radeon_kfd.c called pool_to_domain, and you can see there that we map KGD_POOL_FRAMEBUFFER to RADEON_GEM_DOMAIN_VRAM. However, if you search for KGD_POOL_FRAMEBUFFER, you will see that we don't use it anywhere. > > Also how do you cope with doorbell exhaustion ? Do you just plan to error out ? > In which case this is another DDOS vector but only affecting the gpu. Yes, we plan to error out, but I don't see how we can defend from that. For a single process, we limit the queues to be 1K (as we assign 1 doorbell page per process, and each doorbell is 4 bytes). However, if someone would fork a lot of processes, and each of them will register and open 1K queues, than that would be a problem. But how can we recognize such an event and differentiate it from normal operation ? Did you have something specific in mind ? > > And there is many other questions that need answer, like my kernel memory map > question because as of right now i assume that kfd allow any thread on the gpu > to access any kernel memory. Actually, no. We don't allow any access from gpu kernels to the Linux kernel memory. Let me explain more. In KV, the GPU is responsible of telling the IOMMU whether the access is privileged or not. If the access is privileged, than the IOMMU can allow the GPU to access kernel memory. However, we never configure the GPU in our driver to issue privileged accesses. In CZ, this is solved by configuring the IOMMU to not allow privileged accesses. > > Otherthings are how ill formated packet are handled by the hardware ? I do not > see any mecanism to deal with SIGBUS or SIGFAULT. You are correct when you say you don't see any mechanism. We are now developing it :) Basically, there will be two new modules. The first one is the event module, which is already written and working. The second module is the exception handling module, which is now being developed and will be build upon the event module. The exception handling module will take care of ill formated packets and other exceptions from the GPU (that are not handled by radeon). > > > Also it is a worrisome prospect of seeing resource management completely ignore > for future AMD hardware. Kernel exist for a reason ! Kernel main purpose is to > provide resource management if AMD fails to understand that, this is not looking > good on long term and i expect none of the HSA technology will get momentum and > i would certainly advocate against any use of it inside product i work on. > So I made a mistake in writing that: "Not in KV, and I guess not in CZ as well" and I apologize for misleading you. What I needed to write was: "In KV, as a first generation HSA APU, we have limited ability to allow the kernel to preempt at any time and control user queue priority. However, in CZ we have dramatically improved control and resource management capabilities, that will allow the kernel to preempt at any time and also control user queue priority." So, as you can see, AMD fully understands that the kernel main purpose is to provide resource management and I hope this will make you recommend AMD H/W now and in the future. Oded > Cheers, > Jérôme > >> >> Oded >>>> >>>>> >>>>>>> >>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common >>>>>>>>>>>>> stuff there. >>>>>>>>>>>>> >>>>>>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would >>>>>>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon. >>>>>>>>>>>>> This would avoid crazy communication btw radeon and kfd. >>>>>>>>>>>>> >>>>>>>>>>>>> The whole aperture business needs some serious explanation. Especialy as >>>>>>>>>>>>> you want to use userspace address there is nothing to prevent userspace >>>>>>>>>>>>> program from allocating things at address you reserve for lds, scratch, >>>>>>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual >>>>>>>>>>>>> address reserved for kernel (see kernel memory map). >>>>>>>>>>>>> >>>>>>>>>>>>> The whole business of locking performance counter for exclusive per process >>>>>>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user >>>>>>>>>>>>> space command ring. >>>>>>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I >>>>>>>>>>>> find it analogous to a situation where a network maintainer nacking a driver >>>>>>>>>>>> for a network card, which is slower than a different network card. Doesn't >>>>>>>>>>>> seem reasonable this situation is would happen. He would still put both the >>>>>>>>>>>> drivers in the kernel because people want to use the H/W and its features. So, >>>>>>>>>>>> I don't think this is a valid reason to NACK the driver. >>>>>>>>> >>>>>>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning >>>>>>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING >>>>>>>>> the performance ioctl. >>>>>>>>> >>>>>>>>> Again this is another argument for round trip to the kernel. As inside kernel you >>>>>>>>> could properly do exclusive gpu counter access accross single user cmd buffer >>>>>>>>> execution. >>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> I only see issues with that. First and foremost i would >>>>>>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an >>>>>>>>>>>>> overhead that is measurable in any meaning full way against a simple >>>>>>>>>>>>> function call. I know the userspace command ring is a big marketing features >>>>>>>>>>>>> that please ignorant userspace programmer. But really this only brings issues >>>>>>>>>>>>> and for absolutely not upside afaict. >>>>>>>>>>>> Really ? You think that doing a context switch to kernel space, with all its >>>>>>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace >>>>>>>>>>>> which only puts a buffer on a ring and writes a doorbell ? >>>>>>>>> >>>>>>>>> I am saying the overhead is not that big and it probably will not matter in most >>>>>>>>> usecase. For instance i did wrote the most useless kernel module that add two >>>>>>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and >>>>>>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so >>>>>>>>> ioctl is 13 times slower. >>>>>>>>> >>>>>>>>> Now if there is enough data that shows that a significant percentage of jobs >>>>>>>>> submited to the GPU will take less that 0.35microsecond then yes userspace >>>>>>>>> scheduling does make sense. But so far all we have is handwaving with no data >>>>>>>>> to support any facts. >>>>>>>>> >>>>>>>>> >>>>>>>>> Now if we want to schedule from userspace than you will need to do something >>>>>>>>> about the pinning, something that gives control to kernel so that kernel can >>>>>>>>> unpin when it wants and move object when it wants no matter what userspace is >>>>>>>>> doing. >>>>>>>>> >>>>>>>>>>>>> >>>>> >>>>> -- >>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in >>>>> the body to majordomo@kvack.org. For more info on Linux MM, >>>>> see: http://www.linux-mm.org/ . >>>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> >>>>> >>>> >> >> _______________________________________________ >> dri-devel mailing list >> dri-devel@lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/dri-devel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 19:23 ` Oded Gabbay 2014-07-21 19:28 ` Jerome Glisse @ 2014-07-22 7:23 ` Daniel Vetter 2014-07-22 8:10 ` Oded Gabbay 1 sibling, 1 reply; 49+ messages in thread From: Daniel Vetter @ 2014-07-22 7:23 UTC (permalink / raw) To: Oded Gabbay Cc: Jerome Glisse, Andrew Lewycky, Michel Dänzer, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote: > But Jerome, the core problem still remains in effect, even with your > suggestion. If an application, either via userspace queue or via ioctl, > submits a long-running kernel, than the CPU in general can't stop the > GPU from running it. And if that kernel does while(1); than that's it, > game's over, and no matter how you submitted the work. So I don't really > see the big advantage in your proposal. Only in CZ we can stop this wave > (by CP H/W scheduling only). What are you saying is basically I won't > allow people to use compute on Linux KV system because it _may_ get the > system stuck. > > So even if I really wanted to, and I may agree with you theoretically on > that, I can't fulfill your desire to make the "kernel being able to > preempt at any time and be able to decrease or increase user queue > priority so overall kernel is in charge of resources management and it > can handle rogue client in proper fashion". Not in KV, and I guess not > in CZ as well. At least on intel the execlist stuff which is used for preemption can be used by both the cpu and the firmware scheduler. So we can actually preempt when doing cpu scheduling. It sounds like current amd hw doesn't have any preemption at all. And without preemption I don't think we should ever consider to allow userspace to directly submit stuff to the hw and overload. Imo the kernel _must_ sit in between and reject clients that don't behave. Of course you can only ever react (worst case with a gpu reset, there's code floating around for that on intel-gfx), but at least you can do something. If userspace has a direct submit path to the hw then this gets really tricky, if not impossible. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-22 7:23 ` Daniel Vetter @ 2014-07-22 8:10 ` Oded Gabbay 0 siblings, 0 replies; 49+ messages in thread From: Oded Gabbay @ 2014-07-22 8:10 UTC (permalink / raw) To: Jerome Glisse, Andrew Lewycky, Michel Dänzer, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Alexey Skidanov, Andrew Morton, Bridgman, John, Dave Airlie, Christian König, Joerg Roedel, Daniel Vetter, Sellek, Tom, Deucher, Alexander On 22/07/14 10:23, Daniel Vetter wrote: > On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote: >> But Jerome, the core problem still remains in effect, even with your >> suggestion. If an application, either via userspace queue or via ioctl, >> submits a long-running kernel, than the CPU in general can't stop the >> GPU from running it. And if that kernel does while(1); than that's it, >> game's over, and no matter how you submitted the work. So I don't really >> see the big advantage in your proposal. Only in CZ we can stop this wave >> (by CP H/W scheduling only). What are you saying is basically I won't >> allow people to use compute on Linux KV system because it _may_ get the >> system stuck. >> >> So even if I really wanted to, and I may agree with you theoretically on >> that, I can't fulfill your desire to make the "kernel being able to >> preempt at any time and be able to decrease or increase user queue >> priority so overall kernel is in charge of resources management and it >> can handle rogue client in proper fashion". Not in KV, and I guess not >> in CZ as well. > > At least on intel the execlist stuff which is used for preemption can be > used by both the cpu and the firmware scheduler. So we can actually > preempt when doing cpu scheduling. > > It sounds like current amd hw doesn't have any preemption at all. And > without preemption I don't think we should ever consider to allow > userspace to directly submit stuff to the hw and overload. Imo the kernel > _must_ sit in between and reject clients that don't behave. Of course you > can only ever react (worst case with a gpu reset, there's code floating > around for that on intel-gfx), but at least you can do something. > > If userspace has a direct submit path to the hw then this gets really > tricky, if not impossible. > -Daniel > Hi Daniel, See the email I just sent to Jerome regarding preemption. Bottom line, in KV, we can preempt running queues, except from the case of a stuck gpu kernel. In CZ, this was solved. So, in this regard, I don't think there is any difference between userspace queues and ioctl. Oded -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 13:39 ` Christian König 2014-07-21 14:12 ` Oded Gabbay @ 2014-07-21 15:25 ` Daniel Vetter 2014-07-21 15:58 ` Jerome Glisse 1 sibling, 1 reply; 49+ messages in thread From: Daniel Vetter @ 2014-07-21 15:25 UTC (permalink / raw) To: Christian König Cc: Oded Gabbay, Jerome Glisse, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian Konig wrote: > Am 21.07.2014 14:36, schrieb Oded Gabbay: > >On 20/07/14 20:46, Jerome Glisse wrote: > >>On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: > >>>Forgot to cc mailing list on cover letter. Sorry. > >>> > >>>As a continuation to the existing discussion, here is a v2 patch series > >>>restructured with a cleaner history and no > >>>totally-different-early-versions > >>>of the code. > >>> > >>>Instead of 83 patches, there are now a total of 25 patches, where 5 of > >>>them > >>>are modifications to radeon driver and 18 of them include only amdkfd > >>>code. > >>>There is no code going away or even modified between patches, only > >>>added. > >>> > >>>The driver was renamed from radeon_kfd to amdkfd and moved to reside > >>>under > >>>drm/radeon/amdkfd. This move was done to emphasize the fact that this > >>>driver > >>>is an AMD-only driver at this point. Having said that, we do foresee a > >>>generic hsa framework being implemented in the future and in that > >>>case, we > >>>will adjust amdkfd to work within that framework. > >>> > >>>As the amdkfd driver should support multiple AMD gfx drivers, we want > >>>to > >>>keep it as a seperate driver from radeon. Therefore, the amdkfd code is > >>>contained in its own folder. The amdkfd folder was put under the radeon > >>>folder because the only AMD gfx driver in the Linux kernel at this > >>>point > >>>is the radeon driver. Having said that, we will probably need to move > >>>it > >>>(maybe to be directly under drm) after we integrate with additional > >>>AMD gfx > >>>drivers. > >>> > >>>For people who like to review using git, the v2 patch set is located > >>>at: > >>>http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 > >>> > >>>Written by Oded Gabbayh <oded.gabbay@amd.com> > >> > >>So quick comments before i finish going over all patches. There is many > >>things that need more documentation espacialy as of right now there is > >>no userspace i can go look at. > >So quick comments on some of your questions but first of all, thanks for > >the time you dedicated to review the code. > >> > >>There few show stopper, biggest one is gpu memory pinning this is a big > >>no, that would need serious arguments for any hope of convincing me on > >>that side. > >We only do gpu memory pinning for kernel objects. There are no userspace > >objects that are pinned on the gpu memory in our driver. If that is the > >case, is it still a show stopper ? > > > >The kernel objects are: > >- pipelines (4 per device) > >- mqd per hiq (only 1 per device) > >- mqd per userspace queue. On KV, we support up to 1K queues per process, > >for a total of 512K queues. Each mqd is 151 bytes, but the allocation is > >done in 256 alignment. So total *possible* memory is 128MB > >- kernel queue (only 1 per device) > >- fence address for kernel queue > >- runlists for the CP (1 or 2 per device) > > The main questions here are if it's avoid able to pin down the memory and if > the memory is pinned down at driver load, by request from userspace or by > anything else. > > As far as I can see only the "mqd per userspace queue" might be a bit > questionable, everything else sounds reasonable. Aside, i915 perspective again (i.e. how we solved this): When scheduling away from contexts we unpin them and put them into the lru. And in the shrinker we have a last-ditch callback to switch to a default context (since you can't ever have no context once you've started) which means we can evict any context object if it's getting in the way. We must do that since the contexts have to be in global gtt, which is shared for scanouts. So fragmenting that badly with lots of context objects and other stuff is a no-go, since that means we'll start to fail pageflips. I don't know whether ttm has a ready-made concept for such opportunistically pinned stuff. I guess you could wire up the "switch to dflt context" action to the evict/move function if ttm wants to get rid of the currently used hw context. Oh and: This is another reason for letting the kernel schedule contexts, since you can't do this defrag trick if the gpu does all the scheduling itself. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 15:25 ` Daniel Vetter @ 2014-07-21 15:58 ` Jerome Glisse 2014-07-21 17:05 ` Daniel Vetter 0 siblings, 1 reply; 49+ messages in thread From: Jerome Glisse @ 2014-07-21 15:58 UTC (permalink / raw) To: Christian König, Oded Gabbay, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote: > On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian Konig wrote: > > Am 21.07.2014 14:36, schrieb Oded Gabbay: > > >On 20/07/14 20:46, Jerome Glisse wrote: > > >>On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: > > >>>Forgot to cc mailing list on cover letter. Sorry. > > >>> > > >>>As a continuation to the existing discussion, here is a v2 patch series > > >>>restructured with a cleaner history and no > > >>>totally-different-early-versions > > >>>of the code. > > >>> > > >>>Instead of 83 patches, there are now a total of 25 patches, where 5 of > > >>>them > > >>>are modifications to radeon driver and 18 of them include only amdkfd > > >>>code. > > >>>There is no code going away or even modified between patches, only > > >>>added. > > >>> > > >>>The driver was renamed from radeon_kfd to amdkfd and moved to reside > > >>>under > > >>>drm/radeon/amdkfd. This move was done to emphasize the fact that this > > >>>driver > > >>>is an AMD-only driver at this point. Having said that, we do foresee a > > >>>generic hsa framework being implemented in the future and in that > > >>>case, we > > >>>will adjust amdkfd to work within that framework. > > >>> > > >>>As the amdkfd driver should support multiple AMD gfx drivers, we want > > >>>to > > >>>keep it as a seperate driver from radeon. Therefore, the amdkfd code is > > >>>contained in its own folder. The amdkfd folder was put under the radeon > > >>>folder because the only AMD gfx driver in the Linux kernel at this > > >>>point > > >>>is the radeon driver. Having said that, we will probably need to move > > >>>it > > >>>(maybe to be directly under drm) after we integrate with additional > > >>>AMD gfx > > >>>drivers. > > >>> > > >>>For people who like to review using git, the v2 patch set is located > > >>>at: > > >>>http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 > > >>> > > >>>Written by Oded Gabbayh <oded.gabbay@amd.com> > > >> > > >>So quick comments before i finish going over all patches. There is many > > >>things that need more documentation espacialy as of right now there is > > >>no userspace i can go look at. > > >So quick comments on some of your questions but first of all, thanks for > > >the time you dedicated to review the code. > > >> > > >>There few show stopper, biggest one is gpu memory pinning this is a big > > >>no, that would need serious arguments for any hope of convincing me on > > >>that side. > > >We only do gpu memory pinning for kernel objects. There are no userspace > > >objects that are pinned on the gpu memory in our driver. If that is the > > >case, is it still a show stopper ? > > > > > >The kernel objects are: > > >- pipelines (4 per device) > > >- mqd per hiq (only 1 per device) > > >- mqd per userspace queue. On KV, we support up to 1K queues per process, > > >for a total of 512K queues. Each mqd is 151 bytes, but the allocation is > > >done in 256 alignment. So total *possible* memory is 128MB > > >- kernel queue (only 1 per device) > > >- fence address for kernel queue > > >- runlists for the CP (1 or 2 per device) > > > > The main questions here are if it's avoid able to pin down the memory and if > > the memory is pinned down at driver load, by request from userspace or by > > anything else. > > > > As far as I can see only the "mqd per userspace queue" might be a bit > > questionable, everything else sounds reasonable. > > Aside, i915 perspective again (i.e. how we solved this): When scheduling > away from contexts we unpin them and put them into the lru. And in the > shrinker we have a last-ditch callback to switch to a default context > (since you can't ever have no context once you've started) which means we > can evict any context object if it's getting in the way. So Intel hardware report through some interrupt or some channel when it is not using a context ? ie kernel side get notification when some user context is done executing ? The issue with radeon hardware AFAICT is that the hardware do not report any thing about the userspace context running ie you do not get notification when a context is not use. Well AFAICT. Maybe hardware do provide that. Like the VMID is a limited resources so you have to dynamicly bind them so maybe we can only allocate pinned buffer for each VMID and then when binding a PASID to a VMID it also copy back pinned buffer to pasid unpinned copy. Cheers, Jerome > > We must do that since the contexts have to be in global gtt, which is > shared for scanouts. So fragmenting that badly with lots of context > objects and other stuff is a no-go, since that means we'll start to fail > pageflips. > > I don't know whether ttm has a ready-made concept for such > opportunistically pinned stuff. I guess you could wire up the "switch to > dflt context" action to the evict/move function if ttm wants to get rid of > the currently used hw context. > > Oh and: This is another reason for letting the kernel schedule contexts, > since you can't do this defrag trick if the gpu does all the scheduling > itself. > -Daniel > -- > Daniel Vetter > Software Engineer, Intel Corporation > +41 (0) 79 365 57 48 - http://blog.ffwll.ch > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/dri-devel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 15:58 ` Jerome Glisse @ 2014-07-21 17:05 ` Daniel Vetter 2014-07-21 17:28 ` Oded Gabbay 0 siblings, 1 reply; 49+ messages in thread From: Daniel Vetter @ 2014-07-21 17:05 UTC (permalink / raw) To: Jerome Glisse Cc: Christian König, Oded Gabbay, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote: > On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote: > > On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian Konig wrote: > > > Am 21.07.2014 14:36, schrieb Oded Gabbay: > > > >On 20/07/14 20:46, Jerome Glisse wrote: > > > >>On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: > > > >>>Forgot to cc mailing list on cover letter. Sorry. > > > >>> > > > >>>As a continuation to the existing discussion, here is a v2 patch series > > > >>>restructured with a cleaner history and no > > > >>>totally-different-early-versions > > > >>>of the code. > > > >>> > > > >>>Instead of 83 patches, there are now a total of 25 patches, where 5 of > > > >>>them > > > >>>are modifications to radeon driver and 18 of them include only amdkfd > > > >>>code. > > > >>>There is no code going away or even modified between patches, only > > > >>>added. > > > >>> > > > >>>The driver was renamed from radeon_kfd to amdkfd and moved to reside > > > >>>under > > > >>>drm/radeon/amdkfd. This move was done to emphasize the fact that this > > > >>>driver > > > >>>is an AMD-only driver at this point. Having said that, we do foresee a > > > >>>generic hsa framework being implemented in the future and in that > > > >>>case, we > > > >>>will adjust amdkfd to work within that framework. > > > >>> > > > >>>As the amdkfd driver should support multiple AMD gfx drivers, we want > > > >>>to > > > >>>keep it as a seperate driver from radeon. Therefore, the amdkfd code is > > > >>>contained in its own folder. The amdkfd folder was put under the radeon > > > >>>folder because the only AMD gfx driver in the Linux kernel at this > > > >>>point > > > >>>is the radeon driver. Having said that, we will probably need to move > > > >>>it > > > >>>(maybe to be directly under drm) after we integrate with additional > > > >>>AMD gfx > > > >>>drivers. > > > >>> > > > >>>For people who like to review using git, the v2 patch set is located > > > >>>at: > > > >>>http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 > > > >>> > > > >>>Written by Oded Gabbayh <oded.gabbay@amd.com> > > > >> > > > >>So quick comments before i finish going over all patches. There is many > > > >>things that need more documentation espacialy as of right now there is > > > >>no userspace i can go look at. > > > >So quick comments on some of your questions but first of all, thanks for > > > >the time you dedicated to review the code. > > > >> > > > >>There few show stopper, biggest one is gpu memory pinning this is a big > > > >>no, that would need serious arguments for any hope of convincing me on > > > >>that side. > > > >We only do gpu memory pinning for kernel objects. There are no userspace > > > >objects that are pinned on the gpu memory in our driver. If that is the > > > >case, is it still a show stopper ? > > > > > > > >The kernel objects are: > > > >- pipelines (4 per device) > > > >- mqd per hiq (only 1 per device) > > > >- mqd per userspace queue. On KV, we support up to 1K queues per process, > > > >for a total of 512K queues. Each mqd is 151 bytes, but the allocation is > > > >done in 256 alignment. So total *possible* memory is 128MB > > > >- kernel queue (only 1 per device) > > > >- fence address for kernel queue > > > >- runlists for the CP (1 or 2 per device) > > > > > > The main questions here are if it's avoid able to pin down the memory and if > > > the memory is pinned down at driver load, by request from userspace or by > > > anything else. > > > > > > As far as I can see only the "mqd per userspace queue" might be a bit > > > questionable, everything else sounds reasonable. > > > > Aside, i915 perspective again (i.e. how we solved this): When scheduling > > away from contexts we unpin them and put them into the lru. And in the > > shrinker we have a last-ditch callback to switch to a default context > > (since you can't ever have no context once you've started) which means we > > can evict any context object if it's getting in the way. > > So Intel hardware report through some interrupt or some channel when it is > not using a context ? ie kernel side get notification when some user context > is done executing ? Yes, as long as we do the scheduling with the cpu we get interrupts for context switches. The mechanic is already published in the execlist patches currently floating around. We get a special context switch interrupt. But we have this unpin logic already on the current code where we switch contexts through in-line cs commands from the kernel. There we obviously use the normal batch completion events. > The issue with radeon hardware AFAICT is that the hardware do not report any > thing about the userspace context running ie you do not get notification when > a context is not use. Well AFAICT. Maybe hardware do provide that. I'm not sure whether we can do the same trick with the hw scheduler. But then unpinning hw contexts will drain the pipeline anyway, so I guess we can just stop feeding the hw scheduler until it runs dry. And then unpin and evict. > Like the VMID is a limited resources so you have to dynamicly bind them so > maybe we can only allocate pinned buffer for each VMID and then when binding > a PASID to a VMID it also copy back pinned buffer to pasid unpinned copy. Yeah, pasid assignment will be fun. Not sure whether Jesse's patches will do this already. We _do_ already have fun with ctx id assigments though since we move them around (and the hw id is the ggtt address afaik). So we need to remap them already. Not sure on the details for pasid mapping, iirc it's a separate field somewhere in the context struct. Jesse knows the details. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 17:05 ` Daniel Vetter @ 2014-07-21 17:28 ` Oded Gabbay 2014-07-21 18:22 ` Daniel Vetter 0 siblings, 1 reply; 49+ messages in thread From: Oded Gabbay @ 2014-07-21 17:28 UTC (permalink / raw) To: Jerome Glisse, Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On 21/07/14 20:05, Daniel Vetter wrote: > On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote: >> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote: >>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian König wrote: >>>> Am 21.07.2014 14:36, schrieb Oded Gabbay: >>>>> On 20/07/14 20:46, Jerome Glisse wrote: >>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: >>>>>>> Forgot to cc mailing list on cover letter. Sorry. >>>>>>> >>>>>>> As a continuation to the existing discussion, here is a v2 patch series >>>>>>> restructured with a cleaner history and no >>>>>>> totally-different-early-versions >>>>>>> of the code. >>>>>>> >>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of >>>>>>> them >>>>>>> are modifications to radeon driver and 18 of them include only amdkfd >>>>>>> code. >>>>>>> There is no code going away or even modified between patches, only >>>>>>> added. >>>>>>> >>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside >>>>>>> under >>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this >>>>>>> driver >>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a >>>>>>> generic hsa framework being implemented in the future and in that >>>>>>> case, we >>>>>>> will adjust amdkfd to work within that framework. >>>>>>> >>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want >>>>>>> to >>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is >>>>>>> contained in its own folder. The amdkfd folder was put under the radeon >>>>>>> folder because the only AMD gfx driver in the Linux kernel at this >>>>>>> point >>>>>>> is the radeon driver. Having said that, we will probably need to move >>>>>>> it >>>>>>> (maybe to be directly under drm) after we integrate with additional >>>>>>> AMD gfx >>>>>>> drivers. >>>>>>> >>>>>>> For people who like to review using git, the v2 patch set is located >>>>>>> at: >>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2 >>>>>>> >>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com> >>>>>> >>>>>> So quick comments before i finish going over all patches. There is many >>>>>> things that need more documentation espacialy as of right now there is >>>>>> no userspace i can go look at. >>>>> So quick comments on some of your questions but first of all, thanks for >>>>> the time you dedicated to review the code. >>>>>> >>>>>> There few show stopper, biggest one is gpu memory pinning this is a big >>>>>> no, that would need serious arguments for any hope of convincing me on >>>>>> that side. >>>>> We only do gpu memory pinning for kernel objects. There are no userspace >>>>> objects that are pinned on the gpu memory in our driver. If that is the >>>>> case, is it still a show stopper ? >>>>> >>>>> The kernel objects are: >>>>> - pipelines (4 per device) >>>>> - mqd per hiq (only 1 per device) >>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, >>>>> for a total of 512K queues. Each mqd is 151 bytes, but the allocation is >>>>> done in 256 alignment. So total *possible* memory is 128MB >>>>> - kernel queue (only 1 per device) >>>>> - fence address for kernel queue >>>>> - runlists for the CP (1 or 2 per device) >>>> >>>> The main questions here are if it's avoid able to pin down the memory and if >>>> the memory is pinned down at driver load, by request from userspace or by >>>> anything else. >>>> >>>> As far as I can see only the "mqd per userspace queue" might be a bit >>>> questionable, everything else sounds reasonable. >>> >>> Aside, i915 perspective again (i.e. how we solved this): When scheduling >>> away from contexts we unpin them and put them into the lru. And in the >>> shrinker we have a last-ditch callback to switch to a default context >>> (since you can't ever have no context once you've started) which means we >>> can evict any context object if it's getting in the way. >> >> So Intel hardware report through some interrupt or some channel when it is >> not using a context ? ie kernel side get notification when some user context >> is done executing ? > > Yes, as long as we do the scheduling with the cpu we get interrupts for > context switches. The mechanic is already published in the execlist > patches currently floating around. We get a special context switch > interrupt. > > But we have this unpin logic already on the current code where we switch > contexts through in-line cs commands from the kernel. There we obviously > use the normal batch completion events. > >> The issue with radeon hardware AFAICT is that the hardware do not report any >> thing about the userspace context running ie you do not get notification when >> a context is not use. Well AFAICT. Maybe hardware do provide that. > > I'm not sure whether we can do the same trick with the hw scheduler. But > then unpinning hw contexts will drain the pipeline anyway, so I guess we > can just stop feeding the hw scheduler until it runs dry. And then unpin > and evict. So, I'm afraid but we can't do this for AMD Kaveri because: a. The hw scheduler doesn't inform us which queues it is going to execute next. We feed it a runlist of queues, which can be very large (we have a test that runs 1000 queues on the same runlist, but we can put a lot more). All the MQDs of those queues must be pinned in memory as long as the runlist is in effect. The runlist is in effect until either a queue is deleted or a queue is added (or something more extreme happens, like the process terminates). b. The hw scheduler takes care of VMID to PASID mapping. We don't program the ATC registers manually, the internal CP does that dynamically, so we basically have over-subscription of processes as well. Therefore, we can't ping MQDs based on VMID binding. I don't see AMD moving back to SW scheduling, as it doesn't scale well with the number of processes and queues and our next gen APU will have a lot more queues than what we have on KV Oded > >> Like the VMID is a limited resources so you have to dynamicly bind them so >> maybe we can only allocate pinned buffer for each VMID and then when binding >> a PASID to a VMID it also copy back pinned buffer to pasid unpinned copy. > > Yeah, pasid assignment will be fun. Not sure whether Jesse's patches will > do this already. We _do_ already have fun with ctx id assigments though > since we move them around (and the hw id is the ggtt address afaik). So we > need to remap them already. Not sure on the details for pasid mapping, > iirc it's a separate field somewhere in the context struct. Jesse knows > the details. > -Daniel > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 17:28 ` Oded Gabbay @ 2014-07-21 18:22 ` Daniel Vetter 2014-07-21 18:41 ` Oded Gabbay 0 siblings, 1 reply; 49+ messages in thread From: Daniel Vetter @ 2014-07-21 18:22 UTC (permalink / raw) To: Oded Gabbay Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote: >> I'm not sure whether we can do the same trick with the hw scheduler. But >> then unpinning hw contexts will drain the pipeline anyway, so I guess we >> can just stop feeding the hw scheduler until it runs dry. And then unpin >> and evict. > So, I'm afraid but we can't do this for AMD Kaveri because: Well as long as you can drain the hw scheduler queue (and you can do that, worst case you have to unmap all the doorbells and other stuff to intercept further submission from userspace) you can evict stuff. And if we don't want compute to be a denial of service on the display side of the driver we need this ability. Now if you go through an ioctl instead of the doorbell (I agree with Jerome here, the doorbell should be supported by benchmarks on linux) this gets a bit easier, but it's not a requirement really. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 18:22 ` Daniel Vetter @ 2014-07-21 18:41 ` Oded Gabbay 2014-07-21 19:03 ` Jerome Glisse 0 siblings, 1 reply; 49+ messages in thread From: Oded Gabbay @ 2014-07-21 18:41 UTC (permalink / raw) To: Daniel Vetter Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On 21/07/14 21:22, Daniel Vetter wrote: > On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote: >>> I'm not sure whether we can do the same trick with the hw scheduler. But >>> then unpinning hw contexts will drain the pipeline anyway, so I guess we >>> can just stop feeding the hw scheduler until it runs dry. And then unpin >>> and evict. >> So, I'm afraid but we can't do this for AMD Kaveri because: > > Well as long as you can drain the hw scheduler queue (and you can do > that, worst case you have to unmap all the doorbells and other stuff > to intercept further submission from userspace) you can evict stuff. I can't drain the hw scheduler queue, as I can't do mid-wave preemption. Moreover, if I use the dequeue request register to preempt a queue during a dispatch it may be that some waves (wave groups actually) of the dispatch have not yet been created, and when I reactivate the mqd, they should be created but are not. However, this works fine if you use the HIQ. the CP ucode correctly saves and restores the state of an outstanding dispatch. I don't think we have access to the state from software at all, so it's not a bug, it is "as designed". > And if we don't want compute to be a denial of service on the display > side of the driver we need this ability. Now if you go through an > ioctl instead of the doorbell (I agree with Jerome here, the doorbell > should be supported by benchmarks on linux) this gets a bit easier, > but it's not a requirement really. > -Daniel > On KV, we have the theoretical option of DOS on the display side as we can't do a mid-wave preemption. On CZ, we won't have this problem. Oded -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 18:41 ` Oded Gabbay @ 2014-07-21 19:03 ` Jerome Glisse 2014-07-22 7:28 ` Daniel Vetter 0 siblings, 1 reply; 49+ messages in thread From: Jerome Glisse @ 2014-07-21 19:03 UTC (permalink / raw) To: Oded Gabbay Cc: Daniel Vetter, Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote: > On 21/07/14 21:22, Daniel Vetter wrote: > > On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote: > >>> I'm not sure whether we can do the same trick with the hw scheduler. But > >>> then unpinning hw contexts will drain the pipeline anyway, so I guess we > >>> can just stop feeding the hw scheduler until it runs dry. And then unpin > >>> and evict. > >> So, I'm afraid but we can't do this for AMD Kaveri because: > > > > Well as long as you can drain the hw scheduler queue (and you can do > > that, worst case you have to unmap all the doorbells and other stuff > > to intercept further submission from userspace) you can evict stuff. > > I can't drain the hw scheduler queue, as I can't do mid-wave preemption. > Moreover, if I use the dequeue request register to preempt a queue > during a dispatch it may be that some waves (wave groups actually) of > the dispatch have not yet been created, and when I reactivate the mqd, > they should be created but are not. However, this works fine if you use > the HIQ. the CP ucode correctly saves and restores the state of an > outstanding dispatch. I don't think we have access to the state from > software at all, so it's not a bug, it is "as designed". > I think here Daniel is suggesting to unmapp the doorbell page, and track each write made by userspace to it and while unmapped wait for the gpu to drain or use some kind of fence on a special queue. Once GPU is drain we can move pinned buffer, then remap the doorbell and update it to the last value written by userspace which will resume execution to the next job. > > And if we don't want compute to be a denial of service on the display > > side of the driver we need this ability. Now if you go through an > > ioctl instead of the doorbell (I agree with Jerome here, the doorbell > > should be supported by benchmarks on linux) this gets a bit easier, > > but it's not a requirement really. > > -Daniel > > > On KV, we have the theoretical option of DOS on the display side as we > can't do a mid-wave preemption. On CZ, we won't have this problem. > > Oded -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-21 19:03 ` Jerome Glisse @ 2014-07-22 7:28 ` Daniel Vetter 2014-07-22 7:40 ` Daniel Vetter 2014-07-22 8:19 ` Oded Gabbay 0 siblings, 2 replies; 49+ messages in thread From: Daniel Vetter @ 2014-07-22 7:28 UTC (permalink / raw) To: Jerome Glisse Cc: Oded Gabbay, Daniel Vetter, Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On Mon, Jul 21, 2014 at 03:03:07PM -0400, Jerome Glisse wrote: > On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote: > > On 21/07/14 21:22, Daniel Vetter wrote: > > > On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote: > > >>> I'm not sure whether we can do the same trick with the hw scheduler. But > > >>> then unpinning hw contexts will drain the pipeline anyway, so I guess we > > >>> can just stop feeding the hw scheduler until it runs dry. And then unpin > > >>> and evict. > > >> So, I'm afraid but we can't do this for AMD Kaveri because: > > > > > > Well as long as you can drain the hw scheduler queue (and you can do > > > that, worst case you have to unmap all the doorbells and other stuff > > > to intercept further submission from userspace) you can evict stuff. > > > > I can't drain the hw scheduler queue, as I can't do mid-wave preemption. > > Moreover, if I use the dequeue request register to preempt a queue > > during a dispatch it may be that some waves (wave groups actually) of > > the dispatch have not yet been created, and when I reactivate the mqd, > > they should be created but are not. However, this works fine if you use > > the HIQ. the CP ucode correctly saves and restores the state of an > > outstanding dispatch. I don't think we have access to the state from > > software at all, so it's not a bug, it is "as designed". > > > > I think here Daniel is suggesting to unmapp the doorbell page, and track > each write made by userspace to it and while unmapped wait for the gpu to > drain or use some kind of fence on a special queue. Once GPU is drain we > can move pinned buffer, then remap the doorbell and update it to the last > value written by userspace which will resume execution to the next job. Exactly, just prevent userspace from submitting more. And if you have misbehaving userspace that submits too much, reset the gpu and tell it that you're sorry but won't schedule any more work. We have this already in i915 (since like all other gpus we're not preempting right now) and it works. There's some code floating around to even restrict the reset to _just_ the offending submission context, with nothing else getting corrupted. You can do all this with the doorbells and unmapping them, but it's a pain. Much easier if you have a real ioctl, and I haven't seen anyone with perf data indicating that an ioctl would be too much overhead on linux. Neither in this thread nor internally here at intel. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-22 7:28 ` Daniel Vetter @ 2014-07-22 7:40 ` Daniel Vetter 2014-07-22 8:21 ` Oded Gabbay 2014-07-22 8:19 ` Oded Gabbay 1 sibling, 1 reply; 49+ messages in thread From: Daniel Vetter @ 2014-07-22 7:40 UTC (permalink / raw) To: Jerome Glisse, Oded Gabbay, Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On Tue, Jul 22, 2014 at 09:28:51AM +0200, Daniel Vetter wrote: > On Mon, Jul 21, 2014 at 03:03:07PM -0400, Jerome Glisse wrote: > > On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote: > > > On 21/07/14 21:22, Daniel Vetter wrote: > > > > On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote: > > > >>> I'm not sure whether we can do the same trick with the hw scheduler. But > > > >>> then unpinning hw contexts will drain the pipeline anyway, so I guess we > > > >>> can just stop feeding the hw scheduler until it runs dry. And then unpin > > > >>> and evict. > > > >> So, I'm afraid but we can't do this for AMD Kaveri because: > > > > > > > > Well as long as you can drain the hw scheduler queue (and you can do > > > > that, worst case you have to unmap all the doorbells and other stuff > > > > to intercept further submission from userspace) you can evict stuff. > > > > > > I can't drain the hw scheduler queue, as I can't do mid-wave preemption. > > > Moreover, if I use the dequeue request register to preempt a queue > > > during a dispatch it may be that some waves (wave groups actually) of > > > the dispatch have not yet been created, and when I reactivate the mqd, > > > they should be created but are not. However, this works fine if you use > > > the HIQ. the CP ucode correctly saves and restores the state of an > > > outstanding dispatch. I don't think we have access to the state from > > > software at all, so it's not a bug, it is "as designed". > > > > > > > I think here Daniel is suggesting to unmapp the doorbell page, and track > > each write made by userspace to it and while unmapped wait for the gpu to > > drain or use some kind of fence on a special queue. Once GPU is drain we > > can move pinned buffer, then remap the doorbell and update it to the last > > value written by userspace which will resume execution to the next job. > > Exactly, just prevent userspace from submitting more. And if you have > misbehaving userspace that submits too much, reset the gpu and tell it > that you're sorry but won't schedule any more work. > > We have this already in i915 (since like all other gpus we're not > preempting right now) and it works. There's some code floating around to > even restrict the reset to _just_ the offending submission context, with > nothing else getting corrupted. > > You can do all this with the doorbells and unmapping them, but it's a > pain. Much easier if you have a real ioctl, and I haven't seen anyone with > perf data indicating that an ioctl would be too much overhead on linux. > Neither in this thread nor internally here at intel. Aside: Another reason why the ioctl is better than the doorbell is integration with other drivers. Yeah I know this is about compute, but sooner or later someone will want to e.g. post-proc video frames between the v4l capture device and the gpu mpeg encoder. Or something else fancy. Then you want to be able to somehow integrate into a cross-driver fence framework like android syncpts, and you can't do that without an ioctl for the compute submissions. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-22 7:40 ` Daniel Vetter @ 2014-07-22 8:21 ` Oded Gabbay 0 siblings, 0 replies; 49+ messages in thread From: Oded Gabbay @ 2014-07-22 8:21 UTC (permalink / raw) To: Jerome Glisse, Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Sellek, Tom, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Christian König On 22/07/14 10:40, Daniel Vetter wrote: > On Tue, Jul 22, 2014 at 09:28:51AM +0200, Daniel Vetter wrote: >> On Mon, Jul 21, 2014 at 03:03:07PM -0400, Jerome Glisse wrote: >>> On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote: >>>> On 21/07/14 21:22, Daniel Vetter wrote: >>>>> On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote: >>>>>>> I'm not sure whether we can do the same trick with the hw scheduler. But >>>>>>> then unpinning hw contexts will drain the pipeline anyway, so I guess we >>>>>>> can just stop feeding the hw scheduler until it runs dry. And then unpin >>>>>>> and evict. >>>>>> So, I'm afraid but we can't do this for AMD Kaveri because: >>>>> >>>>> Well as long as you can drain the hw scheduler queue (and you can do >>>>> that, worst case you have to unmap all the doorbells and other stuff >>>>> to intercept further submission from userspace) you can evict stuff. >>>> >>>> I can't drain the hw scheduler queue, as I can't do mid-wave preemption. >>>> Moreover, if I use the dequeue request register to preempt a queue >>>> during a dispatch it may be that some waves (wave groups actually) of >>>> the dispatch have not yet been created, and when I reactivate the mqd, >>>> they should be created but are not. However, this works fine if you use >>>> the HIQ. the CP ucode correctly saves and restores the state of an >>>> outstanding dispatch. I don't think we have access to the state from >>>> software at all, so it's not a bug, it is "as designed". >>>> >>> >>> I think here Daniel is suggesting to unmapp the doorbell page, and track >>> each write made by userspace to it and while unmapped wait for the gpu to >>> drain or use some kind of fence on a special queue. Once GPU is drain we >>> can move pinned buffer, then remap the doorbell and update it to the last >>> value written by userspace which will resume execution to the next job. >> >> Exactly, just prevent userspace from submitting more. And if you have >> misbehaving userspace that submits too much, reset the gpu and tell it >> that you're sorry but won't schedule any more work. >> >> We have this already in i915 (since like all other gpus we're not >> preempting right now) and it works. There's some code floating around to >> even restrict the reset to _just_ the offending submission context, with >> nothing else getting corrupted. >> >> You can do all this with the doorbells and unmapping them, but it's a >> pain. Much easier if you have a real ioctl, and I haven't seen anyone with >> perf data indicating that an ioctl would be too much overhead on linux. >> Neither in this thread nor internally here at intel. > > Aside: Another reason why the ioctl is better than the doorbell is > integration with other drivers. Yeah I know this is about compute, but > sooner or later someone will want to e.g. post-proc video frames between > the v4l capture device and the gpu mpeg encoder. Or something else fancy. > > Then you want to be able to somehow integrate into a cross-driver fence > framework like android syncpts, and you can't do that without an ioctl for > the compute submissions. > -Daniel > I assume you talk about interop between graphics and compute. For that, we have a module that is now being tested, and indeed uses an ioctl to map a graphic object to compute process address space. However, after the translation is done, the work is done only in userspace. Oded -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-22 7:28 ` Daniel Vetter 2014-07-22 7:40 ` Daniel Vetter @ 2014-07-22 8:19 ` Oded Gabbay 2014-07-22 9:21 ` Daniel Vetter 1 sibling, 1 reply; 49+ messages in thread From: Oded Gabbay @ 2014-07-22 8:19 UTC (permalink / raw) To: Jerome Glisse, Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On 22/07/14 10:28, Daniel Vetter wrote: > On Mon, Jul 21, 2014 at 03:03:07PM -0400, Jerome Glisse wrote: >> On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote: >>> On 21/07/14 21:22, Daniel Vetter wrote: >>>> On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote: >>>>>> I'm not sure whether we can do the same trick with the hw scheduler. But >>>>>> then unpinning hw contexts will drain the pipeline anyway, so I guess we >>>>>> can just stop feeding the hw scheduler until it runs dry. And then unpin >>>>>> and evict. >>>>> So, I'm afraid but we can't do this for AMD Kaveri because: >>>> >>>> Well as long as you can drain the hw scheduler queue (and you can do >>>> that, worst case you have to unmap all the doorbells and other stuff >>>> to intercept further submission from userspace) you can evict stuff. >>> >>> I can't drain the hw scheduler queue, as I can't do mid-wave preemption. >>> Moreover, if I use the dequeue request register to preempt a queue >>> during a dispatch it may be that some waves (wave groups actually) of >>> the dispatch have not yet been created, and when I reactivate the mqd, >>> they should be created but are not. However, this works fine if you use >>> the HIQ. the CP ucode correctly saves and restores the state of an >>> outstanding dispatch. I don't think we have access to the state from >>> software at all, so it's not a bug, it is "as designed". >>> >> >> I think here Daniel is suggesting to unmapp the doorbell page, and track >> each write made by userspace to it and while unmapped wait for the gpu to >> drain or use some kind of fence on a special queue. Once GPU is drain we >> can move pinned buffer, then remap the doorbell and update it to the last >> value written by userspace which will resume execution to the next job. > > Exactly, just prevent userspace from submitting more. And if you have > misbehaving userspace that submits too much, reset the gpu and tell it > that you're sorry but won't schedule any more work. I'm not sure how you intend to know if a userspace misbehaves or not. Can you elaborate ? Oded > > We have this already in i915 (since like all other gpus we're not > preempting right now) and it works. There's some code floating around to > even restrict the reset to _just_ the offending submission context, with > nothing else getting corrupted. > > You can do all this with the doorbells and unmapping them, but it's a > pain. Much easier if you have a real ioctl, and I haven't seen anyone with > perf data indicating that an ioctl would be too much overhead on linux. > Neither in this thread nor internally here at intel. > -Daniel > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-22 8:19 ` Oded Gabbay @ 2014-07-22 9:21 ` Daniel Vetter 2014-07-22 9:24 ` Daniel Vetter 2014-07-22 9:52 ` Oded Gabbay 0 siblings, 2 replies; 49+ messages in thread From: Daniel Vetter @ 2014-07-22 9:21 UTC (permalink / raw) To: Oded Gabbay Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote: >> Exactly, just prevent userspace from submitting more. And if you have >> misbehaving userspace that submits too much, reset the gpu and tell it >> that you're sorry but won't schedule any more work. > > I'm not sure how you intend to know if a userspace misbehaves or not. Can > you elaborate ? Well that's mostly policy, currently in i915 we only have a check for hangs, and if userspace hangs a bit too often then we stop it. I guess you can do that with the queue unmapping you've describe in reply to Jerome's mail. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-22 9:21 ` Daniel Vetter @ 2014-07-22 9:24 ` Daniel Vetter 2014-07-22 9:52 ` Oded Gabbay 1 sibling, 0 replies; 49+ messages in thread From: Daniel Vetter @ 2014-07-22 9:24 UTC (permalink / raw) To: Oded Gabbay Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm On Tue, Jul 22, 2014 at 11:21 AM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote: >>> Exactly, just prevent userspace from submitting more. And if you have >>> misbehaving userspace that submits too much, reset the gpu and tell it >>> that you're sorry but won't schedule any more work. >> >> I'm not sure how you intend to know if a userspace misbehaves or not. Can >> you elaborate ? > > Well that's mostly policy, currently in i915 we only have a check for > hangs, and if userspace hangs a bit too often then we stop it. I guess > you can do that with the queue unmapping you've describe in reply to > Jerome's mail. Not just graphics, and especially not just graphics from amd. My experience is that soc designers are _really_ good at stitching randoms stuff together. So you need to deal with non-radeon drivers very likely, too. Also the real problem isn't really the memory sharing - we have dma-buf already and could add a special mmap flag to make sure it will work with svm/iommuv2. The problem is synchronization (either with the new struct fence stuff from Maarten or with android syncpoints or something like that). And for that to be possible you need to go through the kernel. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-22 9:21 ` Daniel Vetter 2014-07-22 9:24 ` Daniel Vetter @ 2014-07-22 9:52 ` Oded Gabbay 2014-07-22 11:15 ` Daniel Vetter 1 sibling, 1 reply; 49+ messages in thread From: Oded Gabbay @ 2014-07-22 9:52 UTC (permalink / raw) To: Daniel Vetter Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Sellek, Tom On 22/07/14 12:21, Daniel Vetter wrote: > On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote: >>> Exactly, just prevent userspace from submitting more. And if you have >>> misbehaving userspace that submits too much, reset the gpu and tell it >>> that you're sorry but won't schedule any more work. >> >> I'm not sure how you intend to know if a userspace misbehaves or not. Can >> you elaborate ? > > Well that's mostly policy, currently in i915 we only have a check for > hangs, and if userspace hangs a bit too often then we stop it. I guess > you can do that with the queue unmapping you've describe in reply to > Jerome's mail. > -Daniel > What do you mean by hang ? Like the tdr mechanism in Windows (checks if a gpu job takes more than 2 seconds, I think, and if so, terminates the job). Oded -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-22 9:52 ` Oded Gabbay @ 2014-07-22 11:15 ` Daniel Vetter 2014-07-23 6:50 ` Oded Gabbay 0 siblings, 1 reply; 49+ messages in thread From: Daniel Vetter @ 2014-07-22 11:15 UTC (permalink / raw) To: Oded Gabbay Cc: Daniel Vetter, Jerome Glisse, Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Sellek, Tom On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote: > On 22/07/14 12:21, Daniel Vetter wrote: > >On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote: > >>>Exactly, just prevent userspace from submitting more. And if you have > >>>misbehaving userspace that submits too much, reset the gpu and tell it > >>>that you're sorry but won't schedule any more work. > >> > >>I'm not sure how you intend to know if a userspace misbehaves or not. Can > >>you elaborate ? > > > >Well that's mostly policy, currently in i915 we only have a check for > >hangs, and if userspace hangs a bit too often then we stop it. I guess > >you can do that with the queue unmapping you've describe in reply to > >Jerome's mail. > >-Daniel > > > What do you mean by hang ? Like the tdr mechanism in Windows (checks if a > gpu job takes more than 2 seconds, I think, and if so, terminates the job). Essentially yes. But we also have some hw features to kill jobs quicker, e.g. for media workloads. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-22 11:15 ` Daniel Vetter @ 2014-07-23 6:50 ` Oded Gabbay 2014-07-23 7:04 ` Christian König 2014-07-23 7:05 ` Daniel Vetter 0 siblings, 2 replies; 49+ messages in thread From: Oded Gabbay @ 2014-07-23 6:50 UTC (permalink / raw) To: Jerome Glisse, Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Sellek, Tom On 22/07/14 14:15, Daniel Vetter wrote: > On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote: >> On 22/07/14 12:21, Daniel Vetter wrote: >>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote: >>>>> Exactly, just prevent userspace from submitting more. And if you have >>>>> misbehaving userspace that submits too much, reset the gpu and tell it >>>>> that you're sorry but won't schedule any more work. >>>> >>>> I'm not sure how you intend to know if a userspace misbehaves or not. Can >>>> you elaborate ? >>> >>> Well that's mostly policy, currently in i915 we only have a check for >>> hangs, and if userspace hangs a bit too often then we stop it. I guess >>> you can do that with the queue unmapping you've describe in reply to >>> Jerome's mail. >>> -Daniel >>> >> What do you mean by hang ? Like the tdr mechanism in Windows (checks if a >> gpu job takes more than 2 seconds, I think, and if so, terminates the job). > > Essentially yes. But we also have some hw features to kill jobs quicker, > e.g. for media workloads. > -Daniel > Yeah, so this is what I'm talking about when I say that you and Jerome come from a graphics POV and amdkfd come from a compute POV, no offense intended. For compute jobs, we simply can't use this logic to terminate jobs. Graphics are mostly Real-Time while compute jobs can take from a few ms to a few hours!!! And I'm not talking about an entire application runtime but on a single submission of jobs by the userspace app. We have tests with jobs that take between 20-30 minutes to complete. In theory, we can even imagine a compute job which takes 1 or 2 days (on larger APUs). Now, I understand the question of how do we prevent the compute job from monopolizing the GPU, and internally here we have some ideas that we will probably share in the next few days, but my point is that I don't think we can terminate a compute job because it is running for more than x seconds. It is like you would terminate a CPU process which runs more than x seconds. I think this is a *very* important discussion (detecting a misbehaved compute process) and I would like to continue it, but I don't think moving the job submission from userspace control to kernel control will solve this core problem. Oded -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-23 6:50 ` Oded Gabbay @ 2014-07-23 7:04 ` Christian König 2014-07-23 13:39 ` Bridgman, John 2014-07-23 14:56 ` Jerome Glisse 2014-07-23 7:05 ` Daniel Vetter 1 sibling, 2 replies; 49+ messages in thread From: Christian König @ 2014-07-23 7:04 UTC (permalink / raw) To: Oded Gabbay, Jerome Glisse, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Sellek, Tom Am 23.07.2014 08:50, schrieb Oded Gabbay: > On 22/07/14 14:15, Daniel Vetter wrote: >> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote: >>> On 22/07/14 12:21, Daniel Vetter wrote: >>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> >>>> wrote: >>>>>> Exactly, just prevent userspace from submitting more. And if you >>>>>> have >>>>>> misbehaving userspace that submits too much, reset the gpu and >>>>>> tell it >>>>>> that you're sorry but won't schedule any more work. >>>>> >>>>> I'm not sure how you intend to know if a userspace misbehaves or >>>>> not. Can >>>>> you elaborate ? >>>> >>>> Well that's mostly policy, currently in i915 we only have a check for >>>> hangs, and if userspace hangs a bit too often then we stop it. I guess >>>> you can do that with the queue unmapping you've describe in reply to >>>> Jerome's mail. >>>> -Daniel >>>> >>> What do you mean by hang ? Like the tdr mechanism in Windows (checks >>> if a >>> gpu job takes more than 2 seconds, I think, and if so, terminates >>> the job). >> >> Essentially yes. But we also have some hw features to kill jobs quicker, >> e.g. for media workloads. >> -Daniel >> > > Yeah, so this is what I'm talking about when I say that you and Jerome > come from a graphics POV and amdkfd come from a compute POV, no > offense intended. > > For compute jobs, we simply can't use this logic to terminate jobs. > Graphics are mostly Real-Time while compute jobs can take from a few > ms to a few hours!!! And I'm not talking about an entire application > runtime but on a single submission of jobs by the userspace app. We > have tests with jobs that take between 20-30 minutes to complete. In > theory, we can even imagine a compute job which takes 1 or 2 days (on > larger APUs). > > Now, I understand the question of how do we prevent the compute job > from monopolizing the GPU, and internally here we have some ideas that > we will probably share in the next few days, but my point is that I > don't think we can terminate a compute job because it is running for > more than x seconds. It is like you would terminate a CPU process > which runs more than x seconds. Yeah that's why one of the first things I've did was making the timeout configurable in the radeon module. But it doesn't necessary needs be a timeout, we should also kill a running job submission if the CPU process associated with the job is killed. > I think this is a *very* important discussion (detecting a misbehaved > compute process) and I would like to continue it, but I don't think > moving the job submission from userspace control to kernel control > will solve this core problem. We need to get this topic solved, otherwise the driver won't make it upstream. Allowing userpsace to monopolizing resources either memory, CPU or GPU time or special things like counters etc... is a strict no go for a kernel module. I agree that moving the job submission from userpsace to kernel wouldn't solve this problem. As Daniel and I pointed out now multiple times it's rather easily possible to prevent further job submissions from userspace, in the worst case by unmapping the doorbell page. Moving it to an IOCTL would just make it a bit less complicated. Christian. > > Oded -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-23 7:04 ` Christian König @ 2014-07-23 13:39 ` Bridgman, John 2014-07-23 14:56 ` Jerome Glisse 1 sibling, 0 replies; 49+ messages in thread From: Bridgman, John @ 2014-07-23 13:39 UTC (permalink / raw) To: Christian König, Gabbay, Oded, Jerome Glisse, David Airlie, Alex Deucher, Andrew Morton, Joerg Roedel, Lewycky, Andrew, Daenzer, Michel, Goz, Ben, Skidanov, Alexey, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Sellek, Tom >-----Original Message----- >From: Christian König [mailto:deathsimple@vodafone.de] >Sent: Wednesday, July 23, 2014 3:04 AM >To: Gabbay, Oded; Jerome Glisse; David Airlie; Alex Deucher; Andrew >Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew; Daenzer, Michel; >Goz, Ben; Skidanov, Alexey; linux-kernel@vger.kernel.org; dri- >devel@lists.freedesktop.org; linux-mm; Sellek, Tom >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver > >Am 23.07.2014 08:50, schrieb Oded Gabbay: >> On 22/07/14 14:15, Daniel Vetter wrote: >>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote: >>>> On 22/07/14 12:21, Daniel Vetter wrote: >>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay ><oded.gabbay@amd.com> >>>>> wrote: >>>>>>> Exactly, just prevent userspace from submitting more. And if you >>>>>>> have misbehaving userspace that submits too much, reset the gpu >>>>>>> and tell it that you're sorry but won't schedule any more work. >>>>>> >>>>>> I'm not sure how you intend to know if a userspace misbehaves or >>>>>> not. Can you elaborate ? >>>>> >>>>> Well that's mostly policy, currently in i915 we only have a check >>>>> for hangs, and if userspace hangs a bit too often then we stop it. >>>>> I guess you can do that with the queue unmapping you've describe in >>>>> reply to Jerome's mail. >>>>> -Daniel >>>>> >>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks >>>> if a gpu job takes more than 2 seconds, I think, and if so, >>>> terminates the job). >>> >>> Essentially yes. But we also have some hw features to kill jobs >>> quicker, e.g. for media workloads. >>> -Daniel >>> >> >> Yeah, so this is what I'm talking about when I say that you and Jerome >> come from a graphics POV and amdkfd come from a compute POV, no >> offense intended. >> >> For compute jobs, we simply can't use this logic to terminate jobs. >> Graphics are mostly Real-Time while compute jobs can take from a few >> ms to a few hours!!! And I'm not talking about an entire application >> runtime but on a single submission of jobs by the userspace app. We >> have tests with jobs that take between 20-30 minutes to complete. In >> theory, we can even imagine a compute job which takes 1 or 2 days (on >> larger APUs). >> >> Now, I understand the question of how do we prevent the compute job >> from monopolizing the GPU, and internally here we have some ideas that >> we will probably share in the next few days, but my point is that I >> don't think we can terminate a compute job because it is running for >> more than x seconds. It is like you would terminate a CPU process >> which runs more than x seconds. > >Yeah that's why one of the first things I've did was making the timeout >configurable in the radeon module. > >But it doesn't necessary needs be a timeout, we should also kill a running job >submission if the CPU process associated with the job is killed. > >> I think this is a *very* important discussion (detecting a misbehaved >> compute process) and I would like to continue it, but I don't think >> moving the job submission from userspace control to kernel control >> will solve this core problem. > >We need to get this topic solved, otherwise the driver won't make it >upstream. Allowing userpsace to monopolizing resources either memory, >CPU or GPU time or special things like counters etc... is a strict no go for a >kernel module. > >I agree that moving the job submission from userpsace to kernel wouldn't >solve this problem. As Daniel and I pointed out now multiple times it's rather >easily possible to prevent further job submissions from userspace, in the >worst case by unmapping the doorbell page. > >Moving it to an IOCTL would just make it a bit less complicated. Hi Christian; HSA uses usermode queues so that programs running on GPU can dispatch work to themselves or to other GPUs with a consistent dispatch mechanism for CPU and GPU code. We could potentially use s_msg and trap every GPU dispatch back through CPU code but that gets slow and ugly very quickly. > >Christian. > >> >> Oded -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-23 7:04 ` Christian König 2014-07-23 13:39 ` Bridgman, John @ 2014-07-23 14:56 ` Jerome Glisse 2014-07-23 19:49 ` Alex Deucher 1 sibling, 1 reply; 49+ messages in thread From: Jerome Glisse @ 2014-07-23 14:56 UTC (permalink / raw) To: Christian König Cc: Oded Gabbay, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Sellek, Tom On Wed, Jul 23, 2014 at 09:04:24AM +0200, Christian Konig wrote: > Am 23.07.2014 08:50, schrieb Oded Gabbay: > >On 22/07/14 14:15, Daniel Vetter wrote: > >>On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote: > >>>On 22/07/14 12:21, Daniel Vetter wrote: > >>>>On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> > >>>>wrote: > >>>>>>Exactly, just prevent userspace from submitting more. And if you > >>>>>>have > >>>>>>misbehaving userspace that submits too much, reset the gpu and > >>>>>>tell it > >>>>>>that you're sorry but won't schedule any more work. > >>>>> > >>>>>I'm not sure how you intend to know if a userspace misbehaves or > >>>>>not. Can > >>>>>you elaborate ? > >>>> > >>>>Well that's mostly policy, currently in i915 we only have a check for > >>>>hangs, and if userspace hangs a bit too often then we stop it. I guess > >>>>you can do that with the queue unmapping you've describe in reply to > >>>>Jerome's mail. > >>>>-Daniel > >>>> > >>>What do you mean by hang ? Like the tdr mechanism in Windows (checks > >>>if a > >>>gpu job takes more than 2 seconds, I think, and if so, terminates the > >>>job). > >> > >>Essentially yes. But we also have some hw features to kill jobs quicker, > >>e.g. for media workloads. > >>-Daniel > >> > > > >Yeah, so this is what I'm talking about when I say that you and Jerome > >come from a graphics POV and amdkfd come from a compute POV, no offense > >intended. > > > >For compute jobs, we simply can't use this logic to terminate jobs. > >Graphics are mostly Real-Time while compute jobs can take from a few ms to > >a few hours!!! And I'm not talking about an entire application runtime but > >on a single submission of jobs by the userspace app. We have tests with > >jobs that take between 20-30 minutes to complete. In theory, we can even > >imagine a compute job which takes 1 or 2 days (on larger APUs). > > > >Now, I understand the question of how do we prevent the compute job from > >monopolizing the GPU, and internally here we have some ideas that we will > >probably share in the next few days, but my point is that I don't think we > >can terminate a compute job because it is running for more than x seconds. > >It is like you would terminate a CPU process which runs more than x > >seconds. > > Yeah that's why one of the first things I've did was making the timeout > configurable in the radeon module. > > But it doesn't necessary needs be a timeout, we should also kill a running > job submission if the CPU process associated with the job is killed. > > >I think this is a *very* important discussion (detecting a misbehaved > >compute process) and I would like to continue it, but I don't think moving > >the job submission from userspace control to kernel control will solve > >this core problem. > > We need to get this topic solved, otherwise the driver won't make it > upstream. Allowing userpsace to monopolizing resources either memory, CPU or > GPU time or special things like counters etc... is a strict no go for a > kernel module. > > I agree that moving the job submission from userpsace to kernel wouldn't > solve this problem. As Daniel and I pointed out now multiple times it's > rather easily possible to prevent further job submissions from userspace, in > the worst case by unmapping the doorbell page. > > Moving it to an IOCTL would just make it a bit less complicated. > It is not only complexity, my main concern is not really the amount of memory pinned (well it would be if it was vram which by the way you need to remove the api that allow to allocate vram just so that it clearly shows that vram is not allowed). Issue is with GPU address space fragmentation, new process hsa queue might be allocated in middle of gtt space and stays there for so long that i will forbid any big buffer to be bind to gtt. Thought with virtual address space for graphics this is less of an issue and only the kernel suffer but still it might block the kernel from evicting some VRAM because i can not bind a system buffer big enough to GTT because some GTT space is taken by some HSA queue. To mitigate this at very least, you need to implement special memory allocation inside ttm and radeon to force this per queue to be allocate for instance from top of GTT space. Like reserve top 8M of GTT and have it grow/shrink depending on number of queue. Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-23 14:56 ` Jerome Glisse @ 2014-07-23 19:49 ` Alex Deucher 2014-07-23 20:25 ` Jerome Glisse 0 siblings, 1 reply; 49+ messages in thread From: Alex Deucher @ 2014-07-23 19:49 UTC (permalink / raw) To: Jerome Glisse Cc: Christian König, Oded Gabbay, David Airlie, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Sellek, Tom On Wed, Jul 23, 2014 at 10:56 AM, Jerome Glisse <j.glisse@gmail.com> wrote: > On Wed, Jul 23, 2014 at 09:04:24AM +0200, Christian König wrote: >> Am 23.07.2014 08:50, schrieb Oded Gabbay: >> >On 22/07/14 14:15, Daniel Vetter wrote: >> >>On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote: >> >>>On 22/07/14 12:21, Daniel Vetter wrote: >> >>>>On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> >> >>>>wrote: >> >>>>>>Exactly, just prevent userspace from submitting more. And if you >> >>>>>>have >> >>>>>>misbehaving userspace that submits too much, reset the gpu and >> >>>>>>tell it >> >>>>>>that you're sorry but won't schedule any more work. >> >>>>> >> >>>>>I'm not sure how you intend to know if a userspace misbehaves or >> >>>>>not. Can >> >>>>>you elaborate ? >> >>>> >> >>>>Well that's mostly policy, currently in i915 we only have a check for >> >>>>hangs, and if userspace hangs a bit too often then we stop it. I guess >> >>>>you can do that with the queue unmapping you've describe in reply to >> >>>>Jerome's mail. >> >>>>-Daniel >> >>>> >> >>>What do you mean by hang ? Like the tdr mechanism in Windows (checks >> >>>if a >> >>>gpu job takes more than 2 seconds, I think, and if so, terminates the >> >>>job). >> >> >> >>Essentially yes. But we also have some hw features to kill jobs quicker, >> >>e.g. for media workloads. >> >>-Daniel >> >> >> > >> >Yeah, so this is what I'm talking about when I say that you and Jerome >> >come from a graphics POV and amdkfd come from a compute POV, no offense >> >intended. >> > >> >For compute jobs, we simply can't use this logic to terminate jobs. >> >Graphics are mostly Real-Time while compute jobs can take from a few ms to >> >a few hours!!! And I'm not talking about an entire application runtime but >> >on a single submission of jobs by the userspace app. We have tests with >> >jobs that take between 20-30 minutes to complete. In theory, we can even >> >imagine a compute job which takes 1 or 2 days (on larger APUs). >> > >> >Now, I understand the question of how do we prevent the compute job from >> >monopolizing the GPU, and internally here we have some ideas that we will >> >probably share in the next few days, but my point is that I don't think we >> >can terminate a compute job because it is running for more than x seconds. >> >It is like you would terminate a CPU process which runs more than x >> >seconds. >> >> Yeah that's why one of the first things I've did was making the timeout >> configurable in the radeon module. >> >> But it doesn't necessary needs be a timeout, we should also kill a running >> job submission if the CPU process associated with the job is killed. >> >> >I think this is a *very* important discussion (detecting a misbehaved >> >compute process) and I would like to continue it, but I don't think moving >> >the job submission from userspace control to kernel control will solve >> >this core problem. >> >> We need to get this topic solved, otherwise the driver won't make it >> upstream. Allowing userpsace to monopolizing resources either memory, CPU or >> GPU time or special things like counters etc... is a strict no go for a >> kernel module. >> >> I agree that moving the job submission from userpsace to kernel wouldn't >> solve this problem. As Daniel and I pointed out now multiple times it's >> rather easily possible to prevent further job submissions from userspace, in >> the worst case by unmapping the doorbell page. >> >> Moving it to an IOCTL would just make it a bit less complicated. >> > > It is not only complexity, my main concern is not really the amount of memory > pinned (well it would be if it was vram which by the way you need to remove > the api that allow to allocate vram just so that it clearly shows that vram is > not allowed). > > Issue is with GPU address space fragmentation, new process hsa queue might be > allocated in middle of gtt space and stays there for so long that i will forbid > any big buffer to be bind to gtt. Thought with virtual address space for graphics > this is less of an issue and only the kernel suffer but still it might block the > kernel from evicting some VRAM because i can not bind a system buffer big enough > to GTT because some GTT space is taken by some HSA queue. > > To mitigate this at very least, you need to implement special memory allocation > inside ttm and radeon to force this per queue to be allocate for instance from > top of GTT space. Like reserve top 8M of GTT and have it grow/shrink depending > on number of queue. This same sort of thing can already happen with gfx, although it's less likely since the workloads are usually shorter. That said, we can issue compute jobs right today with the current CS ioctl and we may end up with a buffer pinned in an inopportune spot. I'm not sure reserving a static pool at init really helps that much. If you aren't using any HSA apps, it just wastes gtt space. So you have a trade off: waste memory for a possibly unused MQD descriptor pool or allocate MQD descriptors on the fly, but possibly end up with a long running one stuck in a bad location. Additionally, we already have a ttm flag for whether we want to allocate from the top or bottom of the pool. We use it today for gfx depending on the buffer (e.g., buffers smaller than 512k are allocated from the bottom and buffers larger than 512 are allocated from the top). So we can't really re-size a static buffer easily as there may already be other buffers pinned up there. If we add sysfs controls to limit the amount of hsa processes, and queues per process so you could use this to dynamically limit the max amount gtt memory that would be in use for MQD descriptors. Alex -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-23 19:49 ` Alex Deucher @ 2014-07-23 20:25 ` Jerome Glisse 0 siblings, 0 replies; 49+ messages in thread From: Jerome Glisse @ 2014-07-23 20:25 UTC (permalink / raw) To: Alex Deucher Cc: Christian König, Oded Gabbay, David Airlie, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Sellek, Tom On Wed, Jul 23, 2014 at 03:49:57PM -0400, Alex Deucher wrote: > On Wed, Jul 23, 2014 at 10:56 AM, Jerome Glisse <j.glisse@gmail.com> wrote: > > On Wed, Jul 23, 2014 at 09:04:24AM +0200, Christian Konig wrote: > >> Am 23.07.2014 08:50, schrieb Oded Gabbay: > >> >On 22/07/14 14:15, Daniel Vetter wrote: > >> >>On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote: > >> >>>On 22/07/14 12:21, Daniel Vetter wrote: > >> >>>>On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> > >> >>>>wrote: > >> >>>>>>Exactly, just prevent userspace from submitting more. And if you > >> >>>>>>have > >> >>>>>>misbehaving userspace that submits too much, reset the gpu and > >> >>>>>>tell it > >> >>>>>>that you're sorry but won't schedule any more work. > >> >>>>> > >> >>>>>I'm not sure how you intend to know if a userspace misbehaves or > >> >>>>>not. Can > >> >>>>>you elaborate ? > >> >>>> > >> >>>>Well that's mostly policy, currently in i915 we only have a check for > >> >>>>hangs, and if userspace hangs a bit too often then we stop it. I guess > >> >>>>you can do that with the queue unmapping you've describe in reply to > >> >>>>Jerome's mail. > >> >>>>-Daniel > >> >>>> > >> >>>What do you mean by hang ? Like the tdr mechanism in Windows (checks > >> >>>if a > >> >>>gpu job takes more than 2 seconds, I think, and if so, terminates the > >> >>>job). > >> >> > >> >>Essentially yes. But we also have some hw features to kill jobs quicker, > >> >>e.g. for media workloads. > >> >>-Daniel > >> >> > >> > > >> >Yeah, so this is what I'm talking about when I say that you and Jerome > >> >come from a graphics POV and amdkfd come from a compute POV, no offense > >> >intended. > >> > > >> >For compute jobs, we simply can't use this logic to terminate jobs. > >> >Graphics are mostly Real-Time while compute jobs can take from a few ms to > >> >a few hours!!! And I'm not talking about an entire application runtime but > >> >on a single submission of jobs by the userspace app. We have tests with > >> >jobs that take between 20-30 minutes to complete. In theory, we can even > >> >imagine a compute job which takes 1 or 2 days (on larger APUs). > >> > > >> >Now, I understand the question of how do we prevent the compute job from > >> >monopolizing the GPU, and internally here we have some ideas that we will > >> >probably share in the next few days, but my point is that I don't think we > >> >can terminate a compute job because it is running for more than x seconds. > >> >It is like you would terminate a CPU process which runs more than x > >> >seconds. > >> > >> Yeah that's why one of the first things I've did was making the timeout > >> configurable in the radeon module. > >> > >> But it doesn't necessary needs be a timeout, we should also kill a running > >> job submission if the CPU process associated with the job is killed. > >> > >> >I think this is a *very* important discussion (detecting a misbehaved > >> >compute process) and I would like to continue it, but I don't think moving > >> >the job submission from userspace control to kernel control will solve > >> >this core problem. > >> > >> We need to get this topic solved, otherwise the driver won't make it > >> upstream. Allowing userpsace to monopolizing resources either memory, CPU or > >> GPU time or special things like counters etc... is a strict no go for a > >> kernel module. > >> > >> I agree that moving the job submission from userpsace to kernel wouldn't > >> solve this problem. As Daniel and I pointed out now multiple times it's > >> rather easily possible to prevent further job submissions from userspace, in > >> the worst case by unmapping the doorbell page. > >> > >> Moving it to an IOCTL would just make it a bit less complicated. > >> > > > > It is not only complexity, my main concern is not really the amount of memory > > pinned (well it would be if it was vram which by the way you need to remove > > the api that allow to allocate vram just so that it clearly shows that vram is > > not allowed). > > > > Issue is with GPU address space fragmentation, new process hsa queue might be > > allocated in middle of gtt space and stays there for so long that i will forbid > > any big buffer to be bind to gtt. Thought with virtual address space for graphics > > this is less of an issue and only the kernel suffer but still it might block the > > kernel from evicting some VRAM because i can not bind a system buffer big enough > > to GTT because some GTT space is taken by some HSA queue. > > > > To mitigate this at very least, you need to implement special memory allocation > > inside ttm and radeon to force this per queue to be allocate for instance from > > top of GTT space. Like reserve top 8M of GTT and have it grow/shrink depending > > on number of queue. > > This same sort of thing can already happen with gfx, although it's > less likely since the workloads are usually shorter. That said, we > can issue compute jobs right today with the current CS ioctl and we > may end up with a buffer pinned in an inopportune spot. I thought compute was using virtual address space (well on > cayman at least). > I'm not sure > reserving a static pool at init really helps that much. If you aren't > using any HSA apps, it just wastes gtt space. So you have a trade > off: waste memory for a possibly unused MQD descriptor pool or > allocate MQD descriptors on the fly, but possibly end up with a long > running one stuck in a bad location. Additionally, we already have a > ttm flag for whether we want to allocate from the top or bottom of the > pool. We use it today for gfx depending on the buffer (e.g., buffers > smaller than 512k are allocated from the bottom and buffers larger > than 512 are allocated from the top). So we can't really re-size a > static buffer easily as there may already be other buffers pinned up > there. Again here iirc only kernel use the GTT space everything else (userspace) is using virtual address space or am i forgeting something ? My point was not so much to be static but to enforce doing it from one end of the address space and to have shrink/grow depending on usage forcing anything else out of that range. On VM GPU only thing left using the "global" GTT is the kernel, it uses it for ring and for moving buffer around. I would assume that pining ring buffers at begining of address space no matter what there size is would be a good idea as anyway those will not fragment ie there lifetime is the lifetime of the driver. My point is that all the HSA queue buffer can have a lifetime way bigger than anything we have now, really now we can bind/unbind any buffer btw cs submission modulo OpenCL task. > > If we add sysfs controls to limit the amount of hsa processes, and > queues per process so you could use this to dynamically limit the max > amount gtt memory that would be in use for MQD descriptors. No this can not be set dynamicly, once a process has created its queue it has it and i see no channel to tell userspace: "sorry buddy but no more room for you" > > Alex -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-23 6:50 ` Oded Gabbay 2014-07-23 7:04 ` Christian König @ 2014-07-23 7:05 ` Daniel Vetter 2014-07-23 8:35 ` Oded Gabbay 2014-07-23 13:33 ` Bridgman, John 1 sibling, 2 replies; 49+ messages in thread From: Daniel Vetter @ 2014-07-23 7:05 UTC (permalink / raw) To: Oded Gabbay Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Sellek, Tom On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com> wrote: > On 22/07/14 14:15, Daniel Vetter wrote: >> >> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote: >>> >>> On 22/07/14 12:21, Daniel Vetter wrote: >>>> >>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> >>>> wrote: >>>>>> >>>>>> Exactly, just prevent userspace from submitting more. And if you have >>>>>> misbehaving userspace that submits too much, reset the gpu and tell it >>>>>> that you're sorry but won't schedule any more work. >>>>> >>>>> >>>>> I'm not sure how you intend to know if a userspace misbehaves or not. >>>>> Can >>>>> you elaborate ? >>>> >>>> >>>> Well that's mostly policy, currently in i915 we only have a check for >>>> hangs, and if userspace hangs a bit too often then we stop it. I guess >>>> you can do that with the queue unmapping you've describe in reply to >>>> Jerome's mail. >>>> -Daniel >>>> >>> What do you mean by hang ? Like the tdr mechanism in Windows (checks if a >>> gpu job takes more than 2 seconds, I think, and if so, terminates the >>> job). >> >> >> Essentially yes. But we also have some hw features to kill jobs quicker, >> e.g. for media workloads. >> -Daniel >> > > Yeah, so this is what I'm talking about when I say that you and Jerome come > from a graphics POV and amdkfd come from a compute POV, no offense intended. > > For compute jobs, we simply can't use this logic to terminate jobs. Graphics > are mostly Real-Time while compute jobs can take from a few ms to a few > hours!!! And I'm not talking about an entire application runtime but on a > single submission of jobs by the userspace app. We have tests with jobs that > take between 20-30 minutes to complete. In theory, we can even imagine a > compute job which takes 1 or 2 days (on larger APUs). > > Now, I understand the question of how do we prevent the compute job from > monopolizing the GPU, and internally here we have some ideas that we will > probably share in the next few days, but my point is that I don't think we > can terminate a compute job because it is running for more than x seconds. > It is like you would terminate a CPU process which runs more than x seconds. > > I think this is a *very* important discussion (detecting a misbehaved > compute process) and I would like to continue it, but I don't think moving > the job submission from userspace control to kernel control will solve this > core problem. Well graphics gets away with cooperative scheduling since usually people want to see stuff within a few frames, so we can legitimately kill jobs after a fairly short timeout. Imo if you want to allow userspace to submit compute jobs that are atomic and take a few minutes to hours with no break-up in between and no hw means to preempt then that design is screwed up. We really can't tell the core vm that "sorry we will hold onto these gobloads of memory you really need now for another few hours". Pinning memory like that essentially without a time limit is restricted to root. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-23 7:05 ` Daniel Vetter @ 2014-07-23 8:35 ` Oded Gabbay 2014-07-23 13:33 ` Bridgman, John 1 sibling, 0 replies; 49+ messages in thread From: Oded Gabbay @ 2014-07-23 8:35 UTC (permalink / raw) To: Daniel Vetter Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Sellek, Tom On 23/07/14 10:05, Daniel Vetter wrote: > On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com> wrote: >> On 22/07/14 14:15, Daniel Vetter wrote: >>> >>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote: >>>> >>>> On 22/07/14 12:21, Daniel Vetter wrote: >>>>> >>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> >>>>> wrote: >>>>>>> >>>>>>> Exactly, just prevent userspace from submitting more. And if you have >>>>>>> misbehaving userspace that submits too much, reset the gpu and tell it >>>>>>> that you're sorry but won't schedule any more work. >>>>>> >>>>>> >>>>>> I'm not sure how you intend to know if a userspace misbehaves or not. >>>>>> Can >>>>>> you elaborate ? >>>>> >>>>> >>>>> Well that's mostly policy, currently in i915 we only have a check for >>>>> hangs, and if userspace hangs a bit too often then we stop it. I guess >>>>> you can do that with the queue unmapping you've describe in reply to >>>>> Jerome's mail. >>>>> -Daniel >>>>> >>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks if a >>>> gpu job takes more than 2 seconds, I think, and if so, terminates the >>>> job). >>> >>> >>> Essentially yes. But we also have some hw features to kill jobs quicker, >>> e.g. for media workloads. >>> -Daniel >>> >> >> Yeah, so this is what I'm talking about when I say that you and Jerome come >> from a graphics POV and amdkfd come from a compute POV, no offense intended. >> >> For compute jobs, we simply can't use this logic to terminate jobs. Graphics >> are mostly Real-Time while compute jobs can take from a few ms to a few >> hours!!! And I'm not talking about an entire application runtime but on a >> single submission of jobs by the userspace app. We have tests with jobs that >> take between 20-30 minutes to complete. In theory, we can even imagine a >> compute job which takes 1 or 2 days (on larger APUs). >> >> Now, I understand the question of how do we prevent the compute job from >> monopolizing the GPU, and internally here we have some ideas that we will >> probably share in the next few days, but my point is that I don't think we >> can terminate a compute job because it is running for more than x seconds. >> It is like you would terminate a CPU process which runs more than x seconds. >> >> I think this is a *very* important discussion (detecting a misbehaved >> compute process) and I would like to continue it, but I don't think moving >> the job submission from userspace control to kernel control will solve this >> core problem. > > Well graphics gets away with cooperative scheduling since usually > people want to see stuff within a few frames, so we can legitimately > kill jobs after a fairly short timeout. Imo if you want to allow > userspace to submit compute jobs that are atomic and take a few > minutes to hours with no break-up in between and no hw means to > preempt then that design is screwed up. We really can't tell the core > vm that "sorry we will hold onto these gobloads of memory you really > need now for another few hours". Pinning memory like that essentially > without a time limit is restricted to root. > -Daniel > First of all, I don't see the relation to memory pinning here. I already said on this thread that amdkfd does NOT pin local memory. The only memory we allocate is system memory, and we map it to the gart, and we can limit that memory by limiting max # of queues and max # of process through kernel parameters. Most of the memory used is allocated via regular means by the userspace, which is usually pageable. Second, it is important to remember that this problem only exists in KV. In CZ, the GPU can context switch between waves (by doing mid-wave preemption). So even long running waves are getting switched on and off constantly and there is no monopolizing of GPU resources. Third, even in KV, we can kill waves. The question is when and how to recognize it. I think it would be sufficient for now if we expose this ability to the kernel. Oded -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-23 7:05 ` Daniel Vetter 2014-07-23 8:35 ` Oded Gabbay @ 2014-07-23 13:33 ` Bridgman, John 2014-07-23 14:41 ` Daniel Vetter 1 sibling, 1 reply; 49+ messages in thread From: Bridgman, John @ 2014-07-23 13:33 UTC (permalink / raw) To: Daniel Vetter, Gabbay, Oded Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher, Andrew Morton, Joerg Roedel, Lewycky, Andrew, Daenzer, Michel, Goz, Ben, Skidanov, Alexey, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Sellek, Tom >-----Original Message----- >From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] >Sent: Wednesday, July 23, 2014 3:06 AM >To: Gabbay, Oded >Cc: Jerome Glisse; Christian König; David Airlie; Alex Deucher; Andrew >Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew; Daenzer, Michel; >Goz, Ben; Skidanov, Alexey; linux-kernel@vger.kernel.org; dri- >devel@lists.freedesktop.org; linux-mm; Sellek, Tom >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver > >On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com> >wrote: >> On 22/07/14 14:15, Daniel Vetter wrote: >>> >>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote: >>>> >>>> On 22/07/14 12:21, Daniel Vetter wrote: >>>>> >>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay ><oded.gabbay@amd.com> >>>>> wrote: >>>>>>> >>>>>>> Exactly, just prevent userspace from submitting more. And if you >>>>>>> have misbehaving userspace that submits too much, reset the gpu >>>>>>> and tell it that you're sorry but won't schedule any more work. >>>>>> >>>>>> >>>>>> I'm not sure how you intend to know if a userspace misbehaves or not. >>>>>> Can >>>>>> you elaborate ? >>>>> >>>>> >>>>> Well that's mostly policy, currently in i915 we only have a check >>>>> for hangs, and if userspace hangs a bit too often then we stop it. >>>>> I guess you can do that with the queue unmapping you've describe in >>>>> reply to Jerome's mail. >>>>> -Daniel >>>>> >>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks >>>> if a gpu job takes more than 2 seconds, I think, and if so, >>>> terminates the job). >>> >>> >>> Essentially yes. But we also have some hw features to kill jobs >>> quicker, e.g. for media workloads. >>> -Daniel >>> >> >> Yeah, so this is what I'm talking about when I say that you and Jerome >> come from a graphics POV and amdkfd come from a compute POV, no >offense intended. >> >> For compute jobs, we simply can't use this logic to terminate jobs. >> Graphics are mostly Real-Time while compute jobs can take from a few >> ms to a few hours!!! And I'm not talking about an entire application >> runtime but on a single submission of jobs by the userspace app. We >> have tests with jobs that take between 20-30 minutes to complete. In >> theory, we can even imagine a compute job which takes 1 or 2 days (on >larger APUs). >> >> Now, I understand the question of how do we prevent the compute job >> from monopolizing the GPU, and internally here we have some ideas that >> we will probably share in the next few days, but my point is that I >> don't think we can terminate a compute job because it is running for more >than x seconds. >> It is like you would terminate a CPU process which runs more than x >seconds. >> >> I think this is a *very* important discussion (detecting a misbehaved >> compute process) and I would like to continue it, but I don't think >> moving the job submission from userspace control to kernel control >> will solve this core problem. > >Well graphics gets away with cooperative scheduling since usually people >want to see stuff within a few frames, so we can legitimately kill jobs after a >fairly short timeout. Imo if you want to allow userspace to submit compute >jobs that are atomic and take a few minutes to hours with no break-up in >between and no hw means to preempt then that design is screwed up. We >really can't tell the core vm that "sorry we will hold onto these gobloads of >memory you really need now for another few hours". Pinning memory like >that essentially without a time limit is restricted to root. Hi Daniel; I don't really understand the reference to "gobloads of memory". Unlike radeon graphics, the userspace data for HSA applications is maintained in pageable system memory and accessed via the IOMMUv2 (ATC/PRI). The IOMMUv2 driver and mm subsystem takes care of faulting in memory pages as needed, nothing is long-term pinned. The only pinned memory we are talking about here is per-queue and per-process data structures in the driver, which are tiny by comparison. Oded provided the "hardware limits" (ie an insane number of process & threads) for context, but real-world limits will be one or two orders of magnitude lower. Agree we should have included those limits in the initial code, that would have made the "real world" memory footprint much more visible. Make sense ? >-Daniel >-- >Daniel Vetter >Software Engineer, Intel Corporation >+41 (0) 79 365 57 48 - http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-23 13:33 ` Bridgman, John @ 2014-07-23 14:41 ` Daniel Vetter 2014-07-23 15:06 ` Bridgman, John 0 siblings, 1 reply; 49+ messages in thread From: Daniel Vetter @ 2014-07-23 14:41 UTC (permalink / raw) To: Bridgman, John Cc: Daniel Vetter, Gabbay, Oded, Jerome Glisse, Christian König, David Airlie, Alex Deucher, Andrew Morton, Joerg Roedel, Lewycky, Andrew, Daenzer, Michel, Goz, Ben, Skidanov, Alexey, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Sellek, Tom On Wed, Jul 23, 2014 at 01:33:24PM +0000, Bridgman, John wrote: > > > >-----Original Message----- > >From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] > >Sent: Wednesday, July 23, 2014 3:06 AM > >To: Gabbay, Oded > >Cc: Jerome Glisse; Christian Konig; David Airlie; Alex Deucher; Andrew > >Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew; Daenzer, Michel; > >Goz, Ben; Skidanov, Alexey; linux-kernel@vger.kernel.org; dri- > >devel@lists.freedesktop.org; linux-mm; Sellek, Tom > >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver > > > >On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com> > >wrote: > >> On 22/07/14 14:15, Daniel Vetter wrote: > >>> > >>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote: > >>>> > >>>> On 22/07/14 12:21, Daniel Vetter wrote: > >>>>> > >>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay > ><oded.gabbay@amd.com> > >>>>> wrote: > >>>>>>> > >>>>>>> Exactly, just prevent userspace from submitting more. And if you > >>>>>>> have misbehaving userspace that submits too much, reset the gpu > >>>>>>> and tell it that you're sorry but won't schedule any more work. > >>>>>> > >>>>>> > >>>>>> I'm not sure how you intend to know if a userspace misbehaves or not. > >>>>>> Can > >>>>>> you elaborate ? > >>>>> > >>>>> > >>>>> Well that's mostly policy, currently in i915 we only have a check > >>>>> for hangs, and if userspace hangs a bit too often then we stop it. > >>>>> I guess you can do that with the queue unmapping you've describe in > >>>>> reply to Jerome's mail. > >>>>> -Daniel > >>>>> > >>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks > >>>> if a gpu job takes more than 2 seconds, I think, and if so, > >>>> terminates the job). > >>> > >>> > >>> Essentially yes. But we also have some hw features to kill jobs > >>> quicker, e.g. for media workloads. > >>> -Daniel > >>> > >> > >> Yeah, so this is what I'm talking about when I say that you and Jerome > >> come from a graphics POV and amdkfd come from a compute POV, no > >offense intended. > >> > >> For compute jobs, we simply can't use this logic to terminate jobs. > >> Graphics are mostly Real-Time while compute jobs can take from a few > >> ms to a few hours!!! And I'm not talking about an entire application > >> runtime but on a single submission of jobs by the userspace app. We > >> have tests with jobs that take between 20-30 minutes to complete. In > >> theory, we can even imagine a compute job which takes 1 or 2 days (on > >larger APUs). > >> > >> Now, I understand the question of how do we prevent the compute job > >> from monopolizing the GPU, and internally here we have some ideas that > >> we will probably share in the next few days, but my point is that I > >> don't think we can terminate a compute job because it is running for more > >than x seconds. > >> It is like you would terminate a CPU process which runs more than x > >seconds. > >> > >> I think this is a *very* important discussion (detecting a misbehaved > >> compute process) and I would like to continue it, but I don't think > >> moving the job submission from userspace control to kernel control > >> will solve this core problem. > > > >Well graphics gets away with cooperative scheduling since usually people > >want to see stuff within a few frames, so we can legitimately kill jobs after a > >fairly short timeout. Imo if you want to allow userspace to submit compute > >jobs that are atomic and take a few minutes to hours with no break-up in > >between and no hw means to preempt then that design is screwed up. We > >really can't tell the core vm that "sorry we will hold onto these gobloads of > >memory you really need now for another few hours". Pinning memory like > >that essentially without a time limit is restricted to root. > > Hi Daniel; > > I don't really understand the reference to "gobloads of memory". Unlike > radeon graphics, the userspace data for HSA applications is maintained > in pageable system memory and accessed via the IOMMUv2 (ATC/PRI). The > IOMMUv2 driver and mm subsystem takes care of faulting in memory pages > as needed, nothing is long-term pinned. Yeah I've lost that part of the equation a bit since I've always thought that proper faulting support without preemption is not really possible. I guess those platforms completely stall on a fault until the ptes are all set up? -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-23 14:41 ` Daniel Vetter @ 2014-07-23 15:06 ` Bridgman, John 2014-07-23 15:12 ` Bridgman, John 0 siblings, 1 reply; 49+ messages in thread From: Bridgman, John @ 2014-07-23 15:06 UTC (permalink / raw) To: Daniel Vetter Cc: Daniel Vetter, Gabbay, Oded, Jerome Glisse, Christian König, David Airlie, Alex Deucher, Andrew Morton, Joerg Roedel, Lewycky, Andrew, Daenzer, Michel, Goz, Ben, Skidanov, Alexey, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm, Sellek, Tom >-----Original Message----- >From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel >Vetter >Sent: Wednesday, July 23, 2014 10:42 AM >To: Bridgman, John >Cc: Daniel Vetter; Gabbay, Oded; Jerome Glisse; Christian König; David Airlie; >Alex Deucher; Andrew Morton; Joerg Roedel; Lewycky, Andrew; Daenzer, >Michel; Goz, Ben; Skidanov, Alexey; linux-kernel@vger.kernel.org; dri- >devel@lists.freedesktop.org; linux-mm; Sellek, Tom >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver > >On Wed, Jul 23, 2014 at 01:33:24PM +0000, Bridgman, John wrote: >> >> >> >-----Original Message----- >> >From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] >> >Sent: Wednesday, July 23, 2014 3:06 AM >> >To: Gabbay, Oded >> >Cc: Jerome Glisse; Christian König; David Airlie; Alex Deucher; >> >Andrew Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew; >> >Daenzer, Michel; Goz, Ben; Skidanov, Alexey; >> >linux-kernel@vger.kernel.org; dri- devel@lists.freedesktop.org; >> >linux-mm; Sellek, Tom >> >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver >> > >> >On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com> >> >wrote: >> >> On 22/07/14 14:15, Daniel Vetter wrote: >> >>> >> >>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote: >> >>>> >> >>>> On 22/07/14 12:21, Daniel Vetter wrote: >> >>>>> >> >>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay >> ><oded.gabbay@amd.com> >> >>>>> wrote: >> >>>>>>> >> >>>>>>> Exactly, just prevent userspace from submitting more. And if >> >>>>>>> you have misbehaving userspace that submits too much, reset >> >>>>>>> the gpu and tell it that you're sorry but won't schedule any more >work. >> >>>>>> >> >>>>>> >> >>>>>> I'm not sure how you intend to know if a userspace misbehaves or >not. >> >>>>>> Can >> >>>>>> you elaborate ? >> >>>>> >> >>>>> >> >>>>> Well that's mostly policy, currently in i915 we only have a >> >>>>> check for hangs, and if userspace hangs a bit too often then we stop >it. >> >>>>> I guess you can do that with the queue unmapping you've describe >> >>>>> in reply to Jerome's mail. >> >>>>> -Daniel >> >>>>> >> >>>> What do you mean by hang ? Like the tdr mechanism in Windows >> >>>> (checks if a gpu job takes more than 2 seconds, I think, and if >> >>>> so, terminates the job). >> >>> >> >>> >> >>> Essentially yes. But we also have some hw features to kill jobs >> >>> quicker, e.g. for media workloads. >> >>> -Daniel >> >>> >> >> >> >> Yeah, so this is what I'm talking about when I say that you and >> >> Jerome come from a graphics POV and amdkfd come from a compute >POV, >> >> no >> >offense intended. >> >> >> >> For compute jobs, we simply can't use this logic to terminate jobs. >> >> Graphics are mostly Real-Time while compute jobs can take from a >> >> few ms to a few hours!!! And I'm not talking about an entire >> >> application runtime but on a single submission of jobs by the >> >> userspace app. We have tests with jobs that take between 20-30 >> >> minutes to complete. In theory, we can even imagine a compute job >> >> which takes 1 or 2 days (on >> >larger APUs). >> >> >> >> Now, I understand the question of how do we prevent the compute job >> >> from monopolizing the GPU, and internally here we have some ideas >> >> that we will probably share in the next few days, but my point is >> >> that I don't think we can terminate a compute job because it is >> >> running for more >> >than x seconds. >> >> It is like you would terminate a CPU process which runs more than x >> >seconds. >> >> >> >> I think this is a *very* important discussion (detecting a >> >> misbehaved compute process) and I would like to continue it, but I >> >> don't think moving the job submission from userspace control to >> >> kernel control will solve this core problem. >> > >> >Well graphics gets away with cooperative scheduling since usually >> >people want to see stuff within a few frames, so we can legitimately >> >kill jobs after a fairly short timeout. Imo if you want to allow >> >userspace to submit compute jobs that are atomic and take a few >> >minutes to hours with no break-up in between and no hw means to >> >preempt then that design is screwed up. We really can't tell the core >> >vm that "sorry we will hold onto these gobloads of memory you really >> >need now for another few hours". Pinning memory like that essentially >without a time limit is restricted to root. >> >> Hi Daniel; >> >> I don't really understand the reference to "gobloads of memory". >> Unlike radeon graphics, the userspace data for HSA applications is >> maintained in pageable system memory and accessed via the IOMMUv2 >> (ATC/PRI). The >> IOMMUv2 driver and mm subsystem takes care of faulting in memory pages >> as needed, nothing is long-term pinned. > >Yeah I've lost that part of the equation a bit since I've always thought that >proper faulting support without preemption is not really possible. I guess >those platforms completely stall on a fault until the ptes are all set up? Correct. The GPU thread accessing the faulted page definitely stalls but processing can continue on other GPU threads. I don't remember offhand how much of the GPU=>ATC=>IOMMUv2=>system RAM path gets stalled (ie whether other HSA apps get blocked) but AFAIK graphics processing (assuming it is not using ATC path to system memory) is not affected. I will double-check that though, haven't asked internally for a couple of years but I do remember concluding something along the lines of "OK, that'll do" ;) >-Daniel >-- >Daniel Vetter >Software Engineer, Intel Corporation >+41 (0) 79 365 57 48 - http://blog.ffwll.ch -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: [PATCH v2 00/25] AMDKFD kernel driver 2014-07-23 15:06 ` Bridgman, John @ 2014-07-23 15:12 ` Bridgman, John 0 siblings, 0 replies; 49+ messages in thread From: Bridgman, John @ 2014-07-23 15:12 UTC (permalink / raw) To: Daniel Vetter Cc: Lewycky, Andrew, linux-mm, Daniel Vetter, Daenzer, Michel, linux-kernel@vger.kernel.org, Sellek, Tom, Skidanov, Alexey, dri-devel@lists.freedesktop.org, Andrew Morton >-----Original Message----- >From: dri-devel [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf >Of Bridgman, John >Sent: Wednesday, July 23, 2014 11:07 AM >To: Daniel Vetter >Cc: Lewycky, Andrew; linux-mm; Daniel Vetter; Daenzer, Michel; linux- >kernel@vger.kernel.org; Sellek, Tom; Skidanov, Alexey; dri- >devel@lists.freedesktop.org; Andrew Morton >Subject: RE: [PATCH v2 00/25] AMDKFD kernel driver > > > >>-----Original Message----- >>From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel >>Vetter >>Sent: Wednesday, July 23, 2014 10:42 AM >>To: Bridgman, John >>Cc: Daniel Vetter; Gabbay, Oded; Jerome Glisse; Christian König; David >>Airlie; Alex Deucher; Andrew Morton; Joerg Roedel; Lewycky, Andrew; >>Daenzer, Michel; Goz, Ben; Skidanov, Alexey; >>linux-kernel@vger.kernel.org; dri- devel@lists.freedesktop.org; >>linux-mm; Sellek, Tom >>Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver >> >>On Wed, Jul 23, 2014 at 01:33:24PM +0000, Bridgman, John wrote: >>> >>> >>> >-----Original Message----- >>> >From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] >>> >Sent: Wednesday, July 23, 2014 3:06 AM >>> >To: Gabbay, Oded >>> >Cc: Jerome Glisse; Christian König; David Airlie; Alex Deucher; >>> >Andrew Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew; >>> >Daenzer, Michel; Goz, Ben; Skidanov, Alexey; >>> >linux-kernel@vger.kernel.org; dri- devel@lists.freedesktop.org; >>> >linux-mm; Sellek, Tom >>> >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver >>> > >>> >On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com> >>> >wrote: >>> >> On 22/07/14 14:15, Daniel Vetter wrote: >>> >>> >>> >>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote: >>> >>>> >>> >>>> On 22/07/14 12:21, Daniel Vetter wrote: >>> >>>>> >>> >>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay >>> ><oded.gabbay@amd.com> >>> >>>>> wrote: >>> >>>>>>> >>> >>>>>>> Exactly, just prevent userspace from submitting more. And if >>> >>>>>>> you have misbehaving userspace that submits too much, reset >>> >>>>>>> the gpu and tell it that you're sorry but won't schedule any >>> >>>>>>> more >>work. >>> >>>>>> >>> >>>>>> >>> >>>>>> I'm not sure how you intend to know if a userspace misbehaves >>> >>>>>> or >>not. >>> >>>>>> Can >>> >>>>>> you elaborate ? >>> >>>>> >>> >>>>> >>> >>>>> Well that's mostly policy, currently in i915 we only have a >>> >>>>> check for hangs, and if userspace hangs a bit too often then we >>> >>>>> stop >>it. >>> >>>>> I guess you can do that with the queue unmapping you've >>> >>>>> describe in reply to Jerome's mail. >>> >>>>> -Daniel >>> >>>>> >>> >>>> What do you mean by hang ? Like the tdr mechanism in Windows >>> >>>> (checks if a gpu job takes more than 2 seconds, I think, and if >>> >>>> so, terminates the job). >>> >>> >>> >>> >>> >>> Essentially yes. But we also have some hw features to kill jobs >>> >>> quicker, e.g. for media workloads. >>> >>> -Daniel >>> >>> >>> >> >>> >> Yeah, so this is what I'm talking about when I say that you and >>> >> Jerome come from a graphics POV and amdkfd come from a compute >>POV, >>> >> no >>> >offense intended. >>> >> >>> >> For compute jobs, we simply can't use this logic to terminate jobs. >>> >> Graphics are mostly Real-Time while compute jobs can take from a >>> >> few ms to a few hours!!! And I'm not talking about an entire >>> >> application runtime but on a single submission of jobs by the >>> >> userspace app. We have tests with jobs that take between 20-30 >>> >> minutes to complete. In theory, we can even imagine a compute job >>> >> which takes 1 or 2 days (on >>> >larger APUs). >>> >> >>> >> Now, I understand the question of how do we prevent the compute >>> >> job from monopolizing the GPU, and internally here we have some >>> >> ideas that we will probably share in the next few days, but my >>> >> point is that I don't think we can terminate a compute job because >>> >> it is running for more >>> >than x seconds. >>> >> It is like you would terminate a CPU process which runs more than >>> >> x >>> >seconds. >>> >> >>> >> I think this is a *very* important discussion (detecting a >>> >> misbehaved compute process) and I would like to continue it, but I >>> >> don't think moving the job submission from userspace control to >>> >> kernel control will solve this core problem. >>> > >>> >Well graphics gets away with cooperative scheduling since usually >>> >people want to see stuff within a few frames, so we can legitimately >>> >kill jobs after a fairly short timeout. Imo if you want to allow >>> >userspace to submit compute jobs that are atomic and take a few >>> >minutes to hours with no break-up in between and no hw means to >>> >preempt then that design is screwed up. We really can't tell the >>> >core vm that "sorry we will hold onto these gobloads of memory you >>> >really need now for another few hours". Pinning memory like that >>> >essentially >>without a time limit is restricted to root. >>> >>> Hi Daniel; >>> >>> I don't really understand the reference to "gobloads of memory". >>> Unlike radeon graphics, the userspace data for HSA applications is >>> maintained in pageable system memory and accessed via the IOMMUv2 >>> (ATC/PRI). The >>> IOMMUv2 driver and mm subsystem takes care of faulting in memory >>> pages as needed, nothing is long-term pinned. >> >>Yeah I've lost that part of the equation a bit since I've always >>thought that proper faulting support without preemption is not really >>possible. I guess those platforms completely stall on a fault until the ptes >are all set up? > >Correct. The GPU thread accessing the faulted page definitely stalls but >processing can continue on other GPU threads. Sorry, this may be oversimplified -- I'll double-check that internally. We may stall the CU for the duration of the fault processing. Stay tuned. > >I don't remember offhand how much of the GPU=>ATC=>IOMMUv2=>system >RAM path gets stalled (ie whether other HSA apps get blocked) but AFAIK >graphics processing (assuming it is not using ATC path to system memory) is >not affected. I will double-check that though, haven't asked internally for a >couple of years but I do remember concluding something along the lines of >"OK, that'll do" ;) > >>-Daniel >>-- >>Daniel Vetter >>Software Engineer, Intel Corporation >>+41 (0) 79 365 57 48 - http://blog.ffwll.ch >_______________________________________________ >dri-devel mailing list >dri-devel@lists.freedesktop.org >http://lists.freedesktop.org/mailman/listinfo/dri-devel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2014-07-23 20:26 UTC | newest] Thread overview: 49+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-07-17 13:57 [PATCH v2 00/25] AMDKFD kernel driver Oded Gabbay 2014-07-20 17:46 ` Jerome Glisse 2014-07-21 3:03 ` Jerome Glisse 2014-07-21 7:01 ` Daniel Vetter 2014-07-21 9:34 ` Christian König 2014-07-21 12:36 ` Oded Gabbay 2014-07-21 13:39 ` Christian König 2014-07-21 14:12 ` Oded Gabbay 2014-07-21 15:54 ` Jerome Glisse 2014-07-21 17:42 ` Oded Gabbay 2014-07-21 18:14 ` Jerome Glisse 2014-07-21 18:36 ` Oded Gabbay 2014-07-21 18:59 ` Jerome Glisse 2014-07-21 19:23 ` Oded Gabbay 2014-07-21 19:28 ` Jerome Glisse 2014-07-21 21:56 ` Oded Gabbay 2014-07-21 23:05 ` Jerome Glisse 2014-07-21 23:29 ` Bridgman, John 2014-07-21 23:36 ` Jerome Glisse 2014-07-22 8:05 ` Oded Gabbay 2014-07-22 7:23 ` Daniel Vetter 2014-07-22 8:10 ` Oded Gabbay 2014-07-21 15:25 ` Daniel Vetter 2014-07-21 15:58 ` Jerome Glisse 2014-07-21 17:05 ` Daniel Vetter 2014-07-21 17:28 ` Oded Gabbay 2014-07-21 18:22 ` Daniel Vetter 2014-07-21 18:41 ` Oded Gabbay 2014-07-21 19:03 ` Jerome Glisse 2014-07-22 7:28 ` Daniel Vetter 2014-07-22 7:40 ` Daniel Vetter 2014-07-22 8:21 ` Oded Gabbay 2014-07-22 8:19 ` Oded Gabbay 2014-07-22 9:21 ` Daniel Vetter 2014-07-22 9:24 ` Daniel Vetter 2014-07-22 9:52 ` Oded Gabbay 2014-07-22 11:15 ` Daniel Vetter 2014-07-23 6:50 ` Oded Gabbay 2014-07-23 7:04 ` Christian König 2014-07-23 13:39 ` Bridgman, John 2014-07-23 14:56 ` Jerome Glisse 2014-07-23 19:49 ` Alex Deucher 2014-07-23 20:25 ` Jerome Glisse 2014-07-23 7:05 ` Daniel Vetter 2014-07-23 8:35 ` Oded Gabbay 2014-07-23 13:33 ` Bridgman, John 2014-07-23 14:41 ` Daniel Vetter 2014-07-23 15:06 ` Bridgman, John 2014-07-23 15:12 ` Bridgman, John
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).