* [RFC PATCH 00/10] drm/panthor: Add user submission
@ 2024-08-28 17:25 Mihail Atanassov
2024-08-28 17:25 ` [PATCH 1/8] drm/panthor: Add uAPI to submit from user space Mihail Atanassov
` (9 more replies)
0 siblings, 10 replies; 28+ messages in thread
From: Mihail Atanassov @ 2024-08-28 17:25 UTC (permalink / raw)
To: linux-kernel, Boris Brezillon, Liviu Dudau, Steven Price
Cc: dri-devel, Daniel Vetter, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher,
Christian König, Xinhui Pan, Shashank Sharma, Ketil Johnsen,
Akash Goel, Mihail Atanassov
Hello all,
This series implements a mechanism to expose Mali CSF GPUs' queue
ringbuffers directly to userspace, along with paraphernalia to allow
userspace to control job synchronisation between the CPU and GPU.
The goal of these changes is to allow userspace to control work
submission to the FW/HW directly without kernel intervention in the
common case, thereby reducing context switching overhead. It also allows
for greater flexibility in the way work is enqueued in the ringbufs.
For example, the current kernel submit path only supports indirect
calls, which is inefficient for small command buffers. Userspace can
also skip unnecessary sync operations.
This is still a work-in-progress, there's an outstanding issue with
multiple processes using different submission flows triggering
scheduling bugs (e.g. the same group getting scheduled twice), but we'd
love to gather some feedback on the suitability of the approach in
general and see if there's a clear path to merging something like this
eventually.
I've also CCd AMD maintainers because they have in the past done
something similar[1], in case they want to chime in.
There are two uses of this new uAPI in Mesa, one in gallium/panfrost
(link TBD), and one in panvk [2].
The Gallium implementation is a naïve change just to switch the
submission model and exercise the new kernel code, and we don't plan
on pursuing this at this time.
The panvk driver changes are, however, a better representation of the
intent behind this new uAPI, so please consider that as the reference
userspace. It is still very much also a work in progress.
* patch 1 adds all the uAPI changes;
* patch 2 implements the GROUP_CREATE ioctl changes necessary to expose
the required objects to userspace;
* patch 3 maps the doorbell pages, similarly to how the user I/O page is
mapped;
* patch 4 implements GROUP_KICK, which lets userspace request an
inactive group to be scheduled on the GPU;
* patches 5 & 6 implement XGS queues, a way for userspace to
synchronise GPU queue progress with DRM syncobjs;
* patches 7 & 8 add notification mechanisms for user & kernel to signal
changes to native GPU syncobjs.
[1] https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@mail.gmail.com/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
[2] https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads
Ketil Johnsen (7):
drm/panthor: Add uAPI to submit from user space
drm/panthor: Extend GROUP_CREATE for user submission
drm/panthor: Map doorbell pages
drm/panthor: Add GROUP_KICK ioctl
drm/panthor: Factor out syncobj handling
drm/panthor: Implement XGS queues
drm/panthor: Add SYNC_UPDATE ioctl
Mihail Atanassov (1):
drm/panthor: Add sync_update eventfd handling
drivers/gpu/drm/panthor/Makefile | 4 +-
drivers/gpu/drm/panthor/panthor_device.c | 66 ++-
drivers/gpu/drm/panthor/panthor_device.h | 35 +-
drivers/gpu/drm/panthor/panthor_drv.c | 233 +++++++-
drivers/gpu/drm/panthor/panthor_fw.c | 2 +-
drivers/gpu/drm/panthor/panthor_sched.c | 408 +++++++++-----
drivers/gpu/drm/panthor/panthor_sched.h | 8 +-
drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
drivers/gpu/drm/panthor/panthor_syncobj.h | 27 +
drivers/gpu/drm/panthor/panthor_xgs.c | 638 ++++++++++++++++++++++
drivers/gpu/drm/panthor/panthor_xgs.h | 42 ++
include/uapi/drm/panthor_drm.h | 243 +++++++-
12 files changed, 1696 insertions(+), 177 deletions(-)
create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h
--
2.45.0
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 1/8] drm/panthor: Add uAPI to submit from user space
2024-08-28 17:25 [RFC PATCH 00/10] drm/panthor: Add user submission Mihail Atanassov
@ 2024-08-28 17:25 ` Mihail Atanassov
2024-08-28 17:25 ` [PATCH 2/8] drm/panthor: Extend GROUP_CREATE for user submission Mihail Atanassov
` (8 subsequent siblings)
9 siblings, 0 replies; 28+ messages in thread
From: Mihail Atanassov @ 2024-08-28 17:25 UTC (permalink / raw)
To: linux-kernel, Boris Brezillon, Liviu Dudau, Steven Price
Cc: dri-devel, Daniel Vetter, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher,
Christian König, Xinhui Pan, Shashank Sharma, Ketil Johnsen,
Akash Goel, Mihail Atanassov, Mary Guillemard
From: Ketil Johnsen <ketil.johnsen@arm.com>
* Extend GROUP_CREATE to allow the kernel to expose the ringbuf, user
I/O and doorbell pages for each queue, and add eventfds to signal sync
updates and group state changes
* Add a KICK ioctl to re-evaluate queues within a group
* Add cross-group sync (XGS) queues as a mechanism to synchronise
between groups' queues across a VM
* Add a SYNC_UPDATE ioctl for user to notify kernel that it needs to
re-evaluate groups and XGS queues; use eventfds for kernel-to-user
notifications
Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
Co-developed-by: Mihail Atanassov <mihail.atanassov@arm.com>
Signed-off-by: Mihail Atanassov <mihail.atanassov@arm.com>
---
This is going to conflict with [1] due to the driver version bump.
[1] https://lore.kernel.org/dri-devel/20240819112508.67988-2-mary.guillemard@collabora.com/
drivers/gpu/drm/panthor/panthor_drv.c | 3 +-
drivers/gpu/drm/panthor/panthor_sched.c | 2 +-
include/uapi/drm/panthor_drm.h | 243 +++++++++++++++++++++++-
3 files changed, 244 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index b5e7b919f241..4f1efe616698 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -1372,6 +1372,7 @@ static void panthor_debugfs_init(struct drm_minor *minor)
/*
* PanCSF driver version:
* - 1.0 - initial interface
+ * - 1.1 - add user submission path
*/
static const struct drm_driver panthor_drm_driver = {
.driver_features = DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ |
@@ -1385,7 +1386,7 @@ static const struct drm_driver panthor_drm_driver = {
.desc = "Panthor DRM driver",
.date = "20230801",
.major = 1,
- .minor = 0,
+ .minor = 1,
.gem_create_object = panthor_gem_create_object,
.gem_prime_import_sg_table = drm_gem_shmem_prime_import_sg_table,
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index 463bcd3cf00f..b2cf053b3601 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -3089,7 +3089,7 @@ int panthor_group_create(struct panthor_file *pfile,
u32 gid, i, suspend_size;
int ret;
- if (group_args->pad)
+ if (group_args->flags & ~DRM_PANTHOR_GROUP_CREATE_USER_SUBMIT)
return -EINVAL;
if (group_args->priority > PANTHOR_CSG_PRIORITY_HIGH)
diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
index 926b1deb1116..1a6a3877e5b3 100644
--- a/include/uapi/drm/panthor_drm.h
+++ b/include/uapi/drm/panthor_drm.h
@@ -127,6 +127,21 @@ enum drm_panthor_ioctl_id {
/** @DRM_PANTHOR_TILER_HEAP_DESTROY: Destroy a tiler heap. */
DRM_PANTHOR_TILER_HEAP_DESTROY,
+
+ /** @DRM_PANTHOR_GROUP_KICK: Re-evaluate group for scheduling. */
+ DRM_PANTHOR_GROUP_KICK,
+
+ /** @DRM_PANTHOR_XGS_QUEUE_CREATE: Create a cross-group sync queue. */
+ DRM_PANTHOR_XGS_QUEUE_CREATE,
+
+ /** @DRM_PANTHOR_XGS_QUEUE_DESTROY: Destroy a cross-group sync queue. */
+ DRM_PANTHOR_XGS_QUEUE_DESTROY,
+
+ /** @DRM_PANTHOR_XGS_QUEUE_SUBMIT: Submit a cross-group sync op to an XGS queue. */
+ DRM_PANTHOR_XGS_QUEUE_SUBMIT,
+
+ /** @DRM_PANTHOR_SYNC_UPDATE: Notify kernel that a HW syncobj has been updated. */
+ DRM_PANTHOR_SYNC_UPDATE,
};
/**
@@ -170,6 +185,16 @@ enum drm_panthor_ioctl_id {
DRM_IOCTL_PANTHOR(WR, TILER_HEAP_CREATE, tiler_heap_create)
#define DRM_IOCTL_PANTHOR_TILER_HEAP_DESTROY \
DRM_IOCTL_PANTHOR(WR, TILER_HEAP_DESTROY, tiler_heap_destroy)
+#define DRM_IOCTL_PANTHOR_GROUP_KICK \
+ DRM_IOCTL_PANTHOR(WR, GROUP_KICK, group_kick)
+#define DRM_IOCTL_PANTHOR_XGS_QUEUE_CREATE \
+ DRM_IOCTL_PANTHOR(WR, XGS_QUEUE_CREATE, xgs_queue_create)
+#define DRM_IOCTL_PANTHOR_XGS_QUEUE_DESTROY \
+ DRM_IOCTL_PANTHOR(WR, XGS_QUEUE_DESTROY, xgs_queue_destroy)
+#define DRM_IOCTL_PANTHOR_XGS_QUEUE_SUBMIT \
+ DRM_IOCTL_PANTHOR(WR, XGS_QUEUE_SUBMIT, xgs_queue_submit)
+#define DRM_IOCTL_PANTHOR_SYNC_UPDATE \
+ DRM_IOCTL_PANTHOR(WR, SYNC_UPDATE, sync_update)
/**
* DOC: IOCTL arguments
@@ -680,6 +705,40 @@ struct drm_panthor_queue_create {
/** @ringbuf_size: Size of the ring buffer to allocate to this queue. */
__u32 ringbuf_size;
+
+ /**
+ * @ringbuf_offset: file offset to map the ring buffer
+ *
+ * Returns a file offset which can be used to map the queues ring buffer.
+ * The size of this buffer is as requested with @ringbuf_size
+ *
+ * Only valid if DRM_PANTHOR_GROUP_CREATE_USER_SUBMIT is set in
+ * the respective drm_panthor_group_create::flags, 0 otherwise.
+ */
+ __u64 ringbuf_offset;
+
+ /**
+ * @user_io_offset: file offset to map user input/output pages.
+ *
+ * Returns a file offset which can be used to map the user input page
+ * and the user output page (two consecutive pages).
+ *
+ * Only valid if DRM_PANTHOR_GROUP_CREATE_USER_SUBMIT is set in
+ * the respective drm_panthor_group_create::flags, 0 otherwise.
+ */
+ __u64 user_io_offset;
+
+ /**
+ * @doorbell_offset: file offset to map doorbell page
+ *
+ * Returns a file offset which can be used to map the queues HW doorbell page.
+ * Note: multiple queues can share same HW doorbell page, and this offset
+ * will be the same in this case.
+ *
+ * Only valid if DRM_PANTHOR_GROUP_CREATE_USER_SUBMIT is set in
+ * the respective drm_panthor_group_create::flags, 0 otherwise.
+ */
+ __u64 doorbell_offset;
};
/**
@@ -696,6 +755,16 @@ enum drm_panthor_group_priority {
PANTHOR_GROUP_PRIORITY_HIGH,
};
+/**
+ * enum drm_panthor_group_create_flags - Flags for group creation
+ */
+enum drm_panthor_group_create_flags {
+ /**
+ * @DRM_PANTHOR_GROUP_CREATE_USER_SUBMIT: Enable user submission
+ */
+ DRM_PANTHOR_GROUP_CREATE_USER_SUBMIT = 1 << 0,
+};
+
/**
* struct drm_panthor_group_create - Arguments passed to DRM_IOCTL_PANTHOR_GROUP_CREATE
*/
@@ -730,8 +799,8 @@ struct drm_panthor_group_create {
/** @priority: Group priority (see enum drm_panthor_group_priority). */
__u8 priority;
- /** @pad: Padding field, MBZ. */
- __u32 pad;
+ /** @flags: Combination of drm_panthor_group_create_flags flags */
+ __u32 flags;
/**
* @compute_core_mask: Mask encoding cores that can be used for compute jobs.
@@ -772,6 +841,23 @@ struct drm_panthor_group_create {
* destroying a group.
*/
__u32 group_handle;
+
+ /**
+ * @eventfd_sync_update: eventfd to increment for group SYNC_UPDATE events
+ *
+ * Only valid when DRM_PANTHOR_GROUP_CREATE_USER_SUBMIT is set.
+ * Can be set to -1 if UMD don't want to receive these events.
+ */
+ __s32 eventfd_sync_update;
+
+ /**
+ * @eventfd_group_state: eventfd to increment when group state has been changed
+ *
+ * Only valid when DRM_PANTHOR_GROUP_CREATE_USER_SUBMIT is set.
+ * Can be set to -1 if UMD don't want to receive these events.
+ * DRM_IOCTL_PANTHOR_GROUP_GET_STATE should be used to retrieve the new group state.
+ */
+ __s32 eventfd_group_state;
};
/**
@@ -955,6 +1041,159 @@ struct drm_panthor_tiler_heap_destroy {
__u32 pad;
};
+/**
+ * @struct drm_panthor_group_kick - arguments to DRM_IOCTL_PANTHOR_GROUP_KICK
+ */
+struct drm_panthor_group_kick {
+ /** @handle: handle to group */
+ __u32 handle;
+
+ /** @queue_mask: mask with queues to kick */
+ __u32 queue_mask;
+};
+
+/**
+ * struct drm_panthor_xgs_queue_create - Arguments passed to DRM_IOCTL_PANTHOR_XGS_QUEUE_CREATE
+ */
+struct drm_panthor_xgs_queue_create {
+ /** @vm_id: VM id to associate this XGS queue with */
+ __u32 vm_id;
+
+ /**
+ * @eventfd_sync_update: eventfd to increment when a XGS object has been set
+ *
+ * Can be set to -1 if UMD don't want to receive these events.
+ */
+ __s32 eventfd_sync_update;
+
+ /** @handle: Returned handle to the XGS queue created */
+ __u32 handle;
+
+ /** @pad: MBZ */
+ __u32 pad;
+};
+
+/**
+ * struct drm_panthor_xgs_queue_destroy - Arguments passed to DRM_IOCTL_PANTHOR_XGS_QUEUE_DESTROY
+ */
+struct drm_panthor_xgs_queue_destroy {
+ /** @handle: handle to XGS queue to destroy */
+ __u32 handle;
+
+ /** @pad: MBZ */
+ __u32 pad;
+};
+
+/**
+ * enum drm_panthor_xgs_ops - Types of XGS operations
+ */
+enum drm_panthor_xgs_ops {
+ /** @DRM_PANTHOR_XGS_OP_WAIT_LE: Wait for condition to be less-or-equal specified value */
+ DRM_PANTHOR_XGS_OP_WAIT_LE = 0,
+
+ /** @DRM_PANTHOR_XGS_OP_WAIT_GT: Wait for condition to be greater than specified value */
+ DRM_PANTHOR_XGS_OP_WAIT_GT,
+
+ /** @DRM_PANTHOR_XGS_OP_SIGNAL_SET: Set XGS object to specified value */
+ DRM_PANTHOR_XGS_OP_SIGNAL_SET,
+
+ /** @DRM_PANTHOR_XGS_OP_SIGNAL_ADD: Add specified value to XGS object */
+ DRM_PANTHOR_XGS_OP_SIGNAL_ADD
+};
+
+/**
+ * enum drm_panthor_xgs_op_formnat - Format of XGS object
+ */
+enum drm_panthor_xgs_op_format {
+ /** @DRM_PANTHOR_XGS_OP_FORMAT_U32: XGS object is of 32-bit format */
+ DRM_PANTHOR_XGS_OP_FORMAT_U32 = 0,
+
+ /** @DRM_PANTHOR_XGS_OP_FORMAT_U64: XGS object is of 64-bit format */
+ DRM_PANTHOR_XGS_OP_FORMAT_U64
+};
+
+/**
+ * enum drm_panthor_xgs_op_flags - Flags for a XGS operation
+ */
+enum drm_panthor_xgs_op_flags {
+ /**
+ * @DRM_PANTHOR_XGS_OP_INHERIT_ERROR_WAIT: Propagate XGS wait errors to XGS queue
+ *
+ * WAIT operations for a XGS which is signaled with error will set this XGS queue
+ * in an error state. This error state is cleared on next XGS submit with either the
+ * @DRM_PANTHOR_XGS_QUEUE_SUBMIT_ERROR_BARRIER_PRE or
+ * DRM_PANTHOR_XGS_QUEUE_SUBMIT_ERROR_BARRIER_POST flag set.
+ *
+ * Note: Errors during SIGNAL operations will always set the XGS queue in an error state
+ */
+ DRM_PANTHOR_XGS_OP_INHERIT_ERROR_WAIT = 1 << 0
+};
+
+/**
+ * struct drm_panthor_xgs_op - XGS operation
+ */
+struct drm_panthor_xgs_op {
+ /** @addr: GPU address of XGS object */
+ __u64 addr;
+
+ /** @value: value to WAIT for or SET/ADD */
+ __u64 value;
+
+ /** @op: operation from enum drm_panthor_xgs_ops */
+ __u8 op;
+
+ /** @format: format from enum drm_panthor_xgs_op_format */
+ __u8 format;
+
+ /** @flags: flags from enum drm_panthor_xgs_op_flags */
+ __u16 flags;
+
+ /** @pad: MBZ */
+ __u32 pad;
+};
+
+/**
+ * enum drm_panthor_xgs_queue_submit_flags - Flags for XGS queue submission
+ */
+enum drm_panthor_xgs_queue_submit_flags {
+ /** @DRM_PANTHOR_XGS_QUEUE_SUBMIT_ERROR_BARRIER_PRE: Error barrier pre operation
+ *
+ * Clear queue error state before processing the XGS operations
+ */
+ DRM_PANTHOR_XGS_QUEUE_SUBMIT_ERROR_BARRIER_PRE = 1 << 0,
+
+ /** @DRM_PANTHOR_XGS_QUEUE_SUBMIT_ERROR_BARRIER_POST: Error barrier post operation
+ *
+ * Clear queue error state after processing the XGS operations
+ */
+ DRM_PANTHOR_XGS_QUEUE_SUBMIT_ERROR_BARRIER_POST = 1 << 1
+};
+
+/**
+ * struct drm_panthor_xgs_queue_submit - Arguments passed to DRM_IOCTL_PANTHOR_XGS_QUEUE_SUBMIT
+ */
+struct drm_panthor_xgs_queue_submit {
+ /** @handle: XGS queue handle to submit work to */
+ __u32 handle;
+
+ /** @flags: flags from enum drm_panthor_xgs_queue_submit_flags */
+ __u32 flags;
+
+ /** @ops: Array of struct drm_panthor_xgs_op sync operations */
+ struct drm_panthor_obj_array ops;
+
+ /** @syncs: Array of struct drm_panthor_sync_op sync operations */
+ struct drm_panthor_obj_array syncs;
+};
+
+/**
+ * struct drm_panthor_sync_update - Arguments passed to DRM_IOCTL_PANTHOR_SYNC_UPDATE
+ */
+struct drm_panthor_sync_update {
+ /** @pad: MBZ */
+ __u64 pad;
+};
+
#if defined(__cplusplus)
}
#endif
--
2.45.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 2/8] drm/panthor: Extend GROUP_CREATE for user submission
2024-08-28 17:25 [RFC PATCH 00/10] drm/panthor: Add user submission Mihail Atanassov
2024-08-28 17:25 ` [PATCH 1/8] drm/panthor: Add uAPI to submit from user space Mihail Atanassov
@ 2024-08-28 17:25 ` Mihail Atanassov
2024-08-28 17:25 ` [PATCH 3/8] drm/panthor: Map doorbell pages Mihail Atanassov
` (7 subsequent siblings)
9 siblings, 0 replies; 28+ messages in thread
From: Mihail Atanassov @ 2024-08-28 17:25 UTC (permalink / raw)
To: linux-kernel, Boris Brezillon, Liviu Dudau, Steven Price
Cc: dri-devel, Daniel Vetter, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher,
Christian König, Xinhui Pan, Shashank Sharma, Ketil Johnsen,
Akash Goel, Mihail Atanassov
From: Ketil Johnsen <ketil.johnsen@arm.com>
Allow userspace to mmap() the ring buffer, and the doorbell and user I/O
pages, so that it can submit work directly to queues.
Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
Co-developed-by: Mihail Atanassov <mihail.atanassov@arm.com>
Signed-off-by: Mihail Atanassov <mihail.atanassov@arm.com>
Co-developed-by: Akash Goel <akash.goel@arm.com>
Signed-off-by: Akash Goel <akash.goel@arm.com>
---
drivers/gpu/drm/panthor/panthor_device.h | 24 ++++
drivers/gpu/drm/panthor/panthor_drv.c | 69 ++++++++++-
drivers/gpu/drm/panthor/panthor_sched.c | 151 ++++++++++++++++++-----
drivers/gpu/drm/panthor/panthor_sched.h | 4 +-
4 files changed, 209 insertions(+), 39 deletions(-)
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index e388c0472ba7..7c27dbba8270 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -171,6 +171,9 @@ struct panthor_file {
/** @ptdev: Device attached to this file. */
struct panthor_device *ptdev;
+ /** @drm_file: Corresponding drm_file */
+ struct drm_file *drm_file;
+
/** @vms: VM pool attached to this file. */
struct panthor_vm_pool *vms;
@@ -353,6 +356,27 @@ static int panthor_request_ ## __name ## _irq(struct panthor_device *ptdev, \
pirq); \
}
+/*
+ * We currently set the maximum of groups per file to an arbitrary low value.
+ * But this can be updated if we need more.
+ */
+#define MAX_GROUPS_PER_POOL 128
+
+/*
+ * The maximum page size supported by the driver, when running with a smaller
+ * page size only the first page at this offset is valid.
+ */
+#define DRM_PANTHOR_MAX_PAGE_SHIFT 16
+
+/* Carve out private MMIO offsets */
+#define PANTHOR_PRIVATE_MMIO_OFFSET (DRM_PANTHOR_USER_MMIO_OFFSET + (1ull << 42))
+
+/* Give out file offset for doorbell pages based on the group handle */
+#define PANTHOR_DOORBELL_OFFSET(group) (PANTHOR_PRIVATE_MMIO_OFFSET + \
+ ((group) << DRM_PANTHOR_MAX_PAGE_SHIFT))
+#define PANTHOR_DOORBELL_OFFSET_START PANTHOR_DOORBELL_OFFSET(0)
+#define PANTHOR_DOORBELL_OFFSET_END PANTHOR_DOORBELL_OFFSET(MAX_GROUPS_PER_POOL)
+
extern struct workqueue_struct *panthor_cleanup_wq;
#endif
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index 4f1efe616698..0bd600c464b8 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -67,6 +67,54 @@ panthor_set_uobj(u64 usr_ptr, u32 usr_size, u32 min_size, u32 kern_size, const v
return 0;
}
+/**
+ * panthor_set_uobj_array() - Copy a kernel object array into a user object array.
+ * @out: The object array to copy to.
+ * @min_stride: Minimum array stride.
+ * @obj_size: Kernel object size.
+ *
+ * Helper automating kernel -> user object copies.
+ *
+ * Don't use this function directly, use PANTHOR_UOBJ_SET_ARRAY() instead.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int
+panthor_set_uobj_array(const struct drm_panthor_obj_array *out, u32 min_stride, u32 obj_size,
+ const void *in)
+{
+ if (out->stride < min_stride)
+ return -EINVAL;
+
+ if (!out->count)
+ return 0;
+
+ if (obj_size == out->stride) {
+ if (copy_to_user(u64_to_user_ptr(out->array), in,
+ (unsigned long)obj_size * out->count))
+ return -EFAULT;
+ } else {
+ u32 cpy_elem_size = min_t(u32, out->stride, obj_size);
+ void __user *out_ptr = u64_to_user_ptr(out->array);
+ const void *in_ptr = in;
+
+ for (u32 i = 0; i < out->count; i++) {
+ if (copy_to_user(out_ptr, in_ptr, cpy_elem_size))
+ return -EFAULT;
+
+ if (out->stride > obj_size &&
+ clear_user(out_ptr + cpy_elem_size, out->stride - obj_size)) {
+ return -EFAULT;
+ }
+
+ out_ptr += out->stride;
+ in_ptr += obj_size;
+ }
+ }
+
+ return 0;
+}
+
/**
* panthor_get_uobj_array() - Copy a user object array into a kernel accessible object array.
* @in: The object array to copy.
@@ -182,6 +230,20 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
PANTHOR_UOBJ_MIN_SIZE(_src_obj), \
sizeof(_src_obj), &(_src_obj))
+/**
+ * PANTHOR_UOBJ_SET_ARRAY() - Copies from _src_array to @_dest_drm_panthor_obj_array.array.
+ * @_dest_drm_pvr_obj_array: The &struct drm_pvr_obj_array containing a __u64 raw
+ * pointer to the destination C array in user space and the size of each array
+ * element in user space (the 'stride').
+ * @_src_array: The source C array object in kernel space.
+ *
+ * Return: Error code. See panthor_set_uobj_array().
+ */
+#define PANTHOR_UOBJ_SET_ARRAY(_dest_drm_panthor_obj_array, _src_array) \
+ panthor_set_uobj_array(_dest_drm_panthor_obj_array, \
+ PANTHOR_UOBJ_MIN_SIZE((_src_array)[0]), \
+ sizeof((_src_array)[0]), _src_array)
+
/**
* PANTHOR_UOBJ_GET_ARRAY() - Copy a user object array to a kernel accessible
* object array.
@@ -1012,10 +1074,8 @@ static int panthor_ioctl_group_create(struct drm_device *ddev, void *data,
return ret;
ret = panthor_group_create(pfile, args, queue_args);
- if (ret >= 0) {
- args->group_handle = ret;
- ret = 0;
- }
+ if (!ret)
+ ret = PANTHOR_UOBJ_SET_ARRAY(&args->queues, queue_args);
kvfree(queue_args);
return ret;
@@ -1262,6 +1322,7 @@ panthor_open(struct drm_device *ddev, struct drm_file *file)
}
pfile->ptdev = ptdev;
+ pfile->drm_file = file;
ret = panthor_vm_pool_create(pfile);
if (ret)
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index b2cf053b3601..ad160a821957 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -384,6 +384,21 @@ struct panthor_queue {
*/
u8 doorbell_id;
+ /** @doorbell_offset: file offset user space can use to map the doorbell page */
+ u64 doorbell_offset;
+
+ /** @ringbuf_offset: file offset user space can use to map the ring buffer
+ *
+ * Only valid when group is created with DRM_PANTHOR_GROUP_CREATE_USER_SUBMIT
+ */
+ u64 ringbuf_offset;
+
+ /** @user_io_offset: file offset user space can use to map the two user IO pages
+ *
+ * Only valid when group is created with DRM_PANTHOR_GROUP_CREATE_USER_SUBMIT
+ */
+ u64 user_io_offset;
+
/**
* @priority: Priority of the queue inside the group.
*
@@ -524,6 +539,12 @@ struct panthor_group {
/** @ptdev: Device. */
struct panthor_device *ptdev;
+ /** @pfile: associated panthor_file */
+ struct panthor_file *pfile;
+
+ /** @handle: integer value used by user to refer to this group */
+ u32 handle;
+
/** @vm: VM bound to the group. */
struct panthor_vm *vm;
@@ -548,6 +569,9 @@ struct panthor_group {
/** @priority: Group priority (check panthor_csg_priority). */
u8 priority;
+ /** @user_submit: true if user space controls submission */
+ bool user_submit;
+
/** @blocked_queues: Bitmask reflecting the blocked queues. */
u32 blocked_queues;
@@ -708,12 +732,6 @@ struct panthor_group {
mod_delayed_work((sched)->wq, &(sched)->wname ## _work, delay); \
} while (0)
-/*
- * We currently set the maximum of groups per file to an arbitrary low value.
- * But this can be updated if we need more.
- */
-#define MAX_GROUPS_PER_POOL 128
-
/**
* struct panthor_group_pool - Group pool
*
@@ -836,6 +854,12 @@ static void group_free_queue(struct panthor_group *group, struct panthor_queue *
panthor_queue_put_syncwait_obj(queue);
+ if (queue->ringbuf_offset)
+ drm_vma_node_revoke(&queue->ringbuf->obj->vma_node, group->pfile->drm_file);
+
+ if (queue->user_io_offset)
+ drm_vma_node_revoke(&queue->iface.mem->obj->vma_node, group->pfile->drm_file);
+
panthor_kernel_bo_destroy(queue->ringbuf);
panthor_kernel_bo_destroy(queue->iface.mem);
@@ -1552,7 +1576,7 @@ static void csg_slot_sync_update_locked(struct panthor_device *ptdev,
lockdep_assert_held(&ptdev->scheduler->lock);
- if (group)
+ if (group && !group->user_submit)
group_queue_work(group, sync_upd);
sched_queue_work(ptdev->scheduler, sync_upd);
@@ -2019,10 +2043,12 @@ group_term_post_processing(struct panthor_group *group)
}
spin_unlock(&queue->fence_ctx.lock);
- /* Manually update the syncobj seqno to unblock waiters. */
- syncobj = group->syncobjs->kmap + (i * sizeof(*syncobj));
- syncobj->status = ~0;
- syncobj->seqno = atomic64_read(&queue->fence_ctx.seqno);
+ if (!group->user_submit) {
+ /* Manually update the syncobj seqno to unblock waiters. */
+ syncobj = group->syncobjs->kmap + (i * sizeof(*syncobj));
+ syncobj->status = ~0;
+ syncobj->seqno = atomic64_read(&queue->fence_ctx.seqno);
+ }
sched_queue_work(group->ptdev->scheduler, sync_upd);
}
dma_fence_end_signalling(cookie);
@@ -2785,6 +2811,9 @@ static void group_sync_upd_work(struct work_struct *work)
u32 queue_idx;
bool cookie;
+ if (drm_WARN_ON(&group->ptdev->base, group->user_submit))
+ return;
+
cookie = dma_fence_begin_signalling();
for (queue_idx = 0; queue_idx < group->queue_count; queue_idx++) {
struct panthor_queue *queue = group->queues[queue_idx];
@@ -3021,6 +3050,9 @@ group_create_queue(struct panthor_group *group,
if (args->priority > CSF_MAX_QUEUE_PRIO)
return ERR_PTR(-EINVAL);
+ if (args->ringbuf_offset || args->user_io_offset || args->doorbell_offset)
+ return ERR_PTR(-EINVAL);
+
queue = kzalloc(sizeof(*queue), GFP_KERNEL);
if (!queue)
return ERR_PTR(-ENOMEM);
@@ -3046,6 +3078,20 @@ group_create_queue(struct panthor_group *group,
if (ret)
goto err_free_queue;
+ if (group->user_submit) {
+ ret = drm_vma_node_allow(&queue->ringbuf->obj->vma_node, group->pfile->drm_file);
+ if (ret)
+ goto err_free_queue;
+
+ ret = drm_gem_create_mmap_offset(queue->ringbuf->obj);
+ if (ret) {
+ drm_vma_node_revoke(&queue->ringbuf->obj->vma_node, group->pfile->drm_file);
+ goto err_free_queue;
+ }
+
+ queue->ringbuf_offset = drm_vma_node_offset_addr(&queue->ringbuf->obj->vma_node);
+ }
+
queue->iface.mem = panthor_fw_alloc_queue_iface_mem(group->ptdev,
&queue->iface.input,
&queue->iface.output,
@@ -3056,6 +3102,21 @@ group_create_queue(struct panthor_group *group,
goto err_free_queue;
}
+ if (group->user_submit) {
+ ret = drm_vma_node_allow(&queue->iface.mem->obj->vma_node, group->pfile->drm_file);
+ if (ret)
+ goto err_free_queue;
+
+ ret = drm_gem_create_mmap_offset(queue->iface.mem->obj);
+ if (ret) {
+ drm_vma_node_revoke(&queue->iface.mem->obj->vma_node,
+ group->pfile->drm_file);
+ goto err_free_queue;
+ }
+
+ queue->user_io_offset = drm_vma_node_offset_addr(&queue->iface.mem->obj->vma_node);
+ }
+
ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
group->ptdev->scheduler->wq, 1,
args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
@@ -3075,11 +3136,9 @@ group_create_queue(struct panthor_group *group,
return ERR_PTR(ret);
}
-#define MAX_GROUPS_PER_POOL 128
-
int panthor_group_create(struct panthor_file *pfile,
- const struct drm_panthor_group_create *group_args,
- const struct drm_panthor_queue_create *queue_args)
+ struct drm_panthor_group_create *group_args,
+ struct drm_panthor_queue_create *queue_args)
{
struct panthor_device *ptdev = pfile->ptdev;
struct panthor_group_pool *gpool = pfile->groups;
@@ -3115,6 +3174,7 @@ int panthor_group_create(struct panthor_file *pfile,
group->csg_id = -1;
group->ptdev = ptdev;
+ group->pfile = pfile;
group->max_compute_cores = group_args->max_compute_cores;
group->compute_core_mask = group_args->compute_core_mask;
group->max_fragment_cores = group_args->max_fragment_cores;
@@ -3130,6 +3190,9 @@ int panthor_group_create(struct panthor_file *pfile,
INIT_WORK(&group->tiler_oom_work, group_tiler_oom_work);
INIT_WORK(&group->release_work, group_release_work);
+ if (group_args->flags & DRM_PANTHOR_GROUP_CREATE_USER_SUBMIT)
+ group->user_submit = true;
+
group->vm = panthor_vm_pool_get_vm(pfile->vms, group_args->vm_id);
if (!group->vm) {
ret = -EINVAL;
@@ -3152,25 +3215,27 @@ int panthor_group_create(struct panthor_file *pfile,
goto err_put_group;
}
- group->syncobjs = panthor_kernel_bo_create(ptdev, group->vm,
- group_args->queues.count *
- sizeof(struct panthor_syncobj_64b),
- DRM_PANTHOR_BO_NO_MMAP,
- DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
- DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
- PANTHOR_VM_KERNEL_AUTO_VA);
- if (IS_ERR(group->syncobjs)) {
- ret = PTR_ERR(group->syncobjs);
- goto err_put_group;
+ if (!group->user_submit) {
+ group->syncobjs = panthor_kernel_bo_create(ptdev, group->vm,
+ group_args->queues.count *
+ sizeof(struct panthor_syncobj_64b),
+ DRM_PANTHOR_BO_NO_MMAP,
+ DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
+ DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
+ PANTHOR_VM_KERNEL_AUTO_VA);
+ if (IS_ERR(group->syncobjs)) {
+ ret = PTR_ERR(group->syncobjs);
+ goto err_put_group;
+ }
+
+ ret = panthor_kernel_bo_vmap(group->syncobjs);
+ if (ret)
+ goto err_put_group;
+
+ memset(group->syncobjs->kmap, 0,
+ group_args->queues.count * sizeof(struct panthor_syncobj_64b));
}
- ret = panthor_kernel_bo_vmap(group->syncobjs);
- if (ret)
- goto err_put_group;
-
- memset(group->syncobjs->kmap, 0,
- group_args->queues.count * sizeof(struct panthor_syncobj_64b));
-
for (i = 0; i < group_args->queues.count; i++) {
group->queues[i] = group_create_queue(group, &queue_args[i]);
if (IS_ERR(group->queues[i])) {
@@ -3188,6 +3253,21 @@ int panthor_group_create(struct panthor_file *pfile,
if (ret)
goto err_put_group;
+ group->handle = gid;
+ group_args->group_handle = gid;
+
+ if (group->user_submit) {
+ for (i = 0; i < group_args->queues.count; i++) {
+ /* All queues in group use the same HW doorbell */
+ group->queues[i]->doorbell_offset = PANTHOR_DOORBELL_OFFSET(gid - 1);
+
+ /* copy to queue_args so these values can be returned to user */
+ queue_args[i].doorbell_offset = group->queues[i]->doorbell_offset;
+ queue_args[i].ringbuf_offset = group->queues[i]->ringbuf_offset;
+ queue_args[i].user_io_offset = group->queues[i]->user_io_offset;
+ }
+ }
+
mutex_lock(&sched->reset.lock);
if (atomic_read(&sched->reset.in_progress)) {
panthor_group_stop(group);
@@ -3199,7 +3279,7 @@ int panthor_group_create(struct panthor_file *pfile,
}
mutex_unlock(&sched->reset.lock);
- return gid;
+ return 0;
err_put_group:
group_put(group);
@@ -3390,6 +3470,11 @@ panthor_job_create(struct panthor_file *pfile,
goto err_put_job;
}
+ if (job->group->user_submit) {
+ ret = -EINVAL;
+ goto err_put_job;
+ }
+
if (job->queue_idx >= job->group->queue_count ||
!job->group->queues[job->queue_idx]) {
ret = -EINVAL;
diff --git a/drivers/gpu/drm/panthor/panthor_sched.h b/drivers/gpu/drm/panthor/panthor_sched.h
index 3a30d2328b30..55b6534fa390 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.h
+++ b/drivers/gpu/drm/panthor/panthor_sched.h
@@ -19,8 +19,8 @@ struct panthor_group_pool;
struct panthor_job;
int panthor_group_create(struct panthor_file *pfile,
- const struct drm_panthor_group_create *group_args,
- const struct drm_panthor_queue_create *queue_args);
+ struct drm_panthor_group_create *group_args,
+ struct drm_panthor_queue_create *queue_args);
int panthor_group_destroy(struct panthor_file *pfile, u32 group_handle);
int panthor_group_get_state(struct panthor_file *pfile,
struct drm_panthor_group_get_state *get_state);
--
2.45.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 3/8] drm/panthor: Map doorbell pages
2024-08-28 17:25 [RFC PATCH 00/10] drm/panthor: Add user submission Mihail Atanassov
2024-08-28 17:25 ` [PATCH 1/8] drm/panthor: Add uAPI to submit from user space Mihail Atanassov
2024-08-28 17:25 ` [PATCH 2/8] drm/panthor: Extend GROUP_CREATE for user submission Mihail Atanassov
@ 2024-08-28 17:25 ` Mihail Atanassov
2024-08-28 17:26 ` [PATCH 4/8] drm/panthor: Add GROUP_KICK ioctl Mihail Atanassov
` (6 subsequent siblings)
9 siblings, 0 replies; 28+ messages in thread
From: Mihail Atanassov @ 2024-08-28 17:25 UTC (permalink / raw)
To: linux-kernel, Boris Brezillon, Liviu Dudau, Steven Price
Cc: dri-devel, Daniel Vetter, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher,
Christian König, Xinhui Pan, Shashank Sharma, Ketil Johnsen,
Akash Goel, Mihail Atanassov
From: Ketil Johnsen <ketil.johnsen@arm.com>
Dinging the doorbell directly from userspace allows the user driver to
bypass the kernel if the group is already assigned and active.
Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
Signed-off-by: Mihail Atanassov <mihail.atanassov@arm.com>
---
drivers/gpu/drm/panthor/panthor_device.c | 66 ++++++++++++++++++------
drivers/gpu/drm/panthor/panthor_device.h | 7 ++-
drivers/gpu/drm/panthor/panthor_drv.c | 3 +-
drivers/gpu/drm/panthor/panthor_fw.c | 2 +-
drivers/gpu/drm/panthor/panthor_sched.c | 46 ++++++++++++++++-
drivers/gpu/drm/panthor/panthor_sched.h | 2 +
6 files changed, 104 insertions(+), 22 deletions(-)
diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index 4082c8f2951d..3c376e21bd2e 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -179,6 +179,14 @@ int panthor_device_init(struct panthor_device *ptdev)
if (ret)
return ret;
+ ptdev->dummy_doorbell_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+ if (!ptdev->dummy_doorbell_page)
+ return -ENOMEM;
+
+ ret = drmm_add_action_or_reset(&ptdev->base, panthor_device_free_page,
+ page_address(ptdev->dummy_doorbell_page));
+ if (ret)
+ return ret;
/*
* Set the dummy page holding the latest flush to 1. This will cause the
* flush to avoided as we know it isn't necessary if the submission
@@ -343,41 +351,58 @@ const char *panthor_exception_name(struct panthor_device *ptdev, u32 exception_c
static vm_fault_t panthor_mmio_vm_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
- struct panthor_device *ptdev = vma->vm_private_data;
+ struct panthor_file *pfile = vma->vm_private_data;
+ struct panthor_device *ptdev = pfile->ptdev;
u64 offset = (u64)vma->vm_pgoff << PAGE_SHIFT;
unsigned long pfn;
pgprot_t pgprot;
vm_fault_t ret;
bool active;
int cookie;
+ u32 group_handle;
+ u8 doorbell_id;
if (!drm_dev_enter(&ptdev->base, &cookie))
return VM_FAULT_SIGBUS;
- mutex_lock(&ptdev->pm.mmio_lock);
- active = atomic_read(&ptdev->pm.state) == PANTHOR_DEVICE_PM_STATE_ACTIVE;
-
switch (offset) {
case DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET:
- if (active)
+ mutex_lock(&ptdev->pm.mmio_lock);
+
+ pgprot = vma->vm_page_prot;
+
+ active = atomic_read(&ptdev->pm.state) == PANTHOR_DEVICE_PM_STATE_ACTIVE;
+ if (active) {
pfn = __phys_to_pfn(ptdev->phys_addr + CSF_GPU_LATEST_FLUSH_ID);
- else
+ pgprot = pgprot_noncached(pgprot);
+ } else {
pfn = page_to_pfn(ptdev->pm.dummy_latest_flush);
+ }
+
+ ret = vmf_insert_pfn_prot(vma, vmf->address, pfn, pgprot);
+
+ mutex_unlock(&ptdev->pm.mmio_lock);
+
+ break;
+
+ case PANTHOR_DOORBELL_OFFSET_START ... (PANTHOR_DOORBELL_OFFSET_END - 1):
+ group_handle = PANTHOR_DOORBELL_GROUP_FROM_OFFSET(offset) + 1;
+
+ doorbell_id = panthor_sched_doorbell_id(pfile, group_handle);
+ if (doorbell_id != (u8)-1)
+ pfn = PHYS_PFN(ptdev->phys_addr + CSF_DOORBELL(doorbell_id));
+ else
+ pfn = page_to_pfn(ptdev->dummy_doorbell_page);
+
+ ret = vmf_insert_pfn_prot(vma, vmf->address, pfn,
+ pgprot_device(vma->vm_page_prot));
+
break;
default:
ret = VM_FAULT_SIGBUS;
- goto out_unlock;
}
- pgprot = vma->vm_page_prot;
- if (active)
- pgprot = pgprot_noncached(pgprot);
-
- ret = vmf_insert_pfn_prot(vma, vmf->address, pfn, pgprot);
-
-out_unlock:
- mutex_unlock(&ptdev->pm.mmio_lock);
drm_dev_exit(cookie);
return ret;
}
@@ -386,7 +411,7 @@ static const struct vm_operations_struct panthor_mmio_vm_ops = {
.fault = panthor_mmio_vm_fault,
};
-int panthor_device_mmap_io(struct panthor_device *ptdev, struct vm_area_struct *vma)
+int panthor_device_mmap_io(struct panthor_file *pfile, struct vm_area_struct *vma)
{
u64 offset = (u64)vma->vm_pgoff << PAGE_SHIFT;
@@ -398,12 +423,19 @@ int panthor_device_mmap_io(struct panthor_device *ptdev, struct vm_area_struct *
break;
+ case PANTHOR_DOORBELL_OFFSET_START ... (PANTHOR_DOORBELL_OFFSET_END - 1):
+ if (vma->vm_end - vma->vm_start != PAGE_SIZE ||
+ (vma->vm_flags & (VM_READ | VM_EXEC)))
+ return -EINVAL;
+
+ break;
+
default:
return -EINVAL;
}
/* Defer actual mapping to the fault handler. */
- vma->vm_private_data = ptdev;
+ vma->vm_private_data = pfile;
vma->vm_ops = &panthor_mmio_vm_ops;
vm_flags_set(vma,
VM_IO | VM_DONTCOPY | VM_DONTEXPAND |
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index 7c27dbba8270..87cce384e36a 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -162,6 +162,9 @@ struct panthor_device {
*/
struct page *dummy_latest_flush;
} pm;
+
+ /** @dummy_doorbell_page: dummy doorbell page when queue is not on HW */
+ struct page *dummy_doorbell_page;
};
/**
@@ -204,7 +207,7 @@ static inline bool panthor_device_reset_is_pending(struct panthor_device *ptdev)
return atomic_read(&ptdev->reset.pending) != 0;
}
-int panthor_device_mmap_io(struct panthor_device *ptdev,
+int panthor_device_mmap_io(struct panthor_file *pfile,
struct vm_area_struct *vma);
int panthor_device_resume(struct device *dev);
@@ -376,6 +379,8 @@ static int panthor_request_ ## __name ## _irq(struct panthor_device *ptdev, \
((group) << DRM_PANTHOR_MAX_PAGE_SHIFT))
#define PANTHOR_DOORBELL_OFFSET_START PANTHOR_DOORBELL_OFFSET(0)
#define PANTHOR_DOORBELL_OFFSET_END PANTHOR_DOORBELL_OFFSET(MAX_GROUPS_PER_POOL)
+#define PANTHOR_DOORBELL_GROUP_FROM_OFFSET(offset) \
+ ((offset - PANTHOR_PRIVATE_MMIO_OFFSET) >> DRM_PANTHOR_MAX_PAGE_SHIFT)
extern struct workqueue_struct *panthor_cleanup_wq;
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index 0bd600c464b8..e391ab6aaab2 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -1381,7 +1381,6 @@ static int panthor_mmap(struct file *filp, struct vm_area_struct *vma)
{
struct drm_file *file = filp->private_data;
struct panthor_file *pfile = file->driver_priv;
- struct panthor_device *ptdev = pfile->ptdev;
u64 offset = (u64)vma->vm_pgoff << PAGE_SHIFT;
int ret, cookie;
@@ -1404,7 +1403,7 @@ static int panthor_mmap(struct file *filp, struct vm_area_struct *vma)
#endif
if (offset >= DRM_PANTHOR_USER_MMIO_OFFSET)
- ret = panthor_device_mmap_io(ptdev, vma);
+ ret = panthor_device_mmap_io(pfile, vma);
else
ret = drm_gem_mmap(filp, vma);
diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
index ef232c0c2049..ce51f75a419b 100644
--- a/drivers/gpu/drm/panthor/panthor_fw.c
+++ b/drivers/gpu/drm/panthor/panthor_fw.c
@@ -444,7 +444,7 @@ panthor_fw_alloc_queue_iface_mem(struct panthor_device *ptdev,
int ret;
mem = panthor_kernel_bo_create(ptdev, ptdev->fw->vm, SZ_8K,
- DRM_PANTHOR_BO_NO_MMAP,
+ 0,
DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
PANTHOR_VM_KERNEL_AUTO_VA);
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index ad160a821957..471bb8f2b44c 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -954,6 +954,10 @@ group_bind_locked(struct panthor_group *group, u32 csg_id)
for (u32 i = 0; i < group->queue_count; i++)
group->queues[i]->doorbell_id = csg_id + 1;
+ /* Unmap the dummy doorbell page (if mapped) */
+ unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
+ PANTHOR_DOORBELL_OFFSET(group->handle - 1), 0, 1);
+
csg_slot->group = group;
return 0;
@@ -990,6 +994,10 @@ group_unbind_locked(struct panthor_group *group)
for (u32 i = 0; i < group->queue_count; i++)
group->queues[i]->doorbell_id = -1;
+ /* Unmap the dummy doorbell page (if mapped) */
+ unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
+ PANTHOR_DOORBELL_OFFSET(group->handle - 1), 0, 1);
+
slot->group = NULL;
group_put(group);
@@ -1726,6 +1734,41 @@ void panthor_sched_report_fw_events(struct panthor_device *ptdev, u32 events)
sched_queue_work(ptdev->scheduler, fw_events);
}
+/**
+ * panthor_sched_doorbell_id() - Get the doorbell ID for a given group.
+ * @pfile: Panthor file.
+ * @group_handle: group slot id
+ *
+ * Return: a doorbell ID if valid, -1 otherwise
+ */
+u8 panthor_sched_doorbell_id(struct panthor_file *pfile, u32 group_handle)
+{
+ struct panthor_group_pool *gpool = pfile->groups;
+ struct panthor_scheduler *sched = pfile->ptdev->scheduler;
+ u8 doorbell_id;
+ struct panthor_group *group;
+
+ group = group_get(xa_load(&gpool->xa, group_handle));
+ if (!group)
+ return -1;
+
+ if (!group->queue_count) {
+ doorbell_id = -1;
+ goto err_put_group;
+ }
+
+ mutex_lock(&sched->lock);
+
+ /* In current implementation, all queues of same group share same doorbell page. */
+ doorbell_id = group->queues[0]->doorbell_id;
+
+ mutex_unlock(&sched->lock);
+
+err_put_group:
+ group_put(group);
+ return doorbell_id;
+}
+
static const char *fence_get_driver_name(struct dma_fence *fence)
{
return "panthor";
@@ -3057,6 +3100,7 @@ group_create_queue(struct panthor_group *group,
if (!queue)
return ERR_PTR(-ENOMEM);
+ queue->doorbell_id = -1;
queue->fence_ctx.id = dma_fence_context_alloc(1);
spin_lock_init(&queue->fence_ctx.lock);
INIT_LIST_HEAD(&queue->fence_ctx.in_flight_jobs);
@@ -3065,7 +3109,7 @@ group_create_queue(struct panthor_group *group,
queue->ringbuf = panthor_kernel_bo_create(group->ptdev, group->vm,
args->ringbuf_size,
- DRM_PANTHOR_BO_NO_MMAP,
+ 0,
DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
PANTHOR_VM_KERNEL_AUTO_VA);
diff --git a/drivers/gpu/drm/panthor/panthor_sched.h b/drivers/gpu/drm/panthor/panthor_sched.h
index 55b6534fa390..0b3a2ee2a0ad 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.h
+++ b/drivers/gpu/drm/panthor/panthor_sched.h
@@ -47,4 +47,6 @@ void panthor_sched_resume(struct panthor_device *ptdev);
void panthor_sched_report_mmu_fault(struct panthor_device *ptdev);
void panthor_sched_report_fw_events(struct panthor_device *ptdev, u32 events);
+u8 panthor_sched_doorbell_id(struct panthor_file *pfile, u32 group_handle);
+
#endif
--
2.45.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 4/8] drm/panthor: Add GROUP_KICK ioctl
2024-08-28 17:25 [RFC PATCH 00/10] drm/panthor: Add user submission Mihail Atanassov
` (2 preceding siblings ...)
2024-08-28 17:25 ` [PATCH 3/8] drm/panthor: Map doorbell pages Mihail Atanassov
@ 2024-08-28 17:26 ` Mihail Atanassov
2024-08-28 17:26 ` [PATCH 5/8] drm/panthor: Factor out syncobj handling Mihail Atanassov
` (5 subsequent siblings)
9 siblings, 0 replies; 28+ messages in thread
From: Mihail Atanassov @ 2024-08-28 17:26 UTC (permalink / raw)
To: linux-kernel, Boris Brezillon, Liviu Dudau, Steven Price
Cc: dri-devel, Daniel Vetter, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher,
Christian König, Xinhui Pan, Shashank Sharma, Ketil Johnsen,
Akash Goel, Mihail Atanassov
From: Ketil Johnsen <ketil.johnsen@arm.com>
In the kernel submission flow, GROUP_SUBMIT will ensure that the group
gets assigned to a CSG. Conversely for the user submission flow, work
gets added to the ring buffer without kernel supervision, so there needs
to be a mechanism to trigger rescheduling. Use a new GROUP_KICK ioctl,
to keep it distinct from the existing submit flow.
Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
Signed-off-by: Mihail Atanassov <mihail.atanassov@arm.com>
---
drivers/gpu/drm/panthor/panthor_drv.c | 12 +++++++++++
drivers/gpu/drm/panthor/panthor_sched.c | 27 +++++++++++++++++++++++++
drivers/gpu/drm/panthor/panthor_sched.h | 1 +
3 files changed, 40 insertions(+)
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index e391ab6aaab2..ce2fdcd3fb42 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -1305,6 +1305,17 @@ static int panthor_ioctl_vm_get_state(struct drm_device *ddev, void *data,
return 0;
}
+static int panthor_ioctl_group_kick(struct drm_device *ddev, void *data,
+ struct drm_file *file)
+{
+ struct drm_panthor_group_kick *args = data;
+ struct panthor_file *pfile = file->driver_priv;
+
+ panthor_sched_kick(pfile, args->handle, args->queue_mask);
+
+ return 0;
+}
+
static int
panthor_open(struct drm_device *ddev, struct drm_file *file)
{
@@ -1375,6 +1386,7 @@ static const struct drm_ioctl_desc panthor_drm_driver_ioctls[] = {
PANTHOR_IOCTL(TILER_HEAP_CREATE, tiler_heap_create, DRM_RENDER_ALLOW),
PANTHOR_IOCTL(TILER_HEAP_DESTROY, tiler_heap_destroy, DRM_RENDER_ALLOW),
PANTHOR_IOCTL(GROUP_SUBMIT, group_submit, DRM_RENDER_ALLOW),
+ PANTHOR_IOCTL(GROUP_KICK, group_kick, DRM_RENDER_ALLOW),
};
static int panthor_mmap(struct file *filp, struct vm_area_struct *vma)
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index 471bb8f2b44c..3b56526a4b97 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -2845,6 +2845,33 @@ void panthor_sched_post_reset(struct panthor_device *ptdev, bool reset_failed)
}
}
+void panthor_sched_kick(struct panthor_file *pfile, u32 group_handle, u32 queue_mask)
+{
+ struct panthor_group_pool *gpool = pfile->groups;
+ struct panthor_scheduler *sched = pfile->ptdev->scheduler;
+ struct panthor_group *group;
+
+ group = group_get(xa_load(&gpool->xa, group_handle));
+ if (!group)
+ return;
+
+ if (!group->queue_count)
+ goto err_put_group;
+
+ mutex_lock(&sched->lock);
+
+ if (group->csg_id < 0)
+ group_schedule_locked(group, queue_mask);
+ else
+ /* All queues share same doorbell page (for now), so we just need to ding one */
+ gpu_write(pfile->ptdev, CSF_DOORBELL(group->queues[0]->doorbell_id), 1);
+
+ mutex_unlock(&sched->lock);
+
+err_put_group:
+ group_put(group);
+}
+
static void group_sync_upd_work(struct work_struct *work)
{
struct panthor_group *group =
diff --git a/drivers/gpu/drm/panthor/panthor_sched.h b/drivers/gpu/drm/panthor/panthor_sched.h
index 0b3a2ee2a0ad..18fb7ad0952e 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.h
+++ b/drivers/gpu/drm/panthor/panthor_sched.h
@@ -48,5 +48,6 @@ void panthor_sched_report_mmu_fault(struct panthor_device *ptdev);
void panthor_sched_report_fw_events(struct panthor_device *ptdev, u32 events);
u8 panthor_sched_doorbell_id(struct panthor_file *pfile, u32 group_handle);
+void panthor_sched_kick(struct panthor_file *pfile, u32 group_handle, u32 queue_mask);
#endif
--
2.45.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 5/8] drm/panthor: Factor out syncobj handling
2024-08-28 17:25 [RFC PATCH 00/10] drm/panthor: Add user submission Mihail Atanassov
` (3 preceding siblings ...)
2024-08-28 17:26 ` [PATCH 4/8] drm/panthor: Add GROUP_KICK ioctl Mihail Atanassov
@ 2024-08-28 17:26 ` Mihail Atanassov
2024-08-28 17:26 ` [PATCH 6/8] drm/panthor: Implement XGS queues Mihail Atanassov
` (4 subsequent siblings)
9 siblings, 0 replies; 28+ messages in thread
From: Mihail Atanassov @ 2024-08-28 17:26 UTC (permalink / raw)
To: linux-kernel, Boris Brezillon, Liviu Dudau, Steven Price
Cc: dri-devel, Daniel Vetter, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher,
Christian König, Xinhui Pan, Shashank Sharma, Ketil Johnsen,
Akash Goel, Mihail Atanassov
From: Ketil Johnsen <ketil.johnsen@arm.com>
The code is needed both for the existing kernel submission path and for
implementing cross-group sync (XGS) queues which link between
drm_syncobj and the HW syncobj primitives.
Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
Signed-off-by: Mihail Atanassov <mihail.atanassov@arm.com>
---
drivers/gpu/drm/panthor/Makefile | 3 +-
drivers/gpu/drm/panthor/panthor_sched.c | 154 +++++---------------
drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++++++++++++++++++
drivers/gpu/drm/panthor/panthor_syncobj.h | 27 ++++
4 files changed, 234 insertions(+), 117 deletions(-)
create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
diff --git a/drivers/gpu/drm/panthor/Makefile b/drivers/gpu/drm/panthor/Makefile
index 15294719b09c..0af27f33bfe2 100644
--- a/drivers/gpu/drm/panthor/Makefile
+++ b/drivers/gpu/drm/panthor/Makefile
@@ -9,6 +9,7 @@ panthor-y := \
panthor_gpu.o \
panthor_heap.o \
panthor_mmu.o \
- panthor_sched.o
+ panthor_sched.o \
+ panthor_syncobj.o
obj-$(CONFIG_DRM_PANTHOR) += panthor.o
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index 3b56526a4b97..f272aeee8a8f 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -31,6 +31,7 @@
#include "panthor_mmu.h"
#include "panthor_regs.h"
#include "panthor_sched.h"
+#include "panthor_syncobj.h"
/**
* DOC: Scheduler
@@ -318,39 +319,6 @@ struct panthor_scheduler {
} reset;
};
-/**
- * struct panthor_syncobj_32b - 32-bit FW synchronization object
- */
-struct panthor_syncobj_32b {
- /** @seqno: Sequence number. */
- u32 seqno;
-
- /**
- * @status: Status.
- *
- * Not zero on failure.
- */
- u32 status;
-};
-
-/**
- * struct panthor_syncobj_64b - 64-bit FW synchronization object
- */
-struct panthor_syncobj_64b {
- /** @seqno: Sequence number. */
- u64 seqno;
-
- /**
- * @status: Status.
- *
- * Not zero on failure.
- */
- u32 status;
-
- /** @pad: MBZ. */
- u32 pad;
-};
-
/**
* struct panthor_queue - Execution queue
*/
@@ -445,17 +413,8 @@ struct panthor_queue {
/** @sync64: True if this is a 64-bit sync object. */
bool sync64;
- /** @bo: Buffer object holding the synchronization object. */
- struct drm_gem_object *obj;
-
- /** @offset: Offset of the synchronization object inside @bo. */
- u64 offset;
-
- /**
- * @kmap: Kernel mapping of the buffer object holding the
- * synchronization object.
- */
- void *kmap;
+ /** @syncobj: Wrapper for the syncobj in memory */
+ struct panthor_syncobj *syncobj;
} syncwait;
/** @fence_ctx: Fence context fields. */
@@ -794,53 +753,6 @@ struct panthor_job {
struct dma_fence *done_fence;
};
-static void
-panthor_queue_put_syncwait_obj(struct panthor_queue *queue)
-{
- if (queue->syncwait.kmap) {
- struct iosys_map map = IOSYS_MAP_INIT_VADDR(queue->syncwait.kmap);
-
- drm_gem_vunmap_unlocked(queue->syncwait.obj, &map);
- queue->syncwait.kmap = NULL;
- }
-
- drm_gem_object_put(queue->syncwait.obj);
- queue->syncwait.obj = NULL;
-}
-
-static void *
-panthor_queue_get_syncwait_obj(struct panthor_group *group, struct panthor_queue *queue)
-{
- struct panthor_device *ptdev = group->ptdev;
- struct panthor_gem_object *bo;
- struct iosys_map map;
- int ret;
-
- if (queue->syncwait.kmap)
- return queue->syncwait.kmap + queue->syncwait.offset;
-
- bo = panthor_vm_get_bo_for_va(group->vm,
- queue->syncwait.gpu_va,
- &queue->syncwait.offset);
- if (drm_WARN_ON(&ptdev->base, IS_ERR_OR_NULL(bo)))
- goto err_put_syncwait_obj;
-
- queue->syncwait.obj = &bo->base.base;
- ret = drm_gem_vmap_unlocked(queue->syncwait.obj, &map);
- if (drm_WARN_ON(&ptdev->base, ret))
- goto err_put_syncwait_obj;
-
- queue->syncwait.kmap = map.vaddr;
- if (drm_WARN_ON(&ptdev->base, !queue->syncwait.kmap))
- goto err_put_syncwait_obj;
-
- return queue->syncwait.kmap + queue->syncwait.offset;
-
-err_put_syncwait_obj:
- panthor_queue_put_syncwait_obj(queue);
- return NULL;
-}
-
static void group_free_queue(struct panthor_group *group, struct panthor_queue *queue)
{
if (IS_ERR_OR_NULL(queue))
@@ -852,7 +764,7 @@ static void group_free_queue(struct panthor_group *group, struct panthor_queue *
if (queue->scheduler.ops)
drm_sched_fini(&queue->scheduler);
- panthor_queue_put_syncwait_obj(queue);
+ panthor_syncobj_release(queue->syncwait.syncobj);
if (queue->ringbuf_offset)
drm_vma_node_revoke(&queue->ringbuf->obj->vma_node, group->pfile->drm_file);
@@ -2065,7 +1977,6 @@ group_term_post_processing(struct panthor_group *group)
cookie = dma_fence_begin_signalling();
for (i = 0; i < group->queue_count; i++) {
struct panthor_queue *queue = group->queues[i];
- struct panthor_syncobj_64b *syncobj;
int err;
if (group->fatal_queues & BIT(i))
@@ -2086,12 +1997,13 @@ group_term_post_processing(struct panthor_group *group)
}
spin_unlock(&queue->fence_ctx.lock);
- if (!group->user_submit) {
+ if (!group->user_submit)
/* Manually update the syncobj seqno to unblock waiters. */
- syncobj = group->syncobjs->kmap + (i * sizeof(*syncobj));
- syncobj->status = ~0;
- syncobj->seqno = atomic64_read(&queue->fence_ctx.seqno);
- }
+ panthor_syncobj_ptr64_signal_with_error(
+ group->syncobjs->kmap + (i * PANTHOR_SYNCOBJ64_SIZE),
+ atomic64_read(&queue->fence_ctx.seqno),
+ ~0);
+
sched_queue_work(group->ptdev->scheduler, sync_upd);
}
dma_fence_end_signalling(cookie);
@@ -2461,28 +2373,32 @@ static void tick_work(struct work_struct *work)
static int panthor_queue_eval_syncwait(struct panthor_group *group, u8 queue_idx)
{
struct panthor_queue *queue = group->queues[queue_idx];
- union {
- struct panthor_syncobj_64b sync64;
- struct panthor_syncobj_32b sync32;
- } *syncobj;
+ struct panthor_syncobj *syncobj;
bool result;
u64 value;
- syncobj = panthor_queue_get_syncwait_obj(group, queue);
- if (!syncobj)
- return -EINVAL;
+ if (!queue->syncwait.syncobj) {
+ syncobj = panthor_syncobj_create(group->ptdev,
+ group->vm,
+ queue->syncwait.gpu_va,
+ queue->syncwait.sync64);
+ if (IS_ERR_OR_NULL(syncobj))
+ return PTR_ERR(syncobj);
- value = queue->syncwait.sync64 ?
- syncobj->sync64.seqno :
- syncobj->sync32.seqno;
+ queue->syncwait.syncobj = syncobj;
+ }
+
+ value = panthor_syncobj_get_value(queue->syncwait.syncobj);
if (queue->syncwait.gt)
result = value > queue->syncwait.ref;
else
result = value <= queue->syncwait.ref;
- if (result)
- panthor_queue_put_syncwait_obj(queue);
+ if (result) {
+ panthor_syncobj_release(queue->syncwait.syncobj);
+ queue->syncwait.syncobj = NULL;
+ }
return result;
}
@@ -2887,16 +2803,22 @@ static void group_sync_upd_work(struct work_struct *work)
cookie = dma_fence_begin_signalling();
for (queue_idx = 0; queue_idx < group->queue_count; queue_idx++) {
struct panthor_queue *queue = group->queues[queue_idx];
- struct panthor_syncobj_64b *syncobj;
+ void *syncobj;
if (!queue)
continue;
- syncobj = group->syncobjs->kmap + (queue_idx * sizeof(*syncobj));
+ syncobj = group->syncobjs->kmap + (queue_idx * PANTHOR_SYNCOBJ64_SIZE);
spin_lock(&queue->fence_ctx.lock);
list_for_each_entry_safe(job, job_tmp, &queue->fence_ctx.in_flight_jobs, node) {
- if (syncobj->seqno < job->done_fence->seqno)
+ u64 value;
+
+ if (!job->call_info.size)
+ continue;
+
+ value = panthor_syncobj_ptr64_get_value(syncobj);
+ if (value < job->done_fence->seqno)
break;
list_move_tail(&job->node, &done_jobs);
@@ -2928,7 +2850,7 @@ queue_run_job(struct drm_sched_job *sched_job)
ptdev->csif_info.unpreserved_cs_reg_count;
u64 val_reg = addr_reg + 2;
u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) +
- job->queue_idx * sizeof(struct panthor_syncobj_64b);
+ job->queue_idx * PANTHOR_SYNCOBJ64_SIZE;
u32 waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
struct dma_fence *done_fence;
int ret;
@@ -3289,7 +3211,7 @@ int panthor_group_create(struct panthor_file *pfile,
if (!group->user_submit) {
group->syncobjs = panthor_kernel_bo_create(ptdev, group->vm,
group_args->queues.count *
- sizeof(struct panthor_syncobj_64b),
+ PANTHOR_SYNCOBJ64_SIZE,
DRM_PANTHOR_BO_NO_MMAP,
DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
@@ -3304,7 +3226,7 @@ int panthor_group_create(struct panthor_file *pfile,
goto err_put_group;
memset(group->syncobjs->kmap, 0,
- group_args->queues.count * sizeof(struct panthor_syncobj_64b));
+ group_args->queues.count * PANTHOR_SYNCOBJ64_SIZE);
}
for (i = 0; i < group_args->queues.count; i++) {
diff --git a/drivers/gpu/drm/panthor/panthor_syncobj.c b/drivers/gpu/drm/panthor/panthor_syncobj.c
new file mode 100644
index 000000000000..337f75bfa648
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_syncobj.c
@@ -0,0 +1,167 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2024 ARM Limited. All rights reserved. */
+
+#include <linux/iosys-map.h>
+
+#include "panthor_device.h"
+#include "panthor_gem.h"
+#include "panthor_mmu.h"
+#include "panthor_syncobj.h"
+
+/**
+ * struct panthor_syncobj_32b - 32-bit FW synchronization object
+ */
+struct panthor_syncobj_32b {
+ /** @value: Value field. */
+ u32 value;
+
+ /**
+ * @error: Error status.
+ *
+ * Not zero on failure.
+ */
+ u32 error;
+};
+
+/**
+ * struct panthor_syncobj_64b - 64-bit FW synchronization object
+ */
+struct panthor_syncobj_64b {
+ /** @value: Value field. */
+ u64 value;
+
+ /**
+ * @error: Error status.
+ *
+ * Not zero on failure.
+ */
+ u32 error;
+
+ /** @pad: MBZ. */
+ u32 pad;
+};
+
+struct panthor_syncobj {
+ /** @bo: Buffer object holding the synchronization object. */
+ struct drm_gem_object *bo;
+
+ /** @offset: Offset of the synchronization object inside @bo. */
+ u64 offset;
+
+ /**
+ * @kmap: Kernel mapping of the buffer object holding the
+ * synchronization object.
+ */
+ void *kmap;
+
+ /** @ptr: CPU ptr to synchronization object */
+ union {
+ struct panthor_syncobj_64b sync64;
+ struct panthor_syncobj_32b sync32;
+ } *ptr;
+
+ /** @sync64: true for 64-bit synchronization object, otherwise 32-bit */
+ bool sync64;
+};
+
+
+
+struct panthor_syncobj *panthor_syncobj_create(struct panthor_device *ptdev,
+ struct panthor_vm *vm, u64 gpu_va,
+ bool sync64)
+{
+ struct panthor_gem_object *bo;
+ struct iosys_map map;
+ struct panthor_syncobj *syncobj;
+ int err;
+
+ syncobj = kzalloc(sizeof(*syncobj), GFP_KERNEL);
+ if (!syncobj) {
+ err = -ENOMEM;
+ goto err;
+ }
+
+ bo = panthor_vm_get_bo_for_va(vm, gpu_va, &syncobj->offset);
+ if (drm_WARN_ON(&ptdev->base, IS_ERR_OR_NULL(bo))) {
+ err = -EINVAL;
+ goto err_free_syncobj;
+ }
+
+ syncobj->bo = &bo->base.base;
+
+ err = drm_gem_vmap_unlocked(syncobj->bo, &map);
+ if (drm_WARN_ON(&ptdev->base, err))
+ goto err_put_gem_object;
+
+ syncobj->kmap = map.vaddr;
+ syncobj->ptr = syncobj->kmap + syncobj->offset;
+ syncobj->sync64 = sync64;
+
+ return syncobj;
+
+err_put_gem_object:
+ drm_gem_object_put(syncobj->bo);
+err_free_syncobj:
+ kfree(syncobj);
+err:
+ return ERR_PTR(err);
+}
+
+void panthor_syncobj_release(struct panthor_syncobj *syncobj)
+{
+ if (syncobj) {
+ struct iosys_map map = IOSYS_MAP_INIT_VADDR(syncobj->kmap);
+
+ drm_gem_vunmap_unlocked(syncobj->bo, &map);
+ drm_gem_object_put(syncobj->bo);
+ kfree(syncobj);
+ }
+}
+
+u64 panthor_syncobj_get_value(struct panthor_syncobj *syncobj)
+{
+ return syncobj->sync64 ?
+ syncobj->ptr->sync64.value :
+ syncobj->ptr->sync32.value;
+}
+
+u32 panthor_syncobj_get_error(struct panthor_syncobj *syncobj)
+{
+ return syncobj->sync64 ?
+ syncobj->ptr->sync64.error :
+ syncobj->ptr->sync32.error;
+}
+
+void panthor_syncobj_signal(struct panthor_syncobj *syncobj, u64 value)
+{
+ if (syncobj->sync64)
+ syncobj->ptr->sync64.value = value;
+ else
+ syncobj->ptr->sync32.value = (u32)value;
+}
+
+void panthor_syncobj_signal_with_error(struct panthor_syncobj *syncobj, u64 value, u32 error)
+{
+ if (syncobj->sync64) {
+ syncobj->ptr->sync64.value = value;
+ syncobj->ptr->sync64.error = error;
+ } else {
+ syncobj->ptr->sync32.value = (u32)value;
+ syncobj->ptr->sync32.error = error;
+ }
+}
+
+u64 panthor_syncobj_ptr64_get_value(void *syncobj_ptr)
+{
+ struct panthor_syncobj_64b *syncobj = syncobj_ptr;
+
+ return syncobj->value;
+}
+
+void panthor_syncobj_ptr64_signal_with_error(void *syncobj_ptr, u64 value, u32 error)
+{
+ struct panthor_syncobj_64b *syncobj = syncobj_ptr;
+
+ syncobj->value = value;
+ syncobj->error = error;
+}
diff --git a/drivers/gpu/drm/panthor/panthor_syncobj.h b/drivers/gpu/drm/panthor/panthor_syncobj.h
new file mode 100644
index 000000000000..018cfc87cdaa
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_syncobj.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 or MIT */
+/* Copyright 2024 ARM Limited. All rights reserved. */
+
+#ifndef __PANTHOR_SYNCOBJ_H__
+#define __PANTHOR_SYNCOBJ_H__
+
+#define PANTHOR_SYNCOBJ32_SIZE 8
+#define PANTHOR_SYNCOBJ64_SIZE 16
+
+struct panthor_syncobj;
+struct panthor_vm;
+
+struct panthor_syncobj *panthor_syncobj_create(struct panthor_device *ptdev,
+ struct panthor_vm *vm, u64 gpu_va,
+ bool sync64);
+void panthor_syncobj_release(struct panthor_syncobj *syncobj);
+
+u64 panthor_syncobj_get_value(struct panthor_syncobj *syncobj);
+u32 panthor_syncobj_get_error(struct panthor_syncobj *syncobj);
+
+void panthor_syncobj_signal(struct panthor_syncobj *syncobj, u64 value);
+void panthor_syncobj_signal_with_error(struct panthor_syncobj *syncobj, u64 value, u32 error);
+
+u64 panthor_syncobj_ptr64_get_value(void *syncobj_ptr);
+void panthor_syncobj_ptr64_signal_with_error(void *syncobj_ptr, u64 value, u32 error);
+
+#endif
--
2.45.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 6/8] drm/panthor: Implement XGS queues
2024-08-28 17:25 [RFC PATCH 00/10] drm/panthor: Add user submission Mihail Atanassov
` (4 preceding siblings ...)
2024-08-28 17:26 ` [PATCH 5/8] drm/panthor: Factor out syncobj handling Mihail Atanassov
@ 2024-08-28 17:26 ` Mihail Atanassov
2024-09-03 19:17 ` Simona Vetter
2024-08-28 17:26 ` [PATCH 7/8] drm/panthor: Add sync_update eventfd handling Mihail Atanassov
` (3 subsequent siblings)
9 siblings, 1 reply; 28+ messages in thread
From: Mihail Atanassov @ 2024-08-28 17:26 UTC (permalink / raw)
To: linux-kernel, Boris Brezillon, Liviu Dudau, Steven Price
Cc: dri-devel, Daniel Vetter, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher,
Christian König, Xinhui Pan, Shashank Sharma, Ketil Johnsen,
Akash Goel, Mihail Atanassov
From: Ketil Johnsen <ketil.johnsen@arm.com>
Synchronisation between GPU queues in groups from the same VM is
achieved using either 32b or 64b synchronisation objects (see
panthor_syncobj.(c|h)). They are also the mechanism to sync between
a queue and the kernel (or the user application). To allow for the
latter case, introduce cross-group sync (XGS) queues. They are a drm
scheduler-entity pair, associated with a VM, which runs XGS jobs --
WAITs or SETs/ADDs on those HW syncobjs -- to enable a userspace driver
to do CPU-to-GPU (and vice-versa) synchronisation, and link that up with
DRM sync primitives.
Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
Co-developed-by: Akash Goel <akash.goel@arm.com>
Signed-off-by: Akash Goel <akash.goel@arm.com>
Signed-off-by: Mihail Atanassov <mihail.atanassov@arm.com>
---
drivers/gpu/drm/panthor/Makefile | 3 +-
drivers/gpu/drm/panthor/panthor_device.h | 4 +
drivers/gpu/drm/panthor/panthor_drv.c | 123 ++++-
drivers/gpu/drm/panthor/panthor_sched.c | 25 +-
drivers/gpu/drm/panthor/panthor_sched.h | 1 +
drivers/gpu/drm/panthor/panthor_xgs.c | 638 +++++++++++++++++++++++
drivers/gpu/drm/panthor/panthor_xgs.h | 42 ++
7 files changed, 832 insertions(+), 4 deletions(-)
create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h
diff --git a/drivers/gpu/drm/panthor/Makefile b/drivers/gpu/drm/panthor/Makefile
index 0af27f33bfe2..7637bae47e26 100644
--- a/drivers/gpu/drm/panthor/Makefile
+++ b/drivers/gpu/drm/panthor/Makefile
@@ -10,6 +10,7 @@ panthor-y := \
panthor_heap.o \
panthor_mmu.o \
panthor_sched.o \
- panthor_syncobj.o
+ panthor_syncobj.o \
+ panthor_xgs.o
obj-$(CONFIG_DRM_PANTHOR) += panthor.o
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index 87cce384e36a..1e98d2a856b7 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -17,6 +17,7 @@
#include <drm/gpu_scheduler.h>
#include <drm/panthor_drm.h>
+struct panthor_xgs_queue_pool;
struct panthor_csf;
struct panthor_csf_ctx;
struct panthor_device;
@@ -182,6 +183,9 @@ struct panthor_file {
/** @groups: Scheduling group pool attached to this file. */
struct panthor_group_pool *groups;
+
+ /** @xgs_queues: XGS queues attached to this file. */
+ struct panthor_xgs_queue_pool *xgs_queues;
};
int panthor_device_init(struct panthor_device *ptdev);
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index ce2fdcd3fb42..681ac09b6343 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -19,6 +19,7 @@
#include <drm/gpu_scheduler.h>
#include <drm/panthor_drm.h>
+#include "panthor_xgs.h"
#include "panthor_device.h"
#include "panthor_fw.h"
#include "panthor_gem.h"
@@ -215,7 +216,8 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
PANTHOR_UOBJ_DECL(struct drm_panthor_sync_op, timeline_value), \
PANTHOR_UOBJ_DECL(struct drm_panthor_queue_submit, syncs), \
PANTHOR_UOBJ_DECL(struct drm_panthor_queue_create, ringbuf_size), \
- PANTHOR_UOBJ_DECL(struct drm_panthor_vm_bind_op, syncs))
+ PANTHOR_UOBJ_DECL(struct drm_panthor_vm_bind_op, syncs), \
+ PANTHOR_UOBJ_DECL(struct drm_panthor_xgs_op, pad))
/**
* PANTHOR_UOBJ_SET() - Copy a kernel object to a user object.
@@ -1316,6 +1318,114 @@ static int panthor_ioctl_group_kick(struct drm_device *ddev, void *data,
return 0;
}
+static int panthor_ioctl_xgs_queue_create(struct drm_device *ddev, void *data,
+ struct drm_file *file)
+{
+ struct drm_panthor_xgs_queue_create *args = data;
+ struct panthor_file *pfile = file->driver_priv;
+
+ if (args->pad)
+ return -EINVAL;
+
+ return panthor_xgs_queue_create(pfile, args->vm_id,
+ args->eventfd_sync_update, &args->handle);
+}
+
+static int panthor_ioctl_xgs_queue_destroy(struct drm_device *ddev, void *data,
+ struct drm_file *file)
+{
+ struct drm_panthor_xgs_queue_destroy *args = data;
+ struct panthor_file *pfile = file->driver_priv;
+
+ if (args->pad)
+ return -EINVAL;
+
+ return panthor_xgs_queue_destroy(pfile, args->handle);
+}
+
+#define XGS_QUEUE_SUBMIT_FLAGS (DRM_PANTHOR_XGS_QUEUE_SUBMIT_ERROR_BARRIER_PRE | \
+ DRM_PANTHOR_XGS_QUEUE_SUBMIT_ERROR_BARRIER_POST)
+
+static int panthor_ioctl_xgs_queue_submit(struct drm_device *ddev, void *data,
+ struct drm_file *file)
+{
+ struct drm_panthor_xgs_queue_submit *args = data;
+ struct panthor_file *pfile = file->driver_priv;
+ struct drm_panthor_xgs_op *ops_args;
+ struct panthor_xgs_queue *queue;
+ struct panthor_submit_ctx ctx;
+ struct drm_sched_job *job;
+ struct panthor_vm *vm;
+ int ret;
+
+ if (args->flags & ~XGS_QUEUE_SUBMIT_FLAGS)
+ return -EINVAL;
+
+ if (args->ops.count) {
+ ret = PANTHOR_UOBJ_GET_ARRAY(ops_args, &args->ops);
+ if (ret)
+ return ret;
+ } else {
+ ops_args = NULL;
+ }
+
+ queue = panthor_xgs_queue_pool_get_xgs_queue(pfile->xgs_queues, args->handle);
+ if (!queue)
+ goto out_free_ops_args;
+
+ ret = panthor_submit_ctx_init(&ctx, file, 1);
+ if (ret)
+ goto out_put_queue;
+
+ /* Create job object */
+ job = panthor_xgs_job_create(queue, ops_args, args->ops.count);
+ if (IS_ERR(job)) {
+ ret = PTR_ERR(job);
+ goto out_cleanup_submit_ctx;
+ }
+
+ /* handed over to the job object */
+ ops_args = NULL;
+
+ /* attach sync operations */
+ ret = panthor_submit_ctx_add_job(&ctx, 0, job, &args->syncs);
+ if (ret)
+ goto out_cleanup_submit_ctx;
+
+ /* Collect signal operations on all */
+ ret = panthor_submit_ctx_collect_jobs_signal_ops(&ctx);
+ if (ret)
+ goto out_cleanup_submit_ctx;
+
+ /* The group already have a VM ref, so we don't need to take an extra one */
+ vm = panthor_xgs_queue_vm(queue);
+
+ /* We acquire/prepare revs on the job */
+ drm_exec_until_all_locked(&ctx.exec) {
+ ret = panthor_vm_prepare_mapped_bos_resvs(&ctx.exec, vm, 1);
+ }
+
+ if (ret)
+ goto out_cleanup_submit_ctx;
+
+ /* Add deps, arm job fence and register the job fence to signal array */
+ ret = panthor_submit_ctx_add_deps_and_arm_jobs(&ctx);
+ if (ret)
+ goto out_cleanup_submit_ctx;
+
+ /* Nothing can fail after that point */
+ panthor_submit_ctx_push_jobs(&ctx, panthor_xgs_job_update_resvs);
+
+out_cleanup_submit_ctx:
+ panthor_submit_ctx_cleanup(&ctx, panthor_xgs_job_put);
+out_put_queue:
+ panthor_xgs_queue_put(queue);
+out_free_ops_args:
+ kvfree(ops_args);
+
+ return ret;
+}
+
static int
panthor_open(struct drm_device *ddev, struct drm_file *file)
{
@@ -1343,9 +1453,16 @@ panthor_open(struct drm_device *ddev, struct drm_file *file)
if (ret)
goto err_destroy_vm_pool;
+ ret = panthor_xgs_queue_pool_create(pfile);
+ if (ret)
+ goto err_destroy_group_pool;
+
file->driver_priv = pfile;
return 0;
+err_destroy_group_pool:
+ panthor_group_pool_destroy(pfile);
+
err_destroy_vm_pool:
panthor_vm_pool_destroy(pfile);
@@ -1363,6 +1480,7 @@ panthor_postclose(struct drm_device *ddev, struct drm_file *file)
struct panthor_file *pfile = file->driver_priv;
panthor_group_pool_destroy(pfile);
+ panthor_xgs_queue_pool_destroy(pfile);
panthor_vm_pool_destroy(pfile);
kfree(pfile);
@@ -1387,6 +1505,9 @@ static const struct drm_ioctl_desc panthor_drm_driver_ioctls[] = {
PANTHOR_IOCTL(TILER_HEAP_DESTROY, tiler_heap_destroy, DRM_RENDER_ALLOW),
PANTHOR_IOCTL(GROUP_SUBMIT, group_submit, DRM_RENDER_ALLOW),
PANTHOR_IOCTL(GROUP_KICK, group_kick, DRM_RENDER_ALLOW),
+ PANTHOR_IOCTL(XGS_QUEUE_CREATE, xgs_queue_create, DRM_RENDER_ALLOW),
+ PANTHOR_IOCTL(XGS_QUEUE_DESTROY, xgs_queue_destroy, DRM_RENDER_ALLOW),
+ PANTHOR_IOCTL(XGS_QUEUE_SUBMIT, xgs_queue_submit, DRM_RENDER_ALLOW),
};
static int panthor_mmap(struct file *filp, struct vm_area_struct *vma)
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index f272aeee8a8f..92172b2c6253 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -22,6 +22,7 @@
#include <linux/platform_device.h>
#include <linux/pm_runtime.h>
+#include "panthor_xgs.h"
#include "panthor_devfreq.h"
#include "panthor_device.h"
#include "panthor_fw.h"
@@ -1496,8 +1497,13 @@ static void csg_slot_sync_update_locked(struct panthor_device *ptdev,
lockdep_assert_held(&ptdev->scheduler->lock);
- if (group && !group->user_submit)
- group_queue_work(group, sync_upd);
+ if (group) {
+ /* Rerun XGS jobs immediately, as this can potentially unblock the group */
+ panthor_xgs_queue_pool_recheck(group->pfile);
+
+ if (!group->user_submit)
+ group_queue_work(group, sync_upd);
+ }
sched_queue_work(ptdev->scheduler, sync_upd);
}
@@ -1691,9 +1697,15 @@ static const char *queue_fence_get_timeline_name(struct dma_fence *fence)
return "queue-fence";
}
+static void job_fence_free(struct dma_fence *fence)
+{
+ dma_fence_free(fence);
+}
+
static const struct dma_fence_ops panthor_queue_fence_ops = {
.get_driver_name = fence_get_driver_name,
.get_timeline_name = queue_fence_get_timeline_name,
+ .release = job_fence_free,
};
struct panthor_csg_slots_upd_ctx {
@@ -2431,6 +2443,10 @@ static void sync_upd_work(struct work_struct *work)
if (unblocked_queues) {
group->blocked_queues &= ~unblocked_queues;
+ /* Sync updates from XGS queue could happen when we are not ticking */
+ if (sched->resched_target == U64_MAX)
+ immediate_tick = true;
+
if (group->csg_id < 0) {
list_move(&group->run_node,
&sched->groups.runnable[group->priority]);
@@ -2788,6 +2804,11 @@ void panthor_sched_kick(struct panthor_file *pfile, u32 group_handle, u32 queue_
group_put(group);
}
+void panthor_sched_sync_update(struct panthor_device *ptdev)
+{
+ sched_queue_work(ptdev->scheduler, sync_upd);
+}
+
static void group_sync_upd_work(struct work_struct *work)
{
struct panthor_group *group =
diff --git a/drivers/gpu/drm/panthor/panthor_sched.h b/drivers/gpu/drm/panthor/panthor_sched.h
index 18fb7ad0952e..2cb58c66b8ac 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.h
+++ b/drivers/gpu/drm/panthor/panthor_sched.h
@@ -49,5 +49,6 @@ void panthor_sched_report_fw_events(struct panthor_device *ptdev, u32 events);
u8 panthor_sched_doorbell_id(struct panthor_file *pfile, u32 group_handle);
void panthor_sched_kick(struct panthor_file *pfile, u32 group_handle, u32 queue_mask);
+void panthor_sched_sync_update(struct panthor_device *ptdev);
#endif
diff --git a/drivers/gpu/drm/panthor/panthor_xgs.c b/drivers/gpu/drm/panthor/panthor_xgs.c
new file mode 100644
index 000000000000..a900badb9224
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_xgs.c
@@ -0,0 +1,638 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2024 ARM Limited. All rights reserved. */
+
+#include <drm/drm_drv.h>
+#include <drm/drm_gem.h>
+#include <drm/drm_managed.h>
+#include <drm/gpu_scheduler.h>
+#include <drm/panthor_drm.h>
+
+#include <linux/iosys-map.h>
+
+#include "panthor_xgs.h"
+#include "panthor_device.h"
+#include "panthor_gem.h"
+#include "panthor_mmu.h"
+#include "panthor_sched.h"
+#include "panthor_syncobj.h"
+
+#define JOB_TIMEOUT_MS 5000
+
+/**
+ * struct panthor_xgs_queue - Scheduling group object
+ */
+struct panthor_xgs_queue {
+ /** @refcount: Reference count */
+ struct kref refcount;
+
+ /** @lock: Lock to synchronize access to this queue */
+ struct mutex lock;
+
+ /** @handle: integer value used by user to refer to this queue */
+ u32 handle;
+
+ /** @ptdev: Panthor device for which this queue was created */
+ struct panthor_device *ptdev;
+
+ /** @vm: VM associated with this XGS queue */
+ struct panthor_vm *vm;
+
+ /** @eventfd_sync_update_ctx: eventfd context to signal on XGS set ops */
+ struct eventfd_ctx *eventfd_sync_update_ctx;
+
+ /** @scheduler: scheduler instance used to execute XGS jobs */
+ struct drm_gpu_scheduler scheduler;
+
+ /** @entity: Scheduler entity with XGS jobs */
+ struct drm_sched_entity entity;
+
+ /** @fence_ctx: Fence context fields. */
+ struct {
+ /** @lock: Used to protect access to all fences allocated by this context. */
+ spinlock_t lock;
+
+ /**
+ * @id: Fence context ID.
+ *
+ * Allocated with dma_fence_context_alloc().
+ */
+ u64 id;
+
+ /** @seqno: Sequence number of the last initialized fence. */
+ atomic64_t seqno;
+
+ /**
+ * @in_flight_jobs: List containing all in-flight jobs.
+ *
+ * Used to keep track and signal panthor_job::done_fence when the
+ * synchronization object attached to the queue is signaled.
+ */
+ struct list_head in_flight_jobs;
+ } fence_ctx;
+
+ /** @destroyed: True if queue is marked for destruction and should not be used */
+ bool destroyed;
+
+ /**
+ * @release_work: Work used to release XGS queue resources.
+ *
+ * We need to postpone the queue release to avoid a deadlock,
+ * otherwise "free_job" could end up calling back into DRM sched.
+ */
+ struct work_struct release_work;
+};
+
+/*
+ * We currently set the maximum of XGS queues per file to an arbitrary low value.
+ * But this can be updated if we need more.
+ */
+#define MAX_XGS_QUEUES_PER_POOL 128
+
+/**
+ * struct panthor_xgs_queue_pool - XGS queue pool
+ *
+ * Each file get assigned a XGS queue pool.
+ */
+struct panthor_xgs_queue_pool {
+ /** @xa: Xarray used to manage XGS queue handles. */
+ struct xarray xa;
+};
+
+/**
+ * struct panthor_xgs_job - Used to manage XGS job
+ */
+struct panthor_xgs_job {
+ /** @base: Inherit from drm_sched_job. */
+ struct drm_sched_job base;
+
+ /** @refcount: Reference count. */
+ struct kref refcount;
+
+ /** @group: XGS queue this job will be pushed to. */
+ struct panthor_xgs_queue *queue;
+
+ /** @ops: List of XGS operations to execute */
+ struct drm_panthor_xgs_op *ops;
+
+ /** @ops_count: Number of operations in the ops array */
+ u32 ops_count;
+
+ /** @done_fence: Fence signaled when the job is finished or cancelled. */
+ struct dma_fence *done_fence;
+
+ /** @node: Node used to insert job into in_flight_jobs list of queue */
+ struct list_head node;
+
+};
+
+static int panthor_xgs_try_run_job(struct panthor_xgs_job *job);
+
+static const char *xgs_fence_get_driver_name(struct dma_fence *fence)
+{
+ return "panthor";
+}
+
+static const char *xgs_fence_get_timeline_name(struct dma_fence *fence)
+{
+ return "xgs-fence";
+}
+
+static void xgs_fence_free(struct dma_fence *fence)
+{
+ dma_fence_free(fence);
+}
+
+static const struct dma_fence_ops panthor_xgs_fence_ops = {
+ .get_driver_name = xgs_fence_get_driver_name,
+ .get_timeline_name = xgs_fence_get_timeline_name,
+ .release = xgs_fence_free,
+};
+
+static void xgs_queue_release_work(struct work_struct *work)
+{
+ struct panthor_xgs_queue *queue = container_of(work, struct panthor_xgs_queue,
+ release_work);
+
+ if (queue->entity.fence_context)
+ drm_sched_entity_destroy(&queue->entity);
+
+ if (queue->scheduler.ops)
+ drm_sched_fini(&queue->scheduler);
+
+ panthor_vm_put(queue->vm);
+
+ if (queue->eventfd_sync_update_ctx)
+ eventfd_ctx_put(queue->eventfd_sync_update_ctx);
+
+ kfree(queue);
+}
+
+static void xgs_queue_release(struct kref *kref)
+{
+ struct panthor_xgs_queue *queue = container_of(kref, struct panthor_xgs_queue, refcount);
+ struct panthor_device *ptdev = queue->ptdev;
+
+ drm_WARN_ON(&ptdev->base, !list_empty(&queue->fence_ctx.in_flight_jobs));
+
+ queue_work(panthor_cleanup_wq, &queue->release_work);
+}
+
+static struct panthor_xgs_queue *xgs_queue_get(struct panthor_xgs_queue *queue)
+{
+ if (queue)
+ kref_get(&queue->refcount);
+
+ return queue;
+}
+
+static void xgs_queue_recheck(struct panthor_xgs_queue *queue)
+{
+ struct panthor_xgs_job *job, *tmp;
+ int ret;
+
+ mutex_lock(&queue->lock);
+
+ list_for_each_entry_safe(job, tmp, &queue->fence_ctx.in_flight_jobs, node) {
+ ret = panthor_xgs_try_run_job(job);
+
+ if (ret != -EBUSY) {
+ /* completed or failed */
+ list_del_init(&job->node);
+
+ if (ret)
+ dma_fence_set_error(job->done_fence, -ETIMEDOUT);
+
+ dma_fence_signal(job->done_fence);
+
+ /* Ditch ref we took when adding it to the in_flight_jobs */
+ panthor_xgs_job_put(&job->base);
+ }
+ }
+
+ mutex_unlock(&queue->lock);
+}
+
+/* Internal flag to mark operations as completed. Avoid clashes with drm_panthor_xgs_op_flags */
+#define DRM_PANTHOR_XGS_OP_COMPLETED (1 << 15)
+
+static int panthor_xgs_wait(struct panthor_xgs_queue *queue, struct drm_panthor_xgs_op *xgs_op)
+{
+ struct panthor_device *ptdev = queue->ptdev;
+ struct panthor_syncobj *syncobj;
+ int ret;
+ u64 value;
+
+ if (xgs_op->flags & DRM_PANTHOR_XGS_OP_COMPLETED)
+ return 0;
+
+ syncobj = panthor_syncobj_create(ptdev, queue->vm, xgs_op->addr,
+ xgs_op->format == DRM_PANTHOR_XGS_OP_FORMAT_U64);
+ if (IS_ERR_OR_NULL(syncobj))
+ return PTR_ERR(syncobj);
+
+ value = panthor_syncobj_get_value(syncobj);
+
+ ret = -EBUSY;
+
+ if (xgs_op->op == DRM_PANTHOR_XGS_OP_WAIT_LE) {
+ if (value <= xgs_op->value) {
+ ret = 0;
+ xgs_op->flags |= DRM_PANTHOR_XGS_OP_COMPLETED;
+ }
+ } else if (xgs_op->op == DRM_PANTHOR_XGS_OP_WAIT_GT) {
+ if (value > xgs_op->value) {
+ ret = 0;
+ xgs_op->flags |= DRM_PANTHOR_XGS_OP_COMPLETED;
+ }
+ }
+
+ panthor_syncobj_release(syncobj);
+
+ return ret;
+}
+
+static void
+panthor_xgs_signal(struct panthor_xgs_queue *queue, struct drm_panthor_xgs_op *xgs_op, u32 error)
+{
+ struct panthor_device *ptdev = queue->ptdev;
+ struct panthor_syncobj *syncobj;
+ u64 value;
+
+ if (xgs_op->flags & DRM_PANTHOR_XGS_OP_COMPLETED)
+ return;
+
+ syncobj = panthor_syncobj_create(ptdev, queue->vm, xgs_op->addr,
+ xgs_op->format == DRM_PANTHOR_XGS_OP_FORMAT_U64);
+ if (IS_ERR_OR_NULL(syncobj))
+ return;
+
+ value = panthor_syncobj_get_value(syncobj);
+
+ if (xgs_op->op == DRM_PANTHOR_XGS_OP_SIGNAL_SET)
+ value = xgs_op->value;
+ else if (xgs_op->op == DRM_PANTHOR_XGS_OP_SIGNAL_ADD)
+ value += xgs_op->value;
+
+ if (!error)
+ panthor_syncobj_signal(syncobj, value);
+ else
+ panthor_syncobj_signal_with_error(syncobj, value, error);
+
+ panthor_sched_sync_update(ptdev);
+
+ if (queue->eventfd_sync_update_ctx)
+ eventfd_signal(queue->eventfd_sync_update_ctx);
+
+ xgs_op->flags |= DRM_PANTHOR_XGS_OP_COMPLETED;
+
+ panthor_syncobj_release(syncobj);
+}
+
+static int panthor_xgs_try_run_job(struct panthor_xgs_job *job)
+{
+ int i;
+ int err_wait = 0;
+ struct drm_panthor_xgs_op *xgs_op;
+
+ lockdep_assert_held(&job->queue->lock);
+
+ for (i = 0; i < job->ops_count; i++) {
+ xgs_op = &job->ops[i];
+
+ switch (xgs_op->op & ~DRM_PANTHOR_XGS_OP_COMPLETED) {
+ case DRM_PANTHOR_XGS_OP_WAIT_LE:
+ case DRM_PANTHOR_XGS_OP_WAIT_GT:
+ if (!err_wait)
+ err_wait = panthor_xgs_wait(job->queue, &job->ops[i]);
+ if (err_wait == -EBUSY)
+ return err_wait;
+ break;
+ case DRM_PANTHOR_XGS_OP_SIGNAL_SET:
+ case DRM_PANTHOR_XGS_OP_SIGNAL_ADD:
+ panthor_xgs_signal(job->queue, &job->ops[i], err_wait);
+ break;
+ default:
+ /* unknown operation, assume this could be a critical error */
+ err_wait = -EINVAL;
+ break;
+ }
+ }
+
+ return err_wait;
+}
+
+static struct dma_fence *panthor_xgs_run_job(struct drm_sched_job *sched_job)
+{
+ struct panthor_xgs_job *job = container_of(sched_job, struct panthor_xgs_job, base);
+ struct panthor_xgs_queue *queue = job->queue;
+ struct dma_fence *done_fence;
+ int ret;
+
+ mutex_lock(&queue->lock);
+
+ ret = panthor_xgs_try_run_job(job);
+ if (ret == -EBUSY) {
+ dma_fence_init(job->done_fence,
+ &panthor_xgs_fence_ops,
+ &queue->fence_ctx.lock,
+ queue->fence_ctx.id,
+ atomic64_inc_return(&queue->fence_ctx.seqno));
+
+ done_fence = dma_fence_get(job->done_fence);
+ panthor_xgs_job_get(&job->base);
+
+ list_add_tail(&job->node, &queue->fence_ctx.in_flight_jobs);
+
+ } else if (ret) {
+ done_fence = ERR_PTR(ret);
+ } else {
+ /* job completed immediately, no need to return fence */
+ done_fence = NULL;
+ }
+
+ mutex_unlock(&queue->lock);
+
+ return done_fence;
+}
+
+static enum drm_gpu_sched_stat
+panthor_xgs_job_timedout(struct drm_sched_job *sched_job)
+{
+ struct panthor_xgs_job *job = container_of(sched_job, struct panthor_xgs_job, base);
+ struct panthor_xgs_queue *queue = job->queue;
+ int ret;
+
+ mutex_lock(&queue->lock);
+
+ list_del_init(&job->node);
+
+ /* Ditch ref we took when adding it to the in_flight_jobs */
+ panthor_xgs_job_put(&job->base);
+
+ ret = panthor_xgs_try_run_job(job);
+ if (ret)
+ dma_fence_set_error(job->done_fence, -ETIMEDOUT);
+
+ mutex_unlock(&queue->lock);
+
+ dma_fence_signal(job->done_fence);
+
+ panthor_xgs_job_put(sched_job);
+
+ return DRM_GPU_SCHED_STAT_NOMINAL;
+}
+
+static const struct drm_sched_backend_ops panthor_xgs_sched_ops = {
+ .run_job = panthor_xgs_run_job,
+ .timedout_job = panthor_xgs_job_timedout,
+ .free_job = panthor_xgs_job_put,
+};
+
+int panthor_xgs_queue_create(struct panthor_file *pfile, u32 vm_id,
+ int eventfd_sync_update, u32 *handle)
+{
+ struct panthor_device *ptdev = pfile->ptdev;
+ struct panthor_xgs_queue_pool *xgs_queue_pool = pfile->xgs_queues;
+ struct panthor_xgs_queue *queue;
+ struct drm_gpu_scheduler *drm_sched;
+ int ret;
+ int qid;
+
+ queue = kzalloc(sizeof(*queue), GFP_KERNEL);
+ if (!queue)
+ return -ENOMEM;
+
+ kref_init(&queue->refcount);
+ INIT_LIST_HEAD(&queue->fence_ctx.in_flight_jobs);
+ INIT_WORK(&queue->release_work, xgs_queue_release_work);
+ queue->ptdev = ptdev;
+
+ ret = drmm_mutex_init(&ptdev->base, &queue->lock);
+ if (ret)
+ goto err_put_queue;
+
+ if (eventfd_sync_update >= 0) {
+ queue->eventfd_sync_update_ctx = eventfd_ctx_fdget(eventfd_sync_update);
+ ret = PTR_ERR_OR_ZERO(queue->eventfd_sync_update_ctx);
+ if (ret)
+ goto err_put_queue;
+ }
+
+ queue->vm = panthor_vm_pool_get_vm(pfile->vms, vm_id);
+ if (!queue->vm) {
+ ret = -EINVAL;
+ goto err_put_queue;
+ }
+
+ ret = drm_sched_init(&queue->scheduler, &panthor_xgs_sched_ops,
+ NULL, 1, 1, 0,
+ msecs_to_jiffies(JOB_TIMEOUT_MS),
+ NULL, NULL,
+ "panthor_xgs",
+ ptdev->base.dev);
+ if (ret)
+ goto err_put_queue;
+
+ drm_sched = &queue->scheduler;
+ ret = drm_sched_entity_init(&queue->entity, 0, &drm_sched, 1, NULL);
+ if (ret)
+ goto err_put_queue;
+
+ queue->fence_ctx.id = dma_fence_context_alloc(1);
+ spin_lock_init(&queue->fence_ctx.lock);
+
+ ret = xa_alloc(&xgs_queue_pool->xa, &qid, queue,
+ XA_LIMIT(1, MAX_XGS_QUEUES_PER_POOL), GFP_KERNEL);
+ if (ret)
+ goto err_put_queue;
+
+ queue->handle = qid;
+ *handle = qid;
+
+ return 0;
+
+err_put_queue:
+ panthor_xgs_queue_put(queue);
+ return ret;
+}
+
+int panthor_xgs_queue_destroy(struct panthor_file *pfile, u32 handle)
+{
+ struct panthor_xgs_queue_pool *pool = pfile->xgs_queues;
+ struct panthor_xgs_queue *queue;
+ struct panthor_xgs_job *job, *tmp;
+ LIST_HEAD(faulty_jobs);
+ int err = -ECANCELED;
+
+ queue = xa_erase(&pool->xa, handle);
+ if (!queue)
+ return -EINVAL;
+
+ queue->destroyed = true;
+
+ mutex_lock(&queue->lock);
+
+ list_for_each_entry_safe(job, tmp, &queue->fence_ctx.in_flight_jobs, node) {
+ list_move_tail(&job->node, &faulty_jobs);
+ dma_fence_set_error(job->done_fence, err);
+ dma_fence_signal(job->done_fence);
+ }
+
+ mutex_unlock(&queue->lock);
+
+ list_for_each_entry_safe(job, tmp, &faulty_jobs, node) {
+ list_del_init(&job->node);
+ /* Ditch ref we took when adding it to the in_flight_jobs */
+ panthor_xgs_job_put(&job->base);
+ }
+
+ panthor_xgs_queue_put(queue);
+
+ return 0;
+}
+
+void panthor_xgs_queue_put(struct panthor_xgs_queue *queue)
+{
+ if (queue)
+ kref_put(&queue->refcount, xgs_queue_release);
+}
+
+struct panthor_vm *panthor_xgs_queue_vm(struct panthor_xgs_queue *queue)
+{
+ return queue->vm;
+}
+
+int panthor_xgs_queue_pool_create(struct panthor_file *pfile)
+{
+ struct panthor_xgs_queue_pool *pool;
+
+ pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+ if (!pool)
+ return -ENOMEM;
+
+ xa_init_flags(&pool->xa, XA_FLAGS_ALLOC1);
+ pfile->xgs_queues = pool;
+ return 0;
+}
+
+void panthor_xgs_queue_pool_destroy(struct panthor_file *pfile)
+{
+ struct panthor_xgs_queue_pool *pool = pfile->xgs_queues;
+ struct panthor_xgs_queue *queue;
+ unsigned long i;
+
+ if (IS_ERR_OR_NULL(pool))
+ return;
+
+ xa_for_each(&pool->xa, i, queue)
+ panthor_xgs_queue_destroy(pfile, i);
+
+ xa_destroy(&pool->xa);
+ kfree(pool);
+ pfile->xgs_queues = NULL;
+}
+
+struct panthor_xgs_queue *panthor_xgs_queue_pool_get_xgs_queue(struct panthor_xgs_queue_pool *pool,
+ u32 handle)
+{
+ struct panthor_xgs_queue *queue;
+
+ queue = xgs_queue_get(xa_load(&pool->xa, handle));
+
+ return queue;
+}
+
+void panthor_xgs_queue_pool_recheck(struct panthor_file *ptfile)
+{
+ unsigned long i;
+ struct panthor_xgs_queue *queue;
+
+ xa_for_each(&ptfile->xgs_queues->xa, i, queue)
+ xgs_queue_recheck(queue);
+}
+
+struct drm_sched_job *panthor_xgs_job_create(struct panthor_xgs_queue *queue,
+ struct drm_panthor_xgs_op *ops, u32 ops_count)
+{
+ struct panthor_xgs_job *job;
+ int ret;
+
+ job = kzalloc(sizeof(*job), GFP_KERNEL);
+ if (!job)
+ return ERR_PTR(-ENOMEM);
+
+ kref_init(&job->refcount);
+ INIT_LIST_HEAD(&job->node);
+
+ job->queue = xgs_queue_get(queue);
+ if (!job->queue) {
+ ret = -EINVAL;
+ goto err_put_job;
+ }
+
+ job->done_fence = kzalloc(sizeof(*job->done_fence), GFP_KERNEL);
+ if (!job->done_fence) {
+ ret = -ENOMEM;
+ goto err_put_job;
+ }
+
+ ret = drm_sched_job_init(&job->base, &queue->entity, 1, queue);
+ if (ret)
+ goto err_put_job;
+
+ /* take ownership of ops array */
+ job->ops = ops;
+ job->ops_count = ops_count;
+
+ return &job->base;
+
+err_put_job:
+ panthor_xgs_job_put(&job->base);
+ return ERR_PTR(ret);
+}
+
+static void xgs_job_release(struct kref *ref)
+{
+ struct panthor_xgs_job *job = container_of(ref, struct panthor_xgs_job, refcount);
+
+ drm_WARN_ON(&job->queue->ptdev->base, !list_empty(&job->node));
+
+ if (job->base.s_fence)
+ drm_sched_job_cleanup(&job->base);
+
+ if (job->done_fence && job->done_fence->ops)
+ dma_fence_put(job->done_fence);
+ else
+ dma_fence_free(job->done_fence);
+
+ panthor_xgs_queue_put(job->queue);
+ kvfree(job->ops);
+ kfree(job);
+}
+
+struct drm_sched_job *panthor_xgs_job_get(struct drm_sched_job *sched_job)
+{
+ if (sched_job) {
+ struct panthor_xgs_job *job = container_of(sched_job, struct panthor_xgs_job, base);
+
+ kref_get(&job->refcount);
+ }
+
+ return sched_job;
+}
+
+void panthor_xgs_job_put(struct drm_sched_job *sched_job)
+{
+ struct panthor_xgs_job *job = container_of(sched_job, struct panthor_xgs_job, base);
+
+ if (sched_job)
+ kref_put(&job->refcount, xgs_job_release);
+}
+
+void panthor_xgs_job_update_resvs(struct drm_exec *exec, struct drm_sched_job *sched_job)
+{
+ struct panthor_xgs_job *job = container_of(sched_job, struct panthor_xgs_job, base);
+
+ panthor_vm_update_resvs(job->queue->vm, exec, &sched_job->s_fence->finished,
+ DMA_RESV_USAGE_BOOKKEEP, DMA_RESV_USAGE_WRITE);
+}
diff --git a/drivers/gpu/drm/panthor/panthor_xgs.h b/drivers/gpu/drm/panthor/panthor_xgs.h
new file mode 100644
index 000000000000..fa7dd5e5ef83
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_xgs.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0 or MIT */
+/* Copyright 2024 ARM Limited. All rights reserved. */
+
+#ifndef __PANTHOR_XGS_H__
+#define __PANTHOR_XGS_H__
+
+struct drm_exec;
+struct drm_panthor_xgs_op;
+struct drm_panthor_xgs_queue_create;
+struct drm_sched_job;
+struct panthor_xgs_queue;
+struct panthor_xgs_queue_pool;
+struct panthor_file;
+struct panthor_vm;
+
+int panthor_xgs_queue_create(struct panthor_file *pfile, u32 vm_id,
+ int eventfd_sync_update, u32 *handle);
+int panthor_xgs_queue_destroy(struct panthor_file *pfile, u32 handle);
+
+void panthor_xgs_queue_put(struct panthor_xgs_queue *queue);
+
+struct panthor_vm *panthor_xgs_queue_vm(struct panthor_xgs_queue *queue);
+
+int panthor_xgs_queue_pool_create(struct panthor_file *pfile);
+void panthor_xgs_queue_pool_destroy(struct panthor_file *pfile);
+
+struct panthor_xgs_queue *
+panthor_xgs_queue_pool_get_xgs_queue(struct panthor_xgs_queue_pool *pool, u32 handle);
+
+void panthor_xgs_queue_pool_recheck(struct panthor_file *ptfile);
+
+struct drm_sched_job *
+panthor_xgs_job_create(struct panthor_xgs_queue *queue,
+ struct drm_panthor_xgs_op *ops,
+ u32 ops_count);
+
+void panthor_xgs_job_put(struct drm_sched_job *sched_job);
+struct drm_sched_job *panthor_xgs_job_get(struct drm_sched_job *sched_job);
+
+void panthor_xgs_job_update_resvs(struct drm_exec *exec, struct drm_sched_job *sched_job);
+
+#endif
--
2.45.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 7/8] drm/panthor: Add sync_update eventfd handling
2024-08-28 17:25 [RFC PATCH 00/10] drm/panthor: Add user submission Mihail Atanassov
` (5 preceding siblings ...)
2024-08-28 17:26 ` [PATCH 6/8] drm/panthor: Implement XGS queues Mihail Atanassov
@ 2024-08-28 17:26 ` Mihail Atanassov
2024-08-28 17:26 ` [PATCH 8/8] drm/panthor: Add SYNC_UPDATE ioctl Mihail Atanassov
` (2 subsequent siblings)
9 siblings, 0 replies; 28+ messages in thread
From: Mihail Atanassov @ 2024-08-28 17:26 UTC (permalink / raw)
To: linux-kernel, Boris Brezillon, Liviu Dudau, Steven Price
Cc: dri-devel, Daniel Vetter, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher,
Christian König, Xinhui Pan, Shashank Sharma, Ketil Johnsen,
Akash Goel, Mihail Atanassov
Expose the SYNC_UPDATE event to userspace so it can respond to changes
in syncobj state.
Signed-off-by: Mihail Atanassov <mihail.atanassov@arm.com>
---
drivers/gpu/drm/panthor/panthor_sched.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index 92172b2c6253..67c27fcc3345 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -643,6 +643,9 @@ struct panthor_group {
* panthor_group::groups::waiting list.
*/
struct list_head wait_node;
+
+ /** @eventfd_sync_update_ctx: eventfd context to signal on GPU_SYNC_UPDATE */
+ struct eventfd_ctx *eventfd_sync_update_ctx;
};
/**
@@ -797,6 +800,10 @@ static void group_release_work(struct work_struct *work)
panthor_kernel_bo_destroy(group->syncobjs);
panthor_vm_put(group->vm);
+
+ if (group->eventfd_sync_update_ctx)
+ eventfd_ctx_put(group->eventfd_sync_update_ctx);
+
kfree(group);
}
@@ -1501,6 +1508,9 @@ static void csg_slot_sync_update_locked(struct panthor_device *ptdev,
/* Rerun XGS jobs immediately, as this can potentially unblock the group */
panthor_xgs_queue_pool_recheck(group->pfile);
+ if (group->eventfd_sync_update_ctx)
+ eventfd_signal(group->eventfd_sync_update_ctx);
+
if (!group->user_submit)
group_queue_work(group, sync_upd);
}
@@ -3204,9 +3214,18 @@ int panthor_group_create(struct panthor_file *pfile,
INIT_WORK(&group->tiler_oom_work, group_tiler_oom_work);
INIT_WORK(&group->release_work, group_release_work);
- if (group_args->flags & DRM_PANTHOR_GROUP_CREATE_USER_SUBMIT)
+ if (group_args->flags & DRM_PANTHOR_GROUP_CREATE_USER_SUBMIT) {
group->user_submit = true;
+ if (group_args->eventfd_sync_update >= 0) {
+ group->eventfd_sync_update_ctx = eventfd_ctx_fdget(
+ group_args->eventfd_sync_update);
+ ret = PTR_ERR_OR_ZERO(group->eventfd_sync_update_ctx);
+ if (ret)
+ goto err_put_group;
+ }
+ }
+
group->vm = panthor_vm_pool_get_vm(pfile->vms, group_args->vm_id);
if (!group->vm) {
ret = -EINVAL;
--
2.45.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 8/8] drm/panthor: Add SYNC_UPDATE ioctl
2024-08-28 17:25 [RFC PATCH 00/10] drm/panthor: Add user submission Mihail Atanassov
` (6 preceding siblings ...)
2024-08-28 17:26 ` [PATCH 7/8] drm/panthor: Add sync_update eventfd handling Mihail Atanassov
@ 2024-08-28 17:26 ` Mihail Atanassov
2024-08-29 9:40 ` [RFC PATCH 00/10] drm/panthor: Add user submission Christian König
2024-09-03 12:27 ` Ketil Johnsen
9 siblings, 0 replies; 28+ messages in thread
From: Mihail Atanassov @ 2024-08-28 17:26 UTC (permalink / raw)
To: linux-kernel, Boris Brezillon, Liviu Dudau, Steven Price
Cc: dri-devel, Daniel Vetter, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher,
Christian König, Xinhui Pan, Shashank Sharma, Ketil Johnsen,
Akash Goel, Mihail Atanassov
From: Ketil Johnsen <ketil.johnsen@arm.com>
Similar to how the kernel driver notifies userspace about syncobj
modifications, the reverse notification is also necessary to let panthor
re-evaluate any queues (GPU or XGS) that were blocked on wait
operations.
Signed-off-by: Mihail Atanassov <mihail.atanassov@arm.com>
Signed-off-by: Ketil Johnsen <ketil.johnsen@arm.com>
---
drivers/gpu/drm/panthor/panthor_drv.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index 681ac09b6343..1734a52dc3b5 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -1426,6 +1426,28 @@ static int panthor_ioctl_xgs_queue_submit(struct drm_device *ddev, void *data,
return ret;
}
+static int panthor_ioctl_sync_update(struct drm_device *ddev, void *data,
+ struct drm_file *file)
+{
+ struct drm_panthor_sync_update *args = data;
+ struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
+ struct panthor_file *pfile = file->driver_priv;
+
+ if (args->pad)
+ return -EINVAL;
+
+ if (pm_runtime_get_if_in_use(ddev->dev)) {
+ /* All sync wait ops will be re-evaluated when we ding the global doorbell */
+ gpu_write(ptdev, CSF_DOORBELL(0), 1);
+ pm_runtime_put_autosuspend(ddev->dev);
+ }
+
+ panthor_sched_sync_update(ptdev);
+ panthor_xgs_queue_pool_recheck(pfile);
+
+ return 0;
+}
+
static int
panthor_open(struct drm_device *ddev, struct drm_file *file)
{
@@ -1508,6 +1530,7 @@ static const struct drm_ioctl_desc panthor_drm_driver_ioctls[] = {
PANTHOR_IOCTL(XGS_QUEUE_CREATE, xgs_queue_create, DRM_RENDER_ALLOW),
PANTHOR_IOCTL(XGS_QUEUE_DESTROY, xgs_queue_destroy, DRM_RENDER_ALLOW),
PANTHOR_IOCTL(XGS_QUEUE_SUBMIT, xgs_queue_submit, DRM_RENDER_ALLOW),
+ PANTHOR_IOCTL(SYNC_UPDATE, sync_update, DRM_RENDER_ALLOW),
};
static int panthor_mmap(struct file *filp, struct vm_area_struct *vma)
--
2.45.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
2024-08-28 17:25 [RFC PATCH 00/10] drm/panthor: Add user submission Mihail Atanassov
` (7 preceding siblings ...)
2024-08-28 17:26 ` [PATCH 8/8] drm/panthor: Add SYNC_UPDATE ioctl Mihail Atanassov
@ 2024-08-29 9:40 ` Christian König
2024-08-29 13:37 ` Steven Price
2024-08-29 13:48 ` Ketil Johnsen
2024-09-03 12:27 ` Ketil Johnsen
9 siblings, 2 replies; 28+ messages in thread
From: Christian König @ 2024-08-29 9:40 UTC (permalink / raw)
To: Mihail Atanassov, linux-kernel, Boris Brezillon, Liviu Dudau,
Steven Price
Cc: dri-devel, Daniel Vetter, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher, Xinhui Pan,
Shashank Sharma, Ketil Johnsen, Akash Goel
Am 28.08.24 um 19:25 schrieb Mihail Atanassov:
> Hello all,
>
> This series implements a mechanism to expose Mali CSF GPUs' queue
> ringbuffers directly to userspace, along with paraphernalia to allow
> userspace to control job synchronisation between the CPU and GPU.
>
> The goal of these changes is to allow userspace to control work
> submission to the FW/HW directly without kernel intervention in the
> common case, thereby reducing context switching overhead. It also allows
> for greater flexibility in the way work is enqueued in the ringbufs.
> For example, the current kernel submit path only supports indirect
> calls, which is inefficient for small command buffers. Userspace can
> also skip unnecessary sync operations.
Question is how do you guarantee forward progress for fence signaling?
E.g. when are fences created and published? How do they signal?
How are dependencies handled? How can the kernel suspend an userspace queue?
How does memory management work in this case?
Regards,
Christian.
>
> This is still a work-in-progress, there's an outstanding issue with
> multiple processes using different submission flows triggering
> scheduling bugs (e.g. the same group getting scheduled twice), but we'd
> love to gather some feedback on the suitability of the approach in
> general and see if there's a clear path to merging something like this
> eventually.
>
> I've also CCd AMD maintainers because they have in the past done
> something similar[1], in case they want to chime in.
>
> There are two uses of this new uAPI in Mesa, one in gallium/panfrost
> (link TBD), and one in panvk [2].
>
> The Gallium implementation is a naïve change just to switch the
> submission model and exercise the new kernel code, and we don't plan
> on pursuing this at this time.
>
> The panvk driver changes are, however, a better representation of the
> intent behind this new uAPI, so please consider that as the reference
> userspace. It is still very much also a work in progress.
>
> * patch 1 adds all the uAPI changes;
> * patch 2 implements the GROUP_CREATE ioctl changes necessary to expose
> the required objects to userspace;
> * patch 3 maps the doorbell pages, similarly to how the user I/O page is
> mapped;
> * patch 4 implements GROUP_KICK, which lets userspace request an
> inactive group to be scheduled on the GPU;
> * patches 5 & 6 implement XGS queues, a way for userspace to
> synchronise GPU queue progress with DRM syncobjs;
> * patches 7 & 8 add notification mechanisms for user & kernel to signal
> changes to native GPU syncobjs.
>
> [1] https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@mail.gmail.com/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
> [2] https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads
>
> Ketil Johnsen (7):
> drm/panthor: Add uAPI to submit from user space
> drm/panthor: Extend GROUP_CREATE for user submission
> drm/panthor: Map doorbell pages
> drm/panthor: Add GROUP_KICK ioctl
> drm/panthor: Factor out syncobj handling
> drm/panthor: Implement XGS queues
> drm/panthor: Add SYNC_UPDATE ioctl
>
> Mihail Atanassov (1):
> drm/panthor: Add sync_update eventfd handling
>
> drivers/gpu/drm/panthor/Makefile | 4 +-
> drivers/gpu/drm/panthor/panthor_device.c | 66 ++-
> drivers/gpu/drm/panthor/panthor_device.h | 35 +-
> drivers/gpu/drm/panthor/panthor_drv.c | 233 +++++++-
> drivers/gpu/drm/panthor/panthor_fw.c | 2 +-
> drivers/gpu/drm/panthor/panthor_sched.c | 408 +++++++++-----
> drivers/gpu/drm/panthor/panthor_sched.h | 8 +-
> drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
> drivers/gpu/drm/panthor/panthor_syncobj.h | 27 +
> drivers/gpu/drm/panthor/panthor_xgs.c | 638 ++++++++++++++++++++++
> drivers/gpu/drm/panthor/panthor_xgs.h | 42 ++
> include/uapi/drm/panthor_drm.h | 243 +++++++-
> 12 files changed, 1696 insertions(+), 177 deletions(-)
> create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
> create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
> create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
> create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
2024-08-29 9:40 ` [RFC PATCH 00/10] drm/panthor: Add user submission Christian König
@ 2024-08-29 13:37 ` Steven Price
[not found] ` <96ef7ae3-4df1-4859-8672-453055bbfe96@amd.com>
2024-08-29 13:48 ` Ketil Johnsen
1 sibling, 1 reply; 28+ messages in thread
From: Steven Price @ 2024-08-29 13:37 UTC (permalink / raw)
To: Christian König, Mihail Atanassov, linux-kernel,
Boris Brezillon, Liviu Dudau
Cc: dri-devel, Daniel Vetter, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher, Xinhui Pan,
Shashank Sharma, Ketil Johnsen, Akash Goel
Hi Christian,
Mihail should be able to give more definitive answers, but I think I can
answer your questions.
On 29/08/2024 10:40, Christian König wrote:
> Am 28.08.24 um 19:25 schrieb Mihail Atanassov:
>> Hello all,
>>
>> This series implements a mechanism to expose Mali CSF GPUs' queue
>> ringbuffers directly to userspace, along with paraphernalia to allow
>> userspace to control job synchronisation between the CPU and GPU.
>>
>> The goal of these changes is to allow userspace to control work
>> submission to the FW/HW directly without kernel intervention in the
>> common case, thereby reducing context switching overhead. It also allows
>> for greater flexibility in the way work is enqueued in the ringbufs.
>> For example, the current kernel submit path only supports indirect
>> calls, which is inefficient for small command buffers. Userspace can
>> also skip unnecessary sync operations.
> Question is how do you guarantee forward progress for fence signaling?
A timeout. Although looking at it I think it's probably set too high
currently:
> +#define JOB_TIMEOUT_MS 5000
But basically the XGS queue is a DRM scheduler just like a normal GPU
queue and the jobs have a timeout. If the timeout is hit then any fences
will be signalled (with an error).
> E.g. when are fences created and published? How do they signal?
>
> How are dependencies handled? How can the kernel suspend an userspace
> queue?
The actual userspace queue can be suspended. This is actually a
combination of firmware and kernel driver, and this functionality is
already present without the user submission. The firmware will multiplex
multiple 'groups' onto the hardware, and if there are too many for the
firmware then the kernel multiplexes the extra groups onto the ones the
firmware supports.
I haven't studied Mihail's series in detail yet, but if I understand
correctly, the XGS queues are handled separately and are not suspended
when the hardware queues are suspended. I guess this might be an area
for improvement and might explain the currently very high timeout (to
deal with the case where the actual GPU work has been suspended).
> How does memory management work in this case?
I'm not entirely sure what you mean here. If you are referring to the
potential memory issues with signalling path then this should be handled
by the timeout - although I haven't studied the code to check for bugs here.
The actual new XGS queues don't allocate/free memory during the queue
execution - so it's just the memory usage related to fences (and the
other work which could be blocked on the fence).
In terms of memory management for the GPU work itself, this is handled
the same as before. The VM_BIND mechanism allows dependencies to be
created between syncobjs and VM operations, with XGS these can then be
tied to GPU HW events.
Fundamentally (modulo bugs) there is little change compared to kernel
submission - it's already fairly trivial to write GPU job which will
make no forward progress (a 'while (1)' equivalent job). The only
difference here is that XGS makes this 'easy' and doesn't involve the
GPU spinning. Either way we rely on a timeout to recover from these
situations.
Thanks,
Steve
> Regards,
> Christian.
>
>>
>> This is still a work-in-progress, there's an outstanding issue with
>> multiple processes using different submission flows triggering
>> scheduling bugs (e.g. the same group getting scheduled twice), but we'd
>> love to gather some feedback on the suitability of the approach in
>> general and see if there's a clear path to merging something like this
>> eventually.
>>
>> I've also CCd AMD maintainers because they have in the past done
>> something similar[1], in case they want to chime in.
>>
>> There are two uses of this new uAPI in Mesa, one in gallium/panfrost
>> (link TBD), and one in panvk [2].
>>
>> The Gallium implementation is a naïve change just to switch the
>> submission model and exercise the new kernel code, and we don't plan
>> on pursuing this at this time.
>>
>> The panvk driver changes are, however, a better representation of the
>> intent behind this new uAPI, so please consider that as the reference
>> userspace. It is still very much also a work in progress.
>>
>> * patch 1 adds all the uAPI changes;
>> * patch 2 implements the GROUP_CREATE ioctl changes necessary to expose
>> the required objects to userspace;
>> * patch 3 maps the doorbell pages, similarly to how the user I/O
>> page is
>> mapped;
>> * patch 4 implements GROUP_KICK, which lets userspace request an
>> inactive group to be scheduled on the GPU;
>> * patches 5 & 6 implement XGS queues, a way for userspace to
>> synchronise GPU queue progress with DRM syncobjs;
>> * patches 7 & 8 add notification mechanisms for user & kernel to signal
>> changes to native GPU syncobjs.
>>
>> [1]
>> https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@mail.gmail.com/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
>> [2]
>> https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads
>>
>> Ketil Johnsen (7):
>> drm/panthor: Add uAPI to submit from user space
>> drm/panthor: Extend GROUP_CREATE for user submission
>> drm/panthor: Map doorbell pages
>> drm/panthor: Add GROUP_KICK ioctl
>> drm/panthor: Factor out syncobj handling
>> drm/panthor: Implement XGS queues
>> drm/panthor: Add SYNC_UPDATE ioctl
>>
>> Mihail Atanassov (1):
>> drm/panthor: Add sync_update eventfd handling
>>
>> drivers/gpu/drm/panthor/Makefile | 4 +-
>> drivers/gpu/drm/panthor/panthor_device.c | 66 ++-
>> drivers/gpu/drm/panthor/panthor_device.h | 35 +-
>> drivers/gpu/drm/panthor/panthor_drv.c | 233 +++++++-
>> drivers/gpu/drm/panthor/panthor_fw.c | 2 +-
>> drivers/gpu/drm/panthor/panthor_sched.c | 408 +++++++++-----
>> drivers/gpu/drm/panthor/panthor_sched.h | 8 +-
>> drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
>> drivers/gpu/drm/panthor/panthor_syncobj.h | 27 +
>> drivers/gpu/drm/panthor/panthor_xgs.c | 638 ++++++++++++++++++++++
>> drivers/gpu/drm/panthor/panthor_xgs.h | 42 ++
>> include/uapi/drm/panthor_drm.h | 243 +++++++-
>> 12 files changed, 1696 insertions(+), 177 deletions(-)
>> create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
>> create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
>> create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
>> create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h
>>
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
2024-08-29 9:40 ` [RFC PATCH 00/10] drm/panthor: Add user submission Christian König
2024-08-29 13:37 ` Steven Price
@ 2024-08-29 13:48 ` Ketil Johnsen
1 sibling, 0 replies; 28+ messages in thread
From: Ketil Johnsen @ 2024-08-29 13:48 UTC (permalink / raw)
To: Christian König, Mihail Atanassov, linux-kernel,
Boris Brezillon, Liviu Dudau, Steven Price
Cc: dri-devel, Daniel Vetter, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher, Xinhui Pan,
Shashank Sharma, Akash Goel
On 29/08/2024 11:40, Christian König wrote:
> Am 28.08.24 um 19:25 schrieb Mihail Atanassov:
>> Hello all,
>>
>> This series implements a mechanism to expose Mali CSF GPUs' queue
>> ringbuffers directly to userspace, along with paraphernalia to allow
>> userspace to control job synchronisation between the CPU and GPU.
>>
>> The goal of these changes is to allow userspace to control work
>> submission to the FW/HW directly without kernel intervention in the
>> common case, thereby reducing context switching overhead. It also allows
>> for greater flexibility in the way work is enqueued in the ringbufs.
>> For example, the current kernel submit path only supports indirect
>> calls, which is inefficient for small command buffers. Userspace can
>> also skip unnecessary sync operations.
>
> Question is how do you guarantee forward progress for fence signaling?
>
> E.g. when are fences created and published? How do they signal?
Current XGS queue is built upon an instance of the DRM scheduler, and
XGS jobs which can not complete immediately are assigned a fence (as one
would expect). This proposal rely on the DRM scheduler timeout to ensure
forward progress if user space have encoded a "bad" stream of commands.
PS: We have tried to consider error propagation in case of timeouts, but
the implementation in this area is most likely somewhat missing ATM (not
tested).
> How are dependencies handled? How can the kernel suspend an userspace
> queue?
Mali FW will send IDLE interrupts when a group has become idle. This is
already (mostly) handled in Panthor today.
There is of course the race to handle then, between GPU IDLE and user
space submitting new work.
I'm actually working on this part right now. As this patchset stands, it
doesn't check in the RTPM suspend callback if user space have managed to
submit more work in the timeframe between "IDLE" and RTPM suspend
callback. I just need to correctly "unbind" the group/queues, unmap the
IO pages used, and abort the RTPM suspend if I detect that user space
have managed to submit more work.
> How does memory management work in this case?
Not sure exactly what you refer to here. There has basically been no
change to how we handle memory.
If you think of how GPU jobs and VM_BIND interacts, then the change here
is that it is now the XGS job and VM_BIND which interact. XGS takes on
the same duties as the GPU jobs with kernel submission (in this regard).
Actually, if you see the submission flow for GPU jobs and XGS jobs, you
will find that they are virtually identical when it comes to setting up
fences and dependencies. Same goes for VM_BIND jobs.
--
Regards,
Ketil
> Regards,
> Christian.
>
>>
>> This is still a work-in-progress, there's an outstanding issue with
>> multiple processes using different submission flows triggering
>> scheduling bugs (e.g. the same group getting scheduled twice), but we'd
>> love to gather some feedback on the suitability of the approach in
>> general and see if there's a clear path to merging something like this
>> eventually.
>>
>> I've also CCd AMD maintainers because they have in the past done
>> something similar[1], in case they want to chime in.
>>
>> There are two uses of this new uAPI in Mesa, one in gallium/panfrost
>> (link TBD), and one in panvk [2].
>>
>> The Gallium implementation is a naïve change just to switch the
>> submission model and exercise the new kernel code, and we don't plan
>> on pursuing this at this time.
>>
>> The panvk driver changes are, however, a better representation of the
>> intent behind this new uAPI, so please consider that as the reference
>> userspace. It is still very much also a work in progress.
>>
>> * patch 1 adds all the uAPI changes;
>> * patch 2 implements the GROUP_CREATE ioctl changes necessary to expose
>> the required objects to userspace;
>> * patch 3 maps the doorbell pages, similarly to how the user I/O
>> page is
>> mapped;
>> * patch 4 implements GROUP_KICK, which lets userspace request an
>> inactive group to be scheduled on the GPU;
>> * patches 5 & 6 implement XGS queues, a way for userspace to
>> synchronise GPU queue progress with DRM syncobjs;
>> * patches 7 & 8 add notification mechanisms for user & kernel to signal
>> changes to native GPU syncobjs.
>>
>> [1]
>> https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@mail.gmail.com/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
>> [2]
>> https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads
>>
>> Ketil Johnsen (7):
>> drm/panthor: Add uAPI to submit from user space
>> drm/panthor: Extend GROUP_CREATE for user submission
>> drm/panthor: Map doorbell pages
>> drm/panthor: Add GROUP_KICK ioctl
>> drm/panthor: Factor out syncobj handling
>> drm/panthor: Implement XGS queues
>> drm/panthor: Add SYNC_UPDATE ioctl
>>
>> Mihail Atanassov (1):
>> drm/panthor: Add sync_update eventfd handling
>>
>> drivers/gpu/drm/panthor/Makefile | 4 +-
>> drivers/gpu/drm/panthor/panthor_device.c | 66 ++-
>> drivers/gpu/drm/panthor/panthor_device.h | 35 +-
>> drivers/gpu/drm/panthor/panthor_drv.c | 233 +++++++-
>> drivers/gpu/drm/panthor/panthor_fw.c | 2 +-
>> drivers/gpu/drm/panthor/panthor_sched.c | 408 +++++++++-----
>> drivers/gpu/drm/panthor/panthor_sched.h | 8 +-
>> drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
>> drivers/gpu/drm/panthor/panthor_syncobj.h | 27 +
>> drivers/gpu/drm/panthor/panthor_xgs.c | 638 ++++++++++++++++++++++
>> drivers/gpu/drm/panthor/panthor_xgs.h | 42 ++
>> include/uapi/drm/panthor_drm.h | 243 +++++++-
>> 12 files changed, 1696 insertions(+), 177 deletions(-)
>> create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
>> create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
>> create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
>> create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h
>>
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
2024-08-28 17:25 [RFC PATCH 00/10] drm/panthor: Add user submission Mihail Atanassov
` (8 preceding siblings ...)
2024-08-29 9:40 ` [RFC PATCH 00/10] drm/panthor: Add user submission Christian König
@ 2024-09-03 12:27 ` Ketil Johnsen
9 siblings, 0 replies; 28+ messages in thread
From: Ketil Johnsen @ 2024-09-03 12:27 UTC (permalink / raw)
To: Mihail Atanassov, linux-kernel, Boris Brezillon, Liviu Dudau,
Steven Price
Cc: dri-devel, Daniel Vetter, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher,
Christian König, Xinhui Pan, Shashank Sharma, Akash Goel
On 28.08.2024 19:25, Mihail Atanassov wrote:
> Hello all,
>
> This series implements a mechanism to expose Mali CSF GPUs' queue
> ringbuffers directly to userspace, along with paraphernalia to allow
> userspace to control job synchronisation between the CPU and GPU.
>
> The goal of these changes is to allow userspace to control work
> submission to the FW/HW directly without kernel intervention in the
> common case, thereby reducing context switching overhead. It also allows
> for greater flexibility in the way work is enqueued in the ringbufs.
> For example, the current kernel submit path only supports indirect
> calls, which is inefficient for small command buffers. Userspace can
> also skip unnecessary sync operations.
>
> This is still a work-in-progress, there's an outstanding issue with
> multiple processes using different submission flows triggering
> scheduling bugs (e.g. the same group getting scheduled twice), but we'd
> love to gather some feedback on the suitability of the approach in
> general and see if there's a clear path to merging something like this
> eventually.
>
> I've also CCd AMD maintainers because they have in the past done
> something similar[1], in case they want to chime in.
>
> There are two uses of this new uAPI in Mesa, one in gallium/panfrost
> (link TBD), and one in panvk [2].
Gallium/Panfrost changes to make use of this new user submission API can
now be found here:
https://gitlab.freedesktop.org/ketil.johnsen/mesa/-/commits/panthor_usersubmit/?ref_type=heads
It is worth repeating, this is just a dumb switch from kernel submission
to user submission for the Panfrost Gallium driver, no optimizations
attempted. We have no concrete plans to pursue upstreaming of this for
the time being. Panvk is our focus in that regard.
--
Regards,
Ketil Johnsen
> The Gallium implementation is a naïve change just to switch the
> submission model and exercise the new kernel code, and we don't plan
> on pursuing this at this time.
>
> The panvk driver changes are, however, a better representation of the
> intent behind this new uAPI, so please consider that as the reference
> userspace. It is still very much also a work in progress.
>
> * patch 1 adds all the uAPI changes;
> * patch 2 implements the GROUP_CREATE ioctl changes necessary to expose
> the required objects to userspace;
> * patch 3 maps the doorbell pages, similarly to how the user I/O page is
> mapped;
> * patch 4 implements GROUP_KICK, which lets userspace request an
> inactive group to be scheduled on the GPU;
> * patches 5 & 6 implement XGS queues, a way for userspace to
> synchronise GPU queue progress with DRM syncobjs;
> * patches 7 & 8 add notification mechanisms for user & kernel to signal
> changes to native GPU syncobjs.
>
> [1] https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@mail.gmail.com/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
> [2] https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads
>
> Ketil Johnsen (7):
> drm/panthor: Add uAPI to submit from user space
> drm/panthor: Extend GROUP_CREATE for user submission
> drm/panthor: Map doorbell pages
> drm/panthor: Add GROUP_KICK ioctl
> drm/panthor: Factor out syncobj handling
> drm/panthor: Implement XGS queues
> drm/panthor: Add SYNC_UPDATE ioctl
>
> Mihail Atanassov (1):
> drm/panthor: Add sync_update eventfd handling
>
> drivers/gpu/drm/panthor/Makefile | 4 +-
> drivers/gpu/drm/panthor/panthor_device.c | 66 ++-
> drivers/gpu/drm/panthor/panthor_device.h | 35 +-
> drivers/gpu/drm/panthor/panthor_drv.c | 233 +++++++-
> drivers/gpu/drm/panthor/panthor_fw.c | 2 +-
> drivers/gpu/drm/panthor/panthor_sched.c | 408 +++++++++-----
> drivers/gpu/drm/panthor/panthor_sched.h | 8 +-
> drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
> drivers/gpu/drm/panthor/panthor_syncobj.h | 27 +
> drivers/gpu/drm/panthor/panthor_xgs.c | 638 ++++++++++++++++++++++
> drivers/gpu/drm/panthor/panthor_xgs.h | 42 ++
> include/uapi/drm/panthor_drm.h | 243 +++++++-
> 12 files changed, 1696 insertions(+), 177 deletions(-)
> create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
> create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
> create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
> create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 6/8] drm/panthor: Implement XGS queues
2024-08-28 17:26 ` [PATCH 6/8] drm/panthor: Implement XGS queues Mihail Atanassov
@ 2024-09-03 19:17 ` Simona Vetter
0 siblings, 0 replies; 28+ messages in thread
From: Simona Vetter @ 2024-09-03 19:17 UTC (permalink / raw)
To: Mihail Atanassov
Cc: linux-kernel, Boris Brezillon, Liviu Dudau, Steven Price,
dri-devel, Daniel Vetter, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher,
Christian König, Xinhui Pan, Shashank Sharma, Ketil Johnsen,
Akash Goel
On Wed, Aug 28, 2024 at 06:26:02PM +0100, Mihail Atanassov wrote:
> +int panthor_xgs_queue_create(struct panthor_file *pfile, u32 vm_id,
> + int eventfd_sync_update, u32 *handle)
> +{
> + struct panthor_device *ptdev = pfile->ptdev;
> + struct panthor_xgs_queue_pool *xgs_queue_pool = pfile->xgs_queues;
> + struct panthor_xgs_queue *queue;
> + struct drm_gpu_scheduler *drm_sched;
> + int ret;
> + int qid;
> +
> + queue = kzalloc(sizeof(*queue), GFP_KERNEL);
> + if (!queue)
> + return -ENOMEM;
> +
> + kref_init(&queue->refcount);
> + INIT_LIST_HEAD(&queue->fence_ctx.in_flight_jobs);
> + INIT_WORK(&queue->release_work, xgs_queue_release_work);
> + queue->ptdev = ptdev;
> +
> + ret = drmm_mutex_init(&ptdev->base, &queue->lock);
This is guaranteed buggy, because you kzalloc queue, with it's own
refcount, but then tie the mutex cleanup to the entirely different
lifetime of the drm_device.
Just spotted this while reading around.
-Sima
--
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
[not found] ` <96ef7ae3-4df1-4859-8672-453055bbfe96@amd.com>
@ 2024-09-03 21:11 ` Simona Vetter
2024-09-04 7:49 ` Christian König
0 siblings, 1 reply; 28+ messages in thread
From: Simona Vetter @ 2024-09-03 21:11 UTC (permalink / raw)
To: Christian König
Cc: Steven Price, Mihail Atanassov, linux-kernel, Boris Brezillon,
Liviu Dudau, dri-devel, Daniel Vetter, David Airlie,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Alex Deucher,
Xinhui Pan, Shashank Sharma, Ketil Johnsen, Akash Goel
On Tue, Sep 03, 2024 at 03:46:43PM +0200, Christian König wrote:
> Hi Steven,
>
> Am 29.08.24 um 15:37 schrieb Steven Price:
> > Hi Christian,
> >
> > Mihail should be able to give more definitive answers, but I think I can
> > answer your questions.
> >
> > On 29/08/2024 10:40, Christian König wrote:
> > > Am 28.08.24 um 19:25 schrieb Mihail Atanassov:
> > > > Hello all,
> > > >
> > > > This series implements a mechanism to expose Mali CSF GPUs' queue
> > > > ringbuffers directly to userspace, along with paraphernalia to allow
> > > > userspace to control job synchronisation between the CPU and GPU.
> > > >
> > > > The goal of these changes is to allow userspace to control work
> > > > submission to the FW/HW directly without kernel intervention in the
> > > > common case, thereby reducing context switching overhead. It also allows
> > > > for greater flexibility in the way work is enqueued in the ringbufs.
> > > > For example, the current kernel submit path only supports indirect
> > > > calls, which is inefficient for small command buffers. Userspace can
> > > > also skip unnecessary sync operations.
> > > Question is how do you guarantee forward progress for fence signaling?
> > A timeout. Although looking at it I think it's probably set too high
> > currently:
> >
> > > +#define JOB_TIMEOUT_MS 5000
> > But basically the XGS queue is a DRM scheduler just like a normal GPU
> > queue and the jobs have a timeout. If the timeout is hit then any fences
> > will be signalled (with an error).
>
> Mhm, that is unfortunately exactly what I feared.
>
> > > E.g. when are fences created and published? How do they signal?
> > >
> > > How are dependencies handled? How can the kernel suspend an userspace
> > > queue?
> > The actual userspace queue can be suspended. This is actually a
> > combination of firmware and kernel driver, and this functionality is
> > already present without the user submission. The firmware will multiplex
> > multiple 'groups' onto the hardware, and if there are too many for the
> > firmware then the kernel multiplexes the extra groups onto the ones the
> > firmware supports.
>
> How do you guarantee forward progress and that resuming of suspended queues
> doesn't end up in a circle dependency?
>
> > I haven't studied Mihail's series in detail yet, but if I understand
> > correctly, the XGS queues are handled separately and are not suspended
> > when the hardware queues are suspended. I guess this might be an area
> > for improvement and might explain the currently very high timeout (to
> > deal with the case where the actual GPU work has been suspended).
> >
> > > How does memory management work in this case?
> > I'm not entirely sure what you mean here. If you are referring to the
> > potential memory issues with signalling path then this should be handled
> > by the timeout - although I haven't studied the code to check for bugs here.
>
> You might have misunderstood my question (and I might misunderstand the
> code), but on first glance it strongly sounds like the current approach will
> be NAKed.
>
> > The actual new XGS queues don't allocate/free memory during the queue
> > execution - so it's just the memory usage related to fences (and the
> > other work which could be blocked on the fence).
>
> But the kernel and the hardware could suspend the queues, right?
>
> > In terms of memory management for the GPU work itself, this is handled
> > the same as before. The VM_BIND mechanism allows dependencies to be
> > created between syncobjs and VM operations, with XGS these can then be
> > tied to GPU HW events.
>
> I don't know the details, but that again strongly sounds like that won't
> work.
>
> What you need is to somehow guarantee that work doesn't run into memory
> management deadlocks which are resolved by timeouts.
>
> Please read up here on why that stuff isn't allowed: https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
panthor doesn't yet have a shrinker, so all memory is pinned, which means
memory management easy mode.
But also this means there might be an uapi design bug in here, and we
really don't want to commit to that. My stance is that panthor should gain
a proper shrinker first, which means there will be some changes here too.
And then we can properly evaluate this. As-is it's a bit too much on the
toy end of things.
That aside, I've thought this all through with tons of people, and I do
think it's all possible.
-Sima
>
> Regards,
> Christian.
>
> >
> >
> > Fundamentally (modulo bugs) there is little change compared to kernel
> > submission - it's already fairly trivial to write GPU job which will
> > make no forward progress (a 'while (1)' equivalent job). The only
> > difference here is that XGS makes this 'easy' and doesn't involve the
> > GPU spinning. Either way we rely on a timeout to recover from these
> > situations.
> >
> > Thanks,
> > Steve
> >
> > > Regards,
> > > Christian.
> > >
> > > > This is still a work-in-progress, there's an outstanding issue with
> > > > multiple processes using different submission flows triggering
> > > > scheduling bugs (e.g. the same group getting scheduled twice), but we'd
> > > > love to gather some feedback on the suitability of the approach in
> > > > general and see if there's a clear path to merging something like this
> > > > eventually.
> > > >
> > > > I've also CCd AMD maintainers because they have in the past done
> > > > something similar[1], in case they want to chime in.
> > > >
> > > > There are two uses of this new uAPI in Mesa, one in gallium/panfrost
> > > > (link TBD), and one in panvk [2].
> > > >
> > > > The Gallium implementation is a naïve change just to switch the
> > > > submission model and exercise the new kernel code, and we don't plan
> > > > on pursuing this at this time.
> > > >
> > > > The panvk driver changes are, however, a better representation of the
> > > > intent behind this new uAPI, so please consider that as the reference
> > > > userspace. It is still very much also a work in progress.
> > > >
> > > > * patch 1 adds all the uAPI changes;
> > > > * patch 2 implements the GROUP_CREATE ioctl changes necessary to expose
> > > > the required objects to userspace;
> > > > * patch 3 maps the doorbell pages, similarly to how the user I/O
> > > > page is
> > > > mapped;
> > > > * patch 4 implements GROUP_KICK, which lets userspace request an
> > > > inactive group to be scheduled on the GPU;
> > > > * patches 5 & 6 implement XGS queues, a way for userspace to
> > > > synchronise GPU queue progress with DRM syncobjs;
> > > > * patches 7 & 8 add notification mechanisms for user & kernel to signal
> > > > changes to native GPU syncobjs.
> > > >
> > > > [1]
> > > > https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@mail.gmail.com/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
> > > > [2]
> > > > https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads
> > > >
> > > > Ketil Johnsen (7):
> > > > drm/panthor: Add uAPI to submit from user space
> > > > drm/panthor: Extend GROUP_CREATE for user submission
> > > > drm/panthor: Map doorbell pages
> > > > drm/panthor: Add GROUP_KICK ioctl
> > > > drm/panthor: Factor out syncobj handling
> > > > drm/panthor: Implement XGS queues
> > > > drm/panthor: Add SYNC_UPDATE ioctl
> > > >
> > > > Mihail Atanassov (1):
> > > > drm/panthor: Add sync_update eventfd handling
> > > >
> > > > drivers/gpu/drm/panthor/Makefile | 4 +-
> > > > drivers/gpu/drm/panthor/panthor_device.c | 66 ++-
> > > > drivers/gpu/drm/panthor/panthor_device.h | 35 +-
> > > > drivers/gpu/drm/panthor/panthor_drv.c | 233 +++++++-
> > > > drivers/gpu/drm/panthor/panthor_fw.c | 2 +-
> > > > drivers/gpu/drm/panthor/panthor_sched.c | 408 +++++++++-----
> > > > drivers/gpu/drm/panthor/panthor_sched.h | 8 +-
> > > > drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
> > > > drivers/gpu/drm/panthor/panthor_syncobj.h | 27 +
> > > > drivers/gpu/drm/panthor/panthor_xgs.c | 638 ++++++++++++++++++++++
> > > > drivers/gpu/drm/panthor/panthor_xgs.h | 42 ++
> > > > include/uapi/drm/panthor_drm.h | 243 +++++++-
> > > > 12 files changed, 1696 insertions(+), 177 deletions(-)
> > > > create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
> > > > create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
> > > > create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
> > > > create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h
> > > >
--
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
2024-09-03 21:11 ` Simona Vetter
@ 2024-09-04 7:49 ` Christian König
2024-09-04 9:31 ` Steven Price
2024-09-04 9:43 ` Simona Vetter
0 siblings, 2 replies; 28+ messages in thread
From: Christian König @ 2024-09-04 7:49 UTC (permalink / raw)
To: Steven Price, Mihail Atanassov, linux-kernel, Boris Brezillon,
Liviu Dudau, dri-devel, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher, Xinhui Pan,
Shashank Sharma, Ketil Johnsen, Akash Goel
Am 03.09.24 um 23:11 schrieb Simona Vetter:
> On Tue, Sep 03, 2024 at 03:46:43PM +0200, Christian König wrote:
>> Hi Steven,
>>
>> Am 29.08.24 um 15:37 schrieb Steven Price:
>>> Hi Christian,
>>>
>>> Mihail should be able to give more definitive answers, but I think I can
>>> answer your questions.
>>>
>>> On 29/08/2024 10:40, Christian König wrote:
>>>> Am 28.08.24 um 19:25 schrieb Mihail Atanassov:
>>>>> Hello all,
>>>>>
>>>>> This series implements a mechanism to expose Mali CSF GPUs' queue
>>>>> ringbuffers directly to userspace, along with paraphernalia to allow
>>>>> userspace to control job synchronisation between the CPU and GPU.
>>>>>
>>>>> The goal of these changes is to allow userspace to control work
>>>>> submission to the FW/HW directly without kernel intervention in the
>>>>> common case, thereby reducing context switching overhead. It also allows
>>>>> for greater flexibility in the way work is enqueued in the ringbufs.
>>>>> For example, the current kernel submit path only supports indirect
>>>>> calls, which is inefficient for small command buffers. Userspace can
>>>>> also skip unnecessary sync operations.
>>>> Question is how do you guarantee forward progress for fence signaling?
>>> A timeout. Although looking at it I think it's probably set too high
>>> currently:
>>>
>>>> +#define JOB_TIMEOUT_MS 5000
>>> But basically the XGS queue is a DRM scheduler just like a normal GPU
>>> queue and the jobs have a timeout. If the timeout is hit then any fences
>>> will be signalled (with an error).
>> Mhm, that is unfortunately exactly what I feared.
>>
>>>> E.g. when are fences created and published? How do they signal?
>>>>
>>>> How are dependencies handled? How can the kernel suspend an userspace
>>>> queue?
>>> The actual userspace queue can be suspended. This is actually a
>>> combination of firmware and kernel driver, and this functionality is
>>> already present without the user submission. The firmware will multiplex
>>> multiple 'groups' onto the hardware, and if there are too many for the
>>> firmware then the kernel multiplexes the extra groups onto the ones the
>>> firmware supports.
>> How do you guarantee forward progress and that resuming of suspended queues
>> doesn't end up in a circle dependency?
>>
>>> I haven't studied Mihail's series in detail yet, but if I understand
>>> correctly, the XGS queues are handled separately and are not suspended
>>> when the hardware queues are suspended. I guess this might be an area
>>> for improvement and might explain the currently very high timeout (to
>>> deal with the case where the actual GPU work has been suspended).
>>>
>>>> How does memory management work in this case?
>>> I'm not entirely sure what you mean here. If you are referring to the
>>> potential memory issues with signalling path then this should be handled
>>> by the timeout - although I haven't studied the code to check for bugs here.
>> You might have misunderstood my question (and I might misunderstand the
>> code), but on first glance it strongly sounds like the current approach will
>> be NAKed.
>>
>>> The actual new XGS queues don't allocate/free memory during the queue
>>> execution - so it's just the memory usage related to fences (and the
>>> other work which could be blocked on the fence).
>> But the kernel and the hardware could suspend the queues, right?
>>
>>> In terms of memory management for the GPU work itself, this is handled
>>> the same as before. The VM_BIND mechanism allows dependencies to be
>>> created between syncobjs and VM operations, with XGS these can then be
>>> tied to GPU HW events.
>> I don't know the details, but that again strongly sounds like that won't
>> work.
>>
>> What you need is to somehow guarantee that work doesn't run into memory
>> management deadlocks which are resolved by timeouts.
>>
>> Please read up here on why that stuff isn't allowed: https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
> panthor doesn't yet have a shrinker, so all memory is pinned, which means
> memory management easy mode.
Ok, that at least makes things work for the moment.
> But also this means there might be an uapi design bug in here, and we
> really don't want to commit to that. My stance is that panthor should gain
> a proper shrinker first, which means there will be some changes here too.
> And then we can properly evaluate this. As-is it's a bit too much on the
> toy end of things.
I wouldn't say toy end, it looks rather fleshed out to me.
But I agree that the people who design the UAPI needs to be aware of the
restrictions.
>
> That aside, I've thought this all through with tons of people, and I do
> think it's all possible.
It's certainly possible, we have user queue patches for amdgpu in the
pipeline as well.
It's just really really really hard to get right without creating some
circle dependencies and deadlocks in between.
If I would get free beer every time somebody came up with a broken
dma_fence design I would probably end up as alcoholic without spending a
single penny.
Christian.
> -Sima
>
>> Regards,
>> Christian.
>>
>>>
>>> Fundamentally (modulo bugs) there is little change compared to kernel
>>> submission - it's already fairly trivial to write GPU job which will
>>> make no forward progress (a 'while (1)' equivalent job). The only
>>> difference here is that XGS makes this 'easy' and doesn't involve the
>>> GPU spinning. Either way we rely on a timeout to recover from these
>>> situations.
>>>
>>> Thanks,
>>> Steve
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> This is still a work-in-progress, there's an outstanding issue with
>>>>> multiple processes using different submission flows triggering
>>>>> scheduling bugs (e.g. the same group getting scheduled twice), but we'd
>>>>> love to gather some feedback on the suitability of the approach in
>>>>> general and see if there's a clear path to merging something like this
>>>>> eventually.
>>>>>
>>>>> I've also CCd AMD maintainers because they have in the past done
>>>>> something similar[1], in case they want to chime in.
>>>>>
>>>>> There are two uses of this new uAPI in Mesa, one in gallium/panfrost
>>>>> (link TBD), and one in panvk [2].
>>>>>
>>>>> The Gallium implementation is a naïve change just to switch the
>>>>> submission model and exercise the new kernel code, and we don't plan
>>>>> on pursuing this at this time.
>>>>>
>>>>> The panvk driver changes are, however, a better representation of the
>>>>> intent behind this new uAPI, so please consider that as the reference
>>>>> userspace. It is still very much also a work in progress.
>>>>>
>>>>> * patch 1 adds all the uAPI changes;
>>>>> * patch 2 implements the GROUP_CREATE ioctl changes necessary to expose
>>>>> the required objects to userspace;
>>>>> * patch 3 maps the doorbell pages, similarly to how the user I/O
>>>>> page is
>>>>> mapped;
>>>>> * patch 4 implements GROUP_KICK, which lets userspace request an
>>>>> inactive group to be scheduled on the GPU;
>>>>> * patches 5 & 6 implement XGS queues, a way for userspace to
>>>>> synchronise GPU queue progress with DRM syncobjs;
>>>>> * patches 7 & 8 add notification mechanisms for user & kernel to signal
>>>>> changes to native GPU syncobjs.
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@mail.gmail.com/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
>>>>> [2]
>>>>> https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads
>>>>>
>>>>> Ketil Johnsen (7):
>>>>> drm/panthor: Add uAPI to submit from user space
>>>>> drm/panthor: Extend GROUP_CREATE for user submission
>>>>> drm/panthor: Map doorbell pages
>>>>> drm/panthor: Add GROUP_KICK ioctl
>>>>> drm/panthor: Factor out syncobj handling
>>>>> drm/panthor: Implement XGS queues
>>>>> drm/panthor: Add SYNC_UPDATE ioctl
>>>>>
>>>>> Mihail Atanassov (1):
>>>>> drm/panthor: Add sync_update eventfd handling
>>>>>
>>>>> drivers/gpu/drm/panthor/Makefile | 4 +-
>>>>> drivers/gpu/drm/panthor/panthor_device.c | 66 ++-
>>>>> drivers/gpu/drm/panthor/panthor_device.h | 35 +-
>>>>> drivers/gpu/drm/panthor/panthor_drv.c | 233 +++++++-
>>>>> drivers/gpu/drm/panthor/panthor_fw.c | 2 +-
>>>>> drivers/gpu/drm/panthor/panthor_sched.c | 408 +++++++++-----
>>>>> drivers/gpu/drm/panthor/panthor_sched.h | 8 +-
>>>>> drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
>>>>> drivers/gpu/drm/panthor/panthor_syncobj.h | 27 +
>>>>> drivers/gpu/drm/panthor/panthor_xgs.c | 638 ++++++++++++++++++++++
>>>>> drivers/gpu/drm/panthor/panthor_xgs.h | 42 ++
>>>>> include/uapi/drm/panthor_drm.h | 243 +++++++-
>>>>> 12 files changed, 1696 insertions(+), 177 deletions(-)
>>>>> create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
>>>>> create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
>>>>> create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
>>>>> create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h
>>>>>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
2024-09-04 7:49 ` Christian König
@ 2024-09-04 9:31 ` Steven Price
2024-09-04 11:23 ` Boris Brezillon
2024-09-04 9:43 ` Simona Vetter
1 sibling, 1 reply; 28+ messages in thread
From: Steven Price @ 2024-09-04 9:31 UTC (permalink / raw)
To: Christian König, Mihail Atanassov, linux-kernel,
Boris Brezillon, Liviu Dudau, dri-devel, David Airlie,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Alex Deucher,
Xinhui Pan, Shashank Sharma, Ketil Johnsen, Akash Goel
On 04/09/2024 08:49, Christian König wrote:
> Am 03.09.24 um 23:11 schrieb Simona Vetter:
>> On Tue, Sep 03, 2024 at 03:46:43PM +0200, Christian König wrote:
>>> Hi Steven,
>>>
>>> Am 29.08.24 um 15:37 schrieb Steven Price:
>>>> Hi Christian,
>>>>
>>>> Mihail should be able to give more definitive answers, but I think I
>>>> can
>>>> answer your questions.
>>>>
>>>> On 29/08/2024 10:40, Christian König wrote:
>>>>> Am 28.08.24 um 19:25 schrieb Mihail Atanassov:
>>>>>> Hello all,
>>>>>>
>>>>>> This series implements a mechanism to expose Mali CSF GPUs' queue
>>>>>> ringbuffers directly to userspace, along with paraphernalia to allow
>>>>>> userspace to control job synchronisation between the CPU and GPU.
>>>>>>
>>>>>> The goal of these changes is to allow userspace to control work
>>>>>> submission to the FW/HW directly without kernel intervention in the
>>>>>> common case, thereby reducing context switching overhead. It also
>>>>>> allows
>>>>>> for greater flexibility in the way work is enqueued in the ringbufs.
>>>>>> For example, the current kernel submit path only supports indirect
>>>>>> calls, which is inefficient for small command buffers. Userspace can
>>>>>> also skip unnecessary sync operations.
>>>>> Question is how do you guarantee forward progress for fence signaling?
>>>> A timeout. Although looking at it I think it's probably set too high
>>>> currently:
>>>>
>>>>> +#define JOB_TIMEOUT_MS 5000
>>>> But basically the XGS queue is a DRM scheduler just like a normal GPU
>>>> queue and the jobs have a timeout. If the timeout is hit then any
>>>> fences
>>>> will be signalled (with an error).
>>> Mhm, that is unfortunately exactly what I feared.
>>>
>>>>> E.g. when are fences created and published? How do they signal?
>>>>>
>>>>> How are dependencies handled? How can the kernel suspend an userspace
>>>>> queue?
>>>> The actual userspace queue can be suspended. This is actually a
>>>> combination of firmware and kernel driver, and this functionality is
>>>> already present without the user submission. The firmware will
>>>> multiplex
>>>> multiple 'groups' onto the hardware, and if there are too many for the
>>>> firmware then the kernel multiplexes the extra groups onto the ones the
>>>> firmware supports.
>>> How do you guarantee forward progress and that resuming of suspended
>>> queues
>>> doesn't end up in a circle dependency?
I'm not entirely sure what you mean by "guarantee" here - the kernel by
itself only guarantees forward progress by the means of timeouts. User
space can 'easily' shoot itself in the foot by using a XGS queue to
block waiting on a GPU event which will never happen.
However dependencies between applications (and/or other device drivers)
will only occur via dma fences and an unsignalled fence will only be
returned when there is a path forward to signal it. So it shouldn't be
possible to create a dependency loop between contexts (or command stream
groups to use the Mali jargon).
Because the groups can't have dependency cycles it should be possible to
suspend/resume them without deadlocks.
>>>> I haven't studied Mihail's series in detail yet, but if I understand
>>>> correctly, the XGS queues are handled separately and are not suspended
>>>> when the hardware queues are suspended. I guess this might be an area
>>>> for improvement and might explain the currently very high timeout (to
>>>> deal with the case where the actual GPU work has been suspended).
>>>>
>>>>> How does memory management work in this case?
>>>> I'm not entirely sure what you mean here. If you are referring to the
>>>> potential memory issues with signalling path then this should be
>>>> handled
>>>> by the timeout - although I haven't studied the code to check for
>>>> bugs here.
>>> You might have misunderstood my question (and I might misunderstand the
>>> code), but on first glance it strongly sounds like the current
>>> approach will
>>> be NAKed.
>>>
>>>> The actual new XGS queues don't allocate/free memory during the queue
>>>> execution - so it's just the memory usage related to fences (and the
>>>> other work which could be blocked on the fence).
>>> But the kernel and the hardware could suspend the queues, right?
>>>
>>>> In terms of memory management for the GPU work itself, this is handled
>>>> the same as before. The VM_BIND mechanism allows dependencies to be
>>>> created between syncobjs and VM operations, with XGS these can then be
>>>> tied to GPU HW events.
>>> I don't know the details, but that again strongly sounds like that won't
>>> work.
>>>
>>> What you need is to somehow guarantee that work doesn't run into memory
>>> management deadlocks which are resolved by timeouts.
>>>
>>> Please read up here on why that stuff isn't allowed:
>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
>> panthor doesn't yet have a shrinker, so all memory is pinned, which means
>> memory management easy mode.
>
> Ok, that at least makes things work for the moment.
Ah, perhaps this should have been spelt out more clearly ;)
The VM_BIND mechanism that's already in place jumps through some hoops
to ensure that memory is preallocated when the memory operations are
enqueued. So any memory required should have been allocated before any
sync object is returned. We're aware of the issue with memory
allocations on the signalling path and trying to ensure that we don't
have that.
I'm hoping that we don't need a shrinker which deals with (active) GPU
memory with our design. Memory which user space thinks the GPU might
need should be pinned before the GPU work is submitted. APIs which
require any form of 'paging in' of data would need to be implemented by
the GPU work completing and being resubmitted by user space after the
memory changes (i.e. there could be a DMA fence pending on the GPU work).
>> But also this means there might be an uapi design bug in here, and we
>> really don't want to commit to that. My stance is that panthor should
>> gain
>> a proper shrinker first, which means there will be some changes here too.
>> And then we can properly evaluate this. As-is it's a bit too much on the
>> toy end of things.
>
> I wouldn't say toy end, it looks rather fleshed out to me.
>
> But I agree that the people who design the UAPI needs to be aware of the
> restrictions.
Yes, I'm aware this could restrict the design in the future. This is of
course why it's an RFC as we welcome discussion on what problems we
could be setting ourselves up for!
>>
>> That aside, I've thought this all through with tons of people, and I do
>> think it's all possible.
>
> It's certainly possible, we have user queue patches for amdgpu in the
> pipeline as well.
>
> It's just really really really hard to get right without creating some
> circle dependencies and deadlocks in between.
>
> If I would get free beer every time somebody came up with a broken
> dma_fence design I would probably end up as alcoholic without spending a
> single penny.
I'll try hard to restrict your beer intake ;)
Thanks,
Steve
> Christian.
>
>> -Sima
>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Fundamentally (modulo bugs) there is little change compared to kernel
>>>> submission - it's already fairly trivial to write GPU job which will
>>>> make no forward progress (a 'while (1)' equivalent job). The only
>>>> difference here is that XGS makes this 'easy' and doesn't involve the
>>>> GPU spinning. Either way we rely on a timeout to recover from these
>>>> situations.
>>>>
>>>> Thanks,
>>>> Steve
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> This is still a work-in-progress, there's an outstanding issue with
>>>>>> multiple processes using different submission flows triggering
>>>>>> scheduling bugs (e.g. the same group getting scheduled twice), but
>>>>>> we'd
>>>>>> love to gather some feedback on the suitability of the approach in
>>>>>> general and see if there's a clear path to merging something like
>>>>>> this
>>>>>> eventually.
>>>>>>
>>>>>> I've also CCd AMD maintainers because they have in the past done
>>>>>> something similar[1], in case they want to chime in.
>>>>>>
>>>>>> There are two uses of this new uAPI in Mesa, one in gallium/panfrost
>>>>>> (link TBD), and one in panvk [2].
>>>>>>
>>>>>> The Gallium implementation is a naïve change just to switch the
>>>>>> submission model and exercise the new kernel code, and we don't plan
>>>>>> on pursuing this at this time.
>>>>>>
>>>>>> The panvk driver changes are, however, a better representation of the
>>>>>> intent behind this new uAPI, so please consider that as the reference
>>>>>> userspace. It is still very much also a work in progress.
>>>>>>
>>>>>> * patch 1 adds all the uAPI changes;
>>>>>> * patch 2 implements the GROUP_CREATE ioctl changes necessary
>>>>>> to expose
>>>>>> the required objects to userspace;
>>>>>> * patch 3 maps the doorbell pages, similarly to how the user I/O
>>>>>> page is
>>>>>> mapped;
>>>>>> * patch 4 implements GROUP_KICK, which lets userspace request an
>>>>>> inactive group to be scheduled on the GPU;
>>>>>> * patches 5 & 6 implement XGS queues, a way for userspace to
>>>>>> synchronise GPU queue progress with DRM syncobjs;
>>>>>> * patches 7 & 8 add notification mechanisms for user & kernel
>>>>>> to signal
>>>>>> changes to native GPU syncobjs.
>>>>>>
>>>>>> [1]
>>>>>> https://lore.kernel.org/amd-gfx/CADnq5_N61q_o+5WYUZsZ=qu7VmeXTFHQSxLwTco05gLzHaiswA@mail.gmail.com/t/#m116a36a598d8fad1329e053974ad37a4dc0f28ed
>>>>>> [2]
>>>>>> https://gitlab.freedesktop.org/larsivsi/mesa/-/commits/panvk-v10-usersubmit?ref_type=heads
>>>>>>
>>>>>> Ketil Johnsen (7):
>>>>>> drm/panthor: Add uAPI to submit from user space
>>>>>> drm/panthor: Extend GROUP_CREATE for user submission
>>>>>> drm/panthor: Map doorbell pages
>>>>>> drm/panthor: Add GROUP_KICK ioctl
>>>>>> drm/panthor: Factor out syncobj handling
>>>>>> drm/panthor: Implement XGS queues
>>>>>> drm/panthor: Add SYNC_UPDATE ioctl
>>>>>>
>>>>>> Mihail Atanassov (1):
>>>>>> drm/panthor: Add sync_update eventfd handling
>>>>>>
>>>>>> drivers/gpu/drm/panthor/Makefile | 4 +-
>>>>>> drivers/gpu/drm/panthor/panthor_device.c | 66 ++-
>>>>>> drivers/gpu/drm/panthor/panthor_device.h | 35 +-
>>>>>> drivers/gpu/drm/panthor/panthor_drv.c | 233 +++++++-
>>>>>> drivers/gpu/drm/panthor/panthor_fw.c | 2 +-
>>>>>> drivers/gpu/drm/panthor/panthor_sched.c | 408 +++++++++-----
>>>>>> drivers/gpu/drm/panthor/panthor_sched.h | 8 +-
>>>>>> drivers/gpu/drm/panthor/panthor_syncobj.c | 167 ++++++
>>>>>> drivers/gpu/drm/panthor/panthor_syncobj.h | 27 +
>>>>>> drivers/gpu/drm/panthor/panthor_xgs.c | 638
>>>>>> ++++++++++++++++++++++
>>>>>> drivers/gpu/drm/panthor/panthor_xgs.h | 42 ++
>>>>>> include/uapi/drm/panthor_drm.h | 243 +++++++-
>>>>>> 12 files changed, 1696 insertions(+), 177 deletions(-)
>>>>>> create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.c
>>>>>> create mode 100644 drivers/gpu/drm/panthor/panthor_syncobj.h
>>>>>> create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.c
>>>>>> create mode 100644 drivers/gpu/drm/panthor/panthor_xgs.h
>>>>>>
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
2024-09-04 7:49 ` Christian König
2024-09-04 9:31 ` Steven Price
@ 2024-09-04 9:43 ` Simona Vetter
1 sibling, 0 replies; 28+ messages in thread
From: Simona Vetter @ 2024-09-04 9:43 UTC (permalink / raw)
To: Christian König
Cc: Steven Price, Mihail Atanassov, linux-kernel, Boris Brezillon,
Liviu Dudau, dri-devel, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher, Xinhui Pan,
Shashank Sharma, Ketil Johnsen, Akash Goel
On Wed, Sep 04, 2024 at 09:49:44AM +0200, Christian König wrote:
> Am 03.09.24 um 23:11 schrieb Simona Vetter:
> > On Tue, Sep 03, 2024 at 03:46:43PM +0200, Christian König wrote:
> > > Hi Steven,
> > >
> > > Am 29.08.24 um 15:37 schrieb Steven Price:
> > > > Hi Christian,
> > > >
> > > > Mihail should be able to give more definitive answers, but I think I can
> > > > answer your questions.
> > > >
> > > > On 29/08/2024 10:40, Christian König wrote:
> > > > > Am 28.08.24 um 19:25 schrieb Mihail Atanassov:
> > > > > > Hello all,
> > > > > >
> > > > > > This series implements a mechanism to expose Mali CSF GPUs' queue
> > > > > > ringbuffers directly to userspace, along with paraphernalia to allow
> > > > > > userspace to control job synchronisation between the CPU and GPU.
> > > > > >
> > > > > > The goal of these changes is to allow userspace to control work
> > > > > > submission to the FW/HW directly without kernel intervention in the
> > > > > > common case, thereby reducing context switching overhead. It also allows
> > > > > > for greater flexibility in the way work is enqueued in the ringbufs.
> > > > > > For example, the current kernel submit path only supports indirect
> > > > > > calls, which is inefficient for small command buffers. Userspace can
> > > > > > also skip unnecessary sync operations.
> > > > > Question is how do you guarantee forward progress for fence signaling?
> > > > A timeout. Although looking at it I think it's probably set too high
> > > > currently:
> > > >
> > > > > +#define JOB_TIMEOUT_MS 5000
> > > > But basically the XGS queue is a DRM scheduler just like a normal GPU
> > > > queue and the jobs have a timeout. If the timeout is hit then any fences
> > > > will be signalled (with an error).
> > > Mhm, that is unfortunately exactly what I feared.
> > >
> > > > > E.g. when are fences created and published? How do they signal?
> > > > >
> > > > > How are dependencies handled? How can the kernel suspend an userspace
> > > > > queue?
> > > > The actual userspace queue can be suspended. This is actually a
> > > > combination of firmware and kernel driver, and this functionality is
> > > > already present without the user submission. The firmware will multiplex
> > > > multiple 'groups' onto the hardware, and if there are too many for the
> > > > firmware then the kernel multiplexes the extra groups onto the ones the
> > > > firmware supports.
> > > How do you guarantee forward progress and that resuming of suspended queues
> > > doesn't end up in a circle dependency?
> > >
> > > > I haven't studied Mihail's series in detail yet, but if I understand
> > > > correctly, the XGS queues are handled separately and are not suspended
> > > > when the hardware queues are suspended. I guess this might be an area
> > > > for improvement and might explain the currently very high timeout (to
> > > > deal with the case where the actual GPU work has been suspended).
> > > >
> > > > > How does memory management work in this case?
> > > > I'm not entirely sure what you mean here. If you are referring to the
> > > > potential memory issues with signalling path then this should be handled
> > > > by the timeout - although I haven't studied the code to check for bugs here.
> > > You might have misunderstood my question (and I might misunderstand the
> > > code), but on first glance it strongly sounds like the current approach will
> > > be NAKed.
> > >
> > > > The actual new XGS queues don't allocate/free memory during the queue
> > > > execution - so it's just the memory usage related to fences (and the
> > > > other work which could be blocked on the fence).
> > > But the kernel and the hardware could suspend the queues, right?
> > >
> > > > In terms of memory management for the GPU work itself, this is handled
> > > > the same as before. The VM_BIND mechanism allows dependencies to be
> > > > created between syncobjs and VM operations, with XGS these can then be
> > > > tied to GPU HW events.
> > > I don't know the details, but that again strongly sounds like that won't
> > > work.
> > >
> > > What you need is to somehow guarantee that work doesn't run into memory
> > > management deadlocks which are resolved by timeouts.
> > >
> > > Please read up here on why that stuff isn't allowed: https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
> > panthor doesn't yet have a shrinker, so all memory is pinned, which means
> > memory management easy mode.
>
> Ok, that at least makes things work for the moment.
>
> > But also this means there might be an uapi design bug in here, and we
> > really don't want to commit to that. My stance is that panthor should gain
> > a proper shrinker first, which means there will be some changes here too.
> > And then we can properly evaluate this. As-is it's a bit too much on the
> > toy end of things.
>
> I wouldn't say toy end, it looks rather fleshed out to me.
>
> But I agree that the people who design the UAPI needs to be aware of the
> restrictions.
The thing is, if you have pinned memory there are no restrictions on
userspace sync, you can do whatever you want. And then you can still bake
that into a dma_fence and it will not deadlock (as long as there's a
timeout and forward progress guarantee) because it cannot ever get back
into a memory management fence because there are none.
If we merge that uapi we're forever condemned to pinning all memory, which
I don't think is good because sooner or later every serious gpu driver
seems to gain some shrinker support.
> > That aside, I've thought this all through with tons of people, and I do
> > think it's all possible.
>
> It's certainly possible, we have user queue patches for amdgpu in the
> pipeline as well.
>
> It's just really really really hard to get right without creating some
> circle dependencies and deadlocks in between.
>
> If I would get free beer every time somebody came up with a broken dma_fence
> design I would probably end up as alcoholic without spending a single penny.
Yeah that's exactly my fear. It's possible, but even if you understand the
abstract rules and implement the kernel correctly it's really hard to get
the userspace right: There's way too many broken ways, and just about
barely one correct way to implement vulkan winsys dma_fence with userspace
submit.
This is why I called userspace direct submit without a shrinker a toy,
because it makes the userspace trivially easy, and trivially easy to get
wrong. Even if the kernel side looks all great and polished and solid.
If we do already have a shrinker that exposes these issues, then userspace
getting stuff wrong is a userspace bug. If we start out without a
shrinker, and userspace gets it wrong, then ever adding a shrinker is a
kernel regression.
And "you cannot ever add dynamic memory management to this driver" is a
uapi regression situation I don't think we want to get into, because
there's no way out. i915-gem folks managed to pull that trick off too, and
Dave&me where left with no other choice than to flat out revert that uapi
and break userspace.
This is not something I want to make a repeat experience.
So what I think should be done, pretty much carbon copy from the xe plan:
- Prerequisite: Have a shrinker or otherwise dynamic memory management, or
you just wont hit all the interesting bugs.
- Implement support for long running context that do not have dma_fence
end-of-batch semantics, with the preempt context fence trick. Since this
would be the 3rd driver after amdkfd and xe I think it's time for a
little helper, which would be very little code and lots of documentation
explaining the concept.
But if that helper ties in with drm_exec and drm_gpuvm then I think it
would still be really good as a tiny little library itself and not just
as a good place to put documentation. Extracting that from xe_lr.c
should be fairly simple. amdkfd could also be used as inspiration at
least.
- With this you can start using userspace memory fences, as long as you
never need to bake them into a dma_fence or syncobj that actually waits
(instead of waiting until rendering finishes and then handing over a
signalled dma_fence or something like that).
- Implement userspace submit with this infrastructure.
- Implement support for baking dma_fence with all this, relying on the
vulkan winsys fence guarantee that all the indefinite fences are
resolved and queued up, or the application will get a
VK_ERROR_DEVICE_LOST thrown its way.
This is really hard, and I don't think anyone's made it happen yet in a
real prototype with full-fledged userspace memory fences vulkan driver.
- Bonus for adding memory fences to drm_syncobj as future fences, for
sharing. Could be done anytime you have long running context support.
Alternative plan:
- Implement userspace submit, but _every_ job still goes through a kernel
drm_sched submission and the context doesn't run without a job. That way
you can drive userspace-submit only hardware, but with the old
end-of-batch dma_fence kernel submission model, and also the userspace
submit thread for future fences that haven't materialized yet in a
drm_syncobj. This might be a good interim stop-gap for gl/vulkan winsys
drivers, but people seem a lot more interesting in going full userspace
submit with userspace memory fences for compute use cases. But could do
both in parallel.
Unless you absolutely know what you're doing you cannot use userspace
memory fences with this. I think the only case I've seen is where you
submit jobs in multiple engines as an atomic unit (like intel media
workloads split across engines), or one after the other and sync is
guaranteed to go only one way (like the run-ahead cs job radeonsi iirc
has).
tldr; expect pain, I'm sorry.
Cheers, Sima
--
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
2024-09-04 9:31 ` Steven Price
@ 2024-09-04 11:23 ` Boris Brezillon
[not found] ` <a5a53492-9651-403e-b613-91ef0b9e80b6@amd.com>
0 siblings, 1 reply; 28+ messages in thread
From: Boris Brezillon @ 2024-09-04 11:23 UTC (permalink / raw)
To: Steven Price
Cc: Christian König, Mihail Atanassov, linux-kernel, Liviu Dudau,
dri-devel, David Airlie, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, Alex Deucher, Xinhui Pan, Shashank Sharma,
Ketil Johnsen, Akash Goel
On Wed, 4 Sep 2024 10:31:36 +0100
Steven Price <steven.price@arm.com> wrote:
> On 04/09/2024 08:49, Christian König wrote:
> > Am 03.09.24 um 23:11 schrieb Simona Vetter:
> >> On Tue, Sep 03, 2024 at 03:46:43PM +0200, Christian König wrote:
> >>> Hi Steven,
> >>>
> >>> Am 29.08.24 um 15:37 schrieb Steven Price:
> >>>> Hi Christian,
> >>>>
> >>>> Mihail should be able to give more definitive answers, but I think I
> >>>> can
> >>>> answer your questions.
> >>>>
> >>>> On 29/08/2024 10:40, Christian König wrote:
> >>>>> Am 28.08.24 um 19:25 schrieb Mihail Atanassov:
> >>>>>> Hello all,
> >>>>>>
> >>>>>> This series implements a mechanism to expose Mali CSF GPUs' queue
> >>>>>> ringbuffers directly to userspace, along with paraphernalia to allow
> >>>>>> userspace to control job synchronisation between the CPU and GPU.
> >>>>>>
> >>>>>> The goal of these changes is to allow userspace to control work
> >>>>>> submission to the FW/HW directly without kernel intervention in the
> >>>>>> common case, thereby reducing context switching overhead. It also
> >>>>>> allows
> >>>>>> for greater flexibility in the way work is enqueued in the ringbufs.
> >>>>>> For example, the current kernel submit path only supports indirect
> >>>>>> calls, which is inefficient for small command buffers. Userspace can
> >>>>>> also skip unnecessary sync operations.
> >>>>> Question is how do you guarantee forward progress for fence signaling?
> >>>> A timeout. Although looking at it I think it's probably set too high
> >>>> currently:
> >>>>
> >>>>> +#define JOB_TIMEOUT_MS 5000
> >>>> But basically the XGS queue is a DRM scheduler just like a normal GPU
> >>>> queue and the jobs have a timeout. If the timeout is hit then any
> >>>> fences
> >>>> will be signalled (with an error).
> >>> Mhm, that is unfortunately exactly what I feared.
> >>>
> >>>>> E.g. when are fences created and published? How do they signal?
> >>>>>
> >>>>> How are dependencies handled? How can the kernel suspend an userspace
> >>>>> queue?
> >>>> The actual userspace queue can be suspended. This is actually a
> >>>> combination of firmware and kernel driver, and this functionality is
> >>>> already present without the user submission. The firmware will
> >>>> multiplex
> >>>> multiple 'groups' onto the hardware, and if there are too many for the
> >>>> firmware then the kernel multiplexes the extra groups onto the ones the
> >>>> firmware supports.
> >>> How do you guarantee forward progress and that resuming of suspended
> >>> queues
> >>> doesn't end up in a circle dependency?
>
> I'm not entirely sure what you mean by "guarantee" here - the kernel by
> itself only guarantees forward progress by the means of timeouts. User
> space can 'easily' shoot itself in the foot by using a XGS queue to
> block waiting on a GPU event which will never happen.
>
> However dependencies between applications (and/or other device drivers)
> will only occur via dma fences and an unsignalled fence will only be
> returned when there is a path forward to signal it. So it shouldn't be
> possible to create a dependency loop between contexts (or command stream
> groups to use the Mali jargon).
>
> Because the groups can't have dependency cycles it should be possible to
> suspend/resume them without deadlocks.
>
> >>>> I haven't studied Mihail's series in detail yet, but if I understand
> >>>> correctly, the XGS queues are handled separately and are not suspended
> >>>> when the hardware queues are suspended. I guess this might be an area
> >>>> for improvement and might explain the currently very high timeout (to
> >>>> deal with the case where the actual GPU work has been suspended).
> >>>>
> >>>>> How does memory management work in this case?
> >>>> I'm not entirely sure what you mean here. If you are referring to the
> >>>> potential memory issues with signalling path then this should be
> >>>> handled
> >>>> by the timeout - although I haven't studied the code to check for
> >>>> bugs here.
> >>> You might have misunderstood my question (and I might misunderstand the
> >>> code), but on first glance it strongly sounds like the current
> >>> approach will
> >>> be NAKed.
> >>>
> >>>> The actual new XGS queues don't allocate/free memory during the queue
> >>>> execution - so it's just the memory usage related to fences (and the
> >>>> other work which could be blocked on the fence).
> >>> But the kernel and the hardware could suspend the queues, right?
> >>>
> >>>> In terms of memory management for the GPU work itself, this is handled
> >>>> the same as before. The VM_BIND mechanism allows dependencies to be
> >>>> created between syncobjs and VM operations, with XGS these can then be
> >>>> tied to GPU HW events.
> >>> I don't know the details, but that again strongly sounds like that won't
> >>> work.
> >>>
> >>> What you need is to somehow guarantee that work doesn't run into memory
> >>> management deadlocks which are resolved by timeouts.
> >>>
> >>> Please read up here on why that stuff isn't allowed:
> >>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
> >> panthor doesn't yet have a shrinker, so all memory is pinned, which means
> >> memory management easy mode.
> >
> > Ok, that at least makes things work for the moment.
>
> Ah, perhaps this should have been spelt out more clearly ;)
>
> The VM_BIND mechanism that's already in place jumps through some hoops
> to ensure that memory is preallocated when the memory operations are
> enqueued. So any memory required should have been allocated before any
> sync object is returned. We're aware of the issue with memory
> allocations on the signalling path and trying to ensure that we don't
> have that.
>
> I'm hoping that we don't need a shrinker which deals with (active) GPU
> memory with our design.
That's actually what we were planning to do: the panthor shrinker was
about to rely on fences attached to GEM objects to know if it can
reclaim the memory. This design relies on each job attaching its fence
to the GEM mapped to the VM at the time the job is submitted, such that
memory that's in-use or about-to-be-used doesn't vanish before the GPU
is done.
> Memory which user space thinks the GPU might
> need should be pinned before the GPU work is submitted. APIs which
> require any form of 'paging in' of data would need to be implemented by
> the GPU work completing and being resubmitted by user space after the
> memory changes (i.e. there could be a DMA fence pending on the GPU work).
Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
that means we can't really transparently swap out GPU memory, or we
have to constantly pin/unpin around each job, which means even more
ioctl()s than we have now. Another option would be to add the XGS fence
to the BOs attached to the VM, assuming it's created before the job
submission itself, but you're no longer reducing the number of user <->
kernel round trips if you do that, because you now have to create an
XSG job for each submission, so you basically get back to one ioctl()
per submission.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
[not found] ` <a5a53492-9651-403e-b613-91ef0b9e80b6@amd.com>
@ 2024-09-04 12:46 ` Steven Price
2024-09-04 13:20 ` Boris Brezillon
2024-09-04 13:36 ` Christian König
2024-09-24 9:30 ` Simona Vetter
1 sibling, 2 replies; 28+ messages in thread
From: Steven Price @ 2024-09-04 12:46 UTC (permalink / raw)
To: Christian König, Boris Brezillon
Cc: Mihail Atanassov, linux-kernel, Liviu Dudau, dri-devel,
David Airlie, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
Alex Deucher, Xinhui Pan, Shashank Sharma, Ketil Johnsen,
Akash Goel
On 04/09/2024 12:34, Christian König wrote:
> Hi Boris,
>
> Am 04.09.24 um 13:23 schrieb Boris Brezillon:
>>>>>> Please read up here on why that stuff isn't allowed:
>>>>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
>>>>> panthor doesn't yet have a shrinker, so all memory is pinned, which means
>>>>> memory management easy mode.
>>>> Ok, that at least makes things work for the moment.
>>> Ah, perhaps this should have been spelt out more clearly ;)
>>>
>>> The VM_BIND mechanism that's already in place jumps through some hoops
>>> to ensure that memory is preallocated when the memory operations are
>>> enqueued. So any memory required should have been allocated before any
>>> sync object is returned. We're aware of the issue with memory
>>> allocations on the signalling path and trying to ensure that we don't
>>> have that.
>>>
>>> I'm hoping that we don't need a shrinker which deals with (active) GPU
>>> memory with our design.
>> That's actually what we were planning to do: the panthor shrinker was
>> about to rely on fences attached to GEM objects to know if it can
>> reclaim the memory. This design relies on each job attaching its fence
>> to the GEM mapped to the VM at the time the job is submitted, such that
>> memory that's in-use or about-to-be-used doesn't vanish before the GPU
>> is done.
How progressed is this shrinker? It would be good to have an RFC so that
we can look to see how user submission could fit in with it.
> Yeah and exactly that doesn't work any more when you are using user
> queues, because the kernel has no opportunity to attach a fence for each
> submission.
User submission requires a cooperating user space[1]. So obviously user
space would need to ensure any BOs that it expects will be accessed to
be in some way pinned. Since the expectation of user space submission is
that we're reducing kernel involvement, I'd also expect these to be
fairly long-term pins.
[1] Obviously with a timer to kill things from a malicious user space.
The (closed) 'kbase' driver has a shrinker but is only used on a subset
of memory and it's up to user space to ensure that it keeps the relevant
parts pinned (or more specifically not marking them to be discarded if
there's memory pressure). Not that I think we should be taking it's
model as a reference here.
>>> Memory which user space thinks the GPU might
>>> need should be pinned before the GPU work is submitted. APIs which
>>> require any form of 'paging in' of data would need to be implemented by
>>> the GPU work completing and being resubmitted by user space after the
>>> memory changes (i.e. there could be a DMA fence pending on the GPU work).
>> Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
>> that means we can't really transparently swap out GPU memory, or we
>> have to constantly pin/unpin around each job, which means even more
>> ioctl()s than we have now. Another option would be to add the XGS fence
>> to the BOs attached to the VM, assuming it's created before the job
>> submission itself, but you're no longer reducing the number of user <->
>> kernel round trips if you do that, because you now have to create an
>> XSG job for each submission, so you basically get back to one ioctl()
>> per submission.
As you say the granularity of pinning has to be fairly coarse for user
space submission to make sense. My assumption (could be wildly wrong)
was that most memory would be pinned whenever a context is rendering.
> For AMDGPU we are currently working on the following solution with
> memory management and user queues:
>
> 1. User queues are created through an kernel IOCTL, submissions work by
> writing into a ring buffer and ringing a doorbell.
>
> 2. Each queue can request the kernel to create fences for the currently
> pushed work for a queues which can then be attached to BOs, syncobjs,
> syncfiles etc...
>
> 3. Additional to that we have and eviction/preemption fence attached to
> all BOs, page tables, whatever resources we need.
>
> 4. When this eviction fences are requested to signal they first wait for
> all submission fences and then suspend the user queues and block
> creating new submission fences until the queues are restarted again.
>
> This way you can still do your memory management inside the kernel (e.g.
> move BOs from local to system memory) or even completely suspend and
> resume applications without their interaction, but as Sima said it is
> just horrible complicated to get right.
>
> We have been working on this for like two years now and it still could
> be that we missed something since it is not in production testing yet.
I'm not entirely sure I follow how this doesn't create a dependency
cycle. From your description it sounds like you create a fence from the
user space queue which is then used to prevent eviction of the BOs needed.
So to me it sounds like:
1. Attach fence to BOs to prevent eviction.
2. User space submits work to the ring buffer, rings doorbell.
3. Call into the kernel to create the fence for step 1.
Which is obviously broken. What am I missing?
One other thing to note is that Mali doesn't have local memory - so the
only benefit of unpinning is if we want to swap to disk (or zram etc).
Steve
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
2024-09-04 12:46 ` Steven Price
@ 2024-09-04 13:20 ` Boris Brezillon
2024-09-04 13:35 ` Steven Price
2024-09-04 13:36 ` Christian König
1 sibling, 1 reply; 28+ messages in thread
From: Boris Brezillon @ 2024-09-04 13:20 UTC (permalink / raw)
To: Steven Price, Adrián Larumbe
Cc: Christian König, Mihail Atanassov, linux-kernel, Liviu Dudau,
dri-devel, David Airlie, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, Alex Deucher, Xinhui Pan, Shashank Sharma,
Ketil Johnsen, Akash Goel
+ Adrian, who has been looking at the shrinker stuff for Panthor
On Wed, 4 Sep 2024 13:46:12 +0100
Steven Price <steven.price@arm.com> wrote:
> On 04/09/2024 12:34, Christian König wrote:
> > Hi Boris,
> >
> > Am 04.09.24 um 13:23 schrieb Boris Brezillon:
> >>>>>> Please read up here on why that stuff isn't allowed:
> >>>>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
> >>>>> panthor doesn't yet have a shrinker, so all memory is pinned, which means
> >>>>> memory management easy mode.
> >>>> Ok, that at least makes things work for the moment.
> >>> Ah, perhaps this should have been spelt out more clearly ;)
> >>>
> >>> The VM_BIND mechanism that's already in place jumps through some hoops
> >>> to ensure that memory is preallocated when the memory operations are
> >>> enqueued. So any memory required should have been allocated before any
> >>> sync object is returned. We're aware of the issue with memory
> >>> allocations on the signalling path and trying to ensure that we don't
> >>> have that.
> >>>
> >>> I'm hoping that we don't need a shrinker which deals with (active) GPU
> >>> memory with our design.
> >> That's actually what we were planning to do: the panthor shrinker was
> >> about to rely on fences attached to GEM objects to know if it can
> >> reclaim the memory. This design relies on each job attaching its fence
> >> to the GEM mapped to the VM at the time the job is submitted, such that
> >> memory that's in-use or about-to-be-used doesn't vanish before the GPU
> >> is done.
>
> How progressed is this shrinker?
We don't have code yet. All we know is that we want to re-use Dmitry's
generic GEM-SHMEM shrinker implementation [1], and adjust it to match
the VM model, which means not tracking things at the BO granularity,
but at the VM granularity. Actually it has to be an hybrid model, where
shared BOs (those imported/exported) are tracked individually, while
all private BOs are checked simultaneously (since they all share the VM
resv object).
> It would be good to have an RFC so that
> we can look to see how user submission could fit in with it.
Unfortunately, we don't have that yet :-(. All we have is a rough idea
of how things will work, which is basically how TTM reclaim works, but
adapted to GEM.
>
> > Yeah and exactly that doesn't work any more when you are using user
> > queues, because the kernel has no opportunity to attach a fence for each
> > submission.
>
> User submission requires a cooperating user space[1]. So obviously user
> space would need to ensure any BOs that it expects will be accessed to
> be in some way pinned. Since the expectation of user space submission is
> that we're reducing kernel involvement, I'd also expect these to be
> fairly long-term pins.
>
> [1] Obviously with a timer to kill things from a malicious user space.
>
> The (closed) 'kbase' driver has a shrinker but is only used on a subset
> of memory and it's up to user space to ensure that it keeps the relevant
> parts pinned (or more specifically not marking them to be discarded if
> there's memory pressure). Not that I think we should be taking it's
> model as a reference here.
>
> >>> Memory which user space thinks the GPU might
> >>> need should be pinned before the GPU work is submitted. APIs which
> >>> require any form of 'paging in' of data would need to be implemented by
> >>> the GPU work completing and being resubmitted by user space after the
> >>> memory changes (i.e. there could be a DMA fence pending on the GPU work).
> >> Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
> >> that means we can't really transparently swap out GPU memory, or we
> >> have to constantly pin/unpin around each job, which means even more
> >> ioctl()s than we have now. Another option would be to add the XGS fence
> >> to the BOs attached to the VM, assuming it's created before the job
> >> submission itself, but you're no longer reducing the number of user <->
> >> kernel round trips if you do that, because you now have to create an
> >> XSG job for each submission, so you basically get back to one ioctl()
> >> per submission.
>
> As you say the granularity of pinning has to be fairly coarse for user
> space submission to make sense. My assumption (could be wildly wrong)
> was that most memory would be pinned whenever a context is rendering.
The granularity of pinning (in term of which regions are pinned) is not
really the problem, we can just assume anything that's mapped to the VM
will be used by the GPU (which is what we're planning to do for kernel
submission BTW). The problem is making the timeslice during
which VM memory is considered unreclaimable as short as possible, such
that the system can reclaim memory under mem pressure. Ideally, you want
to pin memory as long as you have jobs queued/running, and allow for
reclaim when the GPU context is idle.
We might be able to involve the panthor_scheduler for usermode queues,
such that a context that's eligible for scheduling first gets its VM
mappings pinned (fence creation + assignment to the VM/BO resvs), and
things get reclaimable again when the group is evicted from the CSG
slot. That implies evicting idle groups more aggressively than we do
know, but there's probably a way around it.
>
> > For AMDGPU we are currently working on the following solution with
> > memory management and user queues:
> >
> > 1. User queues are created through an kernel IOCTL, submissions work by
> > writing into a ring buffer and ringing a doorbell.
> >
> > 2. Each queue can request the kernel to create fences for the currently
> > pushed work for a queues which can then be attached to BOs, syncobjs,
> > syncfiles etc...
> >
> > 3. Additional to that we have and eviction/preemption fence attached to
> > all BOs, page tables, whatever resources we need.
> >
> > 4. When this eviction fences are requested to signal they first wait for
> > all submission fences and then suspend the user queues and block
> > creating new submission fences until the queues are restarted again.
> >
> > This way you can still do your memory management inside the kernel (e.g.
> > move BOs from local to system memory) or even completely suspend and
> > resume applications without their interaction, but as Sima said it is
> > just horrible complicated to get right.
> >
> > We have been working on this for like two years now and it still could
> > be that we missed something since it is not in production testing yet.
>
> I'm not entirely sure I follow how this doesn't create a dependency
> cycle. From your description it sounds like you create a fence from the
> user space queue which is then used to prevent eviction of the BOs needed.
>
> So to me it sounds like:
>
> 1. Attach fence to BOs to prevent eviction.
>
> 2. User space submits work to the ring buffer, rings doorbell.
>
> 3. Call into the kernel to create the fence for step 1.
>
> Which is obviously broken. What am I missing?
>
> One other thing to note is that Mali doesn't have local memory - so the
> only benefit of unpinning is if we want to swap to disk (or zram etc).
Which would be good to have, IMHO. If we don't do the implicit swapout
based on some sort of least-recently-used-VM, we have to rely on
userspace freeing its buffer (or flagging them reclaimable) as soon as
they are no longer used by the GPU.
[1]https://patchwork.kernel.org/project/dri-devel/cover/20240105184624.508603-1-dmitry.osipenko@collabora.com/
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
2024-09-04 13:20 ` Boris Brezillon
@ 2024-09-04 13:35 ` Steven Price
2024-09-04 13:58 ` Boris Brezillon
0 siblings, 1 reply; 28+ messages in thread
From: Steven Price @ 2024-09-04 13:35 UTC (permalink / raw)
To: Boris Brezillon, Adrián Larumbe
Cc: Christian König, Mihail Atanassov, linux-kernel, Liviu Dudau,
dri-devel, David Airlie, Maarten Lankhorst, Maxime Ripard,
Thomas Zimmermann, Alex Deucher, Xinhui Pan, Shashank Sharma,
Ketil Johnsen, Akash Goel
On 04/09/2024 14:20, Boris Brezillon wrote:
> + Adrian, who has been looking at the shrinker stuff for Panthor
>
> On Wed, 4 Sep 2024 13:46:12 +0100
> Steven Price <steven.price@arm.com> wrote:
>
>> On 04/09/2024 12:34, Christian König wrote:
>>> Hi Boris,
>>>
>>> Am 04.09.24 um 13:23 schrieb Boris Brezillon:
>>>>>>>> Please read up here on why that stuff isn't allowed:
>>>>>>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
>>>>>>> panthor doesn't yet have a shrinker, so all memory is pinned, which means
>>>>>>> memory management easy mode.
>>>>>> Ok, that at least makes things work for the moment.
>>>>> Ah, perhaps this should have been spelt out more clearly ;)
>>>>>
>>>>> The VM_BIND mechanism that's already in place jumps through some hoops
>>>>> to ensure that memory is preallocated when the memory operations are
>>>>> enqueued. So any memory required should have been allocated before any
>>>>> sync object is returned. We're aware of the issue with memory
>>>>> allocations on the signalling path and trying to ensure that we don't
>>>>> have that.
>>>>>
>>>>> I'm hoping that we don't need a shrinker which deals with (active) GPU
>>>>> memory with our design.
>>>> That's actually what we were planning to do: the panthor shrinker was
>>>> about to rely on fences attached to GEM objects to know if it can
>>>> reclaim the memory. This design relies on each job attaching its fence
>>>> to the GEM mapped to the VM at the time the job is submitted, such that
>>>> memory that's in-use or about-to-be-used doesn't vanish before the GPU
>>>> is done.
>>
>> How progressed is this shrinker?
>
> We don't have code yet. All we know is that we want to re-use Dmitry's
> generic GEM-SHMEM shrinker implementation [1], and adjust it to match
> the VM model, which means not tracking things at the BO granularity,
> but at the VM granularity. Actually it has to be an hybrid model, where
> shared BOs (those imported/exported) are tracked individually, while
> all private BOs are checked simultaneously (since they all share the VM
> resv object).
>
>> It would be good to have an RFC so that
>> we can look to see how user submission could fit in with it.
>
> Unfortunately, we don't have that yet :-(. All we have is a rough idea
> of how things will work, which is basically how TTM reclaim works, but
> adapted to GEM.
Fair enough, thanks for the description. We might need to coordinate to
get this looked at sooner if it's going to be blocking user submission.
>>
>>> Yeah and exactly that doesn't work any more when you are using user
>>> queues, because the kernel has no opportunity to attach a fence for each
>>> submission.
>>
>> User submission requires a cooperating user space[1]. So obviously user
>> space would need to ensure any BOs that it expects will be accessed to
>> be in some way pinned. Since the expectation of user space submission is
>> that we're reducing kernel involvement, I'd also expect these to be
>> fairly long-term pins.
>>
>> [1] Obviously with a timer to kill things from a malicious user space.
>>
>> The (closed) 'kbase' driver has a shrinker but is only used on a subset
>> of memory and it's up to user space to ensure that it keeps the relevant
>> parts pinned (or more specifically not marking them to be discarded if
>> there's memory pressure). Not that I think we should be taking it's
>> model as a reference here.
>>
>>>>> Memory which user space thinks the GPU might
>>>>> need should be pinned before the GPU work is submitted. APIs which
>>>>> require any form of 'paging in' of data would need to be implemented by
>>>>> the GPU work completing and being resubmitted by user space after the
>>>>> memory changes (i.e. there could be a DMA fence pending on the GPU work).
>>>> Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
>>>> that means we can't really transparently swap out GPU memory, or we
>>>> have to constantly pin/unpin around each job, which means even more
>>>> ioctl()s than we have now. Another option would be to add the XGS fence
>>>> to the BOs attached to the VM, assuming it's created before the job
>>>> submission itself, but you're no longer reducing the number of user <->
>>>> kernel round trips if you do that, because you now have to create an
>>>> XSG job for each submission, so you basically get back to one ioctl()
>>>> per submission.
>>
>> As you say the granularity of pinning has to be fairly coarse for user
>> space submission to make sense. My assumption (could be wildly wrong)
>> was that most memory would be pinned whenever a context is rendering.
>
> The granularity of pinning (in term of which regions are pinned) is not
> really the problem, we can just assume anything that's mapped to the VM
> will be used by the GPU (which is what we're planning to do for kernel
> submission BTW). The problem is making the timeslice during
> which VM memory is considered unreclaimable as short as possible, such
> that the system can reclaim memory under mem pressure. Ideally, you want
> to pin memory as long as you have jobs queued/running, and allow for
> reclaim when the GPU context is idle.
>
> We might be able to involve the panthor_scheduler for usermode queues,
> such that a context that's eligible for scheduling first gets its VM
> mappings pinned (fence creation + assignment to the VM/BO resvs), and
> things get reclaimable again when the group is evicted from the CSG
> slot. That implies evicting idle groups more aggressively than we do
> know, but there's probably a way around it.
Can we evict idle groups from the shrinker? User space already has to do
a little dance to work out whether it needs to "kick" the kernel to do
the submission. So it would be quite reasonable to extend the kick to
also pin the VM(s). The shrinker could then proactively evict idle
groups, and at that point it would be possible to unpin the memory.
I'm probably missing something, but that seems like a fairly solid
solution to me.
>>
>>> For AMDGPU we are currently working on the following solution with
>>> memory management and user queues:
>>>
>>> 1. User queues are created through an kernel IOCTL, submissions work by
>>> writing into a ring buffer and ringing a doorbell.
>>>
>>> 2. Each queue can request the kernel to create fences for the currently
>>> pushed work for a queues which can then be attached to BOs, syncobjs,
>>> syncfiles etc...
>>>
>>> 3. Additional to that we have and eviction/preemption fence attached to
>>> all BOs, page tables, whatever resources we need.
>>>
>>> 4. When this eviction fences are requested to signal they first wait for
>>> all submission fences and then suspend the user queues and block
>>> creating new submission fences until the queues are restarted again.
>>>
>>> This way you can still do your memory management inside the kernel (e.g.
>>> move BOs from local to system memory) or even completely suspend and
>>> resume applications without their interaction, but as Sima said it is
>>> just horrible complicated to get right.
>>>
>>> We have been working on this for like two years now and it still could
>>> be that we missed something since it is not in production testing yet.
>>
>> I'm not entirely sure I follow how this doesn't create a dependency
>> cycle. From your description it sounds like you create a fence from the
>> user space queue which is then used to prevent eviction of the BOs needed.
>>
>> So to me it sounds like:
>>
>> 1. Attach fence to BOs to prevent eviction.
>>
>> 2. User space submits work to the ring buffer, rings doorbell.
>>
>> 3. Call into the kernel to create the fence for step 1.
>>
>> Which is obviously broken. What am I missing?
>>
>> One other thing to note is that Mali doesn't have local memory - so the
>> only benefit of unpinning is if we want to swap to disk (or zram etc).
>
> Which would be good to have, IMHO. If we don't do the implicit swapout
> based on some sort of least-recently-used-VM, we have to rely on
> userspace freeing its buffer (or flagging them reclaimable) as soon as
> they are no longer used by the GPU.
It's an awkward one because really you do want user space to free
buffers - generally the user space driver has a fairly large number of
buffers which are for temporary storage (and kept around to avoid the
overhead of freeing/allocating each frame). I know we've resorted to
idle timers in user space in an attempt to do this in the past - but
it's ugly.
Actual swapout can work, but it incurs a heavy penalty when the
application becomes active again. But I agree we should probably be
aiming to add support for this.
Steve
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
2024-09-04 12:46 ` Steven Price
2024-09-04 13:20 ` Boris Brezillon
@ 2024-09-04 13:36 ` Christian König
1 sibling, 0 replies; 28+ messages in thread
From: Christian König @ 2024-09-04 13:36 UTC (permalink / raw)
To: Steven Price, Boris Brezillon
Cc: Mihail Atanassov, linux-kernel, Liviu Dudau, dri-devel,
David Airlie, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
Alex Deucher, Xinhui Pan, Shashank Sharma, Ketil Johnsen,
Akash Goel
Am 04.09.24 um 14:46 schrieb Steven Price:
> On 04/09/2024 12:34, Christian König wrote:
>> Hi Boris,
>>
>> Am 04.09.24 um 13:23 schrieb Boris Brezillon:
>>>>>>> Please read up here on why that stuff isn't allowed:
>>>>>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
>>>>>> panthor doesn't yet have a shrinker, so all memory is pinned, which means
>>>>>> memory management easy mode.
>>>>> Ok, that at least makes things work for the moment.
>>>> Ah, perhaps this should have been spelt out more clearly ;)
>>>>
>>>> The VM_BIND mechanism that's already in place jumps through some hoops
>>>> to ensure that memory is preallocated when the memory operations are
>>>> enqueued. So any memory required should have been allocated before any
>>>> sync object is returned. We're aware of the issue with memory
>>>> allocations on the signalling path and trying to ensure that we don't
>>>> have that.
>>>>
>>>> I'm hoping that we don't need a shrinker which deals with (active) GPU
>>>> memory with our design.
>>> That's actually what we were planning to do: the panthor shrinker was
>>> about to rely on fences attached to GEM objects to know if it can
>>> reclaim the memory. This design relies on each job attaching its fence
>>> to the GEM mapped to the VM at the time the job is submitted, such that
>>> memory that's in-use or about-to-be-used doesn't vanish before the GPU
>>> is done.
> How progressed is this shrinker? It would be good to have an RFC so that
> we can look to see how user submission could fit in with it.
>
>> Yeah and exactly that doesn't work any more when you are using user
>> queues, because the kernel has no opportunity to attach a fence for each
>> submission.
> User submission requires a cooperating user space[1]. So obviously user
> space would need to ensure any BOs that it expects will be accessed to
> be in some way pinned. Since the expectation of user space submission is
> that we're reducing kernel involvement, I'd also expect these to be
> fairly long-term pins.
>
> [1] Obviously with a timer to kill things from a malicious user space.
>
> The (closed) 'kbase' driver has a shrinker but is only used on a subset
> of memory and it's up to user space to ensure that it keeps the relevant
> parts pinned (or more specifically not marking them to be discarded if
> there's memory pressure). Not that I think we should be taking it's
> model as a reference here.
>
>>>> Memory which user space thinks the GPU might
>>>> need should be pinned before the GPU work is submitted. APIs which
>>>> require any form of 'paging in' of data would need to be implemented by
>>>> the GPU work completing and being resubmitted by user space after the
>>>> memory changes (i.e. there could be a DMA fence pending on the GPU work).
>>> Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
>>> that means we can't really transparently swap out GPU memory, or we
>>> have to constantly pin/unpin around each job, which means even more
>>> ioctl()s than we have now. Another option would be to add the XGS fence
>>> to the BOs attached to the VM, assuming it's created before the job
>>> submission itself, but you're no longer reducing the number of user <->
>>> kernel round trips if you do that, because you now have to create an
>>> XSG job for each submission, so you basically get back to one ioctl()
>>> per submission.
> As you say the granularity of pinning has to be fairly coarse for user
> space submission to make sense. My assumption (could be wildly wrong)
> was that most memory would be pinned whenever a context is rendering.
>
>> For AMDGPU we are currently working on the following solution with
>> memory management and user queues:
>>
>> 1. User queues are created through an kernel IOCTL, submissions work by
>> writing into a ring buffer and ringing a doorbell.
>>
>> 2. Each queue can request the kernel to create fences for the currently
>> pushed work for a queues which can then be attached to BOs, syncobjs,
>> syncfiles etc...
>>
>> 3. Additional to that we have and eviction/preemption fence attached to
>> all BOs, page tables, whatever resources we need.
>>
>> 4. When this eviction fences are requested to signal they first wait for
>> all submission fences and then suspend the user queues and block
>> creating new submission fences until the queues are restarted again.
>>
>> This way you can still do your memory management inside the kernel (e.g.
>> move BOs from local to system memory) or even completely suspend and
>> resume applications without their interaction, but as Sima said it is
>> just horrible complicated to get right.
>>
>> We have been working on this for like two years now and it still could
>> be that we missed something since it is not in production testing yet.
> I'm not entirely sure I follow how this doesn't create a dependency
> cycle. From your description it sounds like you create a fence from the
> user space queue which is then used to prevent eviction of the BOs needed.
>
> So to me it sounds like:
>
> 1. Attach fence to BOs to prevent eviction.
>
> 2. User space submits work to the ring buffer, rings doorbell.
>
> 3. Call into the kernel to create the fence for step 1.
>
> Which is obviously broken. What am I missing?
Those are two different fences, we have an eviction fence and a
submission fence.
The eviction fence takes care of making sure the memory is not moved
while submissions are running.
And the submission fence is the actual signal to completion.
> One other thing to note is that Mali doesn't have local memory - so the
> only benefit of unpinning is if we want to swap to disk (or zram etc).
Yeah that makes sense, but the problem remains that with an userspace
submission module you completely nail that pinning.
Regards,
Christian.
>
> Steve
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
2024-09-04 13:35 ` Steven Price
@ 2024-09-04 13:58 ` Boris Brezillon
0 siblings, 0 replies; 28+ messages in thread
From: Boris Brezillon @ 2024-09-04 13:58 UTC (permalink / raw)
To: Steven Price
Cc: Adrián Larumbe, Christian König, Mihail Atanassov,
linux-kernel, Liviu Dudau, dri-devel, David Airlie,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Alex Deucher,
Xinhui Pan, Shashank Sharma, Ketil Johnsen, Akash Goel
On Wed, 4 Sep 2024 14:35:12 +0100
Steven Price <steven.price@arm.com> wrote:
> On 04/09/2024 14:20, Boris Brezillon wrote:
> > + Adrian, who has been looking at the shrinker stuff for Panthor
> >
> > On Wed, 4 Sep 2024 13:46:12 +0100
> > Steven Price <steven.price@arm.com> wrote:
> >
> >> On 04/09/2024 12:34, Christian König wrote:
> >>> Hi Boris,
> >>>
> >>> Am 04.09.24 um 13:23 schrieb Boris Brezillon:
> >>>>>>>> Please read up here on why that stuff isn't allowed:
> >>>>>>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
> >>>>>>> panthor doesn't yet have a shrinker, so all memory is pinned, which means
> >>>>>>> memory management easy mode.
> >>>>>> Ok, that at least makes things work for the moment.
> >>>>> Ah, perhaps this should have been spelt out more clearly ;)
> >>>>>
> >>>>> The VM_BIND mechanism that's already in place jumps through some hoops
> >>>>> to ensure that memory is preallocated when the memory operations are
> >>>>> enqueued. So any memory required should have been allocated before any
> >>>>> sync object is returned. We're aware of the issue with memory
> >>>>> allocations on the signalling path and trying to ensure that we don't
> >>>>> have that.
> >>>>>
> >>>>> I'm hoping that we don't need a shrinker which deals with (active) GPU
> >>>>> memory with our design.
> >>>> That's actually what we were planning to do: the panthor shrinker was
> >>>> about to rely on fences attached to GEM objects to know if it can
> >>>> reclaim the memory. This design relies on each job attaching its fence
> >>>> to the GEM mapped to the VM at the time the job is submitted, such that
> >>>> memory that's in-use or about-to-be-used doesn't vanish before the GPU
> >>>> is done.
> >>
> >> How progressed is this shrinker?
> >
> > We don't have code yet. All we know is that we want to re-use Dmitry's
> > generic GEM-SHMEM shrinker implementation [1], and adjust it to match
> > the VM model, which means not tracking things at the BO granularity,
> > but at the VM granularity. Actually it has to be an hybrid model, where
> > shared BOs (those imported/exported) are tracked individually, while
> > all private BOs are checked simultaneously (since they all share the VM
> > resv object).
> >
> >> It would be good to have an RFC so that
> >> we can look to see how user submission could fit in with it.
> >
> > Unfortunately, we don't have that yet :-(. All we have is a rough idea
> > of how things will work, which is basically how TTM reclaim works, but
> > adapted to GEM.
>
> Fair enough, thanks for the description. We might need to coordinate to
> get this looked at sooner if it's going to be blocking user submission.
>
> >>
> >>> Yeah and exactly that doesn't work any more when you are using user
> >>> queues, because the kernel has no opportunity to attach a fence for each
> >>> submission.
> >>
> >> User submission requires a cooperating user space[1]. So obviously user
> >> space would need to ensure any BOs that it expects will be accessed to
> >> be in some way pinned. Since the expectation of user space submission is
> >> that we're reducing kernel involvement, I'd also expect these to be
> >> fairly long-term pins.
> >>
> >> [1] Obviously with a timer to kill things from a malicious user space.
> >>
> >> The (closed) 'kbase' driver has a shrinker but is only used on a subset
> >> of memory and it's up to user space to ensure that it keeps the relevant
> >> parts pinned (or more specifically not marking them to be discarded if
> >> there's memory pressure). Not that I think we should be taking it's
> >> model as a reference here.
> >>
> >>>>> Memory which user space thinks the GPU might
> >>>>> need should be pinned before the GPU work is submitted. APIs which
> >>>>> require any form of 'paging in' of data would need to be implemented by
> >>>>> the GPU work completing and being resubmitted by user space after the
> >>>>> memory changes (i.e. there could be a DMA fence pending on the GPU work).
> >>>> Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
> >>>> that means we can't really transparently swap out GPU memory, or we
> >>>> have to constantly pin/unpin around each job, which means even more
> >>>> ioctl()s than we have now. Another option would be to add the XGS fence
> >>>> to the BOs attached to the VM, assuming it's created before the job
> >>>> submission itself, but you're no longer reducing the number of user <->
> >>>> kernel round trips if you do that, because you now have to create an
> >>>> XSG job for each submission, so you basically get back to one ioctl()
> >>>> per submission.
> >>
> >> As you say the granularity of pinning has to be fairly coarse for user
> >> space submission to make sense. My assumption (could be wildly wrong)
> >> was that most memory would be pinned whenever a context is rendering.
> >
> > The granularity of pinning (in term of which regions are pinned) is not
> > really the problem, we can just assume anything that's mapped to the VM
> > will be used by the GPU (which is what we're planning to do for kernel
> > submission BTW). The problem is making the timeslice during
> > which VM memory is considered unreclaimable as short as possible, such
> > that the system can reclaim memory under mem pressure. Ideally, you want
> > to pin memory as long as you have jobs queued/running, and allow for
> > reclaim when the GPU context is idle.
> >
> > We might be able to involve the panthor_scheduler for usermode queues,
> > such that a context that's eligible for scheduling first gets its VM
> > mappings pinned (fence creation + assignment to the VM/BO resvs), and
> > things get reclaimable again when the group is evicted from the CSG
> > slot. That implies evicting idle groups more aggressively than we do
> > know, but there's probably a way around it.
>
> Can we evict idle groups from the shrinker? User space already has to do
> a little dance to work out whether it needs to "kick" the kernel to do
> the submission. So it would be quite reasonable to extend the kick to
> also pin the VM(s). The shrinker could then proactively evict idle
> groups, and at that point it would be possible to unpin the memory.
>
> I'm probably missing something, but that seems like a fairly solid
> solution to me.
The locking might be tricky to get right, but other than that, it looks
like an option that could work. It's definitely worth investigating.
>
> >>
> >>> For AMDGPU we are currently working on the following solution with
> >>> memory management and user queues:
> >>>
> >>> 1. User queues are created through an kernel IOCTL, submissions work by
> >>> writing into a ring buffer and ringing a doorbell.
> >>>
> >>> 2. Each queue can request the kernel to create fences for the currently
> >>> pushed work for a queues which can then be attached to BOs, syncobjs,
> >>> syncfiles etc...
> >>>
> >>> 3. Additional to that we have and eviction/preemption fence attached to
> >>> all BOs, page tables, whatever resources we need.
> >>>
> >>> 4. When this eviction fences are requested to signal they first wait for
> >>> all submission fences and then suspend the user queues and block
> >>> creating new submission fences until the queues are restarted again.
> >>>
> >>> This way you can still do your memory management inside the kernel (e.g.
> >>> move BOs from local to system memory) or even completely suspend and
> >>> resume applications without their interaction, but as Sima said it is
> >>> just horrible complicated to get right.
> >>>
> >>> We have been working on this for like two years now and it still could
> >>> be that we missed something since it is not in production testing yet.
> >>
> >> I'm not entirely sure I follow how this doesn't create a dependency
> >> cycle. From your description it sounds like you create a fence from the
> >> user space queue which is then used to prevent eviction of the BOs needed.
> >>
> >> So to me it sounds like:
> >>
> >> 1. Attach fence to BOs to prevent eviction.
> >>
> >> 2. User space submits work to the ring buffer, rings doorbell.
> >>
> >> 3. Call into the kernel to create the fence for step 1.
> >>
> >> Which is obviously broken. What am I missing?
> >>
> >> One other thing to note is that Mali doesn't have local memory - so the
> >> only benefit of unpinning is if we want to swap to disk (or zram etc).
> >
> > Which would be good to have, IMHO. If we don't do the implicit swapout
> > based on some sort of least-recently-used-VM, we have to rely on
> > userspace freeing its buffer (or flagging them reclaimable) as soon as
> > they are no longer used by the GPU.
>
> It's an awkward one because really you do want user space to free
> buffers - generally the user space driver has a fairly large number of
> buffers which are for temporary storage (and kept around to avoid the
> overhead of freeing/allocating each frame). I know we've resorted to
> idle timers in user space in an attempt to do this in the past - but
> it's ugly.
Fair enough.
>
> Actual swapout can work, but it incurs a heavy penalty when the
> application becomes active again. But I agree we should probably be
> aiming to add support for this.
I guess the other case is a gazillion of GPU contexts, each keeping
buffers around for good reason. If most of those contexts are idle, I
guess swapping out mostly inactive context is better than having the
OOM kick in to reclaim memory. And yes, there's a hit any time an
inactive context gets active again, if and only if the system was under
memory pressure in the meantime. So I guess the slowdown is acceptable
in that case.
Just to mention that, IIRC, MSM already moved away from explicit BO
reclaim (BO flagged reclaimable from userspace when they're back in the
userspace BO cache) in favor of an LRU-based swapout. Dmitry's work is
also centered around this LRU-based swapout model, even though it also
supports the model we have in panfrost, lima, vc4, etc. So it looks
like the new trend for embedded GPU drivers is to defer this memory
reclaim responsibility entirely to the kernel rather than letting
userspace decide which bits are reclaimable.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
[not found] ` <a5a53492-9651-403e-b613-91ef0b9e80b6@amd.com>
2024-09-04 12:46 ` Steven Price
@ 2024-09-24 9:30 ` Simona Vetter
2024-11-08 22:27 ` Matthew Brost
1 sibling, 1 reply; 28+ messages in thread
From: Simona Vetter @ 2024-09-24 9:30 UTC (permalink / raw)
To: Christian König
Cc: Boris Brezillon, Steven Price, Mihail Atanassov, linux-kernel,
Liviu Dudau, dri-devel, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher, Xinhui Pan,
Shashank Sharma, Ketil Johnsen, Akash Goel
Apologies for the late reply ...
On Wed, Sep 04, 2024 at 01:34:18PM +0200, Christian König wrote:
> Hi Boris,
>
> Am 04.09.24 um 13:23 schrieb Boris Brezillon:
> > > > > > Please read up here on why that stuff isn't allowed:
> > > > > > https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
> > > > > panthor doesn't yet have a shrinker, so all memory is pinned, which means
> > > > > memory management easy mode.
> > > > Ok, that at least makes things work for the moment.
> > > Ah, perhaps this should have been spelt out more clearly ;)
> > >
> > > The VM_BIND mechanism that's already in place jumps through some hoops
> > > to ensure that memory is preallocated when the memory operations are
> > > enqueued. So any memory required should have been allocated before any
> > > sync object is returned. We're aware of the issue with memory
> > > allocations on the signalling path and trying to ensure that we don't
> > > have that.
> > >
> > > I'm hoping that we don't need a shrinker which deals with (active) GPU
> > > memory with our design.
> > That's actually what we were planning to do: the panthor shrinker was
> > about to rely on fences attached to GEM objects to know if it can
> > reclaim the memory. This design relies on each job attaching its fence
> > to the GEM mapped to the VM at the time the job is submitted, such that
> > memory that's in-use or about-to-be-used doesn't vanish before the GPU
> > is done.
>
> Yeah and exactly that doesn't work any more when you are using user queues,
> because the kernel has no opportunity to attach a fence for each submission.
>
> > > Memory which user space thinks the GPU might
> > > need should be pinned before the GPU work is submitted. APIs which
> > > require any form of 'paging in' of data would need to be implemented by
> > > the GPU work completing and being resubmitted by user space after the
> > > memory changes (i.e. there could be a DMA fence pending on the GPU work).
> > Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
> > that means we can't really transparently swap out GPU memory, or we
> > have to constantly pin/unpin around each job, which means even more
> > ioctl()s than we have now. Another option would be to add the XGS fence
> > to the BOs attached to the VM, assuming it's created before the job
> > submission itself, but you're no longer reducing the number of user <->
> > kernel round trips if you do that, because you now have to create an
> > XSG job for each submission, so you basically get back to one ioctl()
> > per submission.
>
> For AMDGPU we are currently working on the following solution with memory
> management and user queues:
>
> 1. User queues are created through an kernel IOCTL, submissions work by
> writing into a ring buffer and ringing a doorbell.
>
> 2. Each queue can request the kernel to create fences for the currently
> pushed work for a queues which can then be attached to BOs, syncobjs,
> syncfiles etc...
>
> 3. Additional to that we have and eviction/preemption fence attached to all
> BOs, page tables, whatever resources we need.
>
> 4. When this eviction fences are requested to signal they first wait for all
> submission fences and then suspend the user queues and block creating new
> submission fences until the queues are restarted again.
Yup this works, at least when I play it out in my head.
Note that the completion fence is only deadlock free if userspace is
really, really careful. Which in practice means you need the very
carefully constructed rules for e.g. vulkan winsys fences, otherwise you
do indeed deadlock.
But if you keep that promise in mind, then it works, and step 2 is
entirely option, which means we can start userspace in a pure long-running
compute mode where there's only eviction/preemption fences. And then if
userspace needs a vulkan winsys fence, we can create that with step 2 as
needed.
But the important part is that you need really strict rules on userspace
for when step 2 is ok and won't result in deadlocks. And those rules are
uapi, which is why I think doing this in panthor without the shrinker and
eviction fences (i.e. steps 3&4 above) is a very bad mistake.
> This way you can still do your memory management inside the kernel (e.g.
> move BOs from local to system memory) or even completely suspend and resume
> applications without their interaction, but as Sima said it is just horrible
> complicated to get right.
>
> We have been working on this for like two years now and it still could be
> that we missed something since it is not in production testing yet.
Ack.
-Sima
--
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
2024-09-24 9:30 ` Simona Vetter
@ 2024-11-08 22:27 ` Matthew Brost
2024-11-11 12:57 ` Christian König
0 siblings, 1 reply; 28+ messages in thread
From: Matthew Brost @ 2024-11-08 22:27 UTC (permalink / raw)
To: Simona Vetter
Cc: Christian König, Boris Brezillon, Steven Price,
Mihail Atanassov, linux-kernel, Liviu Dudau, dri-devel,
David Airlie, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
Alex Deucher, Xinhui Pan, Shashank Sharma, Ketil Johnsen,
Akash Goel
On Tue, Sep 24, 2024 at 11:30:53AM +0200, Simona Vetter wrote:
> Apologies for the late reply ...
>
Also late reply, just read this.
> On Wed, Sep 04, 2024 at 01:34:18PM +0200, Christian König wrote:
> > Hi Boris,
> >
> > Am 04.09.24 um 13:23 schrieb Boris Brezillon:
> > > > > > > Please read up here on why that stuff isn't allowed:
> > > > > > > https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
> > > > > > panthor doesn't yet have a shrinker, so all memory is pinned, which means
> > > > > > memory management easy mode.
> > > > > Ok, that at least makes things work for the moment.
> > > > Ah, perhaps this should have been spelt out more clearly ;)
> > > >
> > > > The VM_BIND mechanism that's already in place jumps through some hoops
> > > > to ensure that memory is preallocated when the memory operations are
> > > > enqueued. So any memory required should have been allocated before any
> > > > sync object is returned. We're aware of the issue with memory
> > > > allocations on the signalling path and trying to ensure that we don't
> > > > have that.
> > > >
> > > > I'm hoping that we don't need a shrinker which deals with (active) GPU
> > > > memory with our design.
> > > That's actually what we were planning to do: the panthor shrinker was
> > > about to rely on fences attached to GEM objects to know if it can
> > > reclaim the memory. This design relies on each job attaching its fence
> > > to the GEM mapped to the VM at the time the job is submitted, such that
> > > memory that's in-use or about-to-be-used doesn't vanish before the GPU
> > > is done.
> >
> > Yeah and exactly that doesn't work any more when you are using user queues,
> > because the kernel has no opportunity to attach a fence for each submission.
> >
> > > > Memory which user space thinks the GPU might
> > > > need should be pinned before the GPU work is submitted. APIs which
> > > > require any form of 'paging in' of data would need to be implemented by
> > > > the GPU work completing and being resubmitted by user space after the
> > > > memory changes (i.e. there could be a DMA fence pending on the GPU work).
> > > Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
> > > that means we can't really transparently swap out GPU memory, or we
> > > have to constantly pin/unpin around each job, which means even more
> > > ioctl()s than we have now. Another option would be to add the XGS fence
> > > to the BOs attached to the VM, assuming it's created before the job
> > > submission itself, but you're no longer reducing the number of user <->
> > > kernel round trips if you do that, because you now have to create an
> > > XSG job for each submission, so you basically get back to one ioctl()
> > > per submission.
> >
> > For AMDGPU we are currently working on the following solution with memory
> > management and user queues:
> >
> > 1. User queues are created through an kernel IOCTL, submissions work by
> > writing into a ring buffer and ringing a doorbell.
> >
> > 2. Each queue can request the kernel to create fences for the currently
> > pushed work for a queues which can then be attached to BOs, syncobjs,
> > syncfiles etc...
> >
> > 3. Additional to that we have and eviction/preemption fence attached to all
> > BOs, page tables, whatever resources we need.
> >
> > 4. When this eviction fences are requested to signal they first wait for all
> > submission fences and then suspend the user queues and block creating new
> > submission fences until the queues are restarted again.
>
> Yup this works, at least when I play it out in my head.
>
I just started experimenting with user submission in Xe last week and
ended up landing on a different PoC, blissfully unaware future fences /
Mesa submit thread. However, after Sima filled me in, I’ve essentially
landed on exactly what Christian is describing in Xe. I haven’t coded it
yet, but have the design in my head.
I also generally agree with Sima’s comments about having a somewhat
generic preempt fence (Christian refers to this as an eviction fence)
as well.
Additionally, I’m thinking it might be beneficial for us to add a new
'preempt' dma-resv slot to track these, which would make it easier to
enforce the ordering of submission fence signaling before preempt
fences.
Depending on bandwidth, I may post an RFC to the list soon. I’ll also
gauge the interest and bandwidth from our Mesa team to begin UMD work.
Matt
> Note that the completion fence is only deadlock free if userspace is
> really, really careful. Which in practice means you need the very
> carefully constructed rules for e.g. vulkan winsys fences, otherwise you
> do indeed deadlock.
>
> But if you keep that promise in mind, then it works, and step 2 is
> entirely option, which means we can start userspace in a pure long-running
> compute mode where there's only eviction/preemption fences. And then if
> userspace needs a vulkan winsys fence, we can create that with step 2 as
> needed.
>
> But the important part is that you need really strict rules on userspace
> for when step 2 is ok and won't result in deadlocks. And those rules are
> uapi, which is why I think doing this in panthor without the shrinker and
> eviction fences (i.e. steps 3&4 above) is a very bad mistake.
>
> > This way you can still do your memory management inside the kernel (e.g.
> > move BOs from local to system memory) or even completely suspend and resume
> > applications without their interaction, but as Sima said it is just horrible
> > complicated to get right.
> >
> > We have been working on this for like two years now and it still could be
> > that we missed something since it is not in production testing yet.
>
> Ack.
> -Sima
> --
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
2024-11-08 22:27 ` Matthew Brost
@ 2024-11-11 12:57 ` Christian König
2024-11-11 20:29 ` Matthew Brost
0 siblings, 1 reply; 28+ messages in thread
From: Christian König @ 2024-11-11 12:57 UTC (permalink / raw)
To: Matthew Brost, Simona Vetter
Cc: Boris Brezillon, Steven Price, Mihail Atanassov, linux-kernel,
Liviu Dudau, dri-devel, David Airlie, Maarten Lankhorst,
Maxime Ripard, Thomas Zimmermann, Alex Deucher, Xinhui Pan,
Shashank Sharma, Ketil Johnsen, Akash Goel
Am 08.11.24 um 23:27 schrieb Matthew Brost:
> On Tue, Sep 24, 2024 at 11:30:53AM +0200, Simona Vetter wrote:
>> Apologies for the late reply ...
>>
> Also late reply, just read this.
>
>> On Wed, Sep 04, 2024 at 01:34:18PM +0200, Christian König wrote:
>>> Hi Boris,
>>>
>>> Am 04.09.24 um 13:23 schrieb Boris Brezillon:
>>>>>>>> Please read up here on why that stuff isn't allowed:
>>>>>>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
>>>>>>> panthor doesn't yet have a shrinker, so all memory is pinned, which means
>>>>>>> memory management easy mode.
>>>>>> Ok, that at least makes things work for the moment.
>>>>> Ah, perhaps this should have been spelt out more clearly ;)
>>>>>
>>>>> The VM_BIND mechanism that's already in place jumps through some hoops
>>>>> to ensure that memory is preallocated when the memory operations are
>>>>> enqueued. So any memory required should have been allocated before any
>>>>> sync object is returned. We're aware of the issue with memory
>>>>> allocations on the signalling path and trying to ensure that we don't
>>>>> have that.
>>>>>
>>>>> I'm hoping that we don't need a shrinker which deals with (active) GPU
>>>>> memory with our design.
>>>> That's actually what we were planning to do: the panthor shrinker was
>>>> about to rely on fences attached to GEM objects to know if it can
>>>> reclaim the memory. This design relies on each job attaching its fence
>>>> to the GEM mapped to the VM at the time the job is submitted, such that
>>>> memory that's in-use or about-to-be-used doesn't vanish before the GPU
>>>> is done.
>>> Yeah and exactly that doesn't work any more when you are using user queues,
>>> because the kernel has no opportunity to attach a fence for each submission.
>>>
>>>>> Memory which user space thinks the GPU might
>>>>> need should be pinned before the GPU work is submitted. APIs which
>>>>> require any form of 'paging in' of data would need to be implemented by
>>>>> the GPU work completing and being resubmitted by user space after the
>>>>> memory changes (i.e. there could be a DMA fence pending on the GPU work).
>>>> Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
>>>> that means we can't really transparently swap out GPU memory, or we
>>>> have to constantly pin/unpin around each job, which means even more
>>>> ioctl()s than we have now. Another option would be to add the XGS fence
>>>> to the BOs attached to the VM, assuming it's created before the job
>>>> submission itself, but you're no longer reducing the number of user <->
>>>> kernel round trips if you do that, because you now have to create an
>>>> XSG job for each submission, so you basically get back to one ioctl()
>>>> per submission.
>>> For AMDGPU we are currently working on the following solution with memory
>>> management and user queues:
>>>
>>> 1. User queues are created through an kernel IOCTL, submissions work by
>>> writing into a ring buffer and ringing a doorbell.
>>>
>>> 2. Each queue can request the kernel to create fences for the currently
>>> pushed work for a queues which can then be attached to BOs, syncobjs,
>>> syncfiles etc...
>>>
>>> 3. Additional to that we have and eviction/preemption fence attached to all
>>> BOs, page tables, whatever resources we need.
>>>
>>> 4. When this eviction fences are requested to signal they first wait for all
>>> submission fences and then suspend the user queues and block creating new
>>> submission fences until the queues are restarted again.
>> Yup this works, at least when I play it out in my head.
>>
> I just started experimenting with user submission in Xe last week and
> ended up landing on a different PoC, blissfully unaware future fences /
> Mesa submit thread. However, after Sima filled me in, I’ve essentially
> landed on exactly what Christian is describing in Xe. I haven’t coded it
> yet, but have the design in my head.
Sounds like going over that design again and again was good invested time.
And yeah we have it working and at least so far it really looks like it
works.
> I also generally agree with Sima’s comments about having a somewhat
> generic preempt fence (Christian refers to this as an eviction fence)
> as well.
Well that is really a bike-sheet.
I don't care if we call it preempt fence or eviction fence as long as
everybody understands what that thing is supposed to do.
Probably something we should document.
> Additionally, I’m thinking it might be beneficial for us to add a new
> 'preempt' dma-resv slot to track these, which would make it easier to
> enforce the ordering of submission fence signaling before preempt
> fences.
That's exactly what DMA_RESV_USAGE_BOOKKEEP is good for.
And yes, I spend really *a lot of time* planning this :)
> Depending on bandwidth, I may post an RFC to the list soon. I’ll also
> gauge the interest and bandwidth from our Mesa team to begin UMD work.
Please loop me in as well.
Regards,
Christian.
>
> Matt
>
>> Note that the completion fence is only deadlock free if userspace is
>> really, really careful. Which in practice means you need the very
>> carefully constructed rules for e.g. vulkan winsys fences, otherwise you
>> do indeed deadlock.
>>
>> But if you keep that promise in mind, then it works, and step 2 is
>> entirely option, which means we can start userspace in a pure long-running
>> compute mode where there's only eviction/preemption fences. And then if
>> userspace needs a vulkan winsys fence, we can create that with step 2 as
>> needed.
>>
>> But the important part is that you need really strict rules on userspace
>> for when step 2 is ok and won't result in deadlocks. And those rules are
>> uapi, which is why I think doing this in panthor without the shrinker and
>> eviction fences (i.e. steps 3&4 above) is a very bad mistake.
>>
>>> This way you can still do your memory management inside the kernel (e.g.
>>> move BOs from local to system memory) or even completely suspend and resume
>>> applications without their interaction, but as Sima said it is just horrible
>>> complicated to get right.
>>>
>>> We have been working on this for like two years now and it still could be
>>> that we missed something since it is not in production testing yet.
>> Ack.
>> -Sima
>> --
>> Simona Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/10] drm/panthor: Add user submission
2024-11-11 12:57 ` Christian König
@ 2024-11-11 20:29 ` Matthew Brost
0 siblings, 0 replies; 28+ messages in thread
From: Matthew Brost @ 2024-11-11 20:29 UTC (permalink / raw)
To: Christian König
Cc: Simona Vetter, Boris Brezillon, Steven Price, Mihail Atanassov,
linux-kernel, Liviu Dudau, dri-devel, David Airlie,
Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Alex Deucher,
Xinhui Pan, Shashank Sharma, Ketil Johnsen, Akash Goel
On Mon, Nov 11, 2024 at 01:57:15PM +0100, Christian König wrote:
> Am 08.11.24 um 23:27 schrieb Matthew Brost:
> > On Tue, Sep 24, 2024 at 11:30:53AM +0200, Simona Vetter wrote:
> > > Apologies for the late reply ...
> > >
> > Also late reply, just read this.
> >
> > > On Wed, Sep 04, 2024 at 01:34:18PM +0200, Christian König wrote:
> > > > Hi Boris,
> > > >
> > > > Am 04.09.24 um 13:23 schrieb Boris Brezillon:
> > > > > > > > > Please read up here on why that stuff isn't allowed:
> > > > > > > > > https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
> > > > > > > > panthor doesn't yet have a shrinker, so all memory is pinned, which means
> > > > > > > > memory management easy mode.
> > > > > > > Ok, that at least makes things work for the moment.
> > > > > > Ah, perhaps this should have been spelt out more clearly ;)
> > > > > >
> > > > > > The VM_BIND mechanism that's already in place jumps through some hoops
> > > > > > to ensure that memory is preallocated when the memory operations are
> > > > > > enqueued. So any memory required should have been allocated before any
> > > > > > sync object is returned. We're aware of the issue with memory
> > > > > > allocations on the signalling path and trying to ensure that we don't
> > > > > > have that.
> > > > > >
> > > > > > I'm hoping that we don't need a shrinker which deals with (active) GPU
> > > > > > memory with our design.
> > > > > That's actually what we were planning to do: the panthor shrinker was
> > > > > about to rely on fences attached to GEM objects to know if it can
> > > > > reclaim the memory. This design relies on each job attaching its fence
> > > > > to the GEM mapped to the VM at the time the job is submitted, such that
> > > > > memory that's in-use or about-to-be-used doesn't vanish before the GPU
> > > > > is done.
> > > > Yeah and exactly that doesn't work any more when you are using user queues,
> > > > because the kernel has no opportunity to attach a fence for each submission.
> > > >
> > > > > > Memory which user space thinks the GPU might
> > > > > > need should be pinned before the GPU work is submitted. APIs which
> > > > > > require any form of 'paging in' of data would need to be implemented by
> > > > > > the GPU work completing and being resubmitted by user space after the
> > > > > > memory changes (i.e. there could be a DMA fence pending on the GPU work).
> > > > > Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
> > > > > that means we can't really transparently swap out GPU memory, or we
> > > > > have to constantly pin/unpin around each job, which means even more
> > > > > ioctl()s than we have now. Another option would be to add the XGS fence
> > > > > to the BOs attached to the VM, assuming it's created before the job
> > > > > submission itself, but you're no longer reducing the number of user <->
> > > > > kernel round trips if you do that, because you now have to create an
> > > > > XSG job for each submission, so you basically get back to one ioctl()
> > > > > per submission.
> > > > For AMDGPU we are currently working on the following solution with memory
> > > > management and user queues:
> > > >
> > > > 1. User queues are created through an kernel IOCTL, submissions work by
> > > > writing into a ring buffer and ringing a doorbell.
> > > >
> > > > 2. Each queue can request the kernel to create fences for the currently
> > > > pushed work for a queues which can then be attached to BOs, syncobjs,
> > > > syncfiles etc...
> > > >
> > > > 3. Additional to that we have and eviction/preemption fence attached to all
> > > > BOs, page tables, whatever resources we need.
> > > >
> > > > 4. When this eviction fences are requested to signal they first wait for all
> > > > submission fences and then suspend the user queues and block creating new
> > > > submission fences until the queues are restarted again.
> > > Yup this works, at least when I play it out in my head.
> > >
> > I just started experimenting with user submission in Xe last week and
> > ended up landing on a different PoC, blissfully unaware future fences /
> > Mesa submit thread. However, after Sima filled me in, I’ve essentially
> > landed on exactly what Christian is describing in Xe. I haven’t coded it
> > yet, but have the design in my head.
>
> Sounds like going over that design again and again was good invested time.
>
> And yeah we have it working and at least so far it really looks like it
> works.
>
From the progress I've made, yea I think this can work with some
cooperation from user space. Engaging with the Mesa team this week to
get a more clear picture on that end.
> > I also generally agree with Sima’s comments about having a somewhat
> > generic preempt fence (Christian refers to this as an eviction fence)
> > as well.
>
> Well that is really a bike-sheet.
>
> I don't care if we call it preempt fence or eviction fence as long as
> everybody understands what that thing is supposed to do.
>
> Probably something we should document.
>
Agree something common / documented would be good given how tricky this
will be to get correct.
> > Additionally, I’m thinking it might be beneficial for us to add a new
> > 'preempt' dma-resv slot to track these, which would make it easier to
> > enforce the ordering of submission fence signaling before preempt
> > fences.
>
> That's exactly what DMA_RESV_USAGE_BOOKKEEP is good for.
>
We can move this specific discussion to the RFC I posted. Saw you
replied there and will respond shortly / once I have followed up on a
few things on my end.
> And yes, I spend really *a lot of time* planning this :)
>
I've looked at the list and bunch of patches are on there modifying code
which is not merged to drm-tip. e.g. This patch [1].
Trying to piece together what AMD is doing compared to what I had in
mind is bit difficult without being able to directly look at the code.
Any chance you have a public repo I can look at?
[1] https://patchwork.freedesktop.org/patch/622074/?series=140648&rev=1
> > Depending on bandwidth, I may post an RFC to the list soon. I’ll also
> > gauge the interest and bandwidth from our Mesa team to begin UMD work.
>
> Please loop me in as well.
>
Will do. Coming together pretty quick so it shouldn't be too long until
I can post something or push a public branch. I think the Mesa side is
going to be the more difficult piece and unsure how much interest there
be to move this along.
Matt
> Regards,
> Christian.
>
> >
> > Matt
> >
> > > Note that the completion fence is only deadlock free if userspace is
> > > really, really careful. Which in practice means you need the very
> > > carefully constructed rules for e.g. vulkan winsys fences, otherwise you
> > > do indeed deadlock.
> > >
> > > But if you keep that promise in mind, then it works, and step 2 is
> > > entirely option, which means we can start userspace in a pure long-running
> > > compute mode where there's only eviction/preemption fences. And then if
> > > userspace needs a vulkan winsys fence, we can create that with step 2 as
> > > needed.
> > >
> > > But the important part is that you need really strict rules on userspace
> > > for when step 2 is ok and won't result in deadlocks. And those rules are
> > > uapi, which is why I think doing this in panthor without the shrinker and
> > > eviction fences (i.e. steps 3&4 above) is a very bad mistake.
> > >
> > > > This way you can still do your memory management inside the kernel (e.g.
> > > > move BOs from local to system memory) or even completely suspend and resume
> > > > applications without their interaction, but as Sima said it is just horrible
> > > > complicated to get right.
> > > >
> > > > We have been working on this for like two years now and it still could be
> > > > that we missed something since it is not in production testing yet.
> > > Ack.
> > > -Sima
> > > --
> > > Simona Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
>
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2024-11-11 20:29 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-28 17:25 [RFC PATCH 00/10] drm/panthor: Add user submission Mihail Atanassov
2024-08-28 17:25 ` [PATCH 1/8] drm/panthor: Add uAPI to submit from user space Mihail Atanassov
2024-08-28 17:25 ` [PATCH 2/8] drm/panthor: Extend GROUP_CREATE for user submission Mihail Atanassov
2024-08-28 17:25 ` [PATCH 3/8] drm/panthor: Map doorbell pages Mihail Atanassov
2024-08-28 17:26 ` [PATCH 4/8] drm/panthor: Add GROUP_KICK ioctl Mihail Atanassov
2024-08-28 17:26 ` [PATCH 5/8] drm/panthor: Factor out syncobj handling Mihail Atanassov
2024-08-28 17:26 ` [PATCH 6/8] drm/panthor: Implement XGS queues Mihail Atanassov
2024-09-03 19:17 ` Simona Vetter
2024-08-28 17:26 ` [PATCH 7/8] drm/panthor: Add sync_update eventfd handling Mihail Atanassov
2024-08-28 17:26 ` [PATCH 8/8] drm/panthor: Add SYNC_UPDATE ioctl Mihail Atanassov
2024-08-29 9:40 ` [RFC PATCH 00/10] drm/panthor: Add user submission Christian König
2024-08-29 13:37 ` Steven Price
[not found] ` <96ef7ae3-4df1-4859-8672-453055bbfe96@amd.com>
2024-09-03 21:11 ` Simona Vetter
2024-09-04 7:49 ` Christian König
2024-09-04 9:31 ` Steven Price
2024-09-04 11:23 ` Boris Brezillon
[not found] ` <a5a53492-9651-403e-b613-91ef0b9e80b6@amd.com>
2024-09-04 12:46 ` Steven Price
2024-09-04 13:20 ` Boris Brezillon
2024-09-04 13:35 ` Steven Price
2024-09-04 13:58 ` Boris Brezillon
2024-09-04 13:36 ` Christian König
2024-09-24 9:30 ` Simona Vetter
2024-11-08 22:27 ` Matthew Brost
2024-11-11 12:57 ` Christian König
2024-11-11 20:29 ` Matthew Brost
2024-09-04 9:43 ` Simona Vetter
2024-08-29 13:48 ` Ketil Johnsen
2024-09-03 12:27 ` Ketil Johnsen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox