From: Demi Marie Obenour <demi@invisiblethingslab.com>
To: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
"Matthew Brost" <matthew.brost@intel.com>,
intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Cc: himal.prasad.ghimiray@intel.com, apopple@nvidia.com,
airlied@gmail.com, simona.vetter@ffwll.ch,
felix.kuehling@amd.com, dakr@kernel.org
Subject: Re: [PATCH v5 00/32] Introduce GPU SVM and Xe SVM implementation
Date: Fri, 14 Feb 2025 13:36:10 -0500 [thread overview]
Message-ID: <Z6-NHgRbMzhkFYcq@itl-email> (raw)
In-Reply-To: <ae45297e3132f13c6d5113aefeaed2c91ed7010c.camel@linux.intel.com>
[-- Attachment #1: Type: text/plain, Size: 4819 bytes --]
On Fri, Feb 14, 2025 at 05:26:48PM +0100, Thomas Hellström wrote:
> Hi!
>
> On Fri, 2025-02-14 at 11:14 -0500, Demi Marie Obenour wrote:
> > On Fri, Feb 14, 2025 at 09:47:13AM +0100, Thomas Hellström wrote:
> > > Hi
> > >
> > > On Thu, 2025-02-13 at 16:23 -0500, Demi Marie Obenour wrote:
> > > > On Wed, Feb 12, 2025 at 06:10:40PM -0800, Matthew Brost wrote:
> > > > > Version 5 of GPU SVM. Thanks to everyone (especially Sima,
> > > > > Thomas,
> > > > > Alistair, Himal) for their numerous reviews on revision 1, 2,
> > > > > 3
> > > > > and for
> > > > > helping to address many design issues.
> > > > >
> > > > > This version has been tested with IGT [1] on PVC, BMG, and LNL.
> > > > > Also
> > > > > tested with level0 (UMD) PR [2].
> > > >
> > > > What is the plan to deal with not being able to preempt while a
> > > > page
> > > > fault is pending? This seems like an easy DoS vector. My
> > > > understanding
> > > > is that SVM is mostly used by compute workloads on headless
> > > > systems.
> > > > Recent AMD client GPUs don't support SVM, so programs that want
> > > > to
> > > > run
> > > > on client systems should not require SVM if they wish to be
> > > > portable.
> > > >
> > > > Given the potential for abuse, I think it would be best to
> > > > require
> > > > explicit administrator opt-in to enable SVM, along with possibly
> > > > having
> > > > a timeout to resolve a page fault (after which the context is
> > > > killed).
> > > > Since I expect most uses of SVM to be in the datacenter space
> > > > (for
> > > > the
> > > > reasons mentioned above), I don't believe this will be a major
> > > > limitation in practice. Programs that wish to run on client
> > > > systems
> > > > already need to use explicit memory transfer or pinned userptr,
> > > > and
> > > > administrators of compute clusters should be willing to enable
> > > > this
> > > > feature because only one workload will be using a GPU at a time.
> > >
> > > While not directly having addressed the potential DoS issue you
> > > mention, there is an associated deadlock possibility that may
> > > happen
> > > due to not being able to preempt a pending pagefault. That is if a
> > > dma-
> > > fence job is requiring the same resources held up by the pending
> > > page-
> > > fault, and then the pagefault servicing is dependent on that dma-
> > > fence
> > > to be signaled in one way or another.
> > >
> > > That deadlock is handled by only allowing either page-faulting jobs
> > > or
> > > dma-fence jobs on a resource (hw engine or hw engine group) that
> > > can be
> > > used by both at a time, blocking synchronously in the exec IOCTL
> > > until
> > > the resource is available for the job type. That means LR jobs
> > > waits
> > > for all dma-fence jobs to complete, and dma-fence jobs wait for all
> > > LR
> > > jobs to preempt. So a dma-fence job wait could easily mean "wait
> > > for
> > > all outstanding pagefaults to be serviced".
> > >
> > > Whether, on the other hand, that is a real DoS we need to care
> > > about,
> > > is probably a topic for debate. The directions we've had so far are
> > > that it's not. Nothing is held up indefinitely, what's held up can
> > > be
> > > Ctrl-C'd by the user and core mm memory management is not blocked
> > > since
> > > mmu_notifiers can execute to completion and shrinkers / eviction
> > > can
> > > execute while a page-fault is pending.
> >
> > The problem is that a program that uses a page-faulting job can lock
> > out
> > all other programs on the system from using the GPU for an indefinite
> > period of time. In a GUI session, this means a frozen UI, which
> > makes
> > recovery basically impossible without drastic measures (like
> > rebooting
> > or logging in over SSH). That counts as a quite effective denial of
> > service from an end-user perspective, and unless I am mistaken it
> > would
> > be very easy to trigger by accident: just start a page-faulting job
> > that
> > loops forever.
>
> I think the easiest remedy for this is that if a page-faulting job is
> either by purpose or mistake crafted in such a way that it holds up
> preemption when preemption is needed (like in the case I described, a
> dma-fence job is submitted) the driver will hit a preemption timeout
> and kill the pagefaulting job. (I think that is already handled in all
> cases in the xe driver but I would need to double check). So this would
> then boil down to the system administrator configuring the preemption
> timeout.
That makes sense! That turns a DoS into "Don't submit pagefaulting jobs
on an interactive system, they won't be reliable."
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
prev parent reply other threads:[~2025-02-14 18:36 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-13 2:10 [PATCH v5 00/32] Introduce GPU SVM and Xe SVM implementation Matthew Brost
2025-02-13 2:10 ` [PATCH v5 01/32] drm/xe: Retry BO allocation Matthew Brost
2025-02-13 2:10 ` [PATCH v5 02/32] mm/migrate: Add migrate_device_pfns Matthew Brost
2025-02-13 2:10 ` [PATCH v5 03/32] mm/migrate: Trylock device page in do_swap_page Matthew Brost
2025-02-19 5:36 ` Alistair Popple
2025-02-19 6:08 ` Matthew Brost
2025-02-19 6:25 ` Alistair Popple
2025-02-20 13:28 ` Gwan-gyeong Mun
2025-02-20 20:03 ` Matthew Brost
2025-02-13 2:10 ` [PATCH v5 04/32] drm/pagemap: Add DRM pagemap Matthew Brost
2025-02-20 13:53 ` Gwan-gyeong Mun
2025-02-13 2:10 ` [PATCH v5 05/32] drm/xe/bo: Introduce xe_bo_put_async Matthew Brost
2025-02-14 9:52 ` Ghimiray, Himal Prasad
2025-02-20 14:33 ` Gwan-gyeong Mun
2025-02-13 2:10 ` [PATCH v5 06/32] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
2025-02-19 8:59 ` Thomas Hellström
2025-02-13 2:10 ` [PATCH v5 07/32] drm/xe: Select DRM_GPUSVM Kconfig Matthew Brost
2025-02-13 2:10 ` [PATCH v5 08/32] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR Matthew Brost
2025-02-13 2:10 ` [PATCH v5 09/32] drm/xe: Add SVM init / close / fini to faulting VMs Matthew Brost
2025-02-13 2:10 ` [PATCH v5 10/32] drm/xe: Add dma_addr res cursor Matthew Brost
2025-02-13 2:10 ` [PATCH v5 11/32] drm/xe: Nuke VM's mapping upon close Matthew Brost
2025-02-13 2:10 ` [PATCH v5 12/32] drm/xe: Add SVM range invalidation and page fault Matthew Brost
2025-02-13 10:05 ` Ghimiray, Himal Prasad
2025-02-13 2:10 ` [PATCH v5 13/32] drm/gpuvm: Add DRM_GPUVA_OP_DRIVER Matthew Brost
2025-02-13 2:10 ` [PATCH v5 14/32] drm/xe: Add (re)bind to SVM page fault handler Matthew Brost
2025-02-13 2:10 ` [PATCH v5 15/32] drm/xe: Add SVM garbage collector Matthew Brost
2025-02-13 10:07 ` Ghimiray, Himal Prasad
2025-02-13 2:10 ` [PATCH v5 16/32] drm/xe: Add unbind to " Matthew Brost
2025-02-19 15:05 ` Thomas Hellström
2025-02-13 2:10 ` [PATCH v5 17/32] drm/xe: Do not allow CPU address mirror VMA unbind if the GPU has bindings Matthew Brost
2025-02-13 11:28 ` Ghimiray, Himal Prasad
2025-02-13 2:10 ` [PATCH v5 18/32] drm/xe: Enable CPU address mirror uAPI Matthew Brost
2025-02-13 11:26 ` Ghimiray, Himal Prasad
2025-02-13 2:10 ` [PATCH v5 19/32] drm/xe/uapi: Add DRM_XE_QUERY_CONFIG_FLAG_HAS_CPU_ADDR_MIRROR Matthew Brost
2025-02-13 2:11 ` [PATCH v5 20/32] drm/xe: Add migrate layer functions for SVM support Matthew Brost
2025-02-13 2:11 ` [PATCH v5 21/32] drm/xe: Add SVM device memory mirroring Matthew Brost
2025-02-13 11:28 ` Ghimiray, Himal Prasad
2025-02-13 2:11 ` [PATCH v5 22/32] drm/xe: Add drm_gpusvm_devmem to xe_bo Matthew Brost
2025-02-13 11:29 ` Ghimiray, Himal Prasad
2025-02-13 2:11 ` [PATCH v5 23/32] drm/xe: Add drm_pagemap ops to SVM Matthew Brost
2025-02-13 2:11 ` [PATCH v5 24/32] drm/xe: Add GPUSVM device memory copy vfunc functions Matthew Brost
2025-02-13 2:11 ` [PATCH v5 25/32] drm/xe: Add Xe SVM populate_devmem_pfn GPU SVM vfunc Matthew Brost
2025-02-13 2:11 ` [PATCH v5 26/32] drm/xe: Add Xe SVM devmem_release " Matthew Brost
2025-02-13 18:29 ` Ghimiray, Himal Prasad
2025-02-13 2:11 ` [PATCH v5 27/32] drm/xe: Add SVM VRAM migration Matthew Brost
2025-02-13 18:28 ` Ghimiray, Himal Prasad
2025-02-18 21:54 ` Matthew Brost
2025-02-19 2:59 ` Ghimiray, Himal Prasad
2025-02-19 3:05 ` Matthew Brost
2025-02-19 3:40 ` Ghimiray, Himal Prasad
2025-02-19 10:30 ` Thomas Hellström
2025-02-19 17:38 ` Matthew Brost
2025-02-20 15:53 ` Matthew Auld
2025-02-20 15:59 ` Thomas Hellström
2025-02-20 19:55 ` Matthew Brost
2025-02-21 15:15 ` Matthew Auld
2025-02-21 15:22 ` Matthew Brost
2025-02-13 2:11 ` [PATCH v5 28/32] drm/xe: Basic SVM BO eviction Matthew Brost
2025-02-13 2:11 ` [PATCH v5 29/32] drm/xe: Add SVM debug Matthew Brost
2025-02-13 11:30 ` Ghimiray, Himal Prasad
2025-02-13 2:11 ` [PATCH v5 30/32] drm/xe: Add modparam for SVM notifier size Matthew Brost
2025-02-13 11:31 ` Ghimiray, Himal Prasad
2025-02-13 2:11 ` [PATCH v5 31/32] drm/xe: Add always_migrate_to_vram modparam Matthew Brost
2025-02-13 11:31 ` Ghimiray, Himal Prasad
2025-02-13 2:11 ` [PATCH v5 32/32] drm/doc: gpusvm: Add GPU SVM documentation Matthew Brost
2025-02-13 3:35 ` ✓ CI.Patch_applied: success for Introduce GPU SVM and Xe SVM implementation (rev5) Patchwork
2025-02-13 3:36 ` ✗ CI.checkpatch: warning " Patchwork
2025-02-13 3:37 ` ✗ CI.KUnit: failure " Patchwork
2025-02-13 21:23 ` [PATCH v5 00/32] Introduce GPU SVM and Xe SVM implementation Demi Marie Obenour
2025-02-14 8:47 ` Thomas Hellström
2025-02-14 9:07 ` Ghimiray, Himal Prasad
2025-02-14 9:10 ` Ghimiray, Himal Prasad
2025-02-14 16:14 ` Demi Marie Obenour
2025-02-14 16:26 ` Thomas Hellström
2025-02-14 18:36 ` Demi Marie Obenour [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z6-NHgRbMzhkFYcq@itl-email \
--to=demi@invisiblethingslab.com \
--cc=airlied@gmail.com \
--cc=apopple@nvidia.com \
--cc=dakr@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=felix.kuehling@amd.com \
--cc=himal.prasad.ghimiray@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
--cc=simona.vetter@ffwll.ch \
--cc=thomas.hellstrom@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.