All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: "Simona Vetter" <simona.vetter@ffwll.ch>,
	"Christian König" <christian.koenig@amd.com>
Cc: Matthew Brost <matthew.brost@intel.com>, dri-devel@lists.freedesktop.org
Subject: Re: [PATCH 4/7] drm/ttm: move LRU walk defines into new internal header
Date: Wed, 02 Oct 2024 13:30:30 +0200	[thread overview]
Message-ID: <4966db2ca3009967de47af018c3e10a30dbacd08.camel@linux.intel.com> (raw)
In-Reply-To: <4c634e5c1bd9907f315d8b3535ebb6154819d5ea.camel@linux.intel.com>

Hi, Christian,

Ping? Can i get an ack to proceed with this?

Thanks,
Thomas



On Wed, 2024-09-18 at 14:57 +0200, Thomas Hellström wrote:
> Sima, Christian
> 
> I've updated the shrinker series now with a guarded for_each macro
> instead:
> 
> https://patchwork.freedesktop.org/patch/614514/?series=131815&rev=9
> 
> (Note I forgot to remove the export of the previous LRU walker).
> 
>  so the midlayer argument is now not an issue anymore. The cleanup.h
> guard provides some additional protection against drivers exiting the
> LRU loop early.
> 
> So remaining is the question whether the driver is allowed to discard
> a
> suggested bo to shrink from TTM.
> 
> Arguments for:
> 
> 1) Not allowing that would require teaching TTM about purgeable
> objects.
> 2) Devices who need the blitter during shrinking would want to punt
> runtime_pm_get() to kswapd to avoid sleeping direct reclaim.
> 3) If those devices end up blitting (LNL) to be able to shrink, they
> would want to punt waiting for the fence to signal to kswapd to avoid
> waiting in direct reclaim.
> 4) It looks like we need to resort to folio_trylock in the shmem
> backup
> backend when shrinking is called for gfp_t = GFP_NOFS. A failing
> trylock will require a new bo.
> 
> Arguments against:
> None really. I thought the idea of demidlayering would be to allow
> the
> driver more freedom.
> 
> So any feedback appreciated. If that is found acceptable we can
> proceed
> with reviewing this patch and also with the shrinker series.
> 
> Thanks,
> Thomas
> 
> 
> On Mon, 2024-09-02 at 13:07 +0200, Daniel Vetter wrote:
> > On Wed, Aug 28, 2024 at 02:20:34PM +0200, Christian König wrote:
> > > Am 27.08.24 um 19:53 schrieb Daniel Vetter:
> > > > On Tue, Aug 27, 2024 at 06:52:13PM +0200, Daniel Vetter wrote:
> > > > > On Thu, Aug 22, 2024 at 03:19:29PM +0200, Christian König
> > > > > wrote:
> > > > > > Completely agree that this is complicated, but I still
> > > > > > don't
> > > > > > see the need
> > > > > > for it.
> > > > > > 
> > > > > > Drivers just need to use pm_runtime_get_if_in_use() inside
> > > > > > the shrinker and
> > > > > > postpone all hw activity until resume.
> > > > > Not good enough, at least long term I think. Also postponing
> > > > > hw
> > > > > activity
> > > > > to resume doesn't solve the deadlock issue, if you still need
> > > > > to grab ttm
> > > > > locks on resume.
> > > > Pondered this specific aspect some more, and I think you still
> > > > have a race
> > > > here (even if you avoid the deadlock): If the condiditional
> > > > rpm_get call
> > > > fails there's no guarantee that the device will suspend/resume
> > > > and clean
> > > > up the GART mapping.
> > > 
> > > Well I think we have a major disconnect here. When the device is
> > > powered
> > > down there is no GART mapping to clean up any more.
> > > 
> > > In other words GART is a table in local memory (VRAM) when the
> > > device is
> > > powered down this table is completely destroyed. Any BO which was
> > > mapped
> > > inside this table is now not mapped any more.
> > > 
> > > So when the shrinker wants to evict a BO which is marked as
> > > mapped
> > > to GART
> > > and the device is powered down we just skip the GART unmapping
> > > part
> > > because
> > > that has already implicitly happened during power down.
> > > 
> > > Before mapping any BO into the GART again we power the GPU up
> > > through the
> > > runtime PM calls. And while powering it up again the GART is
> > > restored.
> > 
> > My point is that you can't tell whether the device will power down
> > or
> > not,
> > you can only tell whether there's a chance it might be powering
> > down
> > and
> > so you can't get at the rpm reference without deadlock issues.
> > 
> > > > The race gets a bit smaller if you use
> > > > pm_runtime_get_if_active(), but even then you might catch it
> > > > right when
> > > > resume almost finished.
> > > 
> > > What race are you talking about?
> > > 
> > > The worst thing which could happen is that we restore a GART
> > > entry
> > > which
> > > isn't needed any more, but that is pretty much irrelevant since
> > > we
> > > only
> > > clear them to avoid some hw bugs.
> > 
> > The race I'm seeing is where you thought the GART entry is not
> > issue,
> > tossed an object, but the device didn't suspend, so might still use
> > it.
> > 
> > I guess if we're clearly separating the sw allocation of the TTM_TT
> > with
> > the physical entries in the GART that should all work, but feels a
> > bit
> > tricky. The race I've seen is essentially these two getting out of
> > sync.
> > 
> > So maybe it was me who's stuck.
> > 
> > What I wonder is whether it works in practice, since on the restore
> > side
> > you need to get some locks to figure out which gart mappings exist
> > and
> > need restoring. And that's the same locks as the shrinker needs to
> > figure
> > out whether it might need to reap a gart mapping.
> > 
> > Or do you just copy the gart entries over and restore them exactly
> > as-is,
> > so that there's no shared locks?
> > 
> > > > That means we'll have ttm bo hanging around with GART
> > > > allocations/mappings
> > > > which aren't actually valid anymore (since they might escape
> > > > the
> > > > cleanup
> > > > upon resume due to the race). That doesn't feel like a solid
> > > > design
> > > > either.
> > > 
> > > I'm most likely missing something, but I'm really scratching my
> > > head where
> > > you see a problem here.
> > 
> > I guess one issue is that at least traditionally, igfx drivers have
> > nested
> > runtime pm within dma_resv lock. And dgpu drivers the other way
> > round.
> > Which is a bit awkward if you're trying for common code.
> > 
> > Cheers, Sima
> 


  reply	other threads:[~2024-10-02 11:30 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-10 12:42 Using drm_exec in TTM Christian König
2024-07-10 12:42 ` [PATCH 1/7] dma-buf/dma-resv: Introduce dma_resv_trylock_ctx() Christian König
2024-07-10 12:42 ` [PATCH 2/7] drm/exec: don't immediately add the prelocked obj Christian König
2024-07-10 12:42 ` [PATCH 3/7] drm/exec: provide trylock interface for eviction Christian König
2024-07-10 12:42 ` [PATCH 4/7] drm/ttm: move LRU walk defines into new internal header Christian König
2024-07-10 18:19   ` Matthew Brost
2024-07-11 12:01     ` Christian König
2024-07-11 16:00       ` Matthew Brost
2024-08-06  8:29       ` Thomas Hellström
2024-08-19 11:03         ` Christian König
2024-08-19 11:38           ` Thomas Hellström
2024-08-19 14:14             ` Daniel Vetter
2024-08-19 15:26               ` Christian König
2024-08-20 10:37                 ` Thomas Hellström
2024-08-20 15:45                   ` Christian König
2024-08-20 16:00                     ` Thomas Hellström
2024-08-21  8:14                       ` Christian König
2024-08-21  8:57                         ` Thomas Hellström
2024-08-21  9:31                           ` Thomas Hellström
2024-08-21  9:48                           ` Christian König
2024-08-21 12:00                             ` Thomas Hellström
2024-08-22  6:47                               ` Thomas Hellström
2024-08-22  7:55                                 ` Christian König
2024-08-22  8:21                                   ` Thomas Hellström
2024-08-22  8:36                                     ` Thomas Hellström
2024-08-22  9:29                                     ` Christian König
2024-08-22 13:16                                       ` Thomas Hellström
2024-08-27 16:58                                       ` Daniel Vetter
2024-08-22  9:23                         ` Daniel Vetter
2024-08-22 13:19                           ` Christian König
2024-08-27 16:52                             ` Daniel Vetter
2024-08-27 17:53                               ` Daniel Vetter
2024-08-28 12:20                                 ` Christian König
2024-08-28 14:05                                   ` Thomas Hellström
2024-08-28 15:25                                     ` Christian König
2024-08-28 15:35                                       ` Alex Deucher
2024-08-28 15:45                                       ` Thomas Hellström
2024-09-02 11:07                                   ` Daniel Vetter
2024-09-18 12:57                                     ` Thomas Hellström
2024-10-02 11:30                                       ` Thomas Hellström [this message]
2024-10-02 11:32                                         ` Christian König
2024-10-02 11:36                                           ` Thomas Hellström
2024-10-07  9:08                                       ` Christian König
2024-10-09 13:39                                         ` Thomas Hellström
2024-10-09 14:17                                           ` Thomas Hellström
2024-10-10  8:00                                             ` Christian König
2024-10-11 16:52                                               ` Thomas Hellström
2024-08-27 18:24                               ` Alex Deucher
2024-09-02 11:00                                 ` Daniel Vetter
2024-08-21  7:02                     ` Thomas Hellström
2024-07-10 12:42 ` [PATCH 5/7] drm/ttm: move needs_unlock into the walk Christian König
2024-07-10 12:43 ` [PATCH 6/7] drm/ttm: support using drm_exec during eviction v2 Christian König
2024-07-10 12:43 ` [PATCH 7/7] drm/amdgpu: use drm_exec during BO validation Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4966db2ca3009967de47af018c3e10a30dbacd08.camel@linux.intel.com \
    --to=thomas.hellstrom@linux.intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=simona.vetter@ffwll.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.