From mboxrd@z Thu Jan 1 00:00:00 1970 From: Thomas Hellstrom Subject: Re: buggy/weird behavior in ttm Date: Thu, 11 Oct 2012 21:26:08 +0200 Message-ID: <50771D50.5000104@vmware.com> References: <5076DCAD.6070606@canonical.com> <5076FA78.9090507@vmware.com> <50771307.3080807@canonical.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: Received: from smtp-outbound-1.vmware.com (smtp-outbound-1.vmware.com [208.91.2.12]) by gabe.freedesktop.org (Postfix) with ESMTP id F38E6A08D0 for ; Thu, 11 Oct 2012 12:26:12 -0700 (PDT) In-Reply-To: <50771307.3080807@canonical.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org Errors-To: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org To: Maarten Lankhorst Cc: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org On 10/11/2012 08:42 PM, Maarten Lankhorst wrote: > >> Anyway, if you plan to remove the fence lock and protect it with reserve, you must >> make sure that a waiting reserve is never done in a destruction path. I think this >> mostly concerns the nvidia driver. > Well I don't think any lock should ever be held during destruction time, What I mean is, that *something* needs to protect the fence pointer. Currently it's the fence lock, and I was assuming you'd protect it with reserve. And neither TTM nor Nvidia should, when a resource is about to be freed, be forced to *block* waiting for reserve just to access the fence pointer. When and if you have a solution that fulfills those requirements, I'm ready to review it. > >>> - no_wait_reserve is ignored if no_wait_gpu is false >>> ttm_bo_reserve_locked can only return true if no_wait_reserve is true, but >>> subsequently it will do a wait_unreserved if no_wait_gpu is false. >>> I'm planning on removing this argument and act like it is always true, since >>> nothing on the lru list should fail to reserve currently. >> Yes, since all buffers that are reserved are removed from the LRU list, there >> should never be a waiting reserve on them, so no_wait_reserve can be removed >> from ttm_mem_evict_first, ttm_bo_evict and possibly other functions in the call chain. > I suppose there will stay a small race though, Hmm, where? >>> - effectively unlimited callchain between some functions that all go through >>> ttm_mem_evict_first: >>> >>> /------------------------\ >>> ttm_mem_evict_first - ttm_bo_evict - -ttm_bo_mem_space - ttm_bo_mem_force_space - ttm_mem_evict_first >>> \ ttm_bo_handle_move_mem / >>> I'm not surprised that there was a deadlock before, it seems to me it would >>> be pretty suicidal to ever do a blocking reserve on any of those lists, >>> lockdep would be all over you for this. >> Well, at first this may look worse than it actually is. The driver's eviction memory order determines the recursion depth >> and typically it's 0 or 1, since subsequent ttm_mem_evict_first should never touch the same LRU lists as the first one. >> What would typically happen is that a BO is evicted from VRAM to TT, and if there is no space in TT, another BO is evicted >> to system memory, and the chain is terminated. However a driver could set up any eviction order but that would be >> a BUG. >> >> But in essence, as you say, even with a small recursion depth, a waiting reserve could cause a deadlock. >> But there should be no waiting reserves in the eviction path currently. > Partially true, ttm_bo_cleanup_refs is currently capable of blocking reserve. > Fixing this might mean that ttm_mem_evict_first may need to become more aggressive, > since it seems all the callers of this function assume that ttm_mem_evict_first can only fail > if there is really nothing more to free and blocking nested would really upset lockdep > and leave you open to the same deadlocks. I can't see how the waiting reserve in ttm_bo_cleanup_refs would cause a deadlock, because the buffer about to be reserved is always *last* in a reservation sequence, and the reservation is always released (or the buffer destroyed) before trying to reserve another buffer. Technically the buffer is not looked up from a LRU list but from the delayed delete list. Could you describe such a deadlock case? (There is a bug in there, though that the reservation is not released if the buffer is no longer on the reservation list here): if (unlikely(ret != 0) || list_empty(&bo->ddestroy)) { spin_unlock(&glob->lru_lock); return ret; } > > An unexpected bonus from adding this skipping would be that the atomic lru_list > removal on unreserve guarantee becomes a lot easier to drop while keeping behavior > exactly the same. Only the swapout code would need to be adjusted for to try the whole > list. Maybe ttm_bo_delayed_delete with remove_all = false too would change behavior > slightly too, but this seems to be less likely, as it could only ever happen on delayed > destruction of imported dma-buf's. May I kindly remind you that the atomic lru_list removal stays in TTM until we've had a high level design discussion weighing it against an extended reservation object API. > > Not true, ttm_bo_move_memcpy has no interruptible parameter, so it's not compatible with driver.move. > The drivers that do use this function have set up their own alias that calls that function instead, > so I was thinking of dropping those parameters from memcpy since they're unused and the drivers > that could point their move member to it don't do it, so no point in having compatibility... OK, then I'm OK with parameter changes in those functions. > > ~Maarten >