* [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete @ 2010-01-18 18:47 Luca Barbieri 2010-01-18 19:40 ` Thomas Hellstrom 0 siblings, 1 reply; 34+ messages in thread From: Luca Barbieri @ 2010-01-18 18:47 UTC (permalink / raw) To: airlied-cv59FeDIM0c Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Luca Barbieri ttm_bo_delayed_delete has a race condition, because after we do: kref_put(&nentry->list_kref, ttm_bo_release_list); we are not holding the list lock and not holding any reference to objects, and thus every bo in the list can be removed and freed at this point. However, we then use the next pointer we stored, which is not guaranteed to be valid. This was apparently the cause of some Nouveau oopses I experienced. This patch rewrites the function so that it keeps the reference to nentry until nentry itself is freed and we already got a reference to nentry->next. It should now be correct and free of races, but please double check this. Signed-off-by: Luca Barbieri <luca-Ukmtq+NC3rhBHFWNQifrYwC/G2K4zDHf@public.gmane.org> --- drivers/gpu/drm/ttm/ttm_bo.c | 58 +++++++++++++++++------------------------ 1 files changed, 24 insertions(+), 34 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 2920f9a..1daa2f1 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -523,52 +523,42 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo, bool remove_all) static int ttm_bo_delayed_delete(struct ttm_bo_device *bdev, bool remove_all) { struct ttm_bo_global *glob = bdev->glob; - struct ttm_buffer_object *entry, *nentry; - struct list_head *list, *next; - int ret; + struct ttm_buffer_object *entry; + int ret = 0; spin_lock(&glob->lru_lock); - list_for_each_safe(list, next, &bdev->ddestroy) { - entry = list_entry(list, struct ttm_buffer_object, ddestroy); - nentry = NULL; + if (list_empty(&bdev->ddestroy)) { + spin_unlock(&glob->lru_lock); + return 0; + } - /* - * Protect the next list entry from destruction while we - * unlock the lru_lock. - */ + entry = list_first_entry(&bdev->ddestroy, + struct ttm_buffer_object, ddestroy); + kref_get(&entry->list_kref); - if (next != &bdev->ddestroy) { - nentry = list_entry(next, struct ttm_buffer_object, - ddestroy); + for (;;) { + struct ttm_buffer_object *nentry = NULL; + + if (!list_empty(&entry->ddestroy) + && entry->ddestroy.next != &bdev->ddestroy) { + nentry = list_entry(entry->ddestroy.next, + struct ttm_buffer_object, ddestroy); kref_get(&nentry->list_kref); } - kref_get(&entry->list_kref); spin_unlock(&glob->lru_lock); ret = ttm_bo_cleanup_refs(entry, remove_all); kref_put(&entry->list_kref, ttm_bo_release_list); + entry = nentry; - spin_lock(&glob->lru_lock); - if (nentry) { - bool next_onlist = !list_empty(next); - spin_unlock(&glob->lru_lock); - kref_put(&nentry->list_kref, ttm_bo_release_list); - spin_lock(&glob->lru_lock); - /* - * Someone might have raced us and removed the - * next entry from the list. We don't bother restarting - * list traversal. - */ - - if (!next_onlist) - break; - } - if (ret) + if (ret || !entry) break; + + spin_lock(&glob->lru_lock); } - ret = !list_empty(&bdev->ddestroy); - spin_unlock(&glob->lru_lock); + if (entry) + kref_put(&entry->list_kref, ttm_bo_release_list); return ret; } -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete 2010-01-18 18:47 [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete Luca Barbieri @ 2010-01-18 19:40 ` Thomas Hellstrom [not found] ` <4B54B949.9010906-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Thomas Hellstrom @ 2010-01-18 19:40 UTC (permalink / raw) To: Luca Barbieri Cc: airlied@linux.ie, nouveau@lists.freedesktop.org, dri-devel@lists.sourceforge.net Luca, Good catch. Some comments inline: Luca Barbieri wrote: > + entry = list_first_entry(&bdev->ddestroy, > + struct ttm_buffer_object, ddestroy); > + kref_get(&entry->list_kref); > > - if (next != &bdev->ddestroy) { > - nentry = list_entry(next, struct ttm_buffer_object, > - ddestroy); > + for (;;) { > + struct ttm_buffer_object *nentry = NULL; > + > + if (!list_empty(&entry->ddestroy) > + && entry->ddestroy.next != &bdev->ddestroy) { > + nentry = list_entry(entry->ddestroy.next, > + struct ttm_buffer_object, ddestroy); > I'm not sure it's considered ok to access the list_head members directly in new code. Would nentry=list_first_entry(&entry->ddestroy, ....) work? > kref_get(&nentry->list_kref); > } > else unlock and break? That would save an unnecessary call to ttm_bo_cleanup_refs. > - kref_get(&entry->list_kref); > > spin_unlock(&glob->lru_lock); > ret = ttm_bo_cleanup_refs(entry, remove_all); > kref_put(&entry->list_kref, ttm_bo_release_list); > + entry = nentry; > Here nentry may have been removed from the list by another process, which would trigger the unnecessary call, mentioned above. /Thomas ------------------------------------------------------------------------------ Throughout its 18-year history, RSA Conference consistently attracts the world's best and brightest in the field, creating opportunities for Conference attendees to learn about information security's most important issues through interactions with peers, luminaries and emerging and established companies. http://p.sf.net/sfu/rsaconf-dev2dev -- ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <4B54B949.9010906-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <4B54B949.9010906-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org> @ 2010-01-18 22:33 ` Luca Barbieri [not found] ` <ff13bc9a1001181433v1694e681l9f7ce9d880132dc3-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Luca Barbieri @ 2010-01-18 22:33 UTC (permalink / raw) To: Thomas Hellstrom Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org [-- Attachment #1: Type: text/plain, Size: 951 bytes --] > Would nentry=list_first_entry(&entry->ddestroy, ....) work? Yes, it seems a bit less intuitive, but if that's the accepted practice, let's do that instead. > Here nentry may have been removed from the list by another process, which > would trigger the unnecessary call, mentioned above. You are right. I attached a revised patch. It's only compile tested, but the changes are small and it should hopefully work. Note that in principle we could remove the special-case code for the list head but that would require pretending the list head is actually inside a ttm_buffer_object and adding a flag to not do the unlock/cleanup/put/lock on it, which seems bad. The whole function seems more complicated than needed, but I can't find a way to do it with less code. If we could keep glob->lru_lock while calling ttm_bo_cleanup_refs things would be much simpler, but that would require intrusive changes and may not be possible. [-- Attachment #2: 0001-drm-ttm-Fix-race-condition-in-ttm_bo_delayed_delete-.patch --] [-- Type: text/x-diff, Size: 3346 bytes --] From 68972c220abe3ce19eb046d72fa218646168adc7 Mon Sep 17 00:00:00 2001 From: Luca Barbieri <luca@luca-barbieri.com> Date: Mon, 18 Jan 2010 19:34:53 +0100 Subject: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete (v2) ttm_bo_delayed_delete has a race condition, because after we do: kref_put(&nentry->list_kref, ttm_bo_release_list); we are not holding the list lock and not holding any reference to objects, and thus every bo in the list can be removed and freed at this point. However, we then use the next pointer we stored, which is not guaranteed to be valid. This was apparently the cause of some Nouveau oopses I experienced. This patch rewrites the function so that it keeps the reference to nentry until nentry itself is freed and we already got a reference to nentry->next. It should now be correct and free of races, but please double check this. Updated according to Thomas Hellstrom's feedback. Signed-off-by: Luca Barbieri <luca@luca-barbieri.com> --- drivers/gpu/drm/ttm/ttm_bo.c | 58 ++++++++++++++++++----------------------- 1 files changed, 26 insertions(+), 32 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 2920f9a..5dfa41f 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -523,52 +523,46 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo, bool remove_all) static int ttm_bo_delayed_delete(struct ttm_bo_device *bdev, bool remove_all) { struct ttm_bo_global *glob = bdev->glob; - struct ttm_buffer_object *entry, *nentry; - struct list_head *list, *next; - int ret; + struct ttm_buffer_object *entry; + int ret = 0; spin_lock(&glob->lru_lock); - list_for_each_safe(list, next, &bdev->ddestroy) { - entry = list_entry(list, struct ttm_buffer_object, ddestroy); - nentry = NULL; + if (list_empty(&bdev->ddestroy)) { + spin_unlock(&glob->lru_lock); + return 0; + } - /* - * Protect the next list entry from destruction while we - * unlock the lru_lock. - */ + entry = list_first_entry(&bdev->ddestroy, + struct ttm_buffer_object, ddestroy); + kref_get(&entry->list_kref); + + for (;;) { + struct ttm_buffer_object *nentry = NULL; - if (next != &bdev->ddestroy) { - nentry = list_entry(next, struct ttm_buffer_object, - ddestroy); + if (entry->ddestroy.next != &bdev->ddestroy) { + nentry = list_first_entry(&entry->ddestroy, + struct ttm_buffer_object, ddestroy); kref_get(&nentry->list_kref); } - kref_get(&entry->list_kref); spin_unlock(&glob->lru_lock); ret = ttm_bo_cleanup_refs(entry, remove_all); kref_put(&entry->list_kref, ttm_bo_release_list); + entry = nentry; + + if (ret || !entry) + break; spin_lock(&glob->lru_lock); - if (nentry) { - bool next_onlist = !list_empty(next); + + if (list_empty(&entry->ddestroy)) { spin_unlock(&glob->lru_lock); - kref_put(&nentry->list_kref, ttm_bo_release_list); - spin_lock(&glob->lru_lock); - /* - * Someone might have raced us and removed the - * next entry from the list. We don't bother restarting - * list traversal. - */ - - if (!next_onlist) - break; - } - if (ret) break; + } } - ret = !list_empty(&bdev->ddestroy); - spin_unlock(&glob->lru_lock); + if (entry) + kref_put(&entry->list_kref, ttm_bo_release_list); return ret; } -- 1.6.3.3 [-- Attachment #3: Type: text/plain, Size: 181 bytes --] _______________________________________________ Nouveau mailing list Nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org http://lists.freedesktop.org/mailman/listinfo/nouveau ^ permalink raw reply related [flat|nested] 34+ messages in thread
[parent not found: <ff13bc9a1001181433v1694e681l9f7ce9d880132dc3-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <ff13bc9a1001181433v1694e681l9f7ce9d880132dc3-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-01-20 11:28 ` Thomas Hellstrom [not found] ` <4B56E8EE.8090706-4+hqylr40dJg9hUCZPvPmw@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Thomas Hellstrom @ 2010-01-20 11:28 UTC (permalink / raw) To: Luca Barbieri Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Luca Barbieri wrote: >> Would nentry=list_first_entry(&entry->ddestroy, ....) work? >> > Yes, it seems a bit less intuitive, but if that's the accepted > practice, let's do that instead. > > >> Here nentry may have been removed from the list by another process, which >> would trigger the unnecessary call, mentioned above. >> > You are right. > > I attached a revised patch. > It's only compile tested, but the changes are small and it should > hopefully work. > Yes, it looks correct. Although it seems a little unintuitive to enter the loop with the spinlock held, and exit it with the spinlock not held. I've attached yet another patch to have that fixed. Could you take a look at whether it seems OK with you and, in that case, repost it for Dave? > The whole function seems more complicated than needed, but I can't > find a way to do it with less code. If we could keep glob->lru_lock > while calling ttm_bo_cleanup_refs things would be much simpler, but > that would require intrusive changes and may not be possible. > Yes, one of the rules of TTM is to avoid calling any kref_put() function that may free an object with a spinlock or a mutex held, since the free function must be able to take whatever mutex or spinlock it likes, otherwise we'd end up in a complete mess of recursive locks and lock dependency errors. Since therefore ttm_bo_cleanup_refs would need to temporarily release any lock held at call time, we wouldn't be better off. /Thomas ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <4B56E8EE.8090706-4+hqylr40dJg9hUCZPvPmw@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <4B56E8EE.8090706-4+hqylr40dJg9hUCZPvPmw@public.gmane.org> @ 2010-01-20 12:11 ` Thomas Hellstrom 0 siblings, 0 replies; 34+ messages in thread From: Thomas Hellstrom @ 2010-01-20 12:11 UTC (permalink / raw) To: Luca Barbieri Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org [-- Attachment #1: Type: text/plain, Size: 362 bytes --] Thomas Hellstrom wrote: > > Yes, it looks correct. Although it seems a little unintuitive to enter > the loop with the spinlock held, and exit it with the spinlock not held. > I've attached yet another patch to have that fixed. Could you take a > look at whether it seems OK with you and, in that case, repost it for Dave? > > ... And the patch. /Thomas [-- Attachment #2: 0001-drm-ttm-Fix-race-condition-in-ttm_bo_delayed_delete-.patch --] [-- Type: text/x-patch, Size: 3410 bytes --] From 68972c220abe3ce19eb046d72fa218646168adc7 Mon Sep 17 00:00:00 2001 From: Luca Barbieri <luca-Ukmtq+NC3rhBHFWNQifrYwC/G2K4zDHf@public.gmane.org> Date: Mon, 18 Jan 2010 19:34:53 +0100 Subject: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete (v2) ttm_bo_delayed_delete has a race condition, because after we do: kref_put(&nentry->list_kref, ttm_bo_release_list); we are not holding the list lock and not holding any reference to objects, and thus every bo in the list can be removed and freed at this point. However, we then use the next pointer we stored, which is not guaranteed to be valid. This was apparently the cause of some Nouveau oopses I experienced. This patch rewrites the function so that it keeps the reference to nentry until nentry itself is freed and we already got a reference to nentry->next. It should now be correct and free of races, but please double check this. Updated according to Thomas Hellstrom's feedback. Signed-off-by: Luca Barbieri <luca-Ukmtq+NC3rhBHFWNQifrYwC/G2K4zDHf@public.gmane.org> --- drivers/gpu/drm/ttm/ttm_bo.c | 58 ++++++++++++++++++----------------------- 1 files changed, 26 insertions(+), 32 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 2920f9a..5dfa41f 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -523,52 +523,46 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo, bool remove_all) static int ttm_bo_delayed_delete(struct ttm_bo_device *bdev, bool remove_all) { struct ttm_bo_global *glob = bdev->glob; - struct ttm_buffer_object *entry, *nentry; - struct list_head *list, *next; - int ret; + struct ttm_buffer_object *entry; + int ret = 0; spin_lock(&glob->lru_lock); - list_for_each_safe(list, next, &bdev->ddestroy) { - entry = list_entry(list, struct ttm_buffer_object, ddestroy); - nentry = NULL; + if (list_empty(&bdev->ddestroy)) { + spin_unlock(&glob->lru_lock); + return 0; + } - /* - * Protect the next list entry from destruction while we - * unlock the lru_lock. - */ + entry = list_first_entry(&bdev->ddestroy, + struct ttm_buffer_object, ddestroy); + kref_get(&entry->list_kref); + + for (;;) { + struct ttm_buffer_object *nentry = NULL; - if (next != &bdev->ddestroy) { - nentry = list_entry(next, struct ttm_buffer_object, - ddestroy); + if (entry->ddestroy.next != &bdev->ddestroy) { + nentry = list_first_entry(&entry->ddestroy, + struct ttm_buffer_object, ddestroy); kref_get(&nentry->list_kref); } - kref_get(&entry->list_kref); spin_unlock(&glob->lru_lock); ret = ttm_bo_cleanup_refs(entry, remove_all); kref_put(&entry->list_kref, ttm_bo_release_list); + entry = nentry; + + if (ret || !entry) + break; spin_lock(&glob->lru_lock); - if (nentry) { - bool next_onlist = !list_empty(next); + + if (list_empty(&entry->ddestroy)) { spin_unlock(&glob->lru_lock); - kref_put(&nentry->list_kref, ttm_bo_release_list); - spin_lock(&glob->lru_lock); - /* - * Someone might have raced us and removed the - * next entry from the list. We don't bother restarting - * list traversal. - */ - - if (!next_onlist) - break; - } - if (ret) break; + } } - ret = !list_empty(&bdev->ddestroy); - spin_unlock(&glob->lru_lock); + if (entry) + kref_put(&entry->list_kref, ttm_bo_release_list); return ret; } -- 1.6.3.3 [-- Attachment #3: Type: text/plain, Size: 181 bytes --] _______________________________________________ Nouveau mailing list Nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org http://lists.freedesktop.org/mailman/listinfo/nouveau ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete @ 2010-01-20 12:11 ` Thomas Hellstrom 0 siblings, 0 replies; 34+ messages in thread From: Thomas Hellstrom @ 2010-01-20 12:11 UTC (permalink / raw) To: Luca Barbieri Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org [-- Attachment #1: Type: text/plain, Size: 362 bytes --] Thomas Hellstrom wrote: > > Yes, it looks correct. Although it seems a little unintuitive to enter > the loop with the spinlock held, and exit it with the spinlock not held. > I've attached yet another patch to have that fixed. Could you take a > look at whether it seems OK with you and, in that case, repost it for Dave? > > ... And the patch. /Thomas [-- Attachment #2: 0001-drm-ttm-Fix-race-condition-in-ttm_bo_delayed_delete-.patch --] [-- Type: text/x-patch, Size: 3411 bytes --] >From 68972c220abe3ce19eb046d72fa218646168adc7 Mon Sep 17 00:00:00 2001 From: Luca Barbieri <luca-Ukmtq+NC3rhBHFWNQifrYwC/G2K4zDHf@public.gmane.org> Date: Mon, 18 Jan 2010 19:34:53 +0100 Subject: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete (v2) ttm_bo_delayed_delete has a race condition, because after we do: kref_put(&nentry->list_kref, ttm_bo_release_list); we are not holding the list lock and not holding any reference to objects, and thus every bo in the list can be removed and freed at this point. However, we then use the next pointer we stored, which is not guaranteed to be valid. This was apparently the cause of some Nouveau oopses I experienced. This patch rewrites the function so that it keeps the reference to nentry until nentry itself is freed and we already got a reference to nentry->next. It should now be correct and free of races, but please double check this. Updated according to Thomas Hellstrom's feedback. Signed-off-by: Luca Barbieri <luca-Ukmtq+NC3rhBHFWNQifrYwC/G2K4zDHf@public.gmane.org> --- drivers/gpu/drm/ttm/ttm_bo.c | 58 ++++++++++++++++++----------------------- 1 files changed, 26 insertions(+), 32 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 2920f9a..5dfa41f 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -523,52 +523,46 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo, bool remove_all) static int ttm_bo_delayed_delete(struct ttm_bo_device *bdev, bool remove_all) { struct ttm_bo_global *glob = bdev->glob; - struct ttm_buffer_object *entry, *nentry; - struct list_head *list, *next; - int ret; + struct ttm_buffer_object *entry; + int ret = 0; spin_lock(&glob->lru_lock); - list_for_each_safe(list, next, &bdev->ddestroy) { - entry = list_entry(list, struct ttm_buffer_object, ddestroy); - nentry = NULL; + if (list_empty(&bdev->ddestroy)) { + spin_unlock(&glob->lru_lock); + return 0; + } - /* - * Protect the next list entry from destruction while we - * unlock the lru_lock. - */ + entry = list_first_entry(&bdev->ddestroy, + struct ttm_buffer_object, ddestroy); + kref_get(&entry->list_kref); + + for (;;) { + struct ttm_buffer_object *nentry = NULL; - if (next != &bdev->ddestroy) { - nentry = list_entry(next, struct ttm_buffer_object, - ddestroy); + if (entry->ddestroy.next != &bdev->ddestroy) { + nentry = list_first_entry(&entry->ddestroy, + struct ttm_buffer_object, ddestroy); kref_get(&nentry->list_kref); } - kref_get(&entry->list_kref); spin_unlock(&glob->lru_lock); ret = ttm_bo_cleanup_refs(entry, remove_all); kref_put(&entry->list_kref, ttm_bo_release_list); + entry = nentry; + + if (ret || !entry) + break; spin_lock(&glob->lru_lock); - if (nentry) { - bool next_onlist = !list_empty(next); + + if (list_empty(&entry->ddestroy)) { spin_unlock(&glob->lru_lock); - kref_put(&nentry->list_kref, ttm_bo_release_list); - spin_lock(&glob->lru_lock); - /* - * Someone might have raced us and removed the - * next entry from the list. We don't bother restarting - * list traversal. - */ - - if (!next_onlist) - break; - } - if (ret) break; + } } - ret = !list_empty(&bdev->ddestroy); - spin_unlock(&glob->lru_lock); + if (entry) + kref_put(&entry->list_kref, ttm_bo_release_list); return ret; } -- 1.6.3.3 [-- Attachment #3: Type: text/plain, Size: 181 bytes --] _______________________________________________ Nouveau mailing list Nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org http://lists.freedesktop.org/mailman/listinfo/nouveau ^ permalink raw reply related [flat|nested] 34+ messages in thread
[parent not found: <4B56F308.5090603-4+hqylr40dJg9hUCZPvPmw@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <4B56F308.5090603-4+hqylr40dJg9hUCZPvPmw@public.gmane.org> @ 2010-01-20 12:16 ` Thomas Hellstrom 0 siblings, 0 replies; 34+ messages in thread From: Thomas Hellstrom @ 2010-01-20 12:16 UTC (permalink / raw) To: Thomas Hellstrom Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Luca Barbieri, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org [-- Attachment #1: Type: text/plain, Size: 510 bytes --] Thomas Hellstrom wrote: > Thomas Hellstrom wrote: > >> Yes, it looks correct. Although it seems a little unintuitive to enter >> the loop with the spinlock held, and exit it with the spinlock not held. >> I've attached yet another patch to have that fixed. Could you take a >> look at whether it seems OK with you and, in that case, repost it for Dave? >> >> >> > ... And the patch. > > /Thomas > > Hmm, It seems I should've stayed in bed today. Hopefully this is the right one... /Thomas [-- Attachment #2: 0001-drm-ttm-Fix-race-condition-in-ttm_bo_delayed_delete.patch --] [-- Type: text/x-patch, Size: 3514 bytes --] From b3dc72cf74d1b8e0eb5f5423fbb0ac975f147f4c Mon Sep 17 00:00:00 2001 From: Luca Barbieri <luca-Ukmtq+NC3rhBHFWNQifrYwC/G2K4zDHf@public.gmane.org> Date: Mon, 18 Jan 2010 19:34:53 +0100 Subject: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete (v2) ttm_bo_delayed_delete has a race condition, because after we do: kref_put(&nentry->list_kref, ttm_bo_release_list); we are not holding the list lock and not holding any reference to objects, and thus every bo in the list can be removed and freed at this point. However, we then use the next pointer we stored, which is not guaranteed to be valid. This was apparently the cause of some Nouveau oopses I experienced. This patch rewrites the function so that it keeps the reference to nentry until nentry itself is freed and we already got a reference to nentry->next. It should now be correct and free of races, but please double check this. Updated according to Thomas Hellstrom's feedback. Signed-off-by: Luca Barbieri <luca-Ukmtq+NC3rhBHFWNQifrYwC/G2K4zDHf@public.gmane.org> Signed-off-by: Thomas Hellstrom <thellstrom-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org> --- drivers/gpu/drm/ttm/ttm_bo.c | 58 ++++++++++++++++++------------------------ 1 files changed, 25 insertions(+), 33 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index c7733c3..1a3e909 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -524,52 +524,44 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo, bool remove_all) static int ttm_bo_delayed_delete(struct ttm_bo_device *bdev, bool remove_all) { struct ttm_bo_global *glob = bdev->glob; - struct ttm_buffer_object *entry, *nentry; - struct list_head *list, *next; - int ret; + struct ttm_buffer_object *entry = NULL; + int ret = 0; spin_lock(&glob->lru_lock); - list_for_each_safe(list, next, &bdev->ddestroy) { - entry = list_entry(list, struct ttm_buffer_object, ddestroy); - nentry = NULL; + if (list_empty(&bdev->ddestroy)) + goto out_unlock; - /* - * Protect the next list entry from destruction while we - * unlock the lru_lock. - */ + entry = list_first_entry(&bdev->ddestroy, + struct ttm_buffer_object, ddestroy); + kref_get(&entry->list_kref); + + for (;;) { + struct ttm_buffer_object *nentry = NULL; - if (next != &bdev->ddestroy) { - nentry = list_entry(next, struct ttm_buffer_object, - ddestroy); + if (entry->ddestroy.next != &bdev->ddestroy) { + nentry = list_first_entry(&entry->ddestroy, + struct ttm_buffer_object, ddestroy); kref_get(&nentry->list_kref); } - kref_get(&entry->list_kref); spin_unlock(&glob->lru_lock); ret = ttm_bo_cleanup_refs(entry, remove_all); kref_put(&entry->list_kref, ttm_bo_release_list); + entry = nentry; + + if (ret || !entry) + goto out; spin_lock(&glob->lru_lock); - if (nentry) { - bool next_onlist = !list_empty(next); - spin_unlock(&glob->lru_lock); - kref_put(&nentry->list_kref, ttm_bo_release_list); - spin_lock(&glob->lru_lock); - /* - * Someone might have raced us and removed the - * next entry from the list. We don't bother restarting - * list traversal. - */ - - if (!next_onlist) - break; - } - if (ret) + if (list_empty(&entry->ddestroy)) break; } - ret = !list_empty(&bdev->ddestroy); - spin_unlock(&glob->lru_lock); +out_unlock: + spin_unlock(&glob->lru_lock); +out: + if (entry) + kref_put(&entry->list_kref, ttm_bo_release_list); return ret; } -- 1.6.2.5 [-- Attachment #3: Type: text/plain, Size: 181 bytes --] _______________________________________________ Nouveau mailing list Nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org http://lists.freedesktop.org/mailman/listinfo/nouveau ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete @ 2010-01-20 12:16 ` Thomas Hellstrom 0 siblings, 0 replies; 34+ messages in thread From: Thomas Hellstrom @ 2010-01-20 12:16 UTC (permalink / raw) To: Thomas Hellstrom Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Luca Barbieri, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org [-- Attachment #1: Type: text/plain, Size: 510 bytes --] Thomas Hellstrom wrote: > Thomas Hellstrom wrote: > >> Yes, it looks correct. Although it seems a little unintuitive to enter >> the loop with the spinlock held, and exit it with the spinlock not held. >> I've attached yet another patch to have that fixed. Could you take a >> look at whether it seems OK with you and, in that case, repost it for Dave? >> >> >> > ... And the patch. > > /Thomas > > Hmm, It seems I should've stayed in bed today. Hopefully this is the right one... /Thomas [-- Attachment #2: 0001-drm-ttm-Fix-race-condition-in-ttm_bo_delayed_delete.patch --] [-- Type: text/x-patch, Size: 3515 bytes --] >From b3dc72cf74d1b8e0eb5f5423fbb0ac975f147f4c Mon Sep 17 00:00:00 2001 From: Luca Barbieri <luca-Ukmtq+NC3rhBHFWNQifrYwC/G2K4zDHf@public.gmane.org> Date: Mon, 18 Jan 2010 19:34:53 +0100 Subject: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete (v2) ttm_bo_delayed_delete has a race condition, because after we do: kref_put(&nentry->list_kref, ttm_bo_release_list); we are not holding the list lock and not holding any reference to objects, and thus every bo in the list can be removed and freed at this point. However, we then use the next pointer we stored, which is not guaranteed to be valid. This was apparently the cause of some Nouveau oopses I experienced. This patch rewrites the function so that it keeps the reference to nentry until nentry itself is freed and we already got a reference to nentry->next. It should now be correct and free of races, but please double check this. Updated according to Thomas Hellstrom's feedback. Signed-off-by: Luca Barbieri <luca-Ukmtq+NC3rhBHFWNQifrYwC/G2K4zDHf@public.gmane.org> Signed-off-by: Thomas Hellstrom <thellstrom-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org> --- drivers/gpu/drm/ttm/ttm_bo.c | 58 ++++++++++++++++++------------------------ 1 files changed, 25 insertions(+), 33 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index c7733c3..1a3e909 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -524,52 +524,44 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo, bool remove_all) static int ttm_bo_delayed_delete(struct ttm_bo_device *bdev, bool remove_all) { struct ttm_bo_global *glob = bdev->glob; - struct ttm_buffer_object *entry, *nentry; - struct list_head *list, *next; - int ret; + struct ttm_buffer_object *entry = NULL; + int ret = 0; spin_lock(&glob->lru_lock); - list_for_each_safe(list, next, &bdev->ddestroy) { - entry = list_entry(list, struct ttm_buffer_object, ddestroy); - nentry = NULL; + if (list_empty(&bdev->ddestroy)) + goto out_unlock; - /* - * Protect the next list entry from destruction while we - * unlock the lru_lock. - */ + entry = list_first_entry(&bdev->ddestroy, + struct ttm_buffer_object, ddestroy); + kref_get(&entry->list_kref); + + for (;;) { + struct ttm_buffer_object *nentry = NULL; - if (next != &bdev->ddestroy) { - nentry = list_entry(next, struct ttm_buffer_object, - ddestroy); + if (entry->ddestroy.next != &bdev->ddestroy) { + nentry = list_first_entry(&entry->ddestroy, + struct ttm_buffer_object, ddestroy); kref_get(&nentry->list_kref); } - kref_get(&entry->list_kref); spin_unlock(&glob->lru_lock); ret = ttm_bo_cleanup_refs(entry, remove_all); kref_put(&entry->list_kref, ttm_bo_release_list); + entry = nentry; + + if (ret || !entry) + goto out; spin_lock(&glob->lru_lock); - if (nentry) { - bool next_onlist = !list_empty(next); - spin_unlock(&glob->lru_lock); - kref_put(&nentry->list_kref, ttm_bo_release_list); - spin_lock(&glob->lru_lock); - /* - * Someone might have raced us and removed the - * next entry from the list. We don't bother restarting - * list traversal. - */ - - if (!next_onlist) - break; - } - if (ret) + if (list_empty(&entry->ddestroy)) break; } - ret = !list_empty(&bdev->ddestroy); - spin_unlock(&glob->lru_lock); +out_unlock: + spin_unlock(&glob->lru_lock); +out: + if (entry) + kref_put(&entry->list_kref, ttm_bo_release_list); return ret; } -- 1.6.2.5 [-- Attachment #3: Type: text/plain, Size: 181 bytes --] _______________________________________________ Nouveau mailing list Nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org http://lists.freedesktop.org/mailman/listinfo/nouveau ^ permalink raw reply related [flat|nested] 34+ messages in thread
[parent not found: <4B56F401.8070700-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <4B56F401.8070700-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org> @ 2010-01-20 19:22 ` Luca Barbieri [not found] ` <ff13bc9a1001201122y110fb003k704bc6d05d2aea07-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Luca Barbieri @ 2010-01-20 19:22 UTC (permalink / raw) To: Thomas Hellstrom Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom Yes it's fine. I sent your patch to Dave with an expanded commit comment for merging. Here is a possible redesign of the mechanism inspired by this issue. It seems that what we are racing against is buffer eviction, due to delayed deletion buffers being still kept on the LRU list. I'm wondering if the delayed deletion design could be changed as follows: 1. Remove to-be-deleted buffers from the LRU list before putting them on delayed delete 2. Change buffer eviction to first do a delayed deletion pass. This should be cheap (and cheaper than the current code) because delayed deletion stops at the first unsignaled fence. 3. Add a new "delayed deletion" lock/semaphore. Then, have ttm_bo_delayed_delete take it for reading and keep it across the function. 4. Inside the delayed deletion lock, grab the LRU lock, copy the delayed delete list head to a local variable, set it to empty and release the lock. 5. Iterate on the privately held list with list_for_each_safe 6. At the end of ttm_bo_delayed_delete, retake the LRU lock and readd the remaining part of our private list at the head of the global list This would prevent uselessly trying to evict delayed-delete buffers (which should be processed in fence order and not LRU order), and also prevent concurrent delayed deletions, which should be more harmful than useful. Furthermore, it should be possible to get rid of list locking in the following way: 1. BOs to be delayed-deleted are added to the head of the initial delayed deletion single linked list, using atomic cmpxchg 2. ttm_bo_delayed_delete takes the delayed deletion lock and grabs the list with an atomic xchg of the head with zero 3. It reverses the list in place, processes the entries and puts them at the end of a second single linked list, the recurring delayed deletion list 4. It processes the recurring delayed deletion list, cleaning up the BOs 5. Finally, the delayed deletion lock is released This makes adding to the delayed deletion list lockless. The LRU list instead inherently needs to be doubly linked, so only RCU could make it lockless, and it seems that may require using an external list node structure (so readers don't suddenly jump to the most recent node), and that would not be a win (except with *lots* of CPUs). Besides, most DRM drivers (except vmware) are taking the BKL around all ioctls and (except nouveau) use a single pushbuffer, so this is a bit moot anyway. What do you think? Anyway, this, if done, would be for the next merge window, or later, while the current fix ought to be merged now. ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <ff13bc9a1001201122y110fb003k704bc6d05d2aea07-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <ff13bc9a1001201122y110fb003k704bc6d05d2aea07-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-01-20 20:16 ` Thomas Hellstrom [not found] ` <4B5764BA.7080801-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Thomas Hellstrom @ 2010-01-20 20:16 UTC (permalink / raw) To: Luca Barbieri Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom Luca Barbieri wrote: > Yes it's fine. I sent your patch to Dave with an expanded commit > comment for merging. > > Here is a possible redesign of the mechanism inspired by this issue. > It seems that what we are racing against is buffer eviction, due to > delayed deletion buffers being still kept on the LRU list. > > I'm wondering if the delayed deletion design could be changed as follows: > 1. Remove to-be-deleted buffers from the LRU list before putting them > on delayed delete > 2. Change buffer eviction to first do a delayed deletion pass. This > should be cheap (and cheaper than the current code) because delayed > deletion stops at the first unsignaled fence. > 3. Add a new "delayed deletion" lock/semaphore. Then, have > ttm_bo_delayed_delete take it for reading and keep it across the > function. > 4. Inside the delayed deletion lock, grab the LRU lock, copy the > delayed delete list head to a local variable, set it to empty and > release the lock. > 5. Iterate on the privately held list with list_for_each_safe > 6. At the end of ttm_bo_delayed_delete, retake the LRU lock and readd > the remaining part of our private list at the head of the global list > > This would prevent uselessly trying to evict delayed-delete buffers > (which should be processed in fence order and not LRU order), and also > prevent concurrent delayed deletions, which should be more harmful > than useful. > Yes. Note that there is a problem with this. If we need to find space for a buffer, we need to be able to guarantee that we've evicted *all* evictable buffers. Even those on the delayed delete list. Otherwise we have no means of guaranteeing a certain available GPU memory space to user-space, so that user-space can predict when to flush. If we remove delayed-delete buffers from the LRU lists, we'd need to empty the delayed-delete list to guarantee that, and we may end up waiting for buffers that aren't even in the relevant memory region. Also note that the delayed delete list is not in fence order but in deletion-time order, which perhaps gives room for more optimizations. > Furthermore, it should be possible to get rid of list locking in the > following way: > 1. BOs to be delayed-deleted are added to the head of the initial > delayed deletion single linked list, using atomic cmpxchg > 2. ttm_bo_delayed_delete takes the delayed deletion lock and grabs the > list with an atomic xchg of the head with zero > 3. It reverses the list in place, processes the entries and puts them > at the end of a second single linked list, the recurring delayed > deletion list > 4. It processes the recurring delayed deletion list, cleaning up the BOs > 5. Finally, the delayed deletion lock is released > > This makes adding to the delayed deletion list lockless. > But isn't an atomic cmpxchg about as costly as a spinlock? > The LRU list instead inherently needs to be doubly linked, so only RCU > could make it lockless, and it seems that may require using an > external list node structure (so readers don't suddenly jump to the > most recent node), and that would not be a win (except with *lots* of > CPUs). > Besides, most DRM drivers (except vmware) are taking the BKL around > all ioctls and (except nouveau) use a single pushbuffer, so this is a > bit moot anyway. > > What do you think? > In general, the more locks we can get rid of, the better, so I'm for that. Note, however, that reserving a buffer and simultaneously taking it off lru lists need to be an atomic operation. The lru lock is currently used to assure that, and I guess we need to preserve that in some way. /Thomas ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <4B5764BA.7080801-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <4B5764BA.7080801-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org> @ 2010-01-20 20:45 ` Luca Barbieri [not found] ` <ff13bc9a1001201245g6ee25219q851b7989968f4c7b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Luca Barbieri @ 2010-01-20 20:45 UTC (permalink / raw) To: Thomas Hellstrom Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom > Also note that the delayed delete list is not in fence order but in > deletion-time order, which perhaps gives room for more optimizations. You are right. I think then that ttm_bo_delayed_delete may still need to be changed, because it stops when ttm_bo_cleanup_refs returns -EBUSY, which happens when a fence has not been reached. This means that a buffer will need to wait for all previously deleted buffers to become unused, even if it is unused itself. Is this acceptable? What if we get rid of the delayed destroy list, and instead append buffers to be deleted to their fence object, and delete them when the fence is signaled? This also allows to do it more naturally, since the fence object can just keep a normal reference to the buffers it fences, and unreference them on expiration. Then there needs to be no special "delayed destruction" logic, and it would work as if the GPU were keeping a reference to the buffer itself, using fences as a proxy to have the CPU do that work for the GPU. Then the delayed work is no longer "periodically destroy buffers" but rather "periodically check if fences are expired", naturally stopping at the first unexpired one. Drivers that support IRQs on fences could also do the work in the interrupt handler/tasklet instead, avoid the delay jiffies magic number. This may need a NAPI-like interrupt mitigation middle layer for optimal results though. > But isn't an atomic cmpxchg about as costly as a spinlock? I think it's cheaper on all architectures, as otherwise it would be mostly pointless to have it, since you can emulate it with a spinlock. ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <ff13bc9a1001201245g6ee25219q851b7989968f4c7b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <ff13bc9a1001201245g6ee25219q851b7989968f4c7b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-01-20 20:58 ` Luca Barbieri [not found] ` <ff13bc9a1001201258i37ee7354gb305aa98afae3716-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2010-01-20 21:04 ` Thomas Hellstrom 1 sibling, 1 reply; 34+ messages in thread From: Luca Barbieri @ 2010-01-20 20:58 UTC (permalink / raw) To: Thomas Hellstrom Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom When designing this, we should also keep in mind that some drivers (e.g. nouveau) have multiple FIFO channels, and thus we would like a buffer to be referenced for reading by multiple channels at once (and be destroyed only when all fences are expired, obviously). Also, hardware may support on-GPU inter-channel synchronization, and then multiple references may be for writing too. If we use an external dynamically allocated channel/buffer list node, we can support this (if the kernel allocators aren't fast enough, which they should be, we can just keep released ones linked to the bo to speed allocations). Note that in nouveau there is a small hardware limit to channels (up to 128 on nv50), but future hardware may possibly support unlimited channels. ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <ff13bc9a1001201258i37ee7354gb305aa98afae3716-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <ff13bc9a1001201258i37ee7354gb305aa98afae3716-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-01-20 21:11 ` Thomas Hellstrom 0 siblings, 0 replies; 34+ messages in thread From: Thomas Hellstrom @ 2010-01-20 21:11 UTC (permalink / raw) To: Luca Barbieri Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Luca Barbieri wrote: > When designing this, we should also keep in mind that some drivers > (e.g. nouveau) have multiple FIFO channels, and thus we would like a > buffer to be referenced for reading by multiple channels at once (and > be destroyed only when all fences are expired, obviously). > Also, hardware may support on-GPU inter-channel synchronization, and > then multiple references may be for writing too. > In the context of the current code, I've been thinking of having a list of fences on each bo to support multiple readers, and also to try to deal with the problem of simultaneous GPU- and CPU readers. But if the hardware supports on-GPU inter-channel synchronization, I figure the code should be smart enough to only wait on the "last" write fence? /Thomas ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <ff13bc9a1001201245g6ee25219q851b7989968f4c7b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2010-01-20 20:58 ` Luca Barbieri @ 2010-01-20 21:04 ` Thomas Hellstrom [not found] ` <4B576FF5.9040907-4+hqylr40dJg9hUCZPvPmw@public.gmane.org> 1 sibling, 1 reply; 34+ messages in thread From: Thomas Hellstrom @ 2010-01-20 21:04 UTC (permalink / raw) To: Luca Barbieri Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Luca Barbieri wrote: >> Also note that the delayed delete list is not in fence order but in >> deletion-time order, which perhaps gives room for more optimizations. >> > You are right. > I think then that ttm_bo_delayed_delete may still need to be changed, > because it stops when ttm_bo_cleanup_refs returns -EBUSY, which > happens when a fence has not been reached. > This means that a buffer will need to wait for all previously deleted > buffers to become unused, even if it is unused itself. > Is this acceptable? > Yes, I think it's acceptable if you view it in the context that the most important buffer resources (GPU memory space and physical system memory) are immediately reclaimable through the eviction- and swapping mechanisms. > What if we get rid of the delayed destroy list, and instead append > buffers to be deleted to their fence object, and delete them when the > fence is signaled? > > This also allows to do it more naturally, since the fence object can > just keep a normal reference to the buffers it fences, and unreference > them on expiration. > > Then there needs to be no special "delayed destruction" logic, and it > would work as if the GPU were keeping a reference to the buffer > itself, using fences as a proxy to have the CPU do that work for the > GPU. > > Then the delayed work is no longer "periodically destroy buffers" but > rather "periodically check if fences are expired", naturally stopping > at the first unexpired one. > Drivers that support IRQs on fences could also do the work in the > interrupt handler/tasklet instead, avoid the delay jiffies magic > number. This may need a NAPI-like interrupt mitigation middle layer > for optimal results though. > > Yes, I think that this way, it should definitely be possible to find a more optimal solution. One should keep in mind, however, that we'll probably not able to destroy buffers from within an atomic context, which means we have to schedule a workqueue to do that task. We had to do a similar thing in the Poulsbo driver and it turned out that we could save a significant amount of CPU by using a delayed workqueue, collecting objects and destroying them periodically. /Thomas ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <4B576FF5.9040907-4+hqylr40dJg9hUCZPvPmw@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <4B576FF5.9040907-4+hqylr40dJg9hUCZPvPmw@public.gmane.org> @ 2010-01-21 3:49 ` Luca Barbieri [not found] ` <ff13bc9a1001201949l4691f202v2a2874b9cef86f37-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Luca Barbieri @ 2010-01-21 3:49 UTC (permalink / raw) To: Thomas Hellstrom Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org > We had to do a similar thing in the > Poulsbo driver and it turned out that we could save a significant amount of > CPU by using a delayed workqueue, collecting objects and destroying them > periodically. Yes, indeed, we don't really care about a fence expiring unless we want to use that buffer or we are out of memory. Processing each fence, especially if there is an interrupt per fence, seems unnecessary, since you can just do that work in the cases mentioned above. So, here is a new proposal that should have the best features of all previously considered methods. Assume first that there is a single FIFO channel. We introduce a list of fence nodes, each containing: - A list of buffers NOT to be destroyed having the fence as their sync object (or possibly one for each memory type) - A delayed-destroy list of buffers having the fence as their sync object Fence nodes are added at the end upon emission. They are removed and freed when the buffer count drops to 0 (meaning all buffers have been moved to later fences). Thus, the number of fence nodes does not grow without bounds, but is bounded by the number of buffers. . The LRU lists are changed to only contain unfenced buffers and buffers are initially put on it. When a buffer is fenced, it is removed from the LRU list or its current fence list, and added to the new fence (possibly destroying the old fence node in the process). The reference count of buffer objects is only increased when they are put in a delayed destruction list. Upon deletion, they are destroyed immediately if they are on the LRU list and moved to the corresponding delayed-delete list if they are in the fenced list. ttm_bo_delayed_delete is no longer called by any workqueue, but only on when we run out of memory, before we try eviction. It processes the list until the first unsignalled fence, destroying all buffers it finds and freeing all the fence nodes. All the buffers in the alive lists are put in the LRU list, in the correct order. We may either keep an alive list per memory type, or sort the buffers by memory type (so they are put in the memory type LRU list) at this point Thus, after ttm_bo_delayed_delete, there is the following scenario: 1. Inactive buffers with no references are destroyed 2. Inactive buffers with references are in the LRU list 3. Fenced buffers with no references are in the delayed-destroy list attached to their fence 4. Fenced buffers with references are in the alive list attached to their fence This should have about the same CPU and memory usage of the current code, but the following advantages: 1. Delayed deletion does not run periodically, but exactly when needed (at allocation time, before eviction) 2. Delayed deletion always deletes all buffers that can be deleted, since it processes them in fence order 3. The LRU list at eviction time contains exactly all evictable buffers 4. Each buffer is always on an LRU _or_ on a delayed destroy list, so only one node is necessary 5. Once buffer deletion determines a fence is signalled, it can destroyed all its to-be-destroyed buffers without any more checking 6. Some fence logic (such as signalling of all fences on channel forced closing) can be moved from drivers to TTM 7. Some channel logic can be moved from drivers to TTM 8. The code should be simpler and more elegant Note that using a buffer with the CPU doesn't change its position in the LRU list. This is good, because evicting a buffer used by the CPU is actually a good thing, since it will move to a region of memory slower for the GPU but possibly faster for the CPU. However, for buffers in system memory, if the LRU list is used for swapping, CPU access should move the buffer to the front of list (using the LRU list for this is probably a bad idea though, since system memory swapping should be at page granularity). For multiple channels, things are slightly more complex. First, we need to built the fence data structure for each channel. Also, we need the fence/buffer nodes to be separate Delayed destruction continues to work, but the reference count needs to be set to the number of channels fencing the buffer on destruction. Putting buffers in the LRU list can be done by doing n-way merging between the channel fence lists, assigning each fence a global sequence number, and using that to merge and put back buffers in the LRU. n-way merging can be done with a small heap on the stack on current hardware where channels are limited. Furthermore, we need to keep a reference count, so that buffers are put in the LRU (or destroyed) only after they are off all the channels. The LRU list semantics are relaxed. If a buffer has both its fence emitted before another buffer, and also signaled before it, then it will be considered as "less recently" used, and the opposite thing happens if both are after. Otherwise, the order is undetermined. This should still guarantee good eviction behavior, and allows to have the LRU list only contain evictable buffers, while not requiring to use a dynamic priority queue. Otherwise, a red-black tree (or an indirect heap with back references) indexed by the global fence sequence number can guarantee strict emission-time LRU semantics. This seems unnecessary though. Thoughts? Opinions? Any possible improvement? ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <ff13bc9a1001201949l4691f202v2a2874b9cef86f37-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <ff13bc9a1001201949l4691f202v2a2874b9cef86f37-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-01-21 12:29 ` Jerome Glisse [not found] ` <20100121122920.GB3837-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-21 12:53 ` Thomas Hellstrom 1 sibling, 1 reply; 34+ messages in thread From: Jerome Glisse @ 2010-01-21 12:29 UTC (permalink / raw) To: Luca Barbieri Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom On Thu, Jan 21, 2010 at 04:49:39AM +0100, Luca Barbieri wrote: > > We had to do a similar thing in the > > Poulsbo driver and it turned out that we could save a significant amount of > > CPU by using a delayed workqueue, collecting objects and destroying them > > periodically. > Yes, indeed, we don't really care about a fence expiring unless we > want to use that buffer or we are out of memory. > Processing each fence, especially if there is an interrupt per fence, > seems unnecessary, since you can just do that work in the cases > mentioned above. > > So, here is a new proposal that should have the best features of all > previously considered methods. > Assume first that there is a single FIFO channel. > > We introduce a list of fence nodes, each containing: > - A list of buffers NOT to be destroyed having the fence as their sync > object (or possibly one for each memory type) > - A delayed-destroy list of buffers having the fence as their sync object > > Fence nodes are added at the end upon emission. > They are removed and freed when the buffer count drops to 0 (meaning > all buffers have been moved to later fences). > Thus, the number of fence nodes does not grow without bounds, but is > bounded by the number of buffers. > . > The LRU lists are changed to only contain unfenced buffers and buffers > are initially put on it. > When a buffer is fenced, it is removed from the LRU list or its > current fence list, and added to the new fence (possibly destroying > the old fence node in the process). > The reference count of buffer objects is only increased when they are > put in a delayed destruction list. > > Upon deletion, they are destroyed immediately if they are on the LRU > list and moved to the corresponding delayed-delete list if they are in > the fenced list. > > ttm_bo_delayed_delete is no longer called by any workqueue, but only > on when we run out of memory, before we try eviction. > It processes the list until the first unsignalled fence, destroying > all buffers it finds and freeing all the fence nodes. > All the buffers in the alive lists are put in the LRU list, in the > correct order. > We may either keep an alive list per memory type, or sort the buffers > by memory type (so they are put in the memory type LRU list) at this > point > > Thus, after ttm_bo_delayed_delete, there is the following scenario: > 1. Inactive buffers with no references are destroyed > 2. Inactive buffers with references are in the LRU list > 3. Fenced buffers with no references are in the delayed-destroy list > attached to their fence > 4. Fenced buffers with references are in the alive list attached to their fence > > This should have about the same CPU and memory usage of the current > code, but the following advantages: > 1. Delayed deletion does not run periodically, but exactly when needed > (at allocation time, before eviction) > 2. Delayed deletion always deletes all buffers that can be deleted, > since it processes them in fence order > 3. The LRU list at eviction time contains exactly all evictable buffers > 4. Each buffer is always on an LRU _or_ on a delayed destroy list, so > only one node is necessary > 5. Once buffer deletion determines a fence is signalled, it can > destroyed all its to-be-destroyed buffers without any more checking > 6. Some fence logic (such as signalling of all fences on channel > forced closing) can be moved from drivers to TTM > 7. Some channel logic can be moved from drivers to TTM > 8. The code should be simpler and more elegant > > Note that using a buffer with the CPU doesn't change its position in > the LRU list. > This is good, because evicting a buffer used by the CPU is actually a > good thing, since it will move to a region of memory slower for the > GPU but possibly faster for the CPU. > However, for buffers in system memory, if the LRU list is used for > swapping, CPU access should move the buffer to the front of list > (using the LRU list for this is probably a bad idea though, since > system memory swapping should be at page granularity). > > For multiple channels, things are slightly more complex. > First, we need to built the fence data structure for each channel. > Also, we need the fence/buffer nodes to be separate > Delayed destruction continues to work, but the reference count needs > to be set to the number of channels fencing the buffer on destruction. > Putting buffers in the LRU list can be done by doing n-way merging > between the channel fence lists, assigning each fence a global > sequence number, and using that to merge and put back buffers in the > LRU. > n-way merging can be done with a small heap on the stack on current > hardware where channels are limited. > Furthermore, we need to keep a reference count, so that buffers are > put in the LRU (or destroyed) only after they are off all the > channels. > > The LRU list semantics are relaxed. If a buffer has both its fence > emitted before another buffer, and also signaled before it, then it > will be considered as "less recently" used, and the opposite thing > happens if both are after. Otherwise, the order is undetermined. > > This should still guarantee good eviction behavior, and allows to have > the LRU list only contain evictable buffers, while not requiring to > use a dynamic priority queue. > Otherwise, a red-black tree (or an indirect heap with back references) > indexed by the global fence sequence number can guarantee strict > emission-time LRU semantics. This seems unnecessary though. > > Thoughts? Opinions? > Any possible improvement? > I love the first solution but solution to multi fifo side looks overkilling. So i am wondering if we aren't tackling the issue from the wrong side. Below is slightly different approach to the issue. Idea is that memory management is about having bo in place for the GPU. So instead of looking at the problem from the bo side, look at it from the GPU address space side. For simplicity here i am using only one space VRAM (but GTT or others can be deal the sameway). So now we split the problem to look at it as a transaction any time the driver want to do somethings it ask to ttm to create new transaction each transaction is associated to a fence and list of buffer with their possible placement (i will explain how it can fit multi fifo gpu well too and how it also allow easy support to GPU with virtual address space which is somethings slowly becoming a reality i think). TTM receive a transaction it takes a transaction lock and give transaction number to this new transaction. Then it goes over the BO list and try to come up with the best placement for each of them. To do so TTM know the situation of the address space from the previous active recorded transaction. Let N be the previous active transaction and N+1 the new one. BO are described as Name(offset,size), vram size is 500: Trans[N]: A(0,100), B(100,200), C(400,100) Trans[N+1]: wants A(?,100), and D(?,200) So TTM need to evict either B or C to put D. Now if for each transaction we record for each segment allocated the bo the fence(s) & associated transaction number to this we can support cleverly the multi fifo GPU. We need to associate with each a fence an fifoID (could be named more genericly channel or somethings else as others hw have similar things for instance separate dma unit). Now in order to avoid having fifo waiting too much on each others all we need to do is evict the bo which either have the same fifoID that the transaction N+1 (ie transaction N+1 will be executed in the same fifo so we know that we don't have to wait for the other fifo) or the BO with the oldest transaction number/fence ie the one that will most likely soon won't be needed anymore by GPU. The TTM transaction function is also responsible for handling BO deletion, each time a BO is bound to GPU space (ie in one of the active transaction) TTM takes a ref. When a new transaction comme TTM query for each segement if all fence are expired or not. If all fence are expired and if there is only 1 ref left then that means the BO have been deleted and we could simply remove it from the space (GPU is done with it and userspace don't care about it anymore). Also TTM will had the N+1 fence to the fence list of BO already in a current suitable position, in the example above, the segment where A is will be associated with fence N and fence N + 1. A segment is idle when all the fences associated to it are signaled (this way we support multi-fifo). Once placement are selected TTM transaction record N+1 as the new current state. Here is pseudo code as it speaks more. struct ttm_trans_bo { struct list_head list; struct bo *bo; struct fence_nodes *fences; struct placement pl; u64 gpuaddr; }; struct ttm_trans { struct fence *fence; // transaction fence u32 step; // transaction step struct list_head bos; // list of struct ttm_trans_bo struct list_head evicts; // list of struct ttm_trans_bo }; int ttm_trans_new(ttmdev, struct ttm_trans *t); Upon successfull return ttm_trans_new will fill evicts list with a list of buffer needed to be evicted for this transactions and will complete bos list by setting gpuaddr and narrowing done the placement info to the placement ttm picked up. Then there is 2 differents possibility. If evicts list is empty driver can directly move bo into the respective gpuaddr and schedule the command it wants to execute. If evicts list is not empty then driver as to go through it and for each entry it has to go through fences and wait for them to be signaled before evicting the bo, once all the bo are evicted it can go through bos list and load bo at the proper address. Using a list of fences for each evicted bos is to allow having a bo being shared btw fifos. This design put the burden on the driver, for instance on some hw with multi-fifo the hw can schedule every transaction into the GPU fifo ring: -wait fence A,B,C.... -evict bo .... -copy bo ... -command stream -evict transaction fence GPU need to have wait for functionality and way to wait for fence (depend on the hw) but also need to be able to change the GART by itself. On GPU with no such support it will have to work with the CPU TTM can provide helper workqueue to do this in the most reasonable way. This design depreciated all bo move functionality, and all bo validation one. There is no more delayed delete list or rcu list per space but rather a transaction struct per space recording the current state of the placement from CPU point of view, this state is being updated by querying fence each time there is a call to ttm_trans_new. The pinned buffer can be associated to some transaction that never ends until unpinned. This design also have nice feature for instance when pinninp unpinning a buffer you can provide in the evicts list the list of buffer to unpin and in the bos list the list of buffer to pin, this way TTM has a change to place the new pinned buffer as the same place as the old one. I also believe it fits well to GPU with multi-fifos and last it could also be cleverly use with GPU with virtual address space for each fifos. Cheers, Jerome ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <20100121122920.GB3837-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <20100121122920.GB3837-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2010-01-21 12:59 ` Thomas Hellstrom [not found] ` <4B584FAE.8040801-4+hqylr40dJg9hUCZPvPmw@public.gmane.org> 2010-01-21 15:14 ` Luca Barbieri 1 sibling, 1 reply; 34+ messages in thread From: Thomas Hellstrom @ 2010-01-21 12:59 UTC (permalink / raw) To: Jerome Glisse Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Luca Barbieri, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom Jerome Glisse wrote: > On Thu, Jan 21, 2010 at 04:49:39AM +0100, Luca Barbieri wrote: > >>> We had to do a similar thing in the >>> Poulsbo driver and it turned out that we could save a significant amount of >>> CPU by using a delayed workqueue, collecting objects and destroying them >>> periodically. >>> >> Yes, indeed, we don't really care about a fence expiring unless we >> want to use that buffer or we are out of memory. >> Processing each fence, especially if there is an interrupt per fence, >> seems unnecessary, since you can just do that work in the cases >> mentioned above. >> >> So, here is a new proposal that should have the best features of all >> previously considered methods. >> Assume first that there is a single FIFO channel. >> >> We introduce a list of fence nodes, each containing: >> - A list of buffers NOT to be destroyed having the fence as their sync >> object (or possibly one for each memory type) >> - A delayed-destroy list of buffers having the fence as their sync object >> >> Fence nodes are added at the end upon emission. >> They are removed and freed when the buffer count drops to 0 (meaning >> all buffers have been moved to later fences). >> Thus, the number of fence nodes does not grow without bounds, but is >> bounded by the number of buffers. >> . >> The LRU lists are changed to only contain unfenced buffers and buffers >> are initially put on it. >> When a buffer is fenced, it is removed from the LRU list or its >> current fence list, and added to the new fence (possibly destroying >> the old fence node in the process). >> The reference count of buffer objects is only increased when they are >> put in a delayed destruction list. >> >> Upon deletion, they are destroyed immediately if they are on the LRU >> list and moved to the corresponding delayed-delete list if they are in >> the fenced list. >> >> ttm_bo_delayed_delete is no longer called by any workqueue, but only >> on when we run out of memory, before we try eviction. >> It processes the list until the first unsignalled fence, destroying >> all buffers it finds and freeing all the fence nodes. >> All the buffers in the alive lists are put in the LRU list, in the >> correct order. >> We may either keep an alive list per memory type, or sort the buffers >> by memory type (so they are put in the memory type LRU list) at this >> point >> >> Thus, after ttm_bo_delayed_delete, there is the following scenario: >> 1. Inactive buffers with no references are destroyed >> 2. Inactive buffers with references are in the LRU list >> 3. Fenced buffers with no references are in the delayed-destroy list >> attached to their fence >> 4. Fenced buffers with references are in the alive list attached to their fence >> >> This should have about the same CPU and memory usage of the current >> code, but the following advantages: >> 1. Delayed deletion does not run periodically, but exactly when needed >> (at allocation time, before eviction) >> 2. Delayed deletion always deletes all buffers that can be deleted, >> since it processes them in fence order >> 3. The LRU list at eviction time contains exactly all evictable buffers >> 4. Each buffer is always on an LRU _or_ on a delayed destroy list, so >> only one node is necessary >> 5. Once buffer deletion determines a fence is signalled, it can >> destroyed all its to-be-destroyed buffers without any more checking >> 6. Some fence logic (such as signalling of all fences on channel >> forced closing) can be moved from drivers to TTM >> 7. Some channel logic can be moved from drivers to TTM >> 8. The code should be simpler and more elegant >> >> Note that using a buffer with the CPU doesn't change its position in >> the LRU list. >> This is good, because evicting a buffer used by the CPU is actually a >> good thing, since it will move to a region of memory slower for the >> GPU but possibly faster for the CPU. >> However, for buffers in system memory, if the LRU list is used for >> swapping, CPU access should move the buffer to the front of list >> (using the LRU list for this is probably a bad idea though, since >> system memory swapping should be at page granularity). >> >> For multiple channels, things are slightly more complex. >> First, we need to built the fence data structure for each channel. >> Also, we need the fence/buffer nodes to be separate >> Delayed destruction continues to work, but the reference count needs >> to be set to the number of channels fencing the buffer on destruction. >> Putting buffers in the LRU list can be done by doing n-way merging >> between the channel fence lists, assigning each fence a global >> sequence number, and using that to merge and put back buffers in the >> LRU. >> n-way merging can be done with a small heap on the stack on current >> hardware where channels are limited. >> Furthermore, we need to keep a reference count, so that buffers are >> put in the LRU (or destroyed) only after they are off all the >> channels. >> >> The LRU list semantics are relaxed. If a buffer has both its fence >> emitted before another buffer, and also signaled before it, then it >> will be considered as "less recently" used, and the opposite thing >> happens if both are after. Otherwise, the order is undetermined. >> >> This should still guarantee good eviction behavior, and allows to have >> the LRU list only contain evictable buffers, while not requiring to >> use a dynamic priority queue. >> Otherwise, a red-black tree (or an indirect heap with back references) >> indexed by the global fence sequence number can guarantee strict >> emission-time LRU semantics. This seems unnecessary though. >> >> Thoughts? Opinions? >> Any possible improvement? >> >> > > I love the first solution but solution to multi fifo side > looks overkilling. So i am wondering if we aren't tackling > the issue from the wrong side. Below is slightly different > approach to the issue. > > Idea is that memory management is about having bo in place > for the GPU. So instead of looking at the problem from the > bo side, look at it from the GPU address space side. > > For simplicity here i am using only one space VRAM (but GTT > or others can be deal the sameway). So now we split the > problem to look at it as a transaction any time the driver > want to do somethings it ask to ttm to create new transaction > each transaction is associated to a fence and list of buffer > with their possible placement (i will explain how it can > fit multi fifo gpu well too and how it also allow easy support > to GPU with virtual address space which is somethings slowly > becoming a reality i think). > > TTM receive a transaction it takes a transaction lock and give > transaction number to this new transaction. Then it goes over > the BO list and try to come up with the best placement for > each of them. > Jerome, This seems to have a fundamental flaw of not allowing simultaneous command submission. With the current code, A process blocked trying to validate its buffers will not block other processes from validating, and we should keep it that way. /Thomas ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <4B584FAE.8040801-4+hqylr40dJg9hUCZPvPmw@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <4B584FAE.8040801-4+hqylr40dJg9hUCZPvPmw@public.gmane.org> @ 2010-01-25 8:14 ` Jerome Glisse [not found] ` <20100125081444.GA23124-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Jerome Glisse @ 2010-01-25 8:14 UTC (permalink / raw) To: Thomas Hellstrom Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Luca Barbieri, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom On Thu, Jan 21, 2010 at 01:59:26PM +0100, Thomas Hellstrom wrote: > Jerome Glisse wrote: > >On Thu, Jan 21, 2010 at 04:49:39AM +0100, Luca Barbieri wrote: > >>>We had to do a similar thing in the > >>>Poulsbo driver and it turned out that we could save a significant amount of > >>>CPU by using a delayed workqueue, collecting objects and destroying them > >>>periodically. > >>Yes, indeed, we don't really care about a fence expiring unless we > >>want to use that buffer or we are out of memory. > >>Processing each fence, especially if there is an interrupt per fence, > >>seems unnecessary, since you can just do that work in the cases > >>mentioned above. > >> > >>So, here is a new proposal that should have the best features of all > >>previously considered methods. > >>Assume first that there is a single FIFO channel. > >> > >>We introduce a list of fence nodes, each containing: > >>- A list of buffers NOT to be destroyed having the fence as their sync > >>object (or possibly one for each memory type) > >>- A delayed-destroy list of buffers having the fence as their sync object > >> > >>Fence nodes are added at the end upon emission. > >>They are removed and freed when the buffer count drops to 0 (meaning > >>all buffers have been moved to later fences). > >>Thus, the number of fence nodes does not grow without bounds, but is > >>bounded by the number of buffers. > >>. > >>The LRU lists are changed to only contain unfenced buffers and buffers > >>are initially put on it. > >>When a buffer is fenced, it is removed from the LRU list or its > >>current fence list, and added to the new fence (possibly destroying > >>the old fence node in the process). > >>The reference count of buffer objects is only increased when they are > >>put in a delayed destruction list. > >> > >>Upon deletion, they are destroyed immediately if they are on the LRU > >>list and moved to the corresponding delayed-delete list if they are in > >>the fenced list. > >> > >>ttm_bo_delayed_delete is no longer called by any workqueue, but only > >>on when we run out of memory, before we try eviction. > >>It processes the list until the first unsignalled fence, destroying > >>all buffers it finds and freeing all the fence nodes. > >>All the buffers in the alive lists are put in the LRU list, in the > >>correct order. > >>We may either keep an alive list per memory type, or sort the buffers > >>by memory type (so they are put in the memory type LRU list) at this > >>point > >> > >>Thus, after ttm_bo_delayed_delete, there is the following scenario: > >>1. Inactive buffers with no references are destroyed > >>2. Inactive buffers with references are in the LRU list > >>3. Fenced buffers with no references are in the delayed-destroy list > >>attached to their fence > >>4. Fenced buffers with references are in the alive list attached to their fence > >> > >>This should have about the same CPU and memory usage of the current > >>code, but the following advantages: > >>1. Delayed deletion does not run periodically, but exactly when needed > >>(at allocation time, before eviction) > >>2. Delayed deletion always deletes all buffers that can be deleted, > >>since it processes them in fence order > >>3. The LRU list at eviction time contains exactly all evictable buffers > >>4. Each buffer is always on an LRU _or_ on a delayed destroy list, so > >>only one node is necessary > >>5. Once buffer deletion determines a fence is signalled, it can > >>destroyed all its to-be-destroyed buffers without any more checking > >>6. Some fence logic (such as signalling of all fences on channel > >>forced closing) can be moved from drivers to TTM > >>7. Some channel logic can be moved from drivers to TTM > >>8. The code should be simpler and more elegant > >> > >>Note that using a buffer with the CPU doesn't change its position in > >>the LRU list. > >>This is good, because evicting a buffer used by the CPU is actually a > >>good thing, since it will move to a region of memory slower for the > >>GPU but possibly faster for the CPU. > >>However, for buffers in system memory, if the LRU list is used for > >>swapping, CPU access should move the buffer to the front of list > >>(using the LRU list for this is probably a bad idea though, since > >>system memory swapping should be at page granularity). > >> > >>For multiple channels, things are slightly more complex. > >>First, we need to built the fence data structure for each channel. > >>Also, we need the fence/buffer nodes to be separate > >>Delayed destruction continues to work, but the reference count needs > >>to be set to the number of channels fencing the buffer on destruction. > >>Putting buffers in the LRU list can be done by doing n-way merging > >>between the channel fence lists, assigning each fence a global > >>sequence number, and using that to merge and put back buffers in the > >>LRU. > >>n-way merging can be done with a small heap on the stack on current > >>hardware where channels are limited. > >>Furthermore, we need to keep a reference count, so that buffers are > >>put in the LRU (or destroyed) only after they are off all the > >>channels. > >> > >>The LRU list semantics are relaxed. If a buffer has both its fence > >>emitted before another buffer, and also signaled before it, then it > >>will be considered as "less recently" used, and the opposite thing > >>happens if both are after. Otherwise, the order is undetermined. > >> > >>This should still guarantee good eviction behavior, and allows to have > >>the LRU list only contain evictable buffers, while not requiring to > >>use a dynamic priority queue. > >>Otherwise, a red-black tree (or an indirect heap with back references) > >>indexed by the global fence sequence number can guarantee strict > >>emission-time LRU semantics. This seems unnecessary though. > >> > >>Thoughts? Opinions? > >>Any possible improvement? > >> > > > >I love the first solution but solution to multi fifo side > >looks overkilling. So i am wondering if we aren't tackling > >the issue from the wrong side. Below is slightly different > >approach to the issue. > > > >Idea is that memory management is about having bo in place > >for the GPU. So instead of looking at the problem from the > >bo side, look at it from the GPU address space side. > > > >For simplicity here i am using only one space VRAM (but GTT > >or others can be deal the sameway). So now we split the > >problem to look at it as a transaction any time the driver > >want to do somethings it ask to ttm to create new transaction > >each transaction is associated to a fence and list of buffer > >with their possible placement (i will explain how it can > >fit multi fifo gpu well too and how it also allow easy support > >to GPU with virtual address space which is somethings slowly > >becoming a reality i think). > > > >TTM receive a transaction it takes a transaction lock and give > >transaction number to this new transaction. Then it goes over > >the BO list and try to come up with the best placement for > >each of them. > Jerome, This seems to have a fundamental flaw of not allowing > simultaneous command submission. > > With the current code, A process blocked trying to validate its > buffers will not block other processes from validating, and we > should keep it that way. > > /Thomas > Idea is who comes first is first to be serve :) and it can deal with multiple command submission if the hw can deal with it (sync stuff i am briefly describing in multi fifo part). But yes if process A is first in the kernel and process B could have been executed right away, A will win and have right to evict others, i don't think it will hurt perf, i think most of the usecase we will have enough memory and better userspace driver that the eviction from vram should become rare enough to be noticeable. Cheers, Jerome ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <20100125081444.GA23124-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <20100125081444.GA23124-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2010-01-25 20:51 ` Thomas Hellstrom 0 siblings, 0 replies; 34+ messages in thread From: Thomas Hellstrom @ 2010-01-25 20:51 UTC (permalink / raw) To: Jerome Glisse Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Luca Barbieri, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom Jerome Glisse wrote: > On Thu, Jan 21, 2010 at 01:59:26PM +0100, Thomas Hellstrom wrote: > >> Jerome Glisse wrote: >> >>> On Thu, Jan 21, 2010 at 04:49:39AM +0100, Luca Barbieri wrote: >>> >>>>> We had to do a similar thing in the >>>>> Poulsbo driver and it turned out that we could save a significant amount of >>>>> CPU by using a delayed workqueue, collecting objects and destroying them >>>>> periodically. >>>>> >>>> Yes, indeed, we don't really care about a fence expiring unless we >>>> want to use that buffer or we are out of memory. >>>> Processing each fence, especially if there is an interrupt per fence, >>>> seems unnecessary, since you can just do that work in the cases >>>> mentioned above. >>>> >>>> So, here is a new proposal that should have the best features of all >>>> previously considered methods. >>>> Assume first that there is a single FIFO channel. >>>> >>>> We introduce a list of fence nodes, each containing: >>>> - A list of buffers NOT to be destroyed having the fence as their sync >>>> object (or possibly one for each memory type) >>>> - A delayed-destroy list of buffers having the fence as their sync object >>>> >>>> Fence nodes are added at the end upon emission. >>>> They are removed and freed when the buffer count drops to 0 (meaning >>>> all buffers have been moved to later fences). >>>> Thus, the number of fence nodes does not grow without bounds, but is >>>> bounded by the number of buffers. >>>> . >>>> The LRU lists are changed to only contain unfenced buffers and buffers >>>> are initially put on it. >>>> When a buffer is fenced, it is removed from the LRU list or its >>>> current fence list, and added to the new fence (possibly destroying >>>> the old fence node in the process). >>>> The reference count of buffer objects is only increased when they are >>>> put in a delayed destruction list. >>>> >>>> Upon deletion, they are destroyed immediately if they are on the LRU >>>> list and moved to the corresponding delayed-delete list if they are in >>>> the fenced list. >>>> >>>> ttm_bo_delayed_delete is no longer called by any workqueue, but only >>>> on when we run out of memory, before we try eviction. >>>> It processes the list until the first unsignalled fence, destroying >>>> all buffers it finds and freeing all the fence nodes. >>>> All the buffers in the alive lists are put in the LRU list, in the >>>> correct order. >>>> We may either keep an alive list per memory type, or sort the buffers >>>> by memory type (so they are put in the memory type LRU list) at this >>>> point >>>> >>>> Thus, after ttm_bo_delayed_delete, there is the following scenario: >>>> 1. Inactive buffers with no references are destroyed >>>> 2. Inactive buffers with references are in the LRU list >>>> 3. Fenced buffers with no references are in the delayed-destroy list >>>> attached to their fence >>>> 4. Fenced buffers with references are in the alive list attached to their fence >>>> >>>> This should have about the same CPU and memory usage of the current >>>> code, but the following advantages: >>>> 1. Delayed deletion does not run periodically, but exactly when needed >>>> (at allocation time, before eviction) >>>> 2. Delayed deletion always deletes all buffers that can be deleted, >>>> since it processes them in fence order >>>> 3. The LRU list at eviction time contains exactly all evictable buffers >>>> 4. Each buffer is always on an LRU _or_ on a delayed destroy list, so >>>> only one node is necessary >>>> 5. Once buffer deletion determines a fence is signalled, it can >>>> destroyed all its to-be-destroyed buffers without any more checking >>>> 6. Some fence logic (such as signalling of all fences on channel >>>> forced closing) can be moved from drivers to TTM >>>> 7. Some channel logic can be moved from drivers to TTM >>>> 8. The code should be simpler and more elegant >>>> >>>> Note that using a buffer with the CPU doesn't change its position in >>>> the LRU list. >>>> This is good, because evicting a buffer used by the CPU is actually a >>>> good thing, since it will move to a region of memory slower for the >>>> GPU but possibly faster for the CPU. >>>> However, for buffers in system memory, if the LRU list is used for >>>> swapping, CPU access should move the buffer to the front of list >>>> (using the LRU list for this is probably a bad idea though, since >>>> system memory swapping should be at page granularity). >>>> >>>> For multiple channels, things are slightly more complex. >>>> First, we need to built the fence data structure for each channel. >>>> Also, we need the fence/buffer nodes to be separate >>>> Delayed destruction continues to work, but the reference count needs >>>> to be set to the number of channels fencing the buffer on destruction. >>>> Putting buffers in the LRU list can be done by doing n-way merging >>>> between the channel fence lists, assigning each fence a global >>>> sequence number, and using that to merge and put back buffers in the >>>> LRU. >>>> n-way merging can be done with a small heap on the stack on current >>>> hardware where channels are limited. >>>> Furthermore, we need to keep a reference count, so that buffers are >>>> put in the LRU (or destroyed) only after they are off all the >>>> channels. >>>> >>>> The LRU list semantics are relaxed. If a buffer has both its fence >>>> emitted before another buffer, and also signaled before it, then it >>>> will be considered as "less recently" used, and the opposite thing >>>> happens if both are after. Otherwise, the order is undetermined. >>>> >>>> This should still guarantee good eviction behavior, and allows to have >>>> the LRU list only contain evictable buffers, while not requiring to >>>> use a dynamic priority queue. >>>> Otherwise, a red-black tree (or an indirect heap with back references) >>>> indexed by the global fence sequence number can guarantee strict >>>> emission-time LRU semantics. This seems unnecessary though. >>>> >>>> Thoughts? Opinions? >>>> Any possible improvement? >>>> >>>> >>> I love the first solution but solution to multi fifo side >>> looks overkilling. So i am wondering if we aren't tackling >>> the issue from the wrong side. Below is slightly different >>> approach to the issue. >>> >>> Idea is that memory management is about having bo in place >>> for the GPU. So instead of looking at the problem from the >>> bo side, look at it from the GPU address space side. >>> >>> For simplicity here i am using only one space VRAM (but GTT >>> or others can be deal the sameway). So now we split the >>> problem to look at it as a transaction any time the driver >>> want to do somethings it ask to ttm to create new transaction >>> each transaction is associated to a fence and list of buffer >>> with their possible placement (i will explain how it can >>> fit multi fifo gpu well too and how it also allow easy support >>> to GPU with virtual address space which is somethings slowly >>> becoming a reality i think). >>> >>> TTM receive a transaction it takes a transaction lock and give >>> transaction number to this new transaction. Then it goes over >>> the BO list and try to come up with the best placement for >>> each of them. >>> >> Jerome, This seems to have a fundamental flaw of not allowing >> simultaneous command submission. >> >> With the current code, A process blocked trying to validate its >> buffers will not block other processes from validating, and we >> should keep it that way. >> >> /Thomas >> >> > > Idea is who comes first is first to be serve :) and it can deal > with multiple command submission if the hw can deal with it > (sync stuff i am briefly describing in multi fifo part). But > yes if process A is first in the kernel and process B could > have been executed right away, A will win and have right to > evict others, i don't think it will hurt perf, i think most > of the usecase we will have enough memory and better userspace > driver that the eviction from vram should become rare enough > to be noticeable. > > Cheers, > Jerome > No, IMHO this is a big step back to "let's serialise everything with a single so we don't have to bother". Imagine a process blocking on VRAM space, and another process waiting that requires only AGP + pinned VRAM memory. That process shouldn't be blocked. A design rule of TTM is to *never* keep a global lock when waiting for the GPU, and I want to keep it that way. /Thomas ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <20100121122920.GB3837-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-21 12:59 ` Thomas Hellstrom @ 2010-01-21 15:14 ` Luca Barbieri [not found] ` <ff13bc9a1001210714m34c3976etc7680f056cb55453-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 34+ messages in thread From: Luca Barbieri @ 2010-01-21 15:14 UTC (permalink / raw) To: Jerome Glisse Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom I'm not sure I understand your proposal correctly. It seems your proposoal is similar to mine, replacing the term "fence nodes" with "ttm transactions", but I'm not sure if I understand it correctly Here is some pseudocode for a improved, simplified version of my proposal. It is modified so that there are no longer distinct alive/destroy lists, but buffers are destroyed if their ref count is zero. list_head ram_lru_list; /* list of bos */ list_head vram_unfenced_lru_list; /* list of bos */ list_head gart_unfenced_lru_list; /* list of bos */ atomic uint64_t next_seq_num; // if read_list and write_list are empty, the buffer is unfenced and MUST be in an unfenced lru list // otherwise, it is fenced and MUST be, if not zombie, on some read_list/write_list, or if zombie, on some unfenced_list struct ttm_buffer_object { kref_t ref; list_head read_list; /* list of bo_fence_nodes */ list_head write_list; /* list of bo_fence_nodes */ list_head unfenced_list; /* list entry in [ram|[gart|vram]_unfenced]_lru_list */ [...] }; // TODO: we could embed just the first two members in the ttm_buffer_object, and steal the lowest bit on the fence pointer to signify that // this would optimize for the very common single-fence case struct ttm_bo_fence_node { list_head bo_list; /* list entry in bo.[read|write]_list */ struct ttm_fence_node* fence; struct ttm_buffer_object* bo; list_head fence_list; /* list entry in fence.[vram|gart|destroy]_list */ }; struct ttm_fence { void* sync_object; /* may also turned into an object containing a ttm_fence at the start */ uint64_t seq_num; /* unique value in order of kmalloc of this ttm_fence */ list_head bo_list; /* list of bo_fence_nodes */ }; struct ttm_channel { list_head fence_list; /* list of ttm_fence_node */ }; ttm_flush_fences() { list_head vram_list[MAX_CHANNELS]; list_head gart_list[MAX_CHANNELS]; foreach channel { foreach fence in channel { if(not driver->fence_expired(fence->sync_object)) break; foreach bo_fence_node in fence.bo_list { remove bo_fence_node if bo.read_list and bo.write_list are both empty { if bo.refcount is zero destroy else { append to [vram|gart]_list[channel] } } } } } // this is the n-way merge of vram_list[]s into the lru list while(vram_list[]s are not all empty) { // this can use either a simple scan, or an heap find channel such that first_entry(vram_list[channel]).fence.seq_num is smallest remove first_entry(vram_list[channel]) and put the bo at the recent end of vram_unfenced_lru_list } same thing for gart; } // assume buffer is reserved, use current mechanism or mutex for that // channel may be null for CPU waits ttm_bo_wait(bo, channel, wait_for_write) { foreach fence_node in bo.write_list { if(fence_node.channel != channel) driver->wait_fence(fence_node.fence.sync_object); } if(!wait_for_write) return; foreach fence_node in bo.read_list { if(fence_node.channel != channel) driver->wait_fence(fence_node.fence.sync_object); } } ttm_out_of_memory() takes memory_alloc_mutex { retry: ttm_flush_fences(); if(we have enough space now) return; foreach in [vram|gart]_unfenced_lru_list { evict that buffer if it's big enough, no need to check fences this uses the current "ghost" mechanism for accelerated moves emit evicted_buffer_fence for after emission } if we didn't find a big enough buffer, evict the biggest buffer (also consider empty space around it in size) if(we have enough space now) return; if(burn_cpu_time_but_have_lowest_latencies) { while(!driver->fence_expired(evicted_bo->sync_object) and we don't have enough space) { driver->wait_for_any_fence_to_expire(); ttm_flush_fences(); } } else ttm_bo_wait(evicted_bo 0) goto retry; } // assume the buffer has already been put in the desired memory space // also assume we already waited for the buffer fences ttm_add_buffer_to_fence(fence, bo) { remove bo from unfenced lru list if it's on it for the none or single bo_fence_node in bo.read_list or bo.write_list with bo_fence_node.fence.channel == fence.channel { remove bo_fence_node from bo_fence_node.fence.[gart|vram]_list if(bo_fence_node.fence has all lists empty) remove from channel and free the fence bo_fence_node.fence remove bo_fence_node from bo.[read|list]_list } create a new bo_fence node and use it to add the bo to the fence } ttm_bo_refcount_drops_to_zero(bo) { if(bo.read_list is empty and bo.write_list is empty) free(bo); // otherwise ttm_flush_fences takes care of this, no need to do anything here } linux_get_free_page_is_out_of_memory { ttm_flush_fences(); // destroy any buffers we can destroy } use_buffer_with_cpu(bo) { if(bo is in vram) return; // CPU access should not prevent eviction of VRAM buffers if(bo is unfenced and thus on an lru_list) move to recent end of lru_list it is on } This code does not itself handle inter-channel synchronization, which is can be done by driver in accelerated moves and before calling ttm_add_buffer_to_fence(fence, bo) Drivers without on-GPU synchronization can just do ttm_bo_wait(bo, channel), others can use whatever scheme they think is best. Actually we may not need to add any "sync domain" concept to TTM, as it seems possible to do synchronization in the driver, at least in this scheme. ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <ff13bc9a1001210714m34c3976etc7680f056cb55453-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <ff13bc9a1001210714m34c3976etc7680f056cb55453-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-01-25 8:20 ` Jerome Glisse 0 siblings, 0 replies; 34+ messages in thread From: Jerome Glisse @ 2010-01-25 8:20 UTC (permalink / raw) To: Luca Barbieri Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom On Thu, Jan 21, 2010 at 04:14:39PM +0100, Luca Barbieri wrote: > I'm not sure I understand your proposal correctly. > It seems your proposoal is similar to mine, replacing the term "fence > nodes" with "ttm transactions", but I'm not sure if I understand it > correctly > > Here is some pseudocode for a improved, simplified version of my proposal. > It is modified so that there are no longer distinct alive/destroy > lists, but buffers are destroyed if their ref count is zero. > > list_head ram_lru_list; /* list of bos */ > list_head vram_unfenced_lru_list; /* list of bos */ > list_head gart_unfenced_lru_list; /* list of bos */ > > atomic uint64_t next_seq_num; > > // if read_list and write_list are empty, the buffer is unfenced and > MUST be in an unfenced lru list > // otherwise, it is fenced and MUST be, if not zombie, on some > read_list/write_list, or if zombie, on some unfenced_list > struct ttm_buffer_object > { > kref_t ref; > list_head read_list; /* list of bo_fence_nodes */ > list_head write_list; /* list of bo_fence_nodes */ > list_head unfenced_list; /* list entry in > [ram|[gart|vram]_unfenced]_lru_list */ > [...] > }; > > // TODO: we could embed just the first two members in the > ttm_buffer_object, and steal the lowest bit on the fence pointer to > signify that > // this would optimize for the very common single-fence case > struct ttm_bo_fence_node > { > list_head bo_list; /* list entry in bo.[read|write]_list */ > struct ttm_fence_node* fence; > > struct ttm_buffer_object* bo; > list_head fence_list; /* list entry in fence.[vram|gart|destroy]_list */ > }; > > struct ttm_fence > { > void* sync_object; /* may also turned into an object containing a > ttm_fence at the start */ > uint64_t seq_num; /* unique value in order of kmalloc of this ttm_fence */ > list_head bo_list; /* list of bo_fence_nodes */ > }; > > struct ttm_channel > { > list_head fence_list; /* list of ttm_fence_node */ > }; > > ttm_flush_fences() > { > list_head vram_list[MAX_CHANNELS]; > list_head gart_list[MAX_CHANNELS]; > foreach channel > { > foreach fence in channel > { > if(not driver->fence_expired(fence->sync_object)) > break; > foreach bo_fence_node in fence.bo_list > { > remove bo_fence_node > if bo.read_list and bo.write_list are both empty > { > if bo.refcount is zero > destroy > else > { > append to [vram|gart]_list[channel] > } > } > } > } > } > > // this is the n-way merge of vram_list[]s into the lru list > while(vram_list[]s are not all empty) > { > // this can use either a simple scan, or an heap > find channel such that > first_entry(vram_list[channel]).fence.seq_num is smallest > > remove first_entry(vram_list[channel]) and put the bo at the > recent end of vram_unfenced_lru_list > } > > same thing for gart; > } > > // assume buffer is reserved, use current mechanism or mutex for that > // channel may be null for CPU waits > ttm_bo_wait(bo, channel, wait_for_write) > { > foreach fence_node in bo.write_list > { > if(fence_node.channel != channel) > driver->wait_fence(fence_node.fence.sync_object); > } > > if(!wait_for_write) > return; > > foreach fence_node in bo.read_list > { > if(fence_node.channel != channel) > driver->wait_fence(fence_node.fence.sync_object); > } > } > > ttm_out_of_memory() takes memory_alloc_mutex > { > retry: > ttm_flush_fences(); > if(we have enough space now) > return; > foreach in [vram|gart]_unfenced_lru_list > { > evict that buffer if it's big enough, no need to check fences > this uses the current "ghost" mechanism for accelerated moves > emit evicted_buffer_fence for after emission > } > if we didn't find a big enough buffer, evict the biggest buffer > (also consider empty space around it in size) > if(we have enough space now) > return; > if(burn_cpu_time_but_have_lowest_latencies) > { > while(!driver->fence_expired(evicted_bo->sync_object) and we > don't have enough space) > { > driver->wait_for_any_fence_to_expire(); > ttm_flush_fences(); > } > } > else > ttm_bo_wait(evicted_bo 0) > goto retry; > } > > // assume the buffer has already been put in the desired memory space > // also assume we already waited for the buffer fences > ttm_add_buffer_to_fence(fence, bo) > { > remove bo from unfenced lru list if it's on it > for the none or single bo_fence_node in bo.read_list or > bo.write_list with bo_fence_node.fence.channel == fence.channel > { > remove bo_fence_node from bo_fence_node.fence.[gart|vram]_list > if(bo_fence_node.fence has all lists empty) > remove from channel and free the fence bo_fence_node.fence > remove bo_fence_node from bo.[read|list]_list > } > > create a new bo_fence node and use it to add the bo to the fence > } > > ttm_bo_refcount_drops_to_zero(bo) > { > if(bo.read_list is empty and bo.write_list is empty) > free(bo); > // otherwise ttm_flush_fences takes care of this, no need to do anything here > } > > linux_get_free_page_is_out_of_memory > { > ttm_flush_fences(); // destroy any buffers we can destroy > } > > use_buffer_with_cpu(bo) > { > if(bo is in vram) > return; // CPU access should not prevent eviction of VRAM buffers > if(bo is unfenced and thus on an lru_list) > move to recent end of lru_list it is on > } > > This code does not itself handle inter-channel synchronization, which > is can be done by driver in accelerated moves and before calling > ttm_add_buffer_to_fence(fence, bo) > Drivers without on-GPU synchronization can just do ttm_bo_wait(bo, > channel), others can use whatever scheme they think is best. > > Actually we may not need to add any "sync domain" concept to TTM, as > it seems possible to do synchronization in the driver, at least in > this scheme. > My proposal is a more radical change in which we leave the BO POV managing to go for a GPU memory space managing. It's more like doing a transactional view of GPU in which for each transaction you ask a placement for the BO. Idea is that GPU with enough hw capacity can easily queue BO moves and forget about them and it should also allow simplier code for sync. But this was just an idea. Cheers, Jerome ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <ff13bc9a1001201949l4691f202v2a2874b9cef86f37-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2010-01-21 12:29 ` Jerome Glisse @ 2010-01-21 12:53 ` Thomas Hellstrom [not found] ` <4B584E48.8020806-4+hqylr40dJg9hUCZPvPmw@public.gmane.org> 1 sibling, 1 reply; 34+ messages in thread From: Thomas Hellstrom @ 2010-01-21 12:53 UTC (permalink / raw) To: Luca Barbieri Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Luca Barbieri wrote: >> We had to do a similar thing in the >> Poulsbo driver and it turned out that we could save a significant amount of >> CPU by using a delayed workqueue, collecting objects and destroying them >> periodically. >> > Yes, indeed, we don't really care about a fence expiring unless we > want to use that buffer or we are out of memory. > Processing each fence, especially if there is an interrupt per fence, > seems unnecessary, since you can just do that work in the cases > mentioned above. > > So, here is a new proposal that should have the best features of all > previously considered methods. > Assume first that there is a single FIFO channel. > > We introduce a list of fence nodes, each containing: > - A list of buffers NOT to be destroyed having the fence as their sync > object (or possibly one for each memory type) > - A delayed-destroy list of buffers having the fence as their sync object > > Fence nodes are added at the end upon emission. > They are removed and freed when the buffer count drops to 0 (meaning > all buffers have been moved to later fences). > Thus, the number of fence nodes does not grow without bounds, but is > bounded by the number of buffers. > . > The LRU lists are changed to only contain unfenced buffers and buffers > are initially put on it. > When a buffer is fenced, it is removed from the LRU list or its > current fence list, and added to the new fence (possibly destroying > the old fence node in the process). > The reference count of buffer objects is only increased when they are > put in a delayed destruction list. > > Upon deletion, they are destroyed immediately if they are on the LRU > list and moved to the corresponding delayed-delete list if they are in > the fenced list. > > ttm_bo_delayed_delete is no longer called by any workqueue, but only > on when we run out of memory, before we try eviction. > It processes the list until the first unsignalled fence, destroying > all buffers it finds and freeing all the fence nodes. > All the buffers in the alive lists are put in the LRU list, in the > correct order. > We may either keep an alive list per memory type, or sort the buffers > by memory type (so they are put in the memory type LRU list) at this > point > > Thus, after ttm_bo_delayed_delete, there is the following scenario: > 1. Inactive buffers with no references are destroyed > 2. Inactive buffers with references are in the LRU list > 3. Fenced buffers with no references are in the delayed-destroy list > attached to their fence > 4. Fenced buffers with references are in the alive list attached to their fence > > This should have about the same CPU and memory usage of the current > code, but the following advantages: > 1. Delayed deletion does not run periodically, but exactly when needed > (at allocation time, before eviction) > 2. Delayed deletion always deletes all buffers that can be deleted, > since it processes them in fence order > 3. The LRU list at eviction time contains exactly all evictable buffers > 4. Each buffer is always on an LRU _or_ on a delayed destroy list, so > only one node is necessary > 5. Once buffer deletion determines a fence is signalled, it can > destroyed all its to-be-destroyed buffers without any more checking > 6. Some fence logic (such as signalling of all fences on channel > forced closing) can be moved from drivers to TTM > 7. Some channel logic can be moved from drivers to TTM > 8. The code should be simpler and more elegant > > Note that using a buffer with the CPU doesn't change its position in > the LRU list. > This is good, because evicting a buffer used by the CPU is actually a > good thing, since it will move to a region of memory slower for the > GPU but possibly faster for the CPU. > However, for buffers in system memory, if the LRU list is used for > swapping, CPU access should move the buffer to the front of list > (using the LRU list for this is probably a bad idea though, since > system memory swapping should be at page granularity). > > For multiple channels, things are slightly more complex. > First, we need to built the fence data structure for each channel. > Also, we need the fence/buffer nodes to be separate > Delayed destruction continues to work, but the reference count needs > to be set to the number of channels fencing the buffer on destruction. > Putting buffers in the LRU list can be done by doing n-way merging > between the channel fence lists, assigning each fence a global > sequence number, and using that to merge and put back buffers in the > LRU. > n-way merging can be done with a small heap on the stack on current > hardware where channels are limited. > Furthermore, we need to keep a reference count, so that buffers are > put in the LRU (or destroyed) only after they are off all the > channels. > > The LRU list semantics are relaxed. If a buffer has both its fence > emitted before another buffer, and also signaled before it, then it > will be considered as "less recently" used, and the opposite thing > happens if both are after. Otherwise, the order is undetermined. > > This should still guarantee good eviction behavior, and allows to have > the LRU list only contain evictable buffers, while not requiring to > use a dynamic priority queue. > Otherwise, a red-black tree (or an indirect heap with back references) > indexed by the global fence sequence number can guarantee strict > emission-time LRU semantics. This seems unnecessary though. > > Thoughts? Opinions? > Any possible improvement? > Hi, Luca. I have some comments, but I'm heading off for vacation and back in a week. At a first glance: 1) We probably *will* need a delayed destroyed workqueue to avoid wasting memory that otherwise should be freed to the system. At the very least, the delayed delete process should optionally be run by a system shrinker. 2) Fences in TTM are currently not necessarily strictly ordered, and sequence numbers are hidden from the bo code. This means, for a given FIFO, fence sequence 3 may expire before fence sequence 2, depending on the usage of the buffer. The old Poulsbo and the experimental openChrome drivers are using this feature, and also the long deprecated intel TTM implementation. The reason for this is that the same FIFO may hand commands to different HW enigines, and the engines execute in parallel. With the asynchronous memory management I've discussed earlier, there will be a driver callback to order two fences from the same FIFO, but that callback will have to either idle one of the engines, flush the GPU caches or insert some kind of syncing primitive into the command stream. If we need to order all fences, we will do unnecessary idling / flushing / syncing. I'll comment a bit more when I get back. /Thomas ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <4B584E48.8020806-4+hqylr40dJg9hUCZPvPmw@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <4B584E48.8020806-4+hqylr40dJg9hUCZPvPmw@public.gmane.org> @ 2010-01-21 13:40 ` Luca Barbieri [not found] ` <ff13bc9a1001210540t7dc2b7a7p38359ca82b8b3eb4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Luca Barbieri @ 2010-01-21 13:40 UTC (permalink / raw) To: Thomas Hellstrom Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org > At a first glance: > > 1) We probably *will* need a delayed destroyed workqueue to avoid wasting > memory that otherwise should be freed to the system. At the very least, the > delayed delete process should optionally be run by a system shrinker. You are right. For VRAM we don't care since we are the only user, while for system backed memory some delayed destruction will be needed. The logical extension of the scheme would be for the Linux page allocator/swapper to check for TTM buffers to destroy when it would otherwise shrink caches, try to swap and/or wait on swap to happen. Not sure whether there are existing hooks for this or where exactly to hook this code. > 2) Fences in TTM are currently not necessarily strictly ordered, and > sequence numbers are hidden from the bo code. This means, for a given FIFO, > fence sequence 3 may expire before fence sequence 2, depending on the usage > of the buffer. My definition of "channel" (I sometimes used FIFO incorrectly as a synonym of that) is exactly a set of fences that are strictly ordered. If the card has multiple HW engines, each is considered a different channel (so that a channel becomes a (fifo, engine) pair). We may need however to add the concept of a "sync domain" that would be a set of channels that support on-GPU synchronization against each other. This would model hardware where channels with the same FIFO can be synchronized together but those with different FIFOs don't, and also multi-core GPUs where synchronization might be available only inside each core and not across cores. To sum it up, a GPU consists of a set of sync domains, each consisting of a set of channels, each consisting of a sequence of fences, with the following rules: 1. Fences within the same channel expire in order 2. If channels A and B belong to the same sync domain, it's possible to emit a fence on A that is guaranteed to expire after an arbitrary fence of B Whether channels have the same FIFO or not is essentially a driver implementation detail, and what TTM cares about is if they are in the same sync domain. [I just made up "sync domain" here: is there a standard term?] This assumes that the "synchronizability" graph is a disjoint union of complete graphs. Is there any example where it is not so? Also, does this actually model correctly Poulsbo, or am I wrong? Note that we could use CPU mediation more than we currently do. For instance now Nouveau, to do inter-channel synchronization, simply waits on the fence with the CPU immediately synchronously, while it could instead queue the commands in software, and with an interrupt/delayed mechanism submit them to hardware once the fence to be waited for is expired. ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <ff13bc9a1001210540t7dc2b7a7p38359ca82b8b3eb4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <ff13bc9a1001210540t7dc2b7a7p38359ca82b8b3eb4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-01-21 14:07 ` Francisco Jerez [not found] ` <87vdeveekk.fsf-sGOZH3hwPm2sTnJN9+BGXg@public.gmane.org> 2010-01-21 14:23 ` Thomas Hellstrom 1 sibling, 1 reply; 34+ messages in thread From: Francisco Jerez @ 2010-01-21 14:07 UTC (permalink / raw) To: Luca Barbieri Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom [-- Attachment #1.1.1: Type: text/plain, Size: 3292 bytes --] Luca Barbieri <luca-Ukmtq+NC3rhBHFWNQifrYwC/G2K4zDHf@public.gmane.org> writes: >> At a first glance: >> >> 1) We probably *will* need a delayed destroyed workqueue to avoid wasting >> memory that otherwise should be freed to the system. At the very least, the >> delayed delete process should optionally be run by a system shrinker. > You are right. For VRAM we don't care since we are the only user, > while for system backed memory some delayed destruction will be > needed. > The logical extension of the scheme would be for the Linux page > allocator/swapper to check for TTM buffers to destroy when it would > otherwise shrink caches, try to swap and/or wait on swap to happen. > Not sure whether there are existing hooks for this or where exactly to > hook this code. > >> 2) Fences in TTM are currently not necessarily strictly ordered, and >> sequence numbers are hidden from the bo code. This means, for a given FIFO, >> fence sequence 3 may expire before fence sequence 2, depending on the usage >> of the buffer. > > My definition of "channel" (I sometimes used FIFO incorrectly as a > synonym of that) is exactly a set of fences that are strictly ordered. > If the card has multiple HW engines, each is considered a different > channel (so that a channel becomes a (fifo, engine) pair). > > We may need however to add the concept of a "sync domain" that would > be a set of channels that support on-GPU synchronization against each > other. > This would model hardware where channels with the same FIFO can be > synchronized together but those with different FIFOs don't, and also > multi-core GPUs where synchronization might be available only inside > each core and not across cores. > > To sum it up, a GPU consists of a set of sync domains, each consisting > of a set of channels, each consisting of a sequence of fences, with > the following rules: > 1. Fences within the same channel expire in order > 2. If channels A and B belong to the same sync domain, it's possible > to emit a fence on A that is guaranteed to expire after an arbitrary > fence of B > > Whether channels have the same FIFO or not is essentially a driver > implementation detail, and what TTM cares about is if they are in the > same sync domain. > > [I just made up "sync domain" here: is there a standard term?] > > This assumes that the "synchronizability" graph is a disjoint union of > complete graphs. Is there any example where it is not so? > Also, does this actually model correctly Poulsbo, or am I wrong? > > Note that we could use CPU mediation more than we currently do. > For instance now Nouveau, to do inter-channel synchronization, simply > waits on the fence with the CPU immediately synchronously, while it > could instead queue the commands in software, and with an > interrupt/delayed mechanism submit them to hardware once the fence to > be waited for is expired. Nvidia cards have a synchronization primitive that could be used to synchronize several FIFOs in hardware (AKA semaphores, see [1] for an example). > _______________________________________________ > Nouveau mailing list > Nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > http://lists.freedesktop.org/mailman/listinfo/nouveau [1] http://lists.freedesktop.org/archives/nouveau/2009-December/004514.html [-- Attachment #1.2: Type: application/pgp-signature, Size: 197 bytes --] [-- Attachment #2: Type: text/plain, Size: 181 bytes --] _______________________________________________ Nouveau mailing list Nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org http://lists.freedesktop.org/mailman/listinfo/nouveau ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <87vdeveekk.fsf-sGOZH3hwPm2sTnJN9+BGXg@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <87vdeveekk.fsf-sGOZH3hwPm2sTnJN9+BGXg@public.gmane.org> @ 2010-01-21 14:17 ` Luca Barbieri [not found] ` <ff13bc9a1001210617qe37ab7at949d545216693608-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Luca Barbieri @ 2010-01-21 14:17 UTC (permalink / raw) To: Francisco Jerez Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom > Nvidia cards have a synchronization primitive that could be used to > synchronize several FIFOs in hardware (AKA semaphores, see [1] for an > example). Does this operate wholly on the GPU on all nVidia cards? It seems that at least on some GPUs this will trigger "software methods" that are basically a way for the GPU to trigger an interrupt and stop the FIFO until the CPU handles the interrupt and restarts it. Also, is there a way on nVidia cards to get interrupts on fences, but only where the fence sequence number is higher than a dynamically set value? (so that one could sleep for fence X without getting an interrupt for every single fence before that) If not, it could possibly be hacked around by reading from a DMA object at the address of the fence sequence number and then resizing the DMA object so that addresses from a certain point on would trigger a protection fault interrupt. ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <ff13bc9a1001210617qe37ab7at949d545216693608-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <ff13bc9a1001210617qe37ab7at949d545216693608-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-01-21 14:44 ` Francisco Jerez [not found] ` <87pr53ecwd.fsf-sGOZH3hwPm2sTnJN9+BGXg@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Francisco Jerez @ 2010-01-21 14:44 UTC (permalink / raw) To: Luca Barbieri Cc: Thomas Hellstrom, Thomas Hellstrom, airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org [-- Attachment #1.1.1: Type: text/plain, Size: 1334 bytes --] Luca Barbieri <luca-Ukmtq+NC3rhBHFWNQifrYwC/G2K4zDHf@public.gmane.org> writes: >> Nvidia cards have a synchronization primitive that could be used to >> synchronize several FIFOs in hardware (AKA semaphores, see [1] for an >> example). > > Does this operate wholly on the GPU on all nVidia cards? > > It seems that at least on some GPUs this will trigger "software > methods" that are basically a way for the GPU to trigger an interrupt > and stop the FIFO until the CPU handles the interrupt and restarts it. > Yes, it could be handled without software methods, Maarten (CC'ed) was looking into this, IIRC. > Also, is there a way on nVidia cards to get interrupts on fences, but > only where the fence sequence number is higher than a dynamically set > value? (so that one could sleep for fence X without getting an > interrupt for every single fence before that) > > If not, it could possibly be hacked around by reading from a DMA > object at the address of the fence sequence number and then resizing > the DMA object so that addresses from a certain point on would trigger > a protection fault interrupt. I don't think you can safely modify a DMA object without stalling the card completely, but i think you could use e.g. PGRAPH NOTIFY interrupts and disable them by flipping a bit in PGRAPH when you stop caring about them. [-- Attachment #1.2: Type: application/pgp-signature, Size: 197 bytes --] [-- Attachment #2: Type: text/plain, Size: 181 bytes --] _______________________________________________ Nouveau mailing list Nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org http://lists.freedesktop.org/mailman/listinfo/nouveau ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <87pr53ecwd.fsf-sGOZH3hwPm2sTnJN9+BGXg@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <87pr53ecwd.fsf-sGOZH3hwPm2sTnJN9+BGXg@public.gmane.org> @ 2010-01-21 15:36 ` Luca Barbieri [not found] ` <ff13bc9a1001210736k71740417r7b129c70374fece3-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2010-01-21 15:39 ` Maarten Maathuis 1 sibling, 1 reply; 34+ messages in thread From: Luca Barbieri @ 2010-01-21 15:36 UTC (permalink / raw) To: Francisco Jerez Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom >> If not, it could possibly be hacked around by reading from a DMA >> object at the address of the fence sequence number and then resizing >> the DMA object so that addresses from a certain point on would trigger >> a protection fault interrupt. > > I don't think you can safely modify a DMA object without stalling the > card completely, but i think you could use e.g. PGRAPH NOTIFY interrupts > and disable them by flipping a bit in PGRAPH when you stop caring about > them. The problem is that one needs to disable them *before* the one he cares about. Suppose the card is at fence 0 and we are interested in fence 1000 expiring. If we just enable interrupts, then we are going to be interrupted uselessly 1000 times. Instead, we would like to tell the card "send me interrupts for fences, but only for fence number 1000 (or higher)". This way one could for instance render a whole scene, and then desiring to read it into the CPU, just ask for an interrupt once rendering is done (i.e. wait for the framebuffer fence) and get a single interrupt, while we cleanly sleep undisturbed in the meantime. Basically, it would just need some way of *conditionally* causing interrupts. If there is none, then maybe we could insert a call to a "fence mini-pushbuffer" filled with NOPs that could be overwritten with an interrupt request on demand? Or alternatively, construct such a pushbuffer with the 2D or 3D engines, starting from our "1000" input and the fence value generated by the 3D engine? (this is likely to be slow though). Or some hack like the DMA object resizing? (would it crash the GPU? or just not work due to it caching the previous size?) ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <ff13bc9a1001210736k71740417r7b129c70374fece3-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <ff13bc9a1001210736k71740417r7b129c70374fece3-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-01-21 16:18 ` Francisco Jerez [not found] ` <87y6jrbfef.fsf-sGOZH3hwPm2sTnJN9+BGXg@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Francisco Jerez @ 2010-01-21 16:18 UTC (permalink / raw) To: Luca Barbieri Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom [-- Attachment #1.1.1: Type: text/plain, Size: 2068 bytes --] Luca Barbieri <luca-Ukmtq+NC3rhBHFWNQifrYwC/G2K4zDHf@public.gmane.org> writes: >>> If not, it could possibly be hacked around by reading from a DMA >>> object at the address of the fence sequence number and then resizing >>> the DMA object so that addresses from a certain point on would trigger >>> a protection fault interrupt. >> >> I don't think you can safely modify a DMA object without stalling the >> card completely, but i think you could use e.g. PGRAPH NOTIFY interrupts >> and disable them by flipping a bit in PGRAPH when you stop caring about >> them. > > The problem is that one needs to disable them *before* the one he cares about. > > Suppose the card is at fence 0 and we are interested in fence 1000 expiring. That isn't the most common case. You typically don't expect the card to get further than 3-4 pushbufs behind the CPU. > > If we just enable interrupts, then we are going to be interrupted > uselessly 1000 times. > Instead, we would like to tell the card "send me interrupts for > fences, but only for fence number 1000 (or higher)". > > This way one could for instance render a whole scene, and then > desiring to read it into the CPU, just ask for an interrupt once > rendering is done (i.e. wait for the framebuffer fence) and get a > single interrupt, while we cleanly sleep undisturbed in the meantime. > > Basically, it would just need some way of *conditionally* causing interrupts. > If there is none, then maybe we could insert a call to a "fence > mini-pushbuffer" filled with NOPs that could be overwritten with an > interrupt request on demand? The card also keeps some internal FIFO caches, so it seems to me that getting that safe of races and efficient at the same time could be a bit non-trivial. > Or alternatively, construct such a pushbuffer with the 2D or 3D > engines, starting from our "1000" input and the fence value generated > by the 3D engine? (this is likely to be slow though). > Or some hack like the DMA object resizing? (would it crash the GPU? or > just not work due to it caching the previous size?) [-- Attachment #1.2: Type: application/pgp-signature, Size: 197 bytes --] [-- Attachment #2: Type: text/plain, Size: 181 bytes --] _______________________________________________ Nouveau mailing list Nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org http://lists.freedesktop.org/mailman/listinfo/nouveau ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <87y6jrbfef.fsf-sGOZH3hwPm2sTnJN9+BGXg@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <87y6jrbfef.fsf-sGOZH3hwPm2sTnJN9+BGXg@public.gmane.org> @ 2010-01-21 16:30 ` Luca Barbieri [not found] ` <ff13bc9a1001210830g2653f672r918f8cb90cbf6170-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Luca Barbieri @ 2010-01-21 16:30 UTC (permalink / raw) To: Francisco Jerez Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom > The card also keeps some internal FIFO caches, so it seems to me that > getting that safe of races and efficient at the same time could be a bit > non-trivial. Are they flushable? It seems this *might* do the job: if (!pfifo->cache_flush(dev)) return; pfifo->reassign(dev, false); pfifo->cache_flush(dev); pfifo->cache_pull(dev, false); pfifo->cache_pull(dev, true); pfifo->reassign(dev, true); ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <ff13bc9a1001210830g2653f672r918f8cb90cbf6170-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <ff13bc9a1001210830g2653f672r918f8cb90cbf6170-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-01-21 17:39 ` Francisco Jerez 0 siblings, 0 replies; 34+ messages in thread From: Francisco Jerez @ 2010-01-21 17:39 UTC (permalink / raw) To: Luca Barbieri Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom [-- Attachment #1.1.1: Type: text/plain, Size: 588 bytes --] Luca Barbieri <luca-Ukmtq+NC3rhBHFWNQifrYwC/G2K4zDHf@public.gmane.org> writes: >> The card also keeps some internal FIFO caches, so it seems to me that >> getting that safe of races and efficient at the same time could be a bit >> non-trivial. > > Are they flushable? > It seems this *might* do the job: > if (!pfifo->cache_flush(dev)) > return; > pfifo->reassign(dev, false); > pfifo->cache_flush(dev); > pfifo->cache_pull(dev, false); > > > pfifo->cache_pull(dev, true); > pfifo->reassign(dev, true); Yeah, that would probably do the job, but it's likely to be too slow to be useful. [-- Attachment #1.2: Type: application/pgp-signature, Size: 197 bytes --] [-- Attachment #2: Type: text/plain, Size: 181 bytes --] _______________________________________________ Nouveau mailing list Nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org http://lists.freedesktop.org/mailman/listinfo/nouveau ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <87pr53ecwd.fsf-sGOZH3hwPm2sTnJN9+BGXg@public.gmane.org> 2010-01-21 15:36 ` Luca Barbieri @ 2010-01-21 15:39 ` Maarten Maathuis [not found] ` <6d4bc9fc1001210739r2bb8b4c4i18590376a1628d82-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 34+ messages in thread From: Maarten Maathuis @ 2010-01-21 15:39 UTC (permalink / raw) To: Francisco Jerez Cc: Luca Barbieri, Thomas Hellstrom, airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org On Thu, Jan 21, 2010 at 3:44 PM, Francisco Jerez <currojerez-sGOZH3hwPm2sTnJN9+BGXg@public.gmane.org> wrote: > Luca Barbieri <luca-Ukmtq+NC3rhBHFWNQifrYwC/G2K4zDHf@public.gmane.org> writes: > >>> Nvidia cards have a synchronization primitive that could be used to >>> synchronize several FIFOs in hardware (AKA semaphores, see [1] for an >>> example). >> >> Does this operate wholly on the GPU on all nVidia cards? >> >> It seems that at least on some GPUs this will trigger "software >> methods" that are basically a way for the GPU to trigger an interrupt >> and stop the FIFO until the CPU handles the interrupt and restarts it. >> Yes this triggers a few software methods, nvidia hardware is a bit lacking in this area. > Yes, it could be handled without software methods, Maarten (CC'ed) was > looking into this, IIRC. Doing it without software methods means you need to have a semaphore that "exists" in the cpu domain and therefore cannot be used again until the cpu is sure it's done. So that will probably require a rotating queue of several semaphores and a fallback to cpu waiting. It should not be that hard to prototype in kernel, although we currently assume notifiers (little chunks of memory that amongst other things are used by semaphore) to be in channel private memory, so while in kernel it's probably doable, there will need to be api changes for the userspace side of it. Since the respective chunk of memory needs to have a dma object on more than one channel. This should probably be dealt with in combination with calims desire to be able to have dma objects for random buffer objects for use with gpu queries. It doesn't seem impossible, but it won't exactly be trivial either. The sanest model would probably involve an ioctl that takes two channels and does whatever is needed to make it synchronised. But given the recent discussion/noise about doubts regarding our mixed userspace/kernel model, we should consider this carefully. That said, i doubt it would be sanely possible with a full userspace model like the one nvidia uses. > >> Also, is there a way on nVidia cards to get interrupts on fences, but >> only where the fence sequence number is higher than a dynamically set >> value? (so that one could sleep for fence X without getting an >> interrupt for every single fence before that) >> >> If not, it could possibly be hacked around by reading from a DMA >> object at the address of the fence sequence number and then resizing >> the DMA object so that addresses from a certain point on would trigger >> a protection fault interrupt. > > I don't think you can safely modify a DMA object without stalling the > card completely, but i think you could use e.g. PGRAPH NOTIFY interrupts > and disable them by flipping a bit in PGRAPH when you stop caring about > them. > ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <6d4bc9fc1001210739r2bb8b4c4i18590376a1628d82-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <6d4bc9fc1001210739r2bb8b4c4i18590376a1628d82-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-01-21 15:56 ` Luca Barbieri [not found] ` <ff13bc9a1001210756r1e627146w89fb2138ca77e6b5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 34+ messages in thread From: Luca Barbieri @ 2010-01-21 15:56 UTC (permalink / raw) To: Maarten Maathuis Cc: Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom, airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > Doing it without software methods means you need to have a semaphore > that "exists" in the cpu domain and therefore cannot be used again > until the cpu is sure it's done. So that will probably require a > rotating queue of several semaphores and a fallback to cpu waiting. Why would it need a semaphore in CPU domain? Couldn't it work this way: 1. Allocate a BO in the kernel 2. Emit on one channel a "notify" request on that BO 3. Emit on the other channel a "wait" request on that BO 4. Emit a fence on the wait channel 5. Use the existing delayed-destroy mechanism to get rid of the BO once the wait channel fence expires Then the GPU would presumably switch away from the "waiting" context and not reexecute it until the notify happens, or something similar. The problem is, of course, whether the GPU supports the "wait" request. ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <ff13bc9a1001210756r1e627146w89fb2138ca77e6b5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <ff13bc9a1001210756r1e627146w89fb2138ca77e6b5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-01-21 16:02 ` Maarten Maathuis 0 siblings, 0 replies; 34+ messages in thread From: Maarten Maathuis @ 2010-01-21 16:02 UTC (permalink / raw) To: Luca Barbieri Cc: Thomas Hellstrom, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom, airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org On Thu, Jan 21, 2010 at 4:56 PM, Luca Barbieri <luca-Ukmtq+NC3rhBHFWNQifrYwC/G2K4zDHf@public.gmane.org> wrote: >> Doing it without software methods means you need to have a semaphore >> that "exists" in the cpu domain and therefore cannot be used again >> until the cpu is sure it's done. So that will probably require a >> rotating queue of several semaphores and a fallback to cpu waiting. > > Why would it need a semaphore in CPU domain? > Couldn't it work this way: > 1. Allocate a BO in the kernel Who will set the initial "not-done" value? > 2. Emit on one channel a "notify" request on that BO > 3. Emit on the other channel a "wait" request on that BO > 4. Emit a fence on the wait channel > 5. Use the existing delayed-destroy mechanism to get rid of the BO > once the wait channel fence expires > > Then the GPU would presumably switch away from the "waiting" context > and not reexecute it until the notify happens, or something similar. > > The problem is, of course, whether the GPU supports the "wait" request. > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete [not found] ` <ff13bc9a1001210540t7dc2b7a7p38359ca82b8b3eb4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2010-01-21 14:07 ` Francisco Jerez @ 2010-01-21 14:23 ` Thomas Hellstrom 1 sibling, 0 replies; 34+ messages in thread From: Thomas Hellstrom @ 2010-01-21 14:23 UTC (permalink / raw) To: Luca Barbieri Cc: airlied-cv59FeDIM0c@public.gmane.org, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, dri-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Thomas Hellstrom Luca Barbieri wrote: >> At a first glance: >> >> 1) We probably *will* need a delayed destroyed workqueue to avoid wasting >> memory that otherwise should be freed to the system. At the very least, the >> delayed delete process should optionally be run by a system shrinker. >> > You are right. For VRAM we don't care since we are the only user, > while for system backed memory some delayed destruction will be > needed. > The logical extension of the scheme would be for the Linux page > allocator/swapper to check for TTM buffers to destroy when it would > otherwise shrink caches, try to swap and/or wait on swap to happen. > Not sure whether there are existing hooks for this or where exactly to > hook this code. > > I think there are existing hooks for this, but I haven't yet figured out how they work. >> 2) Fences in TTM are currently not necessarily strictly ordered, and >> sequence numbers are hidden from the bo code. This means, for a given FIFO, >> fence sequence 3 may expire before fence sequence 2, depending on the usage >> of the buffer. >> > > My definition of "channel" (I sometimes used FIFO incorrectly as a > synonym of that) is exactly a set of fences that are strictly ordered. > If the card has multiple HW engines, each is considered a different > channel (so that a channel becomes a (fifo, engine) pair). > > We may need however to add the concept of a "sync domain" that would > be a set of channels that support on-GPU synchronization against each > other. > This would model hardware where channels with the same FIFO can be > synchronized together but those with different FIFOs don't, and also > multi-core GPUs where synchronization might be available only inside > each core and not across cores. > > To sum it up, a GPU consists of a set of sync domains, each consisting > of a set of channels, each consisting of a sequence of fences, with > the following rules: > 1. Fences within the same channel expire in order > 2. If channels A and B belong to the same sync domain, it's possible > to emit a fence on A that is guaranteed to expire after an arbitrary > fence of B > > Whether channels have the same FIFO or not is essentially a driver > implementation detail, and what TTM cares about is if they are in the > same sync domain. > > [I just made up "sync domain" here: is there a standard term?] > > This assumes that the "synchronizability" graph is a disjoint union of > complete graphs. Is there any example where it is not so? > Also, does this actually model correctly Poulsbo, or am I wrong? > Let me give some usage examples for intel, and Poulsbo. The synchronization for these are / were modeled with fences with "stages", (called fence_types). The signaling of one stage set a bit in a bit mask. A bo was considered idle when (bo->required_stages & fence->signaled_stages == bo->required_stages). Intel would have had the stages command_submitted, read_flushed, write_flushed. A command buffer would have been idle if the fence had signaled command_submitted, whereas a render buffer would've been idle on command_submitted | write_flushed. A write flush would've had to be issued separately. Typically when the buffer was put on the delayed delete queue or when the bo_idle function was called, which really minimized the amount of needed flushing. An executed write flush would signal write_flushed on all fences whose sequence had passed. An explicit ordering of fences here would've meant queueing a read-write-flush in between them. For Poulsbo, the GPU was modeled with a programming stage, a binner stage, a rasterizer stage and a feedback stage. Each stage of completion set a signaled_stage bit in the fence. Now, a vertex buffer would be idle when the fence signaled binner_done, (which could happen well before rasterization complete of the previous command sequence). It's true that one could use "feedback done" for all buffers but a low-memory-footprint system quickly ran out of vertex buffer space, so quick reuse of those were essential. So to summarize, the usage of the buffer together with the signaled state of the fence object really determines whether the buffer is idle. I think in your model, the Poulsbo GPU would've been a sync domain and the "programmer", binner, rasterizer and "feedback engine" would've been separate channels. The Intel case would perhaps have been a bit more tricky. /Thomas > Note that we could use CPU mediation more than we currently do. > For instance now Nouveau, to do inter-channel synchronization, simply > waits on the fence with the CPU immediately synchronously, while it > could instead queue the commands in software, and with an > interrupt/delayed mechanism submit them to hardware once the fence to > be waited for is expired. > ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2010-01-25 20:51 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-18 18:47 [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete Luca Barbieri
2010-01-18 19:40 ` Thomas Hellstrom
[not found] ` <4B54B949.9010906-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org>
2010-01-18 22:33 ` Luca Barbieri
[not found] ` <ff13bc9a1001181433v1694e681l9f7ce9d880132dc3-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-01-20 11:28 ` Thomas Hellstrom
[not found] ` <4B56E8EE.8090706-4+hqylr40dJg9hUCZPvPmw@public.gmane.org>
2010-01-20 12:11 ` Thomas Hellstrom
2010-01-20 12:11 ` Thomas Hellstrom
[not found] ` <4B56F308.5090603-4+hqylr40dJg9hUCZPvPmw@public.gmane.org>
2010-01-20 12:16 ` Thomas Hellstrom
2010-01-20 12:16 ` Thomas Hellstrom
[not found] ` <4B56F401.8070700-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org>
2010-01-20 19:22 ` Luca Barbieri
[not found] ` <ff13bc9a1001201122y110fb003k704bc6d05d2aea07-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-01-20 20:16 ` Thomas Hellstrom
[not found] ` <4B5764BA.7080801-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org>
2010-01-20 20:45 ` Luca Barbieri
[not found] ` <ff13bc9a1001201245g6ee25219q851b7989968f4c7b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-01-20 20:58 ` Luca Barbieri
[not found] ` <ff13bc9a1001201258i37ee7354gb305aa98afae3716-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-01-20 21:11 ` Thomas Hellstrom
2010-01-20 21:04 ` Thomas Hellstrom
[not found] ` <4B576FF5.9040907-4+hqylr40dJg9hUCZPvPmw@public.gmane.org>
2010-01-21 3:49 ` Luca Barbieri
[not found] ` <ff13bc9a1001201949l4691f202v2a2874b9cef86f37-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-01-21 12:29 ` Jerome Glisse
[not found] ` <20100121122920.GB3837-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2010-01-21 12:59 ` Thomas Hellstrom
[not found] ` <4B584FAE.8040801-4+hqylr40dJg9hUCZPvPmw@public.gmane.org>
2010-01-25 8:14 ` Jerome Glisse
[not found] ` <20100125081444.GA23124-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2010-01-25 20:51 ` Thomas Hellstrom
2010-01-21 15:14 ` Luca Barbieri
[not found] ` <ff13bc9a1001210714m34c3976etc7680f056cb55453-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-01-25 8:20 ` Jerome Glisse
2010-01-21 12:53 ` Thomas Hellstrom
[not found] ` <4B584E48.8020806-4+hqylr40dJg9hUCZPvPmw@public.gmane.org>
2010-01-21 13:40 ` Luca Barbieri
[not found] ` <ff13bc9a1001210540t7dc2b7a7p38359ca82b8b3eb4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-01-21 14:07 ` Francisco Jerez
[not found] ` <87vdeveekk.fsf-sGOZH3hwPm2sTnJN9+BGXg@public.gmane.org>
2010-01-21 14:17 ` Luca Barbieri
[not found] ` <ff13bc9a1001210617qe37ab7at949d545216693608-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-01-21 14:44 ` Francisco Jerez
[not found] ` <87pr53ecwd.fsf-sGOZH3hwPm2sTnJN9+BGXg@public.gmane.org>
2010-01-21 15:36 ` Luca Barbieri
[not found] ` <ff13bc9a1001210736k71740417r7b129c70374fece3-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-01-21 16:18 ` Francisco Jerez
[not found] ` <87y6jrbfef.fsf-sGOZH3hwPm2sTnJN9+BGXg@public.gmane.org>
2010-01-21 16:30 ` Luca Barbieri
[not found] ` <ff13bc9a1001210830g2653f672r918f8cb90cbf6170-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-01-21 17:39 ` Francisco Jerez
2010-01-21 15:39 ` Maarten Maathuis
[not found] ` <6d4bc9fc1001210739r2bb8b4c4i18590376a1628d82-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-01-21 15:56 ` Luca Barbieri
[not found] ` <ff13bc9a1001210756r1e627146w89fb2138ca77e6b5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-01-21 16:02 ` Maarten Maathuis
2010-01-21 14:23 ` Thomas Hellstrom
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.