DRM scheduler issue with spsc

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* DRM scheduler issue with spsc_queue
@ 2025-06-13 17:01 Matthew Brost
  2025-06-13 17:26 ` Christian König
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Brost @ 2025-06-13 17:01 UTC (permalink / raw)
  To: dakr, pstanner, christian.koenig, simona.vetter, dri-devel

All,

After about six hours of debugging, I found an issue in a fairly
aggressive test case involving the DRM scheduler function
drm_sched_entity_push_job. The problem is that spsc_queue_push does not
correctly return first on a job push, causing the queue to fail to run
even though it is ready.

I know this sounds a bit insane, but I assure you it’s happening and is
quite reproducible. I'm working off a pull of drm-tip from a few days
ago + some local change to Xe's memory management, with a Kconfig that
has no debug options enabled. I’m not sure if there’s a bug somewhere in
the kernel related to barriers or atomics in the recent drm-tip. That
seems unlikely—but just as unlikely is that this bug has existed for a
while without being triggered until now.

I've verified the hang in several ways: using printks, adding a debugfs
entry to manually kick the DRM scheduler queue when it's stuck (which
gets it unstuck), and replacing the SPSC queue with one guarded by a
spinlock (which completely fixes the issue).

That last point raises a big question: why are we using a convoluted
lockless algorithm here instead of a simple spinlock? This isn't a
critical path—and even if it were, how much performance benefit are we
actually getting from the lockless design? Probably very little.

Any objections to me rewriting this around a spinlock-based design? My
head hurts from chasing this bug, and I feel like this is the best way
forward rather than wasting more time here.

Matt

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: DRM scheduler issue with spsc_queue
  2025-06-13 17:01 DRM scheduler issue with spsc_queue Matthew Brost
@ 2025-06-13 17:26 ` Christian König
  2025-06-13 19:11   ` Matthew Brost
  0 siblings, 1 reply; 7+ messages in thread
From: Christian König @ 2025-06-13 17:26 UTC (permalink / raw)
  To: Matthew Brost, dakr, pstanner, simona.vetter, dri-devel

On 6/13/25 19:01, Matthew Brost wrote:
> All,
> 
> After about six hours of debugging, I found an issue in a fairly
> aggressive test case involving the DRM scheduler function
> drm_sched_entity_push_job. The problem is that spsc_queue_push does not
> correctly return first on a job push, causing the queue to fail to run
> even though it is ready.
> 
> I know this sounds a bit insane, but I assure you it’s happening and is
> quite reproducible. I'm working off a pull of drm-tip from a few days
> ago + some local change to Xe's memory management, with a Kconfig that
> has no debug options enabled. I’m not sure if there’s a bug somewhere in
> the kernel related to barriers or atomics in the recent drm-tip. That
> seems unlikely—but just as unlikely is that this bug has existed for a
> while without being triggered until now.
> 
> I've verified the hang in several ways: using printks, adding a debugfs
> entry to manually kick the DRM scheduler queue when it's stuck (which
> gets it unstuck), and replacing the SPSC queue with one guarded by a
> spinlock (which completely fixes the issue).
> 
> That last point raises a big question: why are we using a convoluted
> lockless algorithm here instead of a simple spinlock? This isn't a
> critical path—and even if it were, how much performance benefit are we
> actually getting from the lockless design? Probably very little.
> 
> Any objections to me rewriting this around a spinlock-based design? My
> head hurts from chasing this bug, and I feel like this is the best way
> forward rather than wasting more time here.

Well the spsc queue is some standard code I used in previous projects and we have never experienced any issue with that.

This is a massively performance critical code path and we need to make sure that we move as few cache lines as possible between the producer and consumer side.

That was the reason why we replaced the spinlock with the spsc queue before.

Regards,
Christian.

> 
> Matt


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: DRM scheduler issue with spsc_queue
  2025-06-13 17:26 ` Christian König
@ 2025-06-13 19:11   ` Matthew Brost
  2025-06-16  7:24     ` Christian König
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Brost @ 2025-06-13 19:11 UTC (permalink / raw)
  To: Christian König; +Cc: dakr, pstanner, simona.vetter, dri-devel

On Fri, Jun 13, 2025 at 07:26:22PM +0200, Christian König wrote:
> On 6/13/25 19:01, Matthew Brost wrote:
> > All,
> > 
> > After about six hours of debugging, I found an issue in a fairly
> > aggressive test case involving the DRM scheduler function
> > drm_sched_entity_push_job. The problem is that spsc_queue_push does not
> > correctly return first on a job push, causing the queue to fail to run
> > even though it is ready.
> > 
> > I know this sounds a bit insane, but I assure you it’s happening and is
> > quite reproducible. I'm working off a pull of drm-tip from a few days
> > ago + some local change to Xe's memory management, with a Kconfig that
> > has no debug options enabled. I’m not sure if there’s a bug somewhere in
> > the kernel related to barriers or atomics in the recent drm-tip. That
> > seems unlikely—but just as unlikely is that this bug has existed for a
> > while without being triggered until now.
> > 
> > I've verified the hang in several ways: using printks, adding a debugfs
> > entry to manually kick the DRM scheduler queue when it's stuck (which
> > gets it unstuck), and replacing the SPSC queue with one guarded by a
> > spinlock (which completely fixes the issue).
> > 
> > That last point raises a big question: why are we using a convoluted
> > lockless algorithm here instead of a simple spinlock? This isn't a
> > critical path—and even if it were, how much performance benefit are we
> > actually getting from the lockless design? Probably very little.
> > 
> > Any objections to me rewriting this around a spinlock-based design? My
> > head hurts from chasing this bug, and I feel like this is the best way
> > forward rather than wasting more time here.
> 
> Well the spsc queue is some standard code I used in previous projects and we have never experienced any issue with that.
> 

I can quite clearly see this not working on my test setup. I can (kinda)
explain it a bit more.

Look at this code:

 65 static inline bool spsc_queue_push(struct spsc_queue *queue, struct spsc_node *node)
 66 {
 67         struct spsc_node **tail;
 68
 69         node->next = NULL;
 70
 71         preempt_disable();
 72
 73         tail = (struct spsc_node **)atomic_long_xchg(&queue->tail, (long)&node->next);
 74         WRITE_ONCE(*tail, node);
 75         atomic_inc(&queue->job_count);
 76
 77         /*
 78          * In case of first element verify new node will be visible to the consumer
 79          * thread when we ping the kernel thread that there is new work to do.
 80          */
 81         smp_wmb();
 82
 83         preempt_enable();
 84
 85         return tail == &queue->head;
 86 }

Between the execution of atomic_long_xchg and atomic_inc, the submission
worker could dequeue the previous last job, reducing job_count to zero,
run the job, observe that job_count == 0 (drm_sched_entity_is_ready),
and then go to sleep. This function has swapped for the previous tail,
so it returns false (i.e., not the first, and does not requeue the
submit worker at caller).

The race window here is tiny, and I would think preempt_disable would
make it impossible (or preempt_disable is broken drm-tip a few days
ago), so I’m still a bit perplexed by it. But again, I assure you this
is, in fact, happening on my test setup. My test case is an SVM one,
which involves all sorts of CPU/GPU faults, migrations, etc. Not sure if
that contributes. I can show this race occurring in dmesg if you need
proof.

The change below fixes the problem. I'm going to post it to unblock
myself.

diff --git a/include/drm/spsc_queue.h b/include/drm/spsc_queue.h
index 125f096c88cb..ee9df8cc67b7 100644
--- a/include/drm/spsc_queue.h
+++ b/include/drm/spsc_queue.h
@@ -70,9 +70,11 @@ static inline bool spsc_queue_push(struct spsc_queue
*queue, struct spsc_node *n

        preempt_disable();

+       atomic_inc(&queue->job_count);
+       smp_mb__after_atomic();
+
        tail = (struct spsc_node **)atomic_long_xchg(&queue->tail,
(long)&node->next);
        WRITE_ONCE(*tail, node);
-       atomic_inc(&queue->job_count);

        /*
         * In case of first element verify new node will be visible to
         * the consumer

> This is a massively performance critical code path and we need to make sure that we move as few cache lines as possible between the producer and consumer side.
> 

Well, I can't say I buy this argument. If you can show me any real
workload where using a spinlock here vs. going lockless makes a
measurable impact, I'd eat my hat. Also, FWIW, this code seemingly
violates the DRM locking guidelines we're all supposed to follow… But
anyway, I'll go ahead with the fix above.

Matt

> That was the reason why we replaced the spinlock with the spsc queue before.
> 
> Regards,
> Christian.
> 
> > 
> > Matt
> 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: DRM scheduler issue with spsc_queue
  2025-06-13 19:11   ` Matthew Brost
@ 2025-06-16  7:24     ` Christian König
  2025-06-16  7:42       ` Matthew Brost
  2025-06-16 10:42       ` Simona Vetter
  0 siblings, 2 replies; 7+ messages in thread
From: Christian König @ 2025-06-16  7:24 UTC (permalink / raw)
  To: Matthew Brost; +Cc: dakr, pstanner, simona.vetter, dri-devel

On 6/13/25 21:11, Matthew Brost wrote:
> On Fri, Jun 13, 2025 at 07:26:22PM +0200, Christian König wrote:
>> On 6/13/25 19:01, Matthew Brost wrote:
>>> All,
>>>
>>> After about six hours of debugging, I found an issue in a fairly
>>> aggressive test case involving the DRM scheduler function
>>> drm_sched_entity_push_job. The problem is that spsc_queue_push does not
>>> correctly return first on a job push, causing the queue to fail to run
>>> even though it is ready.
>>>
>>> I know this sounds a bit insane, but I assure you it’s happening and is
>>> quite reproducible. I'm working off a pull of drm-tip from a few days
>>> ago + some local change to Xe's memory management, with a Kconfig that
>>> has no debug options enabled. I’m not sure if there’s a bug somewhere in
>>> the kernel related to barriers or atomics in the recent drm-tip. That
>>> seems unlikely—but just as unlikely is that this bug has existed for a
>>> while without being triggered until now.
>>>
>>> I've verified the hang in several ways: using printks, adding a debugfs
>>> entry to manually kick the DRM scheduler queue when it's stuck (which
>>> gets it unstuck), and replacing the SPSC queue with one guarded by a
>>> spinlock (which completely fixes the issue).
>>>
>>> That last point raises a big question: why are we using a convoluted
>>> lockless algorithm here instead of a simple spinlock? This isn't a
>>> critical path—and even if it were, how much performance benefit are we
>>> actually getting from the lockless design? Probably very little.
>>>
>>> Any objections to me rewriting this around a spinlock-based design? My
>>> head hurts from chasing this bug, and I feel like this is the best way
>>> forward rather than wasting more time here.
>>
>> Well the spsc queue is some standard code I used in previous projects and we have never experienced any issue with that.
>>
> 
> I can quite clearly see this not working on my test setup. I can (kinda)
> explain it a bit more.
> 
> Look at this code:
> 
>  65 static inline bool spsc_queue_push(struct spsc_queue *queue, struct spsc_node *node)
>  66 {
>  67         struct spsc_node **tail;
>  68
>  69         node->next = NULL;
>  70
>  71         preempt_disable();
>  72
>  73         tail = (struct spsc_node **)atomic_long_xchg(&queue->tail, (long)&node->next);
>  74         WRITE_ONCE(*tail, node);
>  75         atomic_inc(&queue->job_count);
>  76
>  77         /*
>  78          * In case of first element verify new node will be visible to the consumer
>  79          * thread when we ping the kernel thread that there is new work to do.
>  80          */
>  81         smp_wmb();
>  82
>  83         preempt_enable();
>  84
>  85         return tail == &queue->head;
>  86 }
> 
> Between the execution of atomic_long_xchg and atomic_inc, the submission
> worker could dequeue the previous last job, reducing job_count to zero,
> run the job, observe that job_count == 0 (drm_sched_entity_is_ready),
> and then go to sleep. This function has swapped for the previous tail,
> so it returns false (i.e., not the first, and does not requeue the
> submit worker at caller).
> 
> The race window here is tiny, and I would think preempt_disable would
> make it impossible (or preempt_disable is broken drm-tip a few days
> ago), so I’m still a bit perplexed by it. But again, I assure you this
> is, in fact, happening on my test setup. My test case is an SVM one,
> which involves all sorts of CPU/GPU faults, migrations, etc. Not sure if
> that contributes. I can show this race occurring in dmesg if you need
> proof.
> 
> The change below fixes the problem. I'm going to post it to unblock
> myself.
> 
> diff --git a/include/drm/spsc_queue.h b/include/drm/spsc_queue.h
> index 125f096c88cb..ee9df8cc67b7 100644
> --- a/include/drm/spsc_queue.h
> +++ b/include/drm/spsc_queue.h
> @@ -70,9 +70,11 @@ static inline bool spsc_queue_push(struct spsc_queue
> *queue, struct spsc_node *n
> 
>         preempt_disable();
> 
> +       atomic_inc(&queue->job_count);
> +       smp_mb__after_atomic();
> +
>         tail = (struct spsc_node **)atomic_long_xchg(&queue->tail,
> (long)&node->next);
>         WRITE_ONCE(*tail, node);
> -       atomic_inc(&queue->job_count);
> 
>         /*
>          * In case of first element verify new node will be visible to
>          * the consumer

I need to double check the code path once more, but that we got this wrong while could certainly be.

The code originated in userspace and atomics are also memory barriers there. Sima had to point out that we have to manually add smp_mb()s here to make it work. 

>> This is a massively performance critical code path and we need to make sure that we move as few cache lines as possible between the producer and consumer side.
>>
> 
> Well, I can't say I buy this argument. If you can show me any real
> workload where using a spinlock here vs. going lockless makes a
> measurable impact, I'd eat my hat. Also, FWIW, this code seemingly
> violates the DRM locking guidelines we're all supposed to follow… But
> anyway, I'll go ahead with the fix above.

I probably have to take that back, see another comment below.

> 
> Matt
> 
>> That was the reason why we replaced the spinlock with the spsc queue before.

Well we replaced spinlock+kfifo with spsc the for round robing peeking implementation.

Background is that because of the spinlock even a "peek" transfers the cache line as write to the sheduler thread, and when you do the "peek" even on the idle entities then that starts to not scale at all.

Since we now have the FIFO implementation and avoiding peeking all the time into the job queue of idle entities that might as well not suck that badly any more.

Regards,
Christian. 

>>
>> Regards,
>> Christian.
>>
>>>
>>> Matt
>>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: DRM scheduler issue with spsc_queue
  2025-06-16  7:24     ` Christian König
@ 2025-06-16  7:42       ` Matthew Brost
  2025-06-16 10:42       ` Simona Vetter
  1 sibling, 0 replies; 7+ messages in thread
From: Matthew Brost @ 2025-06-16  7:42 UTC (permalink / raw)
  To: Christian König; +Cc: dakr, pstanner, simona.vetter, dri-devel

On Mon, Jun 16, 2025 at 09:24:38AM +0200, Christian König wrote:
> On 6/13/25 21:11, Matthew Brost wrote:
> > On Fri, Jun 13, 2025 at 07:26:22PM +0200, Christian König wrote:
> >> On 6/13/25 19:01, Matthew Brost wrote:
> >>> All,
> >>>
> >>> After about six hours of debugging, I found an issue in a fairly
> >>> aggressive test case involving the DRM scheduler function
> >>> drm_sched_entity_push_job. The problem is that spsc_queue_push does not
> >>> correctly return first on a job push, causing the queue to fail to run
> >>> even though it is ready.
> >>>
> >>> I know this sounds a bit insane, but I assure you it’s happening and is
> >>> quite reproducible. I'm working off a pull of drm-tip from a few days
> >>> ago + some local change to Xe's memory management, with a Kconfig that
> >>> has no debug options enabled. I’m not sure if there’s a bug somewhere in
> >>> the kernel related to barriers or atomics in the recent drm-tip. That
> >>> seems unlikely—but just as unlikely is that this bug has existed for a
> >>> while without being triggered until now.
> >>>
> >>> I've verified the hang in several ways: using printks, adding a debugfs
> >>> entry to manually kick the DRM scheduler queue when it's stuck (which
> >>> gets it unstuck), and replacing the SPSC queue with one guarded by a
> >>> spinlock (which completely fixes the issue).
> >>>
> >>> That last point raises a big question: why are we using a convoluted
> >>> lockless algorithm here instead of a simple spinlock? This isn't a
> >>> critical path—and even if it were, how much performance benefit are we
> >>> actually getting from the lockless design? Probably very little.
> >>>
> >>> Any objections to me rewriting this around a spinlock-based design? My
> >>> head hurts from chasing this bug, and I feel like this is the best way
> >>> forward rather than wasting more time here.
> >>
> >> Well the spsc queue is some standard code I used in previous projects and we have never experienced any issue with that.
> >>
> > 
> > I can quite clearly see this not working on my test setup. I can (kinda)
> > explain it a bit more.
> > 
> > Look at this code:
> > 
> >  65 static inline bool spsc_queue_push(struct spsc_queue *queue, struct spsc_node *node)
> >  66 {
> >  67         struct spsc_node **tail;
> >  68
> >  69         node->next = NULL;
> >  70
> >  71         preempt_disable();
> >  72
> >  73         tail = (struct spsc_node **)atomic_long_xchg(&queue->tail, (long)&node->next);
> >  74         WRITE_ONCE(*tail, node);
> >  75         atomic_inc(&queue->job_count);
> >  76
> >  77         /*
> >  78          * In case of first element verify new node will be visible to the consumer
> >  79          * thread when we ping the kernel thread that there is new work to do.
> >  80          */
> >  81         smp_wmb();
> >  82
> >  83         preempt_enable();
> >  84
> >  85         return tail == &queue->head;
> >  86 }
> > 
> > Between the execution of atomic_long_xchg and atomic_inc, the submission
> > worker could dequeue the previous last job, reducing job_count to zero,
> > run the job, observe that job_count == 0 (drm_sched_entity_is_ready),
> > and then go to sleep. This function has swapped for the previous tail,
> > so it returns false (i.e., not the first, and does not requeue the
> > submit worker at caller).
> > 
> > The race window here is tiny, and I would think preempt_disable would
> > make it impossible (or preempt_disable is broken drm-tip a few days
> > ago), so I’m still a bit perplexed by it. But again, I assure you this
> > is, in fact, happening on my test setup. My test case is an SVM one,
> > which involves all sorts of CPU/GPU faults, migrations, etc. Not sure if
> > that contributes. I can show this race occurring in dmesg if you need
> > proof.
> > 
> > The change below fixes the problem. I'm going to post it to unblock
> > myself.
> > 
> > diff --git a/include/drm/spsc_queue.h b/include/drm/spsc_queue.h
> > index 125f096c88cb..ee9df8cc67b7 100644
> > --- a/include/drm/spsc_queue.h
> > +++ b/include/drm/spsc_queue.h
> > @@ -70,9 +70,11 @@ static inline bool spsc_queue_push(struct spsc_queue
> > *queue, struct spsc_node *n
> > 
> >         preempt_disable();
> > 
> > +       atomic_inc(&queue->job_count);
> > +       smp_mb__after_atomic();
> > +
> >         tail = (struct spsc_node **)atomic_long_xchg(&queue->tail,
> > (long)&node->next);
> >         WRITE_ONCE(*tail, node);
> > -       atomic_inc(&queue->job_count);
> > 
> >         /*
> >          * In case of first element verify new node will be visible to
> >          * the consumer
> 
> I need to double check the code path once more, but that we got this wrong while could certainly be.
> 
> The code originated in userspace and atomics are also memory barriers there. Sima had to point out that we have to manually add smp_mb()s here to make it work. 
> 
> >> This is a massively performance critical code path and we need to make sure that we move as few cache lines as possible between the producer and consumer side.
> >>
> > 
> > Well, I can't say I buy this argument. If you can show me any real
> > workload where using a spinlock here vs. going lockless makes a
> > measurable impact, I'd eat my hat. Also, FWIW, this code seemingly
> > violates the DRM locking guidelines we're all supposed to follow… But
> > anyway, I'll go ahead with the fix above.
> 
> I probably have to take that back, see another comment below.
> 
> > 
> > Matt
> > 
> >> That was the reason why we replaced the spinlock with the spsc queue before.
> 
> Well we replaced spinlock+kfifo with spsc the for round robing peeking implementation.
> 
> Background is that because of the spinlock even a "peek" transfers the cache line as write to the sheduler thread, and when you do the "peek" even on the idle entities then that starts to not scale at all.
> 
> Since we now have the FIFO implementation and avoiding peeking all the time into the job queue of idle entities that might as well not suck that badly any more.
> 

I think if the consumer (scheduler) always tried to peek—e.g., by
checking the SPSC tail rather than the job count—the current SPSC code
would work as is.

Matt
 
> Regards,
> Christian. 
> 
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> Matt
> >>
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: DRM scheduler issue with spsc_queue
  2025-06-16  7:24     ` Christian König
  2025-06-16  7:42       ` Matthew Brost
@ 2025-06-16 10:42       ` Simona Vetter
  2025-06-20 16:45         ` Matthew Brost
  1 sibling, 1 reply; 7+ messages in thread
From: Simona Vetter @ 2025-06-16 10:42 UTC (permalink / raw)
  To: Christian König
  Cc: Matthew Brost, dakr, pstanner, simona.vetter, dri-devel

On Mon, Jun 16, 2025 at 09:24:38AM +0200, Christian König wrote:
> On 6/13/25 21:11, Matthew Brost wrote:
> > On Fri, Jun 13, 2025 at 07:26:22PM +0200, Christian König wrote:
> >> On 6/13/25 19:01, Matthew Brost wrote:
> >>> All,
> >>>
> >>> After about six hours of debugging, I found an issue in a fairly
> >>> aggressive test case involving the DRM scheduler function
> >>> drm_sched_entity_push_job. The problem is that spsc_queue_push does not
> >>> correctly return first on a job push, causing the queue to fail to run
> >>> even though it is ready.
> >>>
> >>> I know this sounds a bit insane, but I assure you it’s happening and is
> >>> quite reproducible. I'm working off a pull of drm-tip from a few days
> >>> ago + some local change to Xe's memory management, with a Kconfig that
> >>> has no debug options enabled. I’m not sure if there’s a bug somewhere in
> >>> the kernel related to barriers or atomics in the recent drm-tip. That
> >>> seems unlikely—but just as unlikely is that this bug has existed for a
> >>> while without being triggered until now.
> >>>
> >>> I've verified the hang in several ways: using printks, adding a debugfs
> >>> entry to manually kick the DRM scheduler queue when it's stuck (which
> >>> gets it unstuck), and replacing the SPSC queue with one guarded by a
> >>> spinlock (which completely fixes the issue).
> >>>
> >>> That last point raises a big question: why are we using a convoluted
> >>> lockless algorithm here instead of a simple spinlock? This isn't a
> >>> critical path—and even if it were, how much performance benefit are we
> >>> actually getting from the lockless design? Probably very little.
> >>>
> >>> Any objections to me rewriting this around a spinlock-based design? My
> >>> head hurts from chasing this bug, and I feel like this is the best way
> >>> forward rather than wasting more time here.
> >>
> >> Well the spsc queue is some standard code I used in previous projects and we have never experienced any issue with that.
> >>
> > 
> > I can quite clearly see this not working on my test setup. I can (kinda)
> > explain it a bit more.
> > 
> > Look at this code:
> > 
> >  65 static inline bool spsc_queue_push(struct spsc_queue *queue, struct spsc_node *node)
> >  66 {
> >  67         struct spsc_node **tail;
> >  68
> >  69         node->next = NULL;
> >  70
> >  71         preempt_disable();
> >  72
> >  73         tail = (struct spsc_node **)atomic_long_xchg(&queue->tail, (long)&node->next);
> >  74         WRITE_ONCE(*tail, node);
> >  75         atomic_inc(&queue->job_count);
> >  76
> >  77         /*
> >  78          * In case of first element verify new node will be visible to the consumer
> >  79          * thread when we ping the kernel thread that there is new work to do.
> >  80          */
> >  81         smp_wmb();
> >  82
> >  83         preempt_enable();
> >  84
> >  85         return tail == &queue->head;
> >  86 }
> > 
> > Between the execution of atomic_long_xchg and atomic_inc, the submission
> > worker could dequeue the previous last job, reducing job_count to zero,
> > run the job, observe that job_count == 0 (drm_sched_entity_is_ready),
> > and then go to sleep. This function has swapped for the previous tail,
> > so it returns false (i.e., not the first, and does not requeue the
> > submit worker at caller).
> > 
> > The race window here is tiny, and I would think preempt_disable would
> > make it impossible (or preempt_disable is broken drm-tip a few days
> > ago), so I’m still a bit perplexed by it. But again, I assure you this
> > is, in fact, happening on my test setup. My test case is an SVM one,
> > which involves all sorts of CPU/GPU faults, migrations, etc. Not sure if
> > that contributes. I can show this race occurring in dmesg if you need
> > proof.
> > 
> > The change below fixes the problem. I'm going to post it to unblock
> > myself.
> > 
> > diff --git a/include/drm/spsc_queue.h b/include/drm/spsc_queue.h
> > index 125f096c88cb..ee9df8cc67b7 100644
> > --- a/include/drm/spsc_queue.h
> > +++ b/include/drm/spsc_queue.h
> > @@ -70,9 +70,11 @@ static inline bool spsc_queue_push(struct spsc_queue
> > *queue, struct spsc_node *n
> > 
> >         preempt_disable();
> > 
> > +       atomic_inc(&queue->job_count);
> > +       smp_mb__after_atomic();
> > +
> >         tail = (struct spsc_node **)atomic_long_xchg(&queue->tail,
> > (long)&node->next);
> >         WRITE_ONCE(*tail, node);
> > -       atomic_inc(&queue->job_count);
> > 
> >         /*
> >          * In case of first element verify new node will be visible to
> >          * the consumer
> 
> I need to double check the code path once more, but that we got this wrong while could certainly be.
> 
> The code originated in userspace and atomics are also memory barriers
> there. Sima had to point out that we have to manually add smp_mb()s here
> to make it work. 

Yeah last time I looked (which I think was years ago) spsc lacked the
needed barriers, and lacked even more the comments to explain how the
barriers match up (since some are there as implied ones). I kinda decided
that it's above my skill level to really dig into the entire thing and do
a formal proof and all that, which is imo what we need for lockless code
in the kernel or it's just going to be busted in strange ways.

Note that thanks to compilers becoming real good you need barriers even on
x86.
-Sima

> 
> >> This is a massively performance critical code path and we need to make sure that we move as few cache lines as possible between the producer and consumer side.
> >>
> > 
> > Well, I can't say I buy this argument. If you can show me any real
> > workload where using a spinlock here vs. going lockless makes a
> > measurable impact, I'd eat my hat. Also, FWIW, this code seemingly
> > violates the DRM locking guidelines we're all supposed to follow… But
> > anyway, I'll go ahead with the fix above.
> 
> I probably have to take that back, see another comment below.
> 
> > 
> > Matt
> > 
> >> That was the reason why we replaced the spinlock with the spsc queue before.
> 
> Well we replaced spinlock+kfifo with spsc the for round robing peeking implementation.
> 
> Background is that because of the spinlock even a "peek" transfers the cache line as write to the sheduler thread, and when you do the "peek" even on the idle entities then that starts to not scale at all.
> 
> Since we now have the FIFO implementation and avoiding peeking all the time into the job queue of idle entities that might as well not suck that badly any more.
> 
> Regards,
> Christian. 
> 
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> Matt
> >>
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: DRM scheduler issue with spsc_queue
  2025-06-16 10:42       ` Simona Vetter
@ 2025-06-20 16:45         ` Matthew Brost
  0 siblings, 0 replies; 7+ messages in thread
From: Matthew Brost @ 2025-06-20 16:45 UTC (permalink / raw)
  To: Simona Vetter; +Cc: Christian König, dakr, pstanner, dri-devel

On Mon, Jun 16, 2025 at 12:42:53PM +0200, Simona Vetter wrote:
> On Mon, Jun 16, 2025 at 09:24:38AM +0200, Christian König wrote:
> > On 6/13/25 21:11, Matthew Brost wrote:
> > > On Fri, Jun 13, 2025 at 07:26:22PM +0200, Christian König wrote:
> > >> On 6/13/25 19:01, Matthew Brost wrote:
> > >>> All,
> > >>>
> > >>> After about six hours of debugging, I found an issue in a fairly
> > >>> aggressive test case involving the DRM scheduler function
> > >>> drm_sched_entity_push_job. The problem is that spsc_queue_push does not
> > >>> correctly return first on a job push, causing the queue to fail to run
> > >>> even though it is ready.
> > >>>
> > >>> I know this sounds a bit insane, but I assure you it’s happening and is
> > >>> quite reproducible. I'm working off a pull of drm-tip from a few days
> > >>> ago + some local change to Xe's memory management, with a Kconfig that
> > >>> has no debug options enabled. I’m not sure if there’s a bug somewhere in
> > >>> the kernel related to barriers or atomics in the recent drm-tip. That
> > >>> seems unlikely—but just as unlikely is that this bug has existed for a
> > >>> while without being triggered until now.
> > >>>
> > >>> I've verified the hang in several ways: using printks, adding a debugfs
> > >>> entry to manually kick the DRM scheduler queue when it's stuck (which
> > >>> gets it unstuck), and replacing the SPSC queue with one guarded by a
> > >>> spinlock (which completely fixes the issue).
> > >>>
> > >>> That last point raises a big question: why are we using a convoluted
> > >>> lockless algorithm here instead of a simple spinlock? This isn't a
> > >>> critical path—and even if it were, how much performance benefit are we
> > >>> actually getting from the lockless design? Probably very little.
> > >>>
> > >>> Any objections to me rewriting this around a spinlock-based design? My
> > >>> head hurts from chasing this bug, and I feel like this is the best way
> > >>> forward rather than wasting more time here.
> > >>
> > >> Well the spsc queue is some standard code I used in previous projects and we have never experienced any issue with that.
> > >>
> > > 
> > > I can quite clearly see this not working on my test setup. I can (kinda)
> > > explain it a bit more.
> > > 
> > > Look at this code:
> > > 
> > >  65 static inline bool spsc_queue_push(struct spsc_queue *queue, struct spsc_node *node)
> > >  66 {
> > >  67         struct spsc_node **tail;
> > >  68
> > >  69         node->next = NULL;
> > >  70
> > >  71         preempt_disable();
> > >  72
> > >  73         tail = (struct spsc_node **)atomic_long_xchg(&queue->tail, (long)&node->next);
> > >  74         WRITE_ONCE(*tail, node);
> > >  75         atomic_inc(&queue->job_count);
> > >  76
> > >  77         /*
> > >  78          * In case of first element verify new node will be visible to the consumer
> > >  79          * thread when we ping the kernel thread that there is new work to do.
> > >  80          */
> > >  81         smp_wmb();
> > >  82
> > >  83         preempt_enable();
> > >  84
> > >  85         return tail == &queue->head;
> > >  86 }
> > > 
> > > Between the execution of atomic_long_xchg and atomic_inc, the submission
> > > worker could dequeue the previous last job, reducing job_count to zero,
> > > run the job, observe that job_count == 0 (drm_sched_entity_is_ready),
> > > and then go to sleep. This function has swapped for the previous tail,
> > > so it returns false (i.e., not the first, and does not requeue the
> > > submit worker at caller).
> > > 
> > > The race window here is tiny, and I would think preempt_disable would
> > > make it impossible (or preempt_disable is broken drm-tip a few days
> > > ago), so I’m still a bit perplexed by it. But again, I assure you this
> > > is, in fact, happening on my test setup. My test case is an SVM one,
> > > which involves all sorts of CPU/GPU faults, migrations, etc. Not sure if
> > > that contributes. I can show this race occurring in dmesg if you need
> > > proof.
> > > 
> > > The change below fixes the problem. I'm going to post it to unblock
> > > myself.
> > > 
> > > diff --git a/include/drm/spsc_queue.h b/include/drm/spsc_queue.h
> > > index 125f096c88cb..ee9df8cc67b7 100644
> > > --- a/include/drm/spsc_queue.h
> > > +++ b/include/drm/spsc_queue.h
> > > @@ -70,9 +70,11 @@ static inline bool spsc_queue_push(struct spsc_queue
> > > *queue, struct spsc_node *n
> > > 
> > >         preempt_disable();
> > > 
> > > +       atomic_inc(&queue->job_count);
> > > +       smp_mb__after_atomic();
> > > +
> > >         tail = (struct spsc_node **)atomic_long_xchg(&queue->tail,
> > > (long)&node->next);
> > >         WRITE_ONCE(*tail, node);
> > > -       atomic_inc(&queue->job_count);
> > > 
> > >         /*
> > >          * In case of first element verify new node will be visible to
> > >          * the consumer
> > 
> > I need to double check the code path once more, but that we got this wrong while could certainly be.
> > 
> > The code originated in userspace and atomics are also memory barriers
> > there. Sima had to point out that we have to manually add smp_mb()s here
> > to make it work. 
> 
> Yeah last time I looked (which I think was years ago) spsc lacked the
> needed barriers, and lacked even more the comments to explain how the
> barriers match up (since some are there as implied ones). I kinda decided
> that it's above my skill level to really dig into the entire thing and do
> a formal proof and all that, which is imo what we need for lockless code
> in the kernel or it's just going to be busted in strange ways.
> 

Yea it took me a while to figure out what the SPSC was doing too. I
can't confirm it is correct after my change either.

> Note that thanks to compilers becoming real good you need barriers even on
> x86.

Ping on this patch [1]. Any objection to me merging?

Matt

[1] https://patchwork.freedesktop.org/series/150260/

> -Sima
> 
> > 
> > >> This is a massively performance critical code path and we need to make sure that we move as few cache lines as possible between the producer and consumer side.
> > >>
> > > 
> > > Well, I can't say I buy this argument. If you can show me any real
> > > workload where using a spinlock here vs. going lockless makes a
> > > measurable impact, I'd eat my hat. Also, FWIW, this code seemingly
> > > violates the DRM locking guidelines we're all supposed to follow… But
> > > anyway, I'll go ahead with the fix above.
> > 
> > I probably have to take that back, see another comment below.
> > 
> > > 
> > > Matt
> > > 
> > >> That was the reason why we replaced the spinlock with the spsc queue before.
> > 
> > Well we replaced spinlock+kfifo with spsc the for round robing peeking implementation.
> > 
> > Background is that because of the spinlock even a "peek" transfers the cache line as write to the sheduler thread, and when you do the "peek" even on the idle entities then that starts to not scale at all.
> > 
> > Since we now have the FIFO implementation and avoiding peeking all the time into the job queue of idle entities that might as well not suck that badly any more.
> > 
> > Regards,
> > Christian. 
> > 
> > >>
> > >> Regards,
> > >> Christian.
> > >>
> > >>>
> > >>> Matt
> > >>
> > 
> 
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-06-20 16:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-13 17:01 DRM scheduler issue with spsc_queue Matthew Brost
2025-06-13 17:26 ` Christian König
2025-06-13 19:11   ` Matthew Brost
2025-06-16  7:24     ` Christian König
2025-06-16  7:42       ` Matthew Brost
2025-06-16 10:42       ` Simona Vetter
2025-06-20 16:45         ` Matthew Brost

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).