Re: [PATCH 0/3] Rework work queue usage

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: Tejun Heo <tj@kernel.org>
Cc: <htejun@gmail.com>, Lucas De Marchi <lucas.demarchi@intel.com>,
	<intel-xe@lists.freedesktop.org>,
	<thomas.hellstrom@linux.intel.com>, <rodrigo.vivi@intel.com>
Subject: Re: [PATCH 0/3] Rework work queue usage
Date: Fri, 29 Mar 2024 16:52:33 +0000	[thread overview]
Message-ID: <Zgbx0cWZXvpT29ql@DUT025-TGLU.fm.intel.com> (raw)
In-Reply-To: <ZgXHonaCuJgnCJgR@slm.duckdns.org>

On Thu, Mar 28, 2024 at 09:40:18AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Mar 28, 2024 at 07:30:41PM +0000, Matthew Brost wrote:
> > The test creates 100s of exec queues that all can be preempted in
> > parallel. In the current code this results in each exec queue kicking a
> > worker which is scheduled on the system_unbound_wq, these workers wait
> > and sleep (using a waitqueue) on signaling from another worker. The
> > other worker, which is also scheduled system_unbound_wq, is processing a
> > queue which interacts with the GPU. I'm thinking the worker which
> > interacts with hardware gets straved by the waiter resulting in a
> > deadlock.
> > 
> > This patch changes the waiters to uses a device private ordered work
> > queue so at most we have 1 waiter a time. Regardless of the new work
> > queue behavior this a better design.
> > 
> > It is beyond my knowledge if the old behavior, albiet poorly designed,
> > should still work with the work queue changes in 6.9.
> 
> Ah, okay, I think you're hitting the max_active limit which regulates the
> maximum number of work items which can be in flight at any given time. Is
> the test machine a NUMA setup by any chance?
> 

Not a NUMA setup in this case.

> We went through a couple changes in terms of how max_active is enforced on
> NUMA machines. Originally, we applied it per-node, ie. if you have
> max_active of 16, each node would be able to have 16 work items in flight at
> any given time. While introducing the affinity stuff, the enforcement became
> per-CPU - ie. each CPU would get 16 work items, which didn't turn out well
> for some workloads. v6.9 changes it so that it's always applied to the whole
> system for unbound workqueues whether NUMA or not.
> 
> system_unbound_wq is created with max_active set at WQ_MAX_ACTIVE which
> happens to be 512. If you stuff more concurrent work items into it which
> have inter-dependency - ie. completion of one work item depends on another,
> it can deadlock, which isn't too unlikely given a lot of basic kernle infra
> depends on system_unbound_wq.
>

That appears to what is happening before my series.

> > > > I think we need some of this information in the commit message in patch
> > > > 1. Because patch 1 simply says it's moving to a device private wq to
> > > > avoid hogging the system one, but the issue is much more serious.
> > > > 
> > > > Also, is the "Fixes:"  really correct? It seems more like a regression
> > > > from the wq changes and there could be other drivers showing similar
> > > > issues now. But it could alos be my lack of understanding of the real
> > > > issue.
> > > 
> > > I don't have enough context to tell whether this is a workqueue problem but
> > > if so we should definitely fix workqueue.
> > 
> > It is beyond my knowledge if the old behavior, albeit poorly designed,
> > should still work with the work queue changes in 6.9.
> 
> So, yeah, in this case, it makes sense to separate it out to a separate
> workqueue.
>

Agree. Thanks for your time.

Matt
 
> Thanks.
> 
> -- 
> tejun

next prev parent reply	other threads:[~2024-03-29 16:51 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-28 18:21 [PATCH 0/3] Rework work queue usage Matthew Brost
2024-03-28 18:21 ` [PATCH 1/3] drm/xe: Use ordered wq for preempt fence waiting Matthew Brost
2024-03-28 18:56   ` Matt Roper
2024-03-28 19:00     ` Matthew Brost
2024-04-01 19:37       ` Matthew Brost
2024-03-28 18:21 ` [PATCH 2/3] drm/xe: Use device, gt ordered work queues for resource cleanup Matthew Brost
2024-03-28 18:21 ` [PATCH 3/3] drm/xe: Use ordered WQ for TLB invalidation fences Matthew Brost
2024-03-28 19:02 ` [PATCH 0/3] Rework work queue usage Lucas De Marchi
2024-03-28 19:13   ` htejun
2024-03-28 19:30     ` Matthew Brost
2024-03-28 19:40       ` Tejun Heo
2024-03-29 16:52         ` Matthew Brost [this message]
2024-03-29  1:59 ` ✓ CI.Patch_applied: success for " Patchwork
2024-03-29  2:00 ` ✓ CI.checkpatch: " Patchwork
2024-03-29  2:00 ` ✓ CI.KUnit: " Patchwork
2024-03-29  2:12 ` ✓ CI.Build: " Patchwork
2024-03-29  2:15 ` ✓ CI.Hooks: " Patchwork
2024-03-29  2:16 ` ✓ CI.checksparse: " Patchwork
2024-03-29  2:53 ` ✗ CI.BAT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zgbx0cWZXvpT29ql@DUT025-TGLU.fm.intel.com \
    --to=matthew.brost@intel.com \
    --cc=htejun@gmail.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox