* [RFC] drm/i915: Move execlists irq handler to a bottom half @ 2016-03-22 17:30 Tvrtko Ursulin 2016-03-23 7:02 ` ✓ Fi.CI.BAT: success for " Patchwork ` (5 more replies) 0 siblings, 6 replies; 32+ messages in thread From: Tvrtko Ursulin @ 2016-03-22 17:30 UTC (permalink / raw) To: Intel-gfx From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Doing a lot of work in the interrupt handler introduces huge latencies to the system as a whole. Most dramatic effect can be seen by running an all engine stress test like igt/gem_exec_nop/all where, when the kernel config is lean enough, the whole system can be brought into multi-second periods of complete non-interactivty. That can look for example like this: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143] Modules linked in: [redacted for brevity] CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0-160321+ #183 Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1 Workqueue: i915 gen6_pm_rps_work [i915] task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000 RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0 RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 Call Trace: <IRQ> [<ffffffff8104a716>] irq_exit+0x86/0x90 [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 <EOI> [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 [<ffffffff8105ab29>] process_one_work+0x139/0x350 [<ffffffff8105b186>] worker_thread+0x126/0x490 [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 [<ffffffff8105fa64>] kthread+0xc4/0xe0 [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 I could not explain, or find a code path, which would explain a +20 second lockup, but from some instrumentation it was apparent the interrupts off proportion of time was between 10-25% under heavy load which is quite bad. By moving the GT interrupt handling to a tasklet in a most simple way, the problem above disappears completely. Also, gem_latency -n 100 shows 25% better throughput and CPU usage, and 14% better latencies. I did not find any gains or regressions with Synmark2 or GLbench under light testing. More benchmarking is certainly required. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> --- drivers/gpu/drm/i915/i915_irq.c | 2 +- drivers/gpu/drm/i915/intel_lrc.c | 19 +++++++++++++------ drivers/gpu/drm/i915/intel_lrc.h | 1 - drivers/gpu/drm/i915/intel_ringbuffer.h | 1 + 4 files changed, 15 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 8f3e3309c3ab..e68134347007 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -1324,7 +1324,7 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift) if (iir & (GT_RENDER_USER_INTERRUPT << test_shift)) notify_ring(engine); if (iir & (GT_CONTEXT_SWITCH_INTERRUPT << test_shift)) - intel_lrc_irq_handler(engine); + tasklet_schedule(&engine->irq_tasklet); } static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv, diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 6916991bdceb..283426c02f8b 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -538,21 +538,23 @@ get_context_status(struct intel_engine_cs *engine, unsigned int read_pointer, /** * intel_lrc_irq_handler() - handle Context Switch interrupts - * @ring: Engine Command Streamer to handle. + * @engine: Engine Command Streamer to handle. * * Check the unread Context Status Buffers and manage the submission of new * contexts to the ELSP accordingly. */ -void intel_lrc_irq_handler(struct intel_engine_cs *engine) +void intel_lrc_irq_handler(unsigned long data) { + struct intel_engine_cs *engine = (struct intel_engine_cs *)data; struct drm_i915_private *dev_priv = engine->dev->dev_private; u32 status_pointer; unsigned int read_pointer, write_pointer; u32 csb[GEN8_CSB_ENTRIES][2]; unsigned int csb_read = 0, i; unsigned int submit_contexts = 0; + unsigned long flags; - spin_lock(&dev_priv->uncore.lock); + spin_lock_irqsave(&dev_priv->uncore.lock, flags); intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL); status_pointer = I915_READ_FW(RING_CONTEXT_STATUS_PTR(engine)); @@ -579,9 +581,9 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine) engine->next_context_status_buffer << 8)); intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL); - spin_unlock(&dev_priv->uncore.lock); + spin_unlock_irqrestore(&dev_priv->uncore.lock, flags); - spin_lock(&engine->execlist_lock); + spin_lock_irqsave(&engine->execlist_lock, flags); for (i = 0; i < csb_read; i++) { if (unlikely(csb[i][0] & GEN8_CTX_STATUS_PREEMPTED)) { @@ -604,7 +606,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine) execlists_context_unqueue(engine); } - spin_unlock(&engine->execlist_lock); + spin_unlock_irqrestore(&engine->execlist_lock, flags); if (unlikely(submit_contexts > 2)) DRM_ERROR("More than two context complete events?\n"); @@ -2020,6 +2022,8 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine) if (!intel_engine_initialized(engine)) return; + tasklet_kill(&engine->irq_tasklet); + dev_priv = engine->dev->dev_private; if (engine->buffer) { @@ -2093,6 +2097,9 @@ logical_ring_init(struct drm_device *dev, struct intel_engine_cs *engine) INIT_LIST_HEAD(&engine->execlist_retired_req_list); spin_lock_init(&engine->execlist_lock); + tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler, + (unsigned long)engine); + logical_ring_init_platform_invariants(engine); ret = i915_cmd_parser_init_ring(engine); diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h index a17cb12221ba..0b0853eee91e 100644 --- a/drivers/gpu/drm/i915/intel_lrc.h +++ b/drivers/gpu/drm/i915/intel_lrc.h @@ -118,7 +118,6 @@ int intel_execlists_submission(struct i915_execbuffer_params *params, struct drm_i915_gem_execbuffer2 *args, struct list_head *vmas); -void intel_lrc_irq_handler(struct intel_engine_cs *engine); void intel_execlists_retire_requests(struct intel_engine_cs *engine); #endif /* _INTEL_LRC_H_ */ diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h index 221a94627aab..29810cba8a8c 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.h +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h @@ -266,6 +266,7 @@ struct intel_engine_cs { } semaphore; /* Execlists */ + struct tasklet_struct irq_tasklet; spinlock_t execlist_lock; struct list_head execlist_queue; struct list_head execlist_retired_req_list; -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply related [flat|nested] 32+ messages in thread
* ✓ Fi.CI.BAT: success for drm/i915: Move execlists irq handler to a bottom half 2016-03-22 17:30 [RFC] drm/i915: Move execlists irq handler to a bottom half Tvrtko Ursulin @ 2016-03-23 7:02 ` Patchwork 2016-03-23 9:07 ` [RFC] " Daniel Vetter ` (4 subsequent siblings) 5 siblings, 0 replies; 32+ messages in thread From: Patchwork @ 2016-03-23 7:02 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: intel-gfx == Series Details == Series: drm/i915: Move execlists irq handler to a bottom half URL : https://patchwork.freedesktop.org/series/4764/ State : success == Summary == Series 4764v1 drm/i915: Move execlists irq handler to a bottom half http://patchwork.freedesktop.org/api/1.0/series/4764/revisions/1/mbox/ Test pm_rpm: Subgroup basic-pci-d3-state: dmesg-warn -> PASS (byt-nuc) bdw-nuci7 total:192 pass:180 dwarn:0 dfail:0 fail:0 skip:12 bdw-ultra total:192 pass:171 dwarn:0 dfail:0 fail:0 skip:21 bsw-nuc-2 total:192 pass:153 dwarn:2 dfail:0 fail:0 skip:37 byt-nuc total:192 pass:157 dwarn:0 dfail:0 fail:0 skip:35 hsw-brixbox total:192 pass:170 dwarn:0 dfail:0 fail:0 skip:22 hsw-gt2 total:192 pass:175 dwarn:0 dfail:0 fail:0 skip:17 ivb-t430s total:192 pass:167 dwarn:0 dfail:0 fail:0 skip:25 skl-i5k-2 total:192 pass:169 dwarn:0 dfail:0 fail:0 skip:23 skl-i7k-2 total:192 pass:169 dwarn:0 dfail:0 fail:0 skip:23 snb-dellxps total:192 pass:158 dwarn:0 dfail:0 fail:0 skip:34 snb-x220t total:192 pass:158 dwarn:0 dfail:0 fail:1 skip:33 Results at /archive/results/CI_IGT_test/Patchwork_1681/ 83ed25fa1b956275542da63eb98dc8fd2291329d drm-intel-nightly: 2016y-03m-22d-15h-20m-55s UTC integration manifest cfad732aadf4f153d718068b3dd8c7980a7d0dd3 drm/i915: Move execlists irq handler to a bottom half _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] drm/i915: Move execlists irq handler to a bottom half 2016-03-22 17:30 [RFC] drm/i915: Move execlists irq handler to a bottom half Tvrtko Ursulin 2016-03-23 7:02 ` ✓ Fi.CI.BAT: success for " Patchwork @ 2016-03-23 9:07 ` Daniel Vetter 2016-03-23 9:14 ` Chris Wilson 2016-03-23 14:57 ` [RFC v2] " Tvrtko Ursulin ` (3 subsequent siblings) 5 siblings, 1 reply; 32+ messages in thread From: Daniel Vetter @ 2016-03-23 9:07 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: Intel-gfx On Tue, Mar 22, 2016 at 05:30:04PM +0000, Tvrtko Ursulin wrote: > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Doing a lot of work in the interrupt handler introduces huge > latencies to the system as a whole. > > Most dramatic effect can be seen by running an all engine > stress test like igt/gem_exec_nop/all where, when the kernel > config is lean enough, the whole system can be brought into > multi-second periods of complete non-interactivty. That can > look for example like this: > > NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143] > Modules linked in: [redacted for brevity] > CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0-160321+ #183 > Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1 > Workqueue: i915 gen6_pm_rps_work [i915] > task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000 > RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0 > RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 > RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 > RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 > RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 > R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 > R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 > FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Stack: > 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a > 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 > 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 > Call Trace: > <IRQ> > [<ffffffff8104a716>] irq_exit+0x86/0x90 > [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 > [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 > <EOI> > [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] > [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 > [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] > [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 > [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] > [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] > [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] > [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] > [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 > [<ffffffff8105ab29>] process_one_work+0x139/0x350 > [<ffffffff8105b186>] worker_thread+0x126/0x490 > [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 > [<ffffffff8105fa64>] kthread+0xc4/0xe0 > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > > I could not explain, or find a code path, which would explain > a +20 second lockup, but from some instrumentation it was > apparent the interrupts off proportion of time was between > 10-25% under heavy load which is quite bad. > > By moving the GT interrupt handling to a tasklet in a most > simple way, the problem above disappears completely. > > Also, gem_latency -n 100 shows 25% better throughput and CPU > usage, and 14% better latencies. > > I did not find any gains or regressions with Synmark2 or > GLbench under light testing. More benchmarking is certainly > required. > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Chris Wilson <chris@chris-wilson.co.uk> I thought tasklets are considered unpopular nowadays? They still steal cpu time, just have the benefit of not also disabling hard interrupts. There should be mitigation though to offload these softinterrupts to threads. Have you tried to create a threaded interrupt thread just for these pins instead? A bit of boilerplate, but not much using the genirq stuff iirc. Anyway just an idea to play with/benchmark on top of this one here. -Daniel > --- > drivers/gpu/drm/i915/i915_irq.c | 2 +- > drivers/gpu/drm/i915/intel_lrc.c | 19 +++++++++++++------ > drivers/gpu/drm/i915/intel_lrc.h | 1 - > drivers/gpu/drm/i915/intel_ringbuffer.h | 1 + > 4 files changed, 15 insertions(+), 8 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c > index 8f3e3309c3ab..e68134347007 100644 > --- a/drivers/gpu/drm/i915/i915_irq.c > +++ b/drivers/gpu/drm/i915/i915_irq.c > @@ -1324,7 +1324,7 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift) > if (iir & (GT_RENDER_USER_INTERRUPT << test_shift)) > notify_ring(engine); > if (iir & (GT_CONTEXT_SWITCH_INTERRUPT << test_shift)) > - intel_lrc_irq_handler(engine); > + tasklet_schedule(&engine->irq_tasklet); > } > > static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv, > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c > index 6916991bdceb..283426c02f8b 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.c > +++ b/drivers/gpu/drm/i915/intel_lrc.c > @@ -538,21 +538,23 @@ get_context_status(struct intel_engine_cs *engine, unsigned int read_pointer, > > /** > * intel_lrc_irq_handler() - handle Context Switch interrupts > - * @ring: Engine Command Streamer to handle. > + * @engine: Engine Command Streamer to handle. > * > * Check the unread Context Status Buffers and manage the submission of new > * contexts to the ELSP accordingly. > */ > -void intel_lrc_irq_handler(struct intel_engine_cs *engine) > +void intel_lrc_irq_handler(unsigned long data) > { > + struct intel_engine_cs *engine = (struct intel_engine_cs *)data; > struct drm_i915_private *dev_priv = engine->dev->dev_private; > u32 status_pointer; > unsigned int read_pointer, write_pointer; > u32 csb[GEN8_CSB_ENTRIES][2]; > unsigned int csb_read = 0, i; > unsigned int submit_contexts = 0; > + unsigned long flags; > > - spin_lock(&dev_priv->uncore.lock); > + spin_lock_irqsave(&dev_priv->uncore.lock, flags); > intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL); > > status_pointer = I915_READ_FW(RING_CONTEXT_STATUS_PTR(engine)); > @@ -579,9 +581,9 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine) > engine->next_context_status_buffer << 8)); > > intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL); > - spin_unlock(&dev_priv->uncore.lock); > + spin_unlock_irqrestore(&dev_priv->uncore.lock, flags); > > - spin_lock(&engine->execlist_lock); > + spin_lock_irqsave(&engine->execlist_lock, flags); > > for (i = 0; i < csb_read; i++) { > if (unlikely(csb[i][0] & GEN8_CTX_STATUS_PREEMPTED)) { > @@ -604,7 +606,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine) > execlists_context_unqueue(engine); > } > > - spin_unlock(&engine->execlist_lock); > + spin_unlock_irqrestore(&engine->execlist_lock, flags); > > if (unlikely(submit_contexts > 2)) > DRM_ERROR("More than two context complete events?\n"); > @@ -2020,6 +2022,8 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine) > if (!intel_engine_initialized(engine)) > return; > > + tasklet_kill(&engine->irq_tasklet); > + > dev_priv = engine->dev->dev_private; > > if (engine->buffer) { > @@ -2093,6 +2097,9 @@ logical_ring_init(struct drm_device *dev, struct intel_engine_cs *engine) > INIT_LIST_HEAD(&engine->execlist_retired_req_list); > spin_lock_init(&engine->execlist_lock); > > + tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler, > + (unsigned long)engine); > + > logical_ring_init_platform_invariants(engine); > > ret = i915_cmd_parser_init_ring(engine); > diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h > index a17cb12221ba..0b0853eee91e 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.h > +++ b/drivers/gpu/drm/i915/intel_lrc.h > @@ -118,7 +118,6 @@ int intel_execlists_submission(struct i915_execbuffer_params *params, > struct drm_i915_gem_execbuffer2 *args, > struct list_head *vmas); > > -void intel_lrc_irq_handler(struct intel_engine_cs *engine); > void intel_execlists_retire_requests(struct intel_engine_cs *engine); > > #endif /* _INTEL_LRC_H_ */ > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h > index 221a94627aab..29810cba8a8c 100644 > --- a/drivers/gpu/drm/i915/intel_ringbuffer.h > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h > @@ -266,6 +266,7 @@ struct intel_engine_cs { > } semaphore; > > /* Execlists */ > + struct tasklet_struct irq_tasklet; > spinlock_t execlist_lock; > struct list_head execlist_queue; > struct list_head execlist_retired_req_list; > -- > 1.9.1 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/intel-gfx -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] drm/i915: Move execlists irq handler to a bottom half 2016-03-23 9:07 ` [RFC] " Daniel Vetter @ 2016-03-23 9:14 ` Chris Wilson 2016-03-23 10:08 ` Tvrtko Ursulin 0 siblings, 1 reply; 32+ messages in thread From: Chris Wilson @ 2016-03-23 9:14 UTC (permalink / raw) To: Daniel Vetter; +Cc: Intel-gfx On Wed, Mar 23, 2016 at 10:07:35AM +0100, Daniel Vetter wrote: > On Tue, Mar 22, 2016 at 05:30:04PM +0000, Tvrtko Ursulin wrote: > > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > > > Doing a lot of work in the interrupt handler introduces huge > > latencies to the system as a whole. > > > > Most dramatic effect can be seen by running an all engine > > stress test like igt/gem_exec_nop/all where, when the kernel > > config is lean enough, the whole system can be brought into > > multi-second periods of complete non-interactivty. That can > > look for example like this: > > > > NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143] > > Modules linked in: [redacted for brevity] > > CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0-160321+ #183 > > Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1 > > Workqueue: i915 gen6_pm_rps_work [i915] > > task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000 > > RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0 > > RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 > > RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 > > RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 > > RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 > > R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 > > R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 > > FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > Stack: > > 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a > > 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 > > 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 > > Call Trace: > > <IRQ> > > [<ffffffff8104a716>] irq_exit+0x86/0x90 > > [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 > > [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 > > <EOI> > > [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] > > [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 > > [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] > > [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 > > [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] > > [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] > > [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] > > [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] > > [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 > > [<ffffffff8105ab29>] process_one_work+0x139/0x350 > > [<ffffffff8105b186>] worker_thread+0x126/0x490 > > [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 > > [<ffffffff8105fa64>] kthread+0xc4/0xe0 > > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > > [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 > > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > > > > I could not explain, or find a code path, which would explain > > a +20 second lockup, but from some instrumentation it was > > apparent the interrupts off proportion of time was between > > 10-25% under heavy load which is quite bad. > > > > By moving the GT interrupt handling to a tasklet in a most > > simple way, the problem above disappears completely. > > > > Also, gem_latency -n 100 shows 25% better throughput and CPU > > usage, and 14% better latencies. Forgot gem_syslatency, since that does reflect UX under load really startlingly well. > > I did not find any gains or regressions with Synmark2 or > > GLbench under light testing. More benchmarking is certainly > > required. > > Bugzilla? > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > > I thought tasklets are considered unpopular nowadays? They still steal cpu > time, just have the benefit of not also disabling hard interrupts. There > should be mitigation though to offload these softinterrupts to threads. > Have you tried to create a threaded interrupt thread just for these pins > instead? A bit of boilerplate, but not much using the genirq stuff iirc. Ah, you haven't been reading patches. Yes, there's been a patch to fix the hardlockup using kthreads for a few months. Tvrtko is trying to move this forward since he too has found a way of locking up his machine using execlist under load. So far kthreads seems to have a slight edge in the benchmarks, or rather using tasklet I have some very wild results on Braswell. Using tasklets the CPU time is accounted to the process (i.e. whoever was running at the time of the irq, typically the benchmark), using kthread we have independent entries in the process table/top (which is quite nice to see just how much time is been eaten up by the context-switches). Benchmarks still progessing, haven't yet got on to the latency measureemnts.... -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] drm/i915: Move execlists irq handler to a bottom half 2016-03-23 9:14 ` Chris Wilson @ 2016-03-23 10:08 ` Tvrtko Ursulin 2016-03-23 11:31 ` Chris Wilson 0 siblings, 1 reply; 32+ messages in thread From: Tvrtko Ursulin @ 2016-03-23 10:08 UTC (permalink / raw) To: Chris Wilson, Daniel Vetter, Intel-gfx On 23/03/16 09:14, Chris Wilson wrote: > On Wed, Mar 23, 2016 at 10:07:35AM +0100, Daniel Vetter wrote: >> On Tue, Mar 22, 2016 at 05:30:04PM +0000, Tvrtko Ursulin wrote: >>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>> >>> Doing a lot of work in the interrupt handler introduces huge >>> latencies to the system as a whole. >>> >>> Most dramatic effect can be seen by running an all engine >>> stress test like igt/gem_exec_nop/all where, when the kernel >>> config is lean enough, the whole system can be brought into >>> multi-second periods of complete non-interactivty. That can >>> look for example like this: >>> >>> NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143] >>> Modules linked in: [redacted for brevity] >>> CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0-160321+ #183 >>> Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1 >>> Workqueue: i915 gen6_pm_rps_work [i915] >>> task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000 >>> RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0 >>> RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 >>> RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 >>> RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 >>> RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 >>> R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 >>> R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 >>> FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>> Stack: >>> 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a >>> 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 >>> 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 >>> Call Trace: >>> <IRQ> >>> [<ffffffff8104a716>] irq_exit+0x86/0x90 >>> [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 >>> [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 >>> <EOI> >>> [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] >>> [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 >>> [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] >>> [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 >>> [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] >>> [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] >>> [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] >>> [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] >>> [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 >>> [<ffffffff8105ab29>] process_one_work+0x139/0x350 >>> [<ffffffff8105b186>] worker_thread+0x126/0x490 >>> [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 >>> [<ffffffff8105fa64>] kthread+0xc4/0xe0 >>> [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 >>> [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 >>> [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 >>> >>> I could not explain, or find a code path, which would explain >>> a +20 second lockup, but from some instrumentation it was >>> apparent the interrupts off proportion of time was between >>> 10-25% under heavy load which is quite bad. >>> >>> By moving the GT interrupt handling to a tasklet in a most >>> simple way, the problem above disappears completely. >>> >>> Also, gem_latency -n 100 shows 25% better throughput and CPU >>> usage, and 14% better latencies. > > Forgot gem_syslatency, since that does reflect UX under load really > startlingly well. gem_syslatency, before: gem_syslatency: cycles=1532739, latency mean=416531.829us max=2499237us gem_syslatency: cycles=1839434, latency mean=1458099.157us max=4998944us gem_syslatency: cycles=1432570, latency mean=2688.451us max=1201185us gem_syslatency: cycles=1533543, latency mean=416520.499us max=2498886us with tasklet: gem_syslatency: cycles=808907, latency mean=53.133us max=1640us gem_syslatency: cycles=862154, latency mean=62.778us max=2117us gem_syslatency: cycles=856039, latency mean=58.079us max=2123us gem_syslatency: cycles=841683, latency mean=56.914us max=1667us Is this smaller throughput and better latency? gem_syslatency -n, before: gem_syslatency: cycles=0, latency mean=2.446us max=18us gem_syslatency: cycles=0, latency mean=7.220us max=37us gem_syslatency: cycles=0, latency mean=6.949us max=36us gem_syslatency: cycles=0, latency mean=5.931us max=39us with tasklet: gem_syslatency: cycles=0, latency mean=2.477us max=5us gem_syslatency: cycles=0, latency mean=2.471us max=6us gem_syslatency: cycles=0, latency mean=2.696us max=24us gem_syslatency: cycles=0, latency mean=6.414us max=39us This looks potentially the same or very similar. May need more runs to get a definitive picture. gem_exec_nop/all has a huge improvement also, if we ignore the fact it locks up the system with the curret irq handler on full tilt, when I limit the max GPU frequency a bit it can avoid that problem but tasklets make it twice as fast here. >>> I did not find any gains or regressions with Synmark2 or >>> GLbench under light testing. More benchmarking is certainly >>> required. >>> > > Bugzilla? You think it is OK to continue sharing your one, https://bugs.freedesktop.org/show_bug.cgi?id=93467? >>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>> Cc: Chris Wilson <chris@chris-wilson.co.uk> >> >> I thought tasklets are considered unpopular nowadays? They still steal cpu Did not know, last (and first) time I've used them was ~15 years ago. :) You got any links to read about it? Since (see below) I am not sure they "steal" CPU time. >> time, just have the benefit of not also disabling hard interrupts. There >> should be mitigation though to offload these softinterrupts to threads. >> Have you tried to create a threaded interrupt thread just for these pins >> instead? A bit of boilerplate, but not much using the genirq stuff iirc. > > Ah, you haven't been reading patches. Yes, there's been a patch to fix > the hardlockup using kthreads for a few months. Tvrtko is trying to move > this forward since he too has found a way of locking up his machine > using execlist under load. Correct. > So far kthreads seems to have a slight edge in the benchmarks, or rather > using tasklet I have some very wild results on Braswell. Using tasklets > the CPU time is accounted to the process (i.e. whoever was running at > the time of the irq, typically the benchmark), using kthread we have I thought they run from ksoftirqd so the CPU time is accounted against it. And looking at top, that even seems what actually happens. > independent entries in the process table/top (which is quite nice to see > just how much time is been eaten up by the context-switches). > > Benchmarks still progessing, haven't yet got on to the latency > measureemnts.... My tasklets hack required surprisingly little code change, at least if there are not some missed corner cases to handle, but I don't mind your threads either. Regards, Tvrtko _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] drm/i915: Move execlists irq handler to a bottom half 2016-03-23 10:08 ` Tvrtko Ursulin @ 2016-03-23 11:31 ` Chris Wilson 2016-03-23 12:46 ` Tvrtko Ursulin 0 siblings, 1 reply; 32+ messages in thread From: Chris Wilson @ 2016-03-23 11:31 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: Intel-gfx On Wed, Mar 23, 2016 at 10:08:46AM +0000, Tvrtko Ursulin wrote: > > On 23/03/16 09:14, Chris Wilson wrote: > >On Wed, Mar 23, 2016 at 10:07:35AM +0100, Daniel Vetter wrote: > >>On Tue, Mar 22, 2016 at 05:30:04PM +0000, Tvrtko Ursulin wrote: > >>>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >>> > >>>Doing a lot of work in the interrupt handler introduces huge > >>>latencies to the system as a whole. > >>> > >>>Most dramatic effect can be seen by running an all engine > >>>stress test like igt/gem_exec_nop/all where, when the kernel > >>>config is lean enough, the whole system can be brought into > >>>multi-second periods of complete non-interactivty. That can > >>>look for example like this: > >>> > >>> NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143] > >>> Modules linked in: [redacted for brevity] > >>> CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0-160321+ #183 > >>> Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1 > >>> Workqueue: i915 gen6_pm_rps_work [i915] > >>> task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000 > >>> RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0 > >>> RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 > >>> RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 > >>> RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 > >>> RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 > >>> R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 > >>> R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 > >>> FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 > >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>> CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 > >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > >>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > >>> Stack: > >>> 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a > >>> 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 > >>> 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 > >>> Call Trace: > >>> <IRQ> > >>> [<ffffffff8104a716>] irq_exit+0x86/0x90 > >>> [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 > >>> [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 > >>> <EOI> > >>> [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] > >>> [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 > >>> [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] > >>> [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 > >>> [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] > >>> [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] > >>> [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] > >>> [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] > >>> [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 > >>> [<ffffffff8105ab29>] process_one_work+0x139/0x350 > >>> [<ffffffff8105b186>] worker_thread+0x126/0x490 > >>> [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 > >>> [<ffffffff8105fa64>] kthread+0xc4/0xe0 > >>> [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > >>> [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 > >>> [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > >>> > >>>I could not explain, or find a code path, which would explain > >>>a +20 second lockup, but from some instrumentation it was > >>>apparent the interrupts off proportion of time was between > >>>10-25% under heavy load which is quite bad. > >>> > >>>By moving the GT interrupt handling to a tasklet in a most > >>>simple way, the problem above disappears completely. > >>> > >>>Also, gem_latency -n 100 shows 25% better throughput and CPU > >>>usage, and 14% better latencies. > > > >Forgot gem_syslatency, since that does reflect UX under load really > >startlingly well. > > gem_syslatency, before: > > gem_syslatency: cycles=1532739, latency mean=416531.829us max=2499237us > gem_syslatency: cycles=1839434, latency mean=1458099.157us max=4998944us > gem_syslatency: cycles=1432570, latency mean=2688.451us max=1201185us > gem_syslatency: cycles=1533543, latency mean=416520.499us max=2498886us > > with tasklet: > > gem_syslatency: cycles=808907, latency mean=53.133us max=1640us > gem_syslatency: cycles=862154, latency mean=62.778us max=2117us > gem_syslatency: cycles=856039, latency mean=58.079us max=2123us > gem_syslatency: cycles=841683, latency mean=56.914us max=1667us > > Is this smaller throughput and better latency? Yeah. I wasn't expecting the smaller throughput, but the impact on other users is massive. You should be able to feel the difference if you try to use the machine whilst gem_syslatency or gem_exec_nop is running, a delay of up to 2s in responding to human input can be annoying! > gem_syslatency -n, before: > > gem_syslatency: cycles=0, latency mean=2.446us max=18us > gem_syslatency: cycles=0, latency mean=7.220us max=37us > gem_syslatency: cycles=0, latency mean=6.949us max=36us > gem_syslatency: cycles=0, latency mean=5.931us max=39us > > with tasklet: > > gem_syslatency: cycles=0, latency mean=2.477us max=5us > gem_syslatency: cycles=0, latency mean=2.471us max=6us > gem_syslatency: cycles=0, latency mean=2.696us max=24us > gem_syslatency: cycles=0, latency mean=6.414us max=39us > > This looks potentially the same or very similar. May need more runs > to get a definitive picture. -n should be unaffected, since it measures the background without gem operations (cycles=0), so should tell us how stable the numbers are for the timers. > gem_exec_nop/all has a huge improvement also, if we ignore the fact > it locks up the system with the curret irq handler on full tilt, > when I limit the max GPU frequency a bit it can avoid that problem > but tasklets make it twice as fast here. Yes, with threaded submission can then concurrently submit requests to multiple rings. I take it you have a 2-processor machine? We should ideally see linear scaling upto min(num_engines, nproc-1) if we assume that one cpu is enough to sustain gem_execbuf() ioctls. > >>>I did not find any gains or regressions with Synmark2 or > >>>GLbench under light testing. More benchmarking is certainly > >>>required. > >>> > > > >Bugzilla? > > You think it is OK to continue sharing your one, > https://bugs.freedesktop.org/show_bug.cgi?id=93467? Yes, it fixes the same freeze (and we've removed the loop from chv irq so there really shouldn't be any others left!) > >>>Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >>>Cc: Chris Wilson <chris@chris-wilson.co.uk> > >> > >>I thought tasklets are considered unpopular nowadays? They still steal cpu > > Did not know, last (and first) time I've used them was ~15 years > ago. :) You got any links to read about it? Since (see below) I am > not sure they "steal" CPU time. > > >>time, just have the benefit of not also disabling hard interrupts. There > >>should be mitigation though to offload these softinterrupts to threads. > >>Have you tried to create a threaded interrupt thread just for these pins > >>instead? A bit of boilerplate, but not much using the genirq stuff iirc. > > > >Ah, you haven't been reading patches. Yes, there's been a patch to fix > >the hardlockup using kthreads for a few months. Tvrtko is trying to move > >this forward since he too has found a way of locking up his machine > >using execlist under load. > > Correct. > > >So far kthreads seems to have a slight edge in the benchmarks, or rather > >using tasklet I have some very wild results on Braswell. Using tasklets > >the CPU time is accounted to the process (i.e. whoever was running at > > the time of the irq, typically the benchmark), using kthread we have > > I thought they run from ksoftirqd so the CPU time is accounted > against it. And looking at top, that even seems what actually > happens. Not for me. :| Though I'm using simple CPU time accounting. > >independent entries in the process table/top (which is quite nice to see > >just how much time is been eaten up by the context-switches). > > > >Benchmarks still progessing, haven't yet got on to the latency > >measureemnts.... > > My tasklets hack required surprisingly little code change, at least > if there are not some missed corner cases to handle, but I don't > mind your threads either. Yes, though when moving to kthreads I dropped the requirement for spin_lock_irq(engine->execlists_lock) and so there is a large amount of fluff in changing those callsites to spin_lock(). (For tasklet, we could argue that requirement is now changed to spin_lock_bh()...) The real meat of the change is that with kthreads we have to worry about doing the scheduling() ourselves, and that impacts upon the forcewake dance so certainly more complex than tasklets! I liked how simple this patch is and so far it looks as good as making our own kthread. The biggest difference really is just who gets the CPU time! -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] drm/i915: Move execlists irq handler to a bottom half 2016-03-23 11:31 ` Chris Wilson @ 2016-03-23 12:46 ` Tvrtko Ursulin 2016-03-23 12:56 ` Chris Wilson 0 siblings, 1 reply; 32+ messages in thread From: Tvrtko Ursulin @ 2016-03-23 12:46 UTC (permalink / raw) To: Chris Wilson, Daniel Vetter, Intel-gfx On 23/03/16 11:31, Chris Wilson wrote: > On Wed, Mar 23, 2016 at 10:08:46AM +0000, Tvrtko Ursulin wrote: >> >> On 23/03/16 09:14, Chris Wilson wrote: >>> On Wed, Mar 23, 2016 at 10:07:35AM +0100, Daniel Vetter wrote: >>>> On Tue, Mar 22, 2016 at 05:30:04PM +0000, Tvrtko Ursulin wrote: >>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>>>> >>>>> Doing a lot of work in the interrupt handler introduces huge >>>>> latencies to the system as a whole. >>>>> >>>>> Most dramatic effect can be seen by running an all engine >>>>> stress test like igt/gem_exec_nop/all where, when the kernel >>>>> config is lean enough, the whole system can be brought into >>>>> multi-second periods of complete non-interactivty. That can >>>>> look for example like this: >>>>> >>>>> NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143] >>>>> Modules linked in: [redacted for brevity] >>>>> CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0-160321+ #183 >>>>> Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1 >>>>> Workqueue: i915 gen6_pm_rps_work [i915] >>>>> task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000 >>>>> RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0 >>>>> RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 >>>>> RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 >>>>> RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 >>>>> RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 >>>>> R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 >>>>> R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 >>>>> FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 >>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>> CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>>>> Stack: >>>>> 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a >>>>> 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 >>>>> 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 >>>>> Call Trace: >>>>> <IRQ> >>>>> [<ffffffff8104a716>] irq_exit+0x86/0x90 >>>>> [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 >>>>> [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 >>>>> <EOI> >>>>> [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] >>>>> [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 >>>>> [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] >>>>> [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 >>>>> [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] >>>>> [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] >>>>> [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] >>>>> [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] >>>>> [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 >>>>> [<ffffffff8105ab29>] process_one_work+0x139/0x350 >>>>> [<ffffffff8105b186>] worker_thread+0x126/0x490 >>>>> [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 >>>>> [<ffffffff8105fa64>] kthread+0xc4/0xe0 >>>>> [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 >>>>> [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 >>>>> [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 >>>>> >>>>> I could not explain, or find a code path, which would explain >>>>> a +20 second lockup, but from some instrumentation it was >>>>> apparent the interrupts off proportion of time was between >>>>> 10-25% under heavy load which is quite bad. >>>>> >>>>> By moving the GT interrupt handling to a tasklet in a most >>>>> simple way, the problem above disappears completely. >>>>> >>>>> Also, gem_latency -n 100 shows 25% better throughput and CPU >>>>> usage, and 14% better latencies. >>> >>> Forgot gem_syslatency, since that does reflect UX under load really >>> startlingly well. >> >> gem_syslatency, before: >> >> gem_syslatency: cycles=1532739, latency mean=416531.829us max=2499237us >> gem_syslatency: cycles=1839434, latency mean=1458099.157us max=4998944us >> gem_syslatency: cycles=1432570, latency mean=2688.451us max=1201185us >> gem_syslatency: cycles=1533543, latency mean=416520.499us max=2498886us >> >> with tasklet: >> >> gem_syslatency: cycles=808907, latency mean=53.133us max=1640us >> gem_syslatency: cycles=862154, latency mean=62.778us max=2117us >> gem_syslatency: cycles=856039, latency mean=58.079us max=2123us >> gem_syslatency: cycles=841683, latency mean=56.914us max=1667us >> >> Is this smaller throughput and better latency? > > Yeah. I wasn't expecting the smaller throughput, but the impact on other > users is massive. You should be able to feel the difference if you try > to use the machine whilst gem_syslatency or gem_exec_nop is running, a > delay of up to 2s in responding to human input can be annoying! Yes, impact is easily felt. >> gem_exec_nop/all has a huge improvement also, if we ignore the fact >> it locks up the system with the curret irq handler on full tilt, >> when I limit the max GPU frequency a bit it can avoid that problem >> but tasklets make it twice as fast here. > > Yes, with threaded submission can then concurrently submit requests to > multiple rings. I take it you have a 2-processor machine? We should > ideally see linear scaling upto min(num_engines, nproc-1) if we assume > that one cpu is enough to sustain gem_execbuf() ioctls. 2C/4T correct. >>>>> I did not find any gains or regressions with Synmark2 or >>>>> GLbench under light testing. More benchmarking is certainly >>>>> required. >>>>> >>> >>> Bugzilla? >> >> You think it is OK to continue sharing your one, >> https://bugs.freedesktop.org/show_bug.cgi?id=93467? > > Yes, it fixes the same freeze (and we've removed the loop from chv irq > so there really shouldn't be any others left!) I don't see that has been merged. Is it all ready CI wise so we could? >>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk> >>>> >>>> I thought tasklets are considered unpopular nowadays? They still steal cpu >> >> Did not know, last (and first) time I've used them was ~15 years >> ago. :) You got any links to read about it? Since (see below) I am >> not sure they "steal" CPU time. >> >>>> time, just have the benefit of not also disabling hard interrupts. There >>>> should be mitigation though to offload these softinterrupts to threads. >>>> Have you tried to create a threaded interrupt thread just for these pins >>>> instead? A bit of boilerplate, but not much using the genirq stuff iirc. >>> >>> Ah, you haven't been reading patches. Yes, there's been a patch to fix >>> the hardlockup using kthreads for a few months. Tvrtko is trying to move >>> this forward since he too has found a way of locking up his machine >>> using execlist under load. >> >> Correct. >> >>> So far kthreads seems to have a slight edge in the benchmarks, or rather >>> using tasklet I have some very wild results on Braswell. Using tasklets >>> the CPU time is accounted to the process (i.e. whoever was running at >>> the time of the irq, typically the benchmark), using kthread we have >> >> I thought they run from ksoftirqd so the CPU time is accounted >> against it. And looking at top, that even seems what actually >> happens. > > Not for me. :| Though I'm using simple CPU time accounting. I suppose it must be this one you don't have then: config IRQ_TIME_ACCOUNTING bool "Fine granularity task level IRQ time accounting" depends on HAVE_IRQ_TIME_ACCOUNTING && !NO_HZ_FULL help Select this option to enable fine granularity task irq time accounting. This is done by reading a timestamp on each transitions between softirq and hardirq state, so there can be a small performance impact. >>> independent entries in the process table/top (which is quite nice to see >>> just how much time is been eaten up by the context-switches). >>> >>> Benchmarks still progessing, haven't yet got on to the latency >>> measureemnts.... >> >> My tasklets hack required surprisingly little code change, at least >> if there are not some missed corner cases to handle, but I don't >> mind your threads either. > > Yes, though when moving to kthreads I dropped the requirement for > spin_lock_irq(engine->execlists_lock) and so there is a large amount of > fluff in changing those callsites to spin_lock(). (For tasklet, we could > argue that requirement is now changed to spin_lock_bh()...) The real meat Ooops yes, _bh variant is the correct one. I wonder if that would further improve things. Will try. > of the change is that with kthreads we have to worry about doing the > scheduling() ourselves, and that impacts upon the forcewake dance so > certainly more complex than tasklets! I liked how simple this patch is > and so far it looks as good as making our own kthread. The biggest > difference really is just who gets the CPU time! Note that in that respect it is then no worse than the current situation wrt CPU time accounting. Regards, Tvrtko _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] drm/i915: Move execlists irq handler to a bottom half 2016-03-23 12:46 ` Tvrtko Ursulin @ 2016-03-23 12:56 ` Chris Wilson 2016-03-23 15:23 ` Tvrtko Ursulin 2016-03-24 12:18 ` [PATCH v3] " Tvrtko Ursulin 0 siblings, 2 replies; 32+ messages in thread From: Chris Wilson @ 2016-03-23 12:56 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: Intel-gfx On Wed, Mar 23, 2016 at 12:46:35PM +0000, Tvrtko Ursulin wrote: > > On 23/03/16 11:31, Chris Wilson wrote: > >On Wed, Mar 23, 2016 at 10:08:46AM +0000, Tvrtko Ursulin wrote: > >>You think it is OK to continue sharing your one, > >>https://bugs.freedesktop.org/show_bug.cgi?id=93467? > > > >Yes, it fixes the same freeze (and we've removed the loop from chv irq > >so there really shouldn't be any others left!) > > I don't see that has been merged. Is it all ready CI wise so we could? On the CI ping: id:20160314103014.30028.12472@emeril.freedesktop.org == Summary == Series 4298v3 drm/i915: Exit cherryview_irq_handler() after one pass http://patchwork.freedesktop.org/api/1.0/series/4298/revisions/3/mbox/ Test drv_module_reload_basic: pass -> SKIP (skl-i5k-2) pass -> INCOMPLETE (bsw-nuc-2) Test gem_ringfill: Subgroup basic-default-s3: dmesg-warn -> PASS (bsw-nuc-2) Test gem_tiled_pread_basic: incomplete -> PASS (byt-nuc) Test kms_flip: Subgroup basic-flip-vs-dpms: dmesg-warn -> PASS (bdw-ultra) Subgroup basic-flip-vs-modeset: incomplete -> PASS (bsw-nuc-2) Test kms_pipe_crc_basic: Subgroup suspend-read-crc-pipe-a: incomplete -> PASS (hsw-gt2) Test pm_rpm: Subgroup basic-pci-d3-state: dmesg-warn -> PASS (bsw-nuc-2) bdw-nuci7 total:194 pass:182 dwarn:0 dfail:0 fail:0 skip:12 bdw-ultra total:194 pass:173 dwarn:0 dfail:0 fail:0 skip:21 bsw-nuc-2 total:189 pass:151 dwarn:0 dfail:0 fail:0 skip:37 byt-nuc total:194 pass:154 dwarn:4 dfail:0 fail:1 skip:35 hsw-brixbox total:194 pass:172 dwarn:0 dfail:0 fail:0 skip:22 hsw-gt2 total:194 pass:176 dwarn:1 dfail:0 fail:0 skip:17 ivb-t430s total:194 pass:169 dwarn:0 dfail:0 fail:0 skip:25 skl-i5k-2 total:194 pass:170 dwarn:0 dfail:0 fail:0 skip:24 skl-i7k-2 total:194 pass:171 dwarn:0 dfail:0 fail:0 skip:23 snb-dellxps total:194 pass:159 dwarn:1 dfail:0 fail:0 skip:34 snb-x220t total:194 pass:159 dwarn:1 dfail:0 fail:1 skip:33 Results at /archive/results/CI_IGT_test/Patchwork_1589/ 3e5ecc8c5ff80cb1fb635ce1cf16b7cd4cfb1979 drm-intel-nightly: 2016y-03m-14d-09h-06m-00s UTC integration manifest 7928c2133b16eb2f26866ca05d1cb7bb6d41c765 drm/i915: Exit cherryview_irq_handler() after one pass == drv_module_reload_basic is weird, but it appears the hiccup CI had on the previous run were external (and affected several CI runs afaict). -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] drm/i915: Move execlists irq handler to a bottom half 2016-03-23 12:56 ` Chris Wilson @ 2016-03-23 15:23 ` Tvrtko Ursulin 2016-03-24 9:37 ` Tvrtko Ursulin 2016-03-24 12:18 ` [PATCH v3] " Tvrtko Ursulin 1 sibling, 1 reply; 32+ messages in thread From: Tvrtko Ursulin @ 2016-03-23 15:23 UTC (permalink / raw) To: Chris Wilson, Daniel Vetter, Intel-gfx On 23/03/16 12:56, Chris Wilson wrote: > On Wed, Mar 23, 2016 at 12:46:35PM +0000, Tvrtko Ursulin wrote: >> >> On 23/03/16 11:31, Chris Wilson wrote: >>> On Wed, Mar 23, 2016 at 10:08:46AM +0000, Tvrtko Ursulin wrote: >>>> You think it is OK to continue sharing your one, >>>> https://bugs.freedesktop.org/show_bug.cgi?id=93467? >>> >>> Yes, it fixes the same freeze (and we've removed the loop from chv irq >>> so there really shouldn't be any others left!) >> >> I don't see that has been merged. Is it all ready CI wise so we could? > > On the CI ping: > id:20160314103014.30028.12472@emeril.freedesktop.org > > == Summary == > > Series 4298v3 drm/i915: Exit cherryview_irq_handler() after one pass > http://patchwork.freedesktop.org/api/1.0/series/4298/revisions/3/mbox/ > > Test drv_module_reload_basic: > pass -> SKIP (skl-i5k-2) > pass -> INCOMPLETE (bsw-nuc-2) > Test gem_ringfill: > Subgroup basic-default-s3: > dmesg-warn -> PASS (bsw-nuc-2) > Test gem_tiled_pread_basic: > incomplete -> PASS (byt-nuc) > Test kms_flip: > Subgroup basic-flip-vs-dpms: > dmesg-warn -> PASS (bdw-ultra) > Subgroup basic-flip-vs-modeset: > incomplete -> PASS (bsw-nuc-2) > Test kms_pipe_crc_basic: > Subgroup suspend-read-crc-pipe-a: > incomplete -> PASS (hsw-gt2) > Test pm_rpm: > Subgroup basic-pci-d3-state: > dmesg-warn -> PASS (bsw-nuc-2) > > bdw-nuci7 total:194 pass:182 dwarn:0 dfail:0 fail:0 skip:12 > bdw-ultra total:194 pass:173 dwarn:0 dfail:0 fail:0 skip:21 > bsw-nuc-2 total:189 pass:151 dwarn:0 dfail:0 fail:0 skip:37 > byt-nuc total:194 pass:154 dwarn:4 dfail:0 fail:1 skip:35 > hsw-brixbox total:194 pass:172 dwarn:0 dfail:0 fail:0 skip:22 > hsw-gt2 total:194 pass:176 dwarn:1 dfail:0 fail:0 skip:17 > ivb-t430s total:194 pass:169 dwarn:0 dfail:0 fail:0 skip:25 > skl-i5k-2 total:194 pass:170 dwarn:0 dfail:0 fail:0 skip:24 > skl-i7k-2 total:194 pass:171 dwarn:0 dfail:0 fail:0 skip:23 > snb-dellxps total:194 pass:159 dwarn:1 dfail:0 fail:0 skip:34 > snb-x220t total:194 pass:159 dwarn:1 dfail:0 fail:1 skip:33 > > Results at /archive/results/CI_IGT_test/Patchwork_1589/ > > 3e5ecc8c5ff80cb1fb635ce1cf16b7cd4cfb1979 drm-intel-nightly: 2016y-03m-14d-09h-06m-00s UTC integration manifest > 7928c2133b16eb2f26866ca05d1cb7bb6d41c765 drm/i915: Exit cherryview_irq_handler() after one pass > > == > > drv_module_reload_basic is weird, but it appears the hiccup CI had on the > previous run were external (and affected several CI runs afaict). That part is goog then, but I am not sure what to do with the incomplete run. Maybe have it do another one? Although that is quite weak. Problem is it has no other hangs with that test in the history. Regards, Tvrtko _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] drm/i915: Move execlists irq handler to a bottom half 2016-03-23 15:23 ` Tvrtko Ursulin @ 2016-03-24 9:37 ` Tvrtko Ursulin 0 siblings, 0 replies; 32+ messages in thread From: Tvrtko Ursulin @ 2016-03-24 9:37 UTC (permalink / raw) To: Chris Wilson, Daniel Vetter, Intel-gfx On 23/03/16 15:23, Tvrtko Ursulin wrote: > > On 23/03/16 12:56, Chris Wilson wrote: >> On Wed, Mar 23, 2016 at 12:46:35PM +0000, Tvrtko Ursulin wrote: >>> >>> On 23/03/16 11:31, Chris Wilson wrote: >>>> On Wed, Mar 23, 2016 at 10:08:46AM +0000, Tvrtko Ursulin wrote: >>>>> You think it is OK to continue sharing your one, >>>>> https://bugs.freedesktop.org/show_bug.cgi?id=93467? >>>> >>>> Yes, it fixes the same freeze (and we've removed the loop from chv irq >>>> so there really shouldn't be any others left!) >>> >>> I don't see that has been merged. Is it all ready CI wise so we could? >> >> On the CI ping: >> id:20160314103014.30028.12472@emeril.freedesktop.org >> >> == Summary == >> >> Series 4298v3 drm/i915: Exit cherryview_irq_handler() after one pass >> http://patchwork.freedesktop.org/api/1.0/series/4298/revisions/3/mbox/ >> >> Test drv_module_reload_basic: >> pass -> SKIP (skl-i5k-2) >> pass -> INCOMPLETE (bsw-nuc-2) >> Test gem_ringfill: >> Subgroup basic-default-s3: >> dmesg-warn -> PASS (bsw-nuc-2) >> Test gem_tiled_pread_basic: >> incomplete -> PASS (byt-nuc) >> Test kms_flip: >> Subgroup basic-flip-vs-dpms: >> dmesg-warn -> PASS (bdw-ultra) >> Subgroup basic-flip-vs-modeset: >> incomplete -> PASS (bsw-nuc-2) >> Test kms_pipe_crc_basic: >> Subgroup suspend-read-crc-pipe-a: >> incomplete -> PASS (hsw-gt2) >> Test pm_rpm: >> Subgroup basic-pci-d3-state: >> dmesg-warn -> PASS (bsw-nuc-2) >> >> bdw-nuci7 total:194 pass:182 dwarn:0 dfail:0 fail:0 >> skip:12 >> bdw-ultra total:194 pass:173 dwarn:0 dfail:0 fail:0 >> skip:21 >> bsw-nuc-2 total:189 pass:151 dwarn:0 dfail:0 fail:0 >> skip:37 >> byt-nuc total:194 pass:154 dwarn:4 dfail:0 fail:1 >> skip:35 >> hsw-brixbox total:194 pass:172 dwarn:0 dfail:0 fail:0 >> skip:22 >> hsw-gt2 total:194 pass:176 dwarn:1 dfail:0 fail:0 >> skip:17 >> ivb-t430s total:194 pass:169 dwarn:0 dfail:0 fail:0 >> skip:25 >> skl-i5k-2 total:194 pass:170 dwarn:0 dfail:0 fail:0 >> skip:24 >> skl-i7k-2 total:194 pass:171 dwarn:0 dfail:0 fail:0 >> skip:23 >> snb-dellxps total:194 pass:159 dwarn:1 dfail:0 fail:0 >> skip:34 >> snb-x220t total:194 pass:159 dwarn:1 dfail:0 fail:1 >> skip:33 >> >> Results at /archive/results/CI_IGT_test/Patchwork_1589/ >> >> 3e5ecc8c5ff80cb1fb635ce1cf16b7cd4cfb1979 drm-intel-nightly: >> 2016y-03m-14d-09h-06m-00s UTC integration manifest >> 7928c2133b16eb2f26866ca05d1cb7bb6d41c765 drm/i915: Exit >> cherryview_irq_handler() after one pass >> >> == >> >> drv_module_reload_basic is weird, but it appears the hiccup CI had on the >> previous run were external (and affected several CI runs afaict). > > That part is goog then, but I am not sure what to do with the incomplete > run. Maybe have it do another one? Although that is quite weak. Problem > is it has no other hangs with that test in the history. goog yes :) I got a bsw-nuc2 hang in results yesterday for a series which I don't think could really have caused it. So I think there might be something really wrong either with that machine or with the driver reload on chv/bsw in general. Regards, Tvrtko _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v3] drm/i915: Move execlists irq handler to a bottom half 2016-03-23 12:56 ` Chris Wilson 2016-03-23 15:23 ` Tvrtko Ursulin @ 2016-03-24 12:18 ` Tvrtko Ursulin 2016-03-24 22:24 ` Chris Wilson 1 sibling, 1 reply; 32+ messages in thread From: Tvrtko Ursulin @ 2016-03-24 12:18 UTC (permalink / raw) To: Intel-gfx From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Doing a lot of work in the interrupt handler introduces huge latencies to the system as a whole. Most dramatic effect can be seen by running an all engine stress test like igt/gem_exec_nop/all where, when the kernel config is lean enough, the whole system can be brought into multi-second periods of complete non-interactivty. That can look for example like this: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143] Modules linked in: [redacted for brevity] CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0-160321+ #183 Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1 Workqueue: i915 gen6_pm_rps_work [i915] task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000 RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0 RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 Call Trace: <IRQ> [<ffffffff8104a716>] irq_exit+0x86/0x90 [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 <EOI> [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 [<ffffffff8105ab29>] process_one_work+0x139/0x350 [<ffffffff8105b186>] worker_thread+0x126/0x490 [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 [<ffffffff8105fa64>] kthread+0xc4/0xe0 [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 I could not explain, or find a code path, which would explain a +20 second lockup, but from some instrumentation it was apparent the interrupts off proportion of time was between 10-25% under heavy load which is quite bad. When a interrupt "cliff" is reached, which was >~320k irq/s on my machine, the whole system goes into a terrible state of the above described multi-second lockups. By moving the GT interrupt handling to a tasklet in a most simple way, the problem above disappears completely. Testing the effect on sytem-wide latencies using igt/gem_syslatency shows the following before this patch: gem_syslatency: cycles=1532739, latency mean=416531.829us max=2499237us gem_syslatency: cycles=1839434, latency mean=1458099.157us max=4998944us gem_syslatency: cycles=1432570, latency mean=2688.451us max=1201185us gem_syslatency: cycles=1533543, latency mean=416520.499us max=2498886us This shows that the unrelated process is experiencing huge delays in its wake-up latency. After the patch the results look like this: gem_syslatency: cycles=808907, latency mean=53.133us max=1640us gem_syslatency: cycles=862154, latency mean=62.778us max=2117us gem_syslatency: cycles=856039, latency mean=58.079us max=2123us gem_syslatency: cycles=841683, latency mean=56.914us max=1667us Showing a huge improvement in the unrelated process wake-up latency. It also shows an approximate halving in the number of total empty batches submitted during the test. This may not be worrying since the test puts the driver under a very unrealistic load with ncpu threads doing empty batch submission to all GPU engines each. More interesting scenario with regards to throughput is "gem_latency -n 100" which shows 25% better throughput and CPU usage, and 14% better dispatch latencies. I did not find any gains or regressions with Synmark2 or GLbench under light testing. More benchmarking is certainly required. v2: * execlists_lock should be taken as spin_lock_bh when queuing work from userspace now. (Chris Wilson) * uncore.lock must be taken with spin_lock_irq when submitting requests since that now runs from either softirq or process context. v3: * Expanded commit message with more testing data; * converted missed locking sites to _bh; * added execlist_lock comment. (Chris Wilson) Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Testcase: igt/gem_exec_nop/all Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350 Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> --- drivers/gpu/drm/i915/i915_debugfs.c | 5 ++--- drivers/gpu/drm/i915/i915_gem.c | 8 ++++---- drivers/gpu/drm/i915/i915_irq.c | 2 +- drivers/gpu/drm/i915/intel_lrc.c | 28 ++++++++++++++++------------ drivers/gpu/drm/i915/intel_lrc.h | 1 - drivers/gpu/drm/i915/intel_ringbuffer.h | 3 ++- 6 files changed, 25 insertions(+), 22 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c index e0ba3e38000f..daf8e76966ae 100644 --- a/drivers/gpu/drm/i915/i915_debugfs.c +++ b/drivers/gpu/drm/i915/i915_debugfs.c @@ -2092,7 +2092,6 @@ static int i915_execlists(struct seq_file *m, void *data) for_each_engine(engine, dev_priv, ring_id) { struct drm_i915_gem_request *head_req = NULL; int count = 0; - unsigned long flags; seq_printf(m, "%s\n", engine->name); @@ -2119,13 +2118,13 @@ static int i915_execlists(struct seq_file *m, void *data) i, status, ctx_id); } - spin_lock_irqsave(&engine->execlist_lock, flags); + spin_lock_bh(&engine->execlist_lock); list_for_each(cursor, &engine->execlist_queue) count++; head_req = list_first_entry_or_null(&engine->execlist_queue, struct drm_i915_gem_request, execlist_link); - spin_unlock_irqrestore(&engine->execlist_lock, flags); + spin_unlock_bh(&engine->execlist_lock); seq_printf(m, "\t%d requests in queue\n", count); if (head_req) { diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 8588c83abb35..898a99a630b8 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2840,13 +2840,13 @@ static void i915_gem_reset_engine_cleanup(struct drm_i915_private *dev_priv, */ if (i915.enable_execlists) { - spin_lock_irq(&engine->execlist_lock); + spin_lock_bh(&engine->execlist_lock); /* list_splice_tail_init checks for empty lists */ list_splice_tail_init(&engine->execlist_queue, &engine->execlist_retired_req_list); - spin_unlock_irq(&engine->execlist_lock); + spin_unlock_bh(&engine->execlist_lock); intel_execlists_retire_requests(engine); } @@ -2968,9 +2968,9 @@ i915_gem_retire_requests(struct drm_device *dev) i915_gem_retire_requests_ring(engine); idle &= list_empty(&engine->request_list); if (i915.enable_execlists) { - spin_lock_irq(&engine->execlist_lock); + spin_lock_bh(&engine->execlist_lock); idle &= list_empty(&engine->execlist_queue); - spin_unlock_irq(&engine->execlist_lock); + spin_unlock_bh(&engine->execlist_lock); intel_execlists_retire_requests(engine); } diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index a55a7cc317f8..0637c01d3c01 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -1324,7 +1324,7 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift) if (iir & (GT_RENDER_USER_INTERRUPT << test_shift)) notify_ring(engine); if (iir & (GT_CONTEXT_SWITCH_INTERRUPT << test_shift)) - intel_lrc_irq_handler(engine); + tasklet_schedule(&engine->irq_tasklet); } static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv, diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 40ef4eaf580f..2d5e833b14d1 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -418,20 +418,18 @@ static void execlists_submit_requests(struct drm_i915_gem_request *rq0, { struct drm_i915_private *dev_priv = rq0->i915; - /* BUG_ON(!irqs_disabled()); */ - execlists_update_context(rq0); if (rq1) execlists_update_context(rq1); - spin_lock(&dev_priv->uncore.lock); + spin_lock_irq(&dev_priv->uncore.lock); intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL); execlists_elsp_write(rq0, rq1); intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL); - spin_unlock(&dev_priv->uncore.lock); + spin_unlock_irq(&dev_priv->uncore.lock); } static void execlists_context_unqueue(struct intel_engine_cs *engine) @@ -538,13 +536,14 @@ get_context_status(struct intel_engine_cs *engine, unsigned int read_pointer, /** * intel_lrc_irq_handler() - handle Context Switch interrupts - * @ring: Engine Command Streamer to handle. + * @engine: Engine Command Streamer to handle. * * Check the unread Context Status Buffers and manage the submission of new * contexts to the ELSP accordingly. */ -void intel_lrc_irq_handler(struct intel_engine_cs *engine) +void intel_lrc_irq_handler(unsigned long data) { + struct intel_engine_cs *engine = (struct intel_engine_cs *)data; struct drm_i915_private *dev_priv = engine->dev->dev_private; u32 status_pointer; unsigned int read_pointer, write_pointer; @@ -552,7 +551,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine) unsigned int csb_read = 0, i; unsigned int submit_contexts = 0; - spin_lock(&dev_priv->uncore.lock); + spin_lock_irq(&dev_priv->uncore.lock); intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL); status_pointer = I915_READ_FW(RING_CONTEXT_STATUS_PTR(engine)); @@ -579,7 +578,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine) engine->next_context_status_buffer << 8)); intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL); - spin_unlock(&dev_priv->uncore.lock); + spin_unlock_irq(&dev_priv->uncore.lock); spin_lock(&engine->execlist_lock); @@ -621,7 +620,7 @@ static void execlists_context_queue(struct drm_i915_gem_request *request) i915_gem_request_reference(request); - spin_lock_irq(&engine->execlist_lock); + spin_lock_bh(&engine->execlist_lock); list_for_each_entry(cursor, &engine->execlist_queue, execlist_link) if (++num_elements > 2) @@ -646,7 +645,7 @@ static void execlists_context_queue(struct drm_i915_gem_request *request) if (num_elements == 0) execlists_context_unqueue(engine); - spin_unlock_irq(&engine->execlist_lock); + spin_unlock_bh(&engine->execlist_lock); } static int logical_ring_invalidate_all_caches(struct drm_i915_gem_request *req) @@ -1033,9 +1032,9 @@ void intel_execlists_retire_requests(struct intel_engine_cs *engine) return; INIT_LIST_HEAD(&retired_list); - spin_lock_irq(&engine->execlist_lock); + spin_lock_bh(&engine->execlist_lock); list_replace_init(&engine->execlist_retired_req_list, &retired_list); - spin_unlock_irq(&engine->execlist_lock); + spin_unlock_bh(&engine->execlist_lock); list_for_each_entry_safe(req, tmp, &retired_list, execlist_link) { struct intel_context *ctx = req->ctx; @@ -2016,6 +2015,8 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine) if (!intel_engine_initialized(engine)) return; + tasklet_kill(&engine->irq_tasklet); + dev_priv = engine->dev->dev_private; if (engine->buffer) { @@ -2089,6 +2090,9 @@ logical_ring_init(struct drm_device *dev, struct intel_engine_cs *engine) INIT_LIST_HEAD(&engine->execlist_retired_req_list); spin_lock_init(&engine->execlist_lock); + tasklet_init(&engine->irq_tasklet, + intel_lrc_irq_handler, (unsigned long)engine); + logical_ring_init_platform_invariants(engine); ret = i915_cmd_parser_init_ring(engine); diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h index a17cb12221ba..0b0853eee91e 100644 --- a/drivers/gpu/drm/i915/intel_lrc.h +++ b/drivers/gpu/drm/i915/intel_lrc.h @@ -118,7 +118,6 @@ int intel_execlists_submission(struct i915_execbuffer_params *params, struct drm_i915_gem_execbuffer2 *args, struct list_head *vmas); -void intel_lrc_irq_handler(struct intel_engine_cs *engine); void intel_execlists_retire_requests(struct intel_engine_cs *engine); #endif /* _INTEL_LRC_H_ */ diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h index 221a94627aab..18074ab55f61 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.h +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h @@ -266,7 +266,8 @@ struct intel_engine_cs { } semaphore; /* Execlists */ - spinlock_t execlist_lock; + struct tasklet_struct irq_tasklet; + spinlock_t execlist_lock; /* used inside tasklet, use spin_lock_bh */ struct list_head execlist_queue; struct list_head execlist_retired_req_list; unsigned int next_context_status_buffer; -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v3] drm/i915: Move execlists irq handler to a bottom half 2016-03-24 12:18 ` [PATCH v3] " Tvrtko Ursulin @ 2016-03-24 22:24 ` Chris Wilson 2016-03-25 12:56 ` Chris Wilson 0 siblings, 1 reply; 32+ messages in thread From: Chris Wilson @ 2016-03-24 22:24 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: Intel-gfx On Thu, Mar 24, 2016 at 12:18:49PM +0000, Tvrtko Ursulin wrote: > /** > * intel_lrc_irq_handler() - handle Context Switch interrupts > - * @ring: Engine Command Streamer to handle. > + * @engine: Engine Command Streamer to handle. > * > * Check the unread Context Status Buffers and manage the submission of new > * contexts to the ELSP accordingly. > */ > -void intel_lrc_irq_handler(struct intel_engine_cs *engine) > +void intel_lrc_irq_handler(unsigned long data) Now should be made static. > @@ -552,7 +551,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine) > unsigned int csb_read = 0, i; > unsigned int submit_contexts = 0; > > - spin_lock(&dev_priv->uncore.lock); > + spin_lock_irq(&dev_priv->uncore.lock); > intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL); > > status_pointer = I915_READ_FW(RING_CONTEXT_STATUS_PTR(engine)); > @@ -579,7 +578,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine) > engine->next_context_status_buffer << 8)); > > intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL); > - spin_unlock(&dev_priv->uncore.lock); > + spin_unlock_irq(&dev_priv->uncore.lock); We actually can make this section just guarded by intel_uncore_forcewake_get/intel_uncore_forcewake_put and drop the explicit spin_lock_irq(uncore.lock). We know no one else can do simultaneous access to this cacheline of mmio space (though that shouldn't affect any of these machines!) and so can reduce the critical section. At the cost of the extra lock/unlock though, versus the potential for multiple cores to be reading from mmio. > @@ -2016,6 +2015,8 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine) > if (!intel_engine_initialized(engine)) > return; > > + tasklet_kill(&engine->irq_tasklet); Imre suggested that we assert that the irq_tasklet is idle. WARN_ON(test_bit(TASK_STATE_SCHED, &engine->irq_tasklet.state)); -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v3] drm/i915: Move execlists irq handler to a bottom half 2016-03-24 22:24 ` Chris Wilson @ 2016-03-25 12:56 ` Chris Wilson 2016-04-04 11:11 ` [PATCH v4] " Tvrtko Ursulin 0 siblings, 1 reply; 32+ messages in thread From: Chris Wilson @ 2016-03-25 12:56 UTC (permalink / raw) To: Tvrtko Ursulin, Intel-gfx, Tvrtko Ursulin On Thu, Mar 24, 2016 at 10:24:49PM +0000, Chris Wilson wrote: > On Thu, Mar 24, 2016 at 12:18:49PM +0000, Tvrtko Ursulin wrote: > > @@ -2016,6 +2015,8 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine) > > if (!intel_engine_initialized(engine)) > > return; > > > > + tasklet_kill(&engine->irq_tasklet); > > Imre suggested that we assert that the irq_tasklet is idle. > WARN_ON(test_bit(TASK_STATE_SCHED, &engine->irq_tasklet.state)); Whilst it may not be necessary in cleanup(), we should put a tasklet_kill() in i915_reset_engine_cleanup() -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v4] drm/i915: Move execlists irq handler to a bottom half 2016-03-25 12:56 ` Chris Wilson @ 2016-04-04 11:11 ` Tvrtko Ursulin 2016-04-04 11:27 ` Chris Wilson 0 siblings, 1 reply; 32+ messages in thread From: Tvrtko Ursulin @ 2016-04-04 11:11 UTC (permalink / raw) To: Intel-gfx From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Doing a lot of work in the interrupt handler introduces huge latencies to the system as a whole. Most dramatic effect can be seen by running an all engine stress test like igt/gem_exec_nop/all where, when the kernel config is lean enough, the whole system can be brought into multi-second periods of complete non-interactivty. That can look for example like this: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143] Modules linked in: [redacted for brevity] CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0-160321+ #183 Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1 Workqueue: i915 gen6_pm_rps_work [i915] task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000 RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0 RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 Call Trace: <IRQ> [<ffffffff8104a716>] irq_exit+0x86/0x90 [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 <EOI> [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 [<ffffffff8105ab29>] process_one_work+0x139/0x350 [<ffffffff8105b186>] worker_thread+0x126/0x490 [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 [<ffffffff8105fa64>] kthread+0xc4/0xe0 [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 I could not explain, or find a code path, which would explain a +20 second lockup, but from some instrumentation it was apparent the interrupts off proportion of time was between 10-25% under heavy load which is quite bad. When a interrupt "cliff" is reached, which was >~320k irq/s on my machine, the whole system goes into a terrible state of the above described multi-second lockups. By moving the GT interrupt handling to a tasklet in a most simple way, the problem above disappears completely. Testing the effect on sytem-wide latencies using igt/gem_syslatency shows the following before this patch: gem_syslatency: cycles=1532739, latency mean=416531.829us max=2499237us gem_syslatency: cycles=1839434, latency mean=1458099.157us max=4998944us gem_syslatency: cycles=1432570, latency mean=2688.451us max=1201185us gem_syslatency: cycles=1533543, latency mean=416520.499us max=2498886us This shows that the unrelated process is experiencing huge delays in its wake-up latency. After the patch the results look like this: gem_syslatency: cycles=808907, latency mean=53.133us max=1640us gem_syslatency: cycles=862154, latency mean=62.778us max=2117us gem_syslatency: cycles=856039, latency mean=58.079us max=2123us gem_syslatency: cycles=841683, latency mean=56.914us max=1667us Showing a huge improvement in the unrelated process wake-up latency. It also shows an approximate halving in the number of total empty batches submitted during the test. This may not be worrying since the test puts the driver under a very unrealistic load with ncpu threads doing empty batch submission to all GPU engines each. Another benefit compared to the hard-irq handling is that now work on all engines can be dispatched in parallel since we can have up to number of CPUs active tasklets. (While previously a single hard-irq would serially dispatch on one engine after another.) More interesting scenario with regards to throughput is "gem_latency -n 100" which shows 25% better throughput and CPU usage, and 14% better dispatch latencies. I did not find any gains or regressions with Synmark2 or GLbench under light testing. More benchmarking is certainly required. v2: * execlists_lock should be taken as spin_lock_bh when queuing work from userspace now. (Chris Wilson) * uncore.lock must be taken with spin_lock_irq when submitting requests since that now runs from either softirq or process context. v3: * Expanded commit message with more testing data; * converted missed locking sites to _bh; * added execlist_lock comment. (Chris Wilson) v4: * Mention dispatch parallelism in commit. (Chris Wilson) * Do not hold uncore.lock over MMIO reads since the block is already serialised per-engine via the tasklet itself. (Chris Wilson) * intel_lrc_irq_handler should be static. (Chris Wilson) * Cancel/sync the tasklet on GPU reset. (Chris Wilson) * Document and WARN that tasklet cannot be active/pending on engine cleanup. (Chris Wilson/Imre Deak) Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Imre Deak <imre.deak@intel.com> Testcase: igt/gem_exec_nop/all Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350 Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> --- drivers/gpu/drm/i915/i915_debugfs.c | 5 ++--- drivers/gpu/drm/i915/i915_gem.c | 10 +++++---- drivers/gpu/drm/i915/i915_irq.c | 2 +- drivers/gpu/drm/i915/intel_lrc.c | 36 ++++++++++++++++++++------------- drivers/gpu/drm/i915/intel_lrc.h | 1 - drivers/gpu/drm/i915/intel_ringbuffer.h | 3 ++- 6 files changed, 33 insertions(+), 24 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c index 0b25228c202e..a2e3af02c292 100644 --- a/drivers/gpu/drm/i915/i915_debugfs.c +++ b/drivers/gpu/drm/i915/i915_debugfs.c @@ -2100,7 +2100,6 @@ static int i915_execlists(struct seq_file *m, void *data) for_each_engine(engine, dev_priv) { struct drm_i915_gem_request *head_req = NULL; int count = 0; - unsigned long flags; seq_printf(m, "%s\n", engine->name); @@ -2127,13 +2126,13 @@ static int i915_execlists(struct seq_file *m, void *data) i, status, ctx_id); } - spin_lock_irqsave(&engine->execlist_lock, flags); + spin_lock_bh(&engine->execlist_lock); list_for_each(cursor, &engine->execlist_queue) count++; head_req = list_first_entry_or_null(&engine->execlist_queue, struct drm_i915_gem_request, execlist_link); - spin_unlock_irqrestore(&engine->execlist_lock, flags); + spin_unlock_bh(&engine->execlist_lock); seq_printf(m, "\t%d requests in queue\n", count); if (head_req) { diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index ca96fc12cdf4..40f90c7e718a 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2842,13 +2842,15 @@ static void i915_gem_reset_engine_cleanup(struct drm_i915_private *dev_priv, */ if (i915.enable_execlists) { - spin_lock_irq(&engine->execlist_lock); + /* Ensure irq handler finishes or is cancelled. */ + tasklet_kill(&engine->irq_tasklet); + spin_lock_bh(&engine->execlist_lock); /* list_splice_tail_init checks for empty lists */ list_splice_tail_init(&engine->execlist_queue, &engine->execlist_retired_req_list); + spin_unlock_bh(&engine->execlist_lock); - spin_unlock_irq(&engine->execlist_lock); intel_execlists_retire_requests(engine); } @@ -2968,9 +2970,9 @@ i915_gem_retire_requests(struct drm_device *dev) i915_gem_retire_requests_ring(engine); idle &= list_empty(&engine->request_list); if (i915.enable_execlists) { - spin_lock_irq(&engine->execlist_lock); + spin_lock_bh(&engine->execlist_lock); idle &= list_empty(&engine->execlist_queue); - spin_unlock_irq(&engine->execlist_lock); + spin_unlock_bh(&engine->execlist_lock); intel_execlists_retire_requests(engine); } diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index a9c18137e3f5..c85eb8dec2dc 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -1323,7 +1323,7 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift) if (iir & (GT_RENDER_USER_INTERRUPT << test_shift)) notify_ring(engine); if (iir & (GT_CONTEXT_SWITCH_INTERRUPT << test_shift)) - intel_lrc_irq_handler(engine); + tasklet_schedule(&engine->irq_tasklet); } static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv, diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 5d4ca3b11ae2..a1db6a02cf23 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -131,6 +131,7 @@ * preemption, but just sampling the new tail pointer). * */ +#include <linux/interrupt.h> #include <drm/drmP.h> #include <drm/i915_drm.h> @@ -418,20 +419,18 @@ static void execlists_submit_requests(struct drm_i915_gem_request *rq0, { struct drm_i915_private *dev_priv = rq0->i915; - /* BUG_ON(!irqs_disabled()); */ - execlists_update_context(rq0); if (rq1) execlists_update_context(rq1); - spin_lock(&dev_priv->uncore.lock); + spin_lock_irq(&dev_priv->uncore.lock); intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL); execlists_elsp_write(rq0, rq1); intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL); - spin_unlock(&dev_priv->uncore.lock); + spin_unlock_irq(&dev_priv->uncore.lock); } static void execlists_context_unqueue(struct intel_engine_cs *engine) @@ -538,13 +537,14 @@ get_context_status(struct intel_engine_cs *engine, unsigned int read_pointer, /** * intel_lrc_irq_handler() - handle Context Switch interrupts - * @ring: Engine Command Streamer to handle. + * @engine: Engine Command Streamer to handle. * * Check the unread Context Status Buffers and manage the submission of new * contexts to the ELSP accordingly. */ -void intel_lrc_irq_handler(struct intel_engine_cs *engine) +static void intel_lrc_irq_handler(unsigned long data) { + struct intel_engine_cs *engine = (struct intel_engine_cs *)data; struct drm_i915_private *dev_priv = engine->dev->dev_private; u32 status_pointer; unsigned int read_pointer, write_pointer; @@ -552,8 +552,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine) unsigned int csb_read = 0, i; unsigned int submit_contexts = 0; - spin_lock(&dev_priv->uncore.lock); - intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL); + intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL); status_pointer = I915_READ_FW(RING_CONTEXT_STATUS_PTR(engine)); @@ -578,8 +577,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine) _MASKED_FIELD(GEN8_CSB_READ_PTR_MASK, engine->next_context_status_buffer << 8)); - intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL); - spin_unlock(&dev_priv->uncore.lock); + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL); spin_lock(&engine->execlist_lock); @@ -621,7 +619,7 @@ static void execlists_context_queue(struct drm_i915_gem_request *request) i915_gem_request_reference(request); - spin_lock_irq(&engine->execlist_lock); + spin_lock_bh(&engine->execlist_lock); list_for_each_entry(cursor, &engine->execlist_queue, execlist_link) if (++num_elements > 2) @@ -646,7 +644,7 @@ static void execlists_context_queue(struct drm_i915_gem_request *request) if (num_elements == 0) execlists_context_unqueue(engine); - spin_unlock_irq(&engine->execlist_lock); + spin_unlock_bh(&engine->execlist_lock); } static int logical_ring_invalidate_all_caches(struct drm_i915_gem_request *req) @@ -1033,9 +1031,9 @@ void intel_execlists_retire_requests(struct intel_engine_cs *engine) return; INIT_LIST_HEAD(&retired_list); - spin_lock_irq(&engine->execlist_lock); + spin_lock_bh(&engine->execlist_lock); list_replace_init(&engine->execlist_retired_req_list, &retired_list); - spin_unlock_irq(&engine->execlist_lock); + spin_unlock_bh(&engine->execlist_lock); list_for_each_entry_safe(req, tmp, &retired_list, execlist_link) { struct intel_context *ctx = req->ctx; @@ -2016,6 +2014,13 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine) if (!intel_engine_initialized(engine)) return; + /* + * Tasklet cannot be active at this point due intel_mark_active/idle + * so this is just for documentation. + */ + if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &engine->irq_tasklet.state))) + tasklet_kill(&engine->irq_tasklet); + dev_priv = engine->dev->dev_private; if (engine->buffer) { @@ -2089,6 +2094,9 @@ logical_ring_init(struct drm_device *dev, struct intel_engine_cs *engine) INIT_LIST_HEAD(&engine->execlist_retired_req_list); spin_lock_init(&engine->execlist_lock); + tasklet_init(&engine->irq_tasklet, + intel_lrc_irq_handler, (unsigned long)engine); + logical_ring_init_platform_invariants(engine); ret = i915_cmd_parser_init_ring(engine); diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h index a17cb12221ba..0b0853eee91e 100644 --- a/drivers/gpu/drm/i915/intel_lrc.h +++ b/drivers/gpu/drm/i915/intel_lrc.h @@ -118,7 +118,6 @@ int intel_execlists_submission(struct i915_execbuffer_params *params, struct drm_i915_gem_execbuffer2 *args, struct list_head *vmas); -void intel_lrc_irq_handler(struct intel_engine_cs *engine); void intel_execlists_retire_requests(struct intel_engine_cs *engine); #endif /* _INTEL_LRC_H_ */ diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h index 221a94627aab..18074ab55f61 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.h +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h @@ -266,7 +266,8 @@ struct intel_engine_cs { } semaphore; /* Execlists */ - spinlock_t execlist_lock; + struct tasklet_struct irq_tasklet; + spinlock_t execlist_lock; /* used inside tasklet, use spin_lock_bh */ struct list_head execlist_queue; struct list_head execlist_retired_req_list; unsigned int next_context_status_buffer; -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v4] drm/i915: Move execlists irq handler to a bottom half 2016-04-04 11:11 ` [PATCH v4] " Tvrtko Ursulin @ 2016-04-04 11:27 ` Chris Wilson 2016-04-04 12:51 ` Tvrtko Ursulin 0 siblings, 1 reply; 32+ messages in thread From: Chris Wilson @ 2016-04-04 11:27 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: Intel-gfx On Mon, Apr 04, 2016 at 12:11:56PM +0100, Tvrtko Ursulin wrote: > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Doing a lot of work in the interrupt handler introduces huge > latencies to the system as a whole. > > Most dramatic effect can be seen by running an all engine > stress test like igt/gem_exec_nop/all where, when the kernel > config is lean enough, the whole system can be brought into > multi-second periods of complete non-interactivty. That can > look for example like this: > > NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143] > Modules linked in: [redacted for brevity] > CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0-160321+ #183 > Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1 > Workqueue: i915 gen6_pm_rps_work [i915] > task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000 > RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0 > RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 > RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 > RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 > RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 > R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 > R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 > FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Stack: > 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a > 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 > 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 > Call Trace: > <IRQ> > [<ffffffff8104a716>] irq_exit+0x86/0x90 > [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 > [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 > <EOI> > [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] > [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 > [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] > [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 > [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] > [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] > [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] > [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] > [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 > [<ffffffff8105ab29>] process_one_work+0x139/0x350 > [<ffffffff8105b186>] worker_thread+0x126/0x490 > [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 > [<ffffffff8105fa64>] kthread+0xc4/0xe0 > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > > I could not explain, or find a code path, which would explain > a +20 second lockup, but from some instrumentation it was > apparent the interrupts off proportion of time was between > 10-25% under heavy load which is quite bad. > > When a interrupt "cliff" is reached, which was >~320k irq/s on > my machine, the whole system goes into a terrible state of the > above described multi-second lockups. > > By moving the GT interrupt handling to a tasklet in a most > simple way, the problem above disappears completely. > > Testing the effect on sytem-wide latencies using > igt/gem_syslatency shows the following before this patch: > > gem_syslatency: cycles=1532739, latency mean=416531.829us max=2499237us > gem_syslatency: cycles=1839434, latency mean=1458099.157us max=4998944us > gem_syslatency: cycles=1432570, latency mean=2688.451us max=1201185us > gem_syslatency: cycles=1533543, latency mean=416520.499us max=2498886us > > This shows that the unrelated process is experiencing huge > delays in its wake-up latency. After the patch the results > look like this: > > gem_syslatency: cycles=808907, latency mean=53.133us max=1640us > gem_syslatency: cycles=862154, latency mean=62.778us max=2117us > gem_syslatency: cycles=856039, latency mean=58.079us max=2123us > gem_syslatency: cycles=841683, latency mean=56.914us max=1667us > > Showing a huge improvement in the unrelated process wake-up > latency. It also shows an approximate halving in the number > of total empty batches submitted during the test. This may > not be worrying since the test puts the driver under > a very unrealistic load with ncpu threads doing empty batch > submission to all GPU engines each. > > Another benefit compared to the hard-irq handling is that now > work on all engines can be dispatched in parallel since we can > have up to number of CPUs active tasklets. (While previously > a single hard-irq would serially dispatch on one engine after > another.) > > More interesting scenario with regards to throughput is > "gem_latency -n 100" which shows 25% better throughput and > CPU usage, and 14% better dispatch latencies. > > I did not find any gains or regressions with Synmark2 or > GLbench under light testing. More benchmarking is certainly > required. > > v2: > * execlists_lock should be taken as spin_lock_bh when > queuing work from userspace now. (Chris Wilson) > * uncore.lock must be taken with spin_lock_irq when > submitting requests since that now runs from either > softirq or process context. > > v3: > * Expanded commit message with more testing data; > * converted missed locking sites to _bh; > * added execlist_lock comment. (Chris Wilson) > > v4: > * Mention dispatch parallelism in commit. (Chris Wilson) > * Do not hold uncore.lock over MMIO reads since the block > is already serialised per-engine via the tasklet itself. > (Chris Wilson) > * intel_lrc_irq_handler should be static. (Chris Wilson) > * Cancel/sync the tasklet on GPU reset. (Chris Wilson) > * Document and WARN that tasklet cannot be active/pending > on engine cleanup. (Chris Wilson/Imre Deak) > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Imre Deak <imre.deak@intel.com> > Testcase: igt/gem_exec_nop/all > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350 > Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Didn't spot anything else in testing over the last week. There are a number of follow up improvements we can make to intel_lrc_irq_handler() to halve its execution time (minor improvement to execlist throughput, major improvement to syslatency) which focus on streamlining the register reads and context-unqueueing (but the patches I have depend on esoteric features like struct_mutex-less requests). Lgtm, -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v4] drm/i915: Move execlists irq handler to a bottom half 2016-04-04 11:27 ` Chris Wilson @ 2016-04-04 12:51 ` Tvrtko Ursulin 0 siblings, 0 replies; 32+ messages in thread From: Tvrtko Ursulin @ 2016-04-04 12:51 UTC (permalink / raw) To: Chris Wilson, Intel-gfx, Tvrtko Ursulin, Imre Deak On 04/04/16 12:27, Chris Wilson wrote: > On Mon, Apr 04, 2016 at 12:11:56PM +0100, Tvrtko Ursulin wrote: >> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >> >> Doing a lot of work in the interrupt handler introduces huge >> latencies to the system as a whole. >> >> Most dramatic effect can be seen by running an all engine >> stress test like igt/gem_exec_nop/all where, when the kernel >> config is lean enough, the whole system can be brought into >> multi-second periods of complete non-interactivty. That can >> look for example like this: >> >> NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143] >> Modules linked in: [redacted for brevity] >> CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0-160321+ #183 >> Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1 >> Workqueue: i915 gen6_pm_rps_work [i915] >> task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000 >> RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0 >> RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 >> RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 >> RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 >> RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 >> R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 >> R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 >> FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> Stack: >> 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a >> 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 >> 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 >> Call Trace: >> <IRQ> >> [<ffffffff8104a716>] irq_exit+0x86/0x90 >> [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 >> [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 >> <EOI> >> [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] >> [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 >> [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] >> [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 >> [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] >> [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] >> [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] >> [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] >> [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 >> [<ffffffff8105ab29>] process_one_work+0x139/0x350 >> [<ffffffff8105b186>] worker_thread+0x126/0x490 >> [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 >> [<ffffffff8105fa64>] kthread+0xc4/0xe0 >> [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 >> [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 >> [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 >> >> I could not explain, or find a code path, which would explain >> a +20 second lockup, but from some instrumentation it was >> apparent the interrupts off proportion of time was between >> 10-25% under heavy load which is quite bad. >> >> When a interrupt "cliff" is reached, which was >~320k irq/s on >> my machine, the whole system goes into a terrible state of the >> above described multi-second lockups. >> >> By moving the GT interrupt handling to a tasklet in a most >> simple way, the problem above disappears completely. >> >> Testing the effect on sytem-wide latencies using >> igt/gem_syslatency shows the following before this patch: >> >> gem_syslatency: cycles=1532739, latency mean=416531.829us max=2499237us >> gem_syslatency: cycles=1839434, latency mean=1458099.157us max=4998944us >> gem_syslatency: cycles=1432570, latency mean=2688.451us max=1201185us >> gem_syslatency: cycles=1533543, latency mean=416520.499us max=2498886us >> >> This shows that the unrelated process is experiencing huge >> delays in its wake-up latency. After the patch the results >> look like this: >> >> gem_syslatency: cycles=808907, latency mean=53.133us max=1640us >> gem_syslatency: cycles=862154, latency mean=62.778us max=2117us >> gem_syslatency: cycles=856039, latency mean=58.079us max=2123us >> gem_syslatency: cycles=841683, latency mean=56.914us max=1667us >> >> Showing a huge improvement in the unrelated process wake-up >> latency. It also shows an approximate halving in the number >> of total empty batches submitted during the test. This may >> not be worrying since the test puts the driver under >> a very unrealistic load with ncpu threads doing empty batch >> submission to all GPU engines each. >> >> Another benefit compared to the hard-irq handling is that now >> work on all engines can be dispatched in parallel since we can >> have up to number of CPUs active tasklets. (While previously >> a single hard-irq would serially dispatch on one engine after >> another.) >> >> More interesting scenario with regards to throughput is >> "gem_latency -n 100" which shows 25% better throughput and >> CPU usage, and 14% better dispatch latencies. >> >> I did not find any gains or regressions with Synmark2 or >> GLbench under light testing. More benchmarking is certainly >> required. >> >> v2: >> * execlists_lock should be taken as spin_lock_bh when >> queuing work from userspace now. (Chris Wilson) >> * uncore.lock must be taken with spin_lock_irq when >> submitting requests since that now runs from either >> softirq or process context. >> >> v3: >> * Expanded commit message with more testing data; >> * converted missed locking sites to _bh; >> * added execlist_lock comment. (Chris Wilson) >> >> v4: >> * Mention dispatch parallelism in commit. (Chris Wilson) >> * Do not hold uncore.lock over MMIO reads since the block >> is already serialised per-engine via the tasklet itself. >> (Chris Wilson) >> * intel_lrc_irq_handler should be static. (Chris Wilson) >> * Cancel/sync the tasklet on GPU reset. (Chris Wilson) >> * Document and WARN that tasklet cannot be active/pending >> on engine cleanup. (Chris Wilson/Imre Deak) >> >> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >> Cc: Chris Wilson <chris@chris-wilson.co.uk> >> Cc: Imre Deak <imre.deak@intel.com> >> Testcase: igt/gem_exec_nop/all >> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350 >> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> > > Didn't spot anything else in testing over the last week. There are a > number of follow up improvements we can make to intel_lrc_irq_handler() > to halve its execution time (minor improvement to execlist throughput, > major improvement to syslatency) which focus on streamlining the > register reads and context-unqueueing (but the patches I have depend on > esoteric features like struct_mutex-less requests). Would that be the just "drm/i915: Move releasing of the GEM request from free to retire/cancel" or more? The former could be pulled out from your pile easily even if it doesn't solve anything major on its own it is a nice cleanup. Regards, Tvrtko _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* [RFC v2] drm/i915: Move execlists irq handler to a bottom half 2016-03-22 17:30 [RFC] drm/i915: Move execlists irq handler to a bottom half Tvrtko Ursulin 2016-03-23 7:02 ` ✓ Fi.CI.BAT: success for " Patchwork 2016-03-23 9:07 ` [RFC] " Daniel Vetter @ 2016-03-23 14:57 ` Tvrtko Ursulin 2016-03-24 10:56 ` Chris Wilson 2016-03-24 15:56 ` Imre Deak 2016-03-23 15:32 ` ✗ Fi.CI.BAT: failure for drm/i915: Move execlists irq handler to a bottom half (rev2) Patchwork ` (2 subsequent siblings) 5 siblings, 2 replies; 32+ messages in thread From: Tvrtko Ursulin @ 2016-03-23 14:57 UTC (permalink / raw) To: Intel-gfx From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Doing a lot of work in the interrupt handler introduces huge latencies to the system as a whole. Most dramatic effect can be seen by running an all engine stress test like igt/gem_exec_nop/all where, when the kernel config is lean enough, the whole system can be brought into multi-second periods of complete non-interactivty. That can look for example like this: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143] Modules linked in: [redacted for brevity] CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0-160321+ #183 Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1 Workqueue: i915 gen6_pm_rps_work [i915] task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000 RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0 RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 Call Trace: <IRQ> [<ffffffff8104a716>] irq_exit+0x86/0x90 [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 <EOI> [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 [<ffffffff8105ab29>] process_one_work+0x139/0x350 [<ffffffff8105b186>] worker_thread+0x126/0x490 [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 [<ffffffff8105fa64>] kthread+0xc4/0xe0 [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 I could not explain, or find a code path, which would explain a +20 second lockup, but from some instrumentation it was apparent the interrupts off proportion of time was between 10-25% under heavy load which is quite bad. By moving the GT interrupt handling to a tasklet in a most simple way, the problem above disappears completely. Also, gem_latency -n 100 shows 25% better throughput and CPU usage, and 14% better latencies. I did not find any gains or regressions with Synmark2 or GLbench under light testing. More benchmarking is certainly required. v2: * execlists_lock should be taken as spin_lock_bh when queuing work from userspace now. (Chris Wilson) * uncore.lock must be taken with spin_lock_irq when submitting requests since that now runs from either softirq or process context. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> --- drivers/gpu/drm/i915/i915_irq.c | 2 +- drivers/gpu/drm/i915/intel_lrc.c | 24 ++++++++++++++---------- drivers/gpu/drm/i915/intel_lrc.h | 1 - drivers/gpu/drm/i915/intel_ringbuffer.h | 1 + 4 files changed, 16 insertions(+), 12 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 8f3e3309c3ab..e68134347007 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -1324,7 +1324,7 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift) if (iir & (GT_RENDER_USER_INTERRUPT << test_shift)) notify_ring(engine); if (iir & (GT_CONTEXT_SWITCH_INTERRUPT << test_shift)) - intel_lrc_irq_handler(engine); + tasklet_schedule(&engine->irq_tasklet); } static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv, diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 67592f8354d6..b3b62b3cd90d 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -418,20 +418,18 @@ static void execlists_submit_requests(struct drm_i915_gem_request *rq0, { struct drm_i915_private *dev_priv = rq0->i915; - /* BUG_ON(!irqs_disabled()); */ - execlists_update_context(rq0); if (rq1) execlists_update_context(rq1); - spin_lock(&dev_priv->uncore.lock); + spin_lock_irq(&dev_priv->uncore.lock); intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL); execlists_elsp_write(rq0, rq1); intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL); - spin_unlock(&dev_priv->uncore.lock); + spin_unlock_irq(&dev_priv->uncore.lock); } static void execlists_context_unqueue(struct intel_engine_cs *engine) @@ -536,13 +534,14 @@ get_context_status(struct drm_i915_private *dev_priv, u32 csb_base, /** * intel_lrc_irq_handler() - handle Context Switch interrupts - * @ring: Engine Command Streamer to handle. + * @engine: Engine Command Streamer to handle. * * Check the unread Context Status Buffers and manage the submission of new * contexts to the ELSP accordingly. */ -void intel_lrc_irq_handler(struct intel_engine_cs *engine) +void intel_lrc_irq_handler(unsigned long data) { + struct intel_engine_cs *engine = (struct intel_engine_cs *)data; struct drm_i915_private *dev_priv = engine->dev->dev_private; u32 status_pointer; unsigned int read_pointer, write_pointer; @@ -551,7 +550,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine) unsigned int csb_read = 0, i; unsigned int submit_contexts = 0; - spin_lock(&dev_priv->uncore.lock); + spin_lock_irq(&dev_priv->uncore.lock); intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL); status_pointer = I915_READ_FW(RING_CONTEXT_STATUS_PTR(engine)); @@ -579,7 +578,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine) engine->next_context_status_buffer << 8)); intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL); - spin_unlock(&dev_priv->uncore.lock); + spin_unlock_irq(&dev_priv->uncore.lock); spin_lock(&engine->execlist_lock); @@ -621,7 +620,7 @@ static void execlists_context_queue(struct drm_i915_gem_request *request) i915_gem_request_reference(request); - spin_lock_irq(&engine->execlist_lock); + spin_lock_bh(&engine->execlist_lock); list_for_each_entry(cursor, &engine->execlist_queue, execlist_link) if (++num_elements > 2) @@ -646,7 +645,7 @@ static void execlists_context_queue(struct drm_i915_gem_request *request) if (num_elements == 0) execlists_context_unqueue(engine); - spin_unlock_irq(&engine->execlist_lock); + spin_unlock_bh(&engine->execlist_lock); } static int logical_ring_invalidate_all_caches(struct drm_i915_gem_request *req) @@ -2016,6 +2015,8 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine) if (!intel_engine_initialized(engine)) return; + tasklet_kill(&engine->irq_tasklet); + dev_priv = engine->dev->dev_private; if (engine->buffer) { @@ -2089,6 +2090,9 @@ logical_ring_init(struct drm_device *dev, struct intel_engine_cs *engine) INIT_LIST_HEAD(&engine->execlist_retired_req_list); spin_lock_init(&engine->execlist_lock); + tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler, + (unsigned long)engine); + logical_ring_init_platform_invariants(engine); ret = i915_cmd_parser_init_ring(engine); diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h index 6690d93d603f..efcbd7bf9cc9 100644 --- a/drivers/gpu/drm/i915/intel_lrc.h +++ b/drivers/gpu/drm/i915/intel_lrc.h @@ -123,7 +123,6 @@ int intel_execlists_submission(struct i915_execbuffer_params *params, struct drm_i915_gem_execbuffer2 *args, struct list_head *vmas); -void intel_lrc_irq_handler(struct intel_engine_cs *engine); void intel_execlists_retire_requests(struct intel_engine_cs *engine); #endif /* _INTEL_LRC_H_ */ diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h index 221a94627aab..29810cba8a8c 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.h +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h @@ -266,6 +266,7 @@ struct intel_engine_cs { } semaphore; /* Execlists */ + struct tasklet_struct irq_tasklet; spinlock_t execlist_lock; struct list_head execlist_queue; struct list_head execlist_retired_req_list; -- 1.9.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [RFC v2] drm/i915: Move execlists irq handler to a bottom half 2016-03-23 14:57 ` [RFC v2] " Tvrtko Ursulin @ 2016-03-24 10:56 ` Chris Wilson 2016-03-24 11:50 ` Tvrtko Ursulin 2016-03-24 15:56 ` Imre Deak 1 sibling, 1 reply; 32+ messages in thread From: Chris Wilson @ 2016-03-24 10:56 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: Intel-gfx On Wed, Mar 23, 2016 at 02:57:36PM +0000, Tvrtko Ursulin wrote: > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Doing a lot of work in the interrupt handler introduces huge > latencies to the system as a whole. > > Most dramatic effect can be seen by running an all engine > stress test like igt/gem_exec_nop/all where, when the kernel > config is lean enough, the whole system can be brought into > multi-second periods of complete non-interactivty. That can > look for example like this: > > NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143] > Modules linked in: [redacted for brevity] > CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0-160321+ #183 > Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1 > Workqueue: i915 gen6_pm_rps_work [i915] > task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000 > RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0 > RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 > RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 > RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 > RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 > R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 > R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 > FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Stack: > 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a > 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 > 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 > Call Trace: > <IRQ> > [<ffffffff8104a716>] irq_exit+0x86/0x90 > [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 > [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 > <EOI> > [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] > [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 > [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] > [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 > [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] > [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] > [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] > [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] > [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 > [<ffffffff8105ab29>] process_one_work+0x139/0x350 > [<ffffffff8105b186>] worker_thread+0x126/0x490 > [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 > [<ffffffff8105fa64>] kthread+0xc4/0xe0 > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > > I could not explain, or find a code path, which would explain > a +20 second lockup, but from some instrumentation it was > apparent the interrupts off proportion of time was between > 10-25% under heavy load which is quite bad. > > By moving the GT interrupt handling to a tasklet in a most > simple way, the problem above disappears completely. Perfect segue into gem_syslatency. I think gem_syslatency is the better tool to correlate disruptive system behaviour. And then continue on with gem_latency to demonstrate that is doesn't adversely affect our performance. > Also, gem_latency -n 100 shows 25% better throughput and CPU > usage, and 14% better latencies. Mention the benefits of parallelising dispatch. As fairly hit-and-miss as perf testing is on these machines, it is looking in favour of using tasklets vs the rt kthread. The numbers swing between 2-10%, but consistently improves in the nop sync latencies. There's still several hours to go in this run before we cover the dispatch latenies, but so far reasonable. (Hmm, looks like there may be a possible degredation on the single nop dispatch but an improvement on the continuous nop dispatch.) > I did not find any gains or regressions with Synmark2 or > GLbench under light testing. More benchmarking is certainly > required. > > v2: > * execlists_lock should be taken as spin_lock_bh when > queuing work from userspace now. (Chris Wilson) > * uncore.lock must be taken with spin_lock_irq when > submitting requests since that now runs from either > softirq or process context. There are a couple of execlist_lock usage outside of intel_lrc that may or may not be useful to convert (low frequency reset / debug paths, so way off the critical paths, but consistency in locking is invaluable). > + tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler, > + (unsigned long)engine); I like trying to split lines to cluster arguments if possible. Here I think intel_lrc_irq_handler pairs with engine, tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler, (unsigned long)engine); *shrug* > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h > index 221a94627aab..29810cba8a8c 100644 > --- a/drivers/gpu/drm/i915/intel_ringbuffer.h > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h > @@ -266,6 +266,7 @@ struct intel_engine_cs { > } semaphore; > > /* Execlists */ > + struct tasklet_struct irq_tasklet; > spinlock_t execlist_lock; spinlock_t execlist_lock; /* used inside tasklet, use spin_lock_bh */ It's looking good, but once this run completes, I'm going to repeat it just to confirm how stable my numbers are. Critical bugfix, improvements, simpler patch than my kthread implementation, Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC v2] drm/i915: Move execlists irq handler to a bottom half 2016-03-24 10:56 ` Chris Wilson @ 2016-03-24 11:50 ` Tvrtko Ursulin 2016-03-24 12:58 ` Tvrtko Ursulin 0 siblings, 1 reply; 32+ messages in thread From: Tvrtko Ursulin @ 2016-03-24 11:50 UTC (permalink / raw) To: Chris Wilson, Intel-gfx, Tvrtko Ursulin On 24/03/16 10:56, Chris Wilson wrote: > On Wed, Mar 23, 2016 at 02:57:36PM +0000, Tvrtko Ursulin wrote: >> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >> >> Doing a lot of work in the interrupt handler introduces huge >> latencies to the system as a whole. >> >> Most dramatic effect can be seen by running an all engine >> stress test like igt/gem_exec_nop/all where, when the kernel >> config is lean enough, the whole system can be brought into >> multi-second periods of complete non-interactivty. That can >> look for example like this: >> >> NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143] >> Modules linked in: [redacted for brevity] >> CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0-160321+ #183 >> Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1 >> Workqueue: i915 gen6_pm_rps_work [i915] >> task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000 >> RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0 >> RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 >> RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 >> RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 >> RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 >> R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 >> R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 >> FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> Stack: >> 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a >> 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 >> 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 >> Call Trace: >> <IRQ> >> [<ffffffff8104a716>] irq_exit+0x86/0x90 >> [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 >> [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 >> <EOI> >> [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] >> [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 >> [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] >> [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 >> [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] >> [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] >> [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] >> [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] >> [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 >> [<ffffffff8105ab29>] process_one_work+0x139/0x350 >> [<ffffffff8105b186>] worker_thread+0x126/0x490 >> [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 >> [<ffffffff8105fa64>] kthread+0xc4/0xe0 >> [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 >> [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 >> [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 >> >> I could not explain, or find a code path, which would explain >> a +20 second lockup, but from some instrumentation it was >> apparent the interrupts off proportion of time was between >> 10-25% under heavy load which is quite bad. >> >> By moving the GT interrupt handling to a tasklet in a most >> simple way, the problem above disappears completely. > > Perfect segue into gem_syslatency. I think gem_syslatency is the better > tool to correlate disruptive system behaviour. And then continue on with > gem_latency to demonstrate that is doesn't adversely affect our > performance. Will do. >> Also, gem_latency -n 100 shows 25% better throughput and CPU >> usage, and 14% better latencies. > > Mention the benefits of parallelising dispatch. Hm, actually this should be the same as before I think. > As fairly hit-and-miss as perf testing is on these machines, it is > looking in favour of using tasklets vs the rt kthread. The numbers swing > between 2-10%, but consistently improves in the nop sync latencies. > There's still several hours to go in this run before we cover the > dispatch latenies, but so far reasonable. > > (Hmm, looks like there may be a possible degredation on the single nop > dispatch but an improvement on the continuous nop dispatch.) We can add all the numbers you get to the commit message as well. >> I did not find any gains or regressions with Synmark2 or >> GLbench under light testing. More benchmarking is certainly >> required. >> >> v2: >> * execlists_lock should be taken as spin_lock_bh when >> queuing work from userspace now. (Chris Wilson) >> * uncore.lock must be taken with spin_lock_irq when >> submitting requests since that now runs from either >> softirq or process context. > > There are a couple of execlist_lock usage outside of intel_lrc that may > or may not be useful to convert (low frequency reset / debug paths, so > way off the critical paths, but consistency in locking is invaluable). Oh right, I've missed those. > >> + tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler, >> + (unsigned long)engine); > > I like trying to split lines to cluster arguments if possible. Here I > think intel_lrc_irq_handler pairs with engine, > > tasklet_init(&engine->irq_tasklet, > intel_lrc_irq_handler, (unsigned long)engine); > > *shrug* Yeah it is nicer. >> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h >> index 221a94627aab..29810cba8a8c 100644 >> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h >> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h >> @@ -266,6 +266,7 @@ struct intel_engine_cs { >> } semaphore; >> >> /* Execlists */ >> + struct tasklet_struct irq_tasklet; >> spinlock_t execlist_lock; > > spinlock_t execlist_lock; /* used inside tasklet, use spin_lock_bh */ Will do. > It's looking good, but once this run completes, I'm going to repeat it > just to confirm how stable my numbers are. > > Critical bugfix, improvements, simpler patch than my kthread > implementation, > Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Okay I will respin with the above and we'll see. Unfortunately my test platform just died so there will be a delay. Regards, Tvrtko _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC v2] drm/i915: Move execlists irq handler to a bottom half 2016-03-24 11:50 ` Tvrtko Ursulin @ 2016-03-24 12:58 ` Tvrtko Ursulin 0 siblings, 0 replies; 32+ messages in thread From: Tvrtko Ursulin @ 2016-03-24 12:58 UTC (permalink / raw) To: Chris Wilson, Intel-gfx, Tvrtko Ursulin On 24/03/16 11:50, Tvrtko Ursulin wrote: >>> Also, gem_latency -n 100 shows 25% better throughput and CPU >>> usage, and 14% better latencies. >> >> Mention the benefits of parallelising dispatch. > > Hm, actually this should be the same as before I think. Of course not, silly me. Will add this at the next opportunity then. Regards, Tvrtko _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC v2] drm/i915: Move execlists irq handler to a bottom half 2016-03-23 14:57 ` [RFC v2] " Tvrtko Ursulin 2016-03-24 10:56 ` Chris Wilson @ 2016-03-24 15:56 ` Imre Deak 2016-03-24 16:05 ` Chris Wilson 1 sibling, 1 reply; 32+ messages in thread From: Imre Deak @ 2016-03-24 15:56 UTC (permalink / raw) To: Tvrtko Ursulin, Intel-gfx On ke, 2016-03-23 at 14:57 +0000, Tvrtko Ursulin wrote: > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Doing a lot of work in the interrupt handler introduces huge > latencies to the system as a whole. > > Most dramatic effect can be seen by running an all engine > stress test like igt/gem_exec_nop/all where, when the kernel > config is lean enough, the whole system can be brought into > multi-second periods of complete non-interactivty. That can > look for example like this: > > NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! > [kworker/u8:3:143] > Modules linked in: [redacted for brevity] > CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0- > 160321+ #183 > Hardware name: Intel Corporation Broadwell Client platform/WhiteTip > Mountain 1 > Workqueue: i915 gen6_pm_rps_work [i915] > task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: > ffff8800aae90000 > RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] > __do_softirq+0x72/0x1d0 > RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 > RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 > RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 > RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 > R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 > R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 > FS: 0000000000000000(0000) GS:ffff88014f400000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Stack: > 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a > 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 > 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 > Call Trace: > <IRQ> > [<ffffffff8104a716>] irq_exit+0x86/0x90 > [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 > [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 > <EOI> > [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] > [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 > [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] > [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 > [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] > [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] > [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] > [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] > [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 > [<ffffffff8105ab29>] process_one_work+0x139/0x350 > [<ffffffff8105b186>] worker_thread+0x126/0x490 > [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 > [<ffffffff8105fa64>] kthread+0xc4/0xe0 > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > > I could not explain, or find a code path, which would explain > a +20 second lockup, but from some instrumentation it was > apparent the interrupts off proportion of time was between > 10-25% under heavy load which is quite bad. > > By moving the GT interrupt handling to a tasklet in a most > simple way, the problem above disappears completely. > > Also, gem_latency -n 100 shows 25% better throughput and CPU > usage, and 14% better latencies. > > I did not find any gains or regressions with Synmark2 or > GLbench under light testing. More benchmarking is certainly > required. > > v2: > * execlists_lock should be taken as spin_lock_bh when > queuing work from userspace now. (Chris Wilson) > * uncore.lock must be taken with spin_lock_irq when > submitting requests since that now runs from either > softirq or process context. > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Chris Wilson <chris@chris-wilson.co.uk> You also have to synchronize against the tasklet now whenever we synchronize against the IRQ, see gen6_disable_rps_interrupts(), gen8_irq_power_well_pre_disable() and intel_runtime_pm_disable_interrupts(). Not saying you should use a threaded IRQ instead, but it does provide for this automatically. --Imre > --- > drivers/gpu/drm/i915/i915_irq.c | 2 +- > drivers/gpu/drm/i915/intel_lrc.c | 24 ++++++++++++++--------- > - > drivers/gpu/drm/i915/intel_lrc.h | 1 - > drivers/gpu/drm/i915/intel_ringbuffer.h | 1 + > 4 files changed, 16 insertions(+), 12 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_irq.c > b/drivers/gpu/drm/i915/i915_irq.c > index 8f3e3309c3ab..e68134347007 100644 > --- a/drivers/gpu/drm/i915/i915_irq.c > +++ b/drivers/gpu/drm/i915/i915_irq.c > @@ -1324,7 +1324,7 @@ gen8_cs_irq_handler(struct intel_engine_cs > *engine, u32 iir, int test_shift) > if (iir & (GT_RENDER_USER_INTERRUPT << test_shift)) > notify_ring(engine); > if (iir & (GT_CONTEXT_SWITCH_INTERRUPT << test_shift)) > - intel_lrc_irq_handler(engine); > + tasklet_schedule(&engine->irq_tasklet); > } > > static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private > *dev_priv, > diff --git a/drivers/gpu/drm/i915/intel_lrc.c > b/drivers/gpu/drm/i915/intel_lrc.c > index 67592f8354d6..b3b62b3cd90d 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.c > +++ b/drivers/gpu/drm/i915/intel_lrc.c > @@ -418,20 +418,18 @@ static void execlists_submit_requests(struct > drm_i915_gem_request *rq0, > { > struct drm_i915_private *dev_priv = rq0->i915; > > - /* BUG_ON(!irqs_disabled()); */ > - > execlists_update_context(rq0); > > if (rq1) > execlists_update_context(rq1); > > - spin_lock(&dev_priv->uncore.lock); > + spin_lock_irq(&dev_priv->uncore.lock); > intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL); > > execlists_elsp_write(rq0, rq1); > > intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL); > - spin_unlock(&dev_priv->uncore.lock); > + spin_unlock_irq(&dev_priv->uncore.lock); > } > > static void execlists_context_unqueue(struct intel_engine_cs > *engine) > @@ -536,13 +534,14 @@ get_context_status(struct drm_i915_private > *dev_priv, u32 csb_base, > > /** > * intel_lrc_irq_handler() - handle Context Switch interrupts > - * @ring: Engine Command Streamer to handle. > + * @engine: Engine Command Streamer to handle. > * > * Check the unread Context Status Buffers and manage the submission > of new > * contexts to the ELSP accordingly. > */ > -void intel_lrc_irq_handler(struct intel_engine_cs *engine) > +void intel_lrc_irq_handler(unsigned long data) > { > + struct intel_engine_cs *engine = (struct intel_engine_cs > *)data; > struct drm_i915_private *dev_priv = engine->dev- > >dev_private; > u32 status_pointer; > unsigned int read_pointer, write_pointer; > @@ -551,7 +550,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs > *engine) > unsigned int csb_read = 0, i; > unsigned int submit_contexts = 0; > > - spin_lock(&dev_priv->uncore.lock); > + spin_lock_irq(&dev_priv->uncore.lock); > intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL); > > status_pointer = > I915_READ_FW(RING_CONTEXT_STATUS_PTR(engine)); > @@ -579,7 +578,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs > *engine) > engine- > >next_context_status_buffer << 8)); > > intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL); > - spin_unlock(&dev_priv->uncore.lock); > + spin_unlock_irq(&dev_priv->uncore.lock); > > spin_lock(&engine->execlist_lock); > > @@ -621,7 +620,7 @@ static void execlists_context_queue(struct > drm_i915_gem_request *request) > > i915_gem_request_reference(request); > > - spin_lock_irq(&engine->execlist_lock); > + spin_lock_bh(&engine->execlist_lock); > > list_for_each_entry(cursor, &engine->execlist_queue, > execlist_link) > if (++num_elements > 2) > @@ -646,7 +645,7 @@ static void execlists_context_queue(struct > drm_i915_gem_request *request) > if (num_elements == 0) > execlists_context_unqueue(engine); > > - spin_unlock_irq(&engine->execlist_lock); > + spin_unlock_bh(&engine->execlist_lock); > } > > static int logical_ring_invalidate_all_caches(struct > drm_i915_gem_request *req) > @@ -2016,6 +2015,8 @@ void intel_logical_ring_cleanup(struct > intel_engine_cs *engine) > if (!intel_engine_initialized(engine)) > return; > > + tasklet_kill(&engine->irq_tasklet); > + > dev_priv = engine->dev->dev_private; > > if (engine->buffer) { > @@ -2089,6 +2090,9 @@ logical_ring_init(struct drm_device *dev, > struct intel_engine_cs *engine) > INIT_LIST_HEAD(&engine->execlist_retired_req_list); > spin_lock_init(&engine->execlist_lock); > > + tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler, > + (unsigned long)engine); > + > logical_ring_init_platform_invariants(engine); > > ret = i915_cmd_parser_init_ring(engine); > diff --git a/drivers/gpu/drm/i915/intel_lrc.h > b/drivers/gpu/drm/i915/intel_lrc.h > index 6690d93d603f..efcbd7bf9cc9 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.h > +++ b/drivers/gpu/drm/i915/intel_lrc.h > @@ -123,7 +123,6 @@ int intel_execlists_submission(struct > i915_execbuffer_params *params, > struct drm_i915_gem_execbuffer2 > *args, > struct list_head *vmas); > > -void intel_lrc_irq_handler(struct intel_engine_cs *engine); > void intel_execlists_retire_requests(struct intel_engine_cs > *engine); > > #endif /* _INTEL_LRC_H_ */ > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h > b/drivers/gpu/drm/i915/intel_ringbuffer.h > index 221a94627aab..29810cba8a8c 100644 > --- a/drivers/gpu/drm/i915/intel_ringbuffer.h > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h > @@ -266,6 +266,7 @@ struct intel_engine_cs { > } semaphore; > > /* Execlists */ > + struct tasklet_struct irq_tasklet; > spinlock_t execlist_lock; > struct list_head execlist_queue; > struct list_head execlist_retired_req_list; _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC v2] drm/i915: Move execlists irq handler to a bottom half 2016-03-24 15:56 ` Imre Deak @ 2016-03-24 16:05 ` Chris Wilson 2016-03-24 16:40 ` Imre Deak 0 siblings, 1 reply; 32+ messages in thread From: Chris Wilson @ 2016-03-24 16:05 UTC (permalink / raw) To: Imre Deak; +Cc: Intel-gfx On Thu, Mar 24, 2016 at 05:56:40PM +0200, Imre Deak wrote: > On ke, 2016-03-23 at 14:57 +0000, Tvrtko Ursulin wrote: > > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > > > Doing a lot of work in the interrupt handler introduces huge > > latencies to the system as a whole. > > > > Most dramatic effect can be seen by running an all engine > > stress test like igt/gem_exec_nop/all where, when the kernel > > config is lean enough, the whole system can be brought into > > multi-second periods of complete non-interactivty. That can > > look for example like this: > > > > NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! > > [kworker/u8:3:143] > > Modules linked in: [redacted for brevity] > > CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0- > > 160321+ #183 > > Hardware name: Intel Corporation Broadwell Client platform/WhiteTip > > Mountain 1 > > Workqueue: i915 gen6_pm_rps_work [i915] > > task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: > > ffff8800aae90000 > > RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] > > __do_softirq+0x72/0x1d0 > > RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 > > RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 > > RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 > > RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 > > R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 > > R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 > > FS: 0000000000000000(0000) GS:ffff88014f400000(0000) > > knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > Stack: > > 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a > > 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 > > 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 > > Call Trace: > > <IRQ> > > [<ffffffff8104a716>] irq_exit+0x86/0x90 > > [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 > > [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 > > <EOI> > > [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] > > [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 > > [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] > > [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 > > [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] > > [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] > > [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] > > [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] > > [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 > > [<ffffffff8105ab29>] process_one_work+0x139/0x350 > > [<ffffffff8105b186>] worker_thread+0x126/0x490 > > [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 > > [<ffffffff8105fa64>] kthread+0xc4/0xe0 > > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > > [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 > > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > > > > I could not explain, or find a code path, which would explain > > a +20 second lockup, but from some instrumentation it was > > apparent the interrupts off proportion of time was between > > 10-25% under heavy load which is quite bad. > > > > By moving the GT interrupt handling to a tasklet in a most > > simple way, the problem above disappears completely. > > > > Also, gem_latency -n 100 shows 25% better throughput and CPU > > usage, and 14% better latencies. > > > > I did not find any gains or regressions with Synmark2 or > > GLbench under light testing. More benchmarking is certainly > > required. > > > > v2: > > * execlists_lock should be taken as spin_lock_bh when > > queuing work from userspace now. (Chris Wilson) > > * uncore.lock must be taken with spin_lock_irq when > > submitting requests since that now runs from either > > softirq or process context. > > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > > You also have to synchronize against the tasklet now whenever we > synchronize against the IRQ, see gen6_disable_rps_interrupts(), > gen8_irq_power_well_pre_disable() and > intel_runtime_pm_disable_interrupts(). Not saying you should use a > threaded IRQ instead, but it does provide for this automatically. But we don't synchronize against the irq for execlists since this tasklet is guarded by the rpm wakeref (though mark_busy / mark_idle) and we stop it before we finally release the irq. Or have I missed something? -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC v2] drm/i915: Move execlists irq handler to a bottom half 2016-03-24 16:05 ` Chris Wilson @ 2016-03-24 16:40 ` Imre Deak 2016-03-24 19:56 ` Chris Wilson 0 siblings, 1 reply; 32+ messages in thread From: Imre Deak @ 2016-03-24 16:40 UTC (permalink / raw) To: Chris Wilson; +Cc: Intel-gfx On to, 2016-03-24 at 16:05 +0000, Chris Wilson wrote: > On Thu, Mar 24, 2016 at 05:56:40PM +0200, Imre Deak wrote: > > On ke, 2016-03-23 at 14:57 +0000, Tvrtko Ursulin wrote: > > > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > > > > > Doing a lot of work in the interrupt handler introduces huge > > > latencies to the system as a whole. > > > > > > Most dramatic effect can be seen by running an all engine > > > stress test like igt/gem_exec_nop/all where, when the kernel > > > config is lean enough, the whole system can be brought into > > > multi-second periods of complete non-interactivty. That can > > > look for example like this: > > > > > > NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! > > > [kworker/u8:3:143] > > > Modules linked in: [redacted for brevity] > > > CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: > > > G U L 4.5.0- > > > 160321+ #183 > > > Hardware name: Intel Corporation Broadwell Client > > > platform/WhiteTip > > > Mountain 1 > > > Workqueue: i915 gen6_pm_rps_work [i915] > > > task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: > > > ffff8800aae90000 > > > RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] > > > __do_softirq+0x72/0x1d0 > > > RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 > > > RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: > > > 00000000000006e0 > > > RDX: 0000000000000020 RSI: 0000000004208060 RDI: > > > 0000000000215d80 > > > RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: > > > 0000000000000022 > > > R10: 0000000000000004 R11: 00000000ffffffff R12: > > > 000000000000a030 > > > R13: 0000000000000082 R14: ffff8800aa4d0080 R15: > > > 0000000000000082 > > > FS: 0000000000000000(0000) GS:ffff88014f400000(0000) > > > knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: > > > 00000000001406f0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > 0000000000000400 > > > Stack: > > > 042080601b33869f ffff8800aae94000 00000000fffc2678 > > > ffff88010000000a > > > 0000000000000000 000000000000a030 0000000000005302 > > > ffff8800aa4d0080 > > > 0000000000000206 ffff88014f403f90 ffffffff8104a716 > > > ffff88014f403fa8 > > > Call Trace: > > > <IRQ> > > > [<ffffffff8104a716>] irq_exit+0x86/0x90 > > > [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 > > > [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 > > > <EOI> > > > [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] > > > [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 > > > [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] > > > [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 > > > [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] > > > [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] > > > [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] > > > [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] > > > [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 > > > [<ffffffff8105ab29>] process_one_work+0x139/0x350 > > > [<ffffffff8105b186>] worker_thread+0x126/0x490 > > > [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 > > > [<ffffffff8105fa64>] kthread+0xc4/0xe0 > > > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > > > [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 > > > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > > > > > > I could not explain, or find a code path, which would explain > > > a +20 second lockup, but from some instrumentation it was > > > apparent the interrupts off proportion of time was between > > > 10-25% under heavy load which is quite bad. > > > > > > By moving the GT interrupt handling to a tasklet in a most > > > simple way, the problem above disappears completely. > > > > > > Also, gem_latency -n 100 shows 25% better throughput and CPU > > > usage, and 14% better latencies. > > > > > > I did not find any gains or regressions with Synmark2 or > > > GLbench under light testing. More benchmarking is certainly > > > required. > > > > > > v2: > > > * execlists_lock should be taken as spin_lock_bh when > > > queuing work from userspace now. (Chris Wilson) > > > * uncore.lock must be taken with spin_lock_irq when > > > submitting requests since that now runs from either > > > softirq or process context. > > > > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > > > > You also have to synchronize against the tasklet now whenever we > > synchronize against the IRQ, see gen6_disable_rps_interrupts(), > > gen8_irq_power_well_pre_disable() and > > intel_runtime_pm_disable_interrupts(). Not saying you should use a > > threaded IRQ instead, but it does provide for this automatically. > > But we don't synchronize against the irq for execlists since this > tasklet is guarded by the rpm wakeref (though mark_busy / mark_idle) > and we stop it before we finally release the irq. Hm yea, I missed that it's only an execlist tasklet and so there shouldn't be any pending tasklet after mark_idle(). Perhaps it would still make sense to assert for this in gen8_logical_ring_put_irq() or somewhere? Similarly there is a tasklet_kill() in intel_logical_ring_cleanup(), but there shouldn't be any pending tasklet there either, so should we add an assert there too? --Imre > Or have I missed something? > -Chris > _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC v2] drm/i915: Move execlists irq handler to a bottom half 2016-03-24 16:40 ` Imre Deak @ 2016-03-24 19:56 ` Chris Wilson 2016-03-24 22:13 ` Imre Deak 0 siblings, 1 reply; 32+ messages in thread From: Chris Wilson @ 2016-03-24 19:56 UTC (permalink / raw) To: Imre Deak; +Cc: Intel-gfx On Thu, Mar 24, 2016 at 06:40:55PM +0200, Imre Deak wrote: > Hm yea, I missed that it's only an execlist tasklet and so there > shouldn't be any pending tasklet after mark_idle(). Perhaps it would > still make sense to assert for this in gen8_logical_ring_put_irq() or > somewhere? Similarly there is a tasklet_kill() in > intel_logical_ring_cleanup(), but there shouldn't be any pending > tasklet there either, so should we add an assert there too? Yes, tasklet_kill() should be a nop. We could if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &tasklet->state)) tasklet_kill(&tasklet); I don't see a particular sensible spot to assert that the engines are off before irq uninstall other than the assertions we have in execlists that irqs are actually enabled when we try to submit, and the battery of WARNs we have for trying to access the hardware whilst !rpm. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC v2] drm/i915: Move execlists irq handler to a bottom half 2016-03-24 19:56 ` Chris Wilson @ 2016-03-24 22:13 ` Imre Deak 0 siblings, 0 replies; 32+ messages in thread From: Imre Deak @ 2016-03-24 22:13 UTC (permalink / raw) To: Chris Wilson; +Cc: Intel-gfx On Thu, 2016-03-24 at 19:56 +0000, Chris Wilson wrote: > On Thu, Mar 24, 2016 at 06:40:55PM +0200, Imre Deak wrote: > > Hm yea, I missed that it's only an execlist tasklet and so there > > shouldn't be any pending tasklet after mark_idle(). Perhaps it > > would > > still make sense to assert for this in gen8_logical_ring_put_irq() > > or > > somewhere? Similarly there is a tasklet_kill() in > > intel_logical_ring_cleanup(), but there shouldn't be any pending > > tasklet there either, so should we add an assert there too? > > Yes, tasklet_kill() should be a nop. We could > > if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &tasklet->state)) > tasklet_kill(&tasklet); > > I don't see a particular sensible spot to assert that the engines are > off before irq uninstall other than the assertions we have in > execlists > that irqs are actually enabled when we try to submit, and the battery > of WARNs we have for trying to access the hardware whilst !rpm. Ok, this was just a hand-wavy idea then and also this tasklet isn't much different from other work we schedule from the interrupt handler and we don't have special checks for those either. The above WARN_ON would be still useful for documentation imo. --Imre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* ✗ Fi.CI.BAT: failure for drm/i915: Move execlists irq handler to a bottom half (rev2) 2016-03-22 17:30 [RFC] drm/i915: Move execlists irq handler to a bottom half Tvrtko Ursulin ` (2 preceding siblings ...) 2016-03-23 14:57 ` [RFC v2] " Tvrtko Ursulin @ 2016-03-23 15:32 ` Patchwork 2016-03-24 14:03 ` ✗ Fi.CI.BAT: warning for drm/i915: Move execlists irq handler to a bottom half (rev3) Patchwork 2016-04-04 12:33 ` ✗ Fi.CI.BAT: failure for drm/i915: Move execlists irq handler to a bottom half (rev4) Patchwork 5 siblings, 0 replies; 32+ messages in thread From: Patchwork @ 2016-03-23 15:32 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: intel-gfx == Series Details == Series: drm/i915: Move execlists irq handler to a bottom half (rev2) URL : https://patchwork.freedesktop.org/series/4764/ State : failure == Summary == Series 4764v2 drm/i915: Move execlists irq handler to a bottom half http://patchwork.freedesktop.org/api/1.0/series/4764/revisions/2/mbox/ Test kms_flip: Subgroup basic-flip-vs-dpms: pass -> DMESG-WARN (ilk-hp8440p) UNSTABLE Test pm_rpm: Subgroup basic-pci-d3-state: dmesg-warn -> PASS (byt-nuc) Subgroup basic-rte: dmesg-warn -> PASS (bsw-nuc-2) bdw-nuci7 total:192 pass:180 dwarn:0 dfail:0 fail:0 skip:12 bdw-ultra total:192 pass:171 dwarn:0 dfail:0 fail:0 skip:21 bsw-nuc-2 total:192 pass:155 dwarn:0 dfail:0 fail:0 skip:37 byt-nuc total:192 pass:157 dwarn:0 dfail:0 fail:0 skip:35 hsw-brixbox total:192 pass:170 dwarn:0 dfail:0 fail:0 skip:22 hsw-gt2 total:192 pass:175 dwarn:0 dfail:0 fail:0 skip:17 ilk-hp8440p total:192 pass:128 dwarn:1 dfail:0 fail:0 skip:63 ivb-t430s total:192 pass:167 dwarn:0 dfail:0 fail:0 skip:25 skl-i7k-2 total:192 pass:169 dwarn:0 dfail:0 fail:0 skip:23 skl-nuci5 total:192 pass:181 dwarn:0 dfail:0 fail:0 skip:11 snb-x220t total:192 pass:158 dwarn:0 dfail:0 fail:1 skip:33 Results at /archive/results/CI_IGT_test/Patchwork_1691/ 6f788978796a35996bea8795db70f9c4194b70a2 drm-intel-nightly: 2016y-03m-23d-12h-24m-24s UTC integration manifest 6a4cbb9de86d4702e036d7d00255174bec59f736 drm/i915: Move execlists irq handler to a bottom half _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* ✗ Fi.CI.BAT: warning for drm/i915: Move execlists irq handler to a bottom half (rev3) 2016-03-22 17:30 [RFC] drm/i915: Move execlists irq handler to a bottom half Tvrtko Ursulin ` (3 preceding siblings ...) 2016-03-23 15:32 ` ✗ Fi.CI.BAT: failure for drm/i915: Move execlists irq handler to a bottom half (rev2) Patchwork @ 2016-03-24 14:03 ` Patchwork 2016-03-24 15:17 ` Tvrtko Ursulin 2016-04-04 12:33 ` ✗ Fi.CI.BAT: failure for drm/i915: Move execlists irq handler to a bottom half (rev4) Patchwork 5 siblings, 1 reply; 32+ messages in thread From: Patchwork @ 2016-03-24 14:03 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: intel-gfx == Series Details == Series: drm/i915: Move execlists irq handler to a bottom half (rev3) URL : https://patchwork.freedesktop.org/series/4764/ State : warning == Summary == Series 4764v3 drm/i915: Move execlists irq handler to a bottom half http://patchwork.freedesktop.org/api/1.0/series/4764/revisions/3/mbox/ Test gem_exec_suspend: Subgroup basic-s3: dmesg-warn -> PASS (bsw-nuc-2) Test pm_rpm: Subgroup basic-pci-d3-state: dmesg-warn -> PASS (bsw-nuc-2) pass -> DMESG-WARN (byt-nuc) Subgroup basic-rte: dmesg-warn -> PASS (byt-nuc) UNSTABLE bdw-nuci7 total:192 pass:179 dwarn:0 dfail:0 fail:1 skip:12 bdw-ultra total:192 pass:170 dwarn:0 dfail:0 fail:1 skip:21 bsw-nuc-2 total:192 pass:155 dwarn:0 dfail:0 fail:0 skip:37 byt-nuc total:192 pass:156 dwarn:1 dfail:0 fail:0 skip:35 hsw-brixbox total:192 pass:170 dwarn:0 dfail:0 fail:0 skip:22 hsw-gt2 total:192 pass:175 dwarn:0 dfail:0 fail:0 skip:17 ivb-t430s total:192 pass:167 dwarn:0 dfail:0 fail:0 skip:25 skl-i7k-2 total:192 pass:169 dwarn:0 dfail:0 fail:0 skip:23 skl-nuci5 total:192 pass:181 dwarn:0 dfail:0 fail:0 skip:11 snb-dellxps total:192 pass:158 dwarn:0 dfail:0 fail:0 skip:34 snb-x220t total:192 pass:158 dwarn:0 dfail:0 fail:1 skip:33 Results at /archive/results/CI_IGT_test/Patchwork_1707/ 83ec122b900baae1aca2bc11eedc28f2d9ea5060 drm-intel-nightly: 2016y-03m-24d-12h-48m-43s UTC integration manifest b4a4e726b4f10b0782c821bf73c945533ec882e8 drm/i915: Move execlists irq handler to a bottom half _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ✗ Fi.CI.BAT: warning for drm/i915: Move execlists irq handler to a bottom half (rev3) 2016-03-24 14:03 ` ✗ Fi.CI.BAT: warning for drm/i915: Move execlists irq handler to a bottom half (rev3) Patchwork @ 2016-03-24 15:17 ` Tvrtko Ursulin 0 siblings, 0 replies; 32+ messages in thread From: Tvrtko Ursulin @ 2016-03-24 15:17 UTC (permalink / raw) To: intel-gfx On 24/03/16 14:03, Patchwork wrote: > == Series Details == > > Series: drm/i915: Move execlists irq handler to a bottom half (rev3) > URL : https://patchwork.freedesktop.org/series/4764/ > State : warning > > == Summary == > > Series 4764v3 drm/i915: Move execlists irq handler to a bottom half > http://patchwork.freedesktop.org/api/1.0/series/4764/revisions/3/mbox/ > > Test gem_exec_suspend: > Subgroup basic-s3: > dmesg-warn -> PASS (bsw-nuc-2) > Test pm_rpm: > Subgroup basic-pci-d3-state: > dmesg-warn -> PASS (bsw-nuc-2) > pass -> DMESG-WARN (byt-nuc) Unclaimed register prior to suspending on BYT: https://bugs.freedesktop.org/show_bug.cgi?id=94164 > Subgroup basic-rte: > dmesg-warn -> PASS (byt-nuc) UNSTABLE > > bdw-nuci7 total:192 pass:179 dwarn:0 dfail:0 fail:1 skip:12 > bdw-ultra total:192 pass:170 dwarn:0 dfail:0 fail:1 skip:21 > bsw-nuc-2 total:192 pass:155 dwarn:0 dfail:0 fail:0 skip:37 > byt-nuc total:192 pass:156 dwarn:1 dfail:0 fail:0 skip:35 > hsw-brixbox total:192 pass:170 dwarn:0 dfail:0 fail:0 skip:22 > hsw-gt2 total:192 pass:175 dwarn:0 dfail:0 fail:0 skip:17 > ivb-t430s total:192 pass:167 dwarn:0 dfail:0 fail:0 skip:25 > skl-i7k-2 total:192 pass:169 dwarn:0 dfail:0 fail:0 skip:23 > skl-nuci5 total:192 pass:181 dwarn:0 dfail:0 fail:0 skip:11 > snb-dellxps total:192 pass:158 dwarn:0 dfail:0 fail:0 skip:34 > snb-x220t total:192 pass:158 dwarn:0 dfail:0 fail:1 skip:33 > > Results at /archive/results/CI_IGT_test/Patchwork_1707/ > > 83ec122b900baae1aca2bc11eedc28f2d9ea5060 drm-intel-nightly: 2016y-03m-24d-12h-48m-43s UTC integration manifest > b4a4e726b4f10b0782c821bf73c945533ec882e8 drm/i915: Move execlists irq handler to a bottom half Sooo... who dares to merge this? It kind of looks too simple not to result in some fallout. Regards, Tvrtko _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* ✗ Fi.CI.BAT: failure for drm/i915: Move execlists irq handler to a bottom half (rev4) 2016-03-22 17:30 [RFC] drm/i915: Move execlists irq handler to a bottom half Tvrtko Ursulin ` (4 preceding siblings ...) 2016-03-24 14:03 ` ✗ Fi.CI.BAT: warning for drm/i915: Move execlists irq handler to a bottom half (rev3) Patchwork @ 2016-04-04 12:33 ` Patchwork 2016-04-04 12:42 ` Tvrtko Ursulin 5 siblings, 1 reply; 32+ messages in thread From: Patchwork @ 2016-04-04 12:33 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: intel-gfx == Series Details == Series: drm/i915: Move execlists irq handler to a bottom half (rev4) URL : https://patchwork.freedesktop.org/series/4764/ State : failure == Summary == Series 4764v4 drm/i915: Move execlists irq handler to a bottom half http://patchwork.freedesktop.org/api/1.0/series/4764/revisions/4/mbox/ Test gem_sync: Subgroup basic-bsd: pass -> DMESG-FAIL (ilk-hp8440p) Test kms_pipe_crc_basic: Subgroup suspend-read-crc-pipe-a: incomplete -> PASS (skl-nuci5) bdw-nuci7 total:196 pass:184 dwarn:0 dfail:0 fail:0 skip:12 bdw-ultra total:196 pass:175 dwarn:0 dfail:0 fail:0 skip:21 bsw-nuc-2 total:196 pass:159 dwarn:0 dfail:0 fail:0 skip:37 byt-nuc total:196 pass:161 dwarn:0 dfail:0 fail:0 skip:35 hsw-brixbox total:196 pass:174 dwarn:0 dfail:0 fail:0 skip:22 hsw-gt2 total:196 pass:179 dwarn:0 dfail:0 fail:0 skip:17 ilk-hp8440p total:196 pass:131 dwarn:0 dfail:1 fail:0 skip:64 ivb-t430s total:196 pass:171 dwarn:0 dfail:0 fail:0 skip:25 skl-i7k-2 total:196 pass:173 dwarn:0 dfail:0 fail:0 skip:23 skl-nuci5 total:105 pass:100 dwarn:0 dfail:0 fail:0 skip:4 snb-dellxps total:196 pass:162 dwarn:0 dfail:0 fail:0 skip:34 snb-x220t total:164 pass:139 dwarn:0 dfail:0 fail:0 skip:25 Results at /archive/results/CI_IGT_test/Patchwork_1786/ 3e353ec38c8fe68e9a243a9388389a8815115451 drm-intel-nightly: 2016y-04m-04d-11h-13m-54s UTC integration manifest 95dc10d4f71a6cf473aa874b0a74036f251aef8c drm/i915: Move execlists irq handler to a bottom half _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ✗ Fi.CI.BAT: failure for drm/i915: Move execlists irq handler to a bottom half (rev4) 2016-04-04 12:33 ` ✗ Fi.CI.BAT: failure for drm/i915: Move execlists irq handler to a bottom half (rev4) Patchwork @ 2016-04-04 12:42 ` Tvrtko Ursulin 2016-04-04 12:53 ` Chris Wilson 0 siblings, 1 reply; 32+ messages in thread From: Tvrtko Ursulin @ 2016-04-04 12:42 UTC (permalink / raw) To: intel-gfx On 04/04/16 13:33, Patchwork wrote: > == Series Details == > > Series: drm/i915: Move execlists irq handler to a bottom half (rev4) > URL : https://patchwork.freedesktop.org/series/4764/ > State : failure > > == Summary == > > Series 4764v4 drm/i915: Move execlists irq handler to a bottom half > http://patchwork.freedesktop.org/api/1.0/series/4764/revisions/4/mbox/ > > Test gem_sync: > Subgroup basic-bsd: > pass -> DMESG-FAIL (ilk-hp8440p) Unrelated hangcheck timer elapsed on ILK: https://bugs.freedesktop.org/show_bug.cgi?id=94307 > Test kms_pipe_crc_basic: > Subgroup suspend-read-crc-pipe-a: > incomplete -> PASS (skl-nuci5) > > bdw-nuci7 total:196 pass:184 dwarn:0 dfail:0 fail:0 skip:12 > bdw-ultra total:196 pass:175 dwarn:0 dfail:0 fail:0 skip:21 > bsw-nuc-2 total:196 pass:159 dwarn:0 dfail:0 fail:0 skip:37 > byt-nuc total:196 pass:161 dwarn:0 dfail:0 fail:0 skip:35 > hsw-brixbox total:196 pass:174 dwarn:0 dfail:0 fail:0 skip:22 > hsw-gt2 total:196 pass:179 dwarn:0 dfail:0 fail:0 skip:17 > ilk-hp8440p total:196 pass:131 dwarn:0 dfail:1 fail:0 skip:64 > ivb-t430s total:196 pass:171 dwarn:0 dfail:0 fail:0 skip:25 > skl-i7k-2 total:196 pass:173 dwarn:0 dfail:0 fail:0 skip:23 > skl-nuci5 total:105 pass:100 dwarn:0 dfail:0 fail:0 skip:4 > snb-dellxps total:196 pass:162 dwarn:0 dfail:0 fail:0 skip:34 > snb-x220t total:164 pass:139 dwarn:0 dfail:0 fail:0 skip:25 > > Results at /archive/results/CI_IGT_test/Patchwork_1786/ > > 3e353ec38c8fe68e9a243a9388389a8815115451 drm-intel-nightly: 2016y-04m-04d-11h-13m-54s UTC integration manifest > 95dc10d4f71a6cf473aa874b0a74036f251aef8c drm/i915: Move execlists irq handler to a bottom half So cross fingers and merge? Regards, Tvrtko _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ✗ Fi.CI.BAT: failure for drm/i915: Move execlists irq handler to a bottom half (rev4) 2016-04-04 12:42 ` Tvrtko Ursulin @ 2016-04-04 12:53 ` Chris Wilson 2016-04-04 13:14 ` Tvrtko Ursulin 0 siblings, 1 reply; 32+ messages in thread From: Chris Wilson @ 2016-04-04 12:53 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: intel-gfx On Mon, Apr 04, 2016 at 01:42:06PM +0100, Tvrtko Ursulin wrote: > > > On 04/04/16 13:33, Patchwork wrote: > >== Series Details == > > > >Series: drm/i915: Move execlists irq handler to a bottom half (rev4) > >URL : https://patchwork.freedesktop.org/series/4764/ > >State : failure > > > >== Summary == > > > >Series 4764v4 drm/i915: Move execlists irq handler to a bottom half > >http://patchwork.freedesktop.org/api/1.0/series/4764/revisions/4/mbox/ > > > >Test gem_sync: > > Subgroup basic-bsd: > > pass -> DMESG-FAIL (ilk-hp8440p) > > Unrelated hangcheck timer elapsed on ILK: > https://bugs.freedesktop.org/show_bug.cgi?id=94307 > > >Test kms_pipe_crc_basic: > > Subgroup suspend-read-crc-pipe-a: > > incomplete -> PASS (skl-nuci5) > > > >bdw-nuci7 total:196 pass:184 dwarn:0 dfail:0 fail:0 skip:12 > >bdw-ultra total:196 pass:175 dwarn:0 dfail:0 fail:0 skip:21 > >bsw-nuc-2 total:196 pass:159 dwarn:0 dfail:0 fail:0 skip:37 > >byt-nuc total:196 pass:161 dwarn:0 dfail:0 fail:0 skip:35 > >hsw-brixbox total:196 pass:174 dwarn:0 dfail:0 fail:0 skip:22 > >hsw-gt2 total:196 pass:179 dwarn:0 dfail:0 fail:0 skip:17 > >ilk-hp8440p total:196 pass:131 dwarn:0 dfail:1 fail:0 skip:64 > >ivb-t430s total:196 pass:171 dwarn:0 dfail:0 fail:0 skip:25 > >skl-i7k-2 total:196 pass:173 dwarn:0 dfail:0 fail:0 skip:23 > >skl-nuci5 total:105 pass:100 dwarn:0 dfail:0 fail:0 skip:4 > >snb-dellxps total:196 pass:162 dwarn:0 dfail:0 fail:0 skip:34 > >snb-x220t total:164 pass:139 dwarn:0 dfail:0 fail:0 skip:25 > > > >Results at /archive/results/CI_IGT_test/Patchwork_1786/ > > > >3e353ec38c8fe68e9a243a9388389a8815115451 drm-intel-nightly: 2016y-04m-04d-11h-13m-54s UTC integration manifest > >95dc10d4f71a6cf473aa874b0a74036f251aef8c drm/i915: Move execlists irq handler to a bottom half > > So cross fingers and merge? Yes! -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ✗ Fi.CI.BAT: failure for drm/i915: Move execlists irq handler to a bottom half (rev4) 2016-04-04 12:53 ` Chris Wilson @ 2016-04-04 13:14 ` Tvrtko Ursulin 0 siblings, 0 replies; 32+ messages in thread From: Tvrtko Ursulin @ 2016-04-04 13:14 UTC (permalink / raw) To: Chris Wilson, intel-gfx On 04/04/16 13:53, Chris Wilson wrote: > On Mon, Apr 04, 2016 at 01:42:06PM +0100, Tvrtko Ursulin wrote: >> >> >> On 04/04/16 13:33, Patchwork wrote: >>> == Series Details == >>> >>> Series: drm/i915: Move execlists irq handler to a bottom half (rev4) >>> URL : https://patchwork.freedesktop.org/series/4764/ >>> State : failure >>> >>> == Summary == >>> >>> Series 4764v4 drm/i915: Move execlists irq handler to a bottom half >>> http://patchwork.freedesktop.org/api/1.0/series/4764/revisions/4/mbox/ >>> >>> Test gem_sync: >>> Subgroup basic-bsd: >>> pass -> DMESG-FAIL (ilk-hp8440p) >> >> Unrelated hangcheck timer elapsed on ILK: >> https://bugs.freedesktop.org/show_bug.cgi?id=94307 >> >>> Test kms_pipe_crc_basic: >>> Subgroup suspend-read-crc-pipe-a: >>> incomplete -> PASS (skl-nuci5) >>> >>> bdw-nuci7 total:196 pass:184 dwarn:0 dfail:0 fail:0 skip:12 >>> bdw-ultra total:196 pass:175 dwarn:0 dfail:0 fail:0 skip:21 >>> bsw-nuc-2 total:196 pass:159 dwarn:0 dfail:0 fail:0 skip:37 >>> byt-nuc total:196 pass:161 dwarn:0 dfail:0 fail:0 skip:35 >>> hsw-brixbox total:196 pass:174 dwarn:0 dfail:0 fail:0 skip:22 >>> hsw-gt2 total:196 pass:179 dwarn:0 dfail:0 fail:0 skip:17 >>> ilk-hp8440p total:196 pass:131 dwarn:0 dfail:1 fail:0 skip:64 >>> ivb-t430s total:196 pass:171 dwarn:0 dfail:0 fail:0 skip:25 >>> skl-i7k-2 total:196 pass:173 dwarn:0 dfail:0 fail:0 skip:23 >>> skl-nuci5 total:105 pass:100 dwarn:0 dfail:0 fail:0 skip:4 >>> snb-dellxps total:196 pass:162 dwarn:0 dfail:0 fail:0 skip:34 >>> snb-x220t total:164 pass:139 dwarn:0 dfail:0 fail:0 skip:25 >>> >>> Results at /archive/results/CI_IGT_test/Patchwork_1786/ >>> >>> 3e353ec38c8fe68e9a243a9388389a8815115451 drm-intel-nightly: 2016y-04m-04d-11h-13m-54s UTC integration manifest >>> 95dc10d4f71a6cf473aa874b0a74036f251aef8c drm/i915: Move execlists irq handler to a bottom half >> >> So cross fingers and merge? > > Yes! Okay, it's done, we'll see what happens next. :) Regards, Tvrtko _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2016-04-04 13:14 UTC | newest] Thread overview: 32+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-03-22 17:30 [RFC] drm/i915: Move execlists irq handler to a bottom half Tvrtko Ursulin 2016-03-23 7:02 ` ✓ Fi.CI.BAT: success for " Patchwork 2016-03-23 9:07 ` [RFC] " Daniel Vetter 2016-03-23 9:14 ` Chris Wilson 2016-03-23 10:08 ` Tvrtko Ursulin 2016-03-23 11:31 ` Chris Wilson 2016-03-23 12:46 ` Tvrtko Ursulin 2016-03-23 12:56 ` Chris Wilson 2016-03-23 15:23 ` Tvrtko Ursulin 2016-03-24 9:37 ` Tvrtko Ursulin 2016-03-24 12:18 ` [PATCH v3] " Tvrtko Ursulin 2016-03-24 22:24 ` Chris Wilson 2016-03-25 12:56 ` Chris Wilson 2016-04-04 11:11 ` [PATCH v4] " Tvrtko Ursulin 2016-04-04 11:27 ` Chris Wilson 2016-04-04 12:51 ` Tvrtko Ursulin 2016-03-23 14:57 ` [RFC v2] " Tvrtko Ursulin 2016-03-24 10:56 ` Chris Wilson 2016-03-24 11:50 ` Tvrtko Ursulin 2016-03-24 12:58 ` Tvrtko Ursulin 2016-03-24 15:56 ` Imre Deak 2016-03-24 16:05 ` Chris Wilson 2016-03-24 16:40 ` Imre Deak 2016-03-24 19:56 ` Chris Wilson 2016-03-24 22:13 ` Imre Deak 2016-03-23 15:32 ` ✗ Fi.CI.BAT: failure for drm/i915: Move execlists irq handler to a bottom half (rev2) Patchwork 2016-03-24 14:03 ` ✗ Fi.CI.BAT: warning for drm/i915: Move execlists irq handler to a bottom half (rev3) Patchwork 2016-03-24 15:17 ` Tvrtko Ursulin 2016-04-04 12:33 ` ✗ Fi.CI.BAT: failure for drm/i915: Move execlists irq handler to a bottom half (rev4) Patchwork 2016-04-04 12:42 ` Tvrtko Ursulin 2016-04-04 12:53 ` Chris Wilson 2016-04-04 13:14 ` Tvrtko Ursulin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).