oops, yes:On Tue, Jul 09, 2024 at 06:08:54PM +0200, Nirmoy Das wrote:On 7/9/2024 11:57 AM, Matthew Auld wrote:Hi, On 08/07/2024 05:03, Matthew Brost wrote:While debuging [1] an issue was identified in which if too many GT TLB invalidations are issued to the GuC, the GuC can get overwhelmed to the point scheduling of jobs starts to stall. To avoid this, hold and coalesce GT TLB invalidations in the KMD if a watermark of pending invalidations is past. Add gitlab for this issue has also been opened [2]. Layering issues with GT TLB invalidations are known [3] which needed to be fixed first before adding this new feature. - Patches 1-8 fix the layering. - Patches 9-11 add coalescing feature. We could merge these two as seperate series if needed. CCing various stakeholders (Farah, Michal, Nirmoy) which have raised GT TLB invalidation issues in the past.Maybe worth mentioning for [1], we try to process TLB invalidations directly from the irq, however we also only process the g2h queue in-order, so if there is something other than TLB invalidation or fault earlier in the queue then we do nothing useful from the irq and just return, that is until the wq can eventually process those earlier items that couldn't be processed directly from the irq. In the pastSeen this recently : <3> [3763.731822] xe 0000:03:00.0: [drm] *ERROR* GT0: g2h outstanding: 611 <snip> <6> [3727.857273] [IGT] xe_evict: executing <3> [3730.165480] xe 0000:03:00.0: [drm] *ERROR* TILE0 [GTT] GT0: TLB invalidation time'd out, seqno=26858, recv=2685Missing the last digit of '2685'?
Which I think fits your description. This series should help but not sure how much.From arch level if this is a continued problem, perhaps we should ask for a dedicated G2H queue for TLB invalidation done responses. It seems like a fairly reasonable ask to me as TLB invalidations really shouldn't get stuck behind other G2H processing...
Yes, that should work really well. I am currently trying out this
series but haven't manged to reproduce the issue without/without
the series reliably yet.
Regards,
Nirmoy
MattRegards, NirmoyI have seen TLB timeouts where the TLB invalidation is clearly in the g2h queue (and has been for a while), but is stuck behind something earlier in the queue that needs the wq, but system is under such a heavy load that the wq can't be scheduled in a timely manner.v2: - Fix CI issues - Clean up some of the series / patch structure Matt [1] https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/799#note_2449497 [2] https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2162 [3] https://patchwork.freedesktop.org/series/133001/ Matthew Brost (11): drm/xe: Add xe_gt_tlb_invalidation_fence_init helper drm/xe: Drop xe_gt_tlb_invalidation_wait drm/xe: s/tlb_invalidation.lock/tlb_invalidation.fence_lock drm/xe: Add tlb_invalidation.seqno_lock drm/xe: Add xe_gt_tlb_invalidation_done_handler drm/xe: Add send tlb invalidation helpers drm/xe: Add xe_guc_tlb_invalidation layer drm/xe: Add multi-client support for GT TLB invalidations drm/xe: Add GT TLB invalidation coalescing drm/xe: Add GT TLB invalidation coalesce tracepoints drm/xe: Add GT TLB invalidation watermark debugfs drivers/gpu/drm/xe/Makefile | 1 + drivers/gpu/drm/xe/xe_debugfs.c | 38 ++ drivers/gpu/drm/xe/xe_device.c | 3 + drivers/gpu/drm/xe/xe_device_types.h | 5 + drivers/gpu/drm/xe/xe_ggtt.c | 21 +- drivers/gpu/drm/xe/xe_ggtt_types.h | 5 + drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 641 ++++++++++++------ drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h | 26 +- .../gpu/drm/xe/xe_gt_tlb_invalidation_types.h | 41 ++ drivers/gpu/drm/xe/xe_gt_types.h | 43 +- drivers/gpu/drm/xe/xe_guc_ct.c | 2 +- drivers/gpu/drm/xe/xe_guc_tlb_invalidation.c | 145 ++++ drivers/gpu/drm/xe/xe_guc_tlb_invalidation.h | 18 + drivers/gpu/drm/xe/xe_pt.c | 33 +- drivers/gpu/drm/xe/xe_trace.h | 10 + drivers/gpu/drm/xe/xe_vm.c | 45 +- drivers/gpu/drm/xe/xe_vm_types.h | 3 + 17 files changed, 801 insertions(+), 279 deletions(-) create mode 100644 drivers/gpu/drm/xe/xe_guc_tlb_invalidation.c create mode 100644 drivers/gpu/drm/xe/xe_guc_tlb_invalidation.h