public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed
From: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
To: Raag Jadav <raag.jadav@intel.com>, <intel-xe@lists.freedesktop.org>
Cc: <matthew.brost@intel.com>, <rodrigo.vivi@intel.com>,
	<thomas.hellstrom@linux.intel.com>, <riana.tauro@intel.com>,
	<michal.wajdeczko@intel.com>, <matthew.d.roper@intel.com>,
	<michal.winiarski@intel.com>, <matthew.auld@intel.com>,
	<maarten@lankhorst.se>, <jani.nikula@intel.com>,
	<lukasz.laguna@intel.com>, <zhanjun.dong@intel.com>,
	<lukas@wunner.de>, <badal.nilawar@intel.com>
Subject: Re: [PATCH v6 8/8] drm/xe/pci: Introduce PCIe FLR
Date: Tue, 28 Apr 2026 16:28:15 -0700	[thread overview]
Message-ID: <2de7d34d-6f47-4327-9290-7cebfd47a69d@intel.com> (raw)
In-Reply-To: <20260423100017.1051587-9-raag.jadav@intel.com>

<snip>

I haven't gone through the code yet, but I wanted to ask some questions 
regarding the approach first.

> +
> +/**
> + * DOC: PCI Error Handling
> + *
> + * Xe driver registers PCI callbacks which are called by PCI core in case of
> + * bus errors or resets.
> + *
> + * Currently only PCI Function Level Reset (FLR) callbacks are supported. Since
> + * most of the Endpoint Function state is lost on PCIe FLR, the flow is pretty
> + * much similar to system suspend/resume flow with a few notable exceptions.

IMO we need a couple of lines to describe what the impact of FLR is on 
the HW. Something like:

"PCI FLR clears VRAM and resets the state of all the HW units. 
Therefore, the contents of all exec queues and BOs in VRAM are lost and 
the HW needs a full re-init".

> + *
> + * Prepare phase:
> + * - Temporarily wedge the device to prevent userspace access

I'm not convinced that wedging is the correct approach here, because the 
expectation from the apps POV is that wedging is permanent, so they 
won't try again later. Maybe we can have a separate flr_in_progress flag 
and return something like -EBUSY or -EAGAIN when the FLR is in progress?

> + * - Stop accepting new submissions

This is done as part of the above step and it isn't a separate one, right?

> + * - Kill exec queues which signals all fences and frees in-flight jobs
> + * - Skip memory eviction due to untrustworthy VRAM contents

Note that the VRAM contents are not necessarily untrustworthy at this 
points since the FLR hasn't happened yet. However, if the admin is 
triggering an FLR it is likely that something is broken (whether memory, 
GuC, GT or something else), so we shouldn't try to touch the HW anyway.

> + * - Remove all memory mappings since VRAM contents will be lost

Dumb question, but what happens if a userspace app has an object mapped 
and they try to access it from the CPU after this step?

> + *
> + * Re-initialization phase:
> + * - Recreate kernel bos due to skipped eviction in prepare phase
> + * - Restore kernel queues which were killed in prepare phase
> + * - Reload all uC firmwares
> + * - Bring up GT and unwedge to allow userspace access
> + *
> + * Since VRAM contents are lost, the user is expected to recreate user memory
> + * and reload context.

How is the user expected to realize that they need to re-create their 
BOs? A queue can be killed for different reasons and normally that 
doesn't imply that any associated BO is now invalid.

Daniele

> + *
> + * TODO: Add PCIe error handling callbacks using similar flow.
> + *
> + * Current implementation is only limited to re-initializing GT.
> + * This needs to be extended for a lot of components listed below.
> + *
> + * - Proper re-initialization of GSC and PXP for integrated platforms
> + * - SRIOV cases which need synchronization between PF and VF
> + * - Re-initialization of all child devices of Xe
> + * - User memory handling and MM corner cases
> + * - Display
> + */
> +
>


  reply	other threads:[~2026-04-28 23:28 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-23 10:00 [PATCH v6 0/8] Introduce Xe PCIe FLR Raag Jadav
2026-04-23 10:00 ` [PATCH v6 1/8] drm/xe/uc_fw: Allow re-initializing firmware Raag Jadav
2026-04-23 10:00 ` [PATCH v6 2/8] drm/xe/guc_submit: Introduce guc_exec_queue_reinit() Raag Jadav
2026-04-23 10:00 ` [PATCH v6 3/8] drm/xe/gt: Introduce FLR helpers Raag Jadav
2026-04-23 10:00 ` [PATCH v6 4/8] drm/xe/bo_evict: Introduce xe_bo_restore_map() Raag Jadav
2026-04-23 10:00 ` [PATCH v6 5/8] drm/xe/exec_queue: Introduce xe_exec_queue_reinit() Raag Jadav
2026-04-23 10:00 ` [PATCH v6 6/8] drm/xe/migrate: Introduce xe_migrate_reinit() Raag Jadav
2026-04-23 10:00 ` [PATCH v6 7/8] drm/xe/pm: Introduce xe_device_suspend/resume() Raag Jadav
2026-04-23 10:00 ` [PATCH v6 8/8] drm/xe/pci: Introduce PCIe FLR Raag Jadav
2026-04-28 23:28   ` Daniele Ceraolo Spurio [this message]
2026-04-29  4:33     ` Raag Jadav
2026-04-29 16:22       ` Rodrigo Vivi
2026-04-29 17:57         ` Daniele Ceraolo Spurio
2026-04-30 20:57           ` Rodrigo Vivi
2026-05-02  7:41             ` Raag Jadav
2026-04-23 10:09 ` ✗ CI.checkpatch: warning for Introduce Xe PCIe FLR (rev6) Patchwork
2026-04-23 10:10 ` ✓ CI.KUnit: success " Patchwork
2026-04-23 11:05 ` ✓ Xe.CI.BAT: " Patchwork
2026-04-23 20:58 ` ✗ Xe.CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2de7d34d-6f47-4327-9290-7cebfd47a69d@intel.com \
    --to=daniele.ceraolospurio@intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jani.nikula@intel.com \
    --cc=lukas@wunner.de \
    --cc=lukasz.laguna@intel.com \
    --cc=maarten@lankhorst.se \
    --cc=matthew.auld@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=matthew.d.roper@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=michal.winiarski@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=zhanjun.dong@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox