Re: [PATCH] drm/xe/pf: Allow to lock/unlock the PF

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Manszewski, Christoph" <christoph.manszewski@intel.com>
To: Michal Wajdeczko <michal.wajdeczko@intel.com>,
	Matthew Brost <matthew.brost@intel.com>
Cc: intel-xe@lists.freedesktop.org,
	Nareshkumar Gollakoti <naresh.kumar.g@intel.com>,
	Maciej Patelczyk <maciej.patelczyk@intel.com>
Subject: Re: [PATCH] drm/xe/pf: Allow to lock/unlock the PF
Date: Wed, 29 Oct 2025 12:02:43 +0100	[thread overview]
Message-ID: <2d802bd2-cbf3-48c1-acf1-b1705b61ec85@intel.com> (raw)
In-Reply-To: <cdae237e-a4d7-4053-8b37-2164a1b2ea31@intel.com>

On 29.10.2025 09:14, Michal Wajdeczko wrote:
> 
> 
> On 10/29/2025 2:31 AM, Matthew Brost wrote:
>> On Tue, Oct 28, 2025 at 09:05:21PM +0100, Michal Wajdeczko wrote:
>>> Some driver functionalities, like eudebug or ccs-mode, can't
>>> be used when VFs are enabled.  Add functions to allow locking
>>> the PF functionality for exclusive usage (either for enabling
>>> VFs or to enable those other features, or simply for testing).
>>> Add also debugfs attributes to explicitly call those functions
>>> if needed.
>>>
>>
>> Hmm, I'm not sure about this. Why not just lock the SR-IOV master mutex
>> in pf_enable_vfs? If the reason is that lockdep blows up — for example,
>> if the master mutex is annotated with __reclaim and pf_enable_vfs
>> allocates memory — then you still have a potential deadlock; you've just
>> silenced lockdep. I'm not certain that's the case, just using it as an
>> example.
>>
>> Given that, I'd lean toward saying no — this really, really looks
>> unsafe. If you'd like, get a second opinion from a locking expert (e.g.,
>> Thomas), but I think this is a no from me.
> 
> looks that more background info is needed here
> 
> this "lock/unlock" is not to protect any PF structures/data, as for this
> we have master_mutex, but to allow other components, like mentioned
> above eudebug & ccs-mode, to block PF from enabling VFs while that other
> feature is running or making incompatible with VFs changes, see [1] [2]
> 
> in this patch, the PF is trying to "lock" itself when enabling VFs.
> if other component (or here debugfs) already locked the PF, then PF will
> not enable any VFs, thus will not break this any other feature.
> 
> it is expected that other components, will follow the same flow, so they
> will first start with pf_try_lock and then either abort it's enabling, as
> PF is already running, or call pf_unlock when they are done.
> 
> however there is one open that we might need to solve: what if there are
> more such non-PF compatible features that would like to run in parallel
> 
> with this trivial approach, only eudebug or ccs-mode will be able to run

Good point - eudebug and ccs mode shouldn't be exclusive. Regardless, I 
think that a mechanism which provides absolute exclusivity to the caller 
shouldn't be exposed by a individual feature like SR-IOV.

> 
> if that's not sufficient, then we can switch to use rw_semaphore, or
> maybe, if that would be cleaner/safer, use that from the beginning ?
> 
> then we will have:
> 
> components:
> 	xe_sriov_pf_try_lock		--> down_read_trylock
> 	xe_sriov_pf_unlock		--> up_read
> 
> PF internals:
> 	__xe_sriov_pf_try_lock(write)	--> down_write_trylock
> 	__xe_sriov_pf_unblock(write)	--> up_write

This looks nice as it only provides SR-IOV vs others exclusion and 
*won't* break when there are two or more of those other features 
(looking at [1] :$)

> 
> [1] https://patchwork.freedesktop.org/patch/667725/?series=152682&rev=1
> [2] https://patchwork.freedesktop.org/patch/681266/?series=154538&rev=6
> 
>>
>> Matt
>>
>>> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
>>> Cc: Nareshkumar Gollakoti <naresh.kumar.g@intel.com>
>>> Cc: Christoph Manszewski <christoph.manszewski@intel.com>
>>> Cc: Maciej Patelczyk <maciej.patelczyk@intel.com>
>>> ---
>>>   drivers/gpu/drm/xe/xe_pci_sriov.c        |  7 +++++
>>>   drivers/gpu/drm/xe/xe_sriov_pf.c         | 38 ++++++++++++++++++++++++
>>>   drivers/gpu/drm/xe/xe_sriov_pf.h         |  4 +++
>>>   drivers/gpu/drm/xe/xe_sriov_pf_debugfs.c | 15 ++++++++++
>>>   drivers/gpu/drm/xe/xe_sriov_pf_types.h   |  3 ++
>>>   5 files changed, 67 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_pci_sriov.c b/drivers/gpu/drm/xe/xe_pci_sriov.c
>>> index 735f51effc7a..e1d34860b064 100644
>>> --- a/drivers/gpu/drm/xe/xe_pci_sriov.c
>>> +++ b/drivers/gpu/drm/xe/xe_pci_sriov.c
>>> @@ -120,6 +120,10 @@ static int pf_enable_vfs(struct xe_device *xe, int num_vfs)
>>>   	if (err)
>>>   		goto out;
>>>   
>>> +	err = xe_sriov_pf_try_lock(xe);
>>> +	if (err)
>>> +		goto out;
>>> +
>>>   	/*
>>>   	 * We must hold additional reference to the runtime PM to keep PF in D0
>>>   	 * during VFs lifetime, as our VFs do not implement the PM capability.
>>> @@ -157,6 +161,7 @@ static int pf_enable_vfs(struct xe_device *xe, int num_vfs)
>>>   failed:
>>>   	xe_sriov_pf_unprovision_vfs(xe, num_vfs);
>>>   	xe_pm_runtime_put(xe);
>>> +	xe_sriov_pf_unlock(xe);
>>>   out:
>>>   	xe_sriov_notice(xe, "Failed to enable %u VF%s (%pe)\n",
>>>   			num_vfs, str_plural(num_vfs), ERR_PTR(err));
>>> @@ -186,6 +191,8 @@ static int pf_disable_vfs(struct xe_device *xe)
>>>   	/* not needed anymore - see pf_enable_vfs() */
>>>   	xe_pm_runtime_put(xe);
>>>   
>>> +	xe_sriov_pf_unlock(xe);
>>> +
>>>   	xe_sriov_info(xe, "Disabled %u VF%s\n", num_vfs, str_plural(num_vfs));
>>>   	return 0;
>>>   }
>>> diff --git a/drivers/gpu/drm/xe/xe_sriov_pf.c b/drivers/gpu/drm/xe/xe_sriov_pf.c
>>> index bc1ab9ee31d9..8cdd25db2cf9 100644
>>> --- a/drivers/gpu/drm/xe/xe_sriov_pf.c
>>> +++ b/drivers/gpu/drm/xe/xe_sriov_pf.c
>>> @@ -157,6 +157,44 @@ int xe_sriov_pf_wait_ready(struct xe_device *xe)
>>>   	return 0;
>>>   }
>>>   
>>> +/**
>>> + * xe_sriov_pf_try_lock() - Try to lock the PF.
>>> + * @xe: the PF &xe_device
>>> + *
>>> + * This function can only be called on PF.

Nit: this comment could be a little bit more descriptive. The name 
itself (try_lock) is reminiscent of a typical resource lock, which I 
would argue is more commonly associated with concurrency and data 
integrity rather than feature state management.

>>> + *
>>> + * Return: 0 on success or a negative error code on failure.
>>> + */
>>> +int xe_sriov_pf_try_lock(struct xe_device *xe)
>>> +{
>>> +	guard(mutex)(xe_sriov_pf_master_mutex(xe));
>>> +
>>> +	if (xe->sriov.pf.owner) {
>>> +		xe_sriov_dbg(xe, "already locked by %ps\n", xe->sriov.pf.owner);
>>> +		return -EBUSY;
>>> +	}
>>> +
>>> +	xe->sriov.pf.owner = __builtin_return_address(0);
>>> +	xe_sriov_dbg_verbose(xe, "locked by %ps\n", xe->sriov.pf.owner);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +/**
>>> + * xe_sriov_pf_unlock() - Unlock the PF.
>>> + * @xe: the PF &xe_device
>>> + *
>>> + * This function can only be called on PF.

Same as above.

Regards,
Christoph

>>> + */
>>> +void xe_sriov_pf_unlock(struct xe_device *xe)
>>> +{
>>> +	guard(mutex)(xe_sriov_pf_master_mutex(xe));
>>> +
>>> +	xe_assert(xe, xe->sriov.pf.owner);
>>> +	xe_sriov_dbg_verbose(xe, "unlocked by %ps\n", __builtin_return_address(0));
>>> +	xe->sriov.pf.owner = NULL;
>>> +}
>>> +
>>>   /**
>>>    * xe_sriov_pf_print_vfs_summary - Print SR-IOV PF information.
>>>    * @xe: the &xe_device to print info from
>>> diff --git a/drivers/gpu/drm/xe/xe_sriov_pf.h b/drivers/gpu/drm/xe/xe_sriov_pf.h
>>> index cba3fde9581f..2261596bb4fe 100644
>>> --- a/drivers/gpu/drm/xe/xe_sriov_pf.h
>>> +++ b/drivers/gpu/drm/xe/xe_sriov_pf.h
>>> @@ -17,11 +17,15 @@ bool xe_sriov_pf_readiness(struct xe_device *xe);
>>>   int xe_sriov_pf_init_early(struct xe_device *xe);
>>>   int xe_sriov_pf_init_late(struct xe_device *xe);
>>>   int xe_sriov_pf_wait_ready(struct xe_device *xe);
>>> +int xe_sriov_pf_try_lock(struct xe_device *xe);
>>> +void xe_sriov_pf_unlock(struct xe_device *xe);
>>>   void xe_sriov_pf_print_vfs_summary(struct xe_device *xe, struct drm_printer *p);
>>>   #else
>>>   static inline bool xe_sriov_pf_readiness(struct xe_device *xe) { return false; }
>>>   static inline int xe_sriov_pf_init_early(struct xe_device *xe) { return 0; }
>>>   static inline int xe_sriov_pf_init_late(struct xe_device *xe) { return 0; }
>>> +int xe_sriov_pf_try_lock(struct xe_device *xe) { return 0; }
>>> +void xe_sriov_pf_unlock(struct xe_device *xe) { }
>>>   #endif
>>>   
>>>   #endif
>>> diff --git a/drivers/gpu/drm/xe/xe_sriov_pf_debugfs.c b/drivers/gpu/drm/xe/xe_sriov_pf_debugfs.c
>>> index a81aa05c5532..7c011462244d 100644
>>> --- a/drivers/gpu/drm/xe/xe_sriov_pf_debugfs.c
>>> +++ b/drivers/gpu/drm/xe/xe_sriov_pf_debugfs.c
>>> @@ -96,12 +96,27 @@ static inline int xe_sriov_pf_restore_auto_provisioning(struct xe_device *xe)
>>>   	return xe_sriov_pf_provision_set_mode(xe, XE_SRIOV_PROVISIONING_MODE_AUTO);
>>>   }
>>>   
>>> +static inline int xe_sriov_pf_try_lock_pf(struct xe_device *xe)
>>> +{
>>> +	return xe_sriov_pf_try_lock(xe);
>>> +}
>>> +
>>> +static inline int xe_sriov_pf_force_unlock_pf(struct xe_device *xe)
>>> +{
>>> +	xe_sriov_pf_unlock(xe);
>>> +	return 0;
>>> +}
>>> +
>>>   DEFINE_SRIOV_ATTRIBUTE(restore_auto_provisioning);
>>> +DEFINE_SRIOV_ATTRIBUTE(try_lock_pf);
>>> +DEFINE_SRIOV_ATTRIBUTE(force_unlock_pf);
>>>   
>>>   static void pf_populate_root(struct xe_device *xe, struct dentry *dent)
>>>   {
>>>   	debugfs_create_file("restore_auto_provisioning", 0200, dent, xe,
>>>   			    &restore_auto_provisioning_fops);
>>> +	debugfs_create_file("try_lock_pf", 0200, dent, xe, &try_lock_pf_fops);
>>> +	debugfs_create_file("force_unlock_pf", 0200, dent, xe, &force_unlock_pf_fops);
>>>   }
>>>   
>>>   static int simple_show(struct seq_file *m, void *data)
>>> diff --git a/drivers/gpu/drm/xe/xe_sriov_pf_types.h b/drivers/gpu/drm/xe/xe_sriov_pf_types.h
>>> index c753cd59aed2..91da3c979922 100644
>>> --- a/drivers/gpu/drm/xe/xe_sriov_pf_types.h
>>> +++ b/drivers/gpu/drm/xe/xe_sriov_pf_types.h
>>> @@ -36,6 +36,9 @@ struct xe_device_pf {
>>>   	/** @master_lock: protects all VFs configurations across GTs */
>>>   	struct mutex master_lock;
>>>   
>>> +	/** @owner: the RET_IP of the owner who locked the PF */
>>> +	void *owner;
>>> +
>>>   	/** @provision: device level provisioning data. */
>>>   	struct xe_sriov_pf_provision provision;
>>>   
>>> -- 
>>> 2.47.1
>>>
>

next prev parent reply	other threads:[~2025-10-29 11:03 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-28 20:05 [PATCH] drm/xe/pf: Allow to lock/unlock the PF Michal Wajdeczko
2025-10-28 22:19 ` ✓ CI.KUnit: success for " Patchwork
2025-10-28 22:57 ` ✓ Xe.CI.BAT: " Patchwork
2025-10-29  1:31 ` [PATCH] " Matthew Brost
2025-10-29  8:14   ` Michal Wajdeczko
2025-10-29 11:02     ` Manszewski, Christoph [this message]
2025-10-29  1:34 ` Matthew Brost
2025-10-29 10:28 ` ✗ Xe.CI.Full: failure for " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2d802bd2-cbf3-48c1-acf1-b1705b61ec85@intel.com \
    --to=christoph.manszewski@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=maciej.patelczyk@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=naresh.kumar.g@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox