From: "Manszewski, Christoph" <christoph.manszewski@intel.com>
To: Michal Wajdeczko <michal.wajdeczko@intel.com>,
Matthew Brost <matthew.brost@intel.com>
Cc: intel-xe@lists.freedesktop.org,
Nareshkumar Gollakoti <naresh.kumar.g@intel.com>,
Maciej Patelczyk <maciej.patelczyk@intel.com>
Subject: Re: [PATCH] drm/xe/pf: Allow to lock/unlock the PF
Date: Wed, 29 Oct 2025 12:02:43 +0100 [thread overview]
Message-ID: <2d802bd2-cbf3-48c1-acf1-b1705b61ec85@intel.com> (raw)
In-Reply-To: <cdae237e-a4d7-4053-8b37-2164a1b2ea31@intel.com>
On 29.10.2025 09:14, Michal Wajdeczko wrote:
>
>
> On 10/29/2025 2:31 AM, Matthew Brost wrote:
>> On Tue, Oct 28, 2025 at 09:05:21PM +0100, Michal Wajdeczko wrote:
>>> Some driver functionalities, like eudebug or ccs-mode, can't
>>> be used when VFs are enabled. Add functions to allow locking
>>> the PF functionality for exclusive usage (either for enabling
>>> VFs or to enable those other features, or simply for testing).
>>> Add also debugfs attributes to explicitly call those functions
>>> if needed.
>>>
>>
>> Hmm, I'm not sure about this. Why not just lock the SR-IOV master mutex
>> in pf_enable_vfs? If the reason is that lockdep blows up — for example,
>> if the master mutex is annotated with __reclaim and pf_enable_vfs
>> allocates memory — then you still have a potential deadlock; you've just
>> silenced lockdep. I'm not certain that's the case, just using it as an
>> example.
>>
>> Given that, I'd lean toward saying no — this really, really looks
>> unsafe. If you'd like, get a second opinion from a locking expert (e.g.,
>> Thomas), but I think this is a no from me.
>
> looks that more background info is needed here
>
> this "lock/unlock" is not to protect any PF structures/data, as for this
> we have master_mutex, but to allow other components, like mentioned
> above eudebug & ccs-mode, to block PF from enabling VFs while that other
> feature is running or making incompatible with VFs changes, see [1] [2]
>
> in this patch, the PF is trying to "lock" itself when enabling VFs.
> if other component (or here debugfs) already locked the PF, then PF will
> not enable any VFs, thus will not break this any other feature.
>
> it is expected that other components, will follow the same flow, so they
> will first start with pf_try_lock and then either abort it's enabling, as
> PF is already running, or call pf_unlock when they are done.
>
> however there is one open that we might need to solve: what if there are
> more such non-PF compatible features that would like to run in parallel
>
> with this trivial approach, only eudebug or ccs-mode will be able to run
Good point - eudebug and ccs mode shouldn't be exclusive. Regardless, I
think that a mechanism which provides absolute exclusivity to the caller
shouldn't be exposed by a individual feature like SR-IOV.
>
> if that's not sufficient, then we can switch to use rw_semaphore, or
> maybe, if that would be cleaner/safer, use that from the beginning ?
>
> then we will have:
>
> components:
> xe_sriov_pf_try_lock --> down_read_trylock
> xe_sriov_pf_unlock --> up_read
>
> PF internals:
> __xe_sriov_pf_try_lock(write) --> down_write_trylock
> __xe_sriov_pf_unblock(write) --> up_write
This looks nice as it only provides SR-IOV vs others exclusion and
*won't* break when there are two or more of those other features
(looking at [1] :$)
>
> [1] https://patchwork.freedesktop.org/patch/667725/?series=152682&rev=1
> [2] https://patchwork.freedesktop.org/patch/681266/?series=154538&rev=6
>
>>
>> Matt
>>
>>> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
>>> Cc: Nareshkumar Gollakoti <naresh.kumar.g@intel.com>
>>> Cc: Christoph Manszewski <christoph.manszewski@intel.com>
>>> Cc: Maciej Patelczyk <maciej.patelczyk@intel.com>
>>> ---
>>> drivers/gpu/drm/xe/xe_pci_sriov.c | 7 +++++
>>> drivers/gpu/drm/xe/xe_sriov_pf.c | 38 ++++++++++++++++++++++++
>>> drivers/gpu/drm/xe/xe_sriov_pf.h | 4 +++
>>> drivers/gpu/drm/xe/xe_sriov_pf_debugfs.c | 15 ++++++++++
>>> drivers/gpu/drm/xe/xe_sriov_pf_types.h | 3 ++
>>> 5 files changed, 67 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_pci_sriov.c b/drivers/gpu/drm/xe/xe_pci_sriov.c
>>> index 735f51effc7a..e1d34860b064 100644
>>> --- a/drivers/gpu/drm/xe/xe_pci_sriov.c
>>> +++ b/drivers/gpu/drm/xe/xe_pci_sriov.c
>>> @@ -120,6 +120,10 @@ static int pf_enable_vfs(struct xe_device *xe, int num_vfs)
>>> if (err)
>>> goto out;
>>>
>>> + err = xe_sriov_pf_try_lock(xe);
>>> + if (err)
>>> + goto out;
>>> +
>>> /*
>>> * We must hold additional reference to the runtime PM to keep PF in D0
>>> * during VFs lifetime, as our VFs do not implement the PM capability.
>>> @@ -157,6 +161,7 @@ static int pf_enable_vfs(struct xe_device *xe, int num_vfs)
>>> failed:
>>> xe_sriov_pf_unprovision_vfs(xe, num_vfs);
>>> xe_pm_runtime_put(xe);
>>> + xe_sriov_pf_unlock(xe);
>>> out:
>>> xe_sriov_notice(xe, "Failed to enable %u VF%s (%pe)\n",
>>> num_vfs, str_plural(num_vfs), ERR_PTR(err));
>>> @@ -186,6 +191,8 @@ static int pf_disable_vfs(struct xe_device *xe)
>>> /* not needed anymore - see pf_enable_vfs() */
>>> xe_pm_runtime_put(xe);
>>>
>>> + xe_sriov_pf_unlock(xe);
>>> +
>>> xe_sriov_info(xe, "Disabled %u VF%s\n", num_vfs, str_plural(num_vfs));
>>> return 0;
>>> }
>>> diff --git a/drivers/gpu/drm/xe/xe_sriov_pf.c b/drivers/gpu/drm/xe/xe_sriov_pf.c
>>> index bc1ab9ee31d9..8cdd25db2cf9 100644
>>> --- a/drivers/gpu/drm/xe/xe_sriov_pf.c
>>> +++ b/drivers/gpu/drm/xe/xe_sriov_pf.c
>>> @@ -157,6 +157,44 @@ int xe_sriov_pf_wait_ready(struct xe_device *xe)
>>> return 0;
>>> }
>>>
>>> +/**
>>> + * xe_sriov_pf_try_lock() - Try to lock the PF.
>>> + * @xe: the PF &xe_device
>>> + *
>>> + * This function can only be called on PF.
Nit: this comment could be a little bit more descriptive. The name
itself (try_lock) is reminiscent of a typical resource lock, which I
would argue is more commonly associated with concurrency and data
integrity rather than feature state management.
>>> + *
>>> + * Return: 0 on success or a negative error code on failure.
>>> + */
>>> +int xe_sriov_pf_try_lock(struct xe_device *xe)
>>> +{
>>> + guard(mutex)(xe_sriov_pf_master_mutex(xe));
>>> +
>>> + if (xe->sriov.pf.owner) {
>>> + xe_sriov_dbg(xe, "already locked by %ps\n", xe->sriov.pf.owner);
>>> + return -EBUSY;
>>> + }
>>> +
>>> + xe->sriov.pf.owner = __builtin_return_address(0);
>>> + xe_sriov_dbg_verbose(xe, "locked by %ps\n", xe->sriov.pf.owner);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +/**
>>> + * xe_sriov_pf_unlock() - Unlock the PF.
>>> + * @xe: the PF &xe_device
>>> + *
>>> + * This function can only be called on PF.
Same as above.
Regards,
Christoph
>>> + */
>>> +void xe_sriov_pf_unlock(struct xe_device *xe)
>>> +{
>>> + guard(mutex)(xe_sriov_pf_master_mutex(xe));
>>> +
>>> + xe_assert(xe, xe->sriov.pf.owner);
>>> + xe_sriov_dbg_verbose(xe, "unlocked by %ps\n", __builtin_return_address(0));
>>> + xe->sriov.pf.owner = NULL;
>>> +}
>>> +
>>> /**
>>> * xe_sriov_pf_print_vfs_summary - Print SR-IOV PF information.
>>> * @xe: the &xe_device to print info from
>>> diff --git a/drivers/gpu/drm/xe/xe_sriov_pf.h b/drivers/gpu/drm/xe/xe_sriov_pf.h
>>> index cba3fde9581f..2261596bb4fe 100644
>>> --- a/drivers/gpu/drm/xe/xe_sriov_pf.h
>>> +++ b/drivers/gpu/drm/xe/xe_sriov_pf.h
>>> @@ -17,11 +17,15 @@ bool xe_sriov_pf_readiness(struct xe_device *xe);
>>> int xe_sriov_pf_init_early(struct xe_device *xe);
>>> int xe_sriov_pf_init_late(struct xe_device *xe);
>>> int xe_sriov_pf_wait_ready(struct xe_device *xe);
>>> +int xe_sriov_pf_try_lock(struct xe_device *xe);
>>> +void xe_sriov_pf_unlock(struct xe_device *xe);
>>> void xe_sriov_pf_print_vfs_summary(struct xe_device *xe, struct drm_printer *p);
>>> #else
>>> static inline bool xe_sriov_pf_readiness(struct xe_device *xe) { return false; }
>>> static inline int xe_sriov_pf_init_early(struct xe_device *xe) { return 0; }
>>> static inline int xe_sriov_pf_init_late(struct xe_device *xe) { return 0; }
>>> +int xe_sriov_pf_try_lock(struct xe_device *xe) { return 0; }
>>> +void xe_sriov_pf_unlock(struct xe_device *xe) { }
>>> #endif
>>>
>>> #endif
>>> diff --git a/drivers/gpu/drm/xe/xe_sriov_pf_debugfs.c b/drivers/gpu/drm/xe/xe_sriov_pf_debugfs.c
>>> index a81aa05c5532..7c011462244d 100644
>>> --- a/drivers/gpu/drm/xe/xe_sriov_pf_debugfs.c
>>> +++ b/drivers/gpu/drm/xe/xe_sriov_pf_debugfs.c
>>> @@ -96,12 +96,27 @@ static inline int xe_sriov_pf_restore_auto_provisioning(struct xe_device *xe)
>>> return xe_sriov_pf_provision_set_mode(xe, XE_SRIOV_PROVISIONING_MODE_AUTO);
>>> }
>>>
>>> +static inline int xe_sriov_pf_try_lock_pf(struct xe_device *xe)
>>> +{
>>> + return xe_sriov_pf_try_lock(xe);
>>> +}
>>> +
>>> +static inline int xe_sriov_pf_force_unlock_pf(struct xe_device *xe)
>>> +{
>>> + xe_sriov_pf_unlock(xe);
>>> + return 0;
>>> +}
>>> +
>>> DEFINE_SRIOV_ATTRIBUTE(restore_auto_provisioning);
>>> +DEFINE_SRIOV_ATTRIBUTE(try_lock_pf);
>>> +DEFINE_SRIOV_ATTRIBUTE(force_unlock_pf);
>>>
>>> static void pf_populate_root(struct xe_device *xe, struct dentry *dent)
>>> {
>>> debugfs_create_file("restore_auto_provisioning", 0200, dent, xe,
>>> &restore_auto_provisioning_fops);
>>> + debugfs_create_file("try_lock_pf", 0200, dent, xe, &try_lock_pf_fops);
>>> + debugfs_create_file("force_unlock_pf", 0200, dent, xe, &force_unlock_pf_fops);
>>> }
>>>
>>> static int simple_show(struct seq_file *m, void *data)
>>> diff --git a/drivers/gpu/drm/xe/xe_sriov_pf_types.h b/drivers/gpu/drm/xe/xe_sriov_pf_types.h
>>> index c753cd59aed2..91da3c979922 100644
>>> --- a/drivers/gpu/drm/xe/xe_sriov_pf_types.h
>>> +++ b/drivers/gpu/drm/xe/xe_sriov_pf_types.h
>>> @@ -36,6 +36,9 @@ struct xe_device_pf {
>>> /** @master_lock: protects all VFs configurations across GTs */
>>> struct mutex master_lock;
>>>
>>> + /** @owner: the RET_IP of the owner who locked the PF */
>>> + void *owner;
>>> +
>>> /** @provision: device level provisioning data. */
>>> struct xe_sriov_pf_provision provision;
>>>
>>> --
>>> 2.47.1
>>>
>
next prev parent reply other threads:[~2025-10-29 11:03 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-28 20:05 [PATCH] drm/xe/pf: Allow to lock/unlock the PF Michal Wajdeczko
2025-10-28 22:19 ` ✓ CI.KUnit: success for " Patchwork
2025-10-28 22:57 ` ✓ Xe.CI.BAT: " Patchwork
2025-10-29 1:31 ` [PATCH] " Matthew Brost
2025-10-29 8:14 ` Michal Wajdeczko
2025-10-29 11:02 ` Manszewski, Christoph [this message]
2025-10-29 1:34 ` Matthew Brost
2025-10-29 10:28 ` ✗ Xe.CI.Full: failure for " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2d802bd2-cbf3-48c1-acf1-b1705b61ec85@intel.com \
--to=christoph.manszewski@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=maciej.patelczyk@intel.com \
--cc=matthew.brost@intel.com \
--cc=michal.wajdeczko@intel.com \
--cc=naresh.kumar.g@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox