Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Add configfs support for survivability mode
@ 2025-03-07 14:24 Riana Tauro
  0 siblings, 0 replies; 15+ messages in thread
From: Riana Tauro @ 2025-03-07 14:24 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, matthew.d.roper,
	lucas.demarchi

This RFC series proposes to expose attributes via xe configfs
subsystem. survivability mode attribute is added here that allows
the user to configure the device to enter survivability mode.

This is done by

echo "domain:bus:dev.fn" > /sys/bus/pci/drivers/xe/unbind
echo "domain:bus:dev.fn" > sys/kernel/config/xe/survivability_mode
echo "domain:bus:dev.fn" > /sys/bus/pci/drivers/xe/bind

This is an alternative to introducing module param that causes
all the connected and supported cards to enter survivability mode.
Manually entering survivability mode is useful when pcode does not
report failure and for validation

Riana Tauro (2):
  drm/xe: Add configfs to enable survivability mode
  drm/xe: Enable configfs support for survivability mode

 drivers/gpu/drm/xe/Makefile                |   1 +
 drivers/gpu/drm/xe/xe_configfs.c           | 117 +++++++++++++++++++++
 drivers/gpu/drm/xe/xe_configfs.h           |  16 +++
 drivers/gpu/drm/xe/xe_module.c             |   5 +
 drivers/gpu/drm/xe/xe_module.h             |   1 +
 drivers/gpu/drm/xe/xe_pci.c                |   8 +-
 drivers/gpu/drm/xe/xe_survivability_mode.c |  21 +++-
 7 files changed, 163 insertions(+), 6 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_configfs.c
 create mode 100644 drivers/gpu/drm/xe/xe_configfs.h

-- 
2.47.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* ✓ CI.Patch_applied: success for Add configfs support for survivability mode (rev2)
  2025-03-27  6:42 [PATCH 0/2] Add configfs support for survivability mode Riana Tauro
@ 2025-03-27  6:25 ` Patchwork
  2025-03-27  6:26 ` ✗ CI.checkpatch: warning " Patchwork
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Patchwork @ 2025-03-27  6:25 UTC (permalink / raw)
  To: Riana Tauro; +Cc: intel-xe

== Series Details ==

Series: Add configfs support for survivability mode (rev2)
URL   : https://patchwork.freedesktop.org/series/145999/
State : success

== Summary ==

=== Applying kernel patches on branch 'drm-tip' with base: ===
Base commit: c1893793c7d3 drm-tip: 2025y-03m-26d-21h-05m-04s UTC integration manifest
=== git am output follows ===
.git/rebase-apply/patch:201: new blank line at EOF.
+
warning: 1 line adds whitespace errors.
Applying: drm/xe: Add configfs to enable survivability mode
Applying: drm/xe: Enable configfs support for survivability mode



^ permalink raw reply	[flat|nested] 15+ messages in thread

* ✗ CI.checkpatch: warning for Add configfs support for survivability mode (rev2)
  2025-03-27  6:42 [PATCH 0/2] Add configfs support for survivability mode Riana Tauro
  2025-03-27  6:25 ` ✓ CI.Patch_applied: success for Add configfs support for survivability mode (rev2) Patchwork
@ 2025-03-27  6:26 ` Patchwork
  2025-03-27  6:26 ` ✗ CI.KUnit: failure " Patchwork
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Patchwork @ 2025-03-27  6:26 UTC (permalink / raw)
  To: Riana Tauro; +Cc: intel-xe

== Series Details ==

Series: Add configfs support for survivability mode (rev2)
URL   : https://patchwork.freedesktop.org/series/145999/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
99e5a866b5e13f134e606a3e29d9508d97826fb3
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit b896028e7e4e4535dc91377e320be14b904a6cf1
Author: Riana Tauro <riana.tauro@intel.com>
Date:   Thu Mar 27 12:12:02 2025 +0530

    drm/xe: Enable configfs support for survivability mode
    
    Enable survivability mode if supported and configfs attribute is set.
    Enabing survivability mode manually is useful in cases where pcode does
    not detect failure, validation and for IFR (in-field-repair).
    
    To set configfs survivability mode attribute for a device
    
    mkdir /config/xe/0000:03:00.0
    echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
    echo 1 > /config/xe/0000:03:00.0/survivability_mode
    echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
    
    The card enters survivability mode if supported
    
    Signed-off-by: Riana Tauro <riana.tauro@intel.com>
+ /mt/dim checkpatch c1893793c7d3868fe083bdab33999178337b5561 drm-intel
89c824cabafb drm/xe: Add configfs to enable survivability mode
-:38: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#38: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 209 lines checked
b896028e7e4e drm/xe: Enable configfs support for survivability mode



^ permalink raw reply	[flat|nested] 15+ messages in thread

* ✗ CI.KUnit: failure for Add configfs support for survivability mode (rev2)
  2025-03-27  6:42 [PATCH 0/2] Add configfs support for survivability mode Riana Tauro
  2025-03-27  6:25 ` ✓ CI.Patch_applied: success for Add configfs support for survivability mode (rev2) Patchwork
  2025-03-27  6:26 ` ✗ CI.checkpatch: warning " Patchwork
@ 2025-03-27  6:26 ` Patchwork
  2025-03-27  6:42 ` [PATCH 1/2] drm/xe: Add configfs to enable survivability mode Riana Tauro
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Patchwork @ 2025-03-27  6:26 UTC (permalink / raw)
  To: Riana Tauro; +Cc: intel-xe

== Series Details ==

Series: Add configfs support for survivability mode (rev2)
URL   : https://patchwork.freedesktop.org/series/145999/
State : failure

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
ERROR:root:../lib/iomap.c:156:5: warning: no previous prototype for ‘ioread64_lo_hi’ [-Wmissing-prototypes]
  156 | u64 ioread64_lo_hi(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~
../lib/iomap.c:163:5: warning: no previous prototype for ‘ioread64_hi_lo’ [-Wmissing-prototypes]
  163 | u64 ioread64_hi_lo(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~
../lib/iomap.c:170:5: warning: no previous prototype for ‘ioread64be_lo_hi’ [-Wmissing-prototypes]
  170 | u64 ioread64be_lo_hi(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~~~
../lib/iomap.c:178:5: warning: no previous prototype for ‘ioread64be_hi_lo’ [-Wmissing-prototypes]
  178 | u64 ioread64be_hi_lo(const void __iomem *addr)
      |     ^~~~~~~~~~~~~~~~
../lib/iomap.c:264:6: warning: no previous prototype for ‘iowrite64_lo_hi’ [-Wmissing-prototypes]
  264 | void iowrite64_lo_hi(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~
../lib/iomap.c:272:6: warning: no previous prototype for ‘iowrite64_hi_lo’ [-Wmissing-prototypes]
  272 | void iowrite64_hi_lo(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~
../lib/iomap.c:280:6: warning: no previous prototype for ‘iowrite64be_lo_hi’ [-Wmissing-prototypes]
  280 | void iowrite64be_lo_hi(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~~~
../lib/iomap.c:288:6: warning: no previous prototype for ‘iowrite64be_hi_lo’ [-Wmissing-prototypes]
  288 | void iowrite64be_hi_lo(u64 val, void __iomem *addr)
      |      ^~~~~~~~~~~~~~~~~
/usr/bin/ld: drivers/gpu/drm/xe/xe_configfs.o: in function `configfs_find_group':
xe_configfs.c:(.text+0x89): undefined reference to `config_group_find_item'
/usr/bin/ld: drivers/gpu/drm/xe/xe_configfs.o: in function `xe_config_make_device_group':
xe_configfs.c:(.text+0x28c): undefined reference to `config_group_init_type_name'
/usr/bin/ld: drivers/gpu/drm/xe/xe_configfs.o: in function `xe_configfs_get_survivability_mode':
xe_configfs.c:(.text+0x304): undefined reference to `config_item_put'
/usr/bin/ld: drivers/gpu/drm/xe/xe_configfs.o: in function `xe_configfs_clear_survivability_mode':
xe_configfs.c:(.text+0x371): undefined reference to `config_item_put'
/usr/bin/ld: drivers/gpu/drm/xe/xe_configfs.o: in function `xe_configfs_init':
xe_configfs.c:(.init.text+0x6): undefined reference to `config_group_init'
/usr/bin/ld: xe_configfs.c:(.init.text+0x4a): undefined reference to `configfs_register_subsystem'
/usr/bin/ld: drivers/gpu/drm/xe/xe_configfs.o: in function `xe_configfs_exit':
xe_configfs.c:(.exit.text+0x10): undefined reference to `configfs_unregister_subsystem'
collect2: error: ld returned 1 exit status
make[3]: *** [../scripts/Makefile.vmlinux:77: vmlinux] Error 1
make[2]: *** [/kernel/Makefile:1231: vmlinux] Error 2
make[1]: *** [/kernel/Makefile:251: __sub-make] Error 2
make: *** [Makefile:251: __sub-make] Error 2

[06:26:13] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[06:26:17] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json ARCH=um O=.kunit --jobs=48
+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 0/2] Add configfs support for survivability mode
@ 2025-03-27  6:42 Riana Tauro
  2025-03-27  6:25 ` ✓ CI.Patch_applied: success for Add configfs support for survivability mode (rev2) Patchwork
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Riana Tauro @ 2025-03-27  6:42 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	matthew.d.roper, aravind.iddamsetty

This series proposes to expose attributes via xe configfs
subsystem. Xe registers a configfs subsystem named 'xe'.
Userspace can then create directories for the devices they
want to configure and set appropriate attributes

This is done by

mount -t configfs none /config
mkdir /config/xe/0000:03:00.0 

echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
echo 1 > sys/kernel/config/xe/0000:03:00.0/survivability_mode
echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind

This is an alternative to introducing module param that causes
all the connected and supported GPU cards to enter survivability mode.
Manually entering survivability mode is useful when pcode does not
report failure, in field repairs and validation

Rev2: use config_groups (Lucas)

Riana Tauro (2):
  drm/xe: Add configfs to enable survivability mode
  drm/xe: Enable configfs support for survivability mode

 drivers/gpu/drm/xe/Makefile                |   1 +
 drivers/gpu/drm/xe/xe_configfs.c           | 236 +++++++++++++++++++++
 drivers/gpu/drm/xe/xe_configfs.h           |  16 ++
 drivers/gpu/drm/xe/xe_device.c             |   2 +-
 drivers/gpu/drm/xe/xe_module.c             |   5 +
 drivers/gpu/drm/xe/xe_pci.c                |  19 +-
 drivers/gpu/drm/xe/xe_survivability_mode.c |  39 +++-
 drivers/gpu/drm/xe/xe_survivability_mode.h |   2 +-
 8 files changed, 299 insertions(+), 21 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_configfs.c
 create mode 100644 drivers/gpu/drm/xe/xe_configfs.h

-- 
2.47.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] drm/xe: Add configfs to enable survivability mode
  2025-03-27  6:42 [PATCH 0/2] Add configfs support for survivability mode Riana Tauro
                   ` (2 preceding siblings ...)
  2025-03-27  6:26 ` ✗ CI.KUnit: failure " Patchwork
@ 2025-03-27  6:42 ` Riana Tauro
  2025-04-01  1:03   ` Lucas De Marchi
  2025-03-27  6:42 ` [PATCH 2/2] drm/xe: Enable configfs support for " Riana Tauro
  2025-03-27 14:40 ` [PATCH 0/2] Add " Lucas De Marchi
  5 siblings, 1 reply; 15+ messages in thread
From: Riana Tauro @ 2025-03-27  6:42 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	matthew.d.roper, aravind.iddamsetty

Registers a configfs subsystem called 'xe' to userspace. The user can
use this to modify exposed attributes for a device.

Attribute exposed:

/config/xe/<device>/survivability_mode : Enables survivability mode

The attributes can be modified by creating the device directory under the
configfs directory

mount -t configfs none /config
mkdir /config/xe/0000:03:00.0

echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
echo 1 > /config/xe/0000:03:00.0/survivability_mode
echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/Makefile      |   1 +
 drivers/gpu/drm/xe/xe_configfs.c | 174 +++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_configfs.h |  11 ++
 drivers/gpu/drm/xe/xe_module.c   |   5 +
 4 files changed, 191 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_configfs.c
 create mode 100644 drivers/gpu/drm/xe/xe_configfs.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index cd464fe26eb8..190176c2e10d 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -28,6 +28,7 @@ $(obj)/generated/%_wa_oob.c $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
 xe-y += xe_bb.o \
 	xe_bo.o \
 	xe_bo_evict.o \
+	xe_configfs.o \
 	xe_devcoredump.o \
 	xe_device.o \
 	xe_device_sysfs.o \
diff --git a/drivers/gpu/drm/xe/xe_configfs.c b/drivers/gpu/drm/xe/xe_configfs.c
new file mode 100644
index 000000000000..59e1bc4c5f76
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_configfs.c
@@ -0,0 +1,174 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#include <linux/configfs.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+
+#include "xe_configfs.h"
+#include "xe_module.h"
+
+/**
+ * DOC: XE Configfs Support
+ *
+ * Overview
+ * =========
+ *
+ * Configfs is a filesystem-based manager of kernel objects. XE KMD registers a
+ * configfs subsystem called 'xe' that allows user to configure various attributes
+ * See Documentation/filesystems/configfs.rst for more information about how configfs works.
+ *
+ * Create devices
+ * ===============
+ *
+ * To use configfs exposed by Xe, mount the configfs directory and create the device directory
+ *
+ * $ mount -t configfs none /config
+ * $ mkdir /config/xe/0000:03:00.0
+ *
+ * Configure Attributes
+ * ====================
+ *
+ * /config/xe/<device>/survivability mode : Enables survivability mode if supported by device
+ *
+ * Unbind			: echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
+ * Enable survivability mode	: echo 1 > /config/xe/0000:03:00.0/survivability_mode
+ * Bind				: echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
+ *
+ * The device enters survivability mode if supported
+ *
+ */
+
+struct xe_config_device {
+	struct config_group group;
+
+	/** @survivability_mode: enables survivability mode when set */
+	bool survivability_mode;
+
+	/** @lock: protects attributes */
+	struct mutex lock;
+};
+
+static struct xe_config_device *to_xe_config_device(struct config_item *item)
+{
+	return container_of(to_config_group(item), struct xe_config_device, group);
+}
+
+static ssize_t survivability_mode_show(struct config_item *item, char *page)
+{
+	struct xe_config_device *dev = to_xe_config_device(item);
+
+	return sprintf(page, "%d\n", dev->survivability_mode);
+}
+
+static ssize_t survivability_mode_store(struct config_item *item, const char *page, size_t len)
+{
+	struct xe_config_device *dev = to_xe_config_device(item);
+	bool survivability_mode;
+	int ret;
+
+	ret = kstrtobool(page, &survivability_mode);
+	if (ret)
+		return ret;
+
+	mutex_lock(&dev->lock);
+	dev->survivability_mode = survivability_mode;
+	mutex_unlock(&dev->lock);
+
+	return len;
+}
+
+CONFIGFS_ATTR(, survivability_mode);
+
+static struct configfs_attribute *xe_config_device_attrs[] = {
+	&attr_survivability_mode,
+	NULL,
+};
+
+static void xe_config_device_release(struct config_item *item)
+{
+	struct xe_config_device *dev = to_xe_config_device(item);
+
+	mutex_destroy(&dev->lock);
+	kfree(dev);
+}
+
+static struct configfs_item_operations xe_config_device_ops = {
+	.release	= xe_config_device_release,
+};
+
+static const struct config_item_type xe_config_device_type = {
+	.ct_item_ops	= &xe_config_device_ops,
+	.ct_attrs	= xe_config_device_attrs,
+	.ct_owner	= THIS_MODULE,
+};
+
+static struct config_group *xe_config_make_device_group(struct config_group *group,
+							const char *name)
+{
+	unsigned int domain, bus, slot, function;
+	struct xe_config_device *dev;
+	struct pci_dev *pdev;
+	int ret;
+
+	ret = sscanf(name, "%04x:%02x:%02x.%x", &domain, &bus, &slot, &function);
+	if (ret != 4)
+		return ERR_PTR(-EINVAL);
+
+	pdev = pci_get_domain_bus_and_slot(domain, bus, PCI_DEVFN(slot, function));
+	if (!pdev)
+		return ERR_PTR(-EINVAL);
+
+	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	if (!dev)
+		return ERR_PTR(-ENOMEM);
+
+	config_group_init_type_name(&dev->group, name, &xe_config_device_type);
+
+	mutex_init(&dev->lock);
+
+	return &dev->group;
+}
+
+static struct configfs_group_operations xe_config_device_group_ops = {
+	.make_group	= xe_config_make_device_group,
+};
+
+static const struct config_item_type xe_configfs_type = {
+	.ct_group_ops	= &xe_config_device_group_ops,
+	.ct_owner	= THIS_MODULE,
+};
+
+static struct configfs_subsystem xe_configfs = {
+	.su_group = {
+		.cg_item = {
+			.ci_namebuf = "xe",
+			.ci_type = &xe_configfs_type,
+		},
+	},
+};
+
+int __init xe_configfs_init(void)
+{
+	struct config_group *root = &xe_configfs.su_group;
+	int ret = 0;
+
+	config_group_init(root);
+	mutex_init(&xe_configfs.su_mutex);
+	ret = configfs_register_subsystem(&xe_configfs);
+	if (ret) {
+		pr_err("Error %d while registering %s subsystem\n",
+		       ret, root->cg_item.ci_namebuf);
+	}
+
+	return ret;
+}
+
+void __exit xe_configfs_exit(void)
+{
+	configfs_unregister_subsystem(&xe_configfs);
+}
+
diff --git a/drivers/gpu/drm/xe/xe_configfs.h b/drivers/gpu/drm/xe/xe_configfs.h
new file mode 100644
index 000000000000..2c30be9a2c7e
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_configfs.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+#ifndef _XE_CONFIGFS_H_
+#define _XE_CONFIGFS_H_
+
+int xe_configfs_init(void);
+void xe_configfs_exit(void);
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c
index 475acdba2b55..15b3cf22193c 100644
--- a/drivers/gpu/drm/xe/xe_module.c
+++ b/drivers/gpu/drm/xe/xe_module.c
@@ -11,6 +11,7 @@
 #include <drm/drm_module.h>
 
 #include "xe_drv.h"
+#include "xe_configfs.h"
 #include "xe_hw_fence.h"
 #include "xe_pci.h"
 #include "xe_pm.h"
@@ -91,6 +92,10 @@ static const struct init_funcs init_funcs[] = {
 	{
 		.init = xe_check_nomodeset,
 	},
+	{
+		.init = xe_configfs_init,
+		.exit = xe_configfs_exit,
+	},
 	{
 		.init = xe_hw_fence_module_init,
 		.exit = xe_hw_fence_module_exit,
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] drm/xe: Enable configfs support for survivability mode
  2025-03-27  6:42 [PATCH 0/2] Add configfs support for survivability mode Riana Tauro
                   ` (3 preceding siblings ...)
  2025-03-27  6:42 ` [PATCH 1/2] drm/xe: Add configfs to enable survivability mode Riana Tauro
@ 2025-03-27  6:42 ` Riana Tauro
  2025-04-01  3:25   ` Lucas De Marchi
  2025-03-27 14:40 ` [PATCH 0/2] Add " Lucas De Marchi
  5 siblings, 1 reply; 15+ messages in thread
From: Riana Tauro @ 2025-03-27  6:42 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	matthew.d.roper, aravind.iddamsetty

Enable survivability mode if supported and configfs attribute is set.
Enabing survivability mode manually is useful in cases where pcode does
not detect failure, validation and for IFR (in-field-repair).

To set configfs survivability mode attribute for a device

mkdir /config/xe/0000:03:00.0
echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
echo 1 > /config/xe/0000:03:00.0/survivability_mode
echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind

The card enters survivability mode if supported

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/xe_configfs.c           | 62 ++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_configfs.h           |  7 ++-
 drivers/gpu/drm/xe/xe_device.c             |  2 +-
 drivers/gpu/drm/xe/xe_pci.c                | 19 ++++---
 drivers/gpu/drm/xe/xe_survivability_mode.c | 39 ++++++++++----
 drivers/gpu/drm/xe/xe_survivability_mode.h |  2 +-
 6 files changed, 109 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_configfs.c b/drivers/gpu/drm/xe/xe_configfs.c
index 59e1bc4c5f76..57234904c061 100644
--- a/drivers/gpu/drm/xe/xe_configfs.c
+++ b/drivers/gpu/drm/xe/xe_configfs.c
@@ -151,6 +151,68 @@ static struct configfs_subsystem xe_configfs = {
 	},
 };
 
+static struct xe_config_device *configfs_find_group(struct pci_dev *pdev)
+{
+	struct config_item *item;
+	char name[64];
+
+	snprintf(name, sizeof(name), "%04x:%02x:%02x.%x", pci_domain_nr(pdev->bus),
+		 pdev->bus->number, PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+
+	mutex_lock(&xe_configfs.su_mutex);
+	item = config_group_find_item(&xe_configfs.su_group, name);
+	mutex_unlock(&xe_configfs.su_mutex);
+
+	if (!item)
+		return NULL;
+
+	return to_xe_config_device(item);
+}
+
+/**
+ * xe_configfs_get_survivability_mode - get configfs survivability mode attribute
+ * @pdev: pci device
+ *
+ * find the configfs group that belongs to the pci device and return
+ * the survivability mode attribute
+ *
+ * Return: survivability mode if config group is found, false otherwise
+ */
+bool xe_configfs_get_survivability_mode(struct pci_dev *pdev)
+{
+	struct xe_config_device *dev = configfs_find_group(pdev);
+	bool mode;
+
+	if (!dev)
+		return false;
+
+	mode = dev->survivability_mode;
+	config_item_put(&dev->group.cg_item);
+
+	return mode;
+}
+
+/**
+ * xe_configfs_clear_survivability_mode - clear configfs survivability mode attribute
+ * @pdev: pci device
+ *
+ * find the configfs group that belongs to the pci device and clear survivability
+ * mode attribute
+ */
+void xe_configfs_clear_survivability_mode(struct pci_dev *pdev)
+{
+	struct xe_config_device *dev = configfs_find_group(pdev);
+
+	if (!dev)
+		return;
+
+	mutex_lock(&dev->lock);
+	dev->survivability_mode = 0;
+	mutex_unlock(&dev->lock);
+
+	config_item_put(&dev->group.cg_item);
+}
+
 int __init xe_configfs_init(void)
 {
 	struct config_group *root = &xe_configfs.su_group;
diff --git a/drivers/gpu/drm/xe/xe_configfs.h b/drivers/gpu/drm/xe/xe_configfs.h
index 2c30be9a2c7e..6e8c5ccaf202 100644
--- a/drivers/gpu/drm/xe/xe_configfs.h
+++ b/drivers/gpu/drm/xe/xe_configfs.h
@@ -5,7 +5,12 @@
 #ifndef _XE_CONFIGFS_H_
 #define _XE_CONFIGFS_H_
 
+#include <linux/types.h>
+
+struct pci_dev;
+
 int xe_configfs_init(void);
 void xe_configfs_exit(void);
-
+bool xe_configfs_get_survivability_mode(struct pci_dev *pdev);
+void xe_configfs_clear_survivability_mode(struct pci_dev *pdev);
 #endif
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 1ffb7d1f6be6..f4e59577ddc0 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -712,7 +712,7 @@ int xe_device_probe_early(struct xe_device *xe)
 	sriov_update_device_info(xe);
 
 	err = xe_pcode_probe_early(xe);
-	if (err) {
+	if (err || xe_survivability_mode_requested(xe)) {
 		int save_err = err;
 
 		/*
diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
index da384acf731f..bcacfa78afd1 100644
--- a/drivers/gpu/drm/xe/xe_pci.c
+++ b/drivers/gpu/drm/xe/xe_pci.c
@@ -812,18 +812,17 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		return err;
 
 	err = xe_device_probe_early(xe);
-	if (err) {
-		/*
-		 * In Boot Survivability mode, no drm card is exposed and driver
-		 * is loaded with bare minimum to allow for firmware to be
-		 * flashed through mei. If early probe failed, but it managed to
-		 * enable survivability mode, return success.
-		 */
-		if (xe_survivability_mode_is_enabled(xe))
-			return 0;
+	/*
+	 * In Boot Survivability mode, no drm card is exposed and driver
+	 * is loaded with bare minimum to allow for firmware to be
+	 * flashed through mei. Return success, if survivability mode
+	 * is enabled due to pcode failure or configfs being set
+	 */
+	if (xe_survivability_mode_is_enabled(xe))
+		return 0;
 
+	if (err)
 		return err;
-	}
 
 	err = xe_info_init(xe, desc);
 	if (err)
diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
index cb813b337fd3..ed6599c5b85d 100644
--- a/drivers/gpu/drm/xe/xe_survivability_mode.c
+++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
@@ -10,6 +10,7 @@
 #include <linux/pci.h>
 #include <linux/sysfs.h>
 
+#include "xe_configfs.h"
 #include "xe_device.h"
 #include "xe_gt.h"
 #include "xe_heci_gsc.h"
@@ -28,8 +29,12 @@
  * This is implemented by loading the driver with bare minimum (no drm card) to allow the firmware
  * to be flashed through mei and collect telemetry. The driver's probe flow is modified
  * such that it enters survivability mode when pcode initialization is incomplete and boot status
- * denotes a failure. The driver then  populates the survivability_mode PCI sysfs indicating
- * survivability mode and provides additional information required for debug
+ * denotes a failure. Survivability mode can also be enabled manually by setting the
+ * survivability_mode attribute of the device in xe configfs. This is useful in cases where pcode
+ * does not detect failure, IFR (in-field-repair) use cases where the repair can be performed for
+ * a single GPU card without impacting the usage of other cards in the same node and for validation.
+ * The driver then populates the survivability_mode PCI sysfs indicating survivability mode and
+ * provides additional information required for debug
  *
  * KMD exposes below admin-only readable sysfs in survivability mode
  *
@@ -42,6 +47,15 @@
  *			      Overflow Information   - Provides history of previous failures
  *			      Auxiliary Information  - Certain failures may have information in
  *						       addition to postcode information
+ *
+ * Enable survivability mode through configfs
+ *
+ * Create device directory	: mkdir /config/xe/0000:03:00.0
+ * Unbind			: echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
+ * Enable survivability mode	: echo 1 > /config/xe/0000:03:00.0/survivability_mode
+ * Bind				: echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
+ *
+ * The card enters survivability mode if supported
  */
 
 static u32 aux_history_offset(u32 reg_value)
@@ -133,6 +147,7 @@ static void xe_survivability_mode_fini(void *arg)
 	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
 	struct device *dev = &pdev->dev;
 
+	xe_configfs_clear_survivability_mode(pdev);
 	sysfs_remove_file(&dev->kobj, &dev_attr_survivability_mode.attr);
 }
 
@@ -186,24 +201,30 @@ bool xe_survivability_mode_is_enabled(struct xe_device *xe)
 	return xe->survivability.mode;
 }
 
-/*
- * survivability_mode_requested - check if it's possible to enable
- * survivability mode and that was requested by firmware
+/**
+ * xe_survivability_mode_requested - check if it's possible to enable survivability
+ *				     mode that was requested by firmware or userspace
+ * @xe: xe device instance
  *
- * This function reads the boot status from Pcode.
+ * This function reads configfs and  boot status from Pcode.
  *
  * Return: true if platform support is available and boot status indicates
- * failure, false otherwise.
+ * failure or if survivability mode is requested, false otherwise.
  */
-static bool survivability_mode_requested(struct xe_device *xe)
+bool xe_survivability_mode_requested(struct xe_device *xe)
 {
 	struct xe_survivability *survivability = &xe->survivability;
 	struct xe_mmio *mmio = xe_root_tile_mmio(xe);
+	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
 	u32 data;
 
 	if (!IS_DGFX(xe) || xe->info.platform < XE_BATTLEMAGE || IS_SRIOV_VF(xe))
 		return false;
 
+	/* Enable survivability mode if set via configfs */
+	if (xe_configfs_get_survivability_mode(pdev))
+		return true;
+
 	data = xe_mmio_read32(mmio, PCODE_SCRATCH(0));
 	survivability->boot_status = REG_FIELD_GET(BOOT_STATUS, data);
 
@@ -226,7 +247,7 @@ int xe_survivability_mode_enable(struct xe_device *xe)
 	struct xe_survivability_info *info;
 	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
 
-	if (!survivability_mode_requested(xe))
+	if (!xe_survivability_mode_requested(xe))
 		return 0;
 
 	survivability->size = MAX_SCRATCH_MMIO;
diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h b/drivers/gpu/drm/xe/xe_survivability_mode.h
index d7e64885570d..70eaf56fb3bb 100644
--- a/drivers/gpu/drm/xe/xe_survivability_mode.h
+++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
@@ -12,5 +12,5 @@ struct xe_device;
 
 int xe_survivability_mode_enable(struct xe_device *xe);
 bool xe_survivability_mode_is_enabled(struct xe_device *xe);
-
+bool xe_survivability_mode_requested(struct xe_device *xe);
 #endif /* _XE_SURVIVABILITY_MODE_H_ */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Add configfs support for survivability mode
  2025-03-27  6:42 [PATCH 0/2] Add configfs support for survivability mode Riana Tauro
                   ` (4 preceding siblings ...)
  2025-03-27  6:42 ` [PATCH 2/2] drm/xe: Enable configfs support for " Riana Tauro
@ 2025-03-27 14:40 ` Lucas De Marchi
  2025-03-31 20:19   ` Rodrigo Vivi
  5 siblings, 1 reply; 15+ messages in thread
From: Lucas De Marchi @ 2025-03-27 14:40 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, matthew.d.roper,
	aravind.iddamsetty

On Thu, Mar 27, 2025 at 12:12:00PM +0530, Riana Tauro wrote:
>This series proposes to expose attributes via xe configfs
>subsystem. Xe registers a configfs subsystem named 'xe'.
>Userspace can then create directories for the devices they
>want to configure and set appropriate attributes
>
>This is done by
>
>mount -t configfs none /config
>mkdir /config/xe/0000:03:00.0
>

If we need a new version or to document anywhere in our docs, I'd add a
comment here:

# If driver is already bound, unbind it as this configuration
# applies only when probing it

>echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
>echo 1 > sys/kernel/config/xe/0000:03:00.0/survivability_mode
>echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
>
>This is an alternative to introducing module param that causes
>all the connected and supported GPU cards to enter survivability mode.
>Manually entering survivability mode is useful when pcode does not
>report failure, in field repairs and validation
>
>Rev2: use config_groups (Lucas)

Awesome. I have some other work pending that will make use of
it. I will play with these patches soon.

thanks
Lucas De Marchi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Add configfs support for survivability mode
  2025-03-27 14:40 ` [PATCH 0/2] Add " Lucas De Marchi
@ 2025-03-31 20:19   ` Rodrigo Vivi
  2025-03-31 21:45     ` Lucas De Marchi
  2025-04-01  5:55     ` Riana Tauro
  0 siblings, 2 replies; 15+ messages in thread
From: Rodrigo Vivi @ 2025-03-31 20:19 UTC (permalink / raw)
  To: Lucas De Marchi
  Cc: Riana Tauro, intel-xe, anshuman.gupta, matthew.d.roper,
	aravind.iddamsetty

On Thu, Mar 27, 2025 at 09:40:39AM -0500, Lucas De Marchi wrote:
> On Thu, Mar 27, 2025 at 12:12:00PM +0530, Riana Tauro wrote:
> > This series proposes to expose attributes via xe configfs
> > subsystem. Xe registers a configfs subsystem named 'xe'.
> > Userspace can then create directories for the devices they
> > want to configure and set appropriate attributes
> > 
> > This is done by
> > 
> > mount -t configfs none /config
> > mkdir /config/xe/0000:03:00.0
> > 
> 
> If we need a new version or to document anywhere in our docs, I'd add a
> comment here:
> 
> # If driver is already bound, unbind it as this configuration
> # applies only when probing it
> 
> > echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
> > echo 1 > sys/kernel/config/xe/0000:03:00.0/survivability_mode
> > echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
> > 
> > This is an alternative to introducing module param that causes
> > all the connected and supported GPU cards to enter survivability mode.
> > Manually entering survivability mode is useful when pcode does not
> > report failure, in field repairs and validation
> > 
> > Rev2: use config_groups (Lucas)
> 
> Awesome. I have some other work pending that will make use of
> it. I will play with these patches soon.

I really liked this new flow and I was giving it a try here right now.

However it didn't work. It didn't take me to the survivability mode,
but also, I cannot unload the xe after creating this configfs file:

sudo remove /sys/kernel/config/xe/0000\:0*
rm: cannot remove '0000:00:02.0/survivability_mode': Operation not permitted
rm: cannot remove '0000:03:00.0/survivability_mode': Operation not permitted

Tried to unbind and had the same failure.

then with the configfs there we cannot remove the module:
$ sudo rmmod xe
rmmod: ERROR: Module xe is in use


So, it looks we have some stuff to adjust here before we can move further,
but so far things are looking promising indeed

> 
> thanks
> Lucas De Marchi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Add configfs support for survivability mode
  2025-03-31 20:19   ` Rodrigo Vivi
@ 2025-03-31 21:45     ` Lucas De Marchi
  2025-04-01  6:18       ` Riana Tauro
  2025-04-01  5:55     ` Riana Tauro
  1 sibling, 1 reply; 15+ messages in thread
From: Lucas De Marchi @ 2025-03-31 21:45 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: Riana Tauro, intel-xe, anshuman.gupta, matthew.d.roper,
	aravind.iddamsetty

On Mon, Mar 31, 2025 at 04:19:28PM -0400, Rodrigo Vivi wrote:
>On Thu, Mar 27, 2025 at 09:40:39AM -0500, Lucas De Marchi wrote:
>> On Thu, Mar 27, 2025 at 12:12:00PM +0530, Riana Tauro wrote:
>> > This series proposes to expose attributes via xe configfs
>> > subsystem. Xe registers a configfs subsystem named 'xe'.
>> > Userspace can then create directories for the devices they
>> > want to configure and set appropriate attributes
>> >
>> > This is done by
>> >
>> > mount -t configfs none /config
>> > mkdir /config/xe/0000:03:00.0
>> >
>>
>> If we need a new version or to document anywhere in our docs, I'd add a
>> comment here:
>>
>> # If driver is already bound, unbind it as this configuration
>> # applies only when probing it
>>
>> > echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
>> > echo 1 > sys/kernel/config/xe/0000:03:00.0/survivability_mode
>> > echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
>> >
>> > This is an alternative to introducing module param that causes
>> > all the connected and supported GPU cards to enter survivability mode.
>> > Manually entering survivability mode is useful when pcode does not
>> > report failure, in field repairs and validation
>> >
>> > Rev2: use config_groups (Lucas)
>>
>> Awesome. I have some other work pending that will make use of
>> it. I will play with these patches soon.
>
>I really liked this new flow and I was giving it a try here right now.
>
>However it didn't work. It didn't take me to the survivability mode,
>but also, I cannot unload the xe after creating this configfs file:
>
>sudo remove /sys/kernel/config/xe/0000\:0*
>rm: cannot remove '0000:00:02.0/survivability_mode': Operation not permitted
>rm: cannot remove '0000:03:00.0/survivability_mode': Operation not permitted

humn... testing on a bmg, it works:

	# # first of all, make sure autoprobe doesn't do what we don't
	# # want:
	# echo 0  > /sys/bus/pci/drivers_autoprobe

	# # load module and set the configuration
	# modprobe xe
	# mkdir /sys/kernel/config/xe/0000:03:00.0
	# echo 1 > /sys/kernel/config/xe/0000\:03\:00.0/survivability_mode

	# # bind the driver and check it enters survivability mode
	# echo 0000:03:00.0 > /sys/module/xe/drivers/pci\:xe/bind
	# dmesg | tail -n1
	[ 1994.807063] xe 0000:03:00.0: In Survivability Mode
	# cat  /sys/bus/pci/drivers/xe/0000\:03\:00.0/survivability_mode 
	Capability Info: 0x138320 - 0x2001ae06
	Postcode Info: 0x138324 - 0x0
	Overflow Info: 0x138328 - 0x0
	Auxiliary Info 0: 0x13832c - 0x0

Unbind first and test we can remove the configuration for next bind:

	# echo 0000:03:00.0 > /sys/module/xe/drivers/pci\:xe/unbind
	# tree /sys/kernel/config/xe
	/sys/kernel/config/xe
	└── 0000:03:00.0
	    └── survivability_mode
	# rmdir /sys/kernel/config/xe/0000:03:00.0/
	# tree /sys/kernel/config/xe
	/sys/kernel/config/xe

Remove module:
	# modprobe -r xe


What doesn't work:

	1) Remove the module without unbinding. This is already the case
	2) Remove the module without unbinding first

For (2) it's basicaly: when you create a configfs connection, configfs
increments the module's refcount. You have to remove them first.

For my surprise, it's possible to remove the config after binding -
there's no error. I need to double check if this wouldn't create some
UAF depending on how we use the config, but for survivability purposes,
I don't think it has an issue with that single bool.

Lucas De Marchi

>
>Tried to unbind and had the same failure.
>
>then with the configfs there we cannot remove the module:
>$ sudo rmmod xe
>rmmod: ERROR: Module xe is in use

what's the `lsmod | grep xe` output?

You should always be able to unbind. There's nothing the driver can do
that would block the unbind. After unbind, you should rmdir every dir in
/sys/kernel/config/xe/*. Note that it's not an rm -r since you can't
remove the inner configuration files, only the directory that is the
"connection" between configfs and the driver. You also can't remove the
xe dir (as it's owned by the module, not the connection to the device),
only  the dirs under it.

Lucas De Marchi


>
>
>So, it looks we have some stuff to adjust here before we can move further,
>but so far things are looking promising indeed
>
>>
>> thanks
>> Lucas De Marchi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] drm/xe: Add configfs to enable survivability mode
  2025-03-27  6:42 ` [PATCH 1/2] drm/xe: Add configfs to enable survivability mode Riana Tauro
@ 2025-04-01  1:03   ` Lucas De Marchi
  2025-04-01  8:13     ` Riana Tauro
  0 siblings, 1 reply; 15+ messages in thread
From: Lucas De Marchi @ 2025-04-01  1:03 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, matthew.d.roper,
	aravind.iddamsetty

On Thu, Mar 27, 2025 at 12:12:01PM +0530, Riana Tauro wrote:
>Registers a configfs subsystem called 'xe' to userspace. The user can
>use this to modify exposed attributes for a device.
>
>Attribute exposed:
>
>/config/xe/<device>/survivability_mode : Enables survivability mode
>
>The attributes can be modified by creating the device directory under the
>configfs directory
>
>mount -t configfs none /config
>mkdir /config/xe/0000:03:00.0
>
>echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
>echo 1 > /config/xe/0000:03:00.0/survivability_mode
>echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
>
>Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>---
> drivers/gpu/drm/xe/Makefile      |   1 +
> drivers/gpu/drm/xe/xe_configfs.c | 174 +++++++++++++++++++++++++++++++
> drivers/gpu/drm/xe/xe_configfs.h |  11 ++
> drivers/gpu/drm/xe/xe_module.c   |   5 +
> 4 files changed, 191 insertions(+)
> create mode 100644 drivers/gpu/drm/xe/xe_configfs.c
> create mode 100644 drivers/gpu/drm/xe/xe_configfs.h
>
>diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
>index cd464fe26eb8..190176c2e10d 100644
>--- a/drivers/gpu/drm/xe/Makefile
>+++ b/drivers/gpu/drm/xe/Makefile
>@@ -28,6 +28,7 @@ $(obj)/generated/%_wa_oob.c $(obj)/generated/%_wa_oob.h: $(obj)/xe_gen_wa_oob \
> xe-y += xe_bb.o \
> 	xe_bo.o \
> 	xe_bo_evict.o \
>+	xe_configfs.o \
> 	xe_devcoredump.o \
> 	xe_device.o \
> 	xe_device_sysfs.o \
>diff --git a/drivers/gpu/drm/xe/xe_configfs.c b/drivers/gpu/drm/xe/xe_configfs.c
>new file mode 100644
>index 000000000000..59e1bc4c5f76
>--- /dev/null
>+++ b/drivers/gpu/drm/xe/xe_configfs.c
>@@ -0,0 +1,174 @@
>+// SPDX-License-Identifier: MIT
>+/*
>+ * Copyright © 2025 Intel Corporation
>+ */
>+
>+#include <linux/configfs.h>
>+#include <linux/init.h>
>+#include <linux/module.h>
>+#include <linux/pci.h>
>+
>+#include "xe_configfs.h"
>+#include "xe_module.h"
>+
>+/**
>+ * DOC: XE Configfs Support
>+ *
>+ * Overview
>+ * =========
>+ *
>+ * Configfs is a filesystem-based manager of kernel objects. XE KMD registers a
>+ * configfs subsystem called 'xe' that allows user to configure various attributes
>+ * See Documentation/filesystems/configfs.rst for more information about how configfs works.
>+ *
>+ * Create devices
>+ * ===============
>+ *
>+ * To use configfs exposed by Xe, mount the configfs directory and create the device directory
>+ *
>+ * $ mount -t configfs none /config

nit: once upon a time I think the "standard" was to have a /config. Same
thing as for debugfs. Nowadays I think pretty much every distro settled
on /sys/kernel/{debug,config} for debugfs and configfs. Mine even gets
mounted automatically:

# grep Where /usr/lib/systemd/system/sys-kernel-config.mount
Where=/sys/kernel/config

>+ * $ mkdir /config/xe/0000:03:00.0
>+ *
>+ * Configure Attributes
>+ * ====================
>+ *
>+ * /config/xe/<device>/survivability mode : Enables survivability mode if supported by device

    * <configfs>/xe/<device>/survivability_mode: Enable survivability mode if supported by device


>+ *
>+ * Unbind			: echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind

please, no unbind. The unbind part is part of "the previous time the
driver was attached/bound to the device".

>+ * Enable survivability mode	: echo 1 > /config/xe/0000:03:00.0/survivability_mode
>+ * Bind				: echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind

are these extra spaces to try to align anything?

>+ *
>+ * The device enters survivability mode if supported
>+ *

trailing line.

>+ */
>+
>+struct xe_config_device {
>+	struct config_group group;
>+
>+	/** @survivability_mode: enables survivability mode when set */

I don't think we should kernel-doc this. But if we are, then it should
be proper kernel-doc, including the doc about the struct and not leaving
members undocumented.

>+	bool survivability_mode;
>+
>+	/** @lock: protects attributes */
>+	struct mutex lock;
>+};
>+
>+static struct xe_config_device *to_xe_config_device(struct config_item *item)
>+{
>+	return container_of(to_config_group(item), struct xe_config_device, group);
>+}
>+
>+static ssize_t survivability_mode_show(struct config_item *item, char *page)
>+{
>+	struct xe_config_device *dev = to_xe_config_device(item);
>+
>+	return sprintf(page, "%d\n", dev->survivability_mode);
>+}
>+
>+static ssize_t survivability_mode_store(struct config_item *item, const char *page, size_t len)
>+{
>+	struct xe_config_device *dev = to_xe_config_device(item);
>+	bool survivability_mode;
>+	int ret;
>+
>+	ret = kstrtobool(page, &survivability_mode);
>+	if (ret)
>+		return ret;
>+
>+	mutex_lock(&dev->lock);
>+	dev->survivability_mode = survivability_mode;
>+	mutex_unlock(&dev->lock);
>+
>+	return len;
>+}
>+
>+CONFIGFS_ATTR(, survivability_mode);
>+
>+static struct configfs_attribute *xe_config_device_attrs[] = {
>+	&attr_survivability_mode,
>+	NULL,
>+};
>+
>+static void xe_config_device_release(struct config_item *item)
>+{
>+	struct xe_config_device *dev = to_xe_config_device(item);
>+
>+	mutex_destroy(&dev->lock);
>+	kfree(dev);
>+}
>+
>+static struct configfs_item_operations xe_config_device_ops = {
>+	.release	= xe_config_device_release,
>+};
>+
>+static const struct config_item_type xe_config_device_type = {
>+	.ct_item_ops	= &xe_config_device_ops,
>+	.ct_attrs	= xe_config_device_attrs,
>+	.ct_owner	= THIS_MODULE,
>+};
>+
>+static struct config_group *xe_config_make_device_group(struct config_group *group,
>+							const char *name)
>+{
>+	unsigned int domain, bus, slot, function;
>+	struct xe_config_device *dev;
>+	struct pci_dev *pdev;
>+	int ret;
>+
>+	ret = sscanf(name, "%04x:%02x:%02x.%x", &domain, &bus, &slot, &function);
>+	if (ret != 4)
>+		return ERR_PTR(-EINVAL);
>+
>+	pdev = pci_get_domain_bus_and_slot(domain, bus, PCI_DEVFN(slot, function));
>+	if (!pdev)
>+		return ERR_PTR(-EINVAL);
>+
>+	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
>+	if (!dev)
>+		return ERR_PTR(-ENOMEM);
>+
>+	config_group_init_type_name(&dev->group, name, &xe_config_device_type);
>+
>+	mutex_init(&dev->lock);
>+
>+	return &dev->group;
>+}
>+
>+static struct configfs_group_operations xe_config_device_group_ops = {
>+	.make_group	= xe_config_make_device_group,
>+};
>+
>+static const struct config_item_type xe_configfs_type = {
>+	.ct_group_ops	= &xe_config_device_group_ops,
>+	.ct_owner	= THIS_MODULE,

Rodrigo, 		   ^

this is what makes the config side take a ref on the module, as it
should.

>+};
>+
>+static struct configfs_subsystem xe_configfs = {
>+	.su_group = {
>+		.cg_item = {
>+			.ci_namebuf = "xe",
>+			.ci_type = &xe_configfs_type,
>+		},
>+	},
>+};
>+
>+int __init xe_configfs_init(void)
>+{
>+	struct config_group *root = &xe_configfs.su_group;
>+	int ret = 0;

pointless init as it's overwritten below.

>+
>+	config_group_init(root);
>+	mutex_init(&xe_configfs.su_mutex);
>+	ret = configfs_register_subsystem(&xe_configfs);
>+	if (ret) {
>+		pr_err("Error %d while registering %s subsystem\n",
>+		       ret, root->cg_item.ci_namebuf);

return ret;

>+	}
>+
>+	return ret;

return 0;


A bunch of nits above, but looking pretty good.

thanks
Lucas De Marchi

>+}
>+
>+void __exit xe_configfs_exit(void)
>+{
>+	configfs_unregister_subsystem(&xe_configfs);
>+}
>+
>diff --git a/drivers/gpu/drm/xe/xe_configfs.h b/drivers/gpu/drm/xe/xe_configfs.h
>new file mode 100644
>index 000000000000..2c30be9a2c7e
>--- /dev/null
>+++ b/drivers/gpu/drm/xe/xe_configfs.h
>@@ -0,0 +1,11 @@
>+/* SPDX-License-Identifier: MIT */
>+/*
>+ * Copyright © 2025 Intel Corporation
>+ */
>+#ifndef _XE_CONFIGFS_H_
>+#define _XE_CONFIGFS_H_
>+
>+int xe_configfs_init(void);
>+void xe_configfs_exit(void);
>+
>+#endif
>diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c
>index 475acdba2b55..15b3cf22193c 100644
>--- a/drivers/gpu/drm/xe/xe_module.c
>+++ b/drivers/gpu/drm/xe/xe_module.c
>@@ -11,6 +11,7 @@
> #include <drm/drm_module.h>
>
> #include "xe_drv.h"
>+#include "xe_configfs.h"
> #include "xe_hw_fence.h"
> #include "xe_pci.h"
> #include "xe_pm.h"
>@@ -91,6 +92,10 @@ static const struct init_funcs init_funcs[] = {
> 	{
> 		.init = xe_check_nomodeset,
> 	},
>+	{
>+		.init = xe_configfs_init,
>+		.exit = xe_configfs_exit,
>+	},
> 	{
> 		.init = xe_hw_fence_module_init,
> 		.exit = xe_hw_fence_module_exit,
>-- 
>2.47.1
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] drm/xe: Enable configfs support for survivability mode
  2025-03-27  6:42 ` [PATCH 2/2] drm/xe: Enable configfs support for " Riana Tauro
@ 2025-04-01  3:25   ` Lucas De Marchi
  0 siblings, 0 replies; 15+ messages in thread
From: Lucas De Marchi @ 2025-04-01  3:25 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, matthew.d.roper,
	aravind.iddamsetty

On Thu, Mar 27, 2025 at 12:12:02PM +0530, Riana Tauro wrote:
>Enable survivability mode if supported and configfs attribute is set.
>Enabing survivability mode manually is useful in cases where pcode does
>not detect failure, validation and for IFR (in-field-repair).
>
>To set configfs survivability mode attribute for a device
>
>mkdir /config/xe/0000:03:00.0
>echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
>echo 1 > /config/xe/0000:03:00.0/survivability_mode
>echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
>
>The card enters survivability mode if supported

this part is repeated in commit 1 and 2. You could leave the generic
part with survivability_mode as the example in commit 1. Here you just
mentioned setting survivability_mode to 1.

>
>Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>---
> drivers/gpu/drm/xe/xe_configfs.c           | 62 ++++++++++++++++++++++
> drivers/gpu/drm/xe/xe_configfs.h           |  7 ++-
> drivers/gpu/drm/xe/xe_device.c             |  2 +-
> drivers/gpu/drm/xe/xe_pci.c                | 19 ++++---
> drivers/gpu/drm/xe/xe_survivability_mode.c | 39 ++++++++++----
> drivers/gpu/drm/xe/xe_survivability_mode.h |  2 +-
> 6 files changed, 109 insertions(+), 22 deletions(-)
>
>diff --git a/drivers/gpu/drm/xe/xe_configfs.c b/drivers/gpu/drm/xe/xe_configfs.c
>index 59e1bc4c5f76..57234904c061 100644
>--- a/drivers/gpu/drm/xe/xe_configfs.c
>+++ b/drivers/gpu/drm/xe/xe_configfs.c
>@@ -151,6 +151,68 @@ static struct configfs_subsystem xe_configfs = {
> 	},
> };
>
>+static struct xe_config_device *configfs_find_group(struct pci_dev *pdev)
>+{
>+	struct config_item *item;
>+	char name[64];
>+
>+	snprintf(name, sizeof(name), "%04x:%02x:%02x.%x", pci_domain_nr(pdev->bus),
>+		 pdev->bus->number, PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
>+
>+	mutex_lock(&xe_configfs.su_mutex);
>+	item = config_group_find_item(&xe_configfs.su_group, name);
>+	mutex_unlock(&xe_configfs.su_mutex);
>+
>+	if (!item)
>+		return NULL;
>+
>+	return to_xe_config_device(item);
>+}
>+
>+/**
>+ * xe_configfs_get_survivability_mode - get configfs survivability mode attribute
>+ * @pdev: pci device
>+ *
>+ * find the configfs group that belongs to the pci device and return
>+ * the survivability mode attribute
>+ *
>+ * Return: survivability mode if config group is found, false otherwise
>+ */
>+bool xe_configfs_get_survivability_mode(struct pci_dev *pdev)
>+{
>+	struct xe_config_device *dev = configfs_find_group(pdev);
>+	bool mode;
>+
>+	if (!dev)
>+		return false;
>+
>+	mode = dev->survivability_mode;
>+	config_item_put(&dev->group.cg_item);
>+
>+	return mode;
>+}
>+
>+/**
>+ * xe_configfs_clear_survivability_mode - clear configfs survivability mode attribute
>+ * @pdev: pci device
>+ *
>+ * find the configfs group that belongs to the pci device and clear survivability
>+ * mode attribute
>+ */
>+void xe_configfs_clear_survivability_mode(struct pci_dev *pdev)
>+{
>+	struct xe_config_device *dev = configfs_find_group(pdev);
>+
>+	if (!dev)
>+		return;
>+
>+	mutex_lock(&dev->lock);
>+	dev->survivability_mode = 0;
>+	mutex_unlock(&dev->lock);
>+
>+	config_item_put(&dev->group.cg_item);
>+}
>+
> int __init xe_configfs_init(void)
> {
> 	struct config_group *root = &xe_configfs.su_group;
>diff --git a/drivers/gpu/drm/xe/xe_configfs.h b/drivers/gpu/drm/xe/xe_configfs.h
>index 2c30be9a2c7e..6e8c5ccaf202 100644
>--- a/drivers/gpu/drm/xe/xe_configfs.h
>+++ b/drivers/gpu/drm/xe/xe_configfs.h
>@@ -5,7 +5,12 @@
> #ifndef _XE_CONFIGFS_H_
> #define _XE_CONFIGFS_H_
>
>+#include <linux/types.h>
>+
>+struct pci_dev;
>+
> int xe_configfs_init(void);
> void xe_configfs_exit(void);
>-
>+bool xe_configfs_get_survivability_mode(struct pci_dev *pdev);
>+void xe_configfs_clear_survivability_mode(struct pci_dev *pdev);
> #endif
>diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>index 1ffb7d1f6be6..f4e59577ddc0 100644
>--- a/drivers/gpu/drm/xe/xe_device.c
>+++ b/drivers/gpu/drm/xe/xe_device.c
>@@ -712,7 +712,7 @@ int xe_device_probe_early(struct xe_device *xe)
> 	sriov_update_device_info(xe);
>
> 	err = xe_pcode_probe_early(xe);
>-	if (err) {
>+	if (err || xe_survivability_mode_requested(xe)) {
> 		int save_err = err;
>
> 		/*
>diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
>index da384acf731f..bcacfa78afd1 100644
>--- a/drivers/gpu/drm/xe/xe_pci.c
>+++ b/drivers/gpu/drm/xe/xe_pci.c
>@@ -812,18 +812,17 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> 		return err;
>
> 	err = xe_device_probe_early(xe);
>-	if (err) {
>-		/*
>-		 * In Boot Survivability mode, no drm card is exposed and driver
>-		 * is loaded with bare minimum to allow for firmware to be
>-		 * flashed through mei. If early probe failed, but it managed to
>-		 * enable survivability mode, return success.
>-		 */
>-		if (xe_survivability_mode_is_enabled(xe))
>-			return 0;
>+	/*
>+	 * In Boot Survivability mode, no drm card is exposed and driver
>+	 * is loaded with bare minimum to allow for firmware to be
>+	 * flashed through mei. Return success, if survivability mode
>+	 * is enabled due to pcode failure or configfs being set
>+	 */
>+	if (xe_survivability_mode_is_enabled(xe))
>+		return 0;
>
>+	if (err)
> 		return err;
>-	}
>
> 	err = xe_info_init(xe, desc);
> 	if (err)
>diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
>index cb813b337fd3..ed6599c5b85d 100644
>--- a/drivers/gpu/drm/xe/xe_survivability_mode.c
>+++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
>@@ -10,6 +10,7 @@
> #include <linux/pci.h>
> #include <linux/sysfs.h>
>
>+#include "xe_configfs.h"
> #include "xe_device.h"
> #include "xe_gt.h"
> #include "xe_heci_gsc.h"
>@@ -28,8 +29,12 @@
>  * This is implemented by loading the driver with bare minimum (no drm card) to allow the firmware
>  * to be flashed through mei and collect telemetry. The driver's probe flow is modified
>  * such that it enters survivability mode when pcode initialization is incomplete and boot status
>- * denotes a failure. The driver then  populates the survivability_mode PCI sysfs indicating
>- * survivability mode and provides additional information required for debug
>+ * denotes a failure. Survivability mode can also be enabled manually by setting the

I think this should start a new paragraph.

>+ * survivability_mode attribute of the device in xe configfs. This is useful in cases where pcode
>+ * does not detect failure, IFR (in-field-repair) use cases where the repair can be performed for
>+ * a single GPU card without impacting the usage of other cards in the same node and for validation.

I'm missing a verb in in this phrase

>+ * The driver then populates the survivability_mode PCI sysfs indicating survivability mode and
>+ * provides additional information required for debug
>  *
>  * KMD exposes below admin-only readable sysfs in survivability mode
>  *
>@@ -42,6 +47,15 @@
>  *			      Overflow Information   - Provides history of previous failures
>  *			      Auxiliary Information  - Certain failures may have information in
>  *						       addition to postcode information
>+ *
>+ * Enable survivability mode through configfs
>+ *
>+ * Create device directory	: mkdir /config/xe/0000:03:00.0
>+ * Unbind			: echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
>+ * Enable survivability mode	: echo 1 > /config/xe/0000:03:00.0/survivability_mode
>+ * Bind				: echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind

ditto about alignment and unbind.

I didn't check how the html gets rendered though.

>+ *
>+ * The card enters survivability mode if supported
>  */
>
> static u32 aux_history_offset(u32 reg_value)
>@@ -133,6 +147,7 @@ static void xe_survivability_mode_fini(void *arg)
> 	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> 	struct device *dev = &pdev->dev;
>
>+	xe_configfs_clear_survivability_mode(pdev);
> 	sysfs_remove_file(&dev->kobj, &dev_attr_survivability_mode.attr);
> }
>
>@@ -186,24 +201,30 @@ bool xe_survivability_mode_is_enabled(struct xe_device *xe)
> 	return xe->survivability.mode;
> }
>
>-/*
>- * survivability_mode_requested - check if it's possible to enable
>- * survivability mode and that was requested by firmware
>+/**
>+ * xe_survivability_mode_requested - check if it's possible to enable survivability
>+ *				     mode that was requested by firmware or userspace
>+ * @xe: xe device instance
>  *
>- * This function reads the boot status from Pcode.
>+ * This function reads configfs and  boot status from Pcode.
>  *
>  * Return: true if platform support is available and boot status indicates
>- * failure, false otherwise.
>+ * failure or if survivability mode is requested, false otherwise.
>  */
>-static bool survivability_mode_requested(struct xe_device *xe)
>+bool xe_survivability_mode_requested(struct xe_device *xe)
> {
> 	struct xe_survivability *survivability = &xe->survivability;
> 	struct xe_mmio *mmio = xe_root_tile_mmio(xe);
>+	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> 	u32 data;
>
> 	if (!IS_DGFX(xe) || xe->info.platform < XE_BATTLEMAGE || IS_SRIOV_VF(xe))
> 		return false;
>
>+	/* Enable survivability mode if set via configfs */
>+	if (xe_configfs_get_survivability_mode(pdev))
>+		return true;
>+
> 	data = xe_mmio_read32(mmio, PCODE_SCRATCH(0));
> 	survivability->boot_status = REG_FIELD_GET(BOOT_STATUS, data);
>
>@@ -226,7 +247,7 @@ int xe_survivability_mode_enable(struct xe_device *xe)
> 	struct xe_survivability_info *info;
> 	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>
>-	if (!survivability_mode_requested(xe))
>+	if (!xe_survivability_mode_requested(xe))
> 		return 0;
>
> 	survivability->size = MAX_SCRATCH_MMIO;
>diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h b/drivers/gpu/drm/xe/xe_survivability_mode.h
>index d7e64885570d..70eaf56fb3bb 100644
>--- a/drivers/gpu/drm/xe/xe_survivability_mode.h
>+++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
>@@ -12,5 +12,5 @@ struct xe_device;
>
> int xe_survivability_mode_enable(struct xe_device *xe);
> bool xe_survivability_mode_is_enabled(struct xe_device *xe);
>-
>+bool xe_survivability_mode_requested(struct xe_device *xe);

keep empty line here and probably use _is_requested to be similar to the
enabled case.

Just a few nits.. From my tests today it appears to be working fine on
BMG. Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>

thanks
Lucas De Marchi

> #endif /* _XE_SURVIVABILITY_MODE_H_ */
>-- 
>2.47.1
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Add configfs support for survivability mode
  2025-03-31 20:19   ` Rodrigo Vivi
  2025-03-31 21:45     ` Lucas De Marchi
@ 2025-04-01  5:55     ` Riana Tauro
  1 sibling, 0 replies; 15+ messages in thread
From: Riana Tauro @ 2025-04-01  5:55 UTC (permalink / raw)
  To: Rodrigo Vivi, Lucas De Marchi
  Cc: intel-xe, anshuman.gupta, matthew.d.roper, aravind.iddamsetty

Hi Rodrigo

On 4/1/2025 1:49 AM, Rodrigo Vivi wrote:
> On Thu, Mar 27, 2025 at 09:40:39AM -0500, Lucas De Marchi wrote:
>> On Thu, Mar 27, 2025 at 12:12:00PM +0530, Riana Tauro wrote:
>>> This series proposes to expose attributes via xe configfs
>>> subsystem. Xe registers a configfs subsystem named 'xe'.
>>> Userspace can then create directories for the devices they
>>> want to configure and set appropriate attributes
>>>
>>> This is done by
>>>
>>> mount -t configfs none /config
>>> mkdir /config/xe/0000:03:00.0
>>>
>>
>> If we need a new version or to document anywhere in our docs, I'd add a
>> comment here:
>>
>> # If driver is already bound, unbind it as this configuration
>> # applies only when probing it
>>
>>> echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
>>> echo 1 > sys/kernel/config/xe/0000:03:00.0/survivability_mode
>>> echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
>>>
>>> This is an alternative to introducing module param that causes
>>> all the connected and supported GPU cards to enter survivability mode.
>>> Manually entering survivability mode is useful when pcode does not
>>> report failure, in field repairs and validation
>>>
>>> Rev2: use config_groups (Lucas)
>>
>> Awesome. I have some other work pending that will make use of
>> it. I will play with these patches soon.
> 
> I really liked this new flow and I was giving it a try here right now.
> 
> However it didn't work. It didn't take me to the survivability mode,
This works only on BMG and after writing 1 to survivability mode> but 
also, I cannot unload the xe after creating this configfs file:only 
rmdir is supported to remove configfs . rm didn't work>
> sudo remove /sys/kernel/config/xe/0000\:0*
> rm: cannot remove '0000:00:02.0/survivability_mode': Operation not permitted
> rm: cannot remove '0000:03:00.0/survivability_mode': Operation not permitted>
> Tried to unbind and had the same failure.
unbind worked for me. rmmod without removing directories does not work 
as configfs takes reference count of module.

But shouldn't user be responsible to rmdir once created before unloading 
module. I haven't tried without the owner attribute but
according to code the registering subsystem might fail. Will try

Thanks
Riana>
> then with the configfs there we cannot remove the module:
> $ sudo rmmod xe
> rmmod: ERROR: Module xe is in use
> 
> 
> So, it looks we have some stuff to adjust here before we can move further,
> but so far things are looking promising indeed
> 
>>
>> thanks
>> Lucas De Marchi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Add configfs support for survivability mode
  2025-03-31 21:45     ` Lucas De Marchi
@ 2025-04-01  6:18       ` Riana Tauro
  0 siblings, 0 replies; 15+ messages in thread
From: Riana Tauro @ 2025-04-01  6:18 UTC (permalink / raw)
  To: Lucas De Marchi, Rodrigo Vivi
  Cc: intel-xe, anshuman.gupta, matthew.d.roper, aravind.iddamsetty

Hi Lucas

On 4/1/2025 3:15 AM, Lucas De Marchi wrote:
> On Mon, Mar 31, 2025 at 04:19:28PM -0400, Rodrigo Vivi wrote:
>> On Thu, Mar 27, 2025 at 09:40:39AM -0500, Lucas De Marchi wrote:
>>> On Thu, Mar 27, 2025 at 12:12:00PM +0530, Riana Tauro wrote:
>>> > This series proposes to expose attributes via xe configfs
>>> > subsystem. Xe registers a configfs subsystem named 'xe'.
>>> > Userspace can then create directories for the devices they
>>> > want to configure and set appropriate attributes
>>> >
>>> > This is done by
>>> >
>>> > mount -t configfs none /config
>>> > mkdir /config/xe/0000:03:00.0
>>> >
>>>
>>> If we need a new version or to document anywhere in our docs, I'd add a
>>> comment here:
>>>
>>> # If driver is already bound, unbind it as this configuration
>>> # applies only when probing it
>>>
>>> > echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
>>> > echo 1 > sys/kernel/config/xe/0000:03:00.0/survivability_mode
>>> > echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
>>> >
>>> > This is an alternative to introducing module param that causes
>>> > all the connected and supported GPU cards to enter survivability mode.
>>> > Manually entering survivability mode is useful when pcode does not
>>> > report failure, in field repairs and validation
>>> >
>>> > Rev2: use config_groups (Lucas)
>>>
>>> Awesome. I have some other work pending that will make use of
>>> it. I will play with these patches soon.
>>
>> I really liked this new flow and I was giving it a try here right now.
>>
>> However it didn't work. It didn't take me to the survivability mode,
>> but also, I cannot unload the xe after creating this configfs file:
>>
>> sudo remove /sys/kernel/config/xe/0000\:0*
>> rm: cannot remove '0000:00:02.0/survivability_mode': Operation not 
>> permitted
>> rm: cannot remove '0000:03:00.0/survivability_mode': Operation not 
>> permitted
> 
> humn... testing on a bmg, it works:
> 
>      # # first of all, make sure autoprobe doesn't do what we don't
>      # # want:
>      # echo 0  > /sys/bus/pci/drivers_autoprobe
> 
>      # # load module and set the configuration
>      # modprobe xe
>      # mkdir /sys/kernel/config/xe/0000:03:00.0
>      # echo 1 > /sys/kernel/config/xe/0000\:03\:00.0/survivability_mode
> 
>      # # bind the driver and check it enters survivability mode
>      # echo 0000:03:00.0 > /sys/module/xe/drivers/pci\:xe/bind
>      # dmesg | tail -n1
>      [ 1994.807063] xe 0000:03:00.0: In Survivability Mode
>      # cat  /sys/bus/pci/drivers/xe/0000\:03\:00.0/survivability_mode 
>      Capability Info: 0x138320 - 0x2001ae06
>      Postcode Info: 0x138324 - 0x0
>      Overflow Info: 0x138328 - 0x0
>      Auxiliary Info 0: 0x13832c - 0x0
> 
> Unbind first and test we can remove the configuration for next bind:
> 
>      # echo 0000:03:00.0 > /sys/module/xe/drivers/pci\:xe/unbind
>      # tree /sys/kernel/config/xe
>      /sys/kernel/config/xe
>      └── 0000:03:00.0
>          └── survivability_mode
>      # rmdir /sys/kernel/config/xe/0000:03:00.0/
>      # tree /sys/kernel/config/xe
>      /sys/kernel/config/xe
> 
> Remove module:
>      # modprobe -r xe
> 
> 
> What doesn't work:
> 
>      1) Remove the module without unbinding. This is already the case
>      2) Remove the module without unbinding first
Do you mean removing directory?>
> For (2) it's basicaly: when you create a configfs connection, configfs
> increments the module's refcount. You have to remove them first.
yeah it takes a refcount and i think subsystem register fails if that's 
not taken. haven't tried >
> For my surprise, it's possible to remove the config after binding -
> there's no error. 
When i was trying, even with config_item reference is taken, it had 
allowed rmdir. drop_item is void and cannot return a error but didn't 
search further as survivability mode didn't require a reference.

Thanks
Riana> I need to double check if this wouldn't create some
> UAF depending on how we use the config, but for survivability purposes,
> I don't think it has an issue with that single bool.
> 
> Lucas De Marchi
> 
>>
>> Tried to unbind and had the same failure.
>>
>> then with the configfs there we cannot remove the module:
>> $ sudo rmmod xe
>> rmmod: ERROR: Module xe is in use
> 
> what's the `lsmod | grep xe` output?
> 
> You should always be able to unbind. There's nothing the driver can do
> that would block the unbind. After unbind, you should rmdir every dir in
> /sys/kernel/config/xe/*. Note that it's not an rm -r since you can't
> remove the inner configuration files, only the directory that is the
> "connection" between configfs and the driver. You also can't remove the
> xe dir (as it's owned by the module, not the connection to the device),
> only  the dirs under it.
> 
> Lucas De Marchi
> 
> 
>>
>>
>> So, it looks we have some stuff to adjust here before we can move 
>> further,
>> but so far things are looking promising indeed
>>
>>>
>>> thanks
>>> Lucas De Marchi


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] drm/xe: Add configfs to enable survivability mode
  2025-04-01  1:03   ` Lucas De Marchi
@ 2025-04-01  8:13     ` Riana Tauro
  0 siblings, 0 replies; 15+ messages in thread
From: Riana Tauro @ 2025-04-01  8:13 UTC (permalink / raw)
  To: Lucas De Marchi
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, matthew.d.roper,
	aravind.iddamsetty

Hi Lucas

Thank you for the review comments
On 4/1/2025 6:33 AM, Lucas De Marchi wrote:
> On Thu, Mar 27, 2025 at 12:12:01PM +0530, Riana Tauro wrote:
>> Registers a configfs subsystem called 'xe' to userspace. The user can
>> use this to modify exposed attributes for a device.
>>
>> Attribute exposed:
>>
>> /config/xe/<device>/survivability_mode : Enables survivability mode
>>
>> The attributes can be modified by creating the device directory under the
>> configfs directory
>>
>> mount -t configfs none /config
>> mkdir /config/xe/0000:03:00.0
>>
>> echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/unbind
>> echo 1 > /config/xe/0000:03:00.0/survivability_mode
>> echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/bind
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> drivers/gpu/drm/xe/Makefile      |   1 +
>> drivers/gpu/drm/xe/xe_configfs.c | 174 +++++++++++++++++++++++++++++++
>> drivers/gpu/drm/xe/xe_configfs.h |  11 ++
>> drivers/gpu/drm/xe/xe_module.c   |   5 +
>> 4 files changed, 191 insertions(+)
>> create mode 100644 drivers/gpu/drm/xe/xe_configfs.c
>> create mode 100644 drivers/gpu/drm/xe/xe_configfs.h
>>
>> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
>> index cd464fe26eb8..190176c2e10d 100644
>> --- a/drivers/gpu/drm/xe/Makefile
>> +++ b/drivers/gpu/drm/xe/Makefile
>> @@ -28,6 +28,7 @@ $(obj)/generated/%_wa_oob.c $(obj)/generated/ 
>> %_wa_oob.h: $(obj)/xe_gen_wa_oob \
>> xe-y += xe_bb.o \
>>     xe_bo.o \
>>     xe_bo_evict.o \
>> +    xe_configfs.o \
>>     xe_devcoredump.o \
>>     xe_device.o \
>>     xe_device_sysfs.o \
>> diff --git a/drivers/gpu/drm/xe/xe_configfs.c b/drivers/gpu/drm/xe/ 
>> xe_configfs.c
>> new file mode 100644
>> index 000000000000..59e1bc4c5f76
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_configfs.c
>> @@ -0,0 +1,174 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +
>> +#include <linux/configfs.h>
>> +#include <linux/init.h>
>> +#include <linux/module.h>
>> +#include <linux/pci.h>
>> +
>> +#include "xe_configfs.h"
>> +#include "xe_module.h"
>> +
>> +/**
>> + * DOC: XE Configfs Support
>> + *
>> + * Overview
>> + * =========
>> + *
>> + * Configfs is a filesystem-based manager of kernel objects. XE KMD 
>> registers a
>> + * configfs subsystem called 'xe' that allows user to configure 
>> various attributes
>> + * See Documentation/filesystems/configfs.rst for more information 
>> about how configfs works.
>> + *
>> + * Create devices
>> + * ===============
>> + *
>> + * To use configfs exposed by Xe, mount the configfs directory and 
>> create the device directory
>> + *
>> + * $ mount -t configfs none /config
> 
> nit: once upon a time I think the "standard" was to have a /config. Same
> thing as for debugfs. Nowadays I think pretty much every distro settled
> on /sys/kernel/{debug,config} for debugfs and configfs. Mine even gets
> mounted automatically:
> 
Oh okay. All docs had this so added the same.Yeah i saw the 
/sys/kernel/config being mounted. Ok will change this to generic

> # grep Where /usr/lib/systemd/system/sys-kernel-config.mount
> Where=/sys/kernel/config
> 
>> + * $ mkdir /config/xe/0000:03:00.0
>> + *
>> + * Configure Attributes
>> + * ====================
>> + *
>> + * /config/xe/<device>/survivability mode : Enables survivability 
>> mode if supported by device
> 
>     * <configfs>/xe/<device>/survivability_mode: Enable survivability 
> mode if supported by device
> 
> 
>> + *
>> + * Unbind            : echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/ 
>> unbind
> 
> please, no unbind. The unbind part is part of "the previous time the
> driver was attached/bound to the device".
> 
>> + * Enable survivability mode    : echo 1 > /config/xe/0000:03:00.0/ 
>> survivability_mode
>> + * Bind                : echo 0000:03:00.0 > /sys/bus/pci/drivers/xe/ 
>> bind
> 
> are these extra spaces to try to align anything?

aligns in the code. will generate kernel-doc and check>
>> + *
>> + * The device enters survivability mode if supported
>> + *
> 
> trailing line.
will remove this>
>> + */
>> +
>> +struct xe_config_device {
>> +    struct config_group group;
>> +
>> +    /** @survivability_mode: enables survivability mode when set */
> 
> I don't think we should kernel-doc this. But if we are, then it should
> be proper kernel-doc, including the doc about the struct and not leaving
> members undocumented.
will remove this>
>> +    bool survivability_mode;
>> +
>> +    /** @lock: protects attributes */
>> +    struct mutex lock;
>> +};
>> +
>> +static struct xe_config_device *to_xe_config_device(struct 
>> config_item *item)
>> +{
>> +    return container_of(to_config_group(item), struct 
>> xe_config_device, group);
>> +}
>> +
>> +static ssize_t survivability_mode_show(struct config_item *item, char 
>> *page)
>> +{
>> +    struct xe_config_device *dev = to_xe_config_device(item);
>> +
>> +    return sprintf(page, "%d\n", dev->survivability_mode);
>> +}
>> +
>> +static ssize_t survivability_mode_store(struct config_item *item, 
>> const char *page, size_t len)
>> +{
>> +    struct xe_config_device *dev = to_xe_config_device(item);
>> +    bool survivability_mode;
>> +    int ret;
>> +
>> +    ret = kstrtobool(page, &survivability_mode);
>> +    if (ret)
>> +        return ret;
>> +
>> +    mutex_lock(&dev->lock);
>> +    dev->survivability_mode = survivability_mode;
>> +    mutex_unlock(&dev->lock);
>> +
>> +    return len;
>> +}
>> +
>> +CONFIGFS_ATTR(, survivability_mode);
>> +
>> +static struct configfs_attribute *xe_config_device_attrs[] = {
>> +    &attr_survivability_mode,
>> +    NULL,
>> +};
>> +
>> +static void xe_config_device_release(struct config_item *item)
>> +{
>> +    struct xe_config_device *dev = to_xe_config_device(item);
>> +
>> +    mutex_destroy(&dev->lock);
>> +    kfree(dev);
>> +}
>> +
>> +static struct configfs_item_operations xe_config_device_ops = {
>> +    .release    = xe_config_device_release,
>> +};
>> +
>> +static const struct config_item_type xe_config_device_type = {
>> +    .ct_item_ops    = &xe_config_device_ops,
>> +    .ct_attrs    = xe_config_device_attrs,
>> +    .ct_owner    = THIS_MODULE,
>> +};
>> +
>> +static struct config_group *xe_config_make_device_group(struct 
>> config_group *group,
>> +                            const char *name)
>> +{
>> +    unsigned int domain, bus, slot, function;
>> +    struct xe_config_device *dev;
>> +    struct pci_dev *pdev;
>> +    int ret;
>> +
>> +    ret = sscanf(name, "%04x:%02x:%02x.%x", &domain, &bus, &slot, 
>> &function);
>> +    if (ret != 4)
>> +        return ERR_PTR(-EINVAL);
>> +
>> +    pdev = pci_get_domain_bus_and_slot(domain, bus, PCI_DEVFN(slot, 
>> function));
>> +    if (!pdev)
>> +        return ERR_PTR(-EINVAL);
>> +
>> +    dev = kzalloc(sizeof(*dev), GFP_KERNEL);
>> +    if (!dev)
>> +        return ERR_PTR(-ENOMEM);
>> +
>> +    config_group_init_type_name(&dev->group, name, 
>> &xe_config_device_type);
>> +
>> +    mutex_init(&dev->lock);
>> +
>> +    return &dev->group;
>> +}
>> +
>> +static struct configfs_group_operations xe_config_device_group_ops = {
>> +    .make_group    = xe_config_make_device_group,
>> +};
>> +
>> +static const struct config_item_type xe_configfs_type = {
>> +    .ct_group_ops    = &xe_config_device_group_ops,
>> +    .ct_owner    = THIS_MODULE,
> 
> Rodrigo,            ^
> 
> this is what makes the config side take a ref on the module, as it
> should.
> 
>> +};
>> +
>> +static struct configfs_subsystem xe_configfs = {
>> +    .su_group = {
>> +        .cg_item = {
>> +            .ci_namebuf = "xe",
>> +            .ci_type = &xe_configfs_type,
>> +        },
>> +    },
>> +};
>> +
>> +int __init xe_configfs_init(void)
>> +{
>> +    struct config_group *root = &xe_configfs.su_group;
>> +    int ret = 0;
> 
> pointless init as it's overwritten below.
> 
>> +
>> +    config_group_init(root);
>> +    mutex_init(&xe_configfs.su_mutex);
>> +    ret = configfs_register_subsystem(&xe_configfs);
>> +    if (ret) {
>> +        pr_err("Error %d while registering %s subsystem\n",
>> +               ret, root->cg_item.ci_namebuf);
> 
> return ret;
> 
>> +    }
>> +
>> +    return ret;
> 
> return 0;
> 
> 
> A bunch of nits above, but looking pretty good.

Will fix the above review comments.

Thank you
Riana>
> thanks
> Lucas De Marchi
> 
>> +}
>> +
>> +void __exit xe_configfs_exit(void)
>> +{
>> +    configfs_unregister_subsystem(&xe_configfs);
>> +}
>> +
>> diff --git a/drivers/gpu/drm/xe/xe_configfs.h b/drivers/gpu/drm/xe/ 
>> xe_configfs.h
>> new file mode 100644
>> index 000000000000..2c30be9a2c7e
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_configfs.h
>> @@ -0,0 +1,11 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +#ifndef _XE_CONFIGFS_H_
>> +#define _XE_CONFIGFS_H_
>> +
>> +int xe_configfs_init(void);
>> +void xe_configfs_exit(void);
>> +
>> +#endif
>> diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/ 
>> xe_module.c
>> index 475acdba2b55..15b3cf22193c 100644
>> --- a/drivers/gpu/drm/xe/xe_module.c
>> +++ b/drivers/gpu/drm/xe/xe_module.c
>> @@ -11,6 +11,7 @@
>> #include <drm/drm_module.h>
>>
>> #include "xe_drv.h"
>> +#include "xe_configfs.h"
>> #include "xe_hw_fence.h"
>> #include "xe_pci.h"
>> #include "xe_pm.h"
>> @@ -91,6 +92,10 @@ static const struct init_funcs init_funcs[] = {
>>     {
>>         .init = xe_check_nomodeset,
>>     },
>> +    {
>> +        .init = xe_configfs_init,
>> +        .exit = xe_configfs_exit,
>> +    },
>>     {
>>         .init = xe_hw_fence_module_init,
>>         .exit = xe_hw_fence_module_exit,
>> -- 
>> 2.47.1
>>


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-04-01  8:14 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-27  6:42 [PATCH 0/2] Add configfs support for survivability mode Riana Tauro
2025-03-27  6:25 ` ✓ CI.Patch_applied: success for Add configfs support for survivability mode (rev2) Patchwork
2025-03-27  6:26 ` ✗ CI.checkpatch: warning " Patchwork
2025-03-27  6:26 ` ✗ CI.KUnit: failure " Patchwork
2025-03-27  6:42 ` [PATCH 1/2] drm/xe: Add configfs to enable survivability mode Riana Tauro
2025-04-01  1:03   ` Lucas De Marchi
2025-04-01  8:13     ` Riana Tauro
2025-03-27  6:42 ` [PATCH 2/2] drm/xe: Enable configfs support for " Riana Tauro
2025-04-01  3:25   ` Lucas De Marchi
2025-03-27 14:40 ` [PATCH 0/2] Add " Lucas De Marchi
2025-03-31 20:19   ` Rodrigo Vivi
2025-03-31 21:45     ` Lucas De Marchi
2025-04-01  6:18       ` Riana Tauro
2025-04-01  5:55     ` Riana Tauro
  -- strict thread matches above, loose matches on Subject: below --
2025-03-07 14:24 Riana Tauro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox