* Re: [PATCH v9 2/9] lib: vsprintf: export simple_strntoull() in a safe prototype
From: Rodrigo Alencar @ 2026-03-30 12:49 UTC (permalink / raw)
To: Rodrigo Alencar, Andy Shevchenko
Cc: Petr Mladek, rodrigo.alencar, linux-kernel, linux-iio, devicetree,
linux-doc, Jonathan Cameron, David Lechner, Andy Shevchenko,
Lars-Peter Clausen, Michael Hennerich, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet, Andrew Morton,
Steven Rostedt, Rasmus Villemoes, Sergey Senozhatsky, Shuah Khan
In-Reply-To: <x34d7jz7be4ommjh6efx5mcq5pbpellykwuyrqayr4ske3lywf@wh46mu3anmcz>
On 26/03/27 03:17PM, Rodrigo Alencar wrote:
> On 26/03/27 12:21PM, Andy Shevchenko wrote:
> > On Fri, Mar 27, 2026 at 10:11:56AM +0000, Rodrigo Alencar wrote:
> > > On 26/03/27 11:17AM, Andy Shevchenko wrote:
> > > > On Fri, Mar 27, 2026 at 09:45:17AM +0100, Petr Mladek wrote:
> > > > > On Fri 2026-03-20 16:27:27, Rodrigo Alencar via B4 Relay wrote:
...
> > > > Maybe we want to have kstrtof32() and kstrtof64() for these two cases?
> > > >
> > > > With that we will always consider the fraction part as 32- or 64-bit,
> > > > imply floor() on the fraction for the sake of simplicity and require
> > > > it to be NUL-terminated with possible trailing '\n'.
> > >
> > > I think this is a good idea, but calling it float or fixed point itself
> > > is a bit confusing as float often refers to the IEEE 754 standard and
> > > fixed point types is often expressed in Q-format.
> >
> > Yeah... I am lack of better naming.
>
> decimals is the name, but they are often represented as:
>
> DECIMAL = INT * 10^X + FRAC
>
> in a single 64-bit number, which would be fine for my end use case.
> However IIO decimal fixed point parsing is out there for quite some time a
> lot of drivers use that. The interface often relies on breaking parsed values
> into an integer array (for standard attributes int val and int val2 are expected).
Thinking about this again and in IIO drivers we end up doing something like:
val64 = (u64)val * MICRO + val2;
so that drivers often work with scaled versions of the decimal value.
then, would it make sense to have a function that already outputs such value?
That would allow to have more freedom over the 64-bit split between integer
and fractional parts.
As a draft:
static int _kstrtodec64(const char *s, unsigned int scale, u64 *res)
{
u64 _res = 0, _frac = 0;
unsigned int rv;
if (*s != '.') {
rv = _parse_integer(s, 10, &_res);
if (rv & KSTRTOX_OVERFLOW)
return -ERANGE;
if (rv == 0)
return -EINVAL;
s += rv;
}
if (*s == '.') {
s++;
rv = _parse_integer_limit(s, 10, &_frac, scale);
if (rv & KSTRTOX_OVERFLOW)
return -ERANGE;
if (rv == 0)
return -EINVAL;
s += rv;
if (rv < scale)
_frac *= int_pow(10, scale - rv);
while (isdigit(*s)) /* truncate */
s++;
}
if (*s == '\n')
s++;
if (*s)
return -EINVAL;
if (check_mul_overflow(_res, int_pow(10, scale), &_res) ||
check_add_overflow(_res, _frac, &_res))
return -ERANGE;
*res = _res;
return 0;
}
noinline
int kstrtoudec64(const char *s, unsigned int scale, u64 *res)
{
if (s[0] == '+')
s++;
return _kstrtodec64(s, scale, res);
}
EXPORT_SYMBOL(kstrtoudec64);
noinline
int kstrtosdec64(const char *s, unsigned int scale, s64 *res)
{
u64 tmp;
int rv;
if (s[0] == '-') {
rv = _kstrtodec64(s + 1, scale, &tmp);
if (rv < 0)
return rv;
if ((s64)-tmp > 0)
return -ERANGE;
*res = -tmp;
} else {
rv = kstrtoudec64(s, scale, &tmp);
if (rv < 0)
return rv;
if ((s64)tmp < 0)
return -ERANGE;
*res = tmp;
}
return 0;
}
EXPORT_SYMBOL(kstrtosdec64);
e.g., kstrtosdec64() or kstrtoudec64() parses "3.1415" with scale 3 into 3141
--
Kind regards,
Rodrigo Alencar
^ permalink raw reply
* [PATCH v6 4/4] RISC-V: KVM: add KVM_CAP_RISCV_SET_HGATP_MODE
From: fangyu.yu @ 2026-03-30 12:26 UTC (permalink / raw)
To: pbonzini, corbet, anup, atish.patra, pjw, palmer, aou, alex,
skhan
Cc: guoren, radim.krcmar, andrew.jones, linux-doc, kvm, kvm-riscv,
linux-riscv, linux-kernel, Fangyu Yu
In-Reply-To: <20260330122601.22140-1-fangyu.yu@linux.alibaba.com>
From: Fangyu Yu <fangyu.yu@linux.alibaba.com>
Add a VM capability that allows userspace to select the G-stage page table
format by setting HGATP.MODE on a per-VM basis.
Userspace enables the capability via KVM_ENABLE_CAP, passing the requested
HGATP.MODE in args[0]. The request is rejected with -EINVAL if the mode is
not supported by the host, and with -EBUSY if the VM has already been
committed (e.g. vCPUs have been created or any memslot is populated).
KVM_CHECK_EXTENSION(KVM_CAP_RISCV_SET_HGATP_MODE) returns a bitmask of the
HGATP.MODE formats supported by the host.
Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
Reviewed-by: Andrew Jones <andrew.jones@oss.qualcomm.com>
---
Documentation/virt/kvm/api.rst | 27 +++++++++++++++++++++++++++
arch/riscv/kvm/vm.c | 18 ++++++++++++++++--
include/uapi/linux/kvm.h | 1 +
3 files changed, 44 insertions(+), 2 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 032516783e96..9d7f6958fa81 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8902,6 +8902,33 @@ helpful if user space wants to emulate instructions which are not
This capability can be enabled dynamically even if VCPUs were already
created and are running.
+7.47 KVM_CAP_RISCV_SET_HGATP_MODE
+---------------------------------
+
+:Architectures: riscv
+:Type: VM
+:Parameters: args[0] contains the requested HGATP mode
+:Returns:
+ - 0 on success.
+ - -EINVAL if args[0] is outside the range of HGATP modes supported by the
+ hardware.
+ - -EBUSY if vCPUs have already been created for the VM, if the VM has any
+ non-empty memslots.
+
+This capability allows userspace to explicitly select the HGATP mode for
+the VM. The selected mode must be supported by both KVM and hardware. This
+capability must be enabled before creating any vCPUs or memslots.
+
+If this capability is not enabled, KVM will select the default HGATP mode
+automatically. The default is the highest HGATP.MODE value supported by
+hardware.
+
+``KVM_CHECK_EXTENSION(KVM_CAP_RISCV_SET_HGATP_MODE)`` returns a bitmask of
+HGATP.MODE values supported by the host. A return value of 0 indicates that
+the capability is not supported. Supported-mode bitmask use HGATP.MODE
+encodings as defined by the RISC-V privileged specification, such as Sv39x4
+corresponds to HGATP.MODE=8, so userspace should test bitmask & BIT(8).
+
8. Other capabilities.
======================
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index 4d82a886102c..5e82a3ad3ad0 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -201,6 +201,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_VM_GPA_BITS:
r = kvm_riscv_gstage_gpa_bits(kvm->arch.pgd_levels);
break;
+ case KVM_CAP_RISCV_SET_HGATP_MODE:
+ r = kvm_riscv_get_hgatp_mode_mask();
+ break;
default:
r = 0;
break;
@@ -211,12 +214,23 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
int kvm_vm_ioctl_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
{
+ if (cap->flags)
+ return -EINVAL;
+
switch (cap->cap) {
case KVM_CAP_RISCV_MP_STATE_RESET:
- if (cap->flags)
- return -EINVAL;
kvm->arch.mp_state_reset = true;
return 0;
+ case KVM_CAP_RISCV_SET_HGATP_MODE:
+ if (!kvm_riscv_hgatp_mode_is_valid(cap->args[0]))
+ return -EINVAL;
+
+ if (kvm->created_vcpus || !kvm_are_all_memslots_empty(kvm))
+ return -EBUSY;
+#ifdef CONFIG_64BIT
+ kvm->arch.pgd_levels = 3 + cap->args[0] - HGATP_MODE_SV39X4;
+#endif
+ return 0;
default:
return -EINVAL;
}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 80364d4dbebb..a74a80fd4046 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -989,6 +989,7 @@ struct kvm_enable_cap {
#define KVM_CAP_ARM_SEA_TO_USER 245
#define KVM_CAP_S390_USER_OPEREXEC 246
#define KVM_CAP_S390_KEYOP 247
+#define KVM_CAP_RISCV_SET_HGATP_MODE 248
struct kvm_irq_routing_irqchip {
__u32 irqchip;
--
2.50.1
^ permalink raw reply related
* [PATCH v6 2/4] RISC-V: KVM: Cache gstage pgd_levels in struct kvm_gstage
From: fangyu.yu @ 2026-03-30 12:25 UTC (permalink / raw)
To: pbonzini, corbet, anup, atish.patra, pjw, palmer, aou, alex,
skhan
Cc: guoren, radim.krcmar, andrew.jones, linux-doc, kvm, kvm-riscv,
linux-riscv, linux-kernel, Fangyu Yu
In-Reply-To: <20260330122601.22140-1-fangyu.yu@linux.alibaba.com>
From: Fangyu Yu <fangyu.yu@linux.alibaba.com>
Gstage page-table helpers frequently chase gstage->kvm->arch to
fetch pgd_levels. This adds noise and repeats the same dereference
chain in hot paths.
Add pgd_levels to struct kvm_gstage and initialize it from kvm->arch
when setting up a gstage instance. Introduce kvm_riscv_gstage_init()
to centralize initialization and switch gstage code to use
gstage->pgd_levels.
Suggested-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
---
arch/riscv/include/asm/kvm_gstage.h | 10 ++++++
arch/riscv/kvm/gstage.c | 10 +++---
arch/riscv/kvm/mmu.c | 50 ++++++-----------------------
3 files changed, 25 insertions(+), 45 deletions(-)
diff --git a/arch/riscv/include/asm/kvm_gstage.h b/arch/riscv/include/asm/kvm_gstage.h
index 5aa58d1f692a..70d9d483365e 100644
--- a/arch/riscv/include/asm/kvm_gstage.h
+++ b/arch/riscv/include/asm/kvm_gstage.h
@@ -15,6 +15,7 @@ struct kvm_gstage {
#define KVM_GSTAGE_FLAGS_LOCAL BIT(0)
unsigned long vmid;
pgd_t *pgd;
+ unsigned long pgd_levels;
};
struct kvm_gstage_mapping {
@@ -92,4 +93,13 @@ static inline unsigned long kvm_riscv_gstage_mode(unsigned long pgd_levels)
}
}
+static inline void kvm_riscv_gstage_init(struct kvm_gstage *gstage, struct kvm *kvm)
+{
+ gstage->kvm = kvm;
+ gstage->flags = 0;
+ gstage->vmid = READ_ONCE(kvm->arch.vmid.vmid);
+ gstage->pgd = kvm->arch.pgd;
+ gstage->pgd_levels = kvm->arch.pgd_levels;
+}
+
#endif
diff --git a/arch/riscv/kvm/gstage.c b/arch/riscv/kvm/gstage.c
index 4beb9322fe76..7c4c34bc191b 100644
--- a/arch/riscv/kvm/gstage.c
+++ b/arch/riscv/kvm/gstage.c
@@ -26,7 +26,7 @@ static inline unsigned long gstage_pte_index(struct kvm_gstage *gstage,
unsigned long mask;
unsigned long shift = HGATP_PAGE_SHIFT + (kvm_riscv_gstage_index_bits * level);
- if (level == gstage->kvm->arch.pgd_levels - 1)
+ if (level == gstage->pgd_levels - 1)
mask = (PTRS_PER_PTE * (1UL << kvm_riscv_gstage_pgd_xbits)) - 1;
else
mask = PTRS_PER_PTE - 1;
@@ -45,7 +45,7 @@ static int gstage_page_size_to_level(struct kvm_gstage *gstage, unsigned long pa
u32 i;
unsigned long psz = 1UL << 12;
- for (i = 0; i < gstage->kvm->arch.pgd_levels; i++) {
+ for (i = 0; i < gstage->pgd_levels; i++) {
if (page_size == (psz << (i * kvm_riscv_gstage_index_bits))) {
*out_level = i;
return 0;
@@ -58,7 +58,7 @@ static int gstage_page_size_to_level(struct kvm_gstage *gstage, unsigned long pa
static int gstage_level_to_page_order(struct kvm_gstage *gstage, u32 level,
unsigned long *out_pgorder)
{
- if (gstage->kvm->arch.pgd_levels < level)
+ if (gstage->pgd_levels < level)
return -EINVAL;
*out_pgorder = 12 + (level * kvm_riscv_gstage_index_bits);
@@ -83,7 +83,7 @@ bool kvm_riscv_gstage_get_leaf(struct kvm_gstage *gstage, gpa_t addr,
pte_t **ptepp, u32 *ptep_level)
{
pte_t *ptep;
- u32 current_level = gstage->kvm->arch.pgd_levels - 1;
+ u32 current_level = gstage->pgd_levels - 1;
*ptep_level = current_level;
ptep = (pte_t *)gstage->pgd;
@@ -127,7 +127,7 @@ int kvm_riscv_gstage_set_pte(struct kvm_gstage *gstage,
struct kvm_mmu_memory_cache *pcache,
const struct kvm_gstage_mapping *map)
{
- u32 current_level = gstage->kvm->arch.pgd_levels - 1;
+ u32 current_level = gstage->pgd_levels - 1;
pte_t *next_ptep = (pte_t *)gstage->pgd;
pte_t *ptep = &next_ptep[gstage_pte_index(gstage, map->addr, current_level)];
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index fbcdd75cb9af..2d3def024270 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -24,10 +24,7 @@ static void mmu_wp_memory_region(struct kvm *kvm, int slot)
phys_addr_t end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
struct kvm_gstage gstage;
- gstage.kvm = kvm;
- gstage.flags = 0;
- gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid);
- gstage.pgd = kvm->arch.pgd;
+ kvm_riscv_gstage_init(&gstage, kvm);
spin_lock(&kvm->mmu_lock);
kvm_riscv_gstage_wp_range(&gstage, start, end);
@@ -49,10 +46,7 @@ int kvm_riscv_mmu_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
struct kvm_gstage_mapping map;
struct kvm_gstage gstage;
- gstage.kvm = kvm;
- gstage.flags = 0;
- gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid);
- gstage.pgd = kvm->arch.pgd;
+ kvm_riscv_gstage_init(&gstage, kvm);
end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
pfn = __phys_to_pfn(hpa);
@@ -89,10 +83,7 @@ void kvm_riscv_mmu_iounmap(struct kvm *kvm, gpa_t gpa, unsigned long size)
{
struct kvm_gstage gstage;
- gstage.kvm = kvm;
- gstage.flags = 0;
- gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid);
- gstage.pgd = kvm->arch.pgd;
+ kvm_riscv_gstage_init(&gstage, kvm);
spin_lock(&kvm->mmu_lock);
kvm_riscv_gstage_unmap_range(&gstage, gpa, size, false);
@@ -109,10 +100,7 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
struct kvm_gstage gstage;
- gstage.kvm = kvm;
- gstage.flags = 0;
- gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid);
- gstage.pgd = kvm->arch.pgd;
+ kvm_riscv_gstage_init(&gstage, kvm);
kvm_riscv_gstage_wp_range(&gstage, start, end);
}
@@ -141,10 +129,7 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
phys_addr_t size = slot->npages << PAGE_SHIFT;
struct kvm_gstage gstage;
- gstage.kvm = kvm;
- gstage.flags = 0;
- gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid);
- gstage.pgd = kvm->arch.pgd;
+ kvm_riscv_gstage_init(&gstage, kvm);
spin_lock(&kvm->mmu_lock);
kvm_riscv_gstage_unmap_range(&gstage, gpa, size, false);
@@ -250,10 +235,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
if (!kvm->arch.pgd)
return false;
- gstage.kvm = kvm;
- gstage.flags = 0;
- gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid);
- gstage.pgd = kvm->arch.pgd;
+ kvm_riscv_gstage_init(&gstage, kvm);
mmu_locked = spin_trylock(&kvm->mmu_lock);
kvm_riscv_gstage_unmap_range(&gstage, range->start << PAGE_SHIFT,
(range->end - range->start) << PAGE_SHIFT,
@@ -275,10 +257,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
- gstage.kvm = kvm;
- gstage.flags = 0;
- gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid);
- gstage.pgd = kvm->arch.pgd;
+ kvm_riscv_gstage_init(&gstage, kvm);
if (!kvm_riscv_gstage_get_leaf(&gstage, range->start << PAGE_SHIFT,
&ptep, &ptep_level))
return false;
@@ -298,10 +277,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
- gstage.kvm = kvm;
- gstage.flags = 0;
- gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid);
- gstage.pgd = kvm->arch.pgd;
+ kvm_riscv_gstage_init(&gstage, kvm);
if (!kvm_riscv_gstage_get_leaf(&gstage, range->start << PAGE_SHIFT,
&ptep, &ptep_level))
return false;
@@ -463,10 +439,7 @@ int kvm_riscv_mmu_map(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot,
struct kvm_gstage gstage;
struct page *page;
- gstage.kvm = kvm;
- gstage.flags = 0;
- gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid);
- gstage.pgd = kvm->arch.pgd;
+ kvm_riscv_gstage_init(&gstage, kvm);
/* Setup initial state of output mapping */
memset(out_map, 0, sizeof(*out_map));
@@ -587,10 +560,7 @@ void kvm_riscv_mmu_free_pgd(struct kvm *kvm)
spin_lock(&kvm->mmu_lock);
if (kvm->arch.pgd) {
- gstage.kvm = kvm;
- gstage.flags = 0;
- gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid);
- gstage.pgd = kvm->arch.pgd;
+ kvm_riscv_gstage_init(&gstage, kvm);
kvm_riscv_gstage_unmap_range(&gstage, 0UL,
kvm_riscv_gstage_gpa_size(kvm->arch.pgd_levels), false);
pgd = READ_ONCE(kvm->arch.pgd);
--
2.50.1
^ permalink raw reply related
* [PATCH v6 1/4] RISC-V: KVM: Support runtime configuration for per-VM's HGATP mode
From: fangyu.yu @ 2026-03-30 12:25 UTC (permalink / raw)
To: pbonzini, corbet, anup, atish.patra, pjw, palmer, aou, alex,
skhan
Cc: guoren, radim.krcmar, andrew.jones, linux-doc, kvm, kvm-riscv,
linux-riscv, linux-kernel, Fangyu Yu
In-Reply-To: <20260330122601.22140-1-fangyu.yu@linux.alibaba.com>
From: Fangyu Yu <fangyu.yu@linux.alibaba.com>
Introduces one per-VM architecture-specific fields to support runtime
configuration of the G-stage page table format:
- kvm->arch.pgd_levels: the corresponding number of page table levels
for the selected mode.
These fields replace the previous global variables
kvm_riscv_gstage_mode and kvm_riscv_gstage_pgd_levels, enabling different
virtual machines to independently select their G-stage page table format
instead of being forced to share the maximum mode detected by the kernel
at boot time.
Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
Reviewed-by: Andrew Jones <andrew.jones@oss.qualcomm.com>
---
arch/riscv/include/asm/kvm_gstage.h | 37 ++++++++++++----
arch/riscv/include/asm/kvm_host.h | 1 +
arch/riscv/kvm/gstage.c | 65 ++++++++++++++---------------
arch/riscv/kvm/main.c | 12 +++---
arch/riscv/kvm/mmu.c | 20 +++++----
arch/riscv/kvm/vm.c | 2 +-
arch/riscv/kvm/vmid.c | 3 +-
7 files changed, 83 insertions(+), 57 deletions(-)
diff --git a/arch/riscv/include/asm/kvm_gstage.h b/arch/riscv/include/asm/kvm_gstage.h
index 595e2183173e..5aa58d1f692a 100644
--- a/arch/riscv/include/asm/kvm_gstage.h
+++ b/arch/riscv/include/asm/kvm_gstage.h
@@ -29,16 +29,22 @@ struct kvm_gstage_mapping {
#define kvm_riscv_gstage_index_bits 10
#endif
-extern unsigned long kvm_riscv_gstage_mode;
-extern unsigned long kvm_riscv_gstage_pgd_levels;
+extern unsigned long kvm_riscv_gstage_max_pgd_levels;
#define kvm_riscv_gstage_pgd_xbits 2
#define kvm_riscv_gstage_pgd_size (1UL << (HGATP_PAGE_SHIFT + kvm_riscv_gstage_pgd_xbits))
-#define kvm_riscv_gstage_gpa_bits (HGATP_PAGE_SHIFT + \
- (kvm_riscv_gstage_pgd_levels * \
- kvm_riscv_gstage_index_bits) + \
- kvm_riscv_gstage_pgd_xbits)
-#define kvm_riscv_gstage_gpa_size ((gpa_t)(1ULL << kvm_riscv_gstage_gpa_bits))
+
+static inline unsigned long kvm_riscv_gstage_gpa_bits(unsigned long pgd_levels)
+{
+ return (HGATP_PAGE_SHIFT +
+ pgd_levels * kvm_riscv_gstage_index_bits +
+ kvm_riscv_gstage_pgd_xbits);
+}
+
+static inline gpa_t kvm_riscv_gstage_gpa_size(unsigned long pgd_levels)
+{
+ return BIT_ULL(kvm_riscv_gstage_gpa_bits(pgd_levels));
+}
bool kvm_riscv_gstage_get_leaf(struct kvm_gstage *gstage, gpa_t addr,
pte_t **ptepp, u32 *ptep_level);
@@ -69,4 +75,21 @@ void kvm_riscv_gstage_wp_range(struct kvm_gstage *gstage, gpa_t start, gpa_t end
void kvm_riscv_gstage_mode_detect(void);
+static inline unsigned long kvm_riscv_gstage_mode(unsigned long pgd_levels)
+{
+ switch (pgd_levels) {
+ case 2:
+ return HGATP_MODE_SV32X4;
+ case 3:
+ return HGATP_MODE_SV39X4;
+ case 4:
+ return HGATP_MODE_SV48X4;
+ case 5:
+ return HGATP_MODE_SV57X4;
+ default:
+ WARN_ON_ONCE(1);
+ return HGATP_MODE_OFF;
+ }
+}
+
#endif
diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 24585304c02b..478f699e9dec 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -94,6 +94,7 @@ struct kvm_arch {
/* G-stage page table */
pgd_t *pgd;
phys_addr_t pgd_phys;
+ unsigned long pgd_levels;
/* Guest Timer */
struct kvm_guest_timer timer;
diff --git a/arch/riscv/kvm/gstage.c b/arch/riscv/kvm/gstage.c
index b67d60d722c2..4beb9322fe76 100644
--- a/arch/riscv/kvm/gstage.c
+++ b/arch/riscv/kvm/gstage.c
@@ -12,22 +12,21 @@
#include <asm/kvm_gstage.h>
#ifdef CONFIG_64BIT
-unsigned long kvm_riscv_gstage_mode __ro_after_init = HGATP_MODE_SV39X4;
-unsigned long kvm_riscv_gstage_pgd_levels __ro_after_init = 3;
+unsigned long kvm_riscv_gstage_max_pgd_levels __ro_after_init = 3;
#else
-unsigned long kvm_riscv_gstage_mode __ro_after_init = HGATP_MODE_SV32X4;
-unsigned long kvm_riscv_gstage_pgd_levels __ro_after_init = 2;
+unsigned long kvm_riscv_gstage_max_pgd_levels __ro_after_init = 2;
#endif
#define gstage_pte_leaf(__ptep) \
(pte_val(*(__ptep)) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC))
-static inline unsigned long gstage_pte_index(gpa_t addr, u32 level)
+static inline unsigned long gstage_pte_index(struct kvm_gstage *gstage,
+ gpa_t addr, u32 level)
{
unsigned long mask;
unsigned long shift = HGATP_PAGE_SHIFT + (kvm_riscv_gstage_index_bits * level);
- if (level == (kvm_riscv_gstage_pgd_levels - 1))
+ if (level == gstage->kvm->arch.pgd_levels - 1)
mask = (PTRS_PER_PTE * (1UL << kvm_riscv_gstage_pgd_xbits)) - 1;
else
mask = PTRS_PER_PTE - 1;
@@ -40,12 +39,13 @@ static inline unsigned long gstage_pte_page_vaddr(pte_t pte)
return (unsigned long)pfn_to_virt(__page_val_to_pfn(pte_val(pte)));
}
-static int gstage_page_size_to_level(unsigned long page_size, u32 *out_level)
+static int gstage_page_size_to_level(struct kvm_gstage *gstage, unsigned long page_size,
+ u32 *out_level)
{
u32 i;
unsigned long psz = 1UL << 12;
- for (i = 0; i < kvm_riscv_gstage_pgd_levels; i++) {
+ for (i = 0; i < gstage->kvm->arch.pgd_levels; i++) {
if (page_size == (psz << (i * kvm_riscv_gstage_index_bits))) {
*out_level = i;
return 0;
@@ -55,21 +55,23 @@ static int gstage_page_size_to_level(unsigned long page_size, u32 *out_level)
return -EINVAL;
}
-static int gstage_level_to_page_order(u32 level, unsigned long *out_pgorder)
+static int gstage_level_to_page_order(struct kvm_gstage *gstage, u32 level,
+ unsigned long *out_pgorder)
{
- if (kvm_riscv_gstage_pgd_levels < level)
+ if (gstage->kvm->arch.pgd_levels < level)
return -EINVAL;
*out_pgorder = 12 + (level * kvm_riscv_gstage_index_bits);
return 0;
}
-static int gstage_level_to_page_size(u32 level, unsigned long *out_pgsize)
+static int gstage_level_to_page_size(struct kvm_gstage *gstage, u32 level,
+ unsigned long *out_pgsize)
{
int rc;
unsigned long page_order = PAGE_SHIFT;
- rc = gstage_level_to_page_order(level, &page_order);
+ rc = gstage_level_to_page_order(gstage, level, &page_order);
if (rc)
return rc;
@@ -81,11 +83,11 @@ bool kvm_riscv_gstage_get_leaf(struct kvm_gstage *gstage, gpa_t addr,
pte_t **ptepp, u32 *ptep_level)
{
pte_t *ptep;
- u32 current_level = kvm_riscv_gstage_pgd_levels - 1;
+ u32 current_level = gstage->kvm->arch.pgd_levels - 1;
*ptep_level = current_level;
ptep = (pte_t *)gstage->pgd;
- ptep = &ptep[gstage_pte_index(addr, current_level)];
+ ptep = &ptep[gstage_pte_index(gstage, addr, current_level)];
while (ptep && pte_val(ptep_get(ptep))) {
if (gstage_pte_leaf(ptep)) {
*ptep_level = current_level;
@@ -97,7 +99,7 @@ bool kvm_riscv_gstage_get_leaf(struct kvm_gstage *gstage, gpa_t addr,
current_level--;
*ptep_level = current_level;
ptep = (pte_t *)gstage_pte_page_vaddr(ptep_get(ptep));
- ptep = &ptep[gstage_pte_index(addr, current_level)];
+ ptep = &ptep[gstage_pte_index(gstage, addr, current_level)];
} else {
ptep = NULL;
}
@@ -110,7 +112,7 @@ static void gstage_tlb_flush(struct kvm_gstage *gstage, u32 level, gpa_t addr)
{
unsigned long order = PAGE_SHIFT;
- if (gstage_level_to_page_order(level, &order))
+ if (gstage_level_to_page_order(gstage, level, &order))
return;
addr &= ~(BIT(order) - 1);
@@ -125,9 +127,9 @@ int kvm_riscv_gstage_set_pte(struct kvm_gstage *gstage,
struct kvm_mmu_memory_cache *pcache,
const struct kvm_gstage_mapping *map)
{
- u32 current_level = kvm_riscv_gstage_pgd_levels - 1;
+ u32 current_level = gstage->kvm->arch.pgd_levels - 1;
pte_t *next_ptep = (pte_t *)gstage->pgd;
- pte_t *ptep = &next_ptep[gstage_pte_index(map->addr, current_level)];
+ pte_t *ptep = &next_ptep[gstage_pte_index(gstage, map->addr, current_level)];
if (current_level < map->level)
return -EINVAL;
@@ -151,7 +153,7 @@ int kvm_riscv_gstage_set_pte(struct kvm_gstage *gstage,
}
current_level--;
- ptep = &next_ptep[gstage_pte_index(map->addr, current_level)];
+ ptep = &next_ptep[gstage_pte_index(gstage, map->addr, current_level)];
}
if (pte_val(*ptep) != pte_val(map->pte)) {
@@ -175,7 +177,7 @@ int kvm_riscv_gstage_map_page(struct kvm_gstage *gstage,
out_map->addr = gpa;
out_map->level = 0;
- ret = gstage_page_size_to_level(page_size, &out_map->level);
+ ret = gstage_page_size_to_level(gstage, page_size, &out_map->level);
if (ret)
return ret;
@@ -217,7 +219,7 @@ void kvm_riscv_gstage_op_pte(struct kvm_gstage *gstage, gpa_t addr,
u32 next_ptep_level;
unsigned long next_page_size, page_size;
- ret = gstage_level_to_page_size(ptep_level, &page_size);
+ ret = gstage_level_to_page_size(gstage, ptep_level, &page_size);
if (ret)
return;
@@ -229,7 +231,7 @@ void kvm_riscv_gstage_op_pte(struct kvm_gstage *gstage, gpa_t addr,
if (ptep_level && !gstage_pte_leaf(ptep)) {
next_ptep = (pte_t *)gstage_pte_page_vaddr(ptep_get(ptep));
next_ptep_level = ptep_level - 1;
- ret = gstage_level_to_page_size(next_ptep_level, &next_page_size);
+ ret = gstage_level_to_page_size(gstage, next_ptep_level, &next_page_size);
if (ret)
return;
@@ -263,7 +265,7 @@ void kvm_riscv_gstage_unmap_range(struct kvm_gstage *gstage,
while (addr < end) {
found_leaf = kvm_riscv_gstage_get_leaf(gstage, addr, &ptep, &ptep_level);
- ret = gstage_level_to_page_size(ptep_level, &page_size);
+ ret = gstage_level_to_page_size(gstage, ptep_level, &page_size);
if (ret)
break;
@@ -297,7 +299,7 @@ void kvm_riscv_gstage_wp_range(struct kvm_gstage *gstage, gpa_t start, gpa_t end
while (addr < end) {
found_leaf = kvm_riscv_gstage_get_leaf(gstage, addr, &ptep, &ptep_level);
- ret = gstage_level_to_page_size(ptep_level, &page_size);
+ ret = gstage_level_to_page_size(gstage, ptep_level, &page_size);
if (ret)
break;
@@ -319,39 +321,34 @@ void __init kvm_riscv_gstage_mode_detect(void)
/* Try Sv57x4 G-stage mode */
csr_write(CSR_HGATP, HGATP_MODE_SV57X4 << HGATP_MODE_SHIFT);
if ((csr_read(CSR_HGATP) >> HGATP_MODE_SHIFT) == HGATP_MODE_SV57X4) {
- kvm_riscv_gstage_mode = HGATP_MODE_SV57X4;
- kvm_riscv_gstage_pgd_levels = 5;
+ kvm_riscv_gstage_max_pgd_levels = 5;
goto done;
}
/* Try Sv48x4 G-stage mode */
csr_write(CSR_HGATP, HGATP_MODE_SV48X4 << HGATP_MODE_SHIFT);
if ((csr_read(CSR_HGATP) >> HGATP_MODE_SHIFT) == HGATP_MODE_SV48X4) {
- kvm_riscv_gstage_mode = HGATP_MODE_SV48X4;
- kvm_riscv_gstage_pgd_levels = 4;
+ kvm_riscv_gstage_max_pgd_levels = 4;
goto done;
}
/* Try Sv39x4 G-stage mode */
csr_write(CSR_HGATP, HGATP_MODE_SV39X4 << HGATP_MODE_SHIFT);
if ((csr_read(CSR_HGATP) >> HGATP_MODE_SHIFT) == HGATP_MODE_SV39X4) {
- kvm_riscv_gstage_mode = HGATP_MODE_SV39X4;
- kvm_riscv_gstage_pgd_levels = 3;
+ kvm_riscv_gstage_max_pgd_levels = 3;
goto done;
}
#else /* CONFIG_32BIT */
/* Try Sv32x4 G-stage mode */
csr_write(CSR_HGATP, HGATP_MODE_SV32X4 << HGATP_MODE_SHIFT);
if ((csr_read(CSR_HGATP) >> HGATP_MODE_SHIFT) == HGATP_MODE_SV32X4) {
- kvm_riscv_gstage_mode = HGATP_MODE_SV32X4;
- kvm_riscv_gstage_pgd_levels = 2;
+ kvm_riscv_gstage_max_pgd_levels = 2;
goto done;
}
#endif
/* KVM depends on !HGATP_MODE_OFF */
- kvm_riscv_gstage_mode = HGATP_MODE_OFF;
- kvm_riscv_gstage_pgd_levels = 0;
+ kvm_riscv_gstage_max_pgd_levels = 0;
done:
csr_write(CSR_HGATP, 0);
diff --git a/arch/riscv/kvm/main.c b/arch/riscv/kvm/main.c
index 0f3fe3986fc0..90ee0a032b9a 100644
--- a/arch/riscv/kvm/main.c
+++ b/arch/riscv/kvm/main.c
@@ -105,17 +105,17 @@ static int __init riscv_kvm_init(void)
return rc;
kvm_riscv_gstage_mode_detect();
- switch (kvm_riscv_gstage_mode) {
- case HGATP_MODE_SV32X4:
+ switch (kvm_riscv_gstage_max_pgd_levels) {
+ case 2:
str = "Sv32x4";
break;
- case HGATP_MODE_SV39X4:
+ case 3:
str = "Sv39x4";
break;
- case HGATP_MODE_SV48X4:
+ case 4:
str = "Sv48x4";
break;
- case HGATP_MODE_SV57X4:
+ case 5:
str = "Sv57x4";
break;
default:
@@ -164,7 +164,7 @@ static int __init riscv_kvm_init(void)
(rc) ? slist : "no features");
}
- kvm_info("using %s G-stage page table format\n", str);
+ kvm_info("highest G-stage page table mode is %s\n", str);
kvm_info("VMID %ld bits available\n", kvm_riscv_gstage_vmid_bits());
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 088d33ba90ed..fbcdd75cb9af 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -67,7 +67,7 @@ int kvm_riscv_mmu_ioremap(struct kvm *kvm, gpa_t gpa, phys_addr_t hpa,
if (!writable)
map.pte = pte_wrprotect(map.pte);
- ret = kvm_mmu_topup_memory_cache(&pcache, kvm_riscv_gstage_pgd_levels);
+ ret = kvm_mmu_topup_memory_cache(&pcache, kvm->arch.pgd_levels);
if (ret)
goto out;
@@ -186,7 +186,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
* space addressable by the KVM guest GPA space.
*/
if ((new->base_gfn + new->npages) >=
- (kvm_riscv_gstage_gpa_size >> PAGE_SHIFT))
+ kvm_riscv_gstage_gpa_size(kvm->arch.pgd_levels) >> PAGE_SHIFT)
return -EFAULT;
hva = new->userspace_addr;
@@ -472,7 +472,7 @@ int kvm_riscv_mmu_map(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot,
memset(out_map, 0, sizeof(*out_map));
/* We need minimum second+third level pages */
- ret = kvm_mmu_topup_memory_cache(pcache, kvm_riscv_gstage_pgd_levels);
+ ret = kvm_mmu_topup_memory_cache(pcache, kvm->arch.pgd_levels);
if (ret) {
kvm_err("Failed to topup G-stage cache\n");
return ret;
@@ -575,6 +575,7 @@ int kvm_riscv_mmu_alloc_pgd(struct kvm *kvm)
return -ENOMEM;
kvm->arch.pgd = page_to_virt(pgd_page);
kvm->arch.pgd_phys = page_to_phys(pgd_page);
+ kvm->arch.pgd_levels = kvm_riscv_gstage_max_pgd_levels;
return 0;
}
@@ -590,10 +591,12 @@ void kvm_riscv_mmu_free_pgd(struct kvm *kvm)
gstage.flags = 0;
gstage.vmid = READ_ONCE(kvm->arch.vmid.vmid);
gstage.pgd = kvm->arch.pgd;
- kvm_riscv_gstage_unmap_range(&gstage, 0UL, kvm_riscv_gstage_gpa_size, false);
+ kvm_riscv_gstage_unmap_range(&gstage, 0UL,
+ kvm_riscv_gstage_gpa_size(kvm->arch.pgd_levels), false);
pgd = READ_ONCE(kvm->arch.pgd);
kvm->arch.pgd = NULL;
kvm->arch.pgd_phys = 0;
+ kvm->arch.pgd_levels = 0;
}
spin_unlock(&kvm->mmu_lock);
@@ -603,11 +606,12 @@ void kvm_riscv_mmu_free_pgd(struct kvm *kvm)
void kvm_riscv_mmu_update_hgatp(struct kvm_vcpu *vcpu)
{
- unsigned long hgatp = kvm_riscv_gstage_mode << HGATP_MODE_SHIFT;
- struct kvm_arch *k = &vcpu->kvm->arch;
+ struct kvm_arch *ka = &vcpu->kvm->arch;
+ unsigned long hgatp = kvm_riscv_gstage_mode(ka->pgd_levels)
+ << HGATP_MODE_SHIFT;
- hgatp |= (READ_ONCE(k->vmid.vmid) << HGATP_VMID_SHIFT) & HGATP_VMID;
- hgatp |= (k->pgd_phys >> PAGE_SHIFT) & HGATP_PPN;
+ hgatp |= (READ_ONCE(ka->vmid.vmid) << HGATP_VMID_SHIFT) & HGATP_VMID;
+ hgatp |= (ka->pgd_phys >> PAGE_SHIFT) & HGATP_PPN;
ncsr_write(CSR_HGATP, hgatp);
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index 13c63ae1a78b..4d82a886102c 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -199,7 +199,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
r = KVM_USER_MEM_SLOTS;
break;
case KVM_CAP_VM_GPA_BITS:
- r = kvm_riscv_gstage_gpa_bits;
+ r = kvm_riscv_gstage_gpa_bits(kvm->arch.pgd_levels);
break;
default:
r = 0;
diff --git a/arch/riscv/kvm/vmid.c b/arch/riscv/kvm/vmid.c
index cf34d448289d..c15bdb1dd8be 100644
--- a/arch/riscv/kvm/vmid.c
+++ b/arch/riscv/kvm/vmid.c
@@ -26,7 +26,8 @@ static DEFINE_SPINLOCK(vmid_lock);
void __init kvm_riscv_gstage_vmid_detect(void)
{
/* Figure-out number of VMID bits in HW */
- csr_write(CSR_HGATP, (kvm_riscv_gstage_mode << HGATP_MODE_SHIFT) | HGATP_VMID);
+ csr_write(CSR_HGATP, (kvm_riscv_gstage_mode(kvm_riscv_gstage_max_pgd_levels) <<
+ HGATP_MODE_SHIFT) | HGATP_VMID);
vmid_bits = csr_read(CSR_HGATP);
vmid_bits = (vmid_bits & HGATP_VMID) >> HGATP_VMID_SHIFT;
vmid_bits = fls_long(vmid_bits);
--
2.50.1
^ permalink raw reply related
* [PATCH v6 3/4] RISC-V: KVM: Detect and expose supported HGATP G-stage modes
From: fangyu.yu @ 2026-03-30 12:26 UTC (permalink / raw)
To: pbonzini, corbet, anup, atish.patra, pjw, palmer, aou, alex,
skhan
Cc: guoren, radim.krcmar, andrew.jones, linux-doc, kvm, kvm-riscv,
linux-riscv, linux-kernel, Fangyu Yu
In-Reply-To: <20260330122601.22140-1-fangyu.yu@linux.alibaba.com>
From: Fangyu Yu <fangyu.yu@linux.alibaba.com>
Extend kvm_riscv_gstage_mode_detect() to probe all HGATP.MODE values
supported by the host and record them in a bitmask. Keep tracking the
maximum supported G-stage page table level for existing internal users.
Also provide lightweight helpers to retrieve the supported-mode bitmask
and validate a requested HGATP.MODE against it.
Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
Reviewed-by: Andrew Jones <andrew.jones@oss.qualcomm.com>
---
arch/riscv/include/asm/kvm_gstage.h | 11 ++++++++
arch/riscv/kvm/gstage.c | 43 +++++++++++++++--------------
2 files changed, 34 insertions(+), 20 deletions(-)
diff --git a/arch/riscv/include/asm/kvm_gstage.h b/arch/riscv/include/asm/kvm_gstage.h
index 70d9d483365e..bbf8f45c6563 100644
--- a/arch/riscv/include/asm/kvm_gstage.h
+++ b/arch/riscv/include/asm/kvm_gstage.h
@@ -31,6 +31,7 @@ struct kvm_gstage_mapping {
#endif
extern unsigned long kvm_riscv_gstage_max_pgd_levels;
+extern u32 kvm_riscv_gstage_supported_mode_mask;
#define kvm_riscv_gstage_pgd_xbits 2
#define kvm_riscv_gstage_pgd_size (1UL << (HGATP_PAGE_SHIFT + kvm_riscv_gstage_pgd_xbits))
@@ -102,4 +103,14 @@ static inline void kvm_riscv_gstage_init(struct kvm_gstage *gstage, struct kvm *
gstage->pgd_levels = kvm->arch.pgd_levels;
}
+static inline u32 kvm_riscv_get_hgatp_mode_mask(void)
+{
+ return kvm_riscv_gstage_supported_mode_mask;
+}
+
+static inline bool kvm_riscv_hgatp_mode_is_valid(unsigned long mode)
+{
+ return kvm_riscv_gstage_supported_mode_mask & BIT(mode);
+}
+
#endif
diff --git a/arch/riscv/kvm/gstage.c b/arch/riscv/kvm/gstage.c
index 7c4c34bc191b..459041255c14 100644
--- a/arch/riscv/kvm/gstage.c
+++ b/arch/riscv/kvm/gstage.c
@@ -16,6 +16,8 @@ unsigned long kvm_riscv_gstage_max_pgd_levels __ro_after_init = 3;
#else
unsigned long kvm_riscv_gstage_max_pgd_levels __ro_after_init = 2;
#endif
+/* Bitmask of supported HGATP.MODE encodings (BIT(HGATP_MODE_*)). */
+u32 kvm_riscv_gstage_supported_mode_mask __ro_after_init;
#define gstage_pte_leaf(__ptep) \
(pte_val(*(__ptep)) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC))
@@ -315,42 +317,43 @@ void kvm_riscv_gstage_wp_range(struct kvm_gstage *gstage, gpa_t start, gpa_t end
}
}
+static bool __init kvm_riscv_hgatp_mode_supported(unsigned long mode)
+{
+ csr_write(CSR_HGATP, mode << HGATP_MODE_SHIFT);
+ return ((csr_read(CSR_HGATP) >> HGATP_MODE_SHIFT) == mode);
+}
+
void __init kvm_riscv_gstage_mode_detect(void)
{
+ kvm_riscv_gstage_supported_mode_mask = 0;
+ kvm_riscv_gstage_max_pgd_levels = 0;
+
#ifdef CONFIG_64BIT
- /* Try Sv57x4 G-stage mode */
- csr_write(CSR_HGATP, HGATP_MODE_SV57X4 << HGATP_MODE_SHIFT);
- if ((csr_read(CSR_HGATP) >> HGATP_MODE_SHIFT) == HGATP_MODE_SV57X4) {
- kvm_riscv_gstage_max_pgd_levels = 5;
- goto done;
+ /* Try Sv39x4 G-stage mode */
+ if (kvm_riscv_hgatp_mode_supported(HGATP_MODE_SV39X4)) {
+ kvm_riscv_gstage_supported_mode_mask |= BIT(HGATP_MODE_SV39X4);
+ kvm_riscv_gstage_max_pgd_levels = 3;
}
/* Try Sv48x4 G-stage mode */
- csr_write(CSR_HGATP, HGATP_MODE_SV48X4 << HGATP_MODE_SHIFT);
- if ((csr_read(CSR_HGATP) >> HGATP_MODE_SHIFT) == HGATP_MODE_SV48X4) {
+ if (kvm_riscv_hgatp_mode_supported(HGATP_MODE_SV48X4)) {
+ kvm_riscv_gstage_supported_mode_mask |= BIT(HGATP_MODE_SV48X4);
kvm_riscv_gstage_max_pgd_levels = 4;
- goto done;
}
- /* Try Sv39x4 G-stage mode */
- csr_write(CSR_HGATP, HGATP_MODE_SV39X4 << HGATP_MODE_SHIFT);
- if ((csr_read(CSR_HGATP) >> HGATP_MODE_SHIFT) == HGATP_MODE_SV39X4) {
- kvm_riscv_gstage_max_pgd_levels = 3;
- goto done;
+ /* Try Sv57x4 G-stage mode */
+ if (kvm_riscv_hgatp_mode_supported(HGATP_MODE_SV57X4)) {
+ kvm_riscv_gstage_supported_mode_mask |= BIT(HGATP_MODE_SV57X4);
+ kvm_riscv_gstage_max_pgd_levels = 5;
}
#else /* CONFIG_32BIT */
/* Try Sv32x4 G-stage mode */
- csr_write(CSR_HGATP, HGATP_MODE_SV32X4 << HGATP_MODE_SHIFT);
- if ((csr_read(CSR_HGATP) >> HGATP_MODE_SHIFT) == HGATP_MODE_SV32X4) {
+ if (kvm_riscv_hgatp_mode_supported(HGATP_MODE_SV32X4)) {
+ kvm_riscv_gstage_supported_mode_mask |= BIT(HGATP_MODE_SV32X4);
kvm_riscv_gstage_max_pgd_levels = 2;
- goto done;
}
#endif
- /* KVM depends on !HGATP_MODE_OFF */
- kvm_riscv_gstage_max_pgd_levels = 0;
-
-done:
csr_write(CSR_HGATP, 0);
kvm_riscv_local_hfence_gvma_all();
}
--
2.50.1
^ permalink raw reply related
* [PATCH v6 0/4] Support runtime configuration for per-VM's HGATP mode
From: fangyu.yu @ 2026-03-30 12:25 UTC (permalink / raw)
To: pbonzini, corbet, anup, atish.patra, pjw, palmer, aou, alex,
skhan
Cc: guoren, radim.krcmar, andrew.jones, linux-doc, kvm, kvm-riscv,
linux-riscv, linux-kernel, Fangyu Yu
From: Fangyu Yu <fangyu.yu@linux.alibaba.com>
Currently, RISC-V KVM hardcodes the G-stage page table format (HGATP mode)
to the maximum mode detected at boot time (e.g., SV57x4 if supported). but
often such a wide GPA is unnecessary, just as a host sometimes doesn't need
sv57.
This patch introduces per-VM configurability of the G-stage mode via a new
KVM capability: KVM_CAP_RISCV_SET_HGATP_MODE. User-space can now explicitly
request a specific HGATP mode (SV39x4, SV48x4, SV57x4 or SV32x4) during
VM creation.
---
Changes in v6 (Anup's suggestions):
- Reworked kvm_riscv_gstage_gpa_bits() and kvm_riscv_gstage_gpa_size() to
take "unsigned long pgd_levels" instead of "struct kvm_arch *".
- Moved kvm_riscv_gstage_mode() helper from kvm_host.h to kvm_gstage.h.
- Renamed kvm->arch.kvm_riscv_gstage_pgd_levels to kvm->arch.pgd_levels.
- Added pgd_levels to struct kvm_gstage to avoid repeated
gstage->kvm->arch pointer chasing.
- Link to v5:
https://lore.kernel.org/linux-riscv/20260204134507.33912-1-fangyu.yu@linux.alibaba.com/
---
Changes in v5:
- Use architectural HGATP.MODE encodings as the bit index for the supported-mode
bitmap and for the VM-mode selection UAPI; no new UAPI mode/bit defines are
introduced(per Radim).
- Allow KVM_CAP_RISCV_SET_HGATP_MODE on RV32 as well(per Drew).
- Link to v4:
https://lore.kernel.org/linux-riscv/20260202140716.34323-1-fangyu.yu@linux.alibaba.com/
---
Changes in v4:
- Extend kvm_riscv_gstage_mode_detect() to probe all HGATP.MODE values
supported by the host and record them in a bitmask.
- Treat unexpected pgd_levels in kvm_riscv_gstage_mode() as an internal error
(e.g. WARN_ON_ONCE())(per Radim).
- Move kvm_riscv_gstage_gpa_bits() and kvm_riscv_gstage_gpa_size() to header
as static inline helpers(per Radim).
- Drop gstage_mode_user_initialized and Remove the kvm_debug() message from
KVM_CAP_RISCV_SET_HGATP_MODE(per Radim).
- Link to v3:
https://lore.kernel.org/linux-riscv/20260125150450.27068-1-fangyu.yu@linux.alibaba.com/
---
Changes in v3:
- Reworked the patch formatting (per Drew).
- Dropped kvm->arch.kvm_riscv_gstage_mode and derive HGATP.MODE from
kvm_riscv_gstage_pgd_levels via a helper, avoiding redundant per-VM state(per Drew).
- Removed kvm_riscv_gstage_max_mode and keep only kvm_riscv_gstage_max_pgd_levels
for host capability detection(per Drew).
- Other initialization and return value issues(per Drew).
- Enforce that KVM_CAP_RISCV_SET_HGATP_MODE can only be enabled before any vCPUs
are created by rejecting the ioctl once kvm->created_vcpus is non-zero(per Radim).
- Add a memslot safety check and reject the capability unless
kvm_are_all_memslots_empty(kvm) is true, ensuring the G-stage format is not
changed after any memslots have been installed(per Radim).
- Link to v2:
https://lore.kernel.org/linux-riscv/20260105143232.76715-1-fangyu.yu@linux.alibaba.com/
Fangyu Yu (4):
RISC-V: KVM: Support runtime configuration for per-VM's HGATP mode
RISC-V: KVM: Cache gstage pgd_levels in struct kvm_gstage
RISC-V: KVM: Detect and expose supported HGATP G-stage modes
RISC-V: KVM: add KVM_CAP_RISCV_SET_HGATP_MODE
Documentation/virt/kvm/api.rst | 27 ++++++++
arch/riscv/include/asm/kvm_gstage.h | 58 ++++++++++++++--
arch/riscv/include/asm/kvm_host.h | 1 +
arch/riscv/kvm/gstage.c | 102 ++++++++++++++--------------
arch/riscv/kvm/main.c | 12 ++--
arch/riscv/kvm/mmu.c | 70 ++++++-------------
arch/riscv/kvm/vm.c | 20 +++++-
arch/riscv/kvm/vmid.c | 3 +-
include/uapi/linux/kvm.h | 1 +
9 files changed, 178 insertions(+), 116 deletions(-)
--
2.50.1
^ permalink raw reply
* Re: [PATCH v10 2/2] hwmon: add support for MCP998X
From: Victor.Duicu @ 2026-03-30 12:01 UTC (permalink / raw)
To: linux
Cc: corbet, linux-hwmon, devicetree, robh, linux-kernel, krzk+dt,
linux-doc, conor+dt, Marius.Cristea
In-Reply-To: <ccda48d0-3b10-4c3c-a632-6f70b54436fb@roeck-us.net>
Hi Guenter,
...
> > + }
> > +
> > + switch (type) {
> > + case hwmon_temp:
> > + switch (attr) {
> > + case hwmon_temp_input:
> > + /* Block reading from addresses 0x00->0x09 is
> > not allowed. */
> > + ret = regmap_read(priv->regmap,
> > MCP9982_HIGH_BYTE_ADDR(channel), ®_high);
> > + if (ret)
> > + return ret;
> > +
> > + ret = regmap_read(priv->regmap,
> > MCP9982_HIGH_BYTE_ADDR(channel) + 1,
> > + ®_low);
> > + if (ret)
> > + return ret;
>
> Reading the 11-bit temperature value involves two separate 8-bit
> register reads.
> If the chip updates the temperature between these two reads, the
> resulting value
> may be torn. While some chips latch the low byte upon reading the
> high byte,
> the driver does not explicitly rely on or document this behavior, and
> it's safer
> to use regmap_bulk_read if supported, or at least ensure the correct
> order and
> atomicity if possible.
>
> Note: Maybe the low temperature is latched, but there is no
> indication in the
> datasheet that this would be the case. Even if it is, the code above
> is
> inefficient.
The low temperature register is latched. In the documentation at
page 32 it is described that when reading the high byte register,
the value from the low byte register is copied into a 'shadow'
register. In this way it is guaranteed that when we read the low byte,
it will correspond to the high byte.
Regarding the bulk read, the chip has a number of design quirks and
because of that different commands are supported only on some
particular memory regions.
According to the documentation page 26, the only areas of memory that
support SMBus block read are 80h->89h(temperature memory block) and
90h->97h(status memory block). In order to block read the temperatures,
the area of memory targeted has to be the temperature memory block. In
this context the read operation uses SMBus protocol and the first value
returned will be the number of addresses that can be read (in our
particular case a max value of 10 bytes).
In v8 of the driver
https://lore.kernel.org/all/20251120071248.3767-1-victor.duicu@microchip.com/
,
the temperature values were read with regmap_bulk_read(). In that
version, regmap_bulk_read() was also used to read the temperature
limits, without returning count (this is an undocumented feature of the
chip and because of that we could assume is not supported).
In order to avoid this behaviour and avoid mixing the SMBus and I2C
protocols all block readings were removed.
In the hopes of bypassing a long chain of replies, I tested the
behaviour of the chip with different read instructions.
Regmap_bulk_read() when applied to the temperature memory block
(80h->89h) returns count and the high and low bytes. When it is applied
to the 00h->09h memory, it uses I2C. It returns one temperature byte,
but all other bytes are returned as 0xFF. The chip behaves as if
it is at the last register location in the temperature block while the
host continues to ACK.(behaviour described at page 26).
If we set use_single_read in regmap_config and apply regmap_bulk_read()
to the 00h->09h register area the high and low temperature bytes are
read successfully without count.
Regmap_multi_reg_read() reads a number of registers one by one. When
applied to the 00h->09h area, I2C is used and it returns only the high
and low temperature bytes. When applied to the temperature memory block
(80h->89h), because it is not a bulk function, returns the count till
the end of the temperature memory block (aka SMBus count).
I2c_smbus_read_block_data() when applied to the temperature block (80h-
89h) returns the count, the driver replies with an NACK and the
communication is stopped. In our case, the board we are using to test
the driver has an AT91 adapter and supports
I2C_FUNC_SMBUS_READ_BLOCK_DATA. It seems that the I2C driver for AT91
does not modify the buff length of the message, leaving it 1.
I2c_smbus_read_i2c_block_data() when applied to the temperature block
(80h-89h) returns count and the temperature values.
If you are of the opinion that block reading the temperatures is worth
introducing (even in case we need to skip count) then I can add it, but
we should come to an agreement on which function to use.
Please let me know your thoughts.
Kind regards,
Victor
^ permalink raw reply
* Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events
From: Leonardo Bras @ 2026-03-30 11:31 UTC (permalink / raw)
To: Tian Zheng
Cc: Leonardo Bras, maz, oupton, catalin.marinas, corbet, pbonzini,
will, yuzenghui, wangzhou1, liuyonglong, Jonathan.Cameron,
yezhenyu2, linuxarm, joey.gouly, kvmarm, kvm, linux-arm-kernel,
linux-doc, linux-kernel, skhan, suzuki.poulose
In-Reply-To: <4e800c1e-25db-4aa2-b100-63434973de93@huawei.com>
On Sat, Mar 28, 2026 at 02:05:25PM +0800, Tian Zheng wrote:
>
> On 3/27/2026 11:00 PM, Leonardo Bras wrote:
> > On Fri, Mar 27, 2026 at 03:35:29PM +0800, Tian Zheng wrote:
> > > On 3/26/2026 2:05 AM, Leonardo Bras wrote:
> > > > Hello Tian,
> > > >
> > > > I am currently working on HACDBS enablement(which will be rebased on top of
> > > > this patchset) and due to the fact HACDBS and HDBSS are kind of
> > > > complementary I will sometimes come with some questions for issues I have
> > > > faced myself on that part. :)
> > > >
> > > > (see below)
> > >
> > > Of course! Happy to exchange ideas and learn together.
> > :)
> >
> > >
> > > > On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:
> > > > > From: eillon <yezhenyu2@huawei.com>
> > > > >
> > > > > HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
> > > > > migration. This feature is only supported in VHE mode.
> > > > >
> > > > > Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
> > > > > write faults are handled by user_mem_abort, which relaxes permissions
> > > > > and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
> > > > > writes no longer trap, as the hardware automatically transitions the page
> > > > > from writable-clean to writable-dirty.
> > > > >
> > > > > KVM does not scan S2 page tables to consume DBM. Instead, when HDBSS is
> > > > > enabled, the hardware observes the clean->dirty transition and records
> > > > > the corresponding page into the HDBSS buffer.
> > > > >
> > > > > During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
> > > > > that check_vcpu_requests flushes the HDBSS buffer and propagates the
> > > > > accumulated dirty information into the userspace-visible dirty bitmap.
> > > > >
> > > > > Add fault handling for HDBSS including buffer full, external abort, and
> > > > > general protection fault (GPF).
> > > > >
> > > > > Signed-off-by: eillon <yezhenyu2@huawei.com>
> > > > > Signed-off-by: Tian Zheng <zhengtian10@huawei.com>
> > > > > ---
> > > > > arch/arm64/include/asm/esr.h | 5 ++
> > > > > arch/arm64/include/asm/kvm_host.h | 17 +++++
> > > > > arch/arm64/include/asm/kvm_mmu.h | 1 +
> > > > > arch/arm64/include/asm/sysreg.h | 11 ++++
> > > > > arch/arm64/kvm/arm.c | 102 ++++++++++++++++++++++++++++++
> > > > > arch/arm64/kvm/hyp/vhe/switch.c | 19 ++++++
> > > > > arch/arm64/kvm/mmu.c | 70 ++++++++++++++++++++
> > > > > arch/arm64/kvm/reset.c | 3 +
> > > > > 8 files changed, 228 insertions(+)
> > > > >
> > > > > diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
> > > > > index 81c17320a588..2e6b679b5908 100644
> > > > > --- a/arch/arm64/include/asm/esr.h
> > > > > +++ b/arch/arm64/include/asm/esr.h
> > > > > @@ -437,6 +437,11 @@
> > > > > #ifndef __ASSEMBLER__
> > > > > #include <asm/types.h>
> > > > >
> > > > > +static inline bool esr_iss2_is_hdbssf(unsigned long esr)
> > > > > +{
> > > > > + return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
> > > > > +}
> > > > > +
> > > > > static inline unsigned long esr_brk_comment(unsigned long esr)
> > > > > {
> > > > > return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
> > > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > > > index 5d5a3bbdb95e..57ee6b53e061 100644
> > > > > --- a/arch/arm64/include/asm/kvm_host.h
> > > > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > > > @@ -55,12 +55,17 @@
> > > > > #define KVM_REQ_GUEST_HYP_IRQ_PENDING KVM_ARCH_REQ(9)
> > > > > #define KVM_REQ_MAP_L1_VNCR_EL2 KVM_ARCH_REQ(10)
> > > > > #define KVM_REQ_VGIC_PROCESS_UPDATE KVM_ARCH_REQ(11)
> > > > > +#define KVM_REQ_FLUSH_HDBSS KVM_ARCH_REQ(12)
> > > > >
> > > > > #define KVM_DIRTY_LOG_MANUAL_CAPS (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
> > > > > KVM_DIRTY_LOG_INITIALLY_SET)
> > > > >
> > > > > #define KVM_HAVE_MMU_RWLOCK
> > > > >
> > > > > +/* HDBSS entry field definitions */
> > > > > +#define HDBSS_ENTRY_VALID BIT(0)
> > > > > +#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
> > > > > +
> > > > > /*
> > > > > * Mode of operation configurable with kvm-arm.mode early param.
> > > > > * See Documentation/admin-guide/kernel-parameters.txt for more information.
> > > > > @@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
> > > > > u32 __attribute_const__ kvm_target_cpu(void);
> > > > > void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
> > > > > void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
> > > > > +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);
> > > > >
> > > > > struct kvm_hyp_memcache {
> > > > > phys_addr_t head;
> > > > > @@ -405,6 +411,8 @@ struct kvm_arch {
> > > > > * the associated pKVM instance in the hypervisor.
> > > > > */
> > > > > struct kvm_protected_vm pkvm;
> > > > > +
> > > > > + bool enable_hdbss;
> > > > > };
> > > > >
> > > > > struct kvm_vcpu_fault_info {
> > > > > @@ -816,6 +824,12 @@ struct vcpu_reset_state {
> > > > > bool reset;
> > > > > };
> > > > >
> > > > > +struct vcpu_hdbss_state {
> > > > > + phys_addr_t base_phys;
> > > > > + u32 size;
> > > > > + u32 next_index;
> > > > > +};
> > > > > +
> > > > > struct vncr_tlb;
> > > > >
> > > > > struct kvm_vcpu_arch {
> > > > > @@ -920,6 +934,9 @@ struct kvm_vcpu_arch {
> > > > >
> > > > > /* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
> > > > > struct vncr_tlb *vncr_tlb;
> > > > > +
> > > > > + /* HDBSS registers info */
> > > > > + struct vcpu_hdbss_state hdbss;
> > > > > };
> > > > >
> > > > > /*
> > > > > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > > > > index d968aca0461a..3fea8cfe8869 100644
> > > > > --- a/arch/arm64/include/asm/kvm_mmu.h
> > > > > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > > > > @@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > > > >
> > > > > int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
> > > > > int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
> > > > > +void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);
> > > > >
> > > > > phys_addr_t kvm_mmu_get_httbr(void);
> > > > > phys_addr_t kvm_get_idmap_vector(void);
> > > > > diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> > > > > index f4436ecc630c..d11f4d0dd4e7 100644
> > > > > --- a/arch/arm64/include/asm/sysreg.h
> > > > > +++ b/arch/arm64/include/asm/sysreg.h
> > > > > @@ -1039,6 +1039,17 @@
> > > > >
> > > > > #define GCS_CAP(x) ((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
> > > > > GCS_CAP_VALID_TOKEN)
> > > > > +
> > > > > +/*
> > > > > + * Definitions for the HDBSS feature
> > > > > + */
> > > > > +#define HDBSS_MAX_SIZE HDBSSBR_EL2_SZ_2MB
> > > > > +
> > > > > +#define HDBSSBR_EL2(baddr, sz) (((baddr) & GENMASK(55, 12 + sz)) | \
> > > > > + FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
> > > > > +
> > > > > +#define HDBSSPROD_IDX(prod) FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
> > > > > +
> > > > > /*
> > > > > * Definitions for GICv5 instructions]
> > > > > */
> > > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > > > index 29f0326f7e00..d64da05e25c4 100644
> > > > > --- a/arch/arm64/kvm/arm.c
> > > > > +++ b/arch/arm64/kvm/arm.c
> > > > > @@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
> > > > > return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
> > > > > }
> > > > >
> > > > > +void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
> > > > > +{
> > > > > + struct page *hdbss_pg;
> > > > > +
> > > > > + hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
> > > > > + if (hdbss_pg)
> > > > > + __free_pages(hdbss_pg, vcpu->arch.hdbss.size);
> > > > > +
> > > > > + vcpu->arch.hdbss.size = 0;
> > > > > +}
> > > > > +
> > > > > +static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
> > > > > + struct kvm_enable_cap *cap)
> > > > > +{
> > > > > + unsigned long i;
> > > > > + struct kvm_vcpu *vcpu;
> > > > > + struct page *hdbss_pg = NULL;
> > > > > + __u64 size = cap->args[0];
> > > > > + bool enable = cap->args[1] ? true : false;
> > > > > +
> > > > > + if (!system_supports_hdbss())
> > > > > + return -EINVAL;
> > > > > +
> > > > > + if (size > HDBSS_MAX_SIZE)
> > > > > + return -EINVAL;
> > > > > +
> > > > > + if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
> > > > > + return 0;
> > > > > +
> > > > > + if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
> > > > > + return -EINVAL;
> > > > > +
> > > > > + if (!enable) { /* Turn it off */
> > > > > + kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
> > > > > +
> > > > > + kvm_for_each_vcpu(i, vcpu, kvm) {
> > > > > + /* Kick vcpus to flush hdbss buffer. */
> > > > > + kvm_vcpu_kick(vcpu);
> > > > > +
> > > > > + kvm_arm_vcpu_free_hdbss(vcpu);
> > > > > + }
> > > > > +
> > > > > + kvm->arch.enable_hdbss = false;
> > > > > +
> > > > > + return 0;
> > > > > + }
> > > > > +
> > > > > + /* Turn it on */
> > > > > + kvm_for_each_vcpu(i, vcpu, kvm) {
> > > > > + hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
> > > > > + if (!hdbss_pg)
> > > > > + goto error_alloc;
> > > > > +
> > > > > + vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
> > > > > + .base_phys = page_to_phys(hdbss_pg),
> > > > > + .size = size,
> > > > > + .next_index = 0,
> > > > > + };
> > > > > + }
> > > > > +
> > > > > + kvm->arch.enable_hdbss = true;
> > > > > + kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
> > > > > +
> > > > > + /*
> > > > > + * We should kick vcpus out of guest mode here to load new
> > > > > + * vtcr value to vtcr_el2 register when re-enter guest mode.
> > > > > + */
> > > > > + kvm_for_each_vcpu(i, vcpu, kvm)
> > > > > + kvm_vcpu_kick(vcpu);
> > > > > +
> > > > > + return 0;
> > > > > +
> > > > > +error_alloc:
> > > > > + kvm_for_each_vcpu(i, vcpu, kvm) {
> > > > > + if (vcpu->arch.hdbss.base_phys)
> > > > > + kvm_arm_vcpu_free_hdbss(vcpu);
> > > > > + }
> > > > > +
> > > > > + return -ENOMEM;
> > > > > +}
> > > > > +
> > > > > int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > > > > struct kvm_enable_cap *cap)
> > > > > {
> > > > > @@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > > > > r = 0;
> > > > > set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
> > > > > break;
> > > > > + case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> > > > > + mutex_lock(&kvm->lock);
> > > > > + r = kvm_cap_arm_enable_hdbss(kvm, cap);
> > > > > + mutex_unlock(&kvm->lock);
> > > > > + break;
> > > > > default:
> > > > > break;
> > > > > }
> > > > > @@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > > > > r = kvm_supports_cacheable_pfnmap();
> > > > > break;
> > > > >
> > > > > + case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
> > > > > + r = system_supports_hdbss();
> > > > > + break;
> > > > > default:
> > > > > r = 0;
> > > > > }
> > > > > @@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> > > > > if (kvm_dirty_ring_check_request(vcpu))
> > > > > return 0;
> > > > >
> > > > > + if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
> > > > > + kvm_flush_hdbss_buffer(vcpu);
> > > > > +
> > > > > check_nested_vcpu_requests(vcpu);
> > > > > }
> > > > >
> > > > > @@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,
> > > > >
> > > > > void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
> > > > > {
> > > > > + /*
> > > > > + * Flush all CPUs' dirty log buffers to the dirty_bitmap. Called
> > > > > + * before reporting dirty_bitmap to userspace. Send a request with
> > > > > + * KVM_REQUEST_WAIT to flush buffer synchronously.
> > > > > + */
> > > > > + struct kvm_vcpu *vcpu;
> > > > > +
> > > > > + if (!kvm->arch.enable_hdbss)
> > > > > + return;
> > > > >
> > > > > + kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
> > > > > }
> > > > >
> > > > > static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
> > > > > diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
> > > > > index 9db3f11a4754..600cbc4f8ae9 100644
> > > > > --- a/arch/arm64/kvm/hyp/vhe/switch.c
> > > > > +++ b/arch/arm64/kvm/hyp/vhe/switch.c
> > > > > @@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
> > > > > local_irq_restore(flags);
> > > > > }
> > > > >
> > > > > +static void __load_hdbss(struct kvm_vcpu *vcpu)
> > > > > +{
> > > > > + struct kvm *kvm = vcpu->kvm;
> > > > > + u64 br_el2, prod_el2;
> > > > > +
> > > > > + if (!kvm->arch.enable_hdbss)
> > > > > + return;
> > > > > +
> > > > > + br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
> > > > > + prod_el2 = vcpu->arch.hdbss.next_index;
> > > > > +
> > > > > + write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
> > > > > + write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
> > > > > +
> > > > > + isb();
> > > > > +}
> > > > > +
> > > > I see in the code below you trust that the tracking will happen with
> > > > PAGE_SIZE granularity (you track with PAGE_SHIFT).
> > > >
> > > > That may be a problem when we have guest memory backed by hugepages or
> > > > transparent huge pages.
> > > >
> > > > When we are using HDBSS, there is no fault happening, so we have no way of
> > > > doing on-demand block splitting, so we need to make use of eager block
> > > > splitting, _before_ we start to track anything, or else we may have
> > > > different-sized pages in the HDBSS buffer, which is harder to deal with.
> > > >
> > > > Suggestion: do the eager splitting before we enable HDBSS.
> > > >
> > > > For this to happen, we have to enable the EAGER_SPLIT_CHUNK_SIZE
> > > > capability, which can only be enabled when all memslots are empty.
> > > >
> > > > I suggest doing that at kvm_init_stage2_mmu(), and checking if HDBSS is
> > > > in which case we set mmu->split_page_chunk_size to PAGESIZE.
> > > >
> > > > I will send a patch you can put before this one to make sure it works :)
> > > >
> > > > Thanks!
> > > > Leo
> > > Hi Leo,
> > >
> > > Thanks for the helpful suggestion. I had previously traced the
> > > hugepage-splitting path
> > >
> > > during live migration and found that when migration starts, enabling dirty
> > > logging
> > >
> > > triggers the splitting path. I also tested HDBSS with traditional hugepages
> > > and haven't
> > >
> > > observed any issues yet.
> > >
> > >
> > > However, your concern is valid — there may be cases not covered, especially
> > > when the
> > >
> > > VMM uses transparent hugepages. I'll integrate your patch into the next
> > > version and
> > >
> > > run some tests.
> > >
> > >
> > > For reference, here's the path I traced:
> > >
> > > ```
> > >
> > > - userspace, e.g., QEMU
> > >
> > > kvm_log_start
> > > +-> kvm_section_update_flags
> > > +-> kvm_slot_update_flags
> > > |
> > > | // For each memory region, QEMU issues a
> > > KVM_SET_USER_MEMORY_REGION ioctl.
> > > | // Before issuing it, flags are updated to include
> > > KVM_MEM_LOG_DIRTY_PAGES.
> > > +-> kvm_mem_flags
> > > +-> kvm_set_user_memory_region // ioctl that enables dirty logging
> > > on the memslot
> > >
> > > - KVM
> > >
> > > KVM_SET_USER_MEMORY_REGION
> > > +-> kvm_vm_ioctl_set_memory_region
> > > +-> kvm_set_memory_region / __kvm_set_memory_region
> > > +-> kvm_set_memslot
> > > +-> kvm_commit_memory_region
> > > +-> kvm_arch_commit_memory_region
> > > +-> kvm_mmu_split_memory_region
> > > // Splits Stage-2 hugepages/contiguous mappings into
> > > 4KB PTEs.
> > Right, except on a case we have dirty_log_manual_protect and init_set, when
> > it returns before splitting pages:
> >
> > ```
> > if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> > return;
> > ```
> >
> > IIUC, that's desired to avoid holding the lock for a long time while it
> > cleans every page in the beginning, and instead do it in a per dirty-page
> > basis. I guess it may benefit guests with very little dirty pages, as it
> > does not have to split/dirty everything at the start.
> > (Its a pain for my HACDBS routines, though)
> >
> > > +-> kvm_mmu_split_huge_pages
> > Other important point here:
> > You can see in this function it skips splitting if chunk_size == 0.
> > This value is set by a capability that configures EAGER_SPLIT, meaning
> > splitting before the guest have write faults, which is nice as the
> > write-fault is faster.
> >
> > Two points in this capability:
> > - It's optional, if it's not set, only on-demand splitting (on fault) will
> > happen, and since HDBSS removes the write-fault, we have no splitting
> > - It can be set to any valid block size, not only 4K, nor PAGE_SIZE, it can
> > be set to PMD_SIZE, PUD_SIZE, and so on, which will depend on the
> > PAGE_SIZE the kernel was compiled to.
> > That's only some points to keep in mind :)
> >
> > if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> > return;
> >
> > > +-> kvm_pgtable_stage2_split
> > >
> > > ```
> > >
> > > Thanks again for the detailed explanation and for sending the patch.
> > >
> > Thank you for the collaboration on this!
> > Leo
>
>
> Thanks for the detailed explanation — very helpful. My earlier tests missed
> cases like lazy splitting
>
> and manual‑protect mode, and your patch addresses them perfectly.
>
> I'll adopt it in the next version and test the corner cases you mentioned.
Awesome, thanks!
Leo
^ permalink raw reply
* [PATCH v8 12/12] rv: Add nomiss deadline monitor
From: Gabriele Monaco @ 2026-03-30 11:10 UTC (permalink / raw)
To: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
Gabriele Monaco, Jonathan Corbet, Masami Hiramatsu,
linux-trace-kernel, linux-doc
Cc: Tomas Glozar, Clark Williams, John Kacur
In-Reply-To: <20260330111010.153663-1-gmonaco@redhat.com>
Add the deadline monitors collection to validate the deadline scheduler,
both for deadline tasks and servers.
The currently implemented monitors are:
* nomiss:
validate dl entities run to completion before their deadiline
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
Notes:
V8:
* Warn if kallsyms lookup fails
* Use u8 instead of uint8_t
* Drop throttle monitor, will submit separately
Documentation/trace/rv/index.rst | 1 +
Documentation/trace/rv/monitor_deadline.rst | 84 +++++
kernel/trace/rv/Kconfig | 4 +
kernel/trace/rv/Makefile | 2 +
kernel/trace/rv/monitors/deadline/Kconfig | 10 +
kernel/trace/rv/monitors/deadline/deadline.c | 44 +++
kernel/trace/rv/monitors/deadline/deadline.h | 202 ++++++++++++
kernel/trace/rv/monitors/nomiss/Kconfig | 15 +
kernel/trace/rv/monitors/nomiss/nomiss.c | 293 ++++++++++++++++++
kernel/trace/rv/monitors/nomiss/nomiss.h | 123 ++++++++
.../trace/rv/monitors/nomiss/nomiss_trace.h | 19 ++
kernel/trace/rv/rv_trace.h | 1 +
tools/verification/models/deadline/nomiss.dot | 41 +++
13 files changed, 839 insertions(+)
create mode 100644 Documentation/trace/rv/monitor_deadline.rst
create mode 100644 kernel/trace/rv/monitors/deadline/Kconfig
create mode 100644 kernel/trace/rv/monitors/deadline/deadline.c
create mode 100644 kernel/trace/rv/monitors/deadline/deadline.h
create mode 100644 kernel/trace/rv/monitors/nomiss/Kconfig
create mode 100644 kernel/trace/rv/monitors/nomiss/nomiss.c
create mode 100644 kernel/trace/rv/monitors/nomiss/nomiss.h
create mode 100644 kernel/trace/rv/monitors/nomiss/nomiss_trace.h
create mode 100644 tools/verification/models/deadline/nomiss.dot
diff --git a/Documentation/trace/rv/index.rst b/Documentation/trace/rv/index.rst
index bf9962f49959..29769f06bb0f 100644
--- a/Documentation/trace/rv/index.rst
+++ b/Documentation/trace/rv/index.rst
@@ -17,3 +17,4 @@ Runtime Verification
monitor_sched.rst
monitor_rtapp.rst
monitor_stall.rst
+ monitor_deadline.rst
diff --git a/Documentation/trace/rv/monitor_deadline.rst b/Documentation/trace/rv/monitor_deadline.rst
new file mode 100644
index 000000000000..84506ed1e293
--- /dev/null
+++ b/Documentation/trace/rv/monitor_deadline.rst
@@ -0,0 +1,84 @@
+Deadline monitors
+=================
+
+- Name: deadline
+- Type: container for multiple monitors
+- Author: Gabriele Monaco <gmonaco@redhat.com>
+
+Description
+-----------
+
+The deadline monitor is a set of specifications to describe the deadline
+scheduler behaviour. It includes monitors per scheduling entity (deadline tasks
+and servers) that work independently to verify different specifications the
+deadline scheduler should follow.
+
+Specifications
+--------------
+
+Monitor nomiss
+~~~~~~~~~~~~~~
+
+The nomiss monitor ensures dl entities get to run *and* run to completion
+before their deadline, although deferrable servers may not run. An entity is
+considered done if ``throttled``, either because it yielded or used up its
+runtime, or when it voluntarily starts ``sleeping``.
+The monitor includes a user configurable deadline threshold. If the total
+utilisation of deadline tasks is larger than 1, they are only guaranteed
+bounded tardiness. See Documentation/scheduler/sched-deadline.rst for more
+details. The threshold (module parameter ``nomiss.deadline_thresh``) can be
+configured to avoid the monitor to fail based on the acceptable tardiness in
+the system. Since ``dl_throttle`` is a valid outcome for the entity to be done,
+the minimum tardiness needs be 1 tick to consider the throttle delay, unless
+the ``HRTICK_DL`` scheduler feature is active.
+
+Servers have also an intermediate ``idle`` state, occurring as soon as no
+runnable task is available from ready or running where no timing constraint
+is applied. A server goes to sleep by stopping, there is no wakeup equivalent
+as the order of a server starting and replenishing is not defined, hence a
+server can run from sleeping without being ready::
+
+ |
+ sched_wakeup v
+ dl_replenish;reset(clk) -- #=========================#
+ | H H dl_replenish;reset(clk)
+ +-----------> H H <--------------------+
+ H H |
+ +- dl_server_stop ---- H ready H |
+ | +-----------------> H clk < DEADLINE_NS() H dl_throttle; |
+ | | H H is_defer == 1 |
+ | | sched_switch_in - H H -----------------+ |
+ | | | #=========================# | |
+ | | | | ^ | |
+ | | | dl_server_idle dl_replenish;reset(clk) | |
+ | | | v | | |
+ | | | +--------------+ | |
+ | | | +------ | | | |
+ | | | dl_server_idle | | dl_throttle | |
+ | | | | | idle | -----------------+ | |
+ | | | +-----> | | | | |
+ | | | | | | | |
+ | | | | | | | |
+ +--+--+---+--- dl_server_stop -- +--------------+ | | |
+ | | | | | ^ | | |
+ | | | | sched_switch_in dl_server_idle | | |
+ | | | | v | | | |
+ | | | | +---------- +---------------------+ | | |
+ | | | | sched_switch_in | | | | |
+ | | | | sched_wakeup | | | | |
+ | | | | dl_replenish; | running | -------+ | | |
+ | | | | reset(clk) | clk < DEADLINE_NS() | | | | |
+ | | | | +---------> | | dl_throttle | | |
+ | | | +----------------> | | | | | |
+ | | | +---------------------+ | | | |
+ | | sched_wakeup ^ sched_switch_suspend | | | |
+ v v dl_replenish;reset(clk) | dl_server_stop | | | |
+ +--------------+ | | v v v |
+ | | - sched_switch_in + | +---------------+
+ | | <---------------------+ dl_throttle +-- | |
+ | sleeping | sched_wakeup | | throttled |
+ | | -- dl_server_stop dl_server_idle +-> | |
+ | | dl_server_idle sched_switch_suspend +---------------+
+ +--------------+ <---------+ ^
+ | |
+ +------ dl_throttle;is_constr_dl == 1 || is_defer == 1 ------+
diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
index 720fbe4935f8..3884b14df375 100644
--- a/kernel/trace/rv/Kconfig
+++ b/kernel/trace/rv/Kconfig
@@ -79,6 +79,10 @@ source "kernel/trace/rv/monitors/sleep/Kconfig"
# Add new rtapp monitors here
source "kernel/trace/rv/monitors/stall/Kconfig"
+source "kernel/trace/rv/monitors/deadline/Kconfig"
+source "kernel/trace/rv/monitors/nomiss/Kconfig"
+# Add new deadline monitors here
+
# Add new monitors here
config RV_REACTORS
diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
index 51c95e2d2da6..94498da35b37 100644
--- a/kernel/trace/rv/Makefile
+++ b/kernel/trace/rv/Makefile
@@ -18,6 +18,8 @@ obj-$(CONFIG_RV_MON_NRP) += monitors/nrp/nrp.o
obj-$(CONFIG_RV_MON_SSSW) += monitors/sssw/sssw.o
obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
obj-$(CONFIG_RV_MON_STALL) += monitors/stall/stall.o
+obj-$(CONFIG_RV_MON_DEADLINE) += monitors/deadline/deadline.o
+obj-$(CONFIG_RV_MON_NOMISS) += monitors/nomiss/nomiss.o
# Add new monitors here
obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
diff --git a/kernel/trace/rv/monitors/deadline/Kconfig b/kernel/trace/rv/monitors/deadline/Kconfig
new file mode 100644
index 000000000000..38804a6ad91d
--- /dev/null
+++ b/kernel/trace/rv/monitors/deadline/Kconfig
@@ -0,0 +1,10 @@
+config RV_MON_DEADLINE
+ depends on RV
+ bool "deadline monitor"
+ help
+ Collection of monitors to check the deadline scheduler and server
+ behave according to specifications. Enable this to enable all
+ scheduler specification supported by the current kernel.
+
+ For further information, see:
+ Documentation/trace/rv/monitor_deadline.rst
diff --git a/kernel/trace/rv/monitors/deadline/deadline.c b/kernel/trace/rv/monitors/deadline/deadline.c
new file mode 100644
index 000000000000..d566d4542ebf
--- /dev/null
+++ b/kernel/trace/rv/monitors/deadline/deadline.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/rv.h>
+#include <linux/kallsyms.h>
+
+#define MODULE_NAME "deadline"
+
+#include "deadline.h"
+
+struct rv_monitor rv_deadline = {
+ .name = "deadline",
+ .description = "container for several deadline scheduler specifications.",
+ .enable = NULL,
+ .disable = NULL,
+ .reset = NULL,
+ .enabled = 0,
+};
+
+/* Used by other monitors */
+struct sched_class *rv_ext_sched_class;
+
+static int __init register_deadline(void)
+{
+ if (IS_ENABLED(CONFIG_SCHED_CLASS_EXT)) {
+ rv_ext_sched_class = (void *)kallsyms_lookup_name("ext_sched_class");
+ if (!rv_ext_sched_class)
+ pr_warn("rv: Missing ext_sched_class, monitors may not work.\n");
+ }
+ return rv_register_monitor(&rv_deadline, NULL);
+}
+
+static void __exit unregister_deadline(void)
+{
+ rv_unregister_monitor(&rv_deadline);
+}
+
+module_init(register_deadline);
+module_exit(unregister_deadline);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Gabriele Monaco <gmonaco@redhat.com>");
+MODULE_DESCRIPTION("deadline: container for several deadline scheduler specifications.");
diff --git a/kernel/trace/rv/monitors/deadline/deadline.h b/kernel/trace/rv/monitors/deadline/deadline.h
new file mode 100644
index 000000000000..0bbfd2543329
--- /dev/null
+++ b/kernel/trace/rv/monitors/deadline/deadline.h
@@ -0,0 +1,202 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/kernel.h>
+#include <linux/uaccess.h>
+#include <linux/sched/deadline.h>
+#include <asm/syscall.h>
+#include <uapi/linux/sched/types.h>
+#include <trace/events/sched.h>
+
+/*
+ * Dummy values if not available
+ */
+#ifndef __NR_sched_setscheduler
+#define __NR_sched_setscheduler -__COUNTER__
+#endif
+#ifndef __NR_sched_setattr
+#define __NR_sched_setattr -__COUNTER__
+#endif
+
+extern struct rv_monitor rv_deadline;
+/* Initialised when registering the deadline container */
+extern struct sched_class *rv_ext_sched_class;
+
+/*
+ * If both have dummy values, the syscalls are not supported and we don't even
+ * need to register the handler.
+ */
+static inline bool should_skip_syscall_handle(void)
+{
+ return __NR_sched_setattr < 0 && __NR_sched_setscheduler < 0;
+}
+
+/*
+ * is_supported_type - return true if @type is supported by the deadline monitors
+ */
+static inline bool is_supported_type(u8 type)
+{
+ return type == DL_TASK || type == DL_SERVER_FAIR || type == DL_SERVER_EXT;
+}
+
+/*
+ * is_server_type - return true if @type is a supported server
+ */
+static inline bool is_server_type(u8 type)
+{
+ return is_supported_type(type) && type != DL_TASK;
+}
+
+/*
+ * Use negative numbers for the server.
+ * Currently only one fair server per CPU, may change in the future.
+ */
+#define fair_server_id(cpu) (-cpu)
+#define ext_server_id(cpu) (-cpu - num_possible_cpus())
+#define NO_SERVER_ID (-2 * num_possible_cpus())
+/*
+ * Get a unique id used for dl entities
+ *
+ * The cpu is not required for tasks as the pid is used there, if this function
+ * is called on a dl_se that for sure corresponds to a task, DL_TASK can be
+ * used in place of cpu.
+ * We need the cpu for servers as it is provided in the tracepoint and we
+ * cannot easily retrieve it from the dl_se (requires the struct rq definition).
+ */
+static inline int get_entity_id(struct sched_dl_entity *dl_se, int cpu, u8 type)
+{
+ if (dl_server(dl_se) && type != DL_TASK) {
+ if (type == DL_SERVER_FAIR)
+ return fair_server_id(cpu);
+ if (type == DL_SERVER_EXT)
+ return ext_server_id(cpu);
+ return NO_SERVER_ID;
+ }
+ return dl_task_of(dl_se)->pid;
+}
+
+static inline bool task_is_scx_enabled(struct task_struct *tsk)
+{
+ return IS_ENABLED(CONFIG_SCHED_CLASS_EXT) &&
+ tsk->sched_class == rv_ext_sched_class;
+}
+
+/* Expand id and target as arguments for da functions */
+#define EXPAND_ID(dl_se, cpu, type) get_entity_id(dl_se, cpu, type), dl_se
+#define EXPAND_ID_TASK(tsk) get_entity_id(&tsk->dl, task_cpu(tsk), DL_TASK), &tsk->dl
+
+static inline u8 get_server_type(struct task_struct *tsk)
+{
+ if (tsk->policy == SCHED_NORMAL || tsk->policy == SCHED_EXT ||
+ tsk->policy == SCHED_BATCH || tsk->policy == SCHED_IDLE)
+ return task_is_scx_enabled(tsk) ? DL_SERVER_EXT : DL_SERVER_FAIR;
+ return DL_OTHER;
+}
+
+static inline int extract_params(struct pt_regs *regs, long id, pid_t *pid_out)
+{
+ size_t size = offsetofend(struct sched_attr, sched_flags);
+ struct sched_attr __user *uattr, attr;
+ int new_policy = -1, ret;
+ unsigned long args[6];
+
+ switch (id) {
+ case __NR_sched_setscheduler:
+ syscall_get_arguments(current, regs, args);
+ *pid_out = args[0];
+ new_policy = args[1];
+ break;
+ case __NR_sched_setattr:
+ syscall_get_arguments(current, regs, args);
+ *pid_out = args[0];
+ uattr = (struct sched_attr __user *)args[1];
+ /*
+ * Just copy up to sched_flags, we are not interested after that
+ */
+ ret = copy_struct_from_user(&attr, size, uattr, size);
+ if (ret)
+ return ret;
+ if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY)
+ return -EINVAL;
+ new_policy = attr.sched_policy;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ return new_policy & ~SCHED_RESET_ON_FORK;
+}
+
+/* Helper functions requiring DA/HA utilities */
+#ifdef RV_MON_TYPE
+
+/*
+ * get_fair_server - get the fair server associated to a task
+ *
+ * If the task is a boosted task, the server is available in the task_struct,
+ * otherwise grab the dl entity saved for the CPU where the task is enqueued.
+ * This function assumes the task is enqueued somewhere.
+ */
+static inline struct sched_dl_entity *get_server(struct task_struct *tsk, u8 type)
+{
+ if (tsk->dl_server && get_server_type(tsk) == type)
+ return tsk->dl_server;
+ if (type == DL_SERVER_FAIR)
+ return da_get_target_by_id(fair_server_id(task_cpu(tsk)));
+ if (type == DL_SERVER_EXT)
+ return da_get_target_by_id(ext_server_id(task_cpu(tsk)));
+ return NULL;
+}
+
+/*
+ * Initialise monitors for all tasks and pre-allocate the storage for servers.
+ * This is necessary since we don't have access to the servers here and
+ * allocation can cause deadlocks from their tracepoints. We can only fill
+ * pre-initialised storage from there.
+ */
+static inline int init_storage(bool skip_tasks)
+{
+ struct task_struct *g, *p;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ if (!da_create_empty_storage(fair_server_id(cpu)))
+ goto fail;
+ if (IS_ENABLED(CONFIG_SCHED_CLASS_EXT) &&
+ !da_create_empty_storage(ext_server_id(cpu)))
+ goto fail;
+ }
+
+ if (skip_tasks)
+ return 0;
+
+ read_lock(&tasklist_lock);
+ for_each_process_thread(g, p) {
+ if (p->policy == SCHED_DEADLINE) {
+ if (!da_create_storage(EXPAND_ID_TASK(p), NULL)) {
+ read_unlock(&tasklist_lock);
+ goto fail;
+ }
+ }
+ }
+ read_unlock(&tasklist_lock);
+ return 0;
+
+fail:
+ da_monitor_destroy();
+ return -ENOMEM;
+}
+
+static void __maybe_unused handle_newtask(void *data, struct task_struct *task, u64 flags)
+{
+ /* Might be superfluous as tasks are not started with this policy.. */
+ if (task->policy == SCHED_DEADLINE)
+ da_create_storage(EXPAND_ID_TASK(task), NULL);
+}
+
+static void __maybe_unused handle_exit(void *data, struct task_struct *p, bool group_dead)
+{
+ if (p->policy == SCHED_DEADLINE)
+ da_destroy_storage(get_entity_id(&p->dl, DL_TASK, DL_TASK));
+}
+
+#endif
diff --git a/kernel/trace/rv/monitors/nomiss/Kconfig b/kernel/trace/rv/monitors/nomiss/Kconfig
new file mode 100644
index 000000000000..e1886c3a0dd9
--- /dev/null
+++ b/kernel/trace/rv/monitors/nomiss/Kconfig
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+config RV_MON_NOMISS
+ depends on RV
+ depends on HAVE_SYSCALL_TRACEPOINTS
+ depends on RV_MON_DEADLINE
+ default y
+ select HA_MON_EVENTS_ID
+ bool "nomiss monitor"
+ help
+ Monitor to ensure dl entities run to completion before their deadiline.
+ This monitor is part of the deadline monitors collection.
+
+ For further information, see:
+ Documentation/trace/rv/monitor_deadline.rst
diff --git a/kernel/trace/rv/monitors/nomiss/nomiss.c b/kernel/trace/rv/monitors/nomiss/nomiss.c
new file mode 100644
index 000000000000..31f90f3638d8
--- /dev/null
+++ b/kernel/trace/rv/monitors/nomiss/nomiss.c
@@ -0,0 +1,293 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/ftrace.h>
+#include <linux/tracepoint.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/rv.h>
+#include <rv/instrumentation.h>
+
+#define MODULE_NAME "nomiss"
+
+#include <uapi/linux/sched/types.h>
+#include <trace/events/syscalls.h>
+#include <trace/events/sched.h>
+#include <trace/events/task.h>
+#include <rv_trace.h>
+
+#define RV_MON_TYPE RV_MON_PER_OBJ
+#define HA_TIMER_TYPE HA_TIMER_WHEEL
+/* The start condition is on sched_switch, it's dangerous to allocate there */
+#define DA_SKIP_AUTO_ALLOC
+typedef struct sched_dl_entity *monitor_target;
+#include "nomiss.h"
+#include <rv/ha_monitor.h>
+#include <monitors/deadline/deadline.h>
+
+/*
+ * User configurable deadline threshold. If the total utilisation of deadline
+ * tasks is larger than 1, they are only guaranteed bounded tardiness. See
+ * Documentation/scheduler/sched-deadline.rst for more details.
+ * The minimum tardiness without sched_feat(HRTICK_DL) is 1 tick to accommodate
+ * for throttle enforced on the next tick.
+ */
+static u64 deadline_thresh = TICK_NSEC;
+module_param(deadline_thresh, ullong, 0644);
+#define DEADLINE_NS(ha_mon) (ha_get_target(ha_mon)->dl_deadline + deadline_thresh)
+
+static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs_nomiss env, u64 time_ns)
+{
+ if (env == clk_nomiss)
+ return ha_get_clk_ns(ha_mon, env, time_ns);
+ else if (env == is_constr_dl_nomiss)
+ return !dl_is_implicit(ha_get_target(ha_mon));
+ else if (env == is_defer_nomiss)
+ return ha_get_target(ha_mon)->dl_defer;
+ return ENV_INVALID_VALUE;
+}
+
+static void ha_reset_env(struct ha_monitor *ha_mon, enum envs_nomiss env, u64 time_ns)
+{
+ if (env == clk_nomiss)
+ ha_reset_clk_ns(ha_mon, env, time_ns);
+}
+
+static inline bool ha_verify_invariants(struct ha_monitor *ha_mon,
+ enum states curr_state, enum events event,
+ enum states next_state, u64 time_ns)
+{
+ if (curr_state == ready_nomiss)
+ return ha_check_invariant_ns(ha_mon, clk_nomiss, time_ns);
+ else if (curr_state == running_nomiss)
+ return ha_check_invariant_ns(ha_mon, clk_nomiss, time_ns);
+ return true;
+}
+
+static inline void ha_convert_inv_guard(struct ha_monitor *ha_mon,
+ enum states curr_state, enum events event,
+ enum states next_state, u64 time_ns)
+{
+ if (curr_state == next_state)
+ return;
+ if (curr_state == ready_nomiss)
+ ha_inv_to_guard(ha_mon, clk_nomiss, DEADLINE_NS(ha_mon), time_ns);
+ else if (curr_state == running_nomiss)
+ ha_inv_to_guard(ha_mon, clk_nomiss, DEADLINE_NS(ha_mon), time_ns);
+}
+
+static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
+ enum states curr_state, enum events event,
+ enum states next_state, u64 time_ns)
+{
+ bool res = true;
+
+ if (curr_state == ready_nomiss && event == dl_replenish_nomiss)
+ ha_reset_env(ha_mon, clk_nomiss, time_ns);
+ else if (curr_state == ready_nomiss && event == dl_throttle_nomiss)
+ res = ha_get_env(ha_mon, is_defer_nomiss, time_ns) == 1ull;
+ else if (curr_state == idle_nomiss && event == dl_replenish_nomiss)
+ ha_reset_env(ha_mon, clk_nomiss, time_ns);
+ else if (curr_state == running_nomiss && event == dl_replenish_nomiss)
+ ha_reset_env(ha_mon, clk_nomiss, time_ns);
+ else if (curr_state == sleeping_nomiss && event == dl_replenish_nomiss)
+ ha_reset_env(ha_mon, clk_nomiss, time_ns);
+ else if (curr_state == sleeping_nomiss && event == dl_throttle_nomiss)
+ res = ha_get_env(ha_mon, is_constr_dl_nomiss, time_ns) == 1ull ||
+ ha_get_env(ha_mon, is_defer_nomiss, time_ns) == 1ull;
+ else if (curr_state == throttled_nomiss && event == dl_replenish_nomiss)
+ ha_reset_env(ha_mon, clk_nomiss, time_ns);
+ return res;
+}
+
+static inline void ha_setup_invariants(struct ha_monitor *ha_mon,
+ enum states curr_state, enum events event,
+ enum states next_state, u64 time_ns)
+{
+ if (next_state == curr_state && event != dl_replenish_nomiss)
+ return;
+ if (next_state == ready_nomiss)
+ ha_start_timer_ns(ha_mon, clk_nomiss, DEADLINE_NS(ha_mon), time_ns);
+ else if (next_state == running_nomiss)
+ ha_start_timer_ns(ha_mon, clk_nomiss, DEADLINE_NS(ha_mon), time_ns);
+ else if (curr_state == ready_nomiss)
+ ha_cancel_timer(ha_mon);
+ else if (curr_state == running_nomiss)
+ ha_cancel_timer(ha_mon);
+}
+
+static bool ha_verify_constraint(struct ha_monitor *ha_mon,
+ enum states curr_state, enum events event,
+ enum states next_state, u64 time_ns)
+{
+ if (!ha_verify_invariants(ha_mon, curr_state, event, next_state, time_ns))
+ return false;
+
+ ha_convert_inv_guard(ha_mon, curr_state, event, next_state, time_ns);
+
+ if (!ha_verify_guards(ha_mon, curr_state, event, next_state, time_ns))
+ return false;
+
+ ha_setup_invariants(ha_mon, curr_state, event, next_state, time_ns);
+
+ return true;
+}
+
+static void handle_dl_replenish(void *data, struct sched_dl_entity *dl_se,
+ int cpu, u8 type)
+{
+ if (is_supported_type(type))
+ da_handle_event(EXPAND_ID(dl_se, cpu, type), dl_replenish_nomiss);
+}
+
+static void handle_dl_throttle(void *data, struct sched_dl_entity *dl_se,
+ int cpu, u8 type)
+{
+ if (is_supported_type(type))
+ da_handle_event(EXPAND_ID(dl_se, cpu, type), dl_throttle_nomiss);
+}
+
+static void handle_dl_server_stop(void *data, struct sched_dl_entity *dl_se,
+ int cpu, u8 type)
+{
+ /*
+ * This isn't the standard use of da_handle_start_run_event since this
+ * event cannot only occur from the initial state.
+ * It is fine to use here because it always brings to a known state and
+ * the fact we "pretend" the transition starts from the initial state
+ * has no side effect.
+ */
+ if (is_supported_type(type))
+ da_handle_start_run_event(EXPAND_ID(dl_se, cpu, type), dl_server_stop_nomiss);
+}
+
+static inline void handle_server_switch(struct task_struct *next, int cpu, u8 type)
+{
+ struct sched_dl_entity *dl_se = get_server(next, type);
+
+ if (dl_se && is_idle_task(next))
+ da_handle_event(EXPAND_ID(dl_se, cpu, type), dl_server_idle_nomiss);
+}
+
+static void handle_sched_switch(void *data, bool preempt,
+ struct task_struct *prev,
+ struct task_struct *next,
+ unsigned int prev_state)
+{
+ int cpu = task_cpu(next);
+
+ if (prev_state != TASK_RUNNING && !preempt && prev->policy == SCHED_DEADLINE)
+ da_handle_event(EXPAND_ID_TASK(prev), sched_switch_suspend_nomiss);
+ if (next->policy == SCHED_DEADLINE)
+ da_handle_start_run_event(EXPAND_ID_TASK(next), sched_switch_in_nomiss);
+
+ /*
+ * The server is available in next only if the next task is boosted,
+ * otherwise we need to retrieve it.
+ * Here the server continues in the state running/armed until actually
+ * stopped, this works since we continue expecting a throttle.
+ */
+ if (next->dl_server)
+ da_handle_start_event(EXPAND_ID(next->dl_server, cpu,
+ get_server_type(next)),
+ sched_switch_in_nomiss);
+ else {
+ handle_server_switch(next, cpu, DL_SERVER_FAIR);
+ if (IS_ENABLED(CONFIG_SCHED_CLASS_EXT))
+ handle_server_switch(next, cpu, DL_SERVER_EXT);
+ }
+}
+
+static void handle_sys_enter(void *data, struct pt_regs *regs, long id)
+{
+ struct task_struct *p;
+ int new_policy = -1;
+ pid_t pid = 0;
+
+ new_policy = extract_params(regs, id, &pid);
+ if (new_policy < 0)
+ return;
+ guard(rcu)();
+ p = pid ? find_task_by_vpid(pid) : current;
+ if (unlikely(!p) || new_policy == p->policy)
+ return;
+
+ if (p->policy == SCHED_DEADLINE)
+ da_reset(EXPAND_ID_TASK(p));
+ else if (new_policy == SCHED_DEADLINE)
+ da_create_or_get(EXPAND_ID_TASK(p));
+}
+
+static void handle_sched_wakeup(void *data, struct task_struct *tsk)
+{
+ if (tsk->policy == SCHED_DEADLINE)
+ da_handle_event(EXPAND_ID_TASK(tsk), sched_wakeup_nomiss);
+}
+
+static int enable_nomiss(void)
+{
+ int retval;
+
+ retval = da_monitor_init();
+ if (retval)
+ return retval;
+
+ retval = init_storage(false);
+ if (retval)
+ return retval;
+ rv_attach_trace_probe("nomiss", sched_dl_replenish_tp, handle_dl_replenish);
+ rv_attach_trace_probe("nomiss", sched_dl_throttle_tp, handle_dl_throttle);
+ rv_attach_trace_probe("nomiss", sched_dl_server_stop_tp, handle_dl_server_stop);
+ rv_attach_trace_probe("nomiss", sched_switch, handle_sched_switch);
+ rv_attach_trace_probe("nomiss", sched_wakeup, handle_sched_wakeup);
+ if (!should_skip_syscall_handle())
+ rv_attach_trace_probe("nomiss", sys_enter, handle_sys_enter);
+ rv_attach_trace_probe("nomiss", task_newtask, handle_newtask);
+ rv_attach_trace_probe("nomiss", sched_process_exit, handle_exit);
+
+ return 0;
+}
+
+static void disable_nomiss(void)
+{
+ rv_this.enabled = 0;
+
+ /* Those are RCU writers, detach earlier hoping to close a bit faster */
+ rv_detach_trace_probe("nomiss", task_newtask, handle_newtask);
+ rv_detach_trace_probe("nomiss", sched_process_exit, handle_exit);
+ if (!should_skip_syscall_handle())
+ rv_detach_trace_probe("nomiss", sys_enter, handle_sys_enter);
+
+ rv_detach_trace_probe("nomiss", sched_dl_replenish_tp, handle_dl_replenish);
+ rv_detach_trace_probe("nomiss", sched_dl_throttle_tp, handle_dl_throttle);
+ rv_detach_trace_probe("nomiss", sched_dl_server_stop_tp, handle_dl_server_stop);
+ rv_detach_trace_probe("nomiss", sched_switch, handle_sched_switch);
+ rv_detach_trace_probe("nomiss", sched_wakeup, handle_sched_wakeup);
+
+ da_monitor_destroy();
+}
+
+static struct rv_monitor rv_this = {
+ .name = "nomiss",
+ .description = "dl entities run to completion before their deadline.",
+ .enable = enable_nomiss,
+ .disable = disable_nomiss,
+ .reset = da_monitor_reset_all,
+ .enabled = 0,
+};
+
+static int __init register_nomiss(void)
+{
+ return rv_register_monitor(&rv_this, &rv_deadline);
+}
+
+static void __exit unregister_nomiss(void)
+{
+ rv_unregister_monitor(&rv_this);
+}
+
+module_init(register_nomiss);
+module_exit(unregister_nomiss);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Gabriele Monaco <gmonaco@redhat.com>");
+MODULE_DESCRIPTION("nomiss: dl entities run to completion before their deadline.");
diff --git a/kernel/trace/rv/monitors/nomiss/nomiss.h b/kernel/trace/rv/monitors/nomiss/nomiss.h
new file mode 100644
index 000000000000..3d1b436194d7
--- /dev/null
+++ b/kernel/trace/rv/monitors/nomiss/nomiss.h
@@ -0,0 +1,123 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Automatically generated C representation of nomiss automaton
+ * For further information about this format, see kernel documentation:
+ * Documentation/trace/rv/deterministic_automata.rst
+ */
+
+#define MONITOR_NAME nomiss
+
+enum states_nomiss {
+ ready_nomiss,
+ idle_nomiss,
+ running_nomiss,
+ sleeping_nomiss,
+ throttled_nomiss,
+ state_max_nomiss,
+};
+
+#define INVALID_STATE state_max_nomiss
+
+enum events_nomiss {
+ dl_replenish_nomiss,
+ dl_server_idle_nomiss,
+ dl_server_stop_nomiss,
+ dl_throttle_nomiss,
+ sched_switch_in_nomiss,
+ sched_switch_suspend_nomiss,
+ sched_wakeup_nomiss,
+ event_max_nomiss,
+};
+
+enum envs_nomiss {
+ clk_nomiss,
+ is_constr_dl_nomiss,
+ is_defer_nomiss,
+ env_max_nomiss,
+ env_max_stored_nomiss = is_constr_dl_nomiss,
+};
+
+_Static_assert(env_max_stored_nomiss <= MAX_HA_ENV_LEN, "Not enough slots");
+#define HA_CLK_NS
+
+struct automaton_nomiss {
+ char *state_names[state_max_nomiss];
+ char *event_names[event_max_nomiss];
+ char *env_names[env_max_nomiss];
+ unsigned char function[state_max_nomiss][event_max_nomiss];
+ unsigned char initial_state;
+ bool final_states[state_max_nomiss];
+};
+
+static const struct automaton_nomiss automaton_nomiss = {
+ .state_names = {
+ "ready",
+ "idle",
+ "running",
+ "sleeping",
+ "throttled",
+ },
+ .event_names = {
+ "dl_replenish",
+ "dl_server_idle",
+ "dl_server_stop",
+ "dl_throttle",
+ "sched_switch_in",
+ "sched_switch_suspend",
+ "sched_wakeup",
+ },
+ .env_names = {
+ "clk",
+ "is_constr_dl",
+ "is_defer",
+ },
+ .function = {
+ {
+ ready_nomiss,
+ idle_nomiss,
+ sleeping_nomiss,
+ throttled_nomiss,
+ running_nomiss,
+ INVALID_STATE,
+ ready_nomiss,
+ },
+ {
+ ready_nomiss,
+ idle_nomiss,
+ sleeping_nomiss,
+ throttled_nomiss,
+ running_nomiss,
+ INVALID_STATE,
+ INVALID_STATE,
+ },
+ {
+ running_nomiss,
+ idle_nomiss,
+ sleeping_nomiss,
+ throttled_nomiss,
+ running_nomiss,
+ sleeping_nomiss,
+ running_nomiss,
+ },
+ {
+ ready_nomiss,
+ sleeping_nomiss,
+ sleeping_nomiss,
+ throttled_nomiss,
+ running_nomiss,
+ INVALID_STATE,
+ ready_nomiss,
+ },
+ {
+ ready_nomiss,
+ throttled_nomiss,
+ INVALID_STATE,
+ throttled_nomiss,
+ INVALID_STATE,
+ throttled_nomiss,
+ throttled_nomiss,
+ },
+ },
+ .initial_state = ready_nomiss,
+ .final_states = { 1, 0, 0, 0, 0 },
+};
diff --git a/kernel/trace/rv/monitors/nomiss/nomiss_trace.h b/kernel/trace/rv/monitors/nomiss/nomiss_trace.h
new file mode 100644
index 000000000000..42e7efaca4e7
--- /dev/null
+++ b/kernel/trace/rv/monitors/nomiss/nomiss_trace.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Snippet to be included in rv_trace.h
+ */
+
+#ifdef CONFIG_RV_MON_NOMISS
+DEFINE_EVENT(event_da_monitor_id, event_nomiss,
+ TP_PROTO(int id, char *state, char *event, char *next_state, bool final_state),
+ TP_ARGS(id, state, event, next_state, final_state));
+
+DEFINE_EVENT(error_da_monitor_id, error_nomiss,
+ TP_PROTO(int id, char *state, char *event),
+ TP_ARGS(id, state, event));
+
+DEFINE_EVENT(error_env_da_monitor_id, error_env_nomiss,
+ TP_PROTO(int id, char *state, char *event, char *env),
+ TP_ARGS(id, state, event, env));
+#endif /* CONFIG_RV_MON_NOMISS */
diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
index 9e8072d863a2..9622c269789c 100644
--- a/kernel/trace/rv/rv_trace.h
+++ b/kernel/trace/rv/rv_trace.h
@@ -188,6 +188,7 @@ DECLARE_EVENT_CLASS(error_env_da_monitor_id,
);
#include <monitors/stall/stall_trace.h>
+#include <monitors/nomiss/nomiss_trace.h>
// Add new monitors based on CONFIG_HA_MON_EVENTS_ID here
#endif
diff --git a/tools/verification/models/deadline/nomiss.dot b/tools/verification/models/deadline/nomiss.dot
new file mode 100644
index 000000000000..fd1ea6bf2509
--- /dev/null
+++ b/tools/verification/models/deadline/nomiss.dot
@@ -0,0 +1,41 @@
+digraph state_automaton {
+ center = true;
+ size = "7,11";
+ {node [shape = circle] "idle"};
+ {node [shape = plaintext, style=invis, label=""] "__init_ready"};
+ {node [shape = doublecircle] "ready"};
+ {node [shape = circle] "ready"};
+ {node [shape = circle] "running"};
+ {node [shape = circle] "sleeping"};
+ {node [shape = circle] "throttled"};
+ "__init_ready" -> "ready";
+ "idle" [label = "idle"];
+ "idle" -> "idle" [ label = "dl_server_idle" ];
+ "idle" -> "ready" [ label = "dl_replenish;reset(clk)" ];
+ "idle" -> "running" [ label = "sched_switch_in" ];
+ "idle" -> "sleeping" [ label = "dl_server_stop" ];
+ "idle" -> "throttled" [ label = "dl_throttle" ];
+ "ready" [label = "ready\nclk < DEADLINE_NS()", color = green3];
+ "ready" -> "idle" [ label = "dl_server_idle" ];
+ "ready" -> "ready" [ label = "sched_wakeup\ndl_replenish;reset(clk)" ];
+ "ready" -> "running" [ label = "sched_switch_in" ];
+ "ready" -> "sleeping" [ label = "dl_server_stop" ];
+ "ready" -> "throttled" [ label = "dl_throttle;is_defer == 1" ];
+ "running" [label = "running\nclk < DEADLINE_NS()"];
+ "running" -> "idle" [ label = "dl_server_idle" ];
+ "running" -> "running" [ label = "dl_replenish;reset(clk)\nsched_switch_in\nsched_wakeup" ];
+ "running" -> "sleeping" [ label = "sched_switch_suspend\ndl_server_stop" ];
+ "running" -> "throttled" [ label = "dl_throttle" ];
+ "sleeping" [label = "sleeping"];
+ "sleeping" -> "ready" [ label = "sched_wakeup\ndl_replenish;reset(clk)" ];
+ "sleeping" -> "running" [ label = "sched_switch_in" ];
+ "sleeping" -> "sleeping" [ label = "dl_server_stop\ndl_server_idle" ];
+ "sleeping" -> "throttled" [ label = "dl_throttle;is_constr_dl == 1 || is_defer == 1" ];
+ "throttled" [label = "throttled"];
+ "throttled" -> "ready" [ label = "dl_replenish;reset(clk)" ];
+ "throttled" -> "throttled" [ label = "sched_switch_suspend\nsched_wakeup\ndl_server_idle\ndl_throttle" ];
+ { rank = min ;
+ "__init_ready";
+ "ready";
+ }
+}
--
2.53.0
^ permalink raw reply related
* [PATCH v8 07/12] rv: Convert the opid monitor to a hybrid automaton
From: Gabriele Monaco @ 2026-03-30 11:10 UTC (permalink / raw)
To: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
Gabriele Monaco, Jonathan Corbet, Masami Hiramatsu,
linux-trace-kernel, linux-doc
Cc: Tomas Glozar, Clark Williams, John Kacur
In-Reply-To: <20260330111010.153663-1-gmonaco@redhat.com>
The opid monitor validates that wakeup and need_resched events only
occur with interrupts and preemption disabled by following the
preemptirq tracepoints.
As reported in [1], those tracepoints might be inaccurate in some
situations (e.g. NMIs).
Since the monitor doesn't validate other ordering properties, remove the
dependency on preemptirq tracepoints and convert the monitor to a hybrid
automaton to validate the constraint during event handling.
This makes the monitor more robust by also removing the workaround for
interrupts missing the preemption tracepoints, which was working on
PREEMPT_RT only and allows the monitor to be built on kernels without
the preemptirqs tracepoints.
[1] - https://lore.kernel.org/lkml/20250625120823.60600-1-gmonaco@redhat.com
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
Documentation/trace/rv/monitor_sched.rst | 62 +++---------
kernel/trace/rv/monitors/opid/Kconfig | 11 +-
kernel/trace/rv/monitors/opid/opid.c | 111 +++++++--------------
kernel/trace/rv/monitors/opid/opid.h | 86 ++++------------
kernel/trace/rv/monitors/opid/opid_trace.h | 4 +
kernel/trace/rv/rv_trace.h | 2 +-
tools/verification/models/sched/opid.dot | 36 ++-----
7 files changed, 82 insertions(+), 230 deletions(-)
diff --git a/Documentation/trace/rv/monitor_sched.rst b/Documentation/trace/rv/monitor_sched.rst
index 3f8381ad9ec7..0b96d6e147c6 100644
--- a/Documentation/trace/rv/monitor_sched.rst
+++ b/Documentation/trace/rv/monitor_sched.rst
@@ -346,55 +346,21 @@ Monitor opid
The operations with preemption and irq disabled (opid) monitor ensures
operations like ``wakeup`` and ``need_resched`` occur with interrupts and
-preemption disabled or during interrupt context, in such case preemption may
-not be disabled explicitly.
+preemption disabled.
``need_resched`` can be set by some RCU internals functions, in which case it
-doesn't match a task wakeup and might occur with only interrupts disabled::
-
- | sched_need_resched
- | sched_waking
- | irq_entry
- | +--------------------+
- v v |
- +------------------------------------------------------+
- +----------- | disabled | <+
- | +------------------------------------------------------+ |
- | | ^ |
- | | preempt_disable sched_need_resched |
- | preempt_enable | +--------------------+ |
- | v | v | |
- | +------------------------------------------------------+ |
- | | irq_disabled | |
- | +------------------------------------------------------+ |
- | | | ^ |
- | irq_entry irq_entry | | |
- | sched_need_resched v | irq_disable |
- | sched_waking +--------------+ | | |
- | +----- | | irq_enable | |
- | | | in_irq | | | |
- | +----> | | | | |
- | +--------------+ | | irq_disable
- | | | | |
- | irq_enable | irq_enable | | |
- | v v | |
- | #======================================================# |
- | H enabled H |
- | #======================================================# |
- | | ^ ^ preempt_enable | |
- | preempt_disable preempt_enable +--------------------+ |
- | v | |
- | +------------------+ | |
- +----------> | preempt_disabled | -+ |
- +------------------+ |
- | |
- +-------------------------------------------------------+
-
-This monitor is designed to work on ``PREEMPT_RT`` kernels, the special case of
-events occurring in interrupt context is a shortcut to identify valid scenarios
-where the preemption tracepoints might not be visible, during interrupts
-preemption is always disabled. On non- ``PREEMPT_RT`` kernels, the interrupts
-might invoke a softirq to set ``need_resched`` and wake up a task. This is
-another special case that is currently not supported by the monitor.
+doesn't match a task wakeup and might occur with only interrupts disabled.
+The interrupt and preemption status are validated by the hybrid automaton
+constraints when processing the events::
+
+ |
+ |
+ v
+ #=========# sched_need_resched;irq_off == 1
+ H H sched_waking;irq_off == 1 && preempt_off == 1
+ H any H ------------------------------------------------+
+ H H |
+ H H <-----------------------------------------------+
+ #=========#
References
----------
diff --git a/kernel/trace/rv/monitors/opid/Kconfig b/kernel/trace/rv/monitors/opid/Kconfig
index 561d32da572b..6d02e239b684 100644
--- a/kernel/trace/rv/monitors/opid/Kconfig
+++ b/kernel/trace/rv/monitors/opid/Kconfig
@@ -2,18 +2,13 @@
#
config RV_MON_OPID
depends on RV
- depends on TRACE_IRQFLAGS
- depends on TRACE_PREEMPT_TOGGLE
depends on RV_MON_SCHED
- default y if PREEMPT_RT
- select DA_MON_EVENTS_IMPLICIT
+ default y
+ select HA_MON_EVENTS_IMPLICIT
bool "opid monitor"
help
Monitor to ensure operations like wakeup and need resched occur with
- interrupts and preemption disabled or during IRQs, where preemption
- may not be disabled explicitly.
-
- This monitor is unstable on !PREEMPT_RT, say N unless you are testing it.
+ interrupts and preemption disabled.
For further information, see:
Documentation/trace/rv/monitor_sched.rst
diff --git a/kernel/trace/rv/monitors/opid/opid.c b/kernel/trace/rv/monitors/opid/opid.c
index 25a40e90fa40..4594c7c46601 100644
--- a/kernel/trace/rv/monitors/opid/opid.c
+++ b/kernel/trace/rv/monitors/opid/opid.c
@@ -10,94 +10,63 @@
#define MODULE_NAME "opid"
#include <trace/events/sched.h>
-#include <trace/events/irq.h>
-#include <trace/events/preemptirq.h>
#include <rv_trace.h>
#include <monitors/sched/sched.h>
#define RV_MON_TYPE RV_MON_PER_CPU
#include "opid.h"
-#include <rv/da_monitor.h>
+#include <rv/ha_monitor.h>
-#ifdef CONFIG_X86_LOCAL_APIC
-#include <asm/trace/irq_vectors.h>
-
-static void handle_vector_irq_entry(void *data, int vector)
+static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs_opid env, u64 time_ns)
{
- da_handle_event(irq_entry_opid);
-}
-
-static void attach_vector_irq(void)
-{
- rv_attach_trace_probe("opid", local_timer_entry, handle_vector_irq_entry);
- if (IS_ENABLED(CONFIG_IRQ_WORK))
- rv_attach_trace_probe("opid", irq_work_entry, handle_vector_irq_entry);
- if (IS_ENABLED(CONFIG_SMP)) {
- rv_attach_trace_probe("opid", reschedule_entry, handle_vector_irq_entry);
- rv_attach_trace_probe("opid", call_function_entry, handle_vector_irq_entry);
- rv_attach_trace_probe("opid", call_function_single_entry, handle_vector_irq_entry);
+ if (env == irq_off_opid)
+ return irqs_disabled();
+ else if (env == preempt_off_opid) {
+ /*
+ * If CONFIG_PREEMPTION is enabled, then the tracepoint itself disables
+ * preemption (adding one to the preempt_count). Since we are
+ * interested in the preempt_count at the time the tracepoint was
+ * hit, we consider 1 as still enabled.
+ */
+ if (IS_ENABLED(CONFIG_PREEMPTION))
+ return (preempt_count() & PREEMPT_MASK) > 1;
+ return true;
}
+ return ENV_INVALID_VALUE;
}
-static void detach_vector_irq(void)
+static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
+ enum states curr_state, enum events event,
+ enum states next_state, u64 time_ns)
{
- rv_detach_trace_probe("opid", local_timer_entry, handle_vector_irq_entry);
- if (IS_ENABLED(CONFIG_IRQ_WORK))
- rv_detach_trace_probe("opid", irq_work_entry, handle_vector_irq_entry);
- if (IS_ENABLED(CONFIG_SMP)) {
- rv_detach_trace_probe("opid", reschedule_entry, handle_vector_irq_entry);
- rv_detach_trace_probe("opid", call_function_entry, handle_vector_irq_entry);
- rv_detach_trace_probe("opid", call_function_single_entry, handle_vector_irq_entry);
- }
+ bool res = true;
+
+ if (curr_state == any_opid && event == sched_need_resched_opid)
+ res = ha_get_env(ha_mon, irq_off_opid, time_ns) == 1ull;
+ else if (curr_state == any_opid && event == sched_waking_opid)
+ res = ha_get_env(ha_mon, irq_off_opid, time_ns) == 1ull &&
+ ha_get_env(ha_mon, preempt_off_opid, time_ns) == 1ull;
+ return res;
}
-#else
-/* We assume irq_entry tracepoints are sufficient on other architectures */
-static void attach_vector_irq(void) { }
-static void detach_vector_irq(void) { }
-#endif
-
-static void handle_irq_disable(void *data, unsigned long ip, unsigned long parent_ip)
+static bool ha_verify_constraint(struct ha_monitor *ha_mon,
+ enum states curr_state, enum events event,
+ enum states next_state, u64 time_ns)
{
- da_handle_event(irq_disable_opid);
-}
+ if (!ha_verify_guards(ha_mon, curr_state, event, next_state, time_ns))
+ return false;
-static void handle_irq_enable(void *data, unsigned long ip, unsigned long parent_ip)
-{
- da_handle_event(irq_enable_opid);
-}
-
-static void handle_irq_entry(void *data, int irq, struct irqaction *action)
-{
- da_handle_event(irq_entry_opid);
-}
-
-static void handle_preempt_disable(void *data, unsigned long ip, unsigned long parent_ip)
-{
- da_handle_event(preempt_disable_opid);
-}
-
-static void handle_preempt_enable(void *data, unsigned long ip, unsigned long parent_ip)
-{
- da_handle_event(preempt_enable_opid);
+ return true;
}
static void handle_sched_need_resched(void *data, struct task_struct *tsk, int cpu, int tif)
{
- /* The monitor's intitial state is not in_irq */
- if (this_cpu_read(hardirq_context))
- da_handle_event(sched_need_resched_opid);
- else
- da_handle_start_event(sched_need_resched_opid);
+ da_handle_start_run_event(sched_need_resched_opid);
}
static void handle_sched_waking(void *data, struct task_struct *p)
{
- /* The monitor's intitial state is not in_irq */
- if (this_cpu_read(hardirq_context))
- da_handle_event(sched_waking_opid);
- else
- da_handle_start_event(sched_waking_opid);
+ da_handle_start_run_event(sched_waking_opid);
}
static int enable_opid(void)
@@ -108,14 +77,8 @@ static int enable_opid(void)
if (retval)
return retval;
- rv_attach_trace_probe("opid", irq_disable, handle_irq_disable);
- rv_attach_trace_probe("opid", irq_enable, handle_irq_enable);
- rv_attach_trace_probe("opid", irq_handler_entry, handle_irq_entry);
- rv_attach_trace_probe("opid", preempt_disable, handle_preempt_disable);
- rv_attach_trace_probe("opid", preempt_enable, handle_preempt_enable);
rv_attach_trace_probe("opid", sched_set_need_resched_tp, handle_sched_need_resched);
rv_attach_trace_probe("opid", sched_waking, handle_sched_waking);
- attach_vector_irq();
return 0;
}
@@ -124,14 +87,8 @@ static void disable_opid(void)
{
rv_this.enabled = 0;
- rv_detach_trace_probe("opid", irq_disable, handle_irq_disable);
- rv_detach_trace_probe("opid", irq_enable, handle_irq_enable);
- rv_detach_trace_probe("opid", irq_handler_entry, handle_irq_entry);
- rv_detach_trace_probe("opid", preempt_disable, handle_preempt_disable);
- rv_detach_trace_probe("opid", preempt_enable, handle_preempt_enable);
rv_detach_trace_probe("opid", sched_set_need_resched_tp, handle_sched_need_resched);
rv_detach_trace_probe("opid", sched_waking, handle_sched_waking);
- detach_vector_irq();
da_monitor_destroy();
}
diff --git a/kernel/trace/rv/monitors/opid/opid.h b/kernel/trace/rv/monitors/opid/opid.h
index 092992514970..fb0aa4c28aa6 100644
--- a/kernel/trace/rv/monitors/opid/opid.h
+++ b/kernel/trace/rv/monitors/opid/opid.h
@@ -8,30 +8,31 @@
#define MONITOR_NAME opid
enum states_opid {
- disabled_opid,
- enabled_opid,
- in_irq_opid,
- irq_disabled_opid,
- preempt_disabled_opid,
+ any_opid,
state_max_opid,
};
#define INVALID_STATE state_max_opid
enum events_opid {
- irq_disable_opid,
- irq_enable_opid,
- irq_entry_opid,
- preempt_disable_opid,
- preempt_enable_opid,
sched_need_resched_opid,
sched_waking_opid,
event_max_opid,
};
+enum envs_opid {
+ irq_off_opid,
+ preempt_off_opid,
+ env_max_opid,
+ env_max_stored_opid = irq_off_opid,
+};
+
+_Static_assert(env_max_stored_opid <= MAX_HA_ENV_LEN, "Not enough slots");
+
struct automaton_opid {
char *state_names[state_max_opid];
char *event_names[event_max_opid];
+ char *env_names[env_max_opid];
unsigned char function[state_max_opid][event_max_opid];
unsigned char initial_state;
bool final_states[state_max_opid];
@@ -39,68 +40,19 @@ struct automaton_opid {
static const struct automaton_opid automaton_opid = {
.state_names = {
- "disabled",
- "enabled",
- "in_irq",
- "irq_disabled",
- "preempt_disabled",
+ "any",
},
.event_names = {
- "irq_disable",
- "irq_enable",
- "irq_entry",
- "preempt_disable",
- "preempt_enable",
"sched_need_resched",
"sched_waking",
},
+ .env_names = {
+ "irq_off",
+ "preempt_off",
+ },
.function = {
- {
- INVALID_STATE,
- preempt_disabled_opid,
- disabled_opid,
- INVALID_STATE,
- irq_disabled_opid,
- disabled_opid,
- disabled_opid,
- },
- {
- irq_disabled_opid,
- INVALID_STATE,
- INVALID_STATE,
- preempt_disabled_opid,
- enabled_opid,
- INVALID_STATE,
- INVALID_STATE,
- },
- {
- INVALID_STATE,
- enabled_opid,
- in_irq_opid,
- INVALID_STATE,
- INVALID_STATE,
- in_irq_opid,
- in_irq_opid,
- },
- {
- INVALID_STATE,
- enabled_opid,
- in_irq_opid,
- disabled_opid,
- INVALID_STATE,
- irq_disabled_opid,
- INVALID_STATE,
- },
- {
- disabled_opid,
- INVALID_STATE,
- INVALID_STATE,
- INVALID_STATE,
- enabled_opid,
- INVALID_STATE,
- INVALID_STATE,
- },
+ { any_opid, any_opid },
},
- .initial_state = disabled_opid,
- .final_states = { 0, 1, 0, 0, 0 },
+ .initial_state = any_opid,
+ .final_states = { 1 },
};
diff --git a/kernel/trace/rv/monitors/opid/opid_trace.h b/kernel/trace/rv/monitors/opid/opid_trace.h
index 3df6ff955c30..b04005b64208 100644
--- a/kernel/trace/rv/monitors/opid/opid_trace.h
+++ b/kernel/trace/rv/monitors/opid/opid_trace.h
@@ -12,4 +12,8 @@ DEFINE_EVENT(event_da_monitor, event_opid,
DEFINE_EVENT(error_da_monitor, error_opid,
TP_PROTO(char *state, char *event),
TP_ARGS(state, event));
+
+DEFINE_EVENT(error_env_da_monitor, error_env_opid,
+ TP_PROTO(char *state, char *event, char *env),
+ TP_ARGS(state, event, env));
#endif /* CONFIG_RV_MON_OPID */
diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
index 1661f8fe4a88..9e8072d863a2 100644
--- a/kernel/trace/rv/rv_trace.h
+++ b/kernel/trace/rv/rv_trace.h
@@ -62,7 +62,6 @@ DECLARE_EVENT_CLASS(error_da_monitor,
#include <monitors/scpd/scpd_trace.h>
#include <monitors/snep/snep_trace.h>
#include <monitors/sts/sts_trace.h>
-#include <monitors/opid/opid_trace.h>
// Add new monitors based on CONFIG_DA_MON_EVENTS_IMPLICIT here
#ifdef CONFIG_HA_MON_EVENTS_IMPLICIT
@@ -91,6 +90,7 @@ DECLARE_EVENT_CLASS(error_env_da_monitor,
__get_str(env))
);
+#include <monitors/opid/opid_trace.h>
// Add new monitors based on CONFIG_HA_MON_EVENTS_IMPLICIT here
#endif
diff --git a/tools/verification/models/sched/opid.dot b/tools/verification/models/sched/opid.dot
index 840052f6952b..511051fce430 100644
--- a/tools/verification/models/sched/opid.dot
+++ b/tools/verification/models/sched/opid.dot
@@ -1,35 +1,13 @@
digraph state_automaton {
center = true;
size = "7,11";
- {node [shape = plaintext, style=invis, label=""] "__init_disabled"};
- {node [shape = circle] "disabled"};
- {node [shape = doublecircle] "enabled"};
- {node [shape = circle] "enabled"};
- {node [shape = circle] "in_irq"};
- {node [shape = circle] "irq_disabled"};
- {node [shape = circle] "preempt_disabled"};
- "__init_disabled" -> "disabled";
- "disabled" [label = "disabled"];
- "disabled" -> "disabled" [ label = "sched_need_resched\nsched_waking\nirq_entry" ];
- "disabled" -> "irq_disabled" [ label = "preempt_enable" ];
- "disabled" -> "preempt_disabled" [ label = "irq_enable" ];
- "enabled" [label = "enabled", color = green3];
- "enabled" -> "enabled" [ label = "preempt_enable" ];
- "enabled" -> "irq_disabled" [ label = "irq_disable" ];
- "enabled" -> "preempt_disabled" [ label = "preempt_disable" ];
- "in_irq" [label = "in_irq"];
- "in_irq" -> "enabled" [ label = "irq_enable" ];
- "in_irq" -> "in_irq" [ label = "sched_need_resched\nsched_waking\nirq_entry" ];
- "irq_disabled" [label = "irq_disabled"];
- "irq_disabled" -> "disabled" [ label = "preempt_disable" ];
- "irq_disabled" -> "enabled" [ label = "irq_enable" ];
- "irq_disabled" -> "in_irq" [ label = "irq_entry" ];
- "irq_disabled" -> "irq_disabled" [ label = "sched_need_resched" ];
- "preempt_disabled" [label = "preempt_disabled"];
- "preempt_disabled" -> "disabled" [ label = "irq_disable" ];
- "preempt_disabled" -> "enabled" [ label = "preempt_enable" ];
+ {node [shape = plaintext, style=invis, label=""] "__init_any"};
+ {node [shape = doublecircle] "any"};
+ "__init_any" -> "any";
+ "any" [label = "any", color = green3];
+ "any" -> "any" [ label = "sched_need_resched;irq_off == 1\nsched_waking;irq_off == 1 && preempt_off == 1" ];
{ rank = min ;
- "__init_disabled";
- "disabled";
+ "__init_any";
+ "any";
}
}
--
2.53.0
^ permalink raw reply related
* [PATCH v8 06/12] rv: Add sample hybrid monitor stall
From: Gabriele Monaco @ 2026-03-30 11:10 UTC (permalink / raw)
To: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
Jonathan Corbet, Gabriele Monaco, Masami Hiramatsu, linux-doc,
linux-trace-kernel
Cc: Tomas Glozar, Clark Williams, John Kacur
In-Reply-To: <20260330111010.153663-1-gmonaco@redhat.com>
Add a sample monitor to showcase hybrid/timed automata.
The stall monitor identifies tasks stalled for longer than a threshold
and reacts when that happens.
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
Documentation/tools/rv/index.rst | 1 +
Documentation/tools/rv/rv-mon-stall.rst | 44 ++++++
Documentation/trace/rv/index.rst | 1 +
Documentation/trace/rv/monitor_stall.rst | 43 ++++++
kernel/trace/rv/Kconfig | 1 +
kernel/trace/rv/Makefile | 1 +
kernel/trace/rv/monitors/stall/Kconfig | 13 ++
kernel/trace/rv/monitors/stall/stall.c | 150 +++++++++++++++++++
kernel/trace/rv/monitors/stall/stall.h | 81 ++++++++++
kernel/trace/rv/monitors/stall/stall_trace.h | 19 +++
kernel/trace/rv/rv_trace.h | 1 +
tools/verification/models/stall.dot | 22 +++
12 files changed, 377 insertions(+)
create mode 100644 Documentation/tools/rv/rv-mon-stall.rst
create mode 100644 Documentation/trace/rv/monitor_stall.rst
create mode 100644 kernel/trace/rv/monitors/stall/Kconfig
create mode 100644 kernel/trace/rv/monitors/stall/stall.c
create mode 100644 kernel/trace/rv/monitors/stall/stall.h
create mode 100644 kernel/trace/rv/monitors/stall/stall_trace.h
create mode 100644 tools/verification/models/stall.dot
diff --git a/Documentation/tools/rv/index.rst b/Documentation/tools/rv/index.rst
index fd42b0017d07..2aaa01c9fe48 100644
--- a/Documentation/tools/rv/index.rst
+++ b/Documentation/tools/rv/index.rst
@@ -16,3 +16,4 @@ Runtime verification (rv) tool
rv-mon-wip
rv-mon-wwnr
rv-mon-sched
+ rv-mon-stall
diff --git a/Documentation/tools/rv/rv-mon-stall.rst b/Documentation/tools/rv/rv-mon-stall.rst
new file mode 100644
index 000000000000..c79d7c2e4dd4
--- /dev/null
+++ b/Documentation/tools/rv/rv-mon-stall.rst
@@ -0,0 +1,44 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+rv-mon-stall
+============
+--------------------
+Stalled task monitor
+--------------------
+
+:Manual section: 1
+
+SYNOPSIS
+========
+
+**rv mon stall** [*OPTIONS*]
+
+DESCRIPTION
+===========
+
+The stalled task (**stall**) monitor is a sample per-task timed monitor that
+checks if tasks are scheduled within a defined threshold after they are ready.
+
+See kernel documentation for further information about this monitor:
+<https://docs.kernel.org/trace/rv/monitor_stall.html>
+
+OPTIONS
+=======
+
+.. include:: common_ikm.rst
+
+SEE ALSO
+========
+
+**rv**\(1), **rv-mon**\(1)
+
+Linux kernel *RV* documentation:
+<https://www.kernel.org/doc/html/latest/trace/rv/index.html>
+
+AUTHOR
+======
+
+Written by Gabriele Monaco <gmonaco@redhat.com>
+
+.. include:: common_appendix.rst
diff --git a/Documentation/trace/rv/index.rst b/Documentation/trace/rv/index.rst
index ad298784bda2..bf9962f49959 100644
--- a/Documentation/trace/rv/index.rst
+++ b/Documentation/trace/rv/index.rst
@@ -16,3 +16,4 @@ Runtime Verification
monitor_wwnr.rst
monitor_sched.rst
monitor_rtapp.rst
+ monitor_stall.rst
diff --git a/Documentation/trace/rv/monitor_stall.rst b/Documentation/trace/rv/monitor_stall.rst
new file mode 100644
index 000000000000..d29e820b2433
--- /dev/null
+++ b/Documentation/trace/rv/monitor_stall.rst
@@ -0,0 +1,43 @@
+Monitor stall
+=============
+
+- Name: stall - stalled task monitor
+- Type: per-task hybrid automaton
+- Author: Gabriele Monaco <gmonaco@redhat.com>
+
+Description
+-----------
+
+The stalled task (stall) monitor is a sample per-task timed monitor that checks
+if tasks are scheduled within a defined threshold after they are ready::
+
+ |
+ |
+ v
+ #==========================#
+ +-----------------> H dequeued H
+ | #==========================#
+ | |
+ sched_switch_wait | sched_wakeup;reset(clk)
+ | v
+ | +--------------------------+ <+
+ | | enqueued | | sched_wakeup
+ | | clk < threshold_jiffies | -+
+ | +--------------------------+
+ | | ^
+ | sched_switch_in sched_switch_preempt;reset(clk)
+ | v |
+ | +--------------------------+
+ +------------------ | running |
+ +--------------------------+
+ ^ sched_switch_in |
+ | sched_wakeup |
+ +----------------------+
+
+The threshold can be configured as a parameter by either booting with the
+``stall.threshold_jiffies=<new value>`` argument or writing a new value to
+``/sys/module/stall/parameters/threshold_jiffies``.
+
+Specification
+-------------
+Graphviz Dot file in tools/verification/models/stall.dot
diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
index 4ad392dfc57f..720fbe4935f8 100644
--- a/kernel/trace/rv/Kconfig
+++ b/kernel/trace/rv/Kconfig
@@ -78,6 +78,7 @@ source "kernel/trace/rv/monitors/pagefault/Kconfig"
source "kernel/trace/rv/monitors/sleep/Kconfig"
# Add new rtapp monitors here
+source "kernel/trace/rv/monitors/stall/Kconfig"
# Add new monitors here
config RV_REACTORS
diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
index 750e4ad6fa0f..51c95e2d2da6 100644
--- a/kernel/trace/rv/Makefile
+++ b/kernel/trace/rv/Makefile
@@ -17,6 +17,7 @@ obj-$(CONFIG_RV_MON_STS) += monitors/sts/sts.o
obj-$(CONFIG_RV_MON_NRP) += monitors/nrp/nrp.o
obj-$(CONFIG_RV_MON_SSSW) += monitors/sssw/sssw.o
obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
+obj-$(CONFIG_RV_MON_STALL) += monitors/stall/stall.o
# Add new monitors here
obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
diff --git a/kernel/trace/rv/monitors/stall/Kconfig b/kernel/trace/rv/monitors/stall/Kconfig
new file mode 100644
index 000000000000..6f846b642544
--- /dev/null
+++ b/kernel/trace/rv/monitors/stall/Kconfig
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+config RV_MON_STALL
+ depends on RV
+ select HA_MON_EVENTS_ID
+ bool "stall monitor"
+ help
+ Enable the stall sample monitor that illustrates the usage of hybrid
+ automata monitors. It can be used to identify tasks stalled for
+ longer than a threshold.
+
+ For further information, see:
+ Documentation/trace/rv/monitor_stall.rst
diff --git a/kernel/trace/rv/monitors/stall/stall.c b/kernel/trace/rv/monitors/stall/stall.c
new file mode 100644
index 000000000000..9ccfda6b0e73
--- /dev/null
+++ b/kernel/trace/rv/monitors/stall/stall.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/ftrace.h>
+#include <linux/tracepoint.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/rv.h>
+#include <rv/instrumentation.h>
+
+#define MODULE_NAME "stall"
+
+#include <trace/events/sched.h>
+#include <rv_trace.h>
+
+#define RV_MON_TYPE RV_MON_PER_TASK
+#define HA_TIMER_TYPE HA_TIMER_WHEEL
+#include "stall.h"
+#include <rv/ha_monitor.h>
+
+static u64 threshold_jiffies = 1000;
+module_param(threshold_jiffies, ullong, 0644);
+
+static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs_stall env, u64 time_ns)
+{
+ if (env == clk_stall)
+ return ha_get_clk_jiffy(ha_mon, env);
+ return ENV_INVALID_VALUE;
+}
+
+static void ha_reset_env(struct ha_monitor *ha_mon, enum envs_stall env, u64 time_ns)
+{
+ if (env == clk_stall)
+ ha_reset_clk_jiffy(ha_mon, env);
+}
+
+static inline bool ha_verify_invariants(struct ha_monitor *ha_mon,
+ enum states curr_state, enum events event,
+ enum states next_state, u64 time_ns)
+{
+ if (curr_state == enqueued_stall)
+ return ha_check_invariant_jiffy(ha_mon, clk_stall, time_ns);
+ return true;
+}
+
+static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
+ enum states curr_state, enum events event,
+ enum states next_state, u64 time_ns)
+{
+ bool res = true;
+
+ if (curr_state == dequeued_stall && event == sched_wakeup_stall)
+ ha_reset_env(ha_mon, clk_stall, time_ns);
+ else if (curr_state == running_stall && event == sched_switch_preempt_stall)
+ ha_reset_env(ha_mon, clk_stall, time_ns);
+ return res;
+}
+
+static inline void ha_setup_invariants(struct ha_monitor *ha_mon,
+ enum states curr_state, enum events event,
+ enum states next_state, u64 time_ns)
+{
+ if (next_state == curr_state)
+ return;
+ if (next_state == enqueued_stall)
+ ha_start_timer_jiffy(ha_mon, clk_stall, threshold_jiffies, time_ns);
+ else if (curr_state == enqueued_stall)
+ ha_cancel_timer(ha_mon);
+}
+
+static bool ha_verify_constraint(struct ha_monitor *ha_mon,
+ enum states curr_state, enum events event,
+ enum states next_state, u64 time_ns)
+{
+ if (!ha_verify_invariants(ha_mon, curr_state, event, next_state, time_ns))
+ return false;
+
+ if (!ha_verify_guards(ha_mon, curr_state, event, next_state, time_ns))
+ return false;
+
+ ha_setup_invariants(ha_mon, curr_state, event, next_state, time_ns);
+
+ return true;
+}
+
+static void handle_sched_switch(void *data, bool preempt,
+ struct task_struct *prev,
+ struct task_struct *next,
+ unsigned int prev_state)
+{
+ if (!preempt && prev_state != TASK_RUNNING)
+ da_handle_start_event(prev, sched_switch_wait_stall);
+ else
+ da_handle_event(prev, sched_switch_preempt_stall);
+ da_handle_event(next, sched_switch_in_stall);
+}
+
+static void handle_sched_wakeup(void *data, struct task_struct *p)
+{
+ da_handle_event(p, sched_wakeup_stall);
+}
+
+static int enable_stall(void)
+{
+ int retval;
+
+ retval = da_monitor_init();
+ if (retval)
+ return retval;
+
+ rv_attach_trace_probe("stall", sched_switch, handle_sched_switch);
+ rv_attach_trace_probe("stall", sched_wakeup, handle_sched_wakeup);
+
+ return 0;
+}
+
+static void disable_stall(void)
+{
+ rv_this.enabled = 0;
+
+ rv_detach_trace_probe("stall", sched_switch, handle_sched_switch);
+ rv_detach_trace_probe("stall", sched_wakeup, handle_sched_wakeup);
+
+ da_monitor_destroy();
+}
+
+static struct rv_monitor rv_this = {
+ .name = "stall",
+ .description = "identify tasks stalled for longer than a threshold.",
+ .enable = enable_stall,
+ .disable = disable_stall,
+ .reset = da_monitor_reset_all,
+ .enabled = 0,
+};
+
+static int __init register_stall(void)
+{
+ return rv_register_monitor(&rv_this, NULL);
+}
+
+static void __exit unregister_stall(void)
+{
+ rv_unregister_monitor(&rv_this);
+}
+
+module_init(register_stall);
+module_exit(unregister_stall);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Gabriele Monaco <gmonaco@redhat.com>");
+MODULE_DESCRIPTION("stall: identify tasks stalled for longer than a threshold.");
diff --git a/kernel/trace/rv/monitors/stall/stall.h b/kernel/trace/rv/monitors/stall/stall.h
new file mode 100644
index 000000000000..638520cb1082
--- /dev/null
+++ b/kernel/trace/rv/monitors/stall/stall.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Automatically generated C representation of stall automaton
+ * For further information about this format, see kernel documentation:
+ * Documentation/trace/rv/deterministic_automata.rst
+ */
+
+#define MONITOR_NAME stall
+
+enum states_stall {
+ dequeued_stall,
+ enqueued_stall,
+ running_stall,
+ state_max_stall,
+};
+
+#define INVALID_STATE state_max_stall
+
+enum events_stall {
+ sched_switch_in_stall,
+ sched_switch_preempt_stall,
+ sched_switch_wait_stall,
+ sched_wakeup_stall,
+ event_max_stall,
+};
+
+enum envs_stall {
+ clk_stall,
+ env_max_stall,
+ env_max_stored_stall = env_max_stall,
+};
+
+_Static_assert(env_max_stored_stall <= MAX_HA_ENV_LEN, "Not enough slots");
+
+struct automaton_stall {
+ char *state_names[state_max_stall];
+ char *event_names[event_max_stall];
+ char *env_names[env_max_stall];
+ unsigned char function[state_max_stall][event_max_stall];
+ unsigned char initial_state;
+ bool final_states[state_max_stall];
+};
+
+static const struct automaton_stall automaton_stall = {
+ .state_names = {
+ "dequeued",
+ "enqueued",
+ "running",
+ },
+ .event_names = {
+ "sched_switch_in",
+ "sched_switch_preempt",
+ "sched_switch_wait",
+ "sched_wakeup",
+ },
+ .env_names = {
+ "clk",
+ },
+ .function = {
+ {
+ INVALID_STATE,
+ INVALID_STATE,
+ INVALID_STATE,
+ enqueued_stall,
+ },
+ {
+ running_stall,
+ INVALID_STATE,
+ INVALID_STATE,
+ enqueued_stall,
+ },
+ {
+ running_stall,
+ enqueued_stall,
+ dequeued_stall,
+ running_stall,
+ },
+ },
+ .initial_state = dequeued_stall,
+ .final_states = { 1, 0, 0 },
+};
diff --git a/kernel/trace/rv/monitors/stall/stall_trace.h b/kernel/trace/rv/monitors/stall/stall_trace.h
new file mode 100644
index 000000000000..6a7cc1b1d040
--- /dev/null
+++ b/kernel/trace/rv/monitors/stall/stall_trace.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Snippet to be included in rv_trace.h
+ */
+
+#ifdef CONFIG_RV_MON_STALL
+DEFINE_EVENT(event_da_monitor_id, event_stall,
+ TP_PROTO(int id, char *state, char *event, char *next_state, bool final_state),
+ TP_ARGS(id, state, event, next_state, final_state));
+
+DEFINE_EVENT(error_da_monitor_id, error_stall,
+ TP_PROTO(int id, char *state, char *event),
+ TP_ARGS(id, state, event));
+
+DEFINE_EVENT(error_env_da_monitor_id, error_env_stall,
+ TP_PROTO(int id, char *state, char *event, char *env),
+ TP_ARGS(id, state, event, env));
+#endif /* CONFIG_RV_MON_STALL */
diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
index 7c598967bc0e..1661f8fe4a88 100644
--- a/kernel/trace/rv/rv_trace.h
+++ b/kernel/trace/rv/rv_trace.h
@@ -187,6 +187,7 @@ DECLARE_EVENT_CLASS(error_env_da_monitor_id,
__get_str(env))
);
+#include <monitors/stall/stall_trace.h>
// Add new monitors based on CONFIG_HA_MON_EVENTS_ID here
#endif
diff --git a/tools/verification/models/stall.dot b/tools/verification/models/stall.dot
new file mode 100644
index 000000000000..50077d1dff74
--- /dev/null
+++ b/tools/verification/models/stall.dot
@@ -0,0 +1,22 @@
+digraph state_automaton {
+ center = true;
+ size = "7,11";
+ {node [shape = circle] "enqueued"};
+ {node [shape = plaintext, style=invis, label=""] "__init_dequeued"};
+ {node [shape = doublecircle] "dequeued"};
+ {node [shape = circle] "running"};
+ "__init_dequeued" -> "dequeued";
+ "enqueued" [label = "enqueued\nclk < threshold_jiffies"];
+ "running" [label = "running"];
+ "dequeued" [label = "dequeued", color = green3];
+ "running" -> "running" [ label = "sched_switch_in\nsched_wakeup" ];
+ "enqueued" -> "enqueued" [ label = "sched_wakeup" ];
+ "enqueued" -> "running" [ label = "sched_switch_in" ];
+ "running" -> "dequeued" [ label = "sched_switch_wait" ];
+ "dequeued" -> "enqueued" [ label = "sched_wakeup;reset(clk)" ];
+ "running" -> "enqueued" [ label = "sched_switch_preempt;reset(clk)" ];
+ { rank = min ;
+ "__init_dequeued";
+ "dequeued";
+ }
+}
--
2.53.0
^ permalink raw reply related
* [PATCH v8 05/12] Documentation/rv: Add documentation about hybrid automata
From: Gabriele Monaco @ 2026-03-30 11:10 UTC (permalink / raw)
To: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
Gabriele Monaco, Jonathan Corbet, linux-trace-kernel, linux-doc
Cc: Juri Lelli, Tomas Glozar, Clark Williams, John Kacur
In-Reply-To: <20260330111010.153663-1-gmonaco@redhat.com>
Describe theory and implementation of hybrid automata in the dedicated
page hybrid_automata.rst
Include a section on how to integrate a hybrid automaton in
monitor_synthesis.rst
Also remove a hanging $ in deterministic_automata.rst
Reviewed-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
Notes:
V8:
* Minor improvement in docs
.../trace/rv/deterministic_automata.rst | 2 +-
Documentation/trace/rv/hybrid_automata.rst | 341 ++++++++++++++++++
Documentation/trace/rv/index.rst | 1 +
Documentation/trace/rv/monitor_synthesis.rst | 117 +++++-
4 files changed, 458 insertions(+), 3 deletions(-)
create mode 100644 Documentation/trace/rv/hybrid_automata.rst
diff --git a/Documentation/trace/rv/deterministic_automata.rst b/Documentation/trace/rv/deterministic_automata.rst
index d0638f95a455..7a1c2b20ec72 100644
--- a/Documentation/trace/rv/deterministic_automata.rst
+++ b/Documentation/trace/rv/deterministic_automata.rst
@@ -11,7 +11,7 @@ where:
- *E* is the finite set of events;
- x\ :subscript:`0` is the initial state;
- X\ :subscript:`m` (subset of *X*) is the set of marked (or final) states.
-- *f* : *X* x *E* -> *X* $ is the transition function. It defines the state
+- *f* : *X* x *E* -> *X* is the transition function. It defines the state
transition in the occurrence of an event from *E* in the state *X*. In the
special case of deterministic automata, the occurrence of the event in *E*
in a state in *X* has a deterministic next state from *X*.
diff --git a/Documentation/trace/rv/hybrid_automata.rst b/Documentation/trace/rv/hybrid_automata.rst
new file mode 100644
index 000000000000..f20d489f18c1
--- /dev/null
+++ b/Documentation/trace/rv/hybrid_automata.rst
@@ -0,0 +1,341 @@
+Hybrid Automata
+===============
+
+Hybrid automata are an extension of deterministic automata, there are several
+definitions of hybrid automata in the literature. The adaptation implemented
+here is formally denoted by G and defined as a 7-tuple:
+
+ *G* = { *X*, *E*, *V*, *f*, x\ :subscript:`0`, X\ :subscript:`m`, *i* }
+
+- *X* is the set of states;
+- *E* is the finite set of events;
+- *V* is the finite set of environment variables;
+- x\ :subscript:`0` is the initial state;
+- X\ :subscript:`m` (subset of *X*) is the set of marked (or final) states.
+- *f* : *X* x *E* x *C(V)* -> *X* is the transition function.
+ It defines the state transition in the occurrence of an event from *E* in the
+ state *X*. Unlike deterministic automata, the transition function also
+ includes guards from the set of all possible constraints (defined as *C(V)*).
+ Guards can be true or false with the valuation of *V* when the event occurs,
+ and the transition is possible only when constraints are true. Similarly to
+ deterministic automata, the occurrence of the event in *E* in a state in *X*
+ has a deterministic next state from *X*, if the guard is true.
+- *i* : *X* -> *C'(V)* is the invariant assignment function, this is a
+ constraint assigned to each state in *X*, every state in *X* must be left
+ before the invariant turns to false. We can omit the representation of
+ invariants whose value is true regardless of the valuation of *V*.
+
+The set of all possible constraints *C(V)* is defined according to the
+following grammar:
+
+ g = v < c | v > c | v <= c | v >= c | v == c | v != c | g && g | true
+
+With v a variable in *V* and c a numerical value.
+
+We define the special case of hybrid automata whose variables grow with uniform
+rates as timed automata. In this case, the variables are called clocks.
+As the name implies, timed automata can be used to describe real time.
+Additionally, clocks support another type of guard which always evaluates to true:
+
+ reset(v)
+
+The reset constraint is used to set the value of a clock to 0.
+
+The set of invariant constraints *C'(V)* is a subset of *C(V)* including only
+constraint of the form:
+
+ g = v < c | true
+
+This simplifies the implementation as a clock expiration is a necessary and
+sufficient condition for the violation of invariants while still allowing more
+complex constraints to be specified as guards.
+
+It is important to note that any hybrid automaton is a valid deterministic
+automaton with additional guards and invariants. Those can only further
+constrain what transitions are valid but it is not possible to define
+transition functions starting from the same state in *X* and the same event in
+*E* but ending up in different states in *X* based on the valuation of *V*.
+
+Examples
+--------
+
+Wip as hybrid automaton
+~~~~~~~~~~~~~~~~~~~~~~~
+
+The 'wip' (wakeup in preemptive) example introduced as a deterministic automaton
+can also be described as:
+
+- *X* = { ``any_thread_running`` }
+- *E* = { ``sched_waking`` }
+- *V* = { ``preemptive`` }
+- x\ :subscript:`0` = ``any_thread_running``
+- X\ :subscript:`m` = {``any_thread_running``}
+- *f* =
+ - *f*\ (``any_thread_running``, ``sched_waking``, ``preemptive==0``) = ``any_thread_running``
+- *i* =
+ - *i*\ (``any_thread_running``) = ``true``
+
+Which can be represented graphically as::
+
+ |
+ |
+ v
+ #====================# sched_waking;preemptive==0
+ H H ------------------------------+
+ H any_thread_running H |
+ H H <-----------------------------+
+ #====================#
+
+In this example, by using the preemptive state of the system as an environment
+variable, we can assert this constraint on ``sched_waking`` without requiring
+preemption events (as we would in a deterministic automaton), which can be
+useful in case those events are not available or not reliable on the system.
+
+Since all the invariants in *i* are true, we can omit them from the representation.
+
+Stall model with guards (iteration 1)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As a sample timed automaton we can define 'stall' as:
+
+- *X* = { ``dequeued``, ``enqueued``, ``running``}
+- *E* = { ``enqueue``, ``dequeue``, ``switch_in``}
+- *V* = { ``clk`` }
+- x\ :subscript:`0` = ``dequeue``
+- X\ :subscript:`m` = {``dequeue``}
+- *f* =
+ - *f*\ (``enqueued``, ``switch_in``, ``clk < threshold``) = ``running``
+ - *f*\ (``running``, ``dequeue``) = ``dequeued``
+ - *f*\ (``dequeued``, ``enqueue``, ``reset(clk)``) = ``enqueued``
+- *i* = *omitted as all true*
+
+Graphically represented as::
+
+ |
+ |
+ v
+ #============================#
+ H dequeued H <+
+ #============================# |
+ | |
+ | enqueue; reset(clk) |
+ v |
+ +----------------------------+ |
+ | enqueued | | dequeue
+ +----------------------------+ |
+ | |
+ | switch_in; clk < threshold |
+ v |
+ +----------------------------+ |
+ | running | -+
+ +----------------------------+
+
+This model imposes that the time between when a task is enqueued (it becomes
+runnable) and when the task gets to run must be lower than a certain threshold.
+A failure in this model means that the task is starving.
+One problem in using guards on the edges in this case is that the model will
+not report a failure until the ``switch_in`` event occurs. This means that,
+according to the model, it is valid for the task never to run.
+
+Stall model with invariants (iteration 2)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The first iteration isn't exactly what was intended, we can change the model as:
+
+- *X* = { ``dequeued``, ``enqueued``, ``running``}
+- *E* = { ``enqueue``, ``dequeue``, ``switch_in``}
+- *V* = { ``clk`` }
+- x\ :subscript:`0` = ``dequeue``
+- X\ :subscript:`m` = {``dequeue``}
+- *f* =
+ - *f*\ (``enqueued``, ``switch_in``) = ``running``
+ - *f*\ (``running``, ``dequeue``) = ``dequeued``
+ - *f*\ (``dequeued``, ``enqueue``, ``reset(clk)``) = ``enqueued``
+- *i* =
+ - *i*\ (``enqueued``) = ``clk < threshold``
+
+Graphically::
+
+ |
+ |
+ v
+ #=========================#
+ H dequeued H <+
+ #=========================# |
+ | |
+ | enqueue; reset(clk) |
+ v |
+ +-------------------------+ |
+ | enqueued | |
+ | clk < threshold | | dequeue
+ +-------------------------+ |
+ | |
+ | switch_in |
+ v |
+ +-------------------------+ |
+ | running | -+
+ +-------------------------+
+
+In this case, we moved the guard as an invariant to the ``enqueued`` state,
+this means we not only forbid the occurrence of ``switch_in`` when ``clk`` is
+past the threshold but also mark as invalid in case we are *still* in
+``enqueued`` after the threshold. This model is effectively in an invalid state
+as soon as a task is starving, rather than when the starving task finally runs.
+
+Hybrid Automaton in C
+---------------------
+
+The definition of hybrid automata in C is heavily based on the deterministic
+automata one. Specifically, we add the set of environment variables and the
+constraints (both guards on transitions and invariants on states) as follows.
+This is a combination of both iterations of the stall example::
+
+ /* enum representation of X (set of states) to be used as index */
+ enum states {
+ dequeued,
+ enqueued,
+ running,
+ state_max,
+ };
+
+ #define INVALID_STATE state_max
+
+ /* enum representation of E (set of events) to be used as index */
+ enum events {
+ dequeue,
+ enqueue,
+ switch_in,
+ event_max,
+ };
+
+ /* enum representation of V (set of environment variables) to be used as index */
+ enum envs {
+ clk,
+ env_max,
+ env_max_stored = env_max,
+ };
+
+ struct automaton {
+ char *state_names[state_max]; // X: the set of states
+ char *event_names[event_max]; // E: the finite set of events
+ char *env_names[env_max]; // V: the finite set of env vars
+ unsigned char function[state_max][event_max]; // f: transition function
+ unsigned char initial_state; // x_0: the initial state
+ bool final_states[state_max]; // X_m: the set of marked states
+ };
+
+ struct automaton aut = {
+ .state_names = {
+ "dequeued",
+ "enqueued",
+ "running",
+ },
+ .event_names = {
+ "dequeue",
+ "enqueue",
+ "switch_in",
+ },
+ .env_names = {
+ "clk",
+ },
+ .function = {
+ { INVALID_STATE, enqueued, INVALID_STATE },
+ { INVALID_STATE, INVALID_STATE, running },
+ { dequeued, INVALID_STATE, INVALID_STATE },
+ },
+ .initial_state = dequeued,
+ .final_states = { 1, 0, 0 },
+ };
+
+ static bool verify_constraint(enum states curr_state, enum events event,
+ enum states next_state)
+ {
+ bool res = true;
+
+ /* Validate guards as part of f */
+ if (curr_state == enqueued && event == switch_in)
+ res = get_env(clk) < threshold;
+ else if (curr_state == dequeued && event == enqueue)
+ reset_env(clk);
+
+ /* Validate invariants in i */
+ if (next_state == curr_state || !res)
+ return res;
+ if (next_state == enqueued)
+ ha_start_timer_jiffy(ha_mon, clk, threshold_jiffies);
+ else if (curr_state == enqueued)
+ res = !ha_cancel_timer(ha_mon);
+ return res;
+ }
+
+The function ``verify_constraint``, here reported as simplified, checks guards,
+performs resets and starts timers to validate invariants according to
+specification, those cannot easily be represented in the automaton struct.
+Due to the complex nature of environment variables, the user needs to provide
+functions to get and reset environment variables that are not common clocks
+(e.g. clocks with ns or jiffy granularity).
+Since invariants are only defined as clock expirations (e.g. *clk <
+threshold*), reaching the expiration of a timer armed when entering the state
+is in fact a failure in the model and triggers a reaction. Leaving the state
+stops the timer.
+
+It is important to note that timers implemented with hrtimers introduce
+overhead, if the monitor has several instances (e.g. all tasks) this can become
+an issue. The impact can be decreased using the timer wheel (``HA_TIMER_TYPE``
+set to ``HA_TIMER_WHEEL``), this lowers the responsiveness of the timer without
+damaging the accuracy of the model, since the invariant condition is checked
+before disabling the timer in case the callback is late.
+Alternatively, if the monitor is guaranteed to *eventually* leave the state and
+the incurred delay to wait for the next event is acceptable, guards can be used
+in place of invariants, as seen in the stall example.
+
+Graphviz .dot format
+--------------------
+
+Also the Graphviz representation of hybrid automata is an extension of the
+deterministic automata one. Specifically, guards can be provided in the event
+name separated by ``;``::
+
+ "state_start" -> "state_dest" [ label = "sched_waking;preemptible==0;reset(clk)" ];
+
+Invariant can be specified in the state label (not the node name!) separated by ``\n``::
+
+ "enqueued" [label = "enqueued\nclk < threshold_jiffies"];
+
+Constraints can be specified as valid C comparisons and allow spaces, the first
+element of the comparison must be the clock while the second is a numerical or
+parametrised value. Guards allow comparisons to be combined with boolean
+operations (``&&`` and ``||``), resets must be separated from other constraints.
+
+This is the full example of the last version of the 'stall' model in DOT::
+
+ digraph state_automaton {
+ {node [shape = circle] "enqueued"};
+ {node [shape = plaintext, style=invis, label=""] "__init_dequeued"};
+ {node [shape = doublecircle] "dequeued"};
+ {node [shape = circle] "running"};
+ "__init_dequeued" -> "dequeued";
+ "enqueued" [label = "enqueued\nclk < threshold_jiffies"];
+ "running" [label = "running"];
+ "dequeued" [label = "dequeued"];
+ "enqueued" -> "running" [ label = "switch_in" ];
+ "running" -> "dequeued" [ label = "dequeue" ];
+ "dequeued" -> "enqueued" [ label = "enqueue;reset(clk)" ];
+ { rank = min ;
+ "__init_dequeued";
+ "dequeued";
+ }
+ }
+
+References
+----------
+
+One book covering model checking and timed automata is::
+
+ Christel Baier and Joost-Pieter Katoen: Principles of Model Checking,
+ The MIT Press, 2008.
+
+Hybrid automata are described in detail in::
+
+ Thomas Henzinger: The theory of hybrid automata,
+ Proceedings 11th Annual IEEE Symposium on Logic in Computer Science, 1996.
diff --git a/Documentation/trace/rv/index.rst b/Documentation/trace/rv/index.rst
index a2812ac5cfeb..ad298784bda2 100644
--- a/Documentation/trace/rv/index.rst
+++ b/Documentation/trace/rv/index.rst
@@ -9,6 +9,7 @@ Runtime Verification
runtime-verification.rst
deterministic_automata.rst
linear_temporal_logic.rst
+ hybrid_automata.rst
monitor_synthesis.rst
da_monitor_instrumentation.rst
monitor_wip.rst
diff --git a/Documentation/trace/rv/monitor_synthesis.rst b/Documentation/trace/rv/monitor_synthesis.rst
index cc5f97977a29..2c1b5a0ae154 100644
--- a/Documentation/trace/rv/monitor_synthesis.rst
+++ b/Documentation/trace/rv/monitor_synthesis.rst
@@ -18,8 +18,8 @@ functions that glue the monitor to the system reference model, and the
trace output as a reaction to event parsing and exceptions, as depicted
below::
- Linux +----- RV Monitor ----------------------------------+ Formal
- Realm | | Realm
+ Linux +---- RV Monitor ----------------------------------+ Formal
+ Realm | | Realm
+-------------------+ +----------------+ +-----------------+
| Linux kernel | | Monitor | | Reference |
| Tracing | -> | Instance(s) | <- | Model |
@@ -45,6 +45,7 @@ creating monitors. The header files are:
* rv/da_monitor.h for deterministic automaton monitor.
* rv/ltl_monitor.h for linear temporal logic monitor.
+ * rv/ha_monitor.h for hybrid automaton monitor.
rvgen
-----
@@ -252,6 +253,118 @@ the task, the monitor may need some time to start validating tasks which have
been running before the monitor is enabled. Therefore, it is recommended to
start the tasks of interest after enabling the monitor.
+rv/ha_monitor.h
++++++++++++++++
+
+The implementation of hybrid automaton monitors derives directly from the
+deterministic automaton one. Despite using a different header
+(``ha_monitor.h``) the functions to handle events are the same (e.g.
+``da_handle_event``).
+
+Additionally, the `rvgen` tool populates skeletons for the
+``ha_verify_constraint``, ``ha_get_env`` and ``ha_reset_env`` based on the
+monitor specification in the monitor source file.
+
+``ha_verify_constraint`` is typically ready as it is generated by `rvgen`:
+
+* standard constraints on edges are turned into the form::
+
+ res = ha_get_env(ha_mon, ENV) < VALUE;
+
+* reset constraints are turned into the form::
+
+ ha_reset_env(ha_mon, ENV);
+
+* constraints on the state are implemented using timers
+
+ - armed before entering the state
+
+ - cancelled while entering any other state
+
+ - untouched if the state does not change as a result of the event
+
+ - checked if the timer expired but the callback did not run
+
+ - available implementation are `HA_TIMER_HRTIMER` and `HA_TIMER_WHEEL`
+
+ - hrtimers are more precise but may have higher overhead
+
+ - select by defining `HA_TIMER_TYPE` before including the header::
+
+ #define HA_TIMER_TYPE HA_TIMER_HRTIMER
+
+Constraint values can be specified in different forms:
+
+* literal value (with optional unit). E.g.::
+
+ preemptive == 0
+ clk < 100ns
+ threshold <= 10j
+
+* constant value (uppercase string). E.g.::
+
+ clk < MAX_NS
+
+* parameter (lowercase string). E.g.::
+
+ clk <= threshold_jiffies
+
+* macro (uppercase string with parentheses). E.g.::
+
+ clk < MAX_NS()
+
+* function (lowercase string with parentheses). E.g.::
+
+ clk <= threshold_jiffies()
+
+In all cases, `rvgen` will try to understand the type of the environment
+variable from the name or unit. For instance, constants or parameters
+terminating with ``_NS`` or ``_jiffies`` are intended as clocks with ns and jiffy
+granularity, respectively. Literals with measure unit `j` are jiffies and if a
+time unit is specified (`ns` to `s`), `rvgen` will convert the value to `ns`.
+
+Constants need to be defined by the user (but unlike the name, they don't
+necessarily need to be defined as constants). Parameters get converted to
+module parameters and the user needs to provide a default value.
+Also function and macros are defined by the user, by default they get as an
+argument the ``ha_monitor``, a common usage would be to get the required value
+from the target, e.g. the task in per-task monitors, using the helper
+``ha_get_target(ha_mon)``.
+
+If `rvgen` determines that the variable is a clock, it provides the getter and
+resetter based on the unit. Otherwise, the user needs to provide an appropriate
+definition.
+Typically non-clock environment variables are not reset. In such case only the
+getter skeleton will be present in the file generated by `rvgen`.
+For instance, the getter for preemptive can be filled as::
+
+ static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs env)
+ {
+ if (env == preemptible)
+ return preempt_count() == 0;
+ return ENV_INVALID_VALUE;
+ }
+
+The function is supplied the ``ha_mon`` parameter in case some storage is
+required (as it is for clocks), but environment variables without reset do not
+require a storage and can ignore that argument.
+The number of environment variables requiring a storage is limited by
+``MAX_HA_ENV_LEN``, however such limitation doesn't stand for other variables.
+
+Finally, constraints on states are only valid for clocks and only if the
+constraint is of the form `clk < N`. This is because such constraints are
+implemented with the expiration of a timer.
+Typically the clock variables are reset just before arming the timer, but this
+doesn't have to be the case and the available functions take care of it.
+It is a responsibility of per-task monitors to make sure no timer is left
+running when the task exits.
+
+By default the generator implements timers with hrtimers (setting
+``HA_TIMER_TYPE`` to ``HA_TIMER_HRTIMER``), this gives better responsiveness
+but higher overhead. The timer wheel (``HA_TIMER_WHEEL``) is a good alternative
+for monitors with several instances (e.g. per-task) that achieves lower
+overhead with increased latency, yet without compromising precision.
+
Final remarks
-------------
--
2.53.0
^ permalink raw reply related
* [PATCH v2] Documentation/kernel-parameters: fix architecture alignment for pt, nopt, and nobypass
From: lirongqing @ 2026-03-30 10:59 UTC (permalink / raw)
To: Jonathan Corbet, Andrew Morton, Borislav Petkov, Randy Dunlap,
linux-doc, linux-kernel
Cc: Li RongQing, Shuah Khan, Peter Zijlstra, Feng Tang, Pawan Gupta,
Dapeng Mi, Kees Cook, Marco Elver, Paul E . McKenney, Askar Safin,
Bjorn Helgaas, Sohil Mehta
From: Li RongQing <lirongqing@baidu.com>
Commit ab0e7f20768a ("Documentation: Merge x86-specific boot options doc
into kernel-parameters.txt") introduced a formatting regression where
architecture tags were placed on separate lines with broken indentation.
This caused the 'nopt' [X86] parameter to appear as if it belonged to
the [PPC/POWERNV] section.
Furthermore, since the main 'iommu=' parameter heading already specifies
it is for [X86, EARLY], the subsequent standalone [X86] tags for 'pt',
'nopt', and the AMD GART options are redundant and clutter the
documentation.
Clean up the formatting by removing these redundant tags and properly
attributing the 'nobypass' option to [PPC/POWERNV].
Fixes: ab0e7f20768a ("Documentation: Merge x86-specific boot options doc into kernel-parameters.txt")
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Feng Tang <feng.tang@linux.alibaba.com>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Marco Elver <elver@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Askar Safin <safinaskar@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sohil Mehta <sohil.mehta@intel.com>
---
Documentation/admin-guide/kernel-parameters.txt | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 03a5506..5253c23 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2615,15 +2615,11 @@ Kernel parameters
Intel machines). This can be used to prevent the usage
of an available hardware IOMMU.
- [X86]
pt
- [X86]
nopt
- [PPC/POWERNV]
- nobypass
+ nobypass [PPC/POWERNV]
Disable IOMMU bypass, using IOMMU for PCI devices.
- [X86]
AMD Gart HW IOMMU-specific options:
<size>
--
2.9.4
^ permalink raw reply related
* [PATCH net-next v2 3/3] dpll: zl3073x: implement frequency monitoring
From: Ivan Vecera @ 2026-03-30 10:55 UTC (permalink / raw)
To: netdev
Cc: Vadim Fedorenko, Arkadiusz Kubalewski, Jiri Pirko,
Jonathan Corbet, Shuah Khan, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Donald Hunter,
Prathosh Satish, Petr Oros, linux-doc, linux-kernel
In-Reply-To: <20260330105505.715099-1-ivecera@redhat.com>
Extract common measurement latch logic from zl3073x_ref_ffo_update()
into a new zl3073x_ref_freq_meas_latch() helper and add
zl3073x_ref_freq_meas_update() that uses it to latch and read absolute
input reference frequencies in Hz.
Add meas_freq field to struct zl3073x_ref and the corresponding
zl3073x_ref_meas_freq_get() accessor. The measured frequencies are
updated periodically alongside the existing FFO measurements.
Add freq_monitor boolean to struct zl3073x_dpll and implement the
freq_monitor_set/get device callbacks to enable/disable frequency
monitoring via the DPLL netlink interface.
Implement measured_freq_get pin callback for input pins that returns the
measured input frequency in Hz.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
---
drivers/dpll/zl3073x/core.c | 88 +++++++++++++++++++++++++++++++------
drivers/dpll/zl3073x/dpll.c | 88 ++++++++++++++++++++++++++++++++++++-
drivers/dpll/zl3073x/dpll.h | 2 +
drivers/dpll/zl3073x/ref.h | 14 ++++++
4 files changed, 178 insertions(+), 14 deletions(-)
diff --git a/drivers/dpll/zl3073x/core.c b/drivers/dpll/zl3073x/core.c
index 6363002d48d46..320c199637efa 100644
--- a/drivers/dpll/zl3073x/core.c
+++ b/drivers/dpll/zl3073x/core.c
@@ -632,22 +632,21 @@ int zl3073x_ref_phase_offsets_update(struct zl3073x_dev *zldev, int channel)
}
/**
- * zl3073x_ref_ffo_update - update reference fractional frequency offsets
+ * zl3073x_ref_freq_meas_latch - latch reference frequency measurements
* @zldev: pointer to zl3073x_dev structure
+ * @type: measurement type (ZL_REF_FREQ_MEAS_CTRL_*)
*
- * The function asks device to update fractional frequency offsets latch
- * registers the latest measured values, reads and stores them into
+ * The function waits for the previous measurement to finish, selects all
+ * references and requests a new measurement of the given type.
*
* Return: 0 on success, <0 on error
*/
static int
-zl3073x_ref_ffo_update(struct zl3073x_dev *zldev)
+zl3073x_ref_freq_meas_latch(struct zl3073x_dev *zldev, u8 type)
{
- int i, rc;
+ int rc;
- /* Per datasheet we have to wait for 'ref_freq_meas_ctrl' to be zero
- * to ensure that the measured data are coherent.
- */
+ /* Wait for previous measurement to finish */
rc = zl3073x_poll_zero_u8(zldev, ZL_REG_REF_FREQ_MEAS_CTRL,
ZL_REF_FREQ_MEAS_CTRL);
if (rc)
@@ -663,15 +662,64 @@ zl3073x_ref_ffo_update(struct zl3073x_dev *zldev)
if (rc)
return rc;
- /* Request frequency offset measurement */
- rc = zl3073x_write_u8(zldev, ZL_REG_REF_FREQ_MEAS_CTRL,
- ZL_REF_FREQ_MEAS_CTRL_REF_FREQ_OFF);
+ /* Request measurement */
+ rc = zl3073x_write_u8(zldev, ZL_REG_REF_FREQ_MEAS_CTRL, type);
if (rc)
return rc;
/* Wait for finish */
- rc = zl3073x_poll_zero_u8(zldev, ZL_REG_REF_FREQ_MEAS_CTRL,
- ZL_REF_FREQ_MEAS_CTRL);
+ return zl3073x_poll_zero_u8(zldev, ZL_REG_REF_FREQ_MEAS_CTRL,
+ ZL_REF_FREQ_MEAS_CTRL);
+}
+
+/**
+ * zl3073x_ref_freq_meas_update - update measured input reference frequencies
+ * @zldev: pointer to zl3073x_dev structure
+ *
+ * The function asks device to latch measured input reference frequencies
+ * and stores the results in the ref state.
+ *
+ * Return: 0 on success, <0 on error
+ */
+static int
+zl3073x_ref_freq_meas_update(struct zl3073x_dev *zldev)
+{
+ int i, rc;
+
+ rc = zl3073x_ref_freq_meas_latch(zldev, ZL_REF_FREQ_MEAS_CTRL_REF_FREQ);
+ if (rc)
+ return rc;
+
+ /* Read measured frequencies in Hz (unsigned 32-bit, LSB = 1 Hz) */
+ for (i = 0; i < ZL3073X_NUM_REFS; i++) {
+ u32 value;
+
+ rc = zl3073x_read_u32(zldev, ZL_REG_REF_FREQ(i), &value);
+ if (rc)
+ return rc;
+
+ zldev->ref[i].meas_freq = value;
+ }
+
+ return 0;
+}
+
+/**
+ * zl3073x_ref_ffo_update - update reference fractional frequency offsets
+ * @zldev: pointer to zl3073x_dev structure
+ *
+ * The function asks device to update fractional frequency offsets latch
+ * registers the latest measured values, reads and stores them into
+ *
+ * Return: 0 on success, <0 on error
+ */
+static int
+zl3073x_ref_ffo_update(struct zl3073x_dev *zldev)
+{
+ int i, rc;
+
+ rc = zl3073x_ref_freq_meas_latch(zldev,
+ ZL_REF_FREQ_MEAS_CTRL_REF_FREQ_OFF);
if (rc)
return rc;
@@ -714,6 +762,20 @@ zl3073x_dev_periodic_work(struct kthread_work *work)
dev_warn(zldev->dev, "Failed to update phase offsets: %pe\n",
ERR_PTR(rc));
+ /* Update measured input reference frequencies if any DPLL has
+ * frequency monitoring enabled.
+ */
+ list_for_each_entry(zldpll, &zldev->dplls, list) {
+ if (zldpll->freq_monitor) {
+ rc = zl3073x_ref_freq_meas_update(zldev);
+ if (rc)
+ dev_warn(zldev->dev,
+ "Failed to update measured frequencies: %pe\n",
+ ERR_PTR(rc));
+ break;
+ }
+ }
+
/* Update references' fractional frequency offsets */
rc = zl3073x_ref_ffo_update(zldev);
if (rc)
diff --git a/drivers/dpll/zl3073x/dpll.c b/drivers/dpll/zl3073x/dpll.c
index a29f606318f6d..c44bfecf2c265 100644
--- a/drivers/dpll/zl3073x/dpll.c
+++ b/drivers/dpll/zl3073x/dpll.c
@@ -39,6 +39,7 @@
* @pin_state: last saved pin state
* @phase_offset: last saved pin phase offset
* @freq_offset: last saved fractional frequency offset
+ * @measured_freq: last saved measured frequency
*/
struct zl3073x_dpll_pin {
struct list_head list;
@@ -54,6 +55,7 @@ struct zl3073x_dpll_pin {
enum dpll_pin_state pin_state;
s64 phase_offset;
s64 freq_offset;
+ u32 measured_freq;
};
/*
@@ -202,6 +204,20 @@ zl3073x_dpll_input_pin_ffo_get(const struct dpll_pin *dpll_pin, void *pin_priv,
return 0;
}
+static int
+zl3073x_dpll_input_pin_measured_freq_get(const struct dpll_pin *dpll_pin,
+ void *pin_priv,
+ const struct dpll_device *dpll,
+ void *dpll_priv, u64 *measured_freq,
+ struct netlink_ext_ack *extack)
+{
+ struct zl3073x_dpll_pin *pin = pin_priv;
+
+ *measured_freq = pin->measured_freq;
+
+ return 0;
+}
+
static int
zl3073x_dpll_input_pin_frequency_get(const struct dpll_pin *dpll_pin,
void *pin_priv,
@@ -1116,6 +1132,35 @@ zl3073x_dpll_phase_offset_monitor_set(const struct dpll_device *dpll,
return 0;
}
+static int
+zl3073x_dpll_freq_monitor_get(const struct dpll_device *dpll,
+ void *dpll_priv,
+ enum dpll_feature_state *state,
+ struct netlink_ext_ack *extack)
+{
+ struct zl3073x_dpll *zldpll = dpll_priv;
+
+ if (zldpll->freq_monitor)
+ *state = DPLL_FEATURE_STATE_ENABLE;
+ else
+ *state = DPLL_FEATURE_STATE_DISABLE;
+
+ return 0;
+}
+
+static int
+zl3073x_dpll_freq_monitor_set(const struct dpll_device *dpll,
+ void *dpll_priv,
+ enum dpll_feature_state state,
+ struct netlink_ext_ack *extack)
+{
+ struct zl3073x_dpll *zldpll = dpll_priv;
+
+ zldpll->freq_monitor = (state == DPLL_FEATURE_STATE_ENABLE);
+
+ return 0;
+}
+
static const struct dpll_pin_ops zl3073x_dpll_input_pin_ops = {
.direction_get = zl3073x_dpll_pin_direction_get,
.esync_get = zl3073x_dpll_input_pin_esync_get,
@@ -1123,6 +1168,7 @@ static const struct dpll_pin_ops zl3073x_dpll_input_pin_ops = {
.ffo_get = zl3073x_dpll_input_pin_ffo_get,
.frequency_get = zl3073x_dpll_input_pin_frequency_get,
.frequency_set = zl3073x_dpll_input_pin_frequency_set,
+ .measured_freq_get = zl3073x_dpll_input_pin_measured_freq_get,
.phase_offset_get = zl3073x_dpll_input_pin_phase_offset_get,
.phase_adjust_get = zl3073x_dpll_input_pin_phase_adjust_get,
.phase_adjust_set = zl3073x_dpll_input_pin_phase_adjust_set,
@@ -1151,6 +1197,8 @@ static const struct dpll_device_ops zl3073x_dpll_device_ops = {
.phase_offset_avg_factor_set = zl3073x_dpll_phase_offset_avg_factor_set,
.phase_offset_monitor_get = zl3073x_dpll_phase_offset_monitor_get,
.phase_offset_monitor_set = zl3073x_dpll_phase_offset_monitor_set,
+ .freq_monitor_get = zl3073x_dpll_freq_monitor_get,
+ .freq_monitor_set = zl3073x_dpll_freq_monitor_set,
.supported_modes_get = zl3073x_dpll_supported_modes_get,
};
@@ -1593,6 +1641,39 @@ zl3073x_dpll_pin_ffo_check(struct zl3073x_dpll_pin *pin)
return false;
}
+/**
+ * zl3073x_dpll_pin_measured_freq_check - check for pin measured frequency change
+ * @pin: pin to check
+ *
+ * Check for the given pin's measured frequency change.
+ *
+ * Return: true on measured frequency change, false otherwise
+ */
+static bool
+zl3073x_dpll_pin_measured_freq_check(struct zl3073x_dpll_pin *pin)
+{
+ struct zl3073x_dpll *zldpll = pin->dpll;
+ struct zl3073x_dev *zldev = zldpll->dev;
+ const struct zl3073x_ref *ref;
+ u8 ref_id;
+
+ if (!zldpll->freq_monitor)
+ return false;
+
+ ref_id = zl3073x_input_pin_ref_get(pin->id);
+ ref = zl3073x_ref_state_get(zldev, ref_id);
+
+ if (pin->measured_freq != ref->meas_freq) {
+ dev_dbg(zldev->dev, "%s measured freq changed: %u -> %u\n",
+ pin->label, pin->measured_freq, ref->meas_freq);
+ pin->measured_freq = ref->meas_freq;
+
+ return true;
+ }
+
+ return false;
+}
+
/**
* zl3073x_dpll_changes_check - check for changes and send notifications
* @zldpll: pointer to zl3073x_dpll structure
@@ -1677,13 +1758,18 @@ zl3073x_dpll_changes_check(struct zl3073x_dpll *zldpll)
pin_changed = true;
}
- /* Check for phase offset and ffo change once per second */
+ /* Check for phase offset, ffo, and measured freq change
+ * once per second.
+ */
if (zldpll->check_count % 2 == 0) {
if (zl3073x_dpll_pin_phase_offset_check(pin))
pin_changed = true;
if (zl3073x_dpll_pin_ffo_check(pin))
pin_changed = true;
+
+ if (zl3073x_dpll_pin_measured_freq_check(pin))
+ pin_changed = true;
}
if (pin_changed)
diff --git a/drivers/dpll/zl3073x/dpll.h b/drivers/dpll/zl3073x/dpll.h
index 115ee4f67e7ab..434c32a7db123 100644
--- a/drivers/dpll/zl3073x/dpll.h
+++ b/drivers/dpll/zl3073x/dpll.h
@@ -15,6 +15,7 @@
* @id: DPLL index
* @check_count: periodic check counter
* @phase_monitor: is phase offset monitor enabled
+ * @freq_monitor: is frequency monitor enabled
* @ops: DPLL device operations for this instance
* @dpll_dev: pointer to registered DPLL device
* @tracker: tracking object for the acquired reference
@@ -28,6 +29,7 @@ struct zl3073x_dpll {
u8 id;
u8 check_count;
bool phase_monitor;
+ bool freq_monitor;
struct dpll_device_ops ops;
struct dpll_device *dpll_dev;
dpll_tracker tracker;
diff --git a/drivers/dpll/zl3073x/ref.h b/drivers/dpll/zl3073x/ref.h
index 06d8d4d97ea26..be16be20dbc7e 100644
--- a/drivers/dpll/zl3073x/ref.h
+++ b/drivers/dpll/zl3073x/ref.h
@@ -23,6 +23,7 @@ struct zl3073x_dev;
* @sync_ctrl: reference sync control
* @config: reference config
* @ffo: current fractional frequency offset
+ * @meas_freq: measured input frequency in Hz
* @mon_status: reference monitor status
*/
struct zl3073x_ref {
@@ -40,6 +41,7 @@ struct zl3073x_ref {
);
struct_group(stat, /* Status */
s64 ffo;
+ u32 meas_freq;
u8 mon_status;
);
};
@@ -68,6 +70,18 @@ zl3073x_ref_ffo_get(const struct zl3073x_ref *ref)
return ref->ffo;
}
+/**
+ * zl3073x_ref_meas_freq_get - get measured input frequency
+ * @ref: pointer to ref state
+ *
+ * Return: measured input frequency in Hz
+ */
+static inline u32
+zl3073x_ref_meas_freq_get(const struct zl3073x_ref *ref)
+{
+ return ref->meas_freq;
+}
+
/**
* zl3073x_ref_freq_get - get given input reference frequency
* @ref: pointer to ref state
--
2.52.0
^ permalink raw reply related
* [PATCH net-next v2 2/3] dpll: add frequency monitoring callback ops
From: Ivan Vecera @ 2026-03-30 10:55 UTC (permalink / raw)
To: netdev
Cc: Vadim Fedorenko, Arkadiusz Kubalewski, Jiri Pirko,
Jonathan Corbet, Shuah Khan, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Donald Hunter,
Prathosh Satish, Petr Oros, linux-doc, linux-kernel
In-Reply-To: <20260330105505.715099-1-ivecera@redhat.com>
Add new callback operations for a dpll device:
- freq_monitor_get(..) - to obtain current state of frequency monitor
feature from dpll device,
- freq_monitor_set(..) - to allow feature configuration.
Add new callback operation for a dpll pin:
- measured_freq_get(..) - to obtain the measured frequency in Hz.
Obtain the feature state value using the get callback and provide it to
the user if the device driver implements callbacks. The measured_freq_get
pin callback is only invoked when the frequency monitor is enabled.
Execute the set callback upon user requests.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
---
drivers/dpll/dpll_netlink.c | 90 +++++++++++++++++++++++++++++++++++++
include/linux/dpll.h | 10 +++++
2 files changed, 100 insertions(+)
diff --git a/drivers/dpll/dpll_netlink.c b/drivers/dpll/dpll_netlink.c
index 83cbd64abf5a4..abddcd4de28b6 100644
--- a/drivers/dpll/dpll_netlink.c
+++ b/drivers/dpll/dpll_netlink.c
@@ -175,6 +175,26 @@ dpll_msg_add_phase_offset_monitor(struct sk_buff *msg, struct dpll_device *dpll,
return 0;
}
+static int
+dpll_msg_add_freq_monitor(struct sk_buff *msg, struct dpll_device *dpll,
+ struct netlink_ext_ack *extack)
+{
+ const struct dpll_device_ops *ops = dpll_device_ops(dpll);
+ enum dpll_feature_state state;
+ int ret;
+
+ if (ops->freq_monitor_set && ops->freq_monitor_get) {
+ ret = ops->freq_monitor_get(dpll, dpll_priv(dpll),
+ &state, extack);
+ if (ret)
+ return ret;
+ if (nla_put_u32(msg, DPLL_A_FREQUENCY_MONITOR, state))
+ return -EMSGSIZE;
+ }
+
+ return 0;
+}
+
static int
dpll_msg_add_phase_offset_avg_factor(struct sk_buff *msg,
struct dpll_device *dpll,
@@ -400,6 +420,38 @@ static int dpll_msg_add_ffo(struct sk_buff *msg, struct dpll_pin *pin,
ffo);
}
+static int dpll_msg_add_measured_freq(struct sk_buff *msg, struct dpll_pin *pin,
+ struct dpll_pin_ref *ref,
+ struct netlink_ext_ack *extack)
+{
+ const struct dpll_device_ops *dev_ops = dpll_device_ops(ref->dpll);
+ const struct dpll_pin_ops *ops = dpll_pin_ops(ref);
+ struct dpll_device *dpll = ref->dpll;
+ enum dpll_feature_state state;
+ u64 measured_freq;
+ int ret;
+
+ if (!ops->measured_freq_get)
+ return 0;
+ if (dev_ops->freq_monitor_get) {
+ ret = dev_ops->freq_monitor_get(dpll, dpll_priv(dpll),
+ &state, extack);
+ if (ret)
+ return ret;
+ if (state == DPLL_FEATURE_STATE_DISABLE)
+ return 0;
+ }
+ ret = ops->measured_freq_get(pin, dpll_pin_on_dpll_priv(dpll, pin),
+ dpll, dpll_priv(dpll), &measured_freq, extack);
+ if (ret)
+ return ret;
+ if (nla_put_64bit(msg, DPLL_A_PIN_MEASURED_FREQUENCY,
+ sizeof(measured_freq), &measured_freq, DPLL_A_PIN_PAD))
+ return -EMSGSIZE;
+
+ return 0;
+}
+
static int
dpll_msg_add_pin_freq(struct sk_buff *msg, struct dpll_pin *pin,
struct dpll_pin_ref *ref, struct netlink_ext_ack *extack)
@@ -670,6 +722,9 @@ dpll_cmd_pin_get_one(struct sk_buff *msg, struct dpll_pin *pin,
if (ret)
return ret;
ret = dpll_msg_add_ffo(msg, pin, ref, extack);
+ if (ret)
+ return ret;
+ ret = dpll_msg_add_measured_freq(msg, pin, ref, extack);
if (ret)
return ret;
ret = dpll_msg_add_pin_esync(msg, pin, ref, extack);
@@ -722,6 +777,9 @@ dpll_device_get_one(struct dpll_device *dpll, struct sk_buff *msg,
if (ret)
return ret;
ret = dpll_msg_add_phase_offset_avg_factor(msg, dpll, extack);
+ if (ret)
+ return ret;
+ ret = dpll_msg_add_freq_monitor(msg, dpll, extack);
if (ret)
return ret;
@@ -948,6 +1006,32 @@ dpll_phase_offset_avg_factor_set(struct dpll_device *dpll, struct nlattr *a,
extack);
}
+static int
+dpll_freq_monitor_set(struct dpll_device *dpll, struct nlattr *a,
+ struct netlink_ext_ack *extack)
+{
+ const struct dpll_device_ops *ops = dpll_device_ops(dpll);
+ enum dpll_feature_state state = nla_get_u32(a), old_state;
+ int ret;
+
+ if (!(ops->freq_monitor_set && ops->freq_monitor_get)) {
+ NL_SET_ERR_MSG_ATTR(extack, a,
+ "dpll device not capable of frequency monitor");
+ return -EOPNOTSUPP;
+ }
+ ret = ops->freq_monitor_get(dpll, dpll_priv(dpll), &old_state,
+ extack);
+ if (ret) {
+ NL_SET_ERR_MSG(extack,
+ "unable to get current state of frequency monitor");
+ return ret;
+ }
+ if (state == old_state)
+ return 0;
+
+ return ops->freq_monitor_set(dpll, dpll_priv(dpll), state, extack);
+}
+
static int
dpll_pin_freq_set(struct dpll_pin *pin, struct nlattr *a,
struct netlink_ext_ack *extack)
@@ -1878,6 +1962,12 @@ dpll_set_from_nlattr(struct dpll_device *dpll, struct genl_info *info)
if (ret)
return ret;
break;
+ case DPLL_A_FREQUENCY_MONITOR:
+ ret = dpll_freq_monitor_set(dpll, a,
+ info->extack);
+ if (ret)
+ return ret;
+ break;
}
}
diff --git a/include/linux/dpll.h b/include/linux/dpll.h
index 2ce295b46b8cd..b7277a8b484d2 100644
--- a/include/linux/dpll.h
+++ b/include/linux/dpll.h
@@ -52,6 +52,12 @@ struct dpll_device_ops {
int (*phase_offset_avg_factor_get)(const struct dpll_device *dpll,
void *dpll_priv, u32 *factor,
struct netlink_ext_ack *extack);
+ int (*freq_monitor_set)(const struct dpll_device *dpll, void *dpll_priv,
+ enum dpll_feature_state state,
+ struct netlink_ext_ack *extack);
+ int (*freq_monitor_get)(const struct dpll_device *dpll, void *dpll_priv,
+ enum dpll_feature_state *state,
+ struct netlink_ext_ack *extack);
};
struct dpll_pin_ops {
@@ -110,6 +116,10 @@ struct dpll_pin_ops {
int (*ffo_get)(const struct dpll_pin *pin, void *pin_priv,
const struct dpll_device *dpll, void *dpll_priv,
s64 *ffo, struct netlink_ext_ack *extack);
+ int (*measured_freq_get)(const struct dpll_pin *pin, void *pin_priv,
+ const struct dpll_device *dpll,
+ void *dpll_priv, u64 *measured_freq,
+ struct netlink_ext_ack *extack);
int (*esync_set)(const struct dpll_pin *pin, void *pin_priv,
const struct dpll_device *dpll, void *dpll_priv,
u64 freq, struct netlink_ext_ack *extack);
--
2.52.0
^ permalink raw reply related
* [PATCH net-next v2 1/3] dpll: add frequency monitoring to netlink spec
From: Ivan Vecera @ 2026-03-30 10:55 UTC (permalink / raw)
To: netdev
Cc: Vadim Fedorenko, Arkadiusz Kubalewski, Jiri Pirko,
Jonathan Corbet, Shuah Khan, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Donald Hunter,
Prathosh Satish, Petr Oros, linux-doc, linux-kernel
In-Reply-To: <20260330105505.715099-1-ivecera@redhat.com>
Add DPLL_A_FREQUENCY_MONITOR device attribute to allow control over
the frequency monitor feature. The attribute uses the existing
dpll_feature_state enum (enable/disable) and is present in both
device-get reply and device-set request.
Add DPLL_A_PIN_MEASURED_FREQUENCY pin attribute to expose the measured
input frequency in Hz. The attribute is present in the pin-get reply.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
---
Documentation/driver-api/dpll.rst | 18 ++++++++++++++++++
Documentation/netlink/specs/dpll.yaml | 17 +++++++++++++++++
drivers/dpll/dpll_nl.c | 5 +++--
include/uapi/linux/dpll.h | 2 ++
4 files changed, 40 insertions(+), 2 deletions(-)
diff --git a/Documentation/driver-api/dpll.rst b/Documentation/driver-api/dpll.rst
index 83118c728ed90..f38e7012e8c0a 100644
--- a/Documentation/driver-api/dpll.rst
+++ b/Documentation/driver-api/dpll.rst
@@ -250,6 +250,22 @@ in the ``DPLL_A_PIN_PHASE_OFFSET`` attribute.
``DPLL_A_PHASE_OFFSET_MONITOR`` attr state of a feature
=============================== ========================
+Frequency monitor
+=================
+
+Some DPLL devices may offer the capability to measure the actual
+frequency of all available input pins. The attribute and current feature state
+shall be included in the response message of the ``DPLL_CMD_DEVICE_GET``
+command for supported DPLL devices. In such cases, users can also control
+the feature using the ``DPLL_CMD_DEVICE_SET`` command by setting the
+``enum dpll_feature_state`` values for the attribute.
+Once enabled the measured input frequency for each input pin shall be
+returned in the ``DPLL_A_PIN_MEASURED_FREQUENCY`` attribute.
+
+ =============================== ========================
+ ``DPLL_A_FREQUENCY_MONITOR`` attr state of a feature
+ =============================== ========================
+
Embedded SYNC
=============
@@ -411,6 +427,8 @@ according to attribute purpose.
``DPLL_A_PIN_STATE`` attr state of pin on the parent
pin
``DPLL_A_PIN_CAPABILITIES`` attr bitmask of pin capabilities
+ ``DPLL_A_PIN_MEASURED_FREQUENCY`` attr measured frequency of
+ an input pin in Hz
==================================== ==================================
==================================== =================================
diff --git a/Documentation/netlink/specs/dpll.yaml b/Documentation/netlink/specs/dpll.yaml
index 3dd48a32f7837..80292b0f2b488 100644
--- a/Documentation/netlink/specs/dpll.yaml
+++ b/Documentation/netlink/specs/dpll.yaml
@@ -319,6 +319,13 @@ attribute-sets:
name: phase-offset-avg-factor
type: u32
doc: Averaging factor applied to calculation of reported phase offset.
+ -
+ name: frequency-monitor
+ type: u32
+ enum: feature-state
+ doc: Receive or request state of frequency monitor feature.
+ If enabled, dpll device shall measure all currently available
+ inputs for their actual input frequency.
-
name: pin
enum-name: dpll_a_pin
@@ -456,6 +463,13 @@ attribute-sets:
Value is in PPT (parts per trillion, 10^-12).
Note: This attribute provides higher resolution than the standard
fractional-frequency-offset (which is in PPM).
+ -
+ name: measured-frequency
+ type: u64
+ doc: |
+ The measured frequency of the input pin in Hz.
+ This is the actual frequency being received on the pin,
+ as measured by the dpll device hardware.
-
name: pin-parent-device
@@ -544,6 +558,7 @@ operations:
- type
- phase-offset-monitor
- phase-offset-avg-factor
+ - frequency-monitor
dump:
reply: *dev-attrs
@@ -563,6 +578,7 @@ operations:
- mode
- phase-offset-monitor
- phase-offset-avg-factor
+ - frequency-monitor
-
name: device-create-ntf
doc: Notification about device appearing
@@ -643,6 +659,7 @@ operations:
- esync-frequency-supported
- esync-pulse
- reference-sync
+ - measured-frequency
dump:
request:
diff --git a/drivers/dpll/dpll_nl.c b/drivers/dpll/dpll_nl.c
index a2b22d4921142..1e652340a5d73 100644
--- a/drivers/dpll/dpll_nl.c
+++ b/drivers/dpll/dpll_nl.c
@@ -43,11 +43,12 @@ static const struct nla_policy dpll_device_get_nl_policy[DPLL_A_ID + 1] = {
};
/* DPLL_CMD_DEVICE_SET - do */
-static const struct nla_policy dpll_device_set_nl_policy[DPLL_A_PHASE_OFFSET_AVG_FACTOR + 1] = {
+static const struct nla_policy dpll_device_set_nl_policy[DPLL_A_FREQUENCY_MONITOR + 1] = {
[DPLL_A_ID] = { .type = NLA_U32, },
[DPLL_A_MODE] = NLA_POLICY_RANGE(NLA_U32, 1, 2),
[DPLL_A_PHASE_OFFSET_MONITOR] = NLA_POLICY_MAX(NLA_U32, 1),
[DPLL_A_PHASE_OFFSET_AVG_FACTOR] = { .type = NLA_U32, },
+ [DPLL_A_FREQUENCY_MONITOR] = NLA_POLICY_MAX(NLA_U32, 1),
};
/* DPLL_CMD_PIN_ID_GET - do */
@@ -115,7 +116,7 @@ static const struct genl_split_ops dpll_nl_ops[] = {
.doit = dpll_nl_device_set_doit,
.post_doit = dpll_post_doit,
.policy = dpll_device_set_nl_policy,
- .maxattr = DPLL_A_PHASE_OFFSET_AVG_FACTOR,
+ .maxattr = DPLL_A_FREQUENCY_MONITOR,
.flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
},
{
diff --git a/include/uapi/linux/dpll.h b/include/uapi/linux/dpll.h
index de0005f28e5c5..2e75d547716ea 100644
--- a/include/uapi/linux/dpll.h
+++ b/include/uapi/linux/dpll.h
@@ -218,6 +218,7 @@ enum dpll_a {
DPLL_A_CLOCK_QUALITY_LEVEL,
DPLL_A_PHASE_OFFSET_MONITOR,
DPLL_A_PHASE_OFFSET_AVG_FACTOR,
+ DPLL_A_FREQUENCY_MONITOR,
__DPLL_A_MAX,
DPLL_A_MAX = (__DPLL_A_MAX - 1)
@@ -254,6 +255,7 @@ enum dpll_a_pin {
DPLL_A_PIN_REFERENCE_SYNC,
DPLL_A_PIN_PHASE_ADJUST_GRAN,
DPLL_A_PIN_FRACTIONAL_FREQUENCY_OFFSET_PPT,
+ DPLL_A_PIN_MEASURED_FREQUENCY,
__DPLL_A_PIN_MAX,
DPLL_A_PIN_MAX = (__DPLL_A_PIN_MAX - 1)
--
2.52.0
^ permalink raw reply related
* [PATCH net-next v2 0/3] dpll: add frequency monitoring feature
From: Ivan Vecera @ 2026-03-30 10:55 UTC (permalink / raw)
To: netdev
Cc: Vadim Fedorenko, Arkadiusz Kubalewski, Jiri Pirko,
Jonathan Corbet, Shuah Khan, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Donald Hunter,
Prathosh Satish, Petr Oros, linux-doc, linux-kernel
This series adds support for monitoring the measured input frequency
of DPLL input pins via the DPLL netlink interface.
Some DPLL devices can measure the actual frequency being received on
input pins. The approach mirrors the existing phase-offset-monitor
feature: a device-level attribute (DPLL_A_FREQUENCY_MONITOR) enables
or disables monitoring, and a per-pin attribute
(DPLL_A_PIN_MEASURED_FREQUENCY) exposes the measured frequency in Hz
when monitoring is enabled.
Patch 1 adds the new attributes to the DPLL netlink spec (dpll.yaml),
regenerates the auto-generated UAPI header and netlink policy, and
updates Documentation/driver-api/dpll.rst.
Patch 2 adds the callback operations (freq_monitor_get/set for
devices, measured_freq_get for pins) and the corresponding netlink
GET/SET handlers in the DPLL core. The core only invokes
measured_freq_get when the frequency monitor is enabled on the parent
device.
Patch 3 implements the feature in the ZL3073x driver by extracting
a common measurement latch helper from the existing FFO update path,
adding a frequency measurement function, and wiring up the new
callbacks.
Changes v1 -> v2:
- Renamed actual-frequency to measured-frequency (Vadim)
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Ivan Vecera (3):
dpll: add frequency monitoring to netlink spec
dpll: add frequency monitoring callback ops
dpll: zl3073x: implement frequency monitoring
Documentation/driver-api/dpll.rst | 18 ++++++
Documentation/netlink/specs/dpll.yaml | 17 +++++
drivers/dpll/dpll_netlink.c | 90 +++++++++++++++++++++++++++
drivers/dpll/dpll_nl.c | 5 +-
drivers/dpll/zl3073x/core.c | 88 ++++++++++++++++++++++----
drivers/dpll/zl3073x/dpll.c | 88 +++++++++++++++++++++++++-
drivers/dpll/zl3073x/dpll.h | 2 +
drivers/dpll/zl3073x/ref.h | 14 +++++
include/linux/dpll.h | 10 +++
include/uapi/linux/dpll.h | 2 +
10 files changed, 318 insertions(+), 16 deletions(-)
--
2.52.0
^ permalink raw reply
* Re: [PATCH net-next 2/3] dpll: add actual frequency monitoring callback ops
From: Vadim Fedorenko @ 2026-03-30 10:27 UTC (permalink / raw)
To: Ivan Vecera, netdev
Cc: Arkadiusz Kubalewski, Jiri Pirko, Jonathan Corbet, Shuah Khan,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Donald Hunter, Prathosh Satish, Petr Oros,
linux-doc, linux-kernel
In-Reply-To: <a341fd52-3682-473f-8116-a8323bf846ee@redhat.com>
On 30/03/2026 09:52, Ivan Vecera wrote:
> On 3/26/26 6:48 PM, Ivan Vecera wrote:
>> On 3/26/26 12:21 PM, Vadim Fedorenko wrote:
>>> On 25/03/2026 19:39, Ivan Vecera wrote:
>>>
>>>> +static int dpll_msg_add_actual_freq(struct sk_buff *msg, struct
>>>> dpll_pin *pin,
>>>> + struct dpll_pin_ref *ref,
>>>> + struct netlink_ext_ack *extack)
>>>> +{
>>>> + const struct dpll_device_ops *dev_ops = dpll_device_ops(ref-
>>>> >dpll);
>>>> + const struct dpll_pin_ops *ops = dpll_pin_ops(ref);
>>>> + struct dpll_device *dpll = ref->dpll;
>>>> + enum dpll_feature_state state;
>>>> + u64 actual_freq;
>>>> + int ret;
>>>> +
>>>> + if (!ops->actual_freq_get)
>>>> + return 0;
>>>> + if (dev_ops->freq_monitor_get) {
>>>> + ret = dev_ops->freq_monitor_get(dpll, dpll_priv(dpll),
>>>> + &state, extack);
>>>> + if (ret)
>>>> + return ret;
>>>> + if (state == DPLL_FEATURE_STATE_DISABLE)
>>>> + return 0;
>>>
>>> I think we have to signal back to user that frequency monitoring is
>>> disabled via extack.
>>
>> Hi Vadim,
>>
>> This would break pin-get operation... Do or dump pin-get operation would
>> fail with this extack message.
>>
>> Here we can check if the freq-monitoring is enabled and conditionally
>> call actual_freq_get() or measured_freq_get()
Let's make it this way, we don't want to waste even small amount of
resources for useless call.
>>
>> -or-
>>
>> Call this callback unconditionally and check for return code and if a
>> driver returns e.g. -ENODATA then skip nla_put_64bit() but return
>> success.
>>
>> WDYT?
>
> Vadim?
Sorry, was AFK this Friday/weekend
^ permalink raw reply
* Re: [PATCH] iommu/amd: add amd_iommu=relax_unity option for VFIO passthrough
From: Vasant Hegde @ 2026-03-30 10:02 UTC (permalink / raw)
To: Christos Longros, Joerg Roedel
Cc: Suravee Suthikulpanit, Will Deacon, Robin Murphy, Jonathan Corbet,
Shuah Khan, iommu, linux-doc, linux-kernel
In-Reply-To: <20260328213228.12084-1-chris.longros@gmail.com>
Hi Christos,
On 3/29/2026 3:02 AM, Christos Longros wrote:
> On some AMD motherboards (Gigabyte B650 Gaming X AX V2, X870E and
> others), VFIO passthrough of any PCI device fails with:
>
> "Firmware has requested this device have a 1:1 IOMMU mapping,
> rejecting configuring the device without a 1:1 mapping."
>
> These boards' IVRS tables include IVMD type 0x22 (range) entries
> spanning wide device ranges (e.g. devid 0x0000 to 0x0FFF, covering
> PCI buses 0-15). The entries exist for platform devices like IOAPIC
> and HPET, but they get applied to nearly every IOMMU group on the
> system. Since commit a48ce36e2786 ("iommu: Prevent RESV_DIRECT
> devices from blocking domains"), any device with IOMMU_RESV_DIRECT
> regions has require_direct=1 set, which prevents VFIO from claiming
> DMA ownership.
I don't have client system handy to verify. Do you have acpi dump?
I want to see IVMD flags are set.
>
> No PCI device can be passed through on affected boards -- not just
> the platform devices that need the identity mappings, but also
> endpoint devices like network adapters and GPUs.
>
> Intel handles a similar firmware over-specification with
> device_rmrr_is_relaxable(), which marks certain RMRR entries as
> IOMMU_RESV_DIRECT_RELAXABLE so VFIO can claim them. AMD has no
> equivalent.
May be we can do similar for pci devices instead of command line option?
-Vasant
^ permalink raw reply
* Re: [PATCH] checkpatch: allow correctly handle full files on stdin
From: Vlastimil Babka @ 2026-03-30 9:40 UTC (permalink / raw)
To: Joe Perches, Dmitry Torokhov
Cc: Dwaipayan Ray, Lukas Bulwahn, Andy Whitcroft, Jonathan Corbet,
Shuah Khan, workflows, linux-doc, linux-kernel
In-Reply-To: <f4dcaecb682c4eaa271abfee27c7cc8f6fbf7d1d.camel@perches.com>
On 3/27/26 00:19, Joe Perches wrote:
> On Thu, 2026-03-26 at 16:04 -0700, Dmitry Torokhov wrote:
>> In all seriousness, if you will not make use of this mode it's fine. But
>> it allows keeping the source cleaner as one makes edits, so why not
>> enable this?
>
> Unnecessary complication.
Are you maintaining a tool that you want to be useful to others, or to only
do stricly what you personally think is necessary?
I don't understand your objections to this rather straightforward patch, it
doesn't look like a maintenance burden to me?
^ permalink raw reply
* RE: [PATCH v5 3/4] iio: adc: ad4691: add triggered buffer support
From: Sabau, Radu bogdan @ 2026-03-30 9:01 UTC (permalink / raw)
To: Nuno Sá, Lars-Peter Clausen, Hennerich, Michael,
Jonathan Cameron, David Lechner, Sa, Nuno, Andy Shevchenko,
Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Uwe Kleine-König, Liam Girdwood, Mark Brown, Linus Walleij,
Bartosz Golaszewski, Philipp Zabel, Jonathan Corbet, Shuah Khan
Cc: linux-iio@vger.kernel.org, devicetree@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-pwm@vger.kernel.org,
linux-gpio@vger.kernel.org, linux-doc@vger.kernel.org
In-Reply-To: <324e16aba4029ec4679f7b141c5f58e8929a0da3.camel@gmail.com>
> -----Original Message-----
> From: Nuno Sá <noname.nuno@gmail.com>
> Sent: Saturday, March 28, 2026 11:44 AM
> To: Sabau, Radu bogdan <Radu.Sabau@analog.com>; Lars-Peter Clausen
...
);
> >
> > - return 0;
> > + if (st->manual_mode)
> > + return 0;
> > +
> > + for (gp_num = 0; gp_num < ARRAY_SIZE(ad4691_gp_names);
> gp_num++) {
> > + if (fwnode_irq_get_byname(dev_fwnode(dev),
> > + ad4691_gp_names[gp_num]) > 0)
>
> Don't love this line break. I'm also a bit puzzled. How does the above differs
> from
> the trigger code? I guess this should be the same GP pin?
>
>
Hi Nuno,
You are right - both loops scan ad4691_gp_names[] independently and must
agree by construction, which is fragile. While gpio_setup() needs to happen,
there is no reason for it to live on ad4691_config(), since setup_triggered_buffer
is only called for the non-offload path and already iterates gp_names[] to find
the IRQ. So we can call gpio_setup right there and frop the lookup from
ad4691_config() entirely, and no need for those early returns either.
Thanks,
Radu
^ permalink raw reply
* Re: [PATCH v8 2/3] hwmon: ltc4283: Add support for the LTC4283 Swap Controller
From: Nuno Sá @ 2026-03-30 9:28 UTC (permalink / raw)
To: Nuno Sá
Cc: linux-gpio, linux-hwmon, devicetree, linux-doc, Guenter Roeck,
Rob Herring, Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet,
Linus Walleij, Bartosz Golaszewski
In-Reply-To: <20260327-ltc4283-support-v8-2-471de255d728@analog.com>
Hi Guenter, Regarding AI review, I think most of the points were
discussed in previous revisions, but there are two valid.
On Fri, Mar 27, 2026 at 05:26:15PM +0000, Nuno Sá wrote:
> Support the LTC4283 Hot Swap Controller. The device features programmable
> current limit with foldback and independently adjustable inrush current to
> optimize the MOSFET safe operating area (SOA). The SOA timer limits MOSFET
> temperature rise for reliable protection against overstresses.
>
> An I2C interface and onboard ADC allow monitoring of board current,
> voltage, power, energy, and fault status.
>
> Signed-off-by: Nuno Sá <nuno.sa@analog.com>
> ---
> Documentation/hwmon/index.rst | 1 +
> Documentation/hwmon/ltc4283.rst | 266 ++++++
> MAINTAINERS | 1 +
> drivers/hwmon/Kconfig | 12 +
> drivers/hwmon/Makefile | 1 +
> drivers/hwmon/ltc4283.c | 1796 +++++++++++++++++++++++++++++++++++++++
> 6 files changed, 2077 insertions(+)
>
...
> +static int ltc4283_read_in_alarm(struct ltc4283_hwmon *st, u32 channel,
> + bool max_alm, long *val)
> +{
> + if (channel == LTC4283_VPWR)
> + return ltc4283_read_alarm(st, LTC4283_ADC_ALM_LOG_1,
> + BIT(2 + max_alm), val);
> +
> + if (channel >= LTC4283_CHAN_ADI_1 && channel <= LTC4283_CHAN_ADI_4) {
> + u32 bit = (channel - LTC4283_CHAN_ADI_1) * 2;
> + /*
> + * Lower channels go to higher bits. We also want to go +1 down
> + * in the min_alarm case.
> + */
> + return ltc4283_read_alarm(st, LTC4283_ADC_ALM_LOG_2,
> + BIT(7 - bit - !max_alm), val);
> + }
> +
> + if (channel >= LTC4283_CHAN_ADIO_1 && channel <= LTC4283_CHAN_ADIO_4) {
> + u32 bit = (channel - LTC4283_CHAN_ADIO_1) * 2;
> +
> + return ltc4283_read_alarm(st, LTC4283_ADC_ALM_LOG_3,
> + BIT(7 - bit - !max_alm), val);
> + }
> +
> + if (channel >= LTC4283_CHAN_ADIN12 && channel <= LTC4283_CHAN_ADIN34) {
> + u32 bit = (channel - LTC4283_CHAN_ADIN12) * 2;
> +
> + return ltc4283_read_alarm(st, LTC4283_ADC_ALM_LOG_5,
> + BIT(7 - bit - !max_alm), val);
> + }
"Will this condition handle the ADIO12 and ADIO34 differential channels?
It looks like channels 14 and 15 fall through to the default return intended
for the DRAIN channel. Since reading the alarm implicitly clears the register
bits, could reading these ADIO alarms unintentionally clear actual DRAIN
alarms? Should the upper bound be LTC4283_CHAN_ADIO34?"
Good catch and should be:
- if (channel >= LTC4283_CHAN_ADIN12 && channel <= LTC4283_CHAN_ADIN34) {
+ if (channel >= LTC4283_CHAN_ADIN12 && channel <= LTC4283_CHAN_ADIO34) {
> +
> + if (channel == LTC4283_CHAN_DRNS)
> + return ltc4283_read_alarm(st, LTC4283_ADC_ALM_LOG_4,
> + BIT(6 + max_alm), val);
> +
> + return ltc4283_read_alarm(st, LTC4283_ADC_ALM_LOG_4, BIT(4 + max_alm),
> + val);
> +}
...
> +
> +static int ltc4283_probe(struct i2c_client *client)
> +{
> + struct device *dev = &client->dev, *hwmon;
> + struct auxiliary_device *adev;
> + struct ltc4283_hwmon *st;
> + int ret;
> +
> + st = devm_kzalloc(dev, sizeof(*st), GFP_KERNEL);
> + if (!st)
> + return -ENOMEM;
> +
> + if (!i2c_check_functionality(client->adapter,
> + I2C_FUNC_SMBUS_BYTE_DATA |
> + I2C_FUNC_SMBUS_WORD_DATA |
> + I2C_FUNC_SMBUS_READ_I2C_BLOCK))
> + return -EOPNOTSUPP;
> +
> + st->client = client;
> + st->map = devm_regmap_init(dev, <c4283_regmap_bus, client,
> + <c4283_regmap_config);
> + if (IS_ERR(st->map))
> + return dev_err_probe(dev, PTR_ERR(st->map),
> + "Failed to create regmap\n");
> +
> + ret = ltc4283_setup(st, dev);
> + if (ret)
> + return ret;
> +
> + hwmon = devm_hwmon_device_register_with_info(dev, "ltc4283", st,
> + <c4283_chip_info, NULL);
> +
> + if (IS_ERR(hwmon))
> + return PTR_ERR(hwmon);
> +
> + ltc4283_debugfs_init(st, client);
> +
> + if (!st->gpio_mask)
> + return 0;
> +
> + adev = devm_auxiliary_device_create(dev, "gpio", &st->gpio_mask);
> + if (!adev)
> + return dev_err_probe(dev, -ENODEV, "Failed to add GPIO device\n");
"Does this allow multiple LTC4283 chips to probe successfully?
Without allocating a unique ID per I2C instance, it seems the first probed
chip takes the generic name. If a second chip is present, it might attempt
to register with the exact same name, resulting in a failure in device_add()
and aborting the probe."
Also looks valid and I suspect is one of those that a quick look will
find more "offenders". I would purpose:
- adev = devm_auxiliary_device_create(dev, "gpio", &st->gpio_mask);
+ adev = __devm_auxiliary_device_create(dev, KBUILD_MODNAME, "gpio",
+ &st->gpio_mask, client->addr);
If there's nothing else and you agree with the above, is this something
you can tweak while applying or should I spin a new version?
Thanks!
- Nuno Sá
^ permalink raw reply
* Re: [PATCH v11 10/11] arm64: kexec: Add support for crashkernel CMA reservation
From: Breno Leitao @ 2026-03-30 9:13 UTC (permalink / raw)
To: Jinjie Ruan
Cc: corbet, skhan, catalin.marinas, will, chenhuacai, kernel, maddy,
mpe, npiggin, chleroy, pjw, palmer, aou, alex, tglx, mingo, bp,
dave.hansen, hpa, robh, saravanak, akpm, bhe, vgoyal, dyoung,
rdunlap, peterz, feng.tang, pawan.kumar.gupta, dapeng1.mi, kees,
elver, paulmck, lirongqing, rppt, ardb, cfsworks, osandov, jbohac,
tangyouling, sourabhjain, ritesh.list, eajames, songshuaishuai,
kevin.brodsky, vishal.moola, junhui.liu, coxu, fuqiang.wang,
liaoyuanhong, guoren, chenjiahao16, hbathini, takahiro.akashi,
james.morse, lizhengyu3, x86, linux-doc, linux-kernel,
linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv,
devicetree, kexec
In-Reply-To: <20260328074013.3589544-11-ruanjinjie@huawei.com>
On Sat, Mar 28, 2026 at 03:40:12PM +0800, Jinjie Ruan wrote:
> Commit 35c18f2933c5 ("Add a new optional ",cma" suffix to the
> crashkernel= command line option") and commit ab475510e042 ("kdump:
> implement reserve_crashkernel_cma") added CMA support for kdump
> crashkernel reservation.
>
> Crash kernel memory reservation wastes production resources if too
> large, risks kdump failure if too small, and faces allocation difficulties
> on fragmented systems due to contiguous block constraints. The new
> CMA-based crashkernel reservation scheme splits the "large fixed
> reservation" into a "small fixed region + large CMA dynamic region": the
> CMA memory is available to userspace during normal operation to avoid
> waste, and is reclaimed for kdump upon crash—saving memory while
> improving reliability.
>
> So extend crashkernel CMA reservation support to arm64. The following
> changes are made to enable CMA reservation:
>
> - Parse and obtain the CMA reservation size along with other crashkernel
> parameters.
> - Call reserve_crashkernel_cma() to allocate the CMA region for kdump.
> - Include the CMA-reserved ranges for kdump kernel to use.
> - Exclude the CMA-reserved ranges from the crash kernel memory to
> prevent them from being exported through /proc/vmcore, which is already
> done in the crash core.
>
> Update kernel-parameters.txt to document CMA support for crashkernel on
> arm64 architecture.
>
> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
> Acked-by: Rob Herring (Arm) <robh@kernel.org>
> Acked-by: Baoquan He <bhe@redhat.com>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Acked-by: Ard Biesheuvel <ardb@kernel.org>
> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Tested-by: Breno Leitao <leitao@debian.org>
^ permalink raw reply
* Re: [PATCH net-next 2/3] dpll: add actual frequency monitoring callback ops
From: Ivan Vecera @ 2026-03-30 8:52 UTC (permalink / raw)
To: Vadim Fedorenko, netdev
Cc: Arkadiusz Kubalewski, Jiri Pirko, Jonathan Corbet, Shuah Khan,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Donald Hunter, Prathosh Satish, Petr Oros,
linux-doc, linux-kernel
In-Reply-To: <e4372a58-13ba-4581-af4e-28d0c880c7bc@redhat.com>
On 3/26/26 6:48 PM, Ivan Vecera wrote:
> On 3/26/26 12:21 PM, Vadim Fedorenko wrote:
>> On 25/03/2026 19:39, Ivan Vecera wrote:
>>
>>> +static int dpll_msg_add_actual_freq(struct sk_buff *msg, struct
>>> dpll_pin *pin,
>>> + struct dpll_pin_ref *ref,
>>> + struct netlink_ext_ack *extack)
>>> +{
>>> + const struct dpll_device_ops *dev_ops = dpll_device_ops(ref->dpll);
>>> + const struct dpll_pin_ops *ops = dpll_pin_ops(ref);
>>> + struct dpll_device *dpll = ref->dpll;
>>> + enum dpll_feature_state state;
>>> + u64 actual_freq;
>>> + int ret;
>>> +
>>> + if (!ops->actual_freq_get)
>>> + return 0;
>>> + if (dev_ops->freq_monitor_get) {
>>> + ret = dev_ops->freq_monitor_get(dpll, dpll_priv(dpll),
>>> + &state, extack);
>>> + if (ret)
>>> + return ret;
>>> + if (state == DPLL_FEATURE_STATE_DISABLE)
>>> + return 0;
>>
>> I think we have to signal back to user that frequency monitoring is
>> disabled via extack.
>
> Hi Vadim,
>
> This would break pin-get operation... Do or dump pin-get operation would
> fail with this extack message.
>
> Here we can check if the freq-monitoring is enabled and conditionally
> call actual_freq_get() or measured_freq_get()
>
> -or-
>
> Call this callback unconditionally and check for return code and if a
> driver returns e.g. -ENODATA then skip nla_put_64bit() but return
> success.
>
> WDYT?
Vadim?
^ permalink raw reply
* Re: [PATCH v2] Docs: iio: ad7191 Correct clock configuration
From: Geert Uytterhoeven @ 2026-03-30 8:09 UTC (permalink / raw)
To: Andy Shevchenko
Cc: Ammar Mustafa, Alisa-Dariana Roman, Jonathan Cameron,
David Lechner, Nuno Sá, Andy Shevchenko, Jonathan Corbet,
Shuah Khan, linux-iio, linux-doc, linux-kernel
In-Reply-To: <aaLIhgJjrNlp3oTy@ashevche-desk.local>
On Sat, 28 Feb 2026 at 11:51, Andy Shevchenko
<andriy.shevchenko@intel.com> wrote:
> On Fri, Feb 27, 2026 at 02:08:33PM -0500, Ammar Mustafa wrote:
> > Correct the ad7191 documentation to match the datasheet:
> > - Fix inverted CLKSEL pin logic: device uses external clock when pin is
> > inactive, and internal CMOS/crystal when high.
>
> high --> active
Thanks for your patch, which is now commit d2a4ec19d2a2e54c ("Docs:
iio: ad7191 Correct clock configuration") in char-misc-next and
iio/togreg.
That commit message still says "inactive" and "high", thus adding to
the confustion.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox