Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH v3 0/5] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures
From: Alison Schofield @ 2026-06-02  1:51 UTC (permalink / raw)
  To: Li Chen
  Cc: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	virtualization, nvdimm, linux-kernel
In-Reply-To: <20260226025712.2236279-1-me@linux.beauty>

On Thu, Feb 26, 2026 at 10:57:05AM +0800, Li Chen wrote:
> Hi,
> 
> The virtio-pmem flush path uses a virtqueue cookie/token to carry a
> per-request context through completion. Under broken virtqueue / notify
> failure conditions, the submitter can return and free the request object
> while the host/backend may still complete the published request. The IRQ
> completion handler then dereferences freed memory when waking waiters,
> which is reported by KASAN as a slab-use-after-free and may manifest as
> lock corruption (e.g. "BUG: spinlock already unlocked") without KASAN.
> 
> In addition, the flush path has two wait sites: one for virtqueue
> descriptor availability (-ENOSPC from virtqueue_add_sgs()) and one for
> request completion. If the virtqueue becomes broken, forward progress is
> no longer guaranteed and these waiters may sleep indefinitely unless the
> driver converges the failure and wakes all wait sites.
> 
> This series addresses both issues:
> 
> 1/5 nvdimm: virtio_pmem: always wake -ENOSPC waiters
> Wake one -ENOSPC waiter for each reclaimed used buffer, decoupled from
> token completion.
> 
> 2/5 nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
> Use READ_ONCE()/WRITE_ONCE() for the wait_event() flags (done and
> wq_buf_avail).
> 
> 3/5 nvdimm: virtio_pmem: refcount requests for token lifetime
> Refcount request objects so the token lifetime spans the window where it
> is reachable through the virtqueue until completion/drain drops the
> virtqueue reference.
> 
> 4/5 nvdimm: virtio_pmem: converge broken virtqueue to -EIO
> Track a device-level broken state to converge broken/notify failures to
> -EIO: wake all waiters and drain/detach outstanding requests to complete
> them with an error, and fail-fast new requests.
> 
> 5/5 nvdimm: virtio_pmem: drain requests in freeze
> Drain outstanding requests in freeze() before tearing down virtqueues so
> waiters do not sleep indefinitely.
> 
> Testing was done on QEMU x86_64 with a virtio-pmem device exported as
> /dev/pmem0, formatted with ext4 (-O fast_commit), mounted with DAX, and
> stressed with fsync-heavy workloads.
> 
> Thanks,
> Li Chen

Hi Li Chen,

Today I took a look at this set, noting that it's been sitting idle 
in our nvdimm backlog for a while. I'm not able to apply it. Can you
post a new rev that applies to 7.1-rc6 ?

Thanks,
Alison


^ permalink raw reply

* Re: [PATCH v4 46/47] x86/kvmclock: Plumb in AP-online and BSP-resume to kvmlock, for documentation
From: David Woodhouse @ 2026-06-01 22:09 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529150833.715042-1-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 543 bytes --]

On Fri, 29 May 2026 08:08:33 -0700, Sean Christopherson wrote:
> Invoke kvmclock_cpu_action() with AP_ONLINE and BSP_RESUME, even though
> kvmclock doesn't need to do anything in either case, so that the asymmetry
> of kvmclock is a detail buried in kvmclock, and to explicitly document
> that doing nothing during those phases is intentional and correct.
>
> For all intents and purposes, no functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 30/47] x86/xen/time: NOP-ify x86_platform's sched_clock save/restore hooks
From: David Woodhouse @ 2026-06-01 22:09 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529150741.714145-1-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 403 bytes --]

On Fri, 29 May 2026 08:07:41 -0700, Sean Christopherson wrote:
> NOP-ify the x86_platform sched_clock save/restore hooks when setting up
> Xen's PV clock to make it somewhat obvious the hooks aren't used when
> running as a Xen guest (Xen uses a paravirtualized suspend/resume flow).
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 31/47] x86/vmware: NOP-ify save/restore hooks when using VMware's sched_clock
From: David Woodhouse @ 2026-06-01 22:09 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529150753.714296-1-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 628 bytes --]

On Fri, 29 May 2026 08:07:52 -0700, Sean Christopherson wrote:
> NOP-ify the sched_clock save/restore hooks when using VMware's version of
> sched_clock.  This will allow extending paravirt_set_sched_clock() to set
> the save/restore hooks, without having to simultaneously change the
> behavior of VMware guests.
>
> Note, it's not at all obvious that it's safe/correct for VMware guests to
> do nothing on suspend/resume, but that's a pre-existing problem.  Leave it
> for a VMware expert to sort out.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 18/47] x86/kvm: Get local APIC bus frequency from PV CPUID Timing Info
From: David Woodhouse @ 2026-06-01 22:06 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx, sashiko-reviews
In-Reply-To: <ahnZ8tuHFVFTeofW@google.com>

[-- Attachment #1: Type: text/plain, Size: 1734 bytes --]

On Fri, 29 May 2026 11:24:50 -0700, Sean Christopherson wrote:
> On Fri, May 29, 2026, sashiko-bot@kernel.org wrote:
> > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > index 4fe9c69bf40b3..c1139182121d5 100644
> > > --- a/arch/x86/kernel/kvm.c
> > > +++ b/arch/x86/kernel/kvm.c
> > 
> > [ ... ]
> > 
> > > @@ -1039,6 +1040,13 @@ static void __init kvm_init_platform(void)
> > >                     x86_init.hyper.get_tsc_khz = kvm_get_tsc_khz;
> > >                     x86_init.hyper.get_cpu_khz = kvm_get_tsc_khz;
> > >             }
> > > +
> > > +#ifdef CONFIG_X86_LOCAL_APIC
> > > +           /* The leaf also includes the local APIC bus/timer frequency.*/
> > > +           apic_khz = cpuid_ebx(timing_info_leaf);
> > > +           if (apic_khz)
> > > +                  lapic_timer_period = apic_khz * 1000 / HZ;
> > 
> > [Severity: High]
> > Could the intermediate multiplication of apic_khz * 1000 overflow here?
> > 
> > Since apic_khz is a u32, the multiplication is evaluated in 32-bit unsigned
> > arithmetic. If the hypervisor reports an APIC timer frequency greater than
> > ~4.29 GHz (apic_khz > 4294967), this calculation will silently overflow.
> 
> Hmm, easy enough to use mul_u64_u32_div() (I think that's the write helper for
> this?).

Yep.

> But this problem pre-exits in almost every other path that sets lapic_timer_period.
> So while I tried to avoid doing yet more tangentially related cleanup, it seems
> like adding a helper to set lapic_timer_period is the way to go.  That would also
> allow making lapic_timer_period local to arch/x86/kernel/apic/apic.c.
> 
> *sigh*

Yay, more patches!

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 17/47] x86/kvm: Mark TSC as reliable when it's constant and nonstop
From: David Woodhouse @ 2026-06-01 22:02 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx, sashiko-reviews
In-Reply-To: <ahnhnjvfIblFxTFX@google.com>

[-- Attachment #1: Type: text/plain, Size: 2028 bytes --]

On Fri, 29 May 2026 11:57:34 -0700, Sean Christopherson wrote:
> On Fri, May 29, 2026, sashiko-bot@kernel.org wrote:
> > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > index 909d3e5e5bcd5..4fe9c69bf40b3 100644
> > > --- a/arch/x86/kernel/kvm.c
> > > +++ b/arch/x86/kernel/kvm.c
> > [ ... ]
> > > @@ -1040,7 +1041,20 @@ static void __init kvm_init_platform(void)
> > [ ... ]
> > > -	kvmclock_init();
> > > +        /*
> > > +         * If the TSC counts at a constant frequency across P/T states, counts
> > > +         * in deep C-states, and the TSC hasn't been marked unstable, treat the
> > > +         * TSC reliable, as guaranteed by KVM.  Note, the TSC unstable check
> > > +         * exists purely to honor the TSC being marked unstable via command
> > > +         * line, any runtime detection of an unstable will happen after this.
> > > +         */
> > > +	tsc_is_reliable = boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > > +			  boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > > +			  !check_tsc_unstable();
> > 
> > [Severity: High]
> > Does this evaluate check_tsc_unstable() too early to catch the command line
> > parameter?
> 
> Huh, it does indeed.
> 
> > It looks like kvm_init_platform() is called from setup_arch(), but the
> > tsc=unstable kernel parameter is parsed via __setup() later during
> > parse_args() in start_kernel().
> > 
> > If check_tsc_unstable() evaluates to 0 here because the parameter hasn't
> > been parsed yet, wouldn't it incorrectly force X86_FEATURE_TSC_RELIABLE
> > and set prefer_tsc to true?
> 
> Yep, but this is a pre-existing problem that goes all the way back to the original
> commit 7539b174aef4 ("x86: kvmguest: use TSC clocksource if invariant TSC is exposed").
> 
> We could try to fix that, but I'm _very_ strongly inclined to add (yet another)
> patch to simply drop the check_tsc_unstable() since it has always been dead code.

Yeah, kill it with fire.

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 14/47] x86/kvmclock: Rename kvm_get_tsc_khz() to kvmclock_get_tsc_khz()
From: David Woodhouse @ 2026-06-01 21:53 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529144435.704127-15-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 400 bytes --]

On Fri, 29 May 2026 07:44:01 -0700, Sean Christopherson wrote:
> Rename kvm_get_tsc_khz() to kvmclock_get_tsc_khz() in anticipation of
> adding support for getting TSC info from PV CPUID, i.e. in a KVM specific
> way, but without non-kvmclock.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 12/47] x86/tsc: Rename pit_hpet_ptimer_calibrate_cpu() => native_calibrate_cpu_late()
From: David Woodhouse @ 2026-06-01 21:52 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529144435.704127-13-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 339 bytes --]

On Fri, 29 May 2026 07:43:59 -0700, Sean Christopherson wrote:
> Rename the late CPU calibration routine so that its relationship to the
> early routine is more obvious and intuitive.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 13/47] x86/tsc: Fold native_calibrate_cpu() into recalibrate_cpu_khz()
From: David Woodhouse @ 2026-06-01 21:52 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529144435.704127-14-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 503 bytes --]

On Fri, 29 May 2026 07:44:00 -0700, Sean Christopherson wrote:
> Fold the guts of native_calibrate_cpu() into its sole remaining caller,
> recalibrate_cpu_khz() to eliminate the extra SMP=n #ifdef, and so that it's
> more obvious that directly invoking the early vs. late calibration routines
> in determine_cpu_tsc_frequencies() is intentional.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 11/47] x86/tsc: Kill off x86_platform_ops.calibrate_{cpu,tsc}() hooks
From: David Woodhouse @ 2026-06-01 21:51 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529144435.704127-12-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 634 bytes --]

On Fri, 29 May 2026 07:43:58 -0700, Sean Christopherson wrote:
> Now that getting the CPU and/or TSC frequencies from the hypervisor uses
> dedicated hooks, drop x86_platform_ops.calibrate_{cpu,tsc}() and instead
> directly invoke the correct helper at each phase of (re)calibration.  In
> addition to eliminating unnecessary code, this makes it a bit more obvious
> when the "late" path invokes pit_hpet_ptimer_calibrate_cpu() instead of
> x86_platform_ops.calibrate_cpu().
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 8/47] x86/tsc: Add dedicated hypervisor hooks for getting known TSC/CPU frequencies
From: David Woodhouse @ 2026-06-01 21:49 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529144435.704127-9-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 740 bytes --]

On Fri, 29 May 2026 07:43:55 -0700, Sean Christopherson wrote:
> Add dedicated hypervisor hooks for getting known TSC/CPU frequencies
> instead of overriding seemingly generic platform hooks, and explicitly
> priotize hypervisor-provided frequencies over native methods, but do NOT
> clobber the frequency obtained from trusted firmware.  While shuffling the
> hooks around is arguably "six of one, half dozen of the other", scoping
> them to x86_hyper_init makes their purpose more obvious, and allows for
> explicitly defining the priority of sources (as is done here).
>
> Cc: David Woodhouse <dwmw2@infradead.org>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 1/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: David Woodhouse @ 2026-06-01 21:46 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529144435.704127-2-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 473 bytes --]

On Fri, 29 May 2026 07:43:48 -0700, Sean Christopherson wrote:
> Don't re-calibrate the TSC frequency if the TSC is known to run at a fixed
> frequency.  In practice, this is likely one big nop, as re-calibration is
> used only for SMP=n kernels, and only for hardware that is 20+ years old,
> i.e. is extremely unlikely to collide with TSC_KNOWN_FREQ.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH 01/11] params: bound array element output to the caller's page buffer
From: Matthew Wilcox @ 2026-06-01 20:23 UTC (permalink / raw)
  To: David Laight
  Cc: Kees Cook, Luis Chamberlain, Pengpeng Hou, stable, Petr Pavlu,
	Richard Weinberger, Anton Ivanov, Johannes Berg,
	Rafael J. Wysocki, Len Brown, Corey Minyard, Gabriel Somlo,
	Michael S. Tsirkin, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Tvrtko Ursulin, David Airlie, Simona Vetter, Bart Van Assche,
	Jason Gunthorpe, Leon Romanovsky, Laurent Pinchart, Hans de Goede,
	Mauro Carvalho Chehab, Bjorn Helgaas, Hannes Reinecke,
	James E.J. Bottomley, Martin K. Petersen, Daniel Lezcano,
	Zhang Rui, Lukasz Luba, Greg Kroah-Hartman, Jiri Slaby,
	Alan Stern, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Jason Baron, Jim Cromie, Tiwei Bie, Benjamin Berg,
	Ilpo Järvinen, David E. Box, Maciej W. Rozycki,
	Srinivas Pandruvada, Peter Zijlstra, Heiko Carstens,
	Vasily Gorbik, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vinod Koul, Frank Li, Daniel Gomez, Sami Tolvanen,
	Aaron Tomlin, Alexander Potapenko, Marco Elver, Dmitry Vyukov,
	Andrew Morton, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Andy Shevchenko, Georgia Garcia, kvm, dmaengine,
	linux-modules, kasan-dev, linux-mm, apparmor,
	linux-security-module, linux-um, linux-acpi, openipmi-developer,
	qemu-devel, intel-gfx, dri-devel, linux-rdma, linux-media,
	linux-pci, linux-scsi, linux-pm, linuxppc-dev, linux-serial,
	linux-usb, usb-storage, virtualization, linux-kernel, linux-arch,
	netdev, linux-fsdevel, linux-hardening
In-Reply-To: <20260521174631.71a06440@pumpkin>

On Thu, May 21, 2026 at 05:46:31PM +0100, David Laight wrote:
> On Thu, 21 May 2026 06:33:14 -0700
> Kees Cook <kees@kernel.org> wrote:
> > Collect each element into a temporary PAGE_SIZE buffer first and then
> > copy only the remaining space into the caller's page buffer.
> 
> Should this be using a 4k buffer on all architectures?
> Initially perhaps just using a different name for the constant until
> all the associated PAGE_SIZE limits have been removed.

If we're acually going to think about this, even 4KiB is too big.
An 80x25 terminal is 2000 bytes (assuming no utf8), so 4KiB is two
entire screenfuls.  Limiting to 2048 would seem reasonable to me.

^ permalink raw reply

* Re: [PATCH 00/11] Convert moduleparams to seq_buf
From: Kees Cook @ 2026-06-01 19:59 UTC (permalink / raw)
  To: Petr Pavlu
  Cc: Luis Chamberlain, Pengpeng Hou, Richard Weinberger, Anton Ivanov,
	Johannes Berg, Rafael J. Wysocki, Len Brown, Corey Minyard,
	Gabriel Somlo, Michael S. Tsirkin, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, David Airlie, Simona Vetter,
	Bart Van Assche, Jason Gunthorpe, Leon Romanovsky,
	Laurent Pinchart, Hans de Goede, Mauro Carvalho Chehab,
	Bjorn Helgaas, Hannes Reinecke, James E.J. Bottomley,
	Martin K. Petersen, Daniel Lezcano, Zhang Rui, Lukasz Luba,
	Greg Kroah-Hartman, Jiri Slaby, Alan Stern, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Jason Baron, Jim Cromie, Tiwei Bie,
	Benjamin Berg, Ilpo Järvinen, David E. Box,
	Maciej W. Rozycki, Srinivas Pandruvada, Peter Zijlstra,
	Heiko Carstens, Vasily Gorbik, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vinod Koul, Frank Li, Daniel Gomez, Sami Tolvanen,
	Aaron Tomlin, Alexander Potapenko, Marco Elver, Dmitry Vyukov,
	Andrew Morton, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Andy Shevchenko, Georgia Garcia, kvm, dmaengine,
	linux-modules, kasan-dev, linux-mm, apparmor,
	linux-security-module, linux-um, linux-acpi, openipmi-developer,
	qemu-devel, intel-gfx, dri-devel, linux-rdma, linux-media,
	linux-pci, linux-scsi, linux-pm, linuxppc-dev, linux-serial,
	linux-usb, usb-storage, virtualization, linux-kernel, linux-arch,
	netdev, linux-fsdevel, linux-hardening
In-Reply-To: <88c5ca1d-eeda-4023-bc7a-397b92780db9@suse.com>

On Tue, May 26, 2026 at 08:53:06AM +0200, Petr Pavlu wrote:
> On 5/21/26 3:33 PM, Kees Cook wrote:
> > Hi,
> > 
> > I tried to trim the CC list here, but it's still pretty huge...
> > 
> > We've had a long-standing issue with "write to a string pointer" callbacks
> > that don't bounds check the destination (and for which the bounds is
> > also not part of the callback prototype, even if it is "known" to be
> > PAGE_SIZE, which sysfs_emit() depends on). Both moduleparams and sysfs
> > use this pattern. As a first step, and to test the migration method,
> > migrate moduleparams first.
> > 
> > There are 2 "mechanical" treewide patches that are handled by Coccinelle:
> > - treewide: Convert struct kernel_param_ops initializers to DEFINE_KERNEL_PARAM_OPS
> > - treewide: Convert custom kernel_param_ops .get callbacks to seq_buf via cocci
> > 
> > The last treewide patch is manual, and may need to be broken up into
> > per-subsystem patches, though I'd prefer to avoid this, as it would
> > extend the migration from 1 relase to at least 2 releases. (1 to
> > release the migration infrastructure, then 1 release to collect all the
> > subsystem changes, and possibly 1 more release to remove the migration
> > infrastructure.)
> > 
> > Thoughts, questions?
> 
> This looks reasonable to me. I added a few minor comments on the patches
> but they already look solid.

Thanks for the review! I'll get a v2 prepared with your notes addressed. :)

-Kees

-- 
Kees Cook

^ permalink raw reply

* Re: [PATCH 2/7] drm/vblank: timer: Fix timestamp calculation
From: Thomas Zimmermann @ 2026-06-01 17:30 UTC (permalink / raw)
  To: Michel Dänzer, simona, louis.chauvet, ville.syrjala,
	jani.nikula, mhklkml, maarten.lankhorst, mripard, airlied
  Cc: dri-devel, amd-gfx, virtualization
In-Reply-To: <c048fbcb-d318-414f-805f-18816cfa86f3@mailbox.org>

Hi

Am 01.06.26 um 18:24 schrieb Michel Dänzer:
> On 6/1/26 16:08, Thomas Zimmermann wrote:
>> In drm_crtc_vblank_get_vblank_timeout(), return the timestamp of the
>> first visible scanline after the last vblank timeout. This is what the
>> caller expects.
>>
>> A vblank phase starts with a vblank timeout. At this point the display
>> is blanked for several scanlines. Afterwards the display is unblanked
>> until the next vblank timeout occurs. The display content is only visible
>> during that second part.
>>
>> The current implementation of drm_crtc_vblank_get_vblank_timeout()
>> returns the timestamp of the last vblank timeout that started the current
>> vblank phase. But the display only unblanks after 20 to 30 percent of
>> the overall frame duration. The returned timestamp is therefore too early.
>>
>> The next vblank timeout is already known when calculating the returned
>> timestamp. Instead of subtracting the duration of a full frame from the
>> value, only subtract the duration of the active, visible part. The result
>> is the timestamp of the first visible scanline, as expected by the caller.
>>
>> This bug was not introduced by the generic vblank timer. It appears that
>> the get_vblank_timeout logic has always been buggy since it was first
>> added in commit 3a0709928b17 ("drm/vkms: Add vblank events simulated by
>> hrtimers").
>>
>> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
>> ---
>>   drivers/gpu/drm/drm_vblank.c | 32 +++++++++++++++++++++++++-------
>>   1 file changed, 25 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
>> index 96d70c3d4522..d52df247d04e 100644
>> --- a/drivers/gpu/drm/drm_vblank.c
>> +++ b/drivers/gpu/drm/drm_vblank.c
>> [...]
>> @@ -2312,17 +2321,26 @@ bool drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_t
>>   		*vblank_time = READ_ONCE(vtimer->timer.node.expires);
>>   	} while (cur_count != drm_crtc_vblank_count_and_time(crtc, &cur_time));
>>   
>> -	if (drm_WARN_ON(crtc->dev, !ktime_compare(*vblank_time, cur_time)))
>> +	if (drm_WARN_ON(dev, !ktime_compare(*vblank_time, cur_time)))
>>   		return false; /* Already expired */
>>   
>> +	framedur_ns = vblank->framedur_ns;
>> +
>>   	/*
>> -	 * To prevent races we roll the hrtimer forward before we do any
>> -	 * interrupt processing - this is how real hw works (the interrupt
>> -	 * is only generated after all the vblank registers are updated)
>> -	 * and what the vblank core expects. Therefore we need to always
>> -	 * correct the timestamp by one frame.
>> +	 * To prevent races we rolled the hrtimer forward before we did any
>> +	 * timeout processing - this is how real hw works (the interrupt is
>> +	 * only generated after all the vblank registers are updated) and what
>> +	 * the vblank core expects.
>> +	 *
>> +	 * Therefore we always need to correct the timestamp. The returned
>> +	 * time should be the time of the first active scanline after the
>> +	 * previous vblank. Hence subtract the active phase's duration from
>> +	 * the next expiration time.
>>   	 */
>> -	*vblank_time = ktime_sub(*vblank_time, vtimer->interval);
>> +	if (drm_WARN_ON(dev, !mode->crtc_vtotal))
>> +		return false;
>> +	activedur_ns = div_s64(framedur_ns * mode->crtc_vdisplay, mode->crtc_vtotal);
>> +	*vblank_time = ktime_sub_ns(*vblank_time, activedur_ns);
> Normally the timestamp returned by drm_crtc_vblank_count_and_time is supposed to correspond to the end of vertical blank / start of active, in which case the new code here looks wrong.
>
> Also, while the current time is inside an active area, it's supposed to return the timestamp corresponding to the start of the current active area, not the next one.

The initial value of *vblank_time is when the vblank timer fires _next_ 
and the display blanks. Subtracting the length of the active period 
should give the time of the first active scanline within the current 
vblank phase.

Isn't that exactly what you describe?

Best regards
Thomas

>
>

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)



^ permalink raw reply

* Re: [PATCH 2/7] drm/vblank: timer: Fix timestamp calculation
From: Michel Dänzer @ 2026-06-01 16:24 UTC (permalink / raw)
  To: Thomas Zimmermann, simona, louis.chauvet, ville.syrjala,
	jani.nikula, mhklkml, maarten.lankhorst, mripard, airlied
  Cc: dri-devel, amd-gfx, virtualization
In-Reply-To: <20260601141922.91498-3-tzimmermann@suse.de>

On 6/1/26 16:08, Thomas Zimmermann wrote:
> In drm_crtc_vblank_get_vblank_timeout(), return the timestamp of the
> first visible scanline after the last vblank timeout. This is what the
> caller expects.
> 
> A vblank phase starts with a vblank timeout. At this point the display
> is blanked for several scanlines. Afterwards the display is unblanked
> until the next vblank timeout occurs. The display content is only visible
> during that second part.
> 
> The current implementation of drm_crtc_vblank_get_vblank_timeout()
> returns the timestamp of the last vblank timeout that started the current
> vblank phase. But the display only unblanks after 20 to 30 percent of
> the overall frame duration. The returned timestamp is therefore too early.
> 
> The next vblank timeout is already known when calculating the returned
> timestamp. Instead of subtracting the duration of a full frame from the
> value, only subtract the duration of the active, visible part. The result
> is the timestamp of the first visible scanline, as expected by the caller.
> 
> This bug was not introduced by the generic vblank timer. It appears that
> the get_vblank_timeout logic has always been buggy since it was first
> added in commit 3a0709928b17 ("drm/vkms: Add vblank events simulated by
> hrtimers").
> 
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> ---
>  drivers/gpu/drm/drm_vblank.c | 32 +++++++++++++++++++++++++-------
>  1 file changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
> index 96d70c3d4522..d52df247d04e 100644
> --- a/drivers/gpu/drm/drm_vblank.c
> +++ b/drivers/gpu/drm/drm_vblank.c
> [...]
> @@ -2312,17 +2321,26 @@ bool drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_t
>  		*vblank_time = READ_ONCE(vtimer->timer.node.expires);
>  	} while (cur_count != drm_crtc_vblank_count_and_time(crtc, &cur_time));
>  
> -	if (drm_WARN_ON(crtc->dev, !ktime_compare(*vblank_time, cur_time)))
> +	if (drm_WARN_ON(dev, !ktime_compare(*vblank_time, cur_time)))
>  		return false; /* Already expired */
>  
> +	framedur_ns = vblank->framedur_ns;
> +
>  	/*
> -	 * To prevent races we roll the hrtimer forward before we do any
> -	 * interrupt processing - this is how real hw works (the interrupt
> -	 * is only generated after all the vblank registers are updated)
> -	 * and what the vblank core expects. Therefore we need to always
> -	 * correct the timestamp by one frame.
> +	 * To prevent races we rolled the hrtimer forward before we did any
> +	 * timeout processing - this is how real hw works (the interrupt is
> +	 * only generated after all the vblank registers are updated) and what
> +	 * the vblank core expects.
> +	 *
> +	 * Therefore we always need to correct the timestamp. The returned
> +	 * time should be the time of the first active scanline after the
> +	 * previous vblank. Hence subtract the active phase's duration from
> +	 * the next expiration time.
>  	 */
> -	*vblank_time = ktime_sub(*vblank_time, vtimer->interval);
> +	if (drm_WARN_ON(dev, !mode->crtc_vtotal))
> +		return false;
> +	activedur_ns = div_s64(framedur_ns * mode->crtc_vdisplay, mode->crtc_vtotal);
> +	*vblank_time = ktime_sub_ns(*vblank_time, activedur_ns);

Normally the timestamp returned by drm_crtc_vblank_count_and_time is supposed to correspond to the end of vertical blank / start of active, in which case the new code here looks wrong.

Also, while the current time is inside an active area, it's supposed to return the timestamp corresponding to the start of the current active area, not the next one.


-- 
Earthling Michel Dänzer       \        GNOME / Xwayland / Mesa developer
https://redhat.com             \               Libre software enthusiast

^ permalink raw reply

* Re: [PATCH 0/5] x86/xen: Get rid of Xen private lazy MMU mode tracking
From: David Hildenbrand (Arm) @ 2026-06-01 15:08 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, x86, linux-trace-kernel, linux-mm,
	virtualization
  Cc: Boris Ostrovsky, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, xen-devel, Andrew Morton, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list
In-Reply-To: <20260526150514.129330-1-jgross@suse.com>

On 5/26/26 17:05, Juergen Gross wrote:
> With generic lazy MMU mode tracking being available, there is no real
> need for having Xen PV specific code to track lazy MMU mode, too.
> 
> This makes it possible to drop two paravirt hooks.

Nice!

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH] virtio_console: clean up port name sysfs attribute
From: Greg KH @ 2026-06-01 14:30 UTC (permalink / raw)
  To: Goncalo Gomes; +Cc: amit, arnd, virtualization, linux-kernel, Goncalo Gomes
In-Reply-To: <20260601142508.3740862-1-gomesgoncalo+linuxkernel@gmail.com>

On Mon, Jun 01, 2026 at 03:25:08PM +0100, Goncalo Gomes wrote:
> Fix issues flagged by checkpatch.pl:

This is only needed for new code, or stuff in drivers/staging/  Stick to
there unless you know the maintainer will take such changes.

> Replace `sprintf` with `sysfs_emit` in the `name_show` callback.
> sysfs.rst states that `show` methods should only use `sysfs_emit`
> when formatting output for user space.
> 
> Rename `show_port_name` to `name_show` to follow the naming convention
> for sysfs attribute callbacks, and replace the open-coded DEVICE_ATTR()
> with DEVICE_ATTR_RO(name) which encodes both the mode and the expected
> function name.
> 
> Also fix a missing blank line after a declaration in free_buf().

You should have broken this up into multiple patches anyway :(

thanks,

greg k-h

^ permalink raw reply

* [PATCH] virtio_console: clean up port name sysfs attribute
From: Goncalo Gomes @ 2026-06-01 14:25 UTC (permalink / raw)
  To: amit; +Cc: arnd, gregkh, virtualization, linux-kernel, Goncalo Gomes

Fix issues flagged by checkpatch.pl:

Replace `sprintf` with `sysfs_emit` in the `name_show` callback.
sysfs.rst states that `show` methods should only use `sysfs_emit`
when formatting output for user space.

Rename `show_port_name` to `name_show` to follow the naming convention
for sysfs attribute callbacks, and replace the open-coded DEVICE_ATTR()
with DEVICE_ATTR_RO(name) which encodes both the mode and the expected
function name.

Also fix a missing blank line after a declaration in free_buf().

Signed-off-by: Goncalo Gomes <gomesgoncalo+linuxkernel@gmail.com>
---
 drivers/char/virtio_console.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 9a33217c68d9..56bf0f1b8a00 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -352,6 +352,7 @@ static void free_buf(struct port_buffer *buf, bool can_sleep)
 
 	for (i = 0; i < buf->sgpages; i++) {
 		struct page *page = sg_page(&buf->sg[i]);
+
 		if (!page)
 			break;
 		put_page(page);
@@ -1237,17 +1238,17 @@ static int init_port_console(struct port *port)
 	return 0;
 }
 
-static ssize_t show_port_name(struct device *dev,
-			      struct device_attribute *attr, char *buffer)
+static ssize_t name_show(struct device *dev,
+			 struct device_attribute *attr, char *buffer)
 {
 	struct port *port;
 
 	port = dev_get_drvdata(dev);
 
-	return sprintf(buffer, "%s\n", port->name);
+	return sysfs_emit(buffer, "%s\n", port->name);
 }
 
-static DEVICE_ATTR(name, S_IRUGO, show_port_name, NULL);
+static DEVICE_ATTR_RO(name);
 
 static struct attribute *port_sysfs_entries[] = {
 	&dev_attr_name.attr,
-- 
2.54.0


^ permalink raw reply related

* [PATCH 7/7] drm/vblank: timer: Avoid reading the vblank time unnecessarily
From: Thomas Zimmermann @ 2026-06-01 14:08 UTC (permalink / raw)
  To: simona, michel.daenzer, louis.chauvet, ville.syrjala, jani.nikula,
	mhklkml, maarten.lankhorst, mripard, airlied
  Cc: dri-devel, amd-gfx, virtualization, Thomas Zimmermann
In-Reply-To: <20260601141922.91498-1-tzimmermann@suse.de>

In drm_crtc_vblank_get_vblank_timeout(), there's a loop to read
a consistent vblank count and time. Only read the time once per
iteration and avoid costly locking and an atomic read.

Return an error after 10 retries. This indicates that the vblank
counter is broken or being updated way too often.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/drm_vblank.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index 75e2183be0ab..05f28e27cbff 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -2316,7 +2316,8 @@ bool drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_t
 
 	if (READ_ONCE(vblank->enabled)) {
 		ktime_t cur_time;
-		u64 cur_count;
+		u64 lst_count, cur_count;
+		unsigned int retries = 10;
 
 		/*
 		 * A concurrent vblank timeout could update the expires field before
@@ -2324,10 +2325,15 @@ bool drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_t
 		 * expiry time to the new vblank time; deducing the timer had already
 		 * expired. Reread until we get consistent values from both fields.
 		 */
+		cur_count = drm_crtc_vblank_count(crtc);
 		do {
-			cur_count = drm_crtc_vblank_count_and_time(crtc, &cur_time);
+			lst_count = cur_count;
 			*vblank_time = READ_ONCE(vtimer->timer.node.expires);
-		} while (cur_count != drm_crtc_vblank_count_and_time(crtc, &cur_time));
+			cur_count = drm_crtc_vblank_count_and_time(crtc, &cur_time);
+		} while (cur_count != lst_count && retries--);
+
+		if (drm_WARN_ON(dev, cur_count != lst_count))
+			return false; /* broken vblank counter */
 
 		if (drm_WARN_ON(dev, !ktime_after(*vblank_time, cur_time)))
 			return false; /* already expired */
-- 
2.54.0


^ permalink raw reply related

* [PATCH 5/7] drm/vblank: timer: Estimate vblank timeout if timer is disabled
From: Thomas Zimmermann @ 2026-06-01 14:08 UTC (permalink / raw)
  To: simona, michel.daenzer, louis.chauvet, ville.syrjala, jani.nikula,
	mhklkml, maarten.lankhorst, mripard, airlied
  Cc: dri-devel, amd-gfx, virtualization, Thomas Zimmermann
In-Reply-To: <20260601141922.91498-1-tzimmermann@suse.de>

Estimate the next vblank timeout from the duration of a frame in
the currently programmed display mode. Timeouts are aligned to
frame duration, so we can round up to the next alignment.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/drm_vblank.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index cecaef98aa52..b5d2fb741b2d 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -2333,6 +2333,17 @@ bool drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_t
 			return false; /* already expired */
 
 		framedur_ns = vblank->framedur_ns;
+	} else if (mode->crtc_clock) {
+		u64 framesize = mode->crtc_htotal * mode->crtc_vtotal;
+
+		/*
+		 * With the vblank timer being disabled, we don't have an
+		 * expiry time. As the timeouts are aligned to the display
+		 * mode's clock, we can estimate when the expiry time would
+		 * have been.
+		 */
+		framedur_ns = div_u64(framesize * 1000000llu, mode->crtc_clock);
+		*vblank_time = roundup(ktime_get_ns(), framedur_ns);
 	} else {
 		return false;
 	}
-- 
2.54.0


^ permalink raw reply related

* [PATCH 6/7] drm/vblank: timer: Verify that expiry time is in the future
From: Thomas Zimmermann @ 2026-06-01 14:08 UTC (permalink / raw)
  To: simona, michel.daenzer, louis.chauvet, ville.syrjala, jani.nikula,
	mhklkml, maarten.lankhorst, mripard, airlied
  Cc: dri-devel, amd-gfx, virtualization, Thomas Zimmermann
In-Reply-To: <20260601141922.91498-1-tzimmermann@suse.de>

The timer expiry must be later than the current vblank timestamp. By
testing with !ktime_compare(), the expiry time could also be before
the vblank timestamp. Use ktime_after() to verify that the timer expires
in the future.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/drm_vblank.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index b5d2fb741b2d..75e2183be0ab 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -2329,7 +2329,7 @@ bool drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_t
 			*vblank_time = READ_ONCE(vtimer->timer.node.expires);
 		} while (cur_count != drm_crtc_vblank_count_and_time(crtc, &cur_time));
 
-		if (drm_WARN_ON(dev, !ktime_compare(*vblank_time, cur_time)))
+		if (drm_WARN_ON(dev, !ktime_after(*vblank_time, cur_time)))
 			return false; /* already expired */
 
 		framedur_ns = vblank->framedur_ns;
-- 
2.54.0


^ permalink raw reply related

* [PATCH 3/7] drm/vblank: timer: Use absolute timer since boot
From: Thomas Zimmermann @ 2026-06-01 14:08 UTC (permalink / raw)
  To: simona, michel.daenzer, louis.chauvet, ville.syrjala, jani.nikula,
	mhklkml, maarten.lankhorst, mripard, airlied
  Cc: dri-devel, amd-gfx, virtualization, Thomas Zimmermann
In-Reply-To: <20260601141922.91498-1-tzimmermann@suse.de>

Replace HRTIMER_MODE_REL with HRTIMER_MODE_ABS. Align the vblank
timeouts at multiples of the current mode's frame duration. Use
CLOCK_BOOTTIME to avoid clock gaps from suspends. Allows the timer
code to easily estimate future timeouts even while the timer is
disabled.

Also add a separate error message for cases where the timeout handler
tries to forward an unexpired timer. This would indicate a problem in
how the timer is being set up.

Suggested-by: Michel Dänzer <michel.daenzer@mailbox.org>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/drm_vblank.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index d52df247d04e..03b07e3c2598 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -2192,7 +2192,9 @@ static enum hrtimer_restart drm_vblank_timer_function(struct hrtimer *timer)
 		return HRTIMER_NORESTART;
 
 	ret_overrun = hrtimer_forward_now(&vtimer->timer, interval);
-	if (ret_overrun != 1)
+	if (!ret_overrun)
+		drm_dbg_vbl(dev, "vblank timer underrun\n");
+	else if (ret_overrun != 1)
 		drm_dbg_vbl(dev, "vblank timer overrun\n");
 
 	if (crtc_funcs->handle_vblank_timeout)
@@ -2221,6 +2223,7 @@ int drm_crtc_vblank_start_timer(struct drm_crtc *crtc)
 	struct drm_vblank_crtc *vblank = drm_crtc_vblank_crtc(crtc);
 	struct drm_vblank_crtc_timer *vtimer = &vblank->vblank_timer;
 	unsigned long flags;
+	s64 vblank_time_ns;
 
 	if (!vtimer->crtc) {
 		/*
@@ -2229,7 +2232,7 @@ int drm_crtc_vblank_start_timer(struct drm_crtc *crtc)
 		vtimer->crtc = crtc;
 		spin_lock_init(&vtimer->interval_lock);
 		hrtimer_setup(&vtimer->timer, drm_vblank_timer_function,
-			      CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+			      CLOCK_BOOTTIME, HRTIMER_MODE_ABS);
 	} else {
 		/*
 		 * Timer should not be active. If it is, wait for the
@@ -2245,7 +2248,13 @@ int drm_crtc_vblank_start_timer(struct drm_crtc *crtc)
 	vtimer->interval = ns_to_ktime(vblank->framedur_ns);
 	spin_unlock_irqrestore(&vtimer->interval_lock, flags);
 
-	hrtimer_start(&vtimer->timer, vtimer->interval, HRTIMER_MODE_REL);
+	/*
+	 * Always align the vblank timeout to the frame duration. Allows
+	 * for estimating the next vblank even if the hrtimer has been
+	 * disabled.
+	 */
+	vblank_time_ns = roundup(ktime_get_ns(), vblank->framedur_ns);
+	hrtimer_start(&vtimer->timer, ns_to_ktime(vblank_time_ns), HRTIMER_MODE_ABS);
 
 	return 0;
 }
-- 
2.54.0


^ permalink raw reply related

* [PATCH 4/7] drm/vblank: timer: Reorganize get_vblank_timeout
From: Thomas Zimmermann @ 2026-06-01 14:08 UTC (permalink / raw)
  To: simona, michel.daenzer, louis.chauvet, ville.syrjala, jani.nikula,
	mhklkml, maarten.lankhorst, mripard, airlied
  Cc: dri-devel, amd-gfx, virtualization, Thomas Zimmermann
In-Reply-To: <20260601141922.91498-1-tzimmermann@suse.de>

Handle vblank->enabled in a separate branch before handling the
opposite case. Prepares the code for estimating the vblank timeout
while vblanking is disabled. No functional changes.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/drm_vblank.c | 36 +++++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index 03b07e3c2598..cecaef98aa52 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -2306,8 +2306,6 @@ bool drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_t
 	struct drm_vblank_crtc *vblank = drm_crtc_vblank_crtc(crtc);
 	struct drm_vblank_crtc_timer *vtimer = &vblank->vblank_timer;
 	const struct drm_display_mode *mode;
-	u64 cur_count;
-	ktime_t cur_time;
 	s64 framedur_ns;
 	s64 activedur_ns;
 
@@ -2316,24 +2314,28 @@ bool drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_t
 	else
 		mode = &crtc->hwmode;
 
-	if (!READ_ONCE(vblank->enabled))
-		return false;
+	if (READ_ONCE(vblank->enabled)) {
+		ktime_t cur_time;
+		u64 cur_count;
 
-	/*
-	 * A concurrent vblank timeout could update the expires field before
-	 * we compare it with the vblank time. Hence we'd compare the old
-	 * expiry time to the new vblank time; deducing the timer had already
-	 * expired. Reread until we get consistent values from both fields.
-	 */
-	do {
-		cur_count = drm_crtc_vblank_count_and_time(crtc, &cur_time);
-		*vblank_time = READ_ONCE(vtimer->timer.node.expires);
-	} while (cur_count != drm_crtc_vblank_count_and_time(crtc, &cur_time));
+		/*
+		 * A concurrent vblank timeout could update the expires field before
+		 * we compare it with the vblank time. Hence we'd compare the old
+		 * expiry time to the new vblank time; deducing the timer had already
+		 * expired. Reread until we get consistent values from both fields.
+		 */
+		do {
+			cur_count = drm_crtc_vblank_count_and_time(crtc, &cur_time);
+			*vblank_time = READ_ONCE(vtimer->timer.node.expires);
+		} while (cur_count != drm_crtc_vblank_count_and_time(crtc, &cur_time));
 
-	if (drm_WARN_ON(dev, !ktime_compare(*vblank_time, cur_time)))
-		return false; /* Already expired */
+		if (drm_WARN_ON(dev, !ktime_compare(*vblank_time, cur_time)))
+			return false; /* already expired */
 
-	framedur_ns = vblank->framedur_ns;
+		framedur_ns = vblank->framedur_ns;
+	} else {
+		return false;
+	}
 
 	/*
 	 * To prevent races we rolled the hrtimer forward before we did any
-- 
2.54.0


^ permalink raw reply related

* [PATCH 2/7] drm/vblank: timer: Fix timestamp calculation
From: Thomas Zimmermann @ 2026-06-01 14:08 UTC (permalink / raw)
  To: simona, michel.daenzer, louis.chauvet, ville.syrjala, jani.nikula,
	mhklkml, maarten.lankhorst, mripard, airlied
  Cc: dri-devel, amd-gfx, virtualization, Thomas Zimmermann
In-Reply-To: <20260601141922.91498-1-tzimmermann@suse.de>

In drm_crtc_vblank_get_vblank_timeout(), return the timestamp of the
first visible scanline after the last vblank timeout. This is what the
caller expects.

A vblank phase starts with a vblank timeout. At this point the display
is blanked for several scanlines. Afterwards the display is unblanked
until the next vblank timeout occurs. The display content is only visible
during that second part.

The current implementation of drm_crtc_vblank_get_vblank_timeout()
returns the timestamp of the last vblank timeout that started the current
vblank phase. But the display only unblanks after 20 to 30 percent of
the overall frame duration. The returned timestamp is therefore too early.

The next vblank timeout is already known when calculating the returned
timestamp. Instead of subtracting the duration of a full frame from the
value, only subtract the duration of the active, visible part. The result
is the timestamp of the first visible scanline, as expected by the caller.

This bug was not introduced by the generic vblank timer. It appears that
the get_vblank_timeout logic has always been buggy since it was first
added in commit 3a0709928b17 ("drm/vkms: Add vblank events simulated by
hrtimers").

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/drm_vblank.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index 96d70c3d4522..d52df247d04e 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -2293,10 +2293,19 @@ EXPORT_SYMBOL(drm_crtc_vblank_cancel_timer);
  */
 bool drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_time)
 {
+	struct drm_device *dev = crtc->dev;
 	struct drm_vblank_crtc *vblank = drm_crtc_vblank_crtc(crtc);
 	struct drm_vblank_crtc_timer *vtimer = &vblank->vblank_timer;
+	const struct drm_display_mode *mode;
 	u64 cur_count;
 	ktime_t cur_time;
+	s64 framedur_ns;
+	s64 activedur_ns;
+
+	if (drm_drv_uses_atomic_modeset(dev))
+		mode = &vblank->hwmode;
+	else
+		mode = &crtc->hwmode;
 
 	if (!READ_ONCE(vblank->enabled))
 		return false;
@@ -2312,17 +2321,26 @@ bool drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_t
 		*vblank_time = READ_ONCE(vtimer->timer.node.expires);
 	} while (cur_count != drm_crtc_vblank_count_and_time(crtc, &cur_time));
 
-	if (drm_WARN_ON(crtc->dev, !ktime_compare(*vblank_time, cur_time)))
+	if (drm_WARN_ON(dev, !ktime_compare(*vblank_time, cur_time)))
 		return false; /* Already expired */
 
+	framedur_ns = vblank->framedur_ns;
+
 	/*
-	 * To prevent races we roll the hrtimer forward before we do any
-	 * interrupt processing - this is how real hw works (the interrupt
-	 * is only generated after all the vblank registers are updated)
-	 * and what the vblank core expects. Therefore we need to always
-	 * correct the timestamp by one frame.
+	 * To prevent races we rolled the hrtimer forward before we did any
+	 * timeout processing - this is how real hw works (the interrupt is
+	 * only generated after all the vblank registers are updated) and what
+	 * the vblank core expects.
+	 *
+	 * Therefore we always need to correct the timestamp. The returned
+	 * time should be the time of the first active scanline after the
+	 * previous vblank. Hence subtract the active phase's duration from
+	 * the next expiration time.
 	 */
-	*vblank_time = ktime_sub(*vblank_time, vtimer->interval);
+	if (drm_WARN_ON(dev, !mode->crtc_vtotal))
+		return false;
+	activedur_ns = div_s64(framedur_ns * mode->crtc_vdisplay, mode->crtc_vtotal);
+	*vblank_time = ktime_sub_ns(*vblank_time, activedur_ns);
 
 	return true;
 }
-- 
2.54.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox