public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [kvm-unit-tests PATCH 0/2] x86/apic: fix false test_apic_change_mode failures on stalled vCPUs
@ 2026-04-28 13:35 Igor Mammedov
  2026-04-28 13:35 ` [kvm-unit-tests PATCH 1/2] x86/apic: separate reporting from actual measurements Igor Mammedov
  2026-04-28 13:35 ` [kvm-unit-tests PATCH 2/2] x86/apic: add retry logic to test_apic_change_mode Igor Mammedov
  0 siblings, 2 replies; 3+ messages in thread
From: Igor Mammedov @ 2026-04-28 13:35 UTC (permalink / raw)
  To: kvm; +Cc: pbonzini

test_apic_change_mode sporadically fails in CI on both Intel and AMD
hosts with errors like:
  "FAIL: TMCCT should have a non-zero value"
  "FAIL: TMCCT should be reset to the initial-count"
  "FAIL: TMCCT should not be reset to TMICT value"

The root cause is that the APIC timer runs at wall clock time under KVM.
With the default tmict=0x999999 (~10ms period at 1ns bus cycle).

A vCPU stall for sufficiently large portion of TMICT leads to false positives
(reasons could be: host preemption, cross-socket migration, heavy CPU
contention). It's basically not possible to reliably sample timer values
while it's running.

This series adds retry logic with increasing timer periods (10ms, 60ms,
700ms) so that transient vCPU stalls don't cause false failures, while
real bugs still get caught. (most of false failures are handled by 60ms
timer, and 700ms is one pathological case observed in a week of testing)

Reproducer (requires 2+ NUMA nodes):

  stress-ng --cpu 128 --timer 32 --hrtimers 32 --quiet &
  sleep 2
  while true; do
      /usr/libexec/qemu-kvm --no-reboot -nodefaults \
          -global kvm-pit.lost_tick_policy=discard \
          -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 \
          -display none -serial stdio -device pci-testdev \
          -machine q35 -kernel x86/apic.flat \
          -smp 1 -cpu qemu64,+x2apic,+tsc-deadline \
          >> apic_race.log 2>&1 &
      QEMU_PID=$!
      while kill -0 $QEMU_PID 2>/dev/null; do
          taskset -p -c 0 $QEMU_PID 2>/dev/null
          sleep 0.001
          taskset -p -c 1 $QEMU_PID 2>/dev/null
          sleep 0.001
      done
      wait $QEMU_PID 2>/dev/null
  done

patches reduce ~4% failure rate (8 FAILs / 216 PASSes in 2 minutes).
to 0 FAILs over thousands of runs.

Igor Mammedov (2):
  x86/apic: separate reporting from actual measurements
  x86/apic: add retry logic to test_apic_change_mode

 x86/apic.c | 119 ++++++++++++++++++++++++++++++++---------------------
 1 file changed, 72 insertions(+), 47 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-04-28 13:35 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28 13:35 [kvm-unit-tests PATCH 0/2] x86/apic: fix false test_apic_change_mode failures on stalled vCPUs Igor Mammedov
2026-04-28 13:35 ` [kvm-unit-tests PATCH 1/2] x86/apic: separate reporting from actual measurements Igor Mammedov
2026-04-28 13:35 ` [kvm-unit-tests PATCH 2/2] x86/apic: add retry logic to test_apic_change_mode Igor Mammedov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox