All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: lkp@lists.01.org
Subject: Re: [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail
Date: Tue, 16 Aug 2022 22:23:01 +0000	[thread overview]
Message-ID: <YvwYxeE4vc/Srbil@google.com> (raw)
In-Reply-To: <9bfeae26-b4b1-eedb-6cbd-b4f9f1e1cc55@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 4480 bytes --]

On Tue, Aug 16, 2022, Gavin Shan wrote:
> Hi Sean,
> 
> On 8/16/22 3:02 PM, Gavin Shan wrote:
> > On 8/16/22 7:42 AM, Sean Christopherson wrote:
> > > On Mon, Aug 15, 2022, kernel test robot wrote:
> > > > commit: e923b0537d28e15c9d31ce8b38f810b325816903 ("KVM: selftests: Fix target thread to be migrated in rseq_test")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > 
> > > ...
> > > 
> > > > # selftests: kvm: rseq_test
> > > > # ==== Test Assertion Failure ====
> > > > #   rseq_test.c:278: i > (NR_TASK_MIGRATIONS / 2)
> > > > #   pid=49599 tid=49599 errno=4 - Interrupted system call
> > > > #      1    0x000000000040265d: main at rseq_test.c:278
> > > > #      2    0x00007fe44eed07fc: ?? ??:0
> > > > #      3    0x00000000004026d9: _start at ??:?
> > > > #   Only performed 23174 KVM_RUNs, task stalled too much?
> > > > #
> > > > not ok 56 selftests: kvm: rseq_test # exit=254
> > > 
> > > ...
> > > 
> > > > # Automatically generated file; DO NOT EDIT.
> > > > # Linux/x86_64 5.19.0-rc6 Kernel Configuration
> > > > #
> > > > CONFIG_CC_VERSION_TEXT="gcc-11 (Debian 11.3.0-3) 11.3.0"
> > > > CONFIG_CC_IS_GCC=y
> > > > CONFIG_GCC_VERSION=110300
> > > > CONFIG_CLANG_VERSION=0
> > > > CONFIG_AS_IS_GNU=y
> > > > CONFIG_AS_VERSION=23800
> > > > CONFIG_LD_IS_BFD=y
> > > > CONFIG_LD_VERSION=23800
> > > > CONFIG_LLD_VERSION=0
> > > 
> > > Assuming 23800 == 2.38, this is a known issue.
> > > 
> > > https://lore.kernel.org/all/20220810104114.6838-1-gshan(a)redhat.com
> > > 
> > 
> > It's probably different story this time.

Doh, if I had bothered to actually look at the error message...

> > The assert is triggered because of the following instructions. I would
> > guess the reason is vcpu thread has been running on CPU where we has high
> > CPU load. In this case, the vcpu thread can't be run in time. More
> > specific, the vcpu thread can't be run in the 1 - 10us time window, which
> > is specified by the migration worker (thread).
> > 
> >      TEST_ASSERT(i > (NR_TASK_MIGRATIONS / 2),
> >                  "Only performed %d KVM_RUNs, task stalled too much?\n", i);
> > 
> > I think we need to improve the handshake mechanism between the vcpu thread
> > and migration worker. In current implementation, the handshake is done through
> > the atomic counter. The mechanism is simple enough, but vcpu thread can miss
> > the aforementioned time window. Another issue is the test case much more time
> > than expected to finish.

There's not really an expected time to finish.  The original purpose of the test
is to trigger a kernel race condition, so it's a balance between letting the test
run long enough to have some confidence that the kernel is bug free, and not running
so long that it wastes time.

> > Sean, if you think it's reasonable, I can figure out something to do:
> > 
> > - Reuse the atomic counter for a full synchronization between these two
> >    threads. Something like below:
> > 
> >    #define RSEQ_TEST_STATE_RUN_VCPU       0     // vcpu_run()
> >    #define RSEQ_TEST_STATE_MIGRATE        1     // sched_setaffinity()
> >    #define RSEQ_TEST_STATE_CHECK          2     // Check rseq.cpu_id and get_cpu()
> > 
> >    The atomic counter is reset to RSEQ_TEST_STATE_RUN_VCPU after RSEQ_TEST_STATE_RUN_VCPU

Again, because one of the primary goals is to ensure the kernel is race free, the
test should avoid full synchronization.

> > 
> > - Reduce NR_TASK_MIGRATIONS from 100000 to num_of_online_cpus(). With this,
> >    less time is needed to finish the test case.
> > 
> 
> I'm able to recreate the issue on my local arm64 system.
> 
> - From the source code, the iteration count is changed from 100000 to 1000
> - Only CPU#0 and CPU#1 are exposed in calc_min_max_cpu, meaning other CPUs
>   are cleared from @possible_mask
> - Run some CPU bound task on CPU#0 and CPU#1
>   # while true; do taskset -c 0 ./a; done
>   # while true; do taskset -c 1 ./a; done
> - Run 'rseq_test' and hit the issue

At this point, this isn't a test bug.  The test is right to complain that it didn't
provide the coverage it's supposed to provide.

If the bot failure is a one-off, my preference is to leave things as-is for now.
If the failure is an ongoing issue, then we probably need to understand why the
bot is failing.

WARNING: multiple messages have this Message-ID (diff)
From: Sean Christopherson <seanjc@google.com>
To: Gavin Shan <gshan@redhat.com>
Cc: kernel test robot <oliver.sang@intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Oliver Upton <oliver.upton@linux.dev>,
	Andrew Jones <andrew.jones@linux.dev>,
	LKML <linux-kernel@vger.kernel.org>,
	kvm@vger.kernel.org, lkp@lists.01.org, lkp@intel.com,
	xudong.hao@intel.com, regressions@lists.linux.dev
Subject: Re: [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail
Date: Tue, 16 Aug 2022 22:23:01 +0000	[thread overview]
Message-ID: <YvwYxeE4vc/Srbil@google.com> (raw)
In-Reply-To: <9bfeae26-b4b1-eedb-6cbd-b4f9f1e1cc55@redhat.com>

On Tue, Aug 16, 2022, Gavin Shan wrote:
> Hi Sean,
> 
> On 8/16/22 3:02 PM, Gavin Shan wrote:
> > On 8/16/22 7:42 AM, Sean Christopherson wrote:
> > > On Mon, Aug 15, 2022, kernel test robot wrote:
> > > > commit: e923b0537d28e15c9d31ce8b38f810b325816903 ("KVM: selftests: Fix target thread to be migrated in rseq_test")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > 
> > > ...
> > > 
> > > > # selftests: kvm: rseq_test
> > > > # ==== Test Assertion Failure ====
> > > > #   rseq_test.c:278: i > (NR_TASK_MIGRATIONS / 2)
> > > > #   pid=49599 tid=49599 errno=4 - Interrupted system call
> > > > #      1    0x000000000040265d: main at rseq_test.c:278
> > > > #      2    0x00007fe44eed07fc: ?? ??:0
> > > > #      3    0x00000000004026d9: _start at ??:?
> > > > #   Only performed 23174 KVM_RUNs, task stalled too much?
> > > > #
> > > > not ok 56 selftests: kvm: rseq_test # exit=254
> > > 
> > > ...
> > > 
> > > > # Automatically generated file; DO NOT EDIT.
> > > > # Linux/x86_64 5.19.0-rc6 Kernel Configuration
> > > > #
> > > > CONFIG_CC_VERSION_TEXT="gcc-11 (Debian 11.3.0-3) 11.3.0"
> > > > CONFIG_CC_IS_GCC=y
> > > > CONFIG_GCC_VERSION=110300
> > > > CONFIG_CLANG_VERSION=0
> > > > CONFIG_AS_IS_GNU=y
> > > > CONFIG_AS_VERSION=23800
> > > > CONFIG_LD_IS_BFD=y
> > > > CONFIG_LD_VERSION=23800
> > > > CONFIG_LLD_VERSION=0
> > > 
> > > Assuming 23800 == 2.38, this is a known issue.
> > > 
> > > https://lore.kernel.org/all/20220810104114.6838-1-gshan@redhat.com
> > > 
> > 
> > It's probably different story this time.

Doh, if I had bothered to actually look at the error message...

> > The assert is triggered because of the following instructions. I would
> > guess the reason is vcpu thread has been running on CPU where we has high
> > CPU load. In this case, the vcpu thread can't be run in time. More
> > specific, the vcpu thread can't be run in the 1 - 10us time window, which
> > is specified by the migration worker (thread).
> > 
> >      TEST_ASSERT(i > (NR_TASK_MIGRATIONS / 2),
> >                  "Only performed %d KVM_RUNs, task stalled too much?\n", i);
> > 
> > I think we need to improve the handshake mechanism between the vcpu thread
> > and migration worker. In current implementation, the handshake is done through
> > the atomic counter. The mechanism is simple enough, but vcpu thread can miss
> > the aforementioned time window. Another issue is the test case much more time
> > than expected to finish.

There's not really an expected time to finish.  The original purpose of the test
is to trigger a kernel race condition, so it's a balance between letting the test
run long enough to have some confidence that the kernel is bug free, and not running
so long that it wastes time.

> > Sean, if you think it's reasonable, I can figure out something to do:
> > 
> > - Reuse the atomic counter for a full synchronization between these two
> >    threads. Something like below:
> > 
> >    #define RSEQ_TEST_STATE_RUN_VCPU       0     // vcpu_run()
> >    #define RSEQ_TEST_STATE_MIGRATE        1     // sched_setaffinity()
> >    #define RSEQ_TEST_STATE_CHECK          2     // Check rseq.cpu_id and get_cpu()
> > 
> >    The atomic counter is reset to RSEQ_TEST_STATE_RUN_VCPU after RSEQ_TEST_STATE_RUN_VCPU

Again, because one of the primary goals is to ensure the kernel is race free, the
test should avoid full synchronization.

> > 
> > - Reduce NR_TASK_MIGRATIONS from 100000 to num_of_online_cpus(). With this,
> >    less time is needed to finish the test case.
> > 
> 
> I'm able to recreate the issue on my local arm64 system.
> 
> - From the source code, the iteration count is changed from 100000 to 1000
> - Only CPU#0 and CPU#1 are exposed in calc_min_max_cpu, meaning other CPUs
>   are cleared from @possible_mask
> - Run some CPU bound task on CPU#0 and CPU#1
>   # while true; do taskset -c 0 ./a; done
>   # while true; do taskset -c 1 ./a; done
> - Run 'rseq_test' and hit the issue

At this point, this isn't a test bug.  The test is right to complain that it didn't
provide the coverage it's supposed to provide.

If the bot failure is a one-off, my preference is to leave things as-is for now.
If the failure is an ongoing issue, then we probably need to understand why the
bot is failing.

  reply	other threads:[~2022-08-16 22:23 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-15  7:50 [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail kernel test robot
2022-08-15  7:50 ` kernel test robot
2022-08-15 21:42 ` Sean Christopherson
2022-08-15 21:42   ` Sean Christopherson
2022-08-16  5:02   ` Gavin Shan
2022-08-16  5:02     ` Gavin Shan
2022-08-16 10:02     ` Gavin Shan
2022-08-16 10:02       ` Gavin Shan
2022-08-16 22:23       ` Sean Christopherson [this message]
2022-08-16 22:23         ` Sean Christopherson
2022-08-17  0:06         ` Gavin Shan
2022-08-17  0:06           ` Gavin Shan
2022-09-05 10:47           ` [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail #forregzbot Thorsten Leemhuis
2022-09-05 10:47             ` Thorsten Leemhuis
2022-11-01  8:32       ` [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail Robert Hoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YvwYxeE4vc/Srbil@google.com \
    --to=seanjc@google.com \
    --cc=lkp@lists.01.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.