* Re: [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail [not found] <Yvn60W/JpPO8URLY@xsang-OptiPlex-9020> @ 2022-08-15 21:42 ` Sean Christopherson 2022-08-16 5:02 ` Gavin Shan 0 siblings, 1 reply; 7+ messages in thread From: Sean Christopherson @ 2022-08-15 21:42 UTC (permalink / raw) To: kernel test robot Cc: Gavin Shan, Paolo Bonzini, Oliver Upton, Andrew Jones, LKML, kvm, lkp, lkp, xudong.hao, regressions On Mon, Aug 15, 2022, kernel test robot wrote: > commit: e923b0537d28e15c9d31ce8b38f810b325816903 ("KVM: selftests: Fix target thread to be migrated in rseq_test") > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master ... > # selftests: kvm: rseq_test > # ==== Test Assertion Failure ==== > # rseq_test.c:278: i > (NR_TASK_MIGRATIONS / 2) > # pid=49599 tid=49599 errno=4 - Interrupted system call > # 1 0x000000000040265d: main at rseq_test.c:278 > # 2 0x00007fe44eed07fc: ?? ??:0 > # 3 0x00000000004026d9: _start at ??:? > # Only performed 23174 KVM_RUNs, task stalled too much? > # > not ok 56 selftests: kvm: rseq_test # exit=254 ... > # Automatically generated file; DO NOT EDIT. > # Linux/x86_64 5.19.0-rc6 Kernel Configuration > # > CONFIG_CC_VERSION_TEXT="gcc-11 (Debian 11.3.0-3) 11.3.0" > CONFIG_CC_IS_GCC=y > CONFIG_GCC_VERSION=110300 > CONFIG_CLANG_VERSION=0 > CONFIG_AS_IS_GNU=y > CONFIG_AS_VERSION=23800 > CONFIG_LD_IS_BFD=y > CONFIG_LD_VERSION=23800 > CONFIG_LLD_VERSION=0 Assuming 23800 == 2.38, this is a known issue. https://lore.kernel.org/all/20220810104114.6838-1-gshan@redhat.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail 2022-08-15 21:42 ` [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail Sean Christopherson @ 2022-08-16 5:02 ` Gavin Shan 2022-08-16 10:02 ` Gavin Shan 0 siblings, 1 reply; 7+ messages in thread From: Gavin Shan @ 2022-08-16 5:02 UTC (permalink / raw) To: Sean Christopherson, kernel test robot Cc: Paolo Bonzini, Oliver Upton, Andrew Jones, LKML, kvm, lkp, lkp, xudong.hao, regressions Hi Sean, On 8/16/22 7:42 AM, Sean Christopherson wrote: > On Mon, Aug 15, 2022, kernel test robot wrote: >> commit: e923b0537d28e15c9d31ce8b38f810b325816903 ("KVM: selftests: Fix target thread to be migrated in rseq_test") >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master > > ... > >> # selftests: kvm: rseq_test >> # ==== Test Assertion Failure ==== >> # rseq_test.c:278: i > (NR_TASK_MIGRATIONS / 2) >> # pid=49599 tid=49599 errno=4 - Interrupted system call >> # 1 0x000000000040265d: main at rseq_test.c:278 >> # 2 0x00007fe44eed07fc: ?? ??:0 >> # 3 0x00000000004026d9: _start at ??:? >> # Only performed 23174 KVM_RUNs, task stalled too much? >> # >> not ok 56 selftests: kvm: rseq_test # exit=254 > > ... > >> # Automatically generated file; DO NOT EDIT. >> # Linux/x86_64 5.19.0-rc6 Kernel Configuration >> # >> CONFIG_CC_VERSION_TEXT="gcc-11 (Debian 11.3.0-3) 11.3.0" >> CONFIG_CC_IS_GCC=y >> CONFIG_GCC_VERSION=110300 >> CONFIG_CLANG_VERSION=0 >> CONFIG_AS_IS_GNU=y >> CONFIG_AS_VERSION=23800 >> CONFIG_LD_IS_BFD=y >> CONFIG_LD_VERSION=23800 >> CONFIG_LLD_VERSION=0 > > Assuming 23800 == 2.38, this is a known issue. > > https://lore.kernel.org/all/20220810104114.6838-1-gshan@redhat.com > It's probably different story this time. The assert is triggered because of the following instructions. I would guess the reason is vcpu thread has been running on CPU where we has high CPU load. In this case, the vcpu thread can't be run in time. More specific, the vcpu thread can't be run in the 1 - 10us time window, which is specified by the migration worker (thread). TEST_ASSERT(i > (NR_TASK_MIGRATIONS / 2), "Only performed %d KVM_RUNs, task stalled too much?\n", i); I think we need to improve the handshake mechanism between the vcpu thread and migration worker. In current implementation, the handshake is done through the atomic counter. The mechanism is simple enough, but vcpu thread can miss the aforementioned time window. Another issue is the test case much more time than expected to finish. Sean, if you think it's reasonable, I can figure out something to do: - Reuse the atomic counter for a full synchronization between these two threads. Something like below: #define RSEQ_TEST_STATE_RUN_VCPU 0 // vcpu_run() #define RSEQ_TEST_STATE_MIGRATE 1 // sched_setaffinity() #define RSEQ_TEST_STATE_CHECK 2 // Check rseq.cpu_id and get_cpu() The atomic counter is reset to RSEQ_TEST_STATE_RUN_VCPU after RSEQ_TEST_STATE_RUN_VCPU - Reduce NR_TASK_MIGRATIONS from 100000 to num_of_online_cpus(). With this, less time is needed to finish the test case. Thanks, Gavin ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail 2022-08-16 5:02 ` Gavin Shan @ 2022-08-16 10:02 ` Gavin Shan 2022-08-16 22:23 ` Sean Christopherson 2022-11-01 8:32 ` [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail Robert Hoo 0 siblings, 2 replies; 7+ messages in thread From: Gavin Shan @ 2022-08-16 10:02 UTC (permalink / raw) To: Sean Christopherson, kernel test robot Cc: Paolo Bonzini, Oliver Upton, Andrew Jones, LKML, kvm, lkp, lkp, xudong.hao, regressions Hi Sean, On 8/16/22 3:02 PM, Gavin Shan wrote: > On 8/16/22 7:42 AM, Sean Christopherson wrote: >> On Mon, Aug 15, 2022, kernel test robot wrote: >>> commit: e923b0537d28e15c9d31ce8b38f810b325816903 ("KVM: selftests: Fix target thread to be migrated in rseq_test") >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master >> >> ... >> >>> # selftests: kvm: rseq_test >>> # ==== Test Assertion Failure ==== >>> # rseq_test.c:278: i > (NR_TASK_MIGRATIONS / 2) >>> # pid=49599 tid=49599 errno=4 - Interrupted system call >>> # 1 0x000000000040265d: main at rseq_test.c:278 >>> # 2 0x00007fe44eed07fc: ?? ??:0 >>> # 3 0x00000000004026d9: _start at ??:? >>> # Only performed 23174 KVM_RUNs, task stalled too much? >>> # >>> not ok 56 selftests: kvm: rseq_test # exit=254 >> >> ... >> >>> # Automatically generated file; DO NOT EDIT. >>> # Linux/x86_64 5.19.0-rc6 Kernel Configuration >>> # >>> CONFIG_CC_VERSION_TEXT="gcc-11 (Debian 11.3.0-3) 11.3.0" >>> CONFIG_CC_IS_GCC=y >>> CONFIG_GCC_VERSION=110300 >>> CONFIG_CLANG_VERSION=0 >>> CONFIG_AS_IS_GNU=y >>> CONFIG_AS_VERSION=23800 >>> CONFIG_LD_IS_BFD=y >>> CONFIG_LD_VERSION=23800 >>> CONFIG_LLD_VERSION=0 >> >> Assuming 23800 == 2.38, this is a known issue. >> >> https://lore.kernel.org/all/20220810104114.6838-1-gshan@redhat.com >> > > It's probably different story this time. The assert is triggered because > of the following instructions. I would guess the reason is vcpu thread > has been running on CPU where we has high CPU load. In this case, the > vcpu thread can't be run in time. More specific, the vcpu thread can't > be run in the 1 - 10us time window, which is specified by the migration > worker (thread). > > TEST_ASSERT(i > (NR_TASK_MIGRATIONS / 2), > "Only performed %d KVM_RUNs, task stalled too much?\n", i); > > I think we need to improve the handshake mechanism between the vcpu thread > and migration worker. In current implementation, the handshake is done through > the atomic counter. The mechanism is simple enough, but vcpu thread can miss > the aforementioned time window. Another issue is the test case much more time > than expected to finish. > > Sean, if you think it's reasonable, I can figure out something to do: > > - Reuse the atomic counter for a full synchronization between these two > threads. Something like below: > > #define RSEQ_TEST_STATE_RUN_VCPU 0 // vcpu_run() > #define RSEQ_TEST_STATE_MIGRATE 1 // sched_setaffinity() > #define RSEQ_TEST_STATE_CHECK 2 // Check rseq.cpu_id and get_cpu() > > The atomic counter is reset to RSEQ_TEST_STATE_RUN_VCPU after RSEQ_TEST_STATE_RUN_VCPU > > - Reduce NR_TASK_MIGRATIONS from 100000 to num_of_online_cpus(). With this, > less time is needed to finish the test case. > I'm able to recreate the issue on my local arm64 system. - From the source code, the iteration count is changed from 100000 to 1000 - Only CPU#0 and CPU#1 are exposed in calc_min_max_cpu, meaning other CPUs are cleared from @possible_mask - Run some CPU bound task on CPU#0 and CPU#1 # while true; do taskset -c 0 ./a; done # while true; do taskset -c 1 ./a; done - Run 'rseq_test' and hit the issue # ./rseq_test calc_min_max_cpu: nproc=224 calc_min_max_cpu: min_cpu=0, max_cpu=1 main: Required tests: 1000 Succeeds: 1 ==== Test Assertion Failure ==== rseq_test.c:280: i > (NR_TASK_MIGRATIONS / 2) pid=9624 tid=9624 errno=4 - Interrupted system call 1 0x0000000000401cf3: main at rseq_test.c:280 2 0x0000ffffbc64679b: ?? ??:0 3 0x0000ffffbc64686b: ?? ??:0 4 0x0000000000401def: _start at ??:? Only performed 1 KVM_RUNs, task stalled too much? Thanks, Gavin ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail 2022-08-16 10:02 ` Gavin Shan @ 2022-08-16 22:23 ` Sean Christopherson 2022-08-17 0:06 ` Gavin Shan 2022-11-01 8:32 ` [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail Robert Hoo 1 sibling, 1 reply; 7+ messages in thread From: Sean Christopherson @ 2022-08-16 22:23 UTC (permalink / raw) To: Gavin Shan Cc: kernel test robot, Paolo Bonzini, Oliver Upton, Andrew Jones, LKML, kvm, lkp, lkp, xudong.hao, regressions On Tue, Aug 16, 2022, Gavin Shan wrote: > Hi Sean, > > On 8/16/22 3:02 PM, Gavin Shan wrote: > > On 8/16/22 7:42 AM, Sean Christopherson wrote: > > > On Mon, Aug 15, 2022, kernel test robot wrote: > > > > commit: e923b0537d28e15c9d31ce8b38f810b325816903 ("KVM: selftests: Fix target thread to be migrated in rseq_test") > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master > > > > > > ... > > > > > > > # selftests: kvm: rseq_test > > > > # ==== Test Assertion Failure ==== > > > > # rseq_test.c:278: i > (NR_TASK_MIGRATIONS / 2) > > > > # pid=49599 tid=49599 errno=4 - Interrupted system call > > > > # 1 0x000000000040265d: main at rseq_test.c:278 > > > > # 2 0x00007fe44eed07fc: ?? ??:0 > > > > # 3 0x00000000004026d9: _start at ??:? > > > > # Only performed 23174 KVM_RUNs, task stalled too much? > > > > # > > > > not ok 56 selftests: kvm: rseq_test # exit=254 > > > > > > ... > > > > > > > # Automatically generated file; DO NOT EDIT. > > > > # Linux/x86_64 5.19.0-rc6 Kernel Configuration > > > > # > > > > CONFIG_CC_VERSION_TEXT="gcc-11 (Debian 11.3.0-3) 11.3.0" > > > > CONFIG_CC_IS_GCC=y > > > > CONFIG_GCC_VERSION=110300 > > > > CONFIG_CLANG_VERSION=0 > > > > CONFIG_AS_IS_GNU=y > > > > CONFIG_AS_VERSION=23800 > > > > CONFIG_LD_IS_BFD=y > > > > CONFIG_LD_VERSION=23800 > > > > CONFIG_LLD_VERSION=0 > > > > > > Assuming 23800 == 2.38, this is a known issue. > > > > > > https://lore.kernel.org/all/20220810104114.6838-1-gshan@redhat.com > > > > > > > It's probably different story this time. Doh, if I had bothered to actually look at the error message... > > The assert is triggered because of the following instructions. I would > > guess the reason is vcpu thread has been running on CPU where we has high > > CPU load. In this case, the vcpu thread can't be run in time. More > > specific, the vcpu thread can't be run in the 1 - 10us time window, which > > is specified by the migration worker (thread). > > > > TEST_ASSERT(i > (NR_TASK_MIGRATIONS / 2), > > "Only performed %d KVM_RUNs, task stalled too much?\n", i); > > > > I think we need to improve the handshake mechanism between the vcpu thread > > and migration worker. In current implementation, the handshake is done through > > the atomic counter. The mechanism is simple enough, but vcpu thread can miss > > the aforementioned time window. Another issue is the test case much more time > > than expected to finish. There's not really an expected time to finish. The original purpose of the test is to trigger a kernel race condition, so it's a balance between letting the test run long enough to have some confidence that the kernel is bug free, and not running so long that it wastes time. > > Sean, if you think it's reasonable, I can figure out something to do: > > > > - Reuse the atomic counter for a full synchronization between these two > > threads. Something like below: > > > > #define RSEQ_TEST_STATE_RUN_VCPU 0 // vcpu_run() > > #define RSEQ_TEST_STATE_MIGRATE 1 // sched_setaffinity() > > #define RSEQ_TEST_STATE_CHECK 2 // Check rseq.cpu_id and get_cpu() > > > > The atomic counter is reset to RSEQ_TEST_STATE_RUN_VCPU after RSEQ_TEST_STATE_RUN_VCPU Again, because one of the primary goals is to ensure the kernel is race free, the test should avoid full synchronization. > > > > - Reduce NR_TASK_MIGRATIONS from 100000 to num_of_online_cpus(). With this, > > less time is needed to finish the test case. > > > > I'm able to recreate the issue on my local arm64 system. > > - From the source code, the iteration count is changed from 100000 to 1000 > - Only CPU#0 and CPU#1 are exposed in calc_min_max_cpu, meaning other CPUs > are cleared from @possible_mask > - Run some CPU bound task on CPU#0 and CPU#1 > # while true; do taskset -c 0 ./a; done > # while true; do taskset -c 1 ./a; done > - Run 'rseq_test' and hit the issue At this point, this isn't a test bug. The test is right to complain that it didn't provide the coverage it's supposed to provide. If the bot failure is a one-off, my preference is to leave things as-is for now. If the failure is an ongoing issue, then we probably need to understand why the bot is failing. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail 2022-08-16 22:23 ` Sean Christopherson @ 2022-08-17 0:06 ` Gavin Shan 2022-09-05 10:47 ` [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail #forregzbot Thorsten Leemhuis 0 siblings, 1 reply; 7+ messages in thread From: Gavin Shan @ 2022-08-17 0:06 UTC (permalink / raw) To: Sean Christopherson Cc: kernel test robot, Paolo Bonzini, Oliver Upton, Andrew Jones, LKML, kvm, lkp, lkp, xudong.hao, regressions Hi Sean, On 8/17/22 8:23 AM, Sean Christopherson wrote: > On Tue, Aug 16, 2022, Gavin Shan wrote: >> On 8/16/22 3:02 PM, Gavin Shan wrote: >>> On 8/16/22 7:42 AM, Sean Christopherson wrote: >>>> On Mon, Aug 15, 2022, kernel test robot wrote: >>>>> commit: e923b0537d28e15c9d31ce8b38f810b325816903 ("KVM: selftests: Fix target thread to be migrated in rseq_test") >>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master >>>> >>>> ... >>>> >>>>> # selftests: kvm: rseq_test >>>>> # ==== Test Assertion Failure ==== >>>>> # rseq_test.c:278: i > (NR_TASK_MIGRATIONS / 2) >>>>> # pid=49599 tid=49599 errno=4 - Interrupted system call >>>>> # 1 0x000000000040265d: main at rseq_test.c:278 >>>>> # 2 0x00007fe44eed07fc: ?? ??:0 >>>>> # 3 0x00000000004026d9: _start at ??:? >>>>> # Only performed 23174 KVM_RUNs, task stalled too much? >>>>> # >>>>> not ok 56 selftests: kvm: rseq_test # exit=254 >>>> >>>> ... >>>> >>>>> # Automatically generated file; DO NOT EDIT. >>>>> # Linux/x86_64 5.19.0-rc6 Kernel Configuration >>>>> # >>>>> CONFIG_CC_VERSION_TEXT="gcc-11 (Debian 11.3.0-3) 11.3.0" >>>>> CONFIG_CC_IS_GCC=y >>>>> CONFIG_GCC_VERSION=110300 >>>>> CONFIG_CLANG_VERSION=0 >>>>> CONFIG_AS_IS_GNU=y >>>>> CONFIG_AS_VERSION=23800 >>>>> CONFIG_LD_IS_BFD=y >>>>> CONFIG_LD_VERSION=23800 >>>>> CONFIG_LLD_VERSION=0 >>>> >>>> Assuming 23800 == 2.38, this is a known issue. >>>> >>>> https://lore.kernel.org/all/20220810104114.6838-1-gshan@redhat.com >>>> >>> >>> It's probably different story this time. > > Doh, if I had bothered to actually look at the error message... > Ok :) >>> The assert is triggered because of the following instructions. I would >>> guess the reason is vcpu thread has been running on CPU where we has high >>> CPU load. In this case, the vcpu thread can't be run in time. More >>> specific, the vcpu thread can't be run in the 1 - 10us time window, which >>> is specified by the migration worker (thread). >>> >>> TEST_ASSERT(i > (NR_TASK_MIGRATIONS / 2), >>> "Only performed %d KVM_RUNs, task stalled too much?\n", i); >>> >>> I think we need to improve the handshake mechanism between the vcpu thread >>> and migration worker. In current implementation, the handshake is done through >>> the atomic counter. The mechanism is simple enough, but vcpu thread can miss >>> the aforementioned time window. Another issue is the test case much more time >>> than expected to finish. > > There's not really an expected time to finish. The original purpose of the test > is to trigger a kernel race condition, so it's a balance between letting the test > run long enough to have some confidence that the kernel is bug free, and not running > so long that it wastes time. > Yeah, I was thinking of it. It's why I'm not 100% sure for my proposal, to have full synchronization. >>> Sean, if you think it's reasonable, I can figure out something to do: >>> >>> - Reuse the atomic counter for a full synchronization between these two >>> threads. Something like below: >>> >>> #define RSEQ_TEST_STATE_RUN_VCPU 0 // vcpu_run() >>> #define RSEQ_TEST_STATE_MIGRATE 1 // sched_setaffinity() >>> #define RSEQ_TEST_STATE_CHECK 2 // Check rseq.cpu_id and get_cpu() >>> >>> The atomic counter is reset to RSEQ_TEST_STATE_RUN_VCPU after RSEQ_TEST_STATE_RUN_VCPU > > Again, because one of the primary goals is to ensure the kernel is race free, the > test should avoid full synchronization. > Ok. >>> >>> - Reduce NR_TASK_MIGRATIONS from 100000 to num_of_online_cpus(). With this, >>> less time is needed to finish the test case. >>> >> >> I'm able to recreate the issue on my local arm64 system. >> >> - From the source code, the iteration count is changed from 100000 to 1000 >> - Only CPU#0 and CPU#1 are exposed in calc_min_max_cpu, meaning other CPUs >> are cleared from @possible_mask >> - Run some CPU bound task on CPU#0 and CPU#1 >> # while true; do taskset -c 0 ./a; done >> # while true; do taskset -c 1 ./a; done >> - Run 'rseq_test' and hit the issue > > At this point, this isn't a test bug. The test is right to complain that it didn't > provide the coverage it's supposed to provide. > > If the bot failure is a one-off, my preference is to leave things as-is for now. > If the failure is an ongoing issue, then we probably need to understand why the > bot is failing. > Yeah, the system for the coverage was likely having high CPU loads, which is similar to my (simulated) environment. I usually have my system being idle when running the coverage test cases. I didn't hit this specific failure before. Lets leave it as of being. We can improve if needed in future :) Thanks, Gavin ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail #forregzbot 2022-08-17 0:06 ` Gavin Shan @ 2022-09-05 10:47 ` Thorsten Leemhuis 0 siblings, 0 replies; 7+ messages in thread From: Thorsten Leemhuis @ 2022-09-05 10:47 UTC (permalink / raw) To: regressions; +Cc: LKML, kvm, lkp TWIMC: this mail is primarily send for documentation purposes and for regzbot, my Linux kernel regression tracking bot. These mails usually contain '#forregzbot' in the subject, to make them easy to spot and filter. On 17.08.22 02:06, Gavin Shan wrote: > > Yeah, the system for the coverage was likely having high CPU loads, > which is similar > to my (simulated) environment. I usually have my system being idle when > running the > coverage test cases. I didn't hit this specific failure before. > > Lets leave it as of being. We can improve if needed in future :) In that case: #regzbot invalid: 0-day report that apparently can be ignoreD Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) P.S.: As the Linux kernel's regression tracker I deal with a lot of reports and sometimes miss something important when writing mails like this. If that's the case here, don't hesitate to tell me in a public reply, it's in everyone's interest to set the public record straight. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail 2022-08-16 10:02 ` Gavin Shan 2022-08-16 22:23 ` Sean Christopherson @ 2022-11-01 8:32 ` Robert Hoo 1 sibling, 0 replies; 7+ messages in thread From: Robert Hoo @ 2022-11-01 8:32 UTC (permalink / raw) To: Gavin Shan, Sean Christopherson, kernel test robot Cc: Paolo Bonzini, Oliver Upton, Andrew Jones, LKML, kvm, lkp, lkp, xudong.hao, regressions On Tue, 2022-08-16 at 20:02 +1000, Gavin Shan wrote: > Hi Sean, > > On 8/16/22 3:02 PM, Gavin Shan wrote: > > On 8/16/22 7:42 AM, Sean Christopherson wrote: > > > On Mon, Aug 15, 2022, kernel test robot wrote: > > > > commit: e923b0537d28e15c9d31ce8b38f810b325816903 ("KVM: > > > > selftests: Fix target thread to be migrated in rseq_test") > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git > > > > master > > > > > > ... > > > > > > > # selftests: kvm: rseq_test > > > > # ==== Test Assertion Failure ==== > > > > # rseq_test.c:278: i > (NR_TASK_MIGRATIONS / 2) > > > > # pid=49599 tid=49599 errno=4 - Interrupted system call > > > > # 1 0x000000000040265d: main at rseq_test.c:278 > > > > # 2 0x00007fe44eed07fc: ?? ??:0 > > > > # 3 0x00000000004026d9: _start at ??:? > > > > # Only performed 23174 KVM_RUNs, task stalled too much? > > > > # > > > > not ok 56 selftests: kvm: rseq_test # exit=254 > > > > > > ... > > > > > > > # Automatically generated file; DO NOT EDIT. > > > > # Linux/x86_64 5.19.0-rc6 Kernel Configuration > > > > # > > > > CONFIG_CC_VERSION_TEXT="gcc-11 (Debian 11.3.0-3) 11.3.0" > > > > CONFIG_CC_IS_GCC=y > > > > CONFIG_GCC_VERSION=110300 > > > > CONFIG_CLANG_VERSION=0 > > > > CONFIG_AS_IS_GNU=y > > > > CONFIG_AS_VERSION=23800 > > > > CONFIG_LD_IS_BFD=y > > > > CONFIG_LD_VERSION=23800 > > > > CONFIG_LLD_VERSION=0 > > > > > > Assuming 23800 == 2.38, this is a known issue. > > > > > > https://lore.kernel.org/all/20220810104114.6838-1-gshan@redhat.com > > > > > > > It's probably different story this time. The assert is triggered > > because > > of the following instructions. I would guess the reason is vcpu > > thread > > has been running on CPU where we has high CPU load. In this case, > > the > > vcpu thread can't be run in time. More specific, the vcpu thread > > can't > > be run in the 1 - 10us time window, which is specified by the > > migration > > worker (thread). > > > > TEST_ASSERT(i > (NR_TASK_MIGRATIONS / 2), > > "Only performed %d KVM_RUNs, task stalled too > > much?\n", i); > > > > I think we need to improve the handshake mechanism between the vcpu > > thread > > and migration worker. In current implementation, the handshake is > > done through > > the atomic counter. The mechanism is simple enough, but vcpu thread > > can miss > > the aforementioned time window. Another issue is the test case much > > more time > > than expected to finish. > > > > Sean, if you think it's reasonable, I can figure out something to > > do: > > > > - Reuse the atomic counter for a full synchronization between these > > two > > threads. Something like below: > > > > #define RSEQ_TEST_STATE_RUN_VCPU 0 // vcpu_run() > > #define RSEQ_TEST_STATE_MIGRATE 1 // > > sched_setaffinity() > > #define RSEQ_TEST_STATE_CHECK 2 // Check > > rseq.cpu_id and get_cpu() > > > > The atomic counter is reset to RSEQ_TEST_STATE_RUN_VCPU after > > RSEQ_TEST_STATE_RUN_VCPU > > > > - Reduce NR_TASK_MIGRATIONS from 100000 to num_of_online_cpus(). > > With this, > > less time is needed to finish the test case. > > > > I'm able to recreate the issue on my local arm64 system. > > - From the source code, the iteration count is changed from 100000 to > 1000 > - Only CPU#0 and CPU#1 are exposed in calc_min_max_cpu, meaning other > CPUs > are cleared from @possible_mask > - Run some CPU bound task on CPU#0 and CPU#1 > # while true; do taskset -c 0 ./a; done > # while true; do taskset -c 1 ./a; done > - Run 'rseq_test' and hit the issue > # ./rseq_test > calc_min_max_cpu: nproc=224 > calc_min_max_cpu: min_cpu=0, max_cpu=1 > main: Required tests: 1000 Succeeds: 1 > ==== Test Assertion Failure ==== > rseq_test.c:280: i > (NR_TASK_MIGRATIONS / 2) > pid=9624 tid=9624 errno=4 - Interrupted system call > 1 0x0000000000401cf3: main at rseq_test.c:280 > 2 0x0000ffffbc64679b: ?? ??:0 > 3 0x0000ffffbc64686b: ?? ??:0 > 4 0x0000000000401def: _start at ??:? > Only performed 1 KVM_RUNs, task stalled too much? > I met this as well, and can reproduce this easily on my Cascade-Lake machine, without heavy workload; before I find this mail. I'm wondering if it's because vcpu_run() has become much more time consuming than this test case was defined? BTW, if we substitute sys_getcpu(&cpu) with vdso_getcpu(), can reduce the reproduction odds. > Thanks, > Gavin > > ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-11-01 8:32 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <Yvn60W/JpPO8URLY@xsang-OptiPlex-9020>
2022-08-15 21:42 ` [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail Sean Christopherson
2022-08-16 5:02 ` Gavin Shan
2022-08-16 10:02 ` Gavin Shan
2022-08-16 22:23 ` Sean Christopherson
2022-08-17 0:06 ` Gavin Shan
2022-09-05 10:47 ` [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail #forregzbot Thorsten Leemhuis
2022-11-01 8:32 ` [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail Robert Hoo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox