* [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED
@ 2020-01-08 21:38 osstest service owner
2020-01-08 23:14 ` Julien Grall
0 siblings, 1 reply; 7+ messages in thread
From: osstest service owner @ 2020-01-08 21:38 UTC (permalink / raw)
To: xen-devel, osstest-admin
flight 145796 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/145796/
Failures :-/ but no regressions.
Tests which are failing intermittently (not blocking):
test-amd64-amd64-xl-rtds 15 guest-saverestore fail in 145773 pass in 145796
test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 16 guest-start/debianhvm.repeat fail in 145773 pass in 145796
test-armhf-armhf-xl-rtds 12 guest-start fail in 145773 pass in 145796
test-armhf-armhf-xl 7 xen-boot fail pass in 145773
Tests which did not succeed, but are not blocking:
test-armhf-armhf-xl 13 migrate-support-check fail in 145773 never pass
test-armhf-armhf-xl 14 saverestore-support-check fail in 145773 never pass
test-amd64-amd64-xl-rtds 18 guest-localmigrate/x10 fail like 145725
test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stop fail like 145725
test-armhf-armhf-libvirt 14 saverestore-support-check fail like 145725
test-amd64-amd64-xl-qemut-win7-amd64 17 guest-stop fail like 145725
test-amd64-i386-xl-qemut-win7-amd64 17 guest-stop fail like 145725
test-armhf-armhf-xl-rtds 16 guest-start/debian.repeat fail like 145725
test-armhf-armhf-libvirt-raw 13 saverestore-support-check fail like 145725
test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop fail like 145725
test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stop fail like 145725
test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stop fail like 145725
test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stop fail like 145725
test-amd64-i386-xl-pvshim 12 guest-start fail never pass
test-arm64-arm64-xl-seattle 13 migrate-support-check fail never pass
test-arm64-arm64-xl-seattle 14 saverestore-support-check fail never pass
test-amd64-amd64-libvirt-xsm 13 migrate-support-check fail never pass
test-amd64-i386-libvirt 13 migrate-support-check fail never pass
test-amd64-i386-libvirt-xsm 13 migrate-support-check fail never pass
test-amd64-amd64-libvirt 13 migrate-support-check fail never pass
test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check fail never pass
test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check fail never pass
test-amd64-amd64-qemuu-nested-amd 17 debian-hvm-install/l1/l2 fail never pass
test-arm64-arm64-xl 13 migrate-support-check fail never pass
test-arm64-arm64-xl 14 saverestore-support-check fail never pass
test-arm64-arm64-xl-xsm 13 migrate-support-check fail never pass
test-arm64-arm64-xl-xsm 14 saverestore-support-check fail never pass
test-arm64-arm64-xl-credit2 13 migrate-support-check fail never pass
test-arm64-arm64-xl-credit1 13 migrate-support-check fail never pass
test-arm64-arm64-xl-thunderx 13 migrate-support-check fail never pass
test-arm64-arm64-xl-credit1 14 saverestore-support-check fail never pass
test-arm64-arm64-xl-credit2 14 saverestore-support-check fail never pass
test-arm64-arm64-xl-thunderx 14 saverestore-support-check fail never pass
test-amd64-amd64-libvirt-vhd 12 migrate-support-check fail never pass
test-armhf-armhf-xl-arndale 13 migrate-support-check fail never pass
test-armhf-armhf-xl-arndale 14 saverestore-support-check fail never pass
test-armhf-armhf-xl-multivcpu 13 migrate-support-check fail never pass
test-armhf-armhf-xl-multivcpu 14 saverestore-support-check fail never pass
test-armhf-armhf-xl-cubietruck 13 migrate-support-check fail never pass
test-armhf-armhf-xl-cubietruck 14 saverestore-support-check fail never pass
test-armhf-armhf-xl-credit2 13 migrate-support-check fail never pass
test-armhf-armhf-xl-credit2 14 saverestore-support-check fail never pass
test-armhf-armhf-xl-credit1 13 migrate-support-check fail never pass
test-armhf-armhf-xl-credit1 14 saverestore-support-check fail never pass
test-armhf-armhf-libvirt 13 migrate-support-check fail never pass
test-armhf-armhf-xl-rtds 13 migrate-support-check fail never pass
test-armhf-armhf-xl-rtds 14 saverestore-support-check fail never pass
test-arm64-arm64-libvirt-xsm 13 migrate-support-check fail never pass
test-arm64-arm64-libvirt-xsm 14 saverestore-support-check fail never pass
test-armhf-armhf-libvirt-raw 12 migrate-support-check fail never pass
test-armhf-armhf-xl-vhd 12 migrate-support-check fail never pass
test-armhf-armhf-xl-vhd 13 saverestore-support-check fail never pass
test-amd64-i386-xl-qemut-ws16-amd64 17 guest-stop fail never pass
version targeted for testing:
xen 4dde27b6e0a0b0dcb8fdfc7580fbd9c976aa103f
baseline version:
xen 0dd92688080202adcc43dcb3486d4143110a66d5
Last test of basis 145725 2020-01-07 08:02:53 Z 1 days
Failing since 145749 2020-01-07 17:36:48 Z 1 days 3 attempts
Testing same since 145773 2020-01-08 02:49:59 Z 0 days 2 attempts
------------------------------------------------------------
People who touched revisions under test:
Andrew Cooper <andrew.cooper3@citrix.com>
Hongyan Xia <hongyxia@amazon.com>
Ian Jackson <Ian.Jackson@citrix.com>
Jan Beulich <jbeulich@suse.com>
Julien Grall <julien@xen.org>
Sergey Dyasli <sergey.dyasli@citrix.com>
Wei Liu <wei.liu2@citrix.com>
Wei Liu <wl@xen.org>
jobs:
build-amd64-xsm pass
build-arm64-xsm pass
build-i386-xsm pass
build-amd64-xtf pass
build-amd64 pass
build-arm64 pass
build-armhf pass
build-i386 pass
build-amd64-libvirt pass
build-arm64-libvirt pass
build-armhf-libvirt pass
build-i386-libvirt pass
build-amd64-prev pass
build-i386-prev pass
build-amd64-pvops pass
build-arm64-pvops pass
build-armhf-pvops pass
build-i386-pvops pass
test-xtf-amd64-amd64-1 pass
test-xtf-amd64-amd64-2 pass
test-xtf-amd64-amd64-3 pass
test-xtf-amd64-amd64-4 pass
test-xtf-amd64-amd64-5 pass
test-amd64-amd64-xl pass
test-arm64-arm64-xl pass
test-armhf-armhf-xl fail
test-amd64-i386-xl pass
test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm pass
test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm pass
test-amd64-amd64-xl-qemut-stubdom-debianhvm-amd64-xsm pass
test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm pass
test-amd64-amd64-xl-qemut-debianhvm-i386-xsm pass
test-amd64-i386-xl-qemut-debianhvm-i386-xsm pass
test-amd64-amd64-xl-qemuu-debianhvm-i386-xsm pass
test-amd64-i386-xl-qemuu-debianhvm-i386-xsm pass
test-amd64-amd64-libvirt-xsm pass
test-arm64-arm64-libvirt-xsm pass
test-amd64-i386-libvirt-xsm pass
test-amd64-amd64-xl-xsm pass
test-arm64-arm64-xl-xsm pass
test-amd64-i386-xl-xsm pass
test-amd64-amd64-qemuu-nested-amd fail
test-amd64-amd64-xl-pvhv2-amd pass
test-amd64-i386-qemut-rhel6hvm-amd pass
test-amd64-i386-qemuu-rhel6hvm-amd pass
test-amd64-amd64-xl-qemut-debianhvm-amd64 pass
test-amd64-i386-xl-qemut-debianhvm-amd64 pass
test-amd64-amd64-xl-qemuu-debianhvm-amd64 pass
test-amd64-i386-xl-qemuu-debianhvm-amd64 pass
test-amd64-i386-freebsd10-amd64 pass
test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
test-amd64-i386-xl-qemuu-ovmf-amd64 pass
test-amd64-amd64-xl-qemut-win7-amd64 fail
test-amd64-i386-xl-qemut-win7-amd64 fail
test-amd64-amd64-xl-qemuu-win7-amd64 fail
test-amd64-i386-xl-qemuu-win7-amd64 fail
test-amd64-amd64-xl-qemut-ws16-amd64 fail
test-amd64-i386-xl-qemut-ws16-amd64 fail
test-amd64-amd64-xl-qemuu-ws16-amd64 fail
test-amd64-i386-xl-qemuu-ws16-amd64 fail
test-armhf-armhf-xl-arndale pass
test-amd64-amd64-xl-credit1 pass
test-arm64-arm64-xl-credit1 pass
test-armhf-armhf-xl-credit1 pass
test-amd64-amd64-xl-credit2 pass
test-arm64-arm64-xl-credit2 pass
test-armhf-armhf-xl-credit2 pass
test-armhf-armhf-xl-cubietruck pass
test-amd64-amd64-xl-qemuu-dmrestrict-amd64-dmrestrict pass
test-amd64-i386-xl-qemuu-dmrestrict-amd64-dmrestrict pass
test-amd64-amd64-examine pass
test-arm64-arm64-examine pass
test-armhf-armhf-examine pass
test-amd64-i386-examine pass
test-amd64-i386-freebsd10-i386 pass
test-amd64-amd64-qemuu-nested-intel pass
test-amd64-amd64-xl-pvhv2-intel pass
test-amd64-i386-qemut-rhel6hvm-intel pass
test-amd64-i386-qemuu-rhel6hvm-intel pass
test-amd64-amd64-libvirt pass
test-armhf-armhf-libvirt pass
test-amd64-i386-libvirt pass
test-amd64-amd64-livepatch pass
test-amd64-i386-livepatch pass
test-amd64-amd64-migrupgrade pass
test-amd64-i386-migrupgrade pass
test-amd64-amd64-xl-multivcpu pass
test-armhf-armhf-xl-multivcpu pass
test-amd64-amd64-pair pass
test-amd64-i386-pair pass
test-amd64-amd64-libvirt-pair pass
test-amd64-i386-libvirt-pair pass
test-amd64-amd64-amd64-pvgrub pass
test-amd64-amd64-i386-pvgrub pass
test-amd64-amd64-xl-pvshim pass
test-amd64-i386-xl-pvshim fail
test-amd64-amd64-pygrub pass
test-amd64-amd64-xl-qcow2 pass
test-armhf-armhf-libvirt-raw pass
test-amd64-i386-xl-raw pass
test-amd64-amd64-xl-rtds fail
test-armhf-armhf-xl-rtds fail
test-arm64-arm64-xl-seattle pass
test-amd64-amd64-xl-qemuu-debianhvm-amd64-shadow pass
test-amd64-i386-xl-qemuu-debianhvm-amd64-shadow pass
test-amd64-amd64-xl-shadow pass
test-amd64-i386-xl-shadow pass
test-arm64-arm64-xl-thunderx pass
test-amd64-amd64-libvirt-vhd pass
test-armhf-armhf-xl-vhd pass
------------------------------------------------------------
sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images
Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs
Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master
Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary
Pushing revision :
To xenbits.xen.org:/home/xen/git/xen.git
0dd9268808..4dde27b6e0 4dde27b6e0a0b0dcb8fdfc7580fbd9c976aa103f -> master
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED
2020-01-08 21:38 [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED osstest service owner
@ 2020-01-08 23:14 ` Julien Grall
2020-01-10 18:24 ` Julien Grall
0 siblings, 1 reply; 7+ messages in thread
From: Julien Grall @ 2020-01-08 23:14 UTC (permalink / raw)
To: osstest service owner, Dario Faggioli, George Dunlap,
Jürgen Groß, Stefano Stabellini
Cc: xen-devel
On Wed, 8 Jan 2020 at 21:40, osstest service owner
<osstest-admin@xenproject.org> wrote:
>
> flight 145796 xen-unstable real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/145796/
>
> Failures :-/ but no regressions.
>
> Tests which are failing intermittently (not blocking):
> test-amd64-amd64-xl-rtds 15 guest-saverestore fail in 145773 pass in 145796
> test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 16 guest-start/debianhvm.repeat fail in 145773 pass in 145796
> test-armhf-armhf-xl-rtds 12 guest-start fail in 145773 pass in 145796
It looks like this test has been failing for a while (although not reliably).
I looked at a few flights, the cause seems to be the same:
Jan 8 15:02:14.700784 (XEN) Assertion '!unit_on_replq(svc)' failed at
sched_rt.c:586
Jan 8 15:02:26.715030 (XEN) ----[ Xen-4.14-unstable arm32 debug=y
Not tainted ]----
Jan 8 15:02:26.720756 (XEN) CPU: 1
Jan 8 15:02:26.722158 (XEN) PC: 0023a750
common/sched_rt.c#replq_insert+0x7c/0xcc
Jan 8 15:02:26.727851 (XEN) CPSR: 200300da MODE:Hypervisor
Jan 8 15:02:26.731334 (XEN) R0: 002a51a4 R1: 400614a0 R2:
3d64b900 R3: 40061338
Jan 8 15:02:26.736830 (XEN) R4: 400614a0 R5: 002a51a4 R6:
3cf1cbf0 R7: 000001cb
Jan 8 15:02:26.742600 (XEN) R8: 4003d1b0 R9: 400614a8
R10:4003d1b0 R11:400ffe54 R12:400ffde4
Jan 8 15:02:26.749119 (XEN) HYP: SP: 400ffe2c LR: 0023b6e8
Jan 8 15:02:26.752296 (XEN)
Jan 8 15:02:26.753036 (XEN) VTCR_EL2: 80003558
Jan 8 15:02:26.755479 (XEN) VTTBR_EL2: 00020000bbff4000
Jan 8 15:02:26.758757 (XEN)
Jan 8 15:02:26.759366 (XEN) SCTLR_EL2: 30cd187f
Jan 8 15:02:26.761755 (XEN) HCR_EL2: 0078663f
Jan 8 15:02:26.764250 (XEN) TTBR0_EL2: 00000000bc029000
Jan 8 15:02:26.767364 (XEN)
Jan 8 15:02:26.767980 (XEN) ESR_EL2: 00000000
Jan 8 15:02:26.770485 (XEN) HPFAR_EL2: 00030010
Jan 8 15:02:26.772795 (XEN) HDFAR: e0800f00
Jan 8 15:02:26.775272 (XEN) HIFAR: c0605744
Jan 8 15:02:26.777748 (XEN)
Jan 8 15:02:26.778505 (XEN) Xen stack trace from sp=400ffe2c:
Jan 8 15:02:26.781910 (XEN) 00000000 3cf1cbf0 400614a0 002a51a4
3cf1cbf0 000001cb 4003d1b0 6003005a
Jan 8 15:02:26.788991 (XEN) 400613f8 400ffe7c 0023b6e8 002f9300
4004c000 400613f8 3cf1cbf0 000001cb
Jan 8 15:02:26.796093 (XEN) 4003d1b0 6003005a 400613f8 400ffeac
00242988 4004c000 002425ac 40058000
Jan 8 15:02:26.803237 (XEN) 4004c000 4004f000 10f45000 10f45008
4004b080 40058000 60030013 400ffebc
Jan 8 15:02:26.810360 (XEN) 00209984 00000002 4004f000 400ffedc
0020eddc 0020caf8 db097cd4 00000020
Jan 8 15:02:26.817504 (XEN) c13afbec 00000000 db15fd68 400ffee4
0020c9dc 400fff34 0020d5e8 4004e000
Jan 8 15:02:26.824615 (XEN) 00000000 400fff44 400fff44 00000002
00000000 4004e8fa 4004e8f4 400fff1c
Jan 8 15:02:26.831737 (XEN) 400fff1c 6003005a 0020caf8 400fff58
00000020 c13afbec 00000000 db15fd68
Jan 8 15:02:26.838798 (XEN) 60030013 400fff54 0026c150 c1204d08
c13afbec 00000000 00000000 00000000
Jan 8 15:02:26.845877 (XEN) 00000002 400fff58 002753b0 00000009
db097cd4 db173008 00000002 c1204d08
Jan 8 15:02:26.852986 (XEN) 00000000 00000002 c13afbec 00000000
db15fd68 60030013 db15fd3c 00000020
Jan 8 15:02:26.860044 (XEN) ffffffff b6cdccb3 c0107ed0 a0030093
4a000ea1 be951568 c136edc0 c010d3a0
Jan 8 15:02:26.867171 (XEN) db097cd0 c056c7f8 c136edcc c010d720
c136edd8 c010d7e0 00000000 00000000
Jan 8 15:02:26.874526 (XEN) 00000000 00000000 00000000 c136ede4
c136ede4 00030030 60070193 80030093
Jan 8 15:02:26.881450 (XEN) 60030193 00000000 00000000 00000000 00000001
Jan 8 15:02:26.886519 (XEN) Xen call trace:
Jan 8 15:02:26.888168 (XEN) [<0023a750>]
common/sched_rt.c#replq_insert+0x7c/0xcc (PC)
Jan 8 15:02:26.894240 (XEN) [<0023b6e8>]
common/sched_rt.c#rt_unit_wake+0xf4/0x274 (LR)
Jan 8 15:02:26.900246 (XEN) [<0023b6e8>]
common/sched_rt.c#rt_unit_wake+0xf4/0x274
Jan 8 15:02:26.905775 (XEN) [<00242988>] vcpu_wake+0x1e4/0x688
Jan 8 15:02:26.909743 (XEN) [<00209984>] domain_unpause+0x64/0x84
Jan 8 15:02:26.913956 (XEN) [<0020eddc>]
common/event_fifo.c#evtchn_fifo_unmask+0xd8/0xf0
Jan 8 15:02:26.920167 (XEN) [<0020c9dc>] evtchn_unmask+0x7c/0xc0
Jan 8 15:02:26.924173 (XEN) [<0020d5e8>] do_event_channel_op+0xaf0/0xdac
Jan 8 15:02:26.928922 (XEN) [<0026c150>] do_trap_guest_sync+0x350/0x4d0
Jan 8 15:02:26.933647 (XEN) [<002753b0>] entry.o#return_from_trap+0/0x4
Jan 8 15:02:26.938299 (XEN)
Jan 8 15:02:26.939039 (XEN)
Jan 8 15:02:26.939668 (XEN) ****************************************
Jan 8 15:02:26.943794 (XEN) Panic on CPU 1:
Jan 8 15:02:26.945872 (XEN) Assertion '!unit_on_replq(svc)' failed at
sched_rt.c:586
Jan 8 15:02:26.951492 (XEN) ****************************************
I believe the domain_unpause() is coming from guest_clear_bit(). This
would mean the atomics didn't succeed without pausing the domain. This
makes sense as, per the log:
CPU1: Guest atomics will try 1 times before pausing the domain
I am under the impression that the crash could be reproduced with just:
domain_pause_nosync(current);
domain_unpause(current);
Any insights what's wrong? I am happy to try to reproduce it tomorrow morning.
Cheers,
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED
2020-01-08 23:14 ` Julien Grall
@ 2020-01-10 18:24 ` Julien Grall
2020-01-10 23:30 ` Julien Grall
2020-01-22 3:40 ` Dario Faggioli
0 siblings, 2 replies; 7+ messages in thread
From: Julien Grall @ 2020-01-10 18:24 UTC (permalink / raw)
To: Julien Grall, osstest service owner, Dario Faggioli,
George Dunlap, Jürgen Groß, Stefano Stabellini
Cc: xen-devel
Hi all,
On 08/01/2020 23:14, Julien Grall wrote:
> On Wed, 8 Jan 2020 at 21:40, osstest service owner
> <osstest-admin@xenproject.org> wrote:
>>
>> flight 145796 xen-unstable real [real]
>> http://logs.test-lab.xenproject.org/osstest/logs/145796/
>>
>> Failures :-/ but no regressions.
>>
>> Tests which are failing intermittently (not blocking):
>> test-amd64-amd64-xl-rtds 15 guest-saverestore fail in 145773 pass in 145796
>> test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 16 guest-start/debianhvm.repeat fail in 145773 pass in 145796
>> test-armhf-armhf-xl-rtds 12 guest-start fail in 145773 pass in 145796
>
> It looks like this test has been failing for a while (although not reliably).
> I looked at a few flights, the cause seems to be the same:
>
> Jan 8 15:02:14.700784 (XEN) Assertion '!unit_on_replq(svc)' failed at
> sched_rt.c:586
> Jan 8 15:02:26.715030 (XEN) ----[ Xen-4.14-unstable arm32 debug=y
> Not tainted ]----
> Jan 8 15:02:26.720756 (XEN) CPU: 1
> Jan 8 15:02:26.722158 (XEN) PC: 0023a750
> common/sched_rt.c#replq_insert+0x7c/0xcc
> Jan 8 15:02:26.727851 (XEN) CPSR: 200300da MODE:Hypervisor
> Jan 8 15:02:26.731334 (XEN) R0: 002a51a4 R1: 400614a0 R2:
> 3d64b900 R3: 40061338
> Jan 8 15:02:26.736830 (XEN) R4: 400614a0 R5: 002a51a4 R6:
> 3cf1cbf0 R7: 000001cb
> Jan 8 15:02:26.742600 (XEN) R8: 4003d1b0 R9: 400614a8
> R10:4003d1b0 R11:400ffe54 R12:400ffde4
> Jan 8 15:02:26.749119 (XEN) HYP: SP: 400ffe2c LR: 0023b6e8
> Jan 8 15:02:26.752296 (XEN)
> Jan 8 15:02:26.753036 (XEN) VTCR_EL2: 80003558
> Jan 8 15:02:26.755479 (XEN) VTTBR_EL2: 00020000bbff4000
> Jan 8 15:02:26.758757 (XEN)
> Jan 8 15:02:26.759366 (XEN) SCTLR_EL2: 30cd187f
> Jan 8 15:02:26.761755 (XEN) HCR_EL2: 0078663f
> Jan 8 15:02:26.764250 (XEN) TTBR0_EL2: 00000000bc029000
> Jan 8 15:02:26.767364 (XEN)
> Jan 8 15:02:26.767980 (XEN) ESR_EL2: 00000000
> Jan 8 15:02:26.770485 (XEN) HPFAR_EL2: 00030010
> Jan 8 15:02:26.772795 (XEN) HDFAR: e0800f00
> Jan 8 15:02:26.775272 (XEN) HIFAR: c0605744
> Jan 8 15:02:26.777748 (XEN)
> Jan 8 15:02:26.778505 (XEN) Xen stack trace from sp=400ffe2c:
> Jan 8 15:02:26.781910 (XEN) 00000000 3cf1cbf0 400614a0 002a51a4
> 3cf1cbf0 000001cb 4003d1b0 6003005a
> Jan 8 15:02:26.788991 (XEN) 400613f8 400ffe7c 0023b6e8 002f9300
> 4004c000 400613f8 3cf1cbf0 000001cb
> Jan 8 15:02:26.796093 (XEN) 4003d1b0 6003005a 400613f8 400ffeac
> 00242988 4004c000 002425ac 40058000
> Jan 8 15:02:26.803237 (XEN) 4004c000 4004f000 10f45000 10f45008
> 4004b080 40058000 60030013 400ffebc
> Jan 8 15:02:26.810360 (XEN) 00209984 00000002 4004f000 400ffedc
> 0020eddc 0020caf8 db097cd4 00000020
> Jan 8 15:02:26.817504 (XEN) c13afbec 00000000 db15fd68 400ffee4
> 0020c9dc 400fff34 0020d5e8 4004e000
> Jan 8 15:02:26.824615 (XEN) 00000000 400fff44 400fff44 00000002
> 00000000 4004e8fa 4004e8f4 400fff1c
> Jan 8 15:02:26.831737 (XEN) 400fff1c 6003005a 0020caf8 400fff58
> 00000020 c13afbec 00000000 db15fd68
> Jan 8 15:02:26.838798 (XEN) 60030013 400fff54 0026c150 c1204d08
> c13afbec 00000000 00000000 00000000
> Jan 8 15:02:26.845877 (XEN) 00000002 400fff58 002753b0 00000009
> db097cd4 db173008 00000002 c1204d08
> Jan 8 15:02:26.852986 (XEN) 00000000 00000002 c13afbec 00000000
> db15fd68 60030013 db15fd3c 00000020
> Jan 8 15:02:26.860044 (XEN) ffffffff b6cdccb3 c0107ed0 a0030093
> 4a000ea1 be951568 c136edc0 c010d3a0
> Jan 8 15:02:26.867171 (XEN) db097cd0 c056c7f8 c136edcc c010d720
> c136edd8 c010d7e0 00000000 00000000
> Jan 8 15:02:26.874526 (XEN) 00000000 00000000 00000000 c136ede4
> c136ede4 00030030 60070193 80030093
> Jan 8 15:02:26.881450 (XEN) 60030193 00000000 00000000 00000000 00000001
> Jan 8 15:02:26.886519 (XEN) Xen call trace:
> Jan 8 15:02:26.888168 (XEN) [<0023a750>]
> common/sched_rt.c#replq_insert+0x7c/0xcc (PC)
> Jan 8 15:02:26.894240 (XEN) [<0023b6e8>]
> common/sched_rt.c#rt_unit_wake+0xf4/0x274 (LR)
> Jan 8 15:02:26.900246 (XEN) [<0023b6e8>]
> common/sched_rt.c#rt_unit_wake+0xf4/0x274
> Jan 8 15:02:26.905775 (XEN) [<00242988>] vcpu_wake+0x1e4/0x688
> Jan 8 15:02:26.909743 (XEN) [<00209984>] domain_unpause+0x64/0x84
> Jan 8 15:02:26.913956 (XEN) [<0020eddc>]
> common/event_fifo.c#evtchn_fifo_unmask+0xd8/0xf0
> Jan 8 15:02:26.920167 (XEN) [<0020c9dc>] evtchn_unmask+0x7c/0xc0
> Jan 8 15:02:26.924173 (XEN) [<0020d5e8>] do_event_channel_op+0xaf0/0xdac
> Jan 8 15:02:26.928922 (XEN) [<0026c150>] do_trap_guest_sync+0x350/0x4d0
> Jan 8 15:02:26.933647 (XEN) [<002753b0>] entry.o#return_from_trap+0/0x4
> Jan 8 15:02:26.938299 (XEN)
> Jan 8 15:02:26.939039 (XEN)
> Jan 8 15:02:26.939668 (XEN) ****************************************
> Jan 8 15:02:26.943794 (XEN) Panic on CPU 1:
> Jan 8 15:02:26.945872 (XEN) Assertion '!unit_on_replq(svc)' failed at
> sched_rt.c:586
> Jan 8 15:02:26.951492 (XEN) ****************************************
>
> I believe the domain_unpause() is coming from guest_clear_bit(). This
> would mean the atomics didn't succeed without pausing the domain. This
> makes sense as, per the log:
>
> CPU1: Guest atomics will try 1 times before pausing the domain
>
> I am under the impression that the crash could be reproduced with just:
>
> domain_pause_nosync(current);
> domain_unpause(current);
>
> Any insights what's wrong? I am happy to try to reproduce it tomorrow morning.
So I managed to reproduce it on Arm by hacking the hypercall path to call:
domain_pause_nosync(current->domain);
domain_unpause(current->domain);
With a debug build and with a 2 vCPU dom0 the crash happen in a few
seconds. When the unit is not scheduled, rt_unit_wake() expects the unit
to be in none of the queues.
The interaction is as following:
CPU0 | CPU1
|
do_domain_pause() |
-> atomic_inc(&d->pause_count) |
-> vcpu_sleep_nosync(vCPU A) | schedule()
| -> Lock
| -> rt_schedule()
| -> snext = runq_pick(...)
| /* return unit A (aka vCPU A)
| /* Unit is not runnable */
| -> Remove from the q
| [....]
| -> Lock
-> Lock |
-> rt_unit_sleep() |
/* Unit not scheduled */ |
/* Nothing to do */ |
Note that on Arm, each vCPU has its own scheduling unit.
When schedule() grab the lock first (as shown above), the unit will only
be removed from the Q. However, when vcpu_sleep_nosync() grab the lock
first and the unit was not scheduled, rt_unit_sleep() will remove the
unit from two queues (runQ/depleteQ and replenishQ).
So I think we want schedule() to remove the unit from the 2 queues if it
is not runnable. Any opinions?
Cheers,
--
Julien Grall
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED
2020-01-10 18:24 ` Julien Grall
@ 2020-01-10 23:30 ` Julien Grall
2020-01-22 3:40 ` Dario Faggioli
1 sibling, 0 replies; 7+ messages in thread
From: Julien Grall @ 2020-01-10 23:30 UTC (permalink / raw)
To: Julien Grall, osstest service owner, Dario Faggioli,
George Dunlap, Jürgen Groß, Stefano Stabellini,
xumengpanda
Cc: xen-devel
(+ Meng)
Hi,
Sorry I forgot to cc the RTDS scheduler maintainer.
On 10/01/2020 18:24, Julien Grall wrote:
> Hi all,
>
> On 08/01/2020 23:14, Julien Grall wrote:
>> On Wed, 8 Jan 2020 at 21:40, osstest service owner
>> <osstest-admin@xenproject.org> wrote:
>>>
>>> flight 145796 xen-unstable real [real]
>>> http://logs.test-lab.xenproject.org/osstest/logs/145796/
>>>
>>> Failures :-/ but no regressions.
>>>
>>> Tests which are failing intermittently (not blocking):
>>> test-amd64-amd64-xl-rtds 15 guest-saverestore fail in 145773
>>> pass in 145796
>>> test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 16
>>> guest-start/debianhvm.repeat fail in 145773 pass in 145796
>>> test-armhf-armhf-xl-rtds 12 guest-start fail in 145773
>>> pass in 145796
>>
>> It looks like this test has been failing for a while (although not
>> reliably).
>> I looked at a few flights, the cause seems to be the same:
>>
>> Jan 8 15:02:14.700784 (XEN) Assertion '!unit_on_replq(svc)' failed at
>> sched_rt.c:586
>> Jan 8 15:02:26.715030 (XEN) ----[ Xen-4.14-unstable arm32 debug=y
>> Not tainted ]----
>> Jan 8 15:02:26.720756 (XEN) CPU: 1
>> Jan 8 15:02:26.722158 (XEN) PC: 0023a750
>> common/sched_rt.c#replq_insert+0x7c/0xcc
>> Jan 8 15:02:26.727851 (XEN) CPSR: 200300da MODE:Hypervisor
>> Jan 8 15:02:26.731334 (XEN) R0: 002a51a4 R1: 400614a0 R2:
>> 3d64b900 R3: 40061338
>> Jan 8 15:02:26.736830 (XEN) R4: 400614a0 R5: 002a51a4 R6:
>> 3cf1cbf0 R7: 000001cb
>> Jan 8 15:02:26.742600 (XEN) R8: 4003d1b0 R9: 400614a8
>> R10:4003d1b0 R11:400ffe54 R12:400ffde4
>> Jan 8 15:02:26.749119 (XEN) HYP: SP: 400ffe2c LR: 0023b6e8
>> Jan 8 15:02:26.752296 (XEN)
>> Jan 8 15:02:26.753036 (XEN) VTCR_EL2: 80003558
>> Jan 8 15:02:26.755479 (XEN) VTTBR_EL2: 00020000bbff4000
>> Jan 8 15:02:26.758757 (XEN)
>> Jan 8 15:02:26.759366 (XEN) SCTLR_EL2: 30cd187f
>> Jan 8 15:02:26.761755 (XEN) HCR_EL2: 0078663f
>> Jan 8 15:02:26.764250 (XEN) TTBR0_EL2: 00000000bc029000
>> Jan 8 15:02:26.767364 (XEN)
>> Jan 8 15:02:26.767980 (XEN) ESR_EL2: 00000000
>> Jan 8 15:02:26.770485 (XEN) HPFAR_EL2: 00030010
>> Jan 8 15:02:26.772795 (XEN) HDFAR: e0800f00
>> Jan 8 15:02:26.775272 (XEN) HIFAR: c0605744
>> Jan 8 15:02:26.777748 (XEN)
>> Jan 8 15:02:26.778505 (XEN) Xen stack trace from sp=400ffe2c:
>> Jan 8 15:02:26.781910 (XEN) 00000000 3cf1cbf0 400614a0 002a51a4
>> 3cf1cbf0 000001cb 4003d1b0 6003005a
>> Jan 8 15:02:26.788991 (XEN) 400613f8 400ffe7c 0023b6e8 002f9300
>> 4004c000 400613f8 3cf1cbf0 000001cb
>> Jan 8 15:02:26.796093 (XEN) 4003d1b0 6003005a 400613f8 400ffeac
>> 00242988 4004c000 002425ac 40058000
>> Jan 8 15:02:26.803237 (XEN) 4004c000 4004f000 10f45000 10f45008
>> 4004b080 40058000 60030013 400ffebc
>> Jan 8 15:02:26.810360 (XEN) 00209984 00000002 4004f000 400ffedc
>> 0020eddc 0020caf8 db097cd4 00000020
>> Jan 8 15:02:26.817504 (XEN) c13afbec 00000000 db15fd68 400ffee4
>> 0020c9dc 400fff34 0020d5e8 4004e000
>> Jan 8 15:02:26.824615 (XEN) 00000000 400fff44 400fff44 00000002
>> 00000000 4004e8fa 4004e8f4 400fff1c
>> Jan 8 15:02:26.831737 (XEN) 400fff1c 6003005a 0020caf8 400fff58
>> 00000020 c13afbec 00000000 db15fd68
>> Jan 8 15:02:26.838798 (XEN) 60030013 400fff54 0026c150 c1204d08
>> c13afbec 00000000 00000000 00000000
>> Jan 8 15:02:26.845877 (XEN) 00000002 400fff58 002753b0 00000009
>> db097cd4 db173008 00000002 c1204d08
>> Jan 8 15:02:26.852986 (XEN) 00000000 00000002 c13afbec 00000000
>> db15fd68 60030013 db15fd3c 00000020
>> Jan 8 15:02:26.860044 (XEN) ffffffff b6cdccb3 c0107ed0 a0030093
>> 4a000ea1 be951568 c136edc0 c010d3a0
>> Jan 8 15:02:26.867171 (XEN) db097cd0 c056c7f8 c136edcc c010d720
>> c136edd8 c010d7e0 00000000 00000000
>> Jan 8 15:02:26.874526 (XEN) 00000000 00000000 00000000 c136ede4
>> c136ede4 00030030 60070193 80030093
>> Jan 8 15:02:26.881450 (XEN) 60030193 00000000 00000000 00000000
>> 00000001
>> Jan 8 15:02:26.886519 (XEN) Xen call trace:
>> Jan 8 15:02:26.888168 (XEN) [<0023a750>]
>> common/sched_rt.c#replq_insert+0x7c/0xcc (PC)
>> Jan 8 15:02:26.894240 (XEN) [<0023b6e8>]
>> common/sched_rt.c#rt_unit_wake+0xf4/0x274 (LR)
>> Jan 8 15:02:26.900246 (XEN) [<0023b6e8>]
>> common/sched_rt.c#rt_unit_wake+0xf4/0x274
>> Jan 8 15:02:26.905775 (XEN) [<00242988>] vcpu_wake+0x1e4/0x688
>> Jan 8 15:02:26.909743 (XEN) [<00209984>] domain_unpause+0x64/0x84
>> Jan 8 15:02:26.913956 (XEN) [<0020eddc>]
>> common/event_fifo.c#evtchn_fifo_unmask+0xd8/0xf0
>> Jan 8 15:02:26.920167 (XEN) [<0020c9dc>] evtchn_unmask+0x7c/0xc0
>> Jan 8 15:02:26.924173 (XEN) [<0020d5e8>]
>> do_event_channel_op+0xaf0/0xdac
>> Jan 8 15:02:26.928922 (XEN) [<0026c150>]
>> do_trap_guest_sync+0x350/0x4d0
>> Jan 8 15:02:26.933647 (XEN) [<002753b0>]
>> entry.o#return_from_trap+0/0x4
>> Jan 8 15:02:26.938299 (XEN)
>> Jan 8 15:02:26.939039 (XEN)
>> Jan 8 15:02:26.939668 (XEN) ****************************************
>> Jan 8 15:02:26.943794 (XEN) Panic on CPU 1:
>> Jan 8 15:02:26.945872 (XEN) Assertion '!unit_on_replq(svc)' failed at
>> sched_rt.c:586
>> Jan 8 15:02:26.951492 (XEN) ****************************************
>>
>> I believe the domain_unpause() is coming from guest_clear_bit(). This
>> would mean the atomics didn't succeed without pausing the domain. This
>> makes sense as, per the log:
>>
>> CPU1: Guest atomics will try 1 times before pausing the domain
>>
>> I am under the impression that the crash could be reproduced with just:
>>
>> domain_pause_nosync(current);
>> domain_unpause(current);
>>
>> Any insights what's wrong? I am happy to try to reproduce it tomorrow
>> morning.
>
> So I managed to reproduce it on Arm by hacking the hypercall path to call:
>
> domain_pause_nosync(current->domain);
> domain_unpause(current->domain);
>
> With a debug build and with a 2 vCPU dom0 the crash happen in a few
> seconds. When the unit is not scheduled, rt_unit_wake() expects the unit
> to be in none of the queues.
>
> The interaction is as following:
>
> CPU0 | CPU1
> |
> do_domain_pause() |
> -> atomic_inc(&d->pause_count) |
> -> vcpu_sleep_nosync(vCPU A) | schedule()
> | -> Lock
> | -> rt_schedule()
> | -> snext = runq_pick(...)
> | /* return unit A (aka vCPU A)
> | /* Unit is not runnable */
> | -> Remove from the q
> | [....]
> | -> Lock
> -> Lock |
> -> rt_unit_sleep() |
> /* Unit not scheduled */ |
> /* Nothing to do */ |
>
> Note that on Arm, each vCPU has its own scheduling unit.
>
> When schedule() grab the lock first (as shown above), the unit will only
> be removed from the Q. However, when vcpu_sleep_nosync() grab the lock
> first and the unit was not scheduled, rt_unit_sleep() will remove the
> unit from two queues (runQ/depleteQ and replenishQ).
>
> So I think we want schedule() to remove the unit from the 2 queues if it
> is not runnable. Any opinions?
>
> Cheers,
>
--
Julien Grall
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED
2020-01-10 18:24 ` Julien Grall
2020-01-10 23:30 ` Julien Grall
@ 2020-01-22 3:40 ` Dario Faggioli
2020-02-02 12:57 ` Julien Grall
1 sibling, 1 reply; 7+ messages in thread
From: Dario Faggioli @ 2020-01-22 3:40 UTC (permalink / raw)
To: Julien Grall, Julien Grall, osstest service owner, George Dunlap,
Jürgen Groß, Stefano Stabellini
Cc: xen-devel
[-- Attachment #1.1: Type: text/plain, Size: 4188 bytes --]
On Fri, 2020-01-10 at 18:24 +0000, Julien Grall wrote:
> Hi all,
>
Hi Julien,
I was looking at this, and I have a couple of questions...
> On 08/01/2020 23:14, Julien Grall wrote:
> > On Wed, 8 Jan 2020 at 21:40, osstest service owner
> > <osstest-admin@xenproject.org> wrote:
> > ****************************************
> > Jan 8 15:02:26.943794 (XEN) Panic on CPU 1:
> > Jan 8 15:02:26.945872 (XEN) Assertion '!unit_on_replq(svc)' failed
> > at
> > sched_rt.c:586
> > Jan 8 15:02:26.951492 (XEN)
> > ****************************************
> >
> So I managed to reproduce it on Arm by hacking the hypercall path to
> call:
>
> domain_pause_nosync(current->domain);
> domain_unpause(current->domain);
>
> With a debug build and with a 2 vCPU dom0 the crash happen in a few
> seconds. When the unit is not scheduled, rt_unit_wake() expects the
> unit
> to be in none of the queues.
>
> The interaction is as following:
>
> CPU0 | CPU1
> |
> do_domain_pause() |
> -> atomic_inc(&d->pause_count) |
> -> vcpu_sleep_nosync(vCPU A) | schedule()
> | -> Lock
> | -> rt_schedule()
> | -> snext = runq_pick(...)
> | /* return unit A (aka
> vCPU A)
> | /* Unit is not runnable */
> | -> Remove from the q
> | [....]
> | -> Lock
> -> Lock |
> -> rt_unit_sleep() |
> /* Unit not scheduled */ |
> /* Nothing to do */ |
>
Thanks a lot for the analysis. As said above, just a few questions, to
be sure I'm understanding properly what is happening.
You have a 2 vCPUs dom0, and how many other vCPUs from other domains?
Or do you only have those 2 dom0 vCPUs and you are actually pausing
dom0?
In general, what is running (I mean which vcpu) on CPU0, when the
domain_pause() happens? And what is running on CPU1 when schedule()
happens?
If you just have the 2 dom0's vCPUs around (and we call them vCPU A and
vCPU B), the only case for which I can imagine runq_pick() returning A
on CPU1 would be if CPU0 would be running vCPU B (and invoked the
hypercall from it) and CPU1 was idle... is this the case?
> When schedule() grab the lock first (as shown above), the unit will
> only
> be removed from the Q. However, when vcpu_sleep_nosync() grab the
> lock
> first and the unit was not scheduled, rt_unit_sleep() will remove
> the
> unit from two queues (runQ/depleteQ and replenishQ).
>
> So I think we want schedule() to remove the unit from the 2 queues if
> it
> is not runnable. Any opinions?
>
Mmm... that may work, but I'm not sure.
In fact, I'm starting to think that patch 7c7b407e777 "xen/sched:
introduce unit_runnable_state()", which added the 'q_remove(snext)' in
rt_schedule() might not be correct.
In fact, if runq_pick() returns a vCPU which is in the runqueue, but is
not runnable (e.g., because we're racing with do_domain_pause(), which
already set pause_count), it's not rt_schedule() job to dequeue it from
anything.
We probably should just ignore it and pick another vCPU, if any (and
idle otherwise). Then, after we release the lock, if will be
rt_unit_sleep(), called by do_domain_pause() in this case, that will
finish the job of properly dequeueing it...
Another strange thing is that, as the code looks right now, runq_pick()
returns the first unit in the runq (i.e., the one with the earliest
deadline), without checking whether it is runnable. Then, in
rt_schedule(), if the unit is not runnable, we (only partially, as you
figured out) dequeue it, and use idle instead, as our candidate for
being the next scheduled unit... But what if there were other
*runnable* units in the runqueue?
Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)
[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 157 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED
2020-01-22 3:40 ` Dario Faggioli
@ 2020-02-02 12:57 ` Julien Grall
2020-02-02 13:15 ` Dario Faggioli
0 siblings, 1 reply; 7+ messages in thread
From: Julien Grall @ 2020-02-02 12:57 UTC (permalink / raw)
To: Dario Faggioli, Julien Grall, osstest service owner,
George Dunlap, Jürgen Groß, Stefano Stabellini
Cc: xen-devel, xumengpanda
Hi Dario,
Apologies for the late answer.
On 22/01/2020 03:40, Dario Faggioli wrote:
> On Fri, 2020-01-10 at 18:24 +0000, Julien Grall wrote:
>> Hi all,
>>
> Hi Julien,
>
> I was looking at this, and I have a couple of questions...
>
>> On 08/01/2020 23:14, Julien Grall wrote:
>>> On Wed, 8 Jan 2020 at 21:40, osstest service owner
>>> <osstest-admin@xenproject.org> wrote:
>>> ****************************************
>>> Jan 8 15:02:26.943794 (XEN) Panic on CPU 1:
>>> Jan 8 15:02:26.945872 (XEN) Assertion '!unit_on_replq(svc)' failed
>>> at
>>> sched_rt.c:586
>>> Jan 8 15:02:26.951492 (XEN)
>>> ****************************************
>>>
>> So I managed to reproduce it on Arm by hacking the hypercall path to
>> call:
>>
>> domain_pause_nosync(current->domain);
>> domain_unpause(current->domain);
>>
>> With a debug build and with a 2 vCPU dom0 the crash happen in a few
>> seconds. When the unit is not scheduled, rt_unit_wake() expects the
>> unit
>> to be in none of the queues.
>>
>> The interaction is as following:
>>
>> CPU0 | CPU1
>> |
>> do_domain_pause() |
>> -> atomic_inc(&d->pause_count) |
>> -> vcpu_sleep_nosync(vCPU A) | schedule()
>> | -> Lock
>> | -> rt_schedule()
>> | -> snext = runq_pick(...)
>> | /* return unit A (aka
>> vCPU A)
>> | /* Unit is not runnable */
>> | -> Remove from the q
>> | [....]
>> | -> Lock
>> -> Lock |
>> -> rt_unit_sleep() |
>> /* Unit not scheduled */ |
>> /* Nothing to do */ |
>>
> Thanks a lot for the analysis. As said above, just a few questions, to
> be sure I'm understanding properly what is happening.
>
> You have a 2 vCPUs dom0, and how many other vCPUs from other domains?
> Or do you only have those 2 dom0 vCPUs and you are actually pausing
> dom0?
Only dom0 with 2 vCPUs is running. On every hypercall, it will try to
pause/unpause itself. This is to roughly match the behavior of the Arm
guest atomic helpers.
>
> In general, what is running (I mean which vcpu) on CPU0, when the
> domain_pause() happens? And what is running on CPU1 when schedule()
> happens?
>
> If you just have the 2 dom0's vCPUs around (and we call them vCPU A and
> vCPU B), the only case for which I can imagine runq_pick() returning A
> on CPU1 would be if CPU0 would be running vCPU B (and invoked the
> hypercall from it) and CPU1 was idle... is this the case?
This is indeed the case. The schedule() on CPU1 has happenned because
vCPU A was woken up (e.g an interrupt was received and injected to the
vCPU).
>
>> When schedule() grab the lock first (as shown above), the unit will
>> only
>> be removed from the Q. However, when vcpu_sleep_nosync() grab the
>> lock
>> first and the unit was not scheduled, rt_unit_sleep() will remove
>> the
>> unit from two queues (runQ/depleteQ and replenishQ).
>>
>> So I think we want schedule() to remove the unit from the 2 queues if
>> it
>> is not runnable. Any opinions?
>>
> Mmm... that may work, but I'm not sure.
>
> In fact, I'm starting to think that patch 7c7b407e777 "xen/sched:
> introduce unit_runnable_state()", which added the 'q_remove(snext)' in
> rt_schedule() might not be correct.
I have tested Xen before this commit and didn't manage to reproduce the
crash. As soon as I had the commit, it will crash quite quickly.
>
> In fact, if runq_pick() returns a vCPU which is in the runqueue, but is
> not runnable (e.g., because we're racing with do_domain_pause(), which
> already set pause_count), it's not rt_schedule() job to dequeue it from
> anything.
>
> We probably should just ignore it and pick another vCPU, if any (and
> idle otherwise). Then, after we release the lock, if will be
> rt_unit_sleep(), called by do_domain_pause() in this case, that will
> finish the job of properly dequeueing it...
>
> Another strange thing is that, as the code looks right now, runq_pick()
> returns the first unit in the runq (i.e., the one with the earliest
> deadline), without checking whether it is runnable. Then, in
> rt_schedule(), if the unit is not runnable, we (only partially, as you
> figured out) dequeue it, and use idle instead, as our candidate for
> being the next scheduled unit... But what if there were other
> *runnable* units in the runqueue?
My knowledge of the scheduler is quite limited. Maybe Meng would be able
to answer to this question?
Cheers,
--
Julien Grall
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED
2020-02-02 12:57 ` Julien Grall
@ 2020-02-02 13:15 ` Dario Faggioli
0 siblings, 0 replies; 7+ messages in thread
From: Dario Faggioli @ 2020-02-02 13:15 UTC (permalink / raw)
To: Julien Grall, Julien Grall, osstest service owner, George Dunlap,
Jürgen Groß, Stefano Stabellini
Cc: xen-devel, xumengpanda
[-- Attachment #1.1: Type: text/plain, Size: 3313 bytes --]
On Sun, 2020-02-02 at 12:57 +0000, Julien Grall wrote:
> Hi Dario,
>
Hi,
> Apologies for the late answer.
>
No problem, I also did not had any more time to look into this yet.
> On 22/01/2020 03:40, Dario Faggioli wrote:
> > On Fri, 2020-01-10 at 18:24 +0000, Julien Grall wrote:
> > >
> > You have a 2 vCPUs dom0, and how many other vCPUs from other
> > domains?
> > Or do you only have those 2 dom0 vCPUs and you are actually pausing
> > dom0?
>
> Only dom0 with 2 vCPUs is running. On every hypercall, it will try
> to
> pause/unpause itself.
>
Ok, that was my understanding, but I wasn't 100% sure. Thanks for
confirming.
> This is to roughly match the behavior of the Arm
> guest atomic helpers.
>
Yep, makes sense.
> > If you just have the 2 dom0's vCPUs around (and we call them vCPU A
> > and
> > vCPU B), the only case for which I can imagine runq_pick()
> > returning A
> > on CPU1 would be if CPU0 would be running vCPU B (and invoked the
> > hypercall from it) and CPU1 was idle... is this the case?
>
> This is indeed the case. The schedule() on CPU1 has happenned
> because
> vCPU A was woken up (e.g an interrupt was received and injected to
> the
> vCPU).
>
Right.
> > In fact, I'm starting to think that patch 7c7b407e777 "xen/sched:
> > introduce unit_runnable_state()", which added the 'q_remove(snext)'
> > in
> > rt_schedule() might not be correct.
>
> I have tested Xen before this commit and didn't manage to reproduce
> the
> crash. As soon as I had the commit, it will crash quite quickly.
>
Ok, thanks for checking this as well. That's very useful.
> > In fact, if runq_pick() returns a vCPU which is in the runqueue,
> > but is
> > not runnable (e.g., because we're racing with do_domain_pause(),
> > which
> > already set pause_count), it's not rt_schedule() job to dequeue it
> > from
> > anything.
> >
> > We probably should just ignore it and pick another vCPU, if any
> > (and
> > idle otherwise). Then, after we release the lock, if will be
> > rt_unit_sleep(), called by do_domain_pause() in this case, that
> > will
> > finish the job of properly dequeueing it...
> >
> > Another strange thing is that, as the code looks right now,
> > runq_pick()
> > returns the first unit in the runq (i.e., the one with the earliest
> > deadline), without checking whether it is runnable. Then, in
> > rt_schedule(), if the unit is not runnable, we (only partially, as
> > you
> > figured out) dequeue it, and use idle instead, as our candidate for
> > being the next scheduled unit... But what if there were other
> > *runnable* units in the runqueue?
>
> My knowledge of the scheduler is quite limited. Maybe Meng would be
> able
> to answer to this question?
>
Yes, indeed, here I was pretty much thinking out loud, and trying to
trigger comments from Meng.
Anyway, I'll see about putting together a quick test patch that
implement what I described (next week), and let's see if it works.
Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)
[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 157 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-02-02 13:16 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-01-08 21:38 [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED osstest service owner
2020-01-08 23:14 ` Julien Grall
2020-01-10 18:24 ` Julien Grall
2020-01-10 23:30 ` Julien Grall
2020-01-22 3:40 ` Dario Faggioli
2020-02-02 12:57 ` Julien Grall
2020-02-02 13:15 ` Dario Faggioli
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.