All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED
@ 2020-01-08 21:38 osstest service owner
  2020-01-08 23:14 ` Julien Grall
  0 siblings, 1 reply; 7+ messages in thread
From: osstest service owner @ 2020-01-08 21:38 UTC (permalink / raw)
  To: xen-devel, osstest-admin

flight 145796 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/145796/

Failures :-/ but no regressions.

Tests which are failing intermittently (not blocking):
 test-amd64-amd64-xl-rtds    15 guest-saverestore fail in 145773 pass in 145796
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 16 guest-start/debianhvm.repeat fail in 145773 pass in 145796
 test-armhf-armhf-xl-rtds     12 guest-start      fail in 145773 pass in 145796
 test-armhf-armhf-xl           7 xen-boot                   fail pass in 145773

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-xl         13 migrate-support-check fail in 145773 never pass
 test-armhf-armhf-xl     14 saverestore-support-check fail in 145773 never pass
 test-amd64-amd64-xl-rtds     18 guest-localmigrate/x10       fail  like 145725
 test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stop            fail like 145725
 test-armhf-armhf-libvirt     14 saverestore-support-check    fail  like 145725
 test-amd64-amd64-xl-qemut-win7-amd64 17 guest-stop            fail like 145725
 test-amd64-i386-xl-qemut-win7-amd64 17 guest-stop             fail like 145725
 test-armhf-armhf-xl-rtds     16 guest-start/debian.repeat    fail  like 145725
 test-armhf-armhf-libvirt-raw 13 saverestore-support-check    fail  like 145725
 test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop             fail like 145725
 test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stop            fail like 145725
 test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stop            fail like 145725
 test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stop             fail like 145725
 test-amd64-i386-xl-pvshim    12 guest-start                  fail   never pass
 test-arm64-arm64-xl-seattle  13 migrate-support-check        fail   never pass
 test-arm64-arm64-xl-seattle  14 saverestore-support-check    fail   never pass
 test-amd64-amd64-libvirt-xsm 13 migrate-support-check        fail   never pass
 test-amd64-i386-libvirt      13 migrate-support-check        fail   never pass
 test-amd64-i386-libvirt-xsm  13 migrate-support-check        fail   never pass
 test-amd64-amd64-libvirt     13 migrate-support-check        fail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check fail never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check fail never pass
 test-amd64-amd64-qemuu-nested-amd 17 debian-hvm-install/l1/l2  fail never pass
 test-arm64-arm64-xl          13 migrate-support-check        fail   never pass
 test-arm64-arm64-xl          14 saverestore-support-check    fail   never pass
 test-arm64-arm64-xl-xsm      13 migrate-support-check        fail   never pass
 test-arm64-arm64-xl-xsm      14 saverestore-support-check    fail   never pass
 test-arm64-arm64-xl-credit2  13 migrate-support-check        fail   never pass
 test-arm64-arm64-xl-credit1  13 migrate-support-check        fail   never pass
 test-arm64-arm64-xl-thunderx 13 migrate-support-check        fail   never pass
 test-arm64-arm64-xl-credit1  14 saverestore-support-check    fail   never pass
 test-arm64-arm64-xl-credit2  14 saverestore-support-check    fail   never pass
 test-arm64-arm64-xl-thunderx 14 saverestore-support-check    fail   never pass
 test-amd64-amd64-libvirt-vhd 12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-arndale  13 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-arndale  14 saverestore-support-check    fail   never pass
 test-armhf-armhf-xl-multivcpu 13 migrate-support-check        fail  never pass
 test-armhf-armhf-xl-multivcpu 14 saverestore-support-check    fail  never pass
 test-armhf-armhf-xl-cubietruck 13 migrate-support-check        fail never pass
 test-armhf-armhf-xl-cubietruck 14 saverestore-support-check    fail never pass
 test-armhf-armhf-xl-credit2  13 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-credit2  14 saverestore-support-check    fail   never pass
 test-armhf-armhf-xl-credit1  13 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-credit1  14 saverestore-support-check    fail   never pass
 test-armhf-armhf-libvirt     13 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-rtds     13 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-rtds     14 saverestore-support-check    fail   never pass
 test-arm64-arm64-libvirt-xsm 13 migrate-support-check        fail   never pass
 test-arm64-arm64-libvirt-xsm 14 saverestore-support-check    fail   never pass
 test-armhf-armhf-libvirt-raw 12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-vhd      12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-vhd      13 saverestore-support-check    fail   never pass
 test-amd64-i386-xl-qemut-ws16-amd64 17 guest-stop              fail never pass

version targeted for testing:
 xen                  4dde27b6e0a0b0dcb8fdfc7580fbd9c976aa103f
baseline version:
 xen                  0dd92688080202adcc43dcb3486d4143110a66d5

Last test of basis   145725  2020-01-07 08:02:53 Z    1 days
Failing since        145749  2020-01-07 17:36:48 Z    1 days    3 attempts
Testing same since   145773  2020-01-08 02:49:59 Z    0 days    2 attempts

------------------------------------------------------------
People who touched revisions under test:
  Andrew Cooper <andrew.cooper3@citrix.com>
  Hongyan Xia <hongyxia@amazon.com>
  Ian Jackson <Ian.Jackson@citrix.com>
  Jan Beulich <jbeulich@suse.com>
  Julien Grall <julien@xen.org>
  Sergey Dyasli <sergey.dyasli@citrix.com>
  Wei Liu <wei.liu2@citrix.com>
  Wei Liu <wl@xen.org>

jobs:
 build-amd64-xsm                                              pass    
 build-arm64-xsm                                              pass    
 build-i386-xsm                                               pass    
 build-amd64-xtf                                              pass    
 build-amd64                                                  pass    
 build-arm64                                                  pass    
 build-armhf                                                  pass    
 build-i386                                                   pass    
 build-amd64-libvirt                                          pass    
 build-arm64-libvirt                                          pass    
 build-armhf-libvirt                                          pass    
 build-i386-libvirt                                           pass    
 build-amd64-prev                                             pass    
 build-i386-prev                                              pass    
 build-amd64-pvops                                            pass    
 build-arm64-pvops                                            pass    
 build-armhf-pvops                                            pass    
 build-i386-pvops                                             pass    
 test-xtf-amd64-amd64-1                                       pass    
 test-xtf-amd64-amd64-2                                       pass    
 test-xtf-amd64-amd64-3                                       pass    
 test-xtf-amd64-amd64-4                                       pass    
 test-xtf-amd64-amd64-5                                       pass    
 test-amd64-amd64-xl                                          pass    
 test-arm64-arm64-xl                                          pass    
 test-armhf-armhf-xl                                          fail    
 test-amd64-i386-xl                                           pass    
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm           pass    
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm            pass    
 test-amd64-amd64-xl-qemut-stubdom-debianhvm-amd64-xsm        pass    
 test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm         pass    
 test-amd64-amd64-xl-qemut-debianhvm-i386-xsm                 pass    
 test-amd64-i386-xl-qemut-debianhvm-i386-xsm                  pass    
 test-amd64-amd64-xl-qemuu-debianhvm-i386-xsm                 pass    
 test-amd64-i386-xl-qemuu-debianhvm-i386-xsm                  pass    
 test-amd64-amd64-libvirt-xsm                                 pass    
 test-arm64-arm64-libvirt-xsm                                 pass    
 test-amd64-i386-libvirt-xsm                                  pass    
 test-amd64-amd64-xl-xsm                                      pass    
 test-arm64-arm64-xl-xsm                                      pass    
 test-amd64-i386-xl-xsm                                       pass    
 test-amd64-amd64-qemuu-nested-amd                            fail    
 test-amd64-amd64-xl-pvhv2-amd                                pass    
 test-amd64-i386-qemut-rhel6hvm-amd                           pass    
 test-amd64-i386-qemuu-rhel6hvm-amd                           pass    
 test-amd64-amd64-xl-qemut-debianhvm-amd64                    pass    
 test-amd64-i386-xl-qemut-debianhvm-amd64                     pass    
 test-amd64-amd64-xl-qemuu-debianhvm-amd64                    pass    
 test-amd64-i386-xl-qemuu-debianhvm-amd64                     pass    
 test-amd64-i386-freebsd10-amd64                              pass    
 test-amd64-amd64-xl-qemuu-ovmf-amd64                         pass    
 test-amd64-i386-xl-qemuu-ovmf-amd64                          pass    
 test-amd64-amd64-xl-qemut-win7-amd64                         fail    
 test-amd64-i386-xl-qemut-win7-amd64                          fail    
 test-amd64-amd64-xl-qemuu-win7-amd64                         fail    
 test-amd64-i386-xl-qemuu-win7-amd64                          fail    
 test-amd64-amd64-xl-qemut-ws16-amd64                         fail    
 test-amd64-i386-xl-qemut-ws16-amd64                          fail    
 test-amd64-amd64-xl-qemuu-ws16-amd64                         fail    
 test-amd64-i386-xl-qemuu-ws16-amd64                          fail    
 test-armhf-armhf-xl-arndale                                  pass    
 test-amd64-amd64-xl-credit1                                  pass    
 test-arm64-arm64-xl-credit1                                  pass    
 test-armhf-armhf-xl-credit1                                  pass    
 test-amd64-amd64-xl-credit2                                  pass    
 test-arm64-arm64-xl-credit2                                  pass    
 test-armhf-armhf-xl-credit2                                  pass    
 test-armhf-armhf-xl-cubietruck                               pass    
 test-amd64-amd64-xl-qemuu-dmrestrict-amd64-dmrestrict        pass    
 test-amd64-i386-xl-qemuu-dmrestrict-amd64-dmrestrict         pass    
 test-amd64-amd64-examine                                     pass    
 test-arm64-arm64-examine                                     pass    
 test-armhf-armhf-examine                                     pass    
 test-amd64-i386-examine                                      pass    
 test-amd64-i386-freebsd10-i386                               pass    
 test-amd64-amd64-qemuu-nested-intel                          pass    
 test-amd64-amd64-xl-pvhv2-intel                              pass    
 test-amd64-i386-qemut-rhel6hvm-intel                         pass    
 test-amd64-i386-qemuu-rhel6hvm-intel                         pass    
 test-amd64-amd64-libvirt                                     pass    
 test-armhf-armhf-libvirt                                     pass    
 test-amd64-i386-libvirt                                      pass    
 test-amd64-amd64-livepatch                                   pass    
 test-amd64-i386-livepatch                                    pass    
 test-amd64-amd64-migrupgrade                                 pass    
 test-amd64-i386-migrupgrade                                  pass    
 test-amd64-amd64-xl-multivcpu                                pass    
 test-armhf-armhf-xl-multivcpu                                pass    
 test-amd64-amd64-pair                                        pass    
 test-amd64-i386-pair                                         pass    
 test-amd64-amd64-libvirt-pair                                pass    
 test-amd64-i386-libvirt-pair                                 pass    
 test-amd64-amd64-amd64-pvgrub                                pass    
 test-amd64-amd64-i386-pvgrub                                 pass    
 test-amd64-amd64-xl-pvshim                                   pass    
 test-amd64-i386-xl-pvshim                                    fail    
 test-amd64-amd64-pygrub                                      pass    
 test-amd64-amd64-xl-qcow2                                    pass    
 test-armhf-armhf-libvirt-raw                                 pass    
 test-amd64-i386-xl-raw                                       pass    
 test-amd64-amd64-xl-rtds                                     fail    
 test-armhf-armhf-xl-rtds                                     fail    
 test-arm64-arm64-xl-seattle                                  pass    
 test-amd64-amd64-xl-qemuu-debianhvm-amd64-shadow             pass    
 test-amd64-i386-xl-qemuu-debianhvm-amd64-shadow              pass    
 test-amd64-amd64-xl-shadow                                   pass    
 test-amd64-i386-xl-shadow                                    pass    
 test-arm64-arm64-xl-thunderx                                 pass    
 test-amd64-amd64-libvirt-vhd                                 pass    
 test-armhf-armhf-xl-vhd                                      pass    


------------------------------------------------------------
sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
    http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
    http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
    http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
    http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/xen.git
   0dd9268808..4dde27b6e0  4dde27b6e0a0b0dcb8fdfc7580fbd9c976aa103f -> master

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED
  2020-01-08 21:38 [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED osstest service owner
@ 2020-01-08 23:14 ` Julien Grall
  2020-01-10 18:24   ` Julien Grall
  0 siblings, 1 reply; 7+ messages in thread
From: Julien Grall @ 2020-01-08 23:14 UTC (permalink / raw)
  To: osstest service owner, Dario Faggioli, George Dunlap,
	Jürgen Groß, Stefano Stabellini
  Cc: xen-devel

On Wed, 8 Jan 2020 at 21:40, osstest service owner
<osstest-admin@xenproject.org> wrote:
>
> flight 145796 xen-unstable real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/145796/
>
> Failures :-/ but no regressions.
>
> Tests which are failing intermittently (not blocking):
>  test-amd64-amd64-xl-rtds    15 guest-saverestore fail in 145773 pass in 145796
>  test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 16 guest-start/debianhvm.repeat fail in 145773 pass in 145796
>  test-armhf-armhf-xl-rtds     12 guest-start      fail in 145773 pass in 145796

It looks like this test has been failing for a while (although not reliably).
I looked at  a few flights, the cause seems to be the same:

Jan  8 15:02:14.700784 (XEN) Assertion '!unit_on_replq(svc)' failed at
sched_rt.c:586
Jan  8 15:02:26.715030 (XEN) ----[ Xen-4.14-unstable  arm32  debug=y
Not tainted ]----
Jan  8 15:02:26.720756 (XEN) CPU:    1
Jan  8 15:02:26.722158 (XEN) PC:     0023a750
common/sched_rt.c#replq_insert+0x7c/0xcc
Jan  8 15:02:26.727851 (XEN) CPSR:   200300da MODE:Hypervisor
Jan  8 15:02:26.731334 (XEN)      R0: 002a51a4 R1: 400614a0 R2:
3d64b900 R3: 40061338
Jan  8 15:02:26.736830 (XEN)      R4: 400614a0 R5: 002a51a4 R6:
3cf1cbf0 R7: 000001cb
Jan  8 15:02:26.742600 (XEN)      R8: 4003d1b0 R9: 400614a8
R10:4003d1b0 R11:400ffe54 R12:400ffde4
Jan  8 15:02:26.749119 (XEN) HYP: SP: 400ffe2c LR: 0023b6e8
Jan  8 15:02:26.752296 (XEN)
Jan  8 15:02:26.753036 (XEN)   VTCR_EL2: 80003558
Jan  8 15:02:26.755479 (XEN)  VTTBR_EL2: 00020000bbff4000
Jan  8 15:02:26.758757 (XEN)
Jan  8 15:02:26.759366 (XEN)  SCTLR_EL2: 30cd187f
Jan  8 15:02:26.761755 (XEN)    HCR_EL2: 0078663f
Jan  8 15:02:26.764250 (XEN)  TTBR0_EL2: 00000000bc029000
Jan  8 15:02:26.767364 (XEN)
Jan  8 15:02:26.767980 (XEN)    ESR_EL2: 00000000
Jan  8 15:02:26.770485 (XEN)  HPFAR_EL2: 00030010
Jan  8 15:02:26.772795 (XEN)      HDFAR: e0800f00
Jan  8 15:02:26.775272 (XEN)      HIFAR: c0605744
Jan  8 15:02:26.777748 (XEN)
Jan  8 15:02:26.778505 (XEN) Xen stack trace from sp=400ffe2c:
Jan  8 15:02:26.781910 (XEN)    00000000 3cf1cbf0 400614a0 002a51a4
3cf1cbf0 000001cb 4003d1b0 6003005a
Jan  8 15:02:26.788991 (XEN)    400613f8 400ffe7c 0023b6e8 002f9300
4004c000 400613f8 3cf1cbf0 000001cb
Jan  8 15:02:26.796093 (XEN)    4003d1b0 6003005a 400613f8 400ffeac
00242988 4004c000 002425ac 40058000
Jan  8 15:02:26.803237 (XEN)    4004c000 4004f000 10f45000 10f45008
4004b080 40058000 60030013 400ffebc
Jan  8 15:02:26.810360 (XEN)    00209984 00000002 4004f000 400ffedc
0020eddc 0020caf8 db097cd4 00000020
Jan  8 15:02:26.817504 (XEN)    c13afbec 00000000 db15fd68 400ffee4
0020c9dc 400fff34 0020d5e8 4004e000
Jan  8 15:02:26.824615 (XEN)    00000000 400fff44 400fff44 00000002
00000000 4004e8fa 4004e8f4 400fff1c
Jan  8 15:02:26.831737 (XEN)    400fff1c 6003005a 0020caf8 400fff58
00000020 c13afbec 00000000 db15fd68
Jan  8 15:02:26.838798 (XEN)    60030013 400fff54 0026c150 c1204d08
c13afbec 00000000 00000000 00000000
Jan  8 15:02:26.845877 (XEN)    00000002 400fff58 002753b0 00000009
db097cd4 db173008 00000002 c1204d08
Jan  8 15:02:26.852986 (XEN)    00000000 00000002 c13afbec 00000000
db15fd68 60030013 db15fd3c 00000020
Jan  8 15:02:26.860044 (XEN)    ffffffff b6cdccb3 c0107ed0 a0030093
4a000ea1 be951568 c136edc0 c010d3a0
Jan  8 15:02:26.867171 (XEN)    db097cd0 c056c7f8 c136edcc c010d720
c136edd8 c010d7e0 00000000 00000000
Jan  8 15:02:26.874526 (XEN)    00000000 00000000 00000000 c136ede4
c136ede4 00030030 60070193 80030093
Jan  8 15:02:26.881450 (XEN)    60030193 00000000 00000000 00000000 00000001
Jan  8 15:02:26.886519 (XEN) Xen call trace:
Jan  8 15:02:26.888168 (XEN)    [<0023a750>]
common/sched_rt.c#replq_insert+0x7c/0xcc (PC)
Jan  8 15:02:26.894240 (XEN)    [<0023b6e8>]
common/sched_rt.c#rt_unit_wake+0xf4/0x274 (LR)
Jan  8 15:02:26.900246 (XEN)    [<0023b6e8>]
common/sched_rt.c#rt_unit_wake+0xf4/0x274
Jan  8 15:02:26.905775 (XEN)    [<00242988>] vcpu_wake+0x1e4/0x688
Jan  8 15:02:26.909743 (XEN)    [<00209984>] domain_unpause+0x64/0x84
Jan  8 15:02:26.913956 (XEN)    [<0020eddc>]
common/event_fifo.c#evtchn_fifo_unmask+0xd8/0xf0
Jan  8 15:02:26.920167 (XEN)    [<0020c9dc>] evtchn_unmask+0x7c/0xc0
Jan  8 15:02:26.924173 (XEN)    [<0020d5e8>] do_event_channel_op+0xaf0/0xdac
Jan  8 15:02:26.928922 (XEN)    [<0026c150>] do_trap_guest_sync+0x350/0x4d0
Jan  8 15:02:26.933647 (XEN)    [<002753b0>] entry.o#return_from_trap+0/0x4
Jan  8 15:02:26.938299 (XEN)
Jan  8 15:02:26.939039 (XEN)
Jan  8 15:02:26.939668 (XEN) ****************************************
Jan  8 15:02:26.943794 (XEN) Panic on CPU 1:
Jan  8 15:02:26.945872 (XEN) Assertion '!unit_on_replq(svc)' failed at
sched_rt.c:586
Jan  8 15:02:26.951492 (XEN) ****************************************

I believe the domain_unpause() is coming from guest_clear_bit(). This
would mean the atomics didn't succeed without pausing the domain. This
makes sense as, per the log:

 CPU1: Guest atomics will try 1 times before pausing the domain

I am under the impression that the crash could be reproduced with just:

domain_pause_nosync(current);
domain_unpause(current);

Any insights what's wrong? I am happy to try to reproduce it tomorrow morning.

Cheers,

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED
  2020-01-08 23:14 ` Julien Grall
@ 2020-01-10 18:24   ` Julien Grall
  2020-01-10 23:30     ` Julien Grall
  2020-01-22  3:40     ` Dario Faggioli
  0 siblings, 2 replies; 7+ messages in thread
From: Julien Grall @ 2020-01-10 18:24 UTC (permalink / raw)
  To: Julien Grall, osstest service owner, Dario Faggioli,
	George Dunlap, Jürgen Groß, Stefano Stabellini
  Cc: xen-devel

Hi all,

On 08/01/2020 23:14, Julien Grall wrote:
> On Wed, 8 Jan 2020 at 21:40, osstest service owner
> <osstest-admin@xenproject.org> wrote:
>>
>> flight 145796 xen-unstable real [real]
>> http://logs.test-lab.xenproject.org/osstest/logs/145796/
>>
>> Failures :-/ but no regressions.
>>
>> Tests which are failing intermittently (not blocking):
>>   test-amd64-amd64-xl-rtds    15 guest-saverestore fail in 145773 pass in 145796
>>   test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 16 guest-start/debianhvm.repeat fail in 145773 pass in 145796
>>   test-armhf-armhf-xl-rtds     12 guest-start      fail in 145773 pass in 145796
> 
> It looks like this test has been failing for a while (although not reliably).
> I looked at  a few flights, the cause seems to be the same:
> 
> Jan  8 15:02:14.700784 (XEN) Assertion '!unit_on_replq(svc)' failed at
> sched_rt.c:586
> Jan  8 15:02:26.715030 (XEN) ----[ Xen-4.14-unstable  arm32  debug=y
> Not tainted ]----
> Jan  8 15:02:26.720756 (XEN) CPU:    1
> Jan  8 15:02:26.722158 (XEN) PC:     0023a750
> common/sched_rt.c#replq_insert+0x7c/0xcc
> Jan  8 15:02:26.727851 (XEN) CPSR:   200300da MODE:Hypervisor
> Jan  8 15:02:26.731334 (XEN)      R0: 002a51a4 R1: 400614a0 R2:
> 3d64b900 R3: 40061338
> Jan  8 15:02:26.736830 (XEN)      R4: 400614a0 R5: 002a51a4 R6:
> 3cf1cbf0 R7: 000001cb
> Jan  8 15:02:26.742600 (XEN)      R8: 4003d1b0 R9: 400614a8
> R10:4003d1b0 R11:400ffe54 R12:400ffde4
> Jan  8 15:02:26.749119 (XEN) HYP: SP: 400ffe2c LR: 0023b6e8
> Jan  8 15:02:26.752296 (XEN)
> Jan  8 15:02:26.753036 (XEN)   VTCR_EL2: 80003558
> Jan  8 15:02:26.755479 (XEN)  VTTBR_EL2: 00020000bbff4000
> Jan  8 15:02:26.758757 (XEN)
> Jan  8 15:02:26.759366 (XEN)  SCTLR_EL2: 30cd187f
> Jan  8 15:02:26.761755 (XEN)    HCR_EL2: 0078663f
> Jan  8 15:02:26.764250 (XEN)  TTBR0_EL2: 00000000bc029000
> Jan  8 15:02:26.767364 (XEN)
> Jan  8 15:02:26.767980 (XEN)    ESR_EL2: 00000000
> Jan  8 15:02:26.770485 (XEN)  HPFAR_EL2: 00030010
> Jan  8 15:02:26.772795 (XEN)      HDFAR: e0800f00
> Jan  8 15:02:26.775272 (XEN)      HIFAR: c0605744
> Jan  8 15:02:26.777748 (XEN)
> Jan  8 15:02:26.778505 (XEN) Xen stack trace from sp=400ffe2c:
> Jan  8 15:02:26.781910 (XEN)    00000000 3cf1cbf0 400614a0 002a51a4
> 3cf1cbf0 000001cb 4003d1b0 6003005a
> Jan  8 15:02:26.788991 (XEN)    400613f8 400ffe7c 0023b6e8 002f9300
> 4004c000 400613f8 3cf1cbf0 000001cb
> Jan  8 15:02:26.796093 (XEN)    4003d1b0 6003005a 400613f8 400ffeac
> 00242988 4004c000 002425ac 40058000
> Jan  8 15:02:26.803237 (XEN)    4004c000 4004f000 10f45000 10f45008
> 4004b080 40058000 60030013 400ffebc
> Jan  8 15:02:26.810360 (XEN)    00209984 00000002 4004f000 400ffedc
> 0020eddc 0020caf8 db097cd4 00000020
> Jan  8 15:02:26.817504 (XEN)    c13afbec 00000000 db15fd68 400ffee4
> 0020c9dc 400fff34 0020d5e8 4004e000
> Jan  8 15:02:26.824615 (XEN)    00000000 400fff44 400fff44 00000002
> 00000000 4004e8fa 4004e8f4 400fff1c
> Jan  8 15:02:26.831737 (XEN)    400fff1c 6003005a 0020caf8 400fff58
> 00000020 c13afbec 00000000 db15fd68
> Jan  8 15:02:26.838798 (XEN)    60030013 400fff54 0026c150 c1204d08
> c13afbec 00000000 00000000 00000000
> Jan  8 15:02:26.845877 (XEN)    00000002 400fff58 002753b0 00000009
> db097cd4 db173008 00000002 c1204d08
> Jan  8 15:02:26.852986 (XEN)    00000000 00000002 c13afbec 00000000
> db15fd68 60030013 db15fd3c 00000020
> Jan  8 15:02:26.860044 (XEN)    ffffffff b6cdccb3 c0107ed0 a0030093
> 4a000ea1 be951568 c136edc0 c010d3a0
> Jan  8 15:02:26.867171 (XEN)    db097cd0 c056c7f8 c136edcc c010d720
> c136edd8 c010d7e0 00000000 00000000
> Jan  8 15:02:26.874526 (XEN)    00000000 00000000 00000000 c136ede4
> c136ede4 00030030 60070193 80030093
> Jan  8 15:02:26.881450 (XEN)    60030193 00000000 00000000 00000000 00000001
> Jan  8 15:02:26.886519 (XEN) Xen call trace:
> Jan  8 15:02:26.888168 (XEN)    [<0023a750>]
> common/sched_rt.c#replq_insert+0x7c/0xcc (PC)
> Jan  8 15:02:26.894240 (XEN)    [<0023b6e8>]
> common/sched_rt.c#rt_unit_wake+0xf4/0x274 (LR)
> Jan  8 15:02:26.900246 (XEN)    [<0023b6e8>]
> common/sched_rt.c#rt_unit_wake+0xf4/0x274
> Jan  8 15:02:26.905775 (XEN)    [<00242988>] vcpu_wake+0x1e4/0x688
> Jan  8 15:02:26.909743 (XEN)    [<00209984>] domain_unpause+0x64/0x84
> Jan  8 15:02:26.913956 (XEN)    [<0020eddc>]
> common/event_fifo.c#evtchn_fifo_unmask+0xd8/0xf0
> Jan  8 15:02:26.920167 (XEN)    [<0020c9dc>] evtchn_unmask+0x7c/0xc0
> Jan  8 15:02:26.924173 (XEN)    [<0020d5e8>] do_event_channel_op+0xaf0/0xdac
> Jan  8 15:02:26.928922 (XEN)    [<0026c150>] do_trap_guest_sync+0x350/0x4d0
> Jan  8 15:02:26.933647 (XEN)    [<002753b0>] entry.o#return_from_trap+0/0x4
> Jan  8 15:02:26.938299 (XEN)
> Jan  8 15:02:26.939039 (XEN)
> Jan  8 15:02:26.939668 (XEN) ****************************************
> Jan  8 15:02:26.943794 (XEN) Panic on CPU 1:
> Jan  8 15:02:26.945872 (XEN) Assertion '!unit_on_replq(svc)' failed at
> sched_rt.c:586
> Jan  8 15:02:26.951492 (XEN) ****************************************
> 
> I believe the domain_unpause() is coming from guest_clear_bit(). This
> would mean the atomics didn't succeed without pausing the domain. This
> makes sense as, per the log:
> 
>   CPU1: Guest atomics will try 1 times before pausing the domain
> 
> I am under the impression that the crash could be reproduced with just:
> 
> domain_pause_nosync(current);
> domain_unpause(current);
> 
> Any insights what's wrong? I am happy to try to reproduce it tomorrow morning.

So I managed to reproduce it on Arm by hacking the hypercall path to call:

domain_pause_nosync(current->domain);
domain_unpause(current->domain);

With a debug build and with a 2 vCPU dom0 the crash happen in a few 
seconds. When the unit is not scheduled, rt_unit_wake() expects the unit 
to be in none of the queues.

The interaction is as following:

CPU0                   		| CPU1
                        		|
do_domain_pause()      		|
  -> atomic_inc(&d->pause_count)	|
  -> vcpu_sleep_nosync(vCPU A) 	|  schedule()
				|	-> Lock
                                 |       -> rt_schedule()
                                 |          -> snext = runq_pick(...)
                                 |          /* return unit A (aka vCPU A)
				|          /* Unit is not runnable */
				|  	   -> Remove from the q
                                 | 	 [....]
				|       -> Lock
    -> Lock			|
    -> rt_unit_sleep()		|
     /* Unit not scheduled */	|
     /* Nothing to do */		|

Note that on Arm, each vCPU has its own scheduling unit.

When schedule() grab the lock first (as shown above), the unit will only 
be removed from the Q. However, when vcpu_sleep_nosync() grab the lock 
first and the unit was not scheduled, rt_unit_sleep() will remove the 
unit from two queues (runQ/depleteQ and replenishQ).

So I think we want schedule() to remove the unit from the 2 queues if it 
is not runnable. Any opinions?

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED
  2020-01-10 18:24   ` Julien Grall
@ 2020-01-10 23:30     ` Julien Grall
  2020-01-22  3:40     ` Dario Faggioli
  1 sibling, 0 replies; 7+ messages in thread
From: Julien Grall @ 2020-01-10 23:30 UTC (permalink / raw)
  To: Julien Grall, osstest service owner, Dario Faggioli,
	George Dunlap, Jürgen Groß, Stefano Stabellini,
	xumengpanda
  Cc: xen-devel

(+ Meng)

Hi,

Sorry I forgot to cc the RTDS scheduler maintainer.

On 10/01/2020 18:24, Julien Grall wrote:
> Hi all,
> 
> On 08/01/2020 23:14, Julien Grall wrote:
>> On Wed, 8 Jan 2020 at 21:40, osstest service owner
>> <osstest-admin@xenproject.org> wrote:
>>>
>>> flight 145796 xen-unstable real [real]
>>> http://logs.test-lab.xenproject.org/osstest/logs/145796/
>>>
>>> Failures :-/ but no regressions.
>>>
>>> Tests which are failing intermittently (not blocking):
>>>   test-amd64-amd64-xl-rtds    15 guest-saverestore fail in 145773 
>>> pass in 145796
>>>   test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 16 
>>> guest-start/debianhvm.repeat fail in 145773 pass in 145796
>>>   test-armhf-armhf-xl-rtds     12 guest-start      fail in 145773 
>>> pass in 145796
>>
>> It looks like this test has been failing for a while (although not 
>> reliably).
>> I looked at  a few flights, the cause seems to be the same:
>>
>> Jan  8 15:02:14.700784 (XEN) Assertion '!unit_on_replq(svc)' failed at
>> sched_rt.c:586
>> Jan  8 15:02:26.715030 (XEN) ----[ Xen-4.14-unstable  arm32  debug=y
>> Not tainted ]----
>> Jan  8 15:02:26.720756 (XEN) CPU:    1
>> Jan  8 15:02:26.722158 (XEN) PC:     0023a750
>> common/sched_rt.c#replq_insert+0x7c/0xcc
>> Jan  8 15:02:26.727851 (XEN) CPSR:   200300da MODE:Hypervisor
>> Jan  8 15:02:26.731334 (XEN)      R0: 002a51a4 R1: 400614a0 R2:
>> 3d64b900 R3: 40061338
>> Jan  8 15:02:26.736830 (XEN)      R4: 400614a0 R5: 002a51a4 R6:
>> 3cf1cbf0 R7: 000001cb
>> Jan  8 15:02:26.742600 (XEN)      R8: 4003d1b0 R9: 400614a8
>> R10:4003d1b0 R11:400ffe54 R12:400ffde4
>> Jan  8 15:02:26.749119 (XEN) HYP: SP: 400ffe2c LR: 0023b6e8
>> Jan  8 15:02:26.752296 (XEN)
>> Jan  8 15:02:26.753036 (XEN)   VTCR_EL2: 80003558
>> Jan  8 15:02:26.755479 (XEN)  VTTBR_EL2: 00020000bbff4000
>> Jan  8 15:02:26.758757 (XEN)
>> Jan  8 15:02:26.759366 (XEN)  SCTLR_EL2: 30cd187f
>> Jan  8 15:02:26.761755 (XEN)    HCR_EL2: 0078663f
>> Jan  8 15:02:26.764250 (XEN)  TTBR0_EL2: 00000000bc029000
>> Jan  8 15:02:26.767364 (XEN)
>> Jan  8 15:02:26.767980 (XEN)    ESR_EL2: 00000000
>> Jan  8 15:02:26.770485 (XEN)  HPFAR_EL2: 00030010
>> Jan  8 15:02:26.772795 (XEN)      HDFAR: e0800f00
>> Jan  8 15:02:26.775272 (XEN)      HIFAR: c0605744
>> Jan  8 15:02:26.777748 (XEN)
>> Jan  8 15:02:26.778505 (XEN) Xen stack trace from sp=400ffe2c:
>> Jan  8 15:02:26.781910 (XEN)    00000000 3cf1cbf0 400614a0 002a51a4
>> 3cf1cbf0 000001cb 4003d1b0 6003005a
>> Jan  8 15:02:26.788991 (XEN)    400613f8 400ffe7c 0023b6e8 002f9300
>> 4004c000 400613f8 3cf1cbf0 000001cb
>> Jan  8 15:02:26.796093 (XEN)    4003d1b0 6003005a 400613f8 400ffeac
>> 00242988 4004c000 002425ac 40058000
>> Jan  8 15:02:26.803237 (XEN)    4004c000 4004f000 10f45000 10f45008
>> 4004b080 40058000 60030013 400ffebc
>> Jan  8 15:02:26.810360 (XEN)    00209984 00000002 4004f000 400ffedc
>> 0020eddc 0020caf8 db097cd4 00000020
>> Jan  8 15:02:26.817504 (XEN)    c13afbec 00000000 db15fd68 400ffee4
>> 0020c9dc 400fff34 0020d5e8 4004e000
>> Jan  8 15:02:26.824615 (XEN)    00000000 400fff44 400fff44 00000002
>> 00000000 4004e8fa 4004e8f4 400fff1c
>> Jan  8 15:02:26.831737 (XEN)    400fff1c 6003005a 0020caf8 400fff58
>> 00000020 c13afbec 00000000 db15fd68
>> Jan  8 15:02:26.838798 (XEN)    60030013 400fff54 0026c150 c1204d08
>> c13afbec 00000000 00000000 00000000
>> Jan  8 15:02:26.845877 (XEN)    00000002 400fff58 002753b0 00000009
>> db097cd4 db173008 00000002 c1204d08
>> Jan  8 15:02:26.852986 (XEN)    00000000 00000002 c13afbec 00000000
>> db15fd68 60030013 db15fd3c 00000020
>> Jan  8 15:02:26.860044 (XEN)    ffffffff b6cdccb3 c0107ed0 a0030093
>> 4a000ea1 be951568 c136edc0 c010d3a0
>> Jan  8 15:02:26.867171 (XEN)    db097cd0 c056c7f8 c136edcc c010d720
>> c136edd8 c010d7e0 00000000 00000000
>> Jan  8 15:02:26.874526 (XEN)    00000000 00000000 00000000 c136ede4
>> c136ede4 00030030 60070193 80030093
>> Jan  8 15:02:26.881450 (XEN)    60030193 00000000 00000000 00000000 
>> 00000001
>> Jan  8 15:02:26.886519 (XEN) Xen call trace:
>> Jan  8 15:02:26.888168 (XEN)    [<0023a750>]
>> common/sched_rt.c#replq_insert+0x7c/0xcc (PC)
>> Jan  8 15:02:26.894240 (XEN)    [<0023b6e8>]
>> common/sched_rt.c#rt_unit_wake+0xf4/0x274 (LR)
>> Jan  8 15:02:26.900246 (XEN)    [<0023b6e8>]
>> common/sched_rt.c#rt_unit_wake+0xf4/0x274
>> Jan  8 15:02:26.905775 (XEN)    [<00242988>] vcpu_wake+0x1e4/0x688
>> Jan  8 15:02:26.909743 (XEN)    [<00209984>] domain_unpause+0x64/0x84
>> Jan  8 15:02:26.913956 (XEN)    [<0020eddc>]
>> common/event_fifo.c#evtchn_fifo_unmask+0xd8/0xf0
>> Jan  8 15:02:26.920167 (XEN)    [<0020c9dc>] evtchn_unmask+0x7c/0xc0
>> Jan  8 15:02:26.924173 (XEN)    [<0020d5e8>] 
>> do_event_channel_op+0xaf0/0xdac
>> Jan  8 15:02:26.928922 (XEN)    [<0026c150>] 
>> do_trap_guest_sync+0x350/0x4d0
>> Jan  8 15:02:26.933647 (XEN)    [<002753b0>] 
>> entry.o#return_from_trap+0/0x4
>> Jan  8 15:02:26.938299 (XEN)
>> Jan  8 15:02:26.939039 (XEN)
>> Jan  8 15:02:26.939668 (XEN) ****************************************
>> Jan  8 15:02:26.943794 (XEN) Panic on CPU 1:
>> Jan  8 15:02:26.945872 (XEN) Assertion '!unit_on_replq(svc)' failed at
>> sched_rt.c:586
>> Jan  8 15:02:26.951492 (XEN) ****************************************
>>
>> I believe the domain_unpause() is coming from guest_clear_bit(). This
>> would mean the atomics didn't succeed without pausing the domain. This
>> makes sense as, per the log:
>>
>>   CPU1: Guest atomics will try 1 times before pausing the domain
>>
>> I am under the impression that the crash could be reproduced with just:
>>
>> domain_pause_nosync(current);
>> domain_unpause(current);
>>
>> Any insights what's wrong? I am happy to try to reproduce it tomorrow 
>> morning.
> 
> So I managed to reproduce it on Arm by hacking the hypercall path to call:
> 
> domain_pause_nosync(current->domain);
> domain_unpause(current->domain);
> 
> With a debug build and with a 2 vCPU dom0 the crash happen in a few 
> seconds. When the unit is not scheduled, rt_unit_wake() expects the unit 
> to be in none of the queues.
> 
> The interaction is as following:
> 
> CPU0                           | CPU1
>                                 |
> do_domain_pause()              |
>   -> atomic_inc(&d->pause_count)    |
>   -> vcpu_sleep_nosync(vCPU A)     |  schedule()
>                  |    -> Lock
>                                  |       -> rt_schedule()
>                                  |          -> snext = runq_pick(...)
>                                  |          /* return unit A (aka vCPU A)
>                  |          /* Unit is not runnable */
>                  |         -> Remove from the q
>                                  |      [....]
>                  |       -> Lock
>     -> Lock            |
>     -> rt_unit_sleep()        |
>      /* Unit not scheduled */    |
>      /* Nothing to do */        |
> 
> Note that on Arm, each vCPU has its own scheduling unit.
> 
> When schedule() grab the lock first (as shown above), the unit will only 
> be removed from the Q. However, when vcpu_sleep_nosync() grab the lock 
> first and the unit was not scheduled, rt_unit_sleep() will remove the 
> unit from two queues (runQ/depleteQ and replenishQ).
> 
> So I think we want schedule() to remove the unit from the 2 queues if it 
> is not runnable. Any opinions?
> 
> Cheers,
> 

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED
  2020-01-10 18:24   ` Julien Grall
  2020-01-10 23:30     ` Julien Grall
@ 2020-01-22  3:40     ` Dario Faggioli
  2020-02-02 12:57       ` Julien Grall
  1 sibling, 1 reply; 7+ messages in thread
From: Dario Faggioli @ 2020-01-22  3:40 UTC (permalink / raw)
  To: Julien Grall, Julien Grall, osstest service owner, George Dunlap,
	Jürgen Groß, Stefano Stabellini
  Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 4188 bytes --]

On Fri, 2020-01-10 at 18:24 +0000, Julien Grall wrote:
> Hi all,
> 
Hi Julien,

I was looking at this, and I have a couple of questions...

> On 08/01/2020 23:14, Julien Grall wrote:
> > On Wed, 8 Jan 2020 at 21:40, osstest service owner
> > <osstest-admin@xenproject.org> wrote:
> > ****************************************
> > Jan  8 15:02:26.943794 (XEN) Panic on CPU 1:
> > Jan  8 15:02:26.945872 (XEN) Assertion '!unit_on_replq(svc)' failed
> > at
> > sched_rt.c:586
> > Jan  8 15:02:26.951492 (XEN)
> > ****************************************
> > 
> So I managed to reproduce it on Arm by hacking the hypercall path to
> call:
> 
> domain_pause_nosync(current->domain);
> domain_unpause(current->domain);
> 
> With a debug build and with a 2 vCPU dom0 the crash happen in a few 
> seconds. When the unit is not scheduled, rt_unit_wake() expects the
> unit 
> to be in none of the queues.
> 
> The interaction is as following:
> 
> CPU0                   		| CPU1
>                         		|
> do_domain_pause()      		|
>   -> atomic_inc(&d->pause_count)	|
>   -> vcpu_sleep_nosync(vCPU A) 	|  schedule()
> 				|	-> Lock
>                                  |       -> rt_schedule()
>                                  |          -> snext = runq_pick(...)
>                                  |          /* return unit A (aka
> vCPU A)
> 				|          /* Unit is not runnable */
> 				|  	   -> Remove from the q
>                                  | 	 [....]
> 				|       -> Lock
>     -> Lock			|
>     -> rt_unit_sleep()		|
>      /* Unit not scheduled */	|
>      /* Nothing to do */		|
> 
Thanks a lot for the analysis. As said above, just a few questions, to
be sure I'm understanding properly what is happening.

You have a 2 vCPUs dom0, and how many other vCPUs from other domains?
Or do you only have those 2 dom0 vCPUs and you are actually pausing
dom0?

In general, what is running (I mean which vcpu) on CPU0, when the
domain_pause() happens? And what is running on CPU1 when schedule()
happens?

If you just have the 2 dom0's vCPUs around (and we call them vCPU A and
vCPU B), the only case for which I can imagine runq_pick() returning A
on CPU1 would be if CPU0 would be running vCPU B (and invoked the
hypercall from it) and CPU1 was idle... is this the case?

> When schedule() grab the lock first (as shown above), the unit will
> only 
> be removed from the Q. However, when vcpu_sleep_nosync() grab the
> lock 
> first and the unit was not scheduled, rt_unit_sleep() will remove
> the 
> unit from two queues (runQ/depleteQ and replenishQ).
> 
> So I think we want schedule() to remove the unit from the 2 queues if
> it 
> is not runnable. Any opinions?
> 
Mmm... that may work, but I'm not sure.

In fact, I'm starting to think that patch 7c7b407e777 "xen/sched:
introduce unit_runnable_state()", which added the 'q_remove(snext)' in
rt_schedule() might not be correct.

In fact, if runq_pick() returns a vCPU which is in the runqueue, but is
not runnable (e.g., because we're racing with do_domain_pause(), which
already set pause_count), it's not rt_schedule() job to dequeue it from
anything.

We probably should just ignore it and pick another vCPU, if any (and
idle otherwise). Then, after we release the lock, if will be
rt_unit_sleep(), called by do_domain_pause() in this case, that will
finish the job of properly dequeueing it...

Another strange thing is that, as the code looks right now, runq_pick()
returns the first unit in the runq (i.e., the one with the earliest
deadline), without checking whether it is runnable. Then, in
rt_schedule(), if the unit is not runnable, we (only partially, as you
figured out) dequeue it, and use idle instead, as our candidate for
being the next scheduled unit... But what if there were other
*runnable* units in the runqueue?

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED
  2020-01-22  3:40     ` Dario Faggioli
@ 2020-02-02 12:57       ` Julien Grall
  2020-02-02 13:15         ` Dario Faggioli
  0 siblings, 1 reply; 7+ messages in thread
From: Julien Grall @ 2020-02-02 12:57 UTC (permalink / raw)
  To: Dario Faggioli, Julien Grall, osstest service owner,
	George Dunlap, Jürgen Groß, Stefano Stabellini
  Cc: xen-devel, xumengpanda

Hi Dario,

Apologies for the late answer.

On 22/01/2020 03:40, Dario Faggioli wrote:
> On Fri, 2020-01-10 at 18:24 +0000, Julien Grall wrote:
>> Hi all,
>>
> Hi Julien,
> 
> I was looking at this, and I have a couple of questions...
> 
>> On 08/01/2020 23:14, Julien Grall wrote:
>>> On Wed, 8 Jan 2020 at 21:40, osstest service owner
>>> <osstest-admin@xenproject.org> wrote:
>>> ****************************************
>>> Jan  8 15:02:26.943794 (XEN) Panic on CPU 1:
>>> Jan  8 15:02:26.945872 (XEN) Assertion '!unit_on_replq(svc)' failed
>>> at
>>> sched_rt.c:586
>>> Jan  8 15:02:26.951492 (XEN)
>>> ****************************************
>>>
>> So I managed to reproduce it on Arm by hacking the hypercall path to
>> call:
>>
>> domain_pause_nosync(current->domain);
>> domain_unpause(current->domain);
>>
>> With a debug build and with a 2 vCPU dom0 the crash happen in a few
>> seconds. When the unit is not scheduled, rt_unit_wake() expects the
>> unit
>> to be in none of the queues.
>>
>> The interaction is as following:
>>
>> CPU0                   		| CPU1
>>                          		|
>> do_domain_pause()      		|
>>    -> atomic_inc(&d->pause_count)	|
>>    -> vcpu_sleep_nosync(vCPU A) 	|  schedule()
>> 				|	-> Lock
>>                                   |       -> rt_schedule()
>>                                   |          -> snext = runq_pick(...)
>>                                   |          /* return unit A (aka
>> vCPU A)
>> 				|          /* Unit is not runnable */
>> 				|  	   -> Remove from the q
>>                                   | 	 [....]
>> 				|       -> Lock
>>      -> Lock			|
>>      -> rt_unit_sleep()		|
>>       /* Unit not scheduled */	|
>>       /* Nothing to do */		|
>>
> Thanks a lot for the analysis. As said above, just a few questions, to
> be sure I'm understanding properly what is happening.
> 
> You have a 2 vCPUs dom0, and how many other vCPUs from other domains?
> Or do you only have those 2 dom0 vCPUs and you are actually pausing
> dom0?

Only dom0 with 2 vCPUs is running. On every hypercall, it will try to 
pause/unpause itself. This is to roughly match the behavior of the Arm 
guest atomic helpers.

> 
> In general, what is running (I mean which vcpu) on CPU0, when the
> domain_pause() happens? And what is running on CPU1 when schedule()
> happens?
> 
> If you just have the 2 dom0's vCPUs around (and we call them vCPU A and
> vCPU B), the only case for which I can imagine runq_pick() returning A
> on CPU1 would be if CPU0 would be running vCPU B (and invoked the
> hypercall from it) and CPU1 was idle... is this the case?

This is indeed the case. The schedule() on CPU1 has happenned because 
vCPU A was woken up (e.g an interrupt was received and injected to the 
vCPU).

> 
>> When schedule() grab the lock first (as shown above), the unit will
>> only
>> be removed from the Q. However, when vcpu_sleep_nosync() grab the
>> lock
>> first and the unit was not scheduled, rt_unit_sleep() will remove
>> the
>> unit from two queues (runQ/depleteQ and replenishQ).
>>
>> So I think we want schedule() to remove the unit from the 2 queues if
>> it
>> is not runnable. Any opinions?
>>
> Mmm... that may work, but I'm not sure.
> 
> In fact, I'm starting to think that patch 7c7b407e777 "xen/sched:
> introduce unit_runnable_state()", which added the 'q_remove(snext)' in
> rt_schedule() might not be correct.

I have tested Xen before this commit and didn't manage to reproduce the 
crash. As soon as I had the commit, it will crash quite quickly.

> 
> In fact, if runq_pick() returns a vCPU which is in the runqueue, but is
> not runnable (e.g., because we're racing with do_domain_pause(), which
> already set pause_count), it's not rt_schedule() job to dequeue it from
> anything.
> 
> We probably should just ignore it and pick another vCPU, if any (and
> idle otherwise). Then, after we release the lock, if will be
> rt_unit_sleep(), called by do_domain_pause() in this case, that will
> finish the job of properly dequeueing it...
> 
> Another strange thing is that, as the code looks right now, runq_pick()
> returns the first unit in the runq (i.e., the one with the earliest
> deadline), without checking whether it is runnable. Then, in
> rt_schedule(), if the unit is not runnable, we (only partially, as you
> figured out) dequeue it, and use idle instead, as our candidate for
> being the next scheduled unit... But what if there were other
> *runnable* units in the runqueue?

My knowledge of the scheduler is quite limited. Maybe Meng would be able 
to answer to this question?

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED
  2020-02-02 12:57       ` Julien Grall
@ 2020-02-02 13:15         ` Dario Faggioli
  0 siblings, 0 replies; 7+ messages in thread
From: Dario Faggioli @ 2020-02-02 13:15 UTC (permalink / raw)
  To: Julien Grall, Julien Grall, osstest service owner, George Dunlap,
	Jürgen Groß, Stefano Stabellini
  Cc: xen-devel, xumengpanda


[-- Attachment #1.1: Type: text/plain, Size: 3313 bytes --]

On Sun, 2020-02-02 at 12:57 +0000, Julien Grall wrote:
> Hi Dario,
> 
Hi,

> Apologies for the late answer.
> 
No problem, I also did not had any more time to look into this yet.

> On 22/01/2020 03:40, Dario Faggioli wrote:
> > On Fri, 2020-01-10 at 18:24 +0000, Julien Grall wrote:
> > > 
> > You have a 2 vCPUs dom0, and how many other vCPUs from other
> > domains?
> > Or do you only have those 2 dom0 vCPUs and you are actually pausing
> > dom0?
> 
> Only dom0 with 2 vCPUs is running. On every hypercall, it will try
> to 
> pause/unpause itself. 
>
Ok, that was my understanding, but I wasn't 100% sure. Thanks for
confirming.

> This is to roughly match the behavior of the Arm 
> guest atomic helpers.
> 
Yep, makes sense.

> > If you just have the 2 dom0's vCPUs around (and we call them vCPU A
> > and
> > vCPU B), the only case for which I can imagine runq_pick()
> > returning A
> > on CPU1 would be if CPU0 would be running vCPU B (and invoked the
> > hypercall from it) and CPU1 was idle... is this the case?
> 
> This is indeed the case. The schedule() on CPU1 has happenned
> because 
> vCPU A was woken up (e.g an interrupt was received and injected to
> the 
> vCPU).
> 
Right.

> > In fact, I'm starting to think that patch 7c7b407e777 "xen/sched:
> > introduce unit_runnable_state()", which added the 'q_remove(snext)'
> > in
> > rt_schedule() might not be correct.
> 
> I have tested Xen before this commit and didn't manage to reproduce
> the 
> crash. As soon as I had the commit, it will crash quite quickly.
> 
Ok, thanks for checking this as well. That's very useful.

> > In fact, if runq_pick() returns a vCPU which is in the runqueue,
> > but is
> > not runnable (e.g., because we're racing with do_domain_pause(),
> > which
> > already set pause_count), it's not rt_schedule() job to dequeue it
> > from
> > anything.
> > 
> > We probably should just ignore it and pick another vCPU, if any
> > (and
> > idle otherwise). Then, after we release the lock, if will be
> > rt_unit_sleep(), called by do_domain_pause() in this case, that
> > will
> > finish the job of properly dequeueing it...
> > 
> > Another strange thing is that, as the code looks right now,
> > runq_pick()
> > returns the first unit in the runq (i.e., the one with the earliest
> > deadline), without checking whether it is runnable. Then, in
> > rt_schedule(), if the unit is not runnable, we (only partially, as
> > you
> > figured out) dequeue it, and use idle instead, as our candidate for
> > being the next scheduled unit... But what if there were other
> > *runnable* units in the runqueue?
> 
> My knowledge of the scheduler is quite limited. Maybe Meng would be
> able 
> to answer to this question?
> 
Yes, indeed, here I was pretty much thinking out loud, and trying to
trigger comments from Meng.

Anyway, I'll see about putting together a quick test patch that
implement what I described (next week), and let's see if it works.

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-02-02 13:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-01-08 21:38 [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED osstest service owner
2020-01-08 23:14 ` Julien Grall
2020-01-10 18:24   ` Julien Grall
2020-01-10 23:30     ` Julien Grall
2020-01-22  3:40     ` Dario Faggioli
2020-02-02 12:57       ` Julien Grall
2020-02-02 13:15         ` Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.