[xen-unstable test] 6374: regressions

All of lore.kernel.org
 help / color / mirror / Atom feed

* [xen-unstable test] 6374: regressions - FAIL
@ 2011-03-11 16:20 xen.org
  2011-03-11 17:51 ` Ian Jackson
  0 siblings, 1 reply; 9+ messages in thread
From: xen.org @ 2011-03-11 16:20 UTC (permalink / raw)
  To: xen-devel; +Cc: ian.jackson

flight 6374 xen-unstable real [real]
http://www.chiark.greenend.org.uk/~xensrcts/logs/6374/

Regressions :-(

Tests which did not succeed and are blocking:
 test-amd64-i386-pv            5 xen-boot                   fail REGR. vs. 6369

Tests which did not succeed, but are not blocking,
including regressions (tests previously passed) regarded as allowable:
 test-amd64-amd64-win         16 leak-check/check             fail   never pass
 test-amd64-amd64-xl-win      13 guest-stop                   fail   never pass
 test-amd64-i386-rhel6hvm-amd  8 guest-saverestore            fail   never pass
 test-amd64-i386-rhel6hvm-intel  8 guest-saverestore            fail never pass
 test-amd64-i386-win-vcpus1   16 leak-check/check             fail   never pass
 test-amd64-i386-win          16 leak-check/check             fail   never pass
 test-amd64-i386-xl-credit2    9 guest-start                  fail    like 6367
 test-amd64-i386-xl-win-vcpus1 13 guest-stop                   fail  never pass
 test-amd64-xcpkern-i386-rhel6hvm-amd  8 guest-saverestore      fail never pass
 test-amd64-xcpkern-i386-rhel6hvm-intel  8 guest-saverestore    fail never pass
 test-amd64-xcpkern-i386-win  16 leak-check/check             fail   never pass
 test-amd64-xcpkern-i386-xl-credit2 11 guest-localmigrate        fail like 6369
 test-amd64-xcpkern-i386-xl-win 13 guest-stop                   fail never pass
 test-i386-i386-win           16 leak-check/check             fail   never pass
 test-i386-i386-xl-win        13 guest-stop                   fail   never pass
 test-i386-xcpkern-i386-win   16 leak-check/check             fail   never pass

version targeted for testing:
 xen                  22cc047eb146
baseline version:
 xen                  6fa299ad15c8

------------------------------------------------------------
People who touched revisions under test:
  Ian Campbell <ian.campbell@citrix.com>
  Ian Jackson <ian.jackson@eu.citrix.com>
  Jan Beulich <jbeulich@novell.com>
  Jim Fehlig <jfehlig@novell.com>
  Liu, Jinsong <jinsong.liu@intel.com>
  Stefano Stabellini <stefano.stabellini@eu.citrix.com>
  Wei Gang <gang.wei@intel.com>
------------------------------------------------------------

jobs:
 build-i386-xcpkern                                           pass     
 build-amd64                                                  pass     
 build-i386                                                   pass     
 build-amd64-oldkern                                          pass     
 build-i386-oldkern                                           pass     
 build-amd64-pvops                                            pass     
 build-i386-pvops                                             pass     
 test-amd64-amd64-xl                                          pass     
 test-amd64-i386-xl                                           pass     
 test-i386-i386-xl                                            pass     
 test-amd64-xcpkern-i386-xl                                   pass     
 test-i386-xcpkern-i386-xl                                    pass     
 test-amd64-i386-rhel6hvm-amd                                 fail     
 test-amd64-xcpkern-i386-rhel6hvm-amd                         fail     
 test-amd64-i386-xl-credit2                                   fail     
 test-amd64-xcpkern-i386-xl-credit2                           fail     
 test-amd64-i386-rhel6hvm-intel                               fail     
 test-amd64-xcpkern-i386-rhel6hvm-intel                       fail     
 test-amd64-i386-xl-multivcpu                                 pass     
 test-amd64-xcpkern-i386-xl-multivcpu                         pass     
 test-amd64-amd64-pair                                        pass     
 test-amd64-i386-pair                                         pass     
 test-i386-i386-pair                                          pass     
 test-amd64-xcpkern-i386-pair                                 pass     
 test-i386-xcpkern-i386-pair                                  pass     
 test-amd64-amd64-pv                                          pass     
 test-amd64-i386-pv                                           fail     
 test-i386-i386-pv                                            pass     
 test-amd64-xcpkern-i386-pv                                   pass     
 test-i386-xcpkern-i386-pv                                    pass     
 test-amd64-i386-win-vcpus1                                   fail     
 test-amd64-i386-xl-win-vcpus1                                fail     
 test-amd64-amd64-win                                         fail     
 test-amd64-i386-win                                          fail     
 test-i386-i386-win                                           fail     
 test-amd64-xcpkern-i386-win                                  fail     
 test-i386-xcpkern-i386-win                                   fail     
 test-amd64-amd64-xl-win                                      fail     
 test-i386-i386-xl-win                                        fail     
 test-amd64-xcpkern-i386-xl-win                               fail     


------------------------------------------------------------
sg-report-flight on woking.cam.xci-test.com
logs: /home/xc_osstest/logs
images: /home/xc_osstest/images

Logs, config files, etc. are available at
    http://www.chiark.greenend.org.uk/~xensrcts/logs

Test harness code can be found at
    http://xenbits.xensource.com/gitweb?p=osstest.git;a=summary


Not pushing.

------------------------------------------------------------
changeset:   23020:22cc047eb146
tag:         tip
user:        Liu, Jinsong <jinsong.liu@intel.com>
date:        Thu Mar 10 18:35:32 2011 +0000
    
    x86: Fix cpuidle bug
    
    Before invoking C3, bus master disable / flush cache should be the
    last step; After resume from C3, bus master enable should be the first
    step;
    
    Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
    Acked-by: Wei Gang <gang.wei@intel.com>
    
    
changeset:   23019:c8947c24536a
user:        Ian Campbell <ian.campbell@citrix.com>
date:        Thu Mar 10 18:21:42 2011 +0000
    
    libxl: do not rely on guest to respond when forcing pci device removal
    
    This is consistent with the expected semantics of a forced device
    removal and also avoids a delay when destroying an HVM domain which
    either does not support hot unplug (does not respond to SCI) or has
    crashed.
    
    Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
    Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
    
    
changeset:   23018:a46101334ee2
user:        Jim Fehlig <jfehlig@novell.com>
date:        Thu Mar 10 18:17:16 2011 +0000
    
    libxl: Call setsid(2) before exec'ing device model
    
    While doing development on libvirt libxenlight driver I noticed
    that terminating a libxenlight client causes any qemu-dm
    processes that were indirectly created by the client to also
    terminate.  Calling setsid(2) before exec'ing qemu-dm resolves
    the issue.
    
    Signed-off-by: Jim Fehlig <jfehlig@novell.com>
    Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
    Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
    Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
    
    
changeset:   23017:b16644e446ef
user:        Stefano Stabellini <stefano.stabellini@eu.citrix.com>
date:        Thu Mar 10 18:11:31 2011 +0000
    
    update README
    
    update README: we are missing few compile time dependencies and a link
    to the pvops kernel page on the wiki.
    
    Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
    Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
    Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
    
    
changeset:   23016:6fa299ad15c8
user:        Jan Beulich <jbeulich@novell.com>
date:        Wed Mar 09 17:25:44 2011 +0000
    
    x86: remove pre-686 CPU support bits
    
    ... as Xen doesn't run on such CPUs anyway. Clearly these bits were
    particularly odd to have on x86-64.
    
    Signed-off-by: Jan Beulich <jbeulich@novell.com>
    
    
(qemu changes not included)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [xen-unstable test] 6374: regressions - FAIL
  2011-03-11 16:20 [xen-unstable test] 6374: regressions - FAIL xen.org
@ 2011-03-11 17:51 ` Ian Jackson
  2011-03-14 10:02   ` Tim Deegan
  2011-03-14 10:33   ` Jan Beulich
  0 siblings, 2 replies; 9+ messages in thread
From: Ian Jackson @ 2011-03-11 17:51 UTC (permalink / raw)
  To: xen-devel

xen.org writes ("[Xen-devel] [xen-unstable test] 6374: regressions - FAIL"):
> flight 6374 xen-unstable real [real]
> Tests which did not succeed and are blocking:
>  test-amd64-i386-pv            5 xen-boot               fail REGR. vs. 6369

Xen crash in scheduler (non-credit2).

Mar 11 13:46:53.646796 (XEN) Watchdog timer detects that CPU1 is stuck!
Mar 11 13:46:57.922794 (XEN) ----[ Xen-4.1.0-rc7-pre  x86_64  debug=y  Not tainted ]----
Mar 11 13:46:57.931763 (XEN) CPU:    1
Mar 11 13:46:57.931784 (XEN) RIP:    e008:[<ffff82c480100140>] __bitmap_empty+0x0/0x7f
Mar 11 13:46:57.931817 (XEN) RFLAGS: 0000000000000047   CONTEXT: hypervisor
Mar 11 13:46:57.946773 (XEN) rax: ffff82c4802d1ac0   rbx: ffff8301a7fafc78   rcx: 0000000000000002
Mar 11 13:46:57.946813 (XEN) rdx: ffff82c4802d0cc0   rsi: 0000000000000080   rdi: ffff8301a7fafc78
Mar 11 13:46:57.954780 (XEN) rbp: ffff8301a7fafcb8   rsp: ffff8301a7fafc00   r8:  0000000000000002
Mar 11 13:46:57.966770 (XEN) r9:  0000ffff0000ffff   r10: 00ff00ff00ff00ff   r11: 0f0f0f0f0f0f0f0f
Mar 11 13:46:57.966805 (XEN) r12: ffff8301a7fafc68   r13: 0000000000000001   r14: 0000000000000001
Mar 11 13:46:57.975780 (XEN) r15: ffff82c4802d1ac0   cr0: 000000008005003b   cr4: 00000000000006f0
Mar 11 13:46:57.987771 (XEN) cr3: 00000000d7c9c000   cr2: 00000000c45e5770
Mar 11 13:46:57.987800 (XEN) ds: 007b   es: 007b   fs: 00d8   gs: 0033   ss: 0000   cs: e008
Mar 11 13:46:57.998773 (XEN) Xen stack trace from rsp=ffff8301a7fafc00:
Mar 11 13:46:57.998802 (XEN)    ffff82c480119557 00007cfe580503c7 ffff82c4802d1ac0 ffff82c4802d0cc0
Mar 11 13:46:58.010781 (XEN)    ffff82c4802d1ad0 ffff82c4802d1ad0 ffff82c4802d1ad0 ffff82c4802d1ad0
Mar 11 13:46:58.019765 (XEN)    01ff8301a7fb3048 0000000000000800 0000000000000000 0000000000000100
Mar 11 13:46:58.019798 (XEN)    0000000000000000 0000000000000f02 0000000000000000 0000000000000f00
Mar 11 13:46:58.031777 (XEN)    0000000000000000 ffff8301a7fb3048 ffff8301a7fb3048 ffff8301a7eac048
Mar 11 13:46:58.039906 (XEN)    0000000000000001 ffff82c4802d0cc0 0000000000000000 ffff8301a7fafcc8
Mar 11 13:46:58.039930 (XEN)    ffff82c480119582 ffff8301a7fafd28 ffff82c480122c8d 0000000100000001
Mar 11 13:46:58.051781 (XEN)    ffff82c4802d0cc0 ffff82c4802d0cc0 ffff8300d7cdc000 0000000000000206
Mar 11 13:46:58.063769 (XEN)    ffff8300d7cdc000 0000000000000001 ffff8300d7cdc000 000000018b4d75e5
Mar 11 13:46:58.063807 (XEN)    ffff8301a7fb3040 ffff8301a7fafd48 ffff82c480122e24 ed543b2d00000000
Mar 11 13:46:58.075781 (XEN)    ffff8300d7afc000 ffff8301a7fafe38 ffff82c480157f17 ffff82c480123dd4
Mar 11 13:46:58.087771 (XEN)    ffff82c4802d0cc8 ffff8301a7fafe38 ffff82c480118c8a ffff82c4802d0cc0
Mar 11 13:46:58.098761 (XEN)    ffff82c4802d0cc0 ffff82c4802d0cc0 ffff82c4802d0cc0 ffff82c4802d0cc0
Mar 11 13:46:58.098797 (XEN)    000000018b4d75e5 ffff8301a7fafe68 00000001a7e80e70 ffff8301a7ffa400
Mar 11 13:46:58.110773 (XEN)    ffff8301a7ffaee8 ffff8301a7fafdf8 ffc08301a7ffaf90 0000000000000086
Mar 11 13:46:58.119760 (XEN)    ffff8301a7fafdf8 ffff82c480123b91 0000000000000001 0000000000000000
Mar 11 13:46:58.119794 (XEN)    0000000000000000 ffff8301a7fafe38 ffff8300d7afc000 ffff8300d7cdc000
Mar 11 13:46:58.134790 (XEN)    0000000000000003 000000018b4d75e5 ffff8301a7fb3040 ffff8301a7fafeb8
Mar 11 13:46:58.139763 (XEN)    ffff82c4801226b4 ffff8301a7fafe68 000000018b4d75e5 ffff8301a7fb3100
Mar 11 13:46:58.139804 (XEN)    ffff8300d7cdc060 ffff8300d7afc000 ffffffffffffffff ffff8301a7faff00
Mar 11 13:46:58.154777 (XEN) Xen call trace:
Mar 11 13:46:58.154798 (XEN)    [<ffff82c480100140>] __bitmap_empty+0x0/0x7f
Mar 11 13:46:58.163767 (XEN)    [<ffff82c480119582>] csched_cpu_pick+0xe/0x10
Mar 11 13:46:58.163802 (XEN)    [<ffff82c480122c8d>] vcpu_migrate+0xfb/0x230
Mar 11 13:46:58.178768 (XEN)    [<ffff82c480122e24>] context_saved+0x62/0x7b
Mar 11 13:46:58.178799 (XEN)    [<ffff82c480157f17>] context_switch+0xd98/0xdca
Mar 11 13:46:58.183766 (XEN)    [<ffff82c4801226b4>] schedule+0x5fc/0x624
Mar 11 13:46:58.183795 (XEN)    [<ffff82c480123837>] __do_softirq+0x88/0x99
Mar 11 13:46:58.198784 (XEN)    [<ffff82c4801238b2>] do_softirq+0x6a/0x7a
Mar 11 13:46:58.198817 (XEN)    
Mar 11 13:46:58.198828 (XEN) 
Mar 11 13:46:58.198839 (XEN) ****************************************
Mar 11 13:46:58.207765 (XEN) Panic on CPU 1:
Mar 11 13:46:58.207787 (XEN) FATAL TRAP: vector = 2 (nmi)
Mar 11 13:46:58.207813 (XEN) [error_code=0000] , IN INTERRUPT CONTEXT
Mar 11 13:46:58.218761 (XEN) ****************************************
Mar 11 13:46:58.218788 (XEN) 
Mar 11 13:46:58.218802 (XEN) Reboot in five seconds...

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [xen-unstable test] 6374: regressions - FAIL
  2011-03-11 17:51 ` Ian Jackson
@ 2011-03-14 10:02   ` Tim Deegan
  2011-03-14 10:39     ` Jan Beulich
  2011-03-14 10:33   ` Jan Beulich
  1 sibling, 1 reply; 9+ messages in thread
From: Tim Deegan @ 2011-03-14 10:02 UTC (permalink / raw)
  To: Ian Jackson; +Cc: Dunlap, xen-devel@lists.xensource.com

At 17:51 +0000 on 11 Mar (1299865912), Ian Jackson wrote:
> Mar 11 13:46:58.154777 (XEN) Xen call trace:
> Mar 11 13:46:58.154798 (XEN)    [<ffff82c480100140>] __bitmap_empty+0x0/0x7f
> Mar 11 13:46:58.163767 (XEN)    [<ffff82c480119582>] csched_cpu_pick+0xe/0x10
> Mar 11 13:46:58.163802 (XEN)    [<ffff82c480122c8d>] vcpu_migrate+0xfb/0x230
> Mar 11 13:46:58.178768 (XEN)    [<ffff82c480122e24>] context_saved+0x62/0x7b
> Mar 11 13:46:58.178799 (XEN)    [<ffff82c480157f17>] context_switch+0xd98/0xdca
> Mar 11 13:46:58.183766 (XEN)    [<ffff82c4801226b4>] schedule+0x5fc/0x624
> Mar 11 13:46:58.183795 (XEN)    [<ffff82c480123837>] __do_softirq+0x88/0x99
> Mar 11 13:46:58.198784 (XEN)    [<ffff82c4801238b2>] do_softirq+0x6a/0x7a

I think this hang comes because although this code:

            cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers);
            if ( commit )
               CSCHED_PCPU(nxt)->idle_bias = cpu;
            cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu));

removes the new cpu and its siblings from cpus, cpu isn't guaranteed to
have been in cpus in the first place, and none of its siblings are
either since nxt might not be its sibling.

Possible fix:

diff -r b9a5d116102d xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Thu Mar 10 13:06:52 2011 +0000
+++ b/xen/common/sched_credit.c	Mon Mar 14 09:25:07 2011 +0000
@@ -533,7 +533,7 @@ _csched_cpu_pick(const struct scheduler 
             cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers);
             if ( commit )
                CSCHED_PCPU(nxt)->idle_bias = cpu;
-            cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu));
+            cpus_andnot(cpus, cpus, nxt_idlers);
         }
         else
         {

which guarantees that nxt will be removed from cpus, though I suspect
this means that we might not pick the best HT pair in a particular core.
Scheduler code is twisty and hurts my brain so I'd like George's
opinion before checking anything in.

Cheers,

Tim.

P.S. the patch above is a one-liner for clarity: a better fix would be:

diff -r b9a5d116102d xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Thu Mar 10 13:06:52 2011 +0000
+++ b/xen/common/sched_credit.c	Mon Mar 14 09:26:11 2011 +0000
@@ -533,12 +533,8 @@ _csched_cpu_pick(const struct scheduler 
             cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers);
             if ( commit )
                CSCHED_PCPU(nxt)->idle_bias = cpu;
-            cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu));
         }
-        else
-        {
-            cpus_andnot(cpus, cpus, nxt_idlers);
-        }
+        cpus_andnot(cpus, cpus, nxt_idlers);
     }
 
     return cpu;



-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [xen-unstable test] 6374: regressions - FAIL
  2011-03-11 17:51 ` Ian Jackson
  2011-03-14 10:02   ` Tim Deegan
@ 2011-03-14 10:33   ` Jan Beulich
  2011-03-14 14:40     ` Juergen Gross
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Beulich @ 2011-03-14 10:33 UTC (permalink / raw)
  To: Ian Jackson, Juergen Gross; +Cc: xen-devel

>>> On 11.03.11 at 18:51, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote:
> xen.org writes ("[Xen-devel] [xen-unstable test] 6374: regressions - FAIL"):
>> flight 6374 xen-unstable real [real]
>> Tests which did not succeed and are blocking:
>>  test-amd64-i386-pv            5 xen-boot               fail REGR. vs. 6369
> 
> Xen crash in scheduler (non-credit2).
> 
> Mar 11 13:46:53.646796 (XEN) Watchdog timer detects that CPU1 is stuck!
> Mar 11 13:46:57.922794 (XEN) ----[ Xen-4.1.0-rc7-pre  x86_64  debug=y  Not tainted ]----
> Mar 11 13:46:57.931763 (XEN) CPU:    1
> Mar 11 13:46:57.931784 (XEN) RIP:    e008:[<ffff82c480100140>] __bitmap_empty+0x0/0x7f
> Mar 11 13:46:57.931817 (XEN) RFLAGS: 0000000000000047   CONTEXT: hypervisor
> Mar 11 13:46:57.946773 (XEN) rax: ffff82c4802d1ac0   rbx: ffff8301a7fafc78   rcx: 0000000000000002
> Mar 11 13:46:57.946813 (XEN) rdx: ffff82c4802d0cc0   rsi: 0000000000000080   rdi: ffff8301a7fafc78
> Mar 11 13:46:57.954780 (XEN) rbp: ffff8301a7fafcb8   rsp: ffff8301a7fafc00   r8:  0000000000000002
> Mar 11 13:46:57.966770 (XEN) r9:  0000ffff0000ffff   r10: 00ff00ff00ff00ff   r11: 0f0f0f0f0f0f0f0f
> Mar 11 13:46:57.966805 (XEN) r12: ffff8301a7fafc68   r13: 0000000000000001   r14: 0000000000000001
> Mar 11 13:46:57.975780 (XEN) r15: ffff82c4802d1ac0   cr0: 000000008005003b   cr4: 00000000000006f0
> Mar 11 13:46:57.987771 (XEN) cr3: 00000000d7c9c000   cr2: 00000000c45e5770
> Mar 11 13:46:57.987800 (XEN) ds: 007b   es: 007b   fs: 00d8   gs: 0033   ss: 0000   cs: e008
> Mar 11 13:46:57.998773 (XEN) Xen stack trace from rsp=ffff8301a7fafc00:
>...
> Mar 11 13:46:58.154777 (XEN) Xen call trace:
> Mar 11 13:46:58.154798 (XEN)    [<ffff82c480100140>] __bitmap_empty+0x0/0x7f
> Mar 11 13:46:58.163767 (XEN)    [<ffff82c480119582>] csched_cpu_pick+0xe/0x10
> Mar 11 13:46:58.163802 (XEN)    [<ffff82c480122c8d>] vcpu_migrate+0xfb/0x230
> Mar 11 13:46:58.178768 (XEN)    [<ffff82c480122e24>] context_saved+0x62/0x7b
> Mar 11 13:46:58.178799 (XEN)    [<ffff82c480157f17>] context_switch+0xd98/0xdca
> Mar 11 13:46:58.183766 (XEN)    [<ffff82c4801226b4>] schedule+0x5fc/0x624
> Mar 11 13:46:58.183795 (XEN)    [<ffff82c480123837>] __do_softirq+0x88/0x99
> Mar 11 13:46:58.198784 (XEN)    [<ffff82c4801238b2>] do_softirq+0x6a/0x7a

I suppose that's a result of 22957:c5c4688d5654 - as I understand it
exiting the loop is only possible if two consecutive invocations of
pick_cpu return the same result. This, however, is precisely what the
pCPU's idle_bias is supposed to prevent on hyper-threaded/multi-core
systems (so that it's not always the same entity that gets selected).

But even beyond that particular aspect, relying on any form of
"stability" of the returned value isn't correct.

Plus running pick_cpu repeatedly without actually using its result
is wrong wrt to idle_bias updating too - that's why
cached_vcpu_acct() calls _csched_cpu_pick() with the commit
argument set to false (which will result in a subsequent call -
through pick_cpu - with the argument set to true to be likely
to return the same value, but there's no correctness dependency
on that). So 22948:2d35823a86e7 already wasn't really correct
in putting a loop around pick_cpu.

It's also not clear to me what the surrounding
if ( old_lock == per_cpu(schedule_data, old_cpu).schedule_lock )
is supposed to filter, as the lock pointer gets set only when a
CPU gets brought up.

As I don't really understand what is being tried to achieve here,
I also can't really suggest a possible fix other than reverting both
offending changesets.

Jan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [xen-unstable test] 6374: regressions - FAIL
  2011-03-14 10:02   ` Tim Deegan
@ 2011-03-14 10:39     ` Jan Beulich
  2011-03-14 10:52       ` Tim Deegan
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Beulich @ 2011-03-14 10:39 UTC (permalink / raw)
  To: Tim Deegan, Ian Jackson; +Cc: Dunlap, xen-devel@lists.xensource.com, George

>>> On 14.03.11 at 11:02, Tim Deegan <Tim.Deegan@citrix.com> wrote:
> At 17:51 +0000 on 11 Mar (1299865912), Ian Jackson wrote:
>> Mar 11 13:46:58.154777 (XEN) Xen call trace:
>> Mar 11 13:46:58.154798 (XEN)    [<ffff82c480100140>] __bitmap_empty+0x0/0x7f
>> Mar 11 13:46:58.163767 (XEN)    [<ffff82c480119582>] csched_cpu_pick+0xe/0x10
>> Mar 11 13:46:58.163802 (XEN)    [<ffff82c480122c8d>] vcpu_migrate+0xfb/0x230
>> Mar 11 13:46:58.178768 (XEN)    [<ffff82c480122e24>] context_saved+0x62/0x7b
>> Mar 11 13:46:58.178799 (XEN)    [<ffff82c480157f17>] 
> context_switch+0xd98/0xdca
>> Mar 11 13:46:58.183766 (XEN)    [<ffff82c4801226b4>] schedule+0x5fc/0x624
>> Mar 11 13:46:58.183795 (XEN)    [<ffff82c480123837>] __do_softirq+0x88/0x99
>> Mar 11 13:46:58.198784 (XEN)    [<ffff82c4801238b2>] do_softirq+0x6a/0x7a
> 
> I think this hang comes because although this code:
> 
>             cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers);
>             if ( commit )
>                CSCHED_PCPU(nxt)->idle_bias = cpu;
>             cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu));
> 
> removes the new cpu and its siblings from cpus, cpu isn't guaranteed to
> have been in cpus in the first place, and none of its siblings are
> either since nxt might not be its sibling.

I had originally spent quite a while to verify that the loop this is in
can't be infinite (i.e. there's going to be always at least one bit
removed from "cpus"), and did so again during the last half hour
or so. I'm certain (hardened also by the CPU masks we see on the
stack) that it's not this function itself that's looping infinitely, but
rather its caller (see my other reply sent just a few minutes ago).

> Possible fix:
> 
> diff -r b9a5d116102d xen/common/sched_credit.c
> --- a/xen/common/sched_credit.c	Thu Mar 10 13:06:52 2011 +0000
> +++ b/xen/common/sched_credit.c	Mon Mar 14 09:25:07 2011 +0000
> @@ -533,7 +533,7 @@ _csched_cpu_pick(const struct scheduler 
>              cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers);
>              if ( commit )
>                 CSCHED_PCPU(nxt)->idle_bias = cpu;
> -            cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu));
> +            cpus_andnot(cpus, cpus, nxt_idlers);
>          }
>          else
>          {
> 
> which guarantees that nxt will be removed from cpus, though I suspect
> this means that we might not pick the best HT pair in a particular core.
> Scheduler code is twisty and hurts my brain so I'd like George's
> opinion before checking anything in.

No - that was precisely done the opposite direction to get
better symmetry of load across all CPUs. With what you propose,
idle_bias would become meaningless.

Jan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [xen-unstable test] 6374: regressions - FAIL
  2011-03-14 10:39     ` Jan Beulich
@ 2011-03-14 10:52       ` Tim Deegan
  2011-03-14 16:08         ` Jan Beulich
  0 siblings, 1 reply; 9+ messages in thread
From: Tim Deegan @ 2011-03-14 10:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, George@citrix.com, xen-devel@lists.xensource.com,
	Ian Jackson

At 10:39 +0000 on 14 Mar (1300099174), Jan Beulich wrote:
> > I think this hang comes because although this code:
> > 
> >             cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers);
> >             if ( commit )
> >                CSCHED_PCPU(nxt)->idle_bias = cpu;
> >             cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu));
> > 
> > removes the new cpu and its siblings from cpus, cpu isn't guaranteed to
> > have been in cpus in the first place, and none of its siblings are
> > either since nxt might not be its sibling.
> 
> I had originally spent quite a while to verify that the loop this is in
> can't be infinite (i.e. there's going to be always at least one bit
> removed from "cpus"), and did so again during the last half hour
> or so.

I'm pretty sure there are possible passes through this loop that don't
remove any cpus, though I haven't constructed the full history that gets
you there.  But the cpupool patches you suggest in your other email look
like much stronger candidates for this hang.

> > which guarantees that nxt will be removed from cpus, though I suspect
> > this means that we might not pick the best HT pair in a particular core.
> > Scheduler code is twisty and hurts my brain so I'd like George's
> > opinion before checking anything in.
> 
> No - that was precisely done the opposite direction to get
> better symmetry of load across all CPUs. With what you propose,
> idle_bias would become meaningless.

I don't think see why it would.  As I said, having picked a core we
might not iterate to pick the best cpu within that core, but the
round-robining effect is still there.  And even if not I figured a
hypervisor crash is worse than a suboptimal scheduling decision. :)

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [xen-unstable test] 6374: regressions - FAIL
  2011-03-14 10:33   ` Jan Beulich
@ 2011-03-14 14:40     ` Juergen Gross
  0 siblings, 0 replies; 9+ messages in thread
From: Juergen Gross @ 2011-03-14 14:40 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Ian Jackson

On 03/14/11 11:33, Jan Beulich wrote:
>>>> On 11.03.11 at 18:51, Ian Jackson<Ian.Jackson@eu.citrix.com>  wrote:
>> xen.org writes ("[Xen-devel] [xen-unstable test] 6374: regressions - FAIL"):
>>> flight 6374 xen-unstable real [real]
>>> Tests which did not succeed and are blocking:
>>>   test-amd64-i386-pv            5 xen-boot               fail REGR. vs. 6369
>>
>> Xen crash in scheduler (non-credit2).
>>
>> Mar 11 13:46:53.646796 (XEN) Watchdog timer detects that CPU1 is stuck!
>> Mar 11 13:46:57.922794 (XEN) ----[ Xen-4.1.0-rc7-pre  x86_64  debug=y  Not tainted ]----
>> Mar 11 13:46:57.931763 (XEN) CPU:    1
>> Mar 11 13:46:57.931784 (XEN) RIP:    e008:[<ffff82c480100140>] __bitmap_empty+0x0/0x7f
>> Mar 11 13:46:57.931817 (XEN) RFLAGS: 0000000000000047   CONTEXT: hypervisor
>> Mar 11 13:46:57.946773 (XEN) rax: ffff82c4802d1ac0   rbx: ffff8301a7fafc78   rcx: 0000000000000002
>> Mar 11 13:46:57.946813 (XEN) rdx: ffff82c4802d0cc0   rsi: 0000000000000080   rdi: ffff8301a7fafc78
>> Mar 11 13:46:57.954780 (XEN) rbp: ffff8301a7fafcb8   rsp: ffff8301a7fafc00   r8:  0000000000000002
>> Mar 11 13:46:57.966770 (XEN) r9:  0000ffff0000ffff   r10: 00ff00ff00ff00ff   r11: 0f0f0f0f0f0f0f0f
>> Mar 11 13:46:57.966805 (XEN) r12: ffff8301a7fafc68   r13: 0000000000000001   r14: 0000000000000001
>> Mar 11 13:46:57.975780 (XEN) r15: ffff82c4802d1ac0   cr0: 000000008005003b   cr4: 00000000000006f0
>> Mar 11 13:46:57.987771 (XEN) cr3: 00000000d7c9c000   cr2: 00000000c45e5770
>> Mar 11 13:46:57.987800 (XEN) ds: 007b   es: 007b   fs: 00d8   gs: 0033   ss: 0000   cs: e008
>> Mar 11 13:46:57.998773 (XEN) Xen stack trace from rsp=ffff8301a7fafc00:
>> ...
>> Mar 11 13:46:58.154777 (XEN) Xen call trace:
>> Mar 11 13:46:58.154798 (XEN)    [<ffff82c480100140>] __bitmap_empty+0x0/0x7f
>> Mar 11 13:46:58.163767 (XEN)    [<ffff82c480119582>] csched_cpu_pick+0xe/0x10
>> Mar 11 13:46:58.163802 (XEN)    [<ffff82c480122c8d>] vcpu_migrate+0xfb/0x230
>> Mar 11 13:46:58.178768 (XEN)    [<ffff82c480122e24>] context_saved+0x62/0x7b
>> Mar 11 13:46:58.178799 (XEN)    [<ffff82c480157f17>] context_switch+0xd98/0xdca
>> Mar 11 13:46:58.183766 (XEN)    [<ffff82c4801226b4>] schedule+0x5fc/0x624
>> Mar 11 13:46:58.183795 (XEN)    [<ffff82c480123837>] __do_softirq+0x88/0x99
>> Mar 11 13:46:58.198784 (XEN)    [<ffff82c4801238b2>] do_softirq+0x6a/0x7a
>
> I suppose that's a result of 22957:c5c4688d5654 - as I understand it
> exiting the loop is only possible if two consecutive invocations of
> pick_cpu return the same result. This, however, is precisely what the
> pCPU's idle_bias is supposed to prevent on hyper-threaded/multi-core
> systems (so that it's not always the same entity that gets selected).
>
> But even beyond that particular aspect, relying on any form of
> "stability" of the returned value isn't correct.
>
> Plus running pick_cpu repeatedly without actually using its result
> is wrong wrt to idle_bias updating too - that's why
> cached_vcpu_acct() calls _csched_cpu_pick() with the commit
> argument set to false (which will result in a subsequent call -
> through pick_cpu - with the argument set to true to be likely
> to return the same value, but there's no correctness dependency
> on that). So 22948:2d35823a86e7 already wasn't really correct
> in putting a loop around pick_cpu.
>
> It's also not clear to me what the surrounding
> if ( old_lock == per_cpu(schedule_data, old_cpu).schedule_lock )
> is supposed to filter, as the lock pointer gets set only when a
> CPU gets brought up.

Yeah, but the vcpu can change cpus while we don't hold the lock.
This means old_cpu can change between selecting the lock and actually
taking it...

> As I don't really understand what is being tried to achieve here,
> I also can't really suggest a possible fix other than reverting both
> offending changesets.

I'll send a patch as a suggestion :-)


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [xen-unstable test] 6374: regressions - FAIL
  2011-03-14 10:52       ` Tim Deegan
@ 2011-03-14 16:08         ` Jan Beulich
  2011-03-14 16:17           ` Tim Deegan
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Beulich @ 2011-03-14 16:08 UTC (permalink / raw)
  To: Tim Deegan
  Cc: George Dunlap, Ian Jackson, xen-devel@lists.xensource.com,
	George@citrix.com

>>> On 14.03.11 at 11:52, Tim Deegan <Tim.Deegan@citrix.com> wrote:
> At 10:39 +0000 on 14 Mar (1300099174), Jan Beulich wrote:
>> > I think this hang comes because although this code:
>> > 
>> >             cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers);
>> >             if ( commit )
>> >                CSCHED_PCPU(nxt)->idle_bias = cpu;
>> >             cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu));
>> > 
>> > removes the new cpu and its siblings from cpus, cpu isn't guaranteed to
>> > have been in cpus in the first place, and none of its siblings are
>> > either since nxt might not be its sibling.
>> 
>> I had originally spent quite a while to verify that the loop this is in
>> can't be infinite (i.e. there's going to be always at least one bit
>> removed from "cpus"), and did so again during the last half hour
>> or so.
> 
> I'm pretty sure there are possible passes through this loop that don't
> remove any cpus, though I haven't constructed the full history that gets
> you there.

Actually, while I don't think that this can happen, something else is
definitely broken here: The logic can select a CPU that's not in the
vCPU's affinity mask. How I managed to not note this when I
originally put this change together I can't tell. I'll send a patch in
a moment, and I think after that patch it's also easier to see that
each iteration will remove at least one bit.

>> > which guarantees that nxt will be removed from cpus, though I suspect
>> > this means that we might not pick the best HT pair in a particular core.
>> > Scheduler code is twisty and hurts my brain so I'd like George's
>> > opinion before checking anything in.
>> 
>> No - that was precisely done the opposite direction to get
>> better symmetry of load across all CPUs. With what you propose,
>> idle_bias would become meaningless.
> 
> I don't think see why it would.  As I said, having picked a core we
> might not iterate to pick the best cpu within that core, but the
> round-robining effect is still there.  And even if not I figured a
> hypervisor crash is worse than a suboptimal scheduling decision. :)

Sure. Just that this code has been there for quite a long time, and
it would be really strange to only now see it start producing hangs
(which apparently aren't that difficult to reproduce - iirc a similar
one was sent around by Ian a few days earlier).

Jan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [xen-unstable test] 6374: regressions - FAIL
  2011-03-14 16:08         ` Jan Beulich
@ 2011-03-14 16:17           ` Tim Deegan
  0 siblings, 0 replies; 9+ messages in thread
From: Tim Deegan @ 2011-03-14 16:17 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel@lists.xensource.com, Ian Jackson

At 16:08 +0000 on 14 Mar (1300118917), Jan Beulich wrote:
> Actually, while I don't think that this can happen, something else is
> definitely broken here: The logic can select a CPU that's not in the
> vCPU's affinity mask. How I managed to not note this when I
> originally put this change together I can't tell. I'll send a patch in
> a moment, and I think after that patch it's also easier to see that
> each iteration will remove at least one bit.

Yes, as long as the cpu selected has to be in "cpus", the loop is
definitely safe.  

> Sure. Just that this code has been there for quite a long time, and
> it would be really strange to only now see it start producing hangs
> (which apparently aren't that difficult to reproduce - iirc a similar
> one was sent around by Ian a few days earlier).

Agreed; the other branch of this thread is clerly where this particular
hang is coming from.

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-03-14 16:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-11 16:20 [xen-unstable test] 6374: regressions - FAIL xen.org
2011-03-11 17:51 ` Ian Jackson
2011-03-14 10:02   ` Tim Deegan
2011-03-14 10:39     ` Jan Beulich
2011-03-14 10:52       ` Tim Deegan
2011-03-14 16:08         ` Jan Beulich
2011-03-14 16:17           ` Tim Deegan
2011-03-14 10:33   ` Jan Beulich
2011-03-14 14:40     ` Juergen Gross

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.