[timer/ticks related] dom0 hang during boot on large 1TB system

All of lore.kernel.org
 help / color / mirror / Atom feed

* [timer/ticks related] dom0 hang during boot on large 1TB system
@ 2009-12-18  4:36 Mukesh Rathor
  2009-12-18  7:02 ` Keir Fraser
  0 siblings, 1 reply; 51+ messages in thread
From: Mukesh Rathor @ 2009-12-18  4:36 UTC (permalink / raw)
  To: Xen-devel@lists.xensource.com; +Cc: Kurt Hackel, Dan Magenheimer

[-- Attachment #1: Type: text/plain, Size: 1414 bytes --]

Hi,

I finally solved a hang on a 1TB box during our dom0 boot on xen 3.4.0,
that I'd been working on. The hang comes from:

calibrate_delay_direct():
....
        for (i = 0; i < MAX_DIRECT_CALIBRATION_RETRIES; i++) {
                pre_start = 0;
                start_jiffies = jiffies;
                while (jiffies <= (start_jiffies + tick_divider)) {
                        pre_start = start;
                        read_current_timer(&start);
                }
                read_current_timer(&post_start);
...


start_jiffies is set to : INITIAL_JIFFIES == 0xfffedb08

now, timer interrupt comes in and finding delta to be rather
huge (thanks to the page scrubbing of 1TB in xen), makes jiffies
wrap around. This causes hang in the loop, that would resolve after
say several days.

delta: 940b7d68a4, jiffies:00009f8b


I came up with fix (is there a reason it doesn't use 64bit values?) :

             while (jiffies <= (start_jiffies + tick_divider)) {
                   pre_start = start;
                   read_current_timer(&start);
+                  if (jiffies < start_jiffies)  /* jiffies wrapped */
+                          start_jiffies = jiffies;
             }


The other fix I thought of was to change INITIAL_JIFFIES to something 
sooner.


Would appreciate any help, I don't understand xen time management well.

thanks,
Mukesh


PS: I'm attaching output of 'xm debug-key t'.


[-- Attachment #2: skew.out --]
[-- Type: application/octet-stream, Size: 16359 bytes --]

(XEN) Synced cycles skew: max=128755 avg=128755 samples=1 current=128755
(XEN) Synced stime skew: max=63373ns avg=61935ns samples=2 current=63373ns
(XEN) Synced cycles skew: max=129906 avg=129330 samples=2 current=129906
(XEN) Synced stime skew: max=63373ns avg=60583ns samples=3 current=57881ns
(XEN) Synced cycles skew: max=133764 avg=130808 samples=3 current=133764
(XEN) Synced stime skew: max=63373ns avg=60431ns samples=4 current=59976ns
(XEN) Synced cycles skew: max=133764 avg=130397 samples=4 current=129165
(XEN) Synced stime skew: max=64212ns avg=61187ns samples=5 current=64212ns
(XEN) Synced cycles skew: max=133764 avg=129990 samples=5 current=128363
(XEN) Synced stime skew: max=77171ns avg=63851ns samples=6 current=77171ns
(XEN) Synced cycles skew: max=133764 avg=129874 samples=6 current=129293
(XEN) Synced stime skew: max=80539ns avg=66235ns samples=7 current=80539ns
(XEN) Synced cycles skew: max=133764 avg=129796 samples=7 current=129329
(XEN) Synced stime skew: max=86713ns avg=68795ns samples=8 current=86713ns
(XEN) Synced cycles skew: max=133764 avg=128902 samples=8 current=122645
(XEN) Synced stime skew: max=86713ns avg=67529ns samples=9 current=57401ns
(XEN) Synced cycles skew: max=133764 avg=128793 samples=9 current=127920
(XEN) Synced stime skew: max=86713ns avg=68076ns samples=10 current=73000ns
(XEN) Synced cycles skew: max=133764 avg=128779 samples=10 current=128656
(XEN) Synced stime skew: max=86713ns avg=67276ns samples=11 current=59282ns
(XEN) Synced cycles skew: max=133764 avg=128856 samples=11 current=129622
(XEN) Synced stime skew: max=86713ns avg=67577ns samples=12 current=70889ns
(XEN) Synced cycles skew: max=133764 avg=128868 samples=12 current=129006
(XEN) Synced stime skew: max=86713ns avg=67438ns samples=13 current=65768ns
(XEN) Synced cycles skew: max=133764 avg=128891 samples=13 current=129166
(XEN) Synced stime skew: max=86713ns avg=68475ns samples=14 current=81951ns
(XEN) Synced cycles skew: max=133764 avg=128910 samples=14 current=129159
(XEN) Synced stime skew: max=86713ns avg=68557ns samples=15 current=69716ns
(XEN) Synced cycles skew: max=140030 avg=129651 samples=15 current=140030
(XEN) Synced stime skew: max=86713ns avg=68187ns samples=16 current=62623ns
(XEN) Synced cycles skew: max=140030 avg=130028 samples=16 current=135669
(XEN) Synced stime skew: max=86713ns avg=67589ns samples=17 current=58036ns
(XEN) Synced cycles skew: max=140030 avg=130010 samples=17 current=129726
(XEN) Synced stime skew: max=86713ns avg=67628ns samples=18 current=68276ns
(XEN) Synced cycles skew: max=140702 avg=130604 samples=18 current=140702
(XEN) Synced stime skew: max=86713ns avg=67436ns samples=19 current=63995ns
(XEN) Synced cycles skew: max=140702 avg=130851 samples=19 current=135296
(XEN) Synced stime skew: max=86713ns avg=66869ns samples=20 current=56100ns
(XEN) Synced cycles skew: max=140702 avg=130728 samples=20 current=128388
(XEN) Synced stime skew: max=86713ns avg=67756ns samples=21 current=85495ns
(XEN) Synced cycles skew: max=141906 avg=131260 samples=21 current=141906
(XEN) Synced stime skew: max=86713ns avg=68000ns samples=22 current=73117ns
(XEN) Synced cycles skew: max=141906 avg=131701 samples=22 current=140977
(XEN) Synced stime skew: max=86713ns avg=66884ns samples=23 current=42339ns
(XEN) Synced cycles skew: max=141906 avg=131575 samples=23 current=128793
(XEN) Synced stime skew: max=86713ns avg=66076ns samples=24 current=47488ns
(XEN) Synced cycles skew: max=141906 avg=131472 samples=24 current=129115
(XEN) Synced stime skew: max=86713ns avg=65573ns samples=25 current=53510ns
(XEN) Synced cycles skew: max=141906 avg=131363 samples=25 current=128730
(XEN) Synced stime skew: max=86713ns avg=65448ns samples=26 current=62315ns
(XEN) Synced cycles skew: max=141906 avg=131535 samples=26 current=135841
(XEN) Synced stime skew: max=86713ns avg=66054ns samples=27 current=81795ns
(XEN) Synced cycles skew: max=141906 avg=131689 samples=27 current=135698
(XEN) Synced stime skew: max=89642ns avg=66896ns samples=28 current=89642ns
(XEN) Synced cycles skew: max=141906 avg=131823 samples=28 current=135425
(XEN) Synced stime skew: max=114074ns avg=68523ns samples=29 current=114074ns
(XEN) Synced cycles skew: max=141906 avg=131940 samples=29 current=135243
(XEN) Synced stime skew: max=114074ns avg=68299ns samples=30 current=61804ns
(XEN) Synced cycles skew: max=141906 avg=131864 samples=30 current=129658
(XEN) Synced stime skew: max=114074ns avg=67987ns samples=31 current=58635ns
(XEN) Synced cycles skew: max=141906 avg=131777 samples=31 current=129171
(XEN) Synced stime skew: max=114074ns avg=67455ns samples=32 current=50977ns
(XEN) Synced cycles skew: max=141906 avg=131705 samples=32 current=129449
(XEN) Synced stime skew: max=114074ns avg=67072ns samples=33 current=54801ns
(XEN) Synced cycles skew: max=141906 avg=131814 samples=33 current=135323
(XEN) Synced stime skew: max=114074ns avg=67185ns samples=34 current=70930ns
(XEN) Synced cycles skew: max=147335 avg=132271 samples=34 current=147335
(XEN) Synced stime skew: max=114074ns avg=67510ns samples=35 current=78543ns
(XEN) Synced cycles skew: max=147335 avg=132369 samples=35 current=135705
(XEN) Synced stime skew: max=114074ns avg=67665ns samples=36 current=73100ns
(XEN) Synced cycles skew: max=147335 avg=132436 samples=36 current=134786
(XEN) Synced stime skew: max=114074ns avg=66970ns samples=37 current=41961ns
(XEN) Synced cycles skew: max=147335 avg=132326 samples=37 current=128380
(XEN) Synced stime skew: max=114074ns avg=67081ns samples=38 current=71167ns
(XEN) Synced cycles skew: max=147335 avg=132433 samples=38 current=136364
(XEN) Synced stime skew: max=114074ns avg=66538ns samples=39 current=45891ns
(XEN) Synced cycles skew: max=147335 avg=132376 samples=39 current=130216
(XEN) Synced stime skew: max=114074ns avg=66474ns samples=40 current=63994ns
(XEN) Synced cycles skew: max=147335 avg=132464 samples=40 current=135885
(XEN) Synced stime skew: max=114074ns avg=66308ns samples=41 current=59661ns
(XEN) Synced cycles skew: max=147335 avg=132522 samples=41 current=134858
(XEN) Synced stime skew: max=114074ns avg=66824ns samples=42 current=87998ns
(XEN) Synced cycles skew: max=147335 avg=132719 samples=42 current=140812
(XEN) Synced stime skew: max=114074ns avg=67367ns samples=43 current=90148ns
(XEN) Synced cycles skew: max=147335 avg=132625 samples=43 current=128661
(XEN) Synced stime skew: max=114074ns avg=67279ns samples=44 current=63503ns
(XEN) Synced cycles skew: max=147335 avg=132533 samples=44 current=128561
(XEN) Synced stime skew: max=114074ns avg=67156ns samples=45 current=61761ns
(XEN) Synced cycles skew: max=147335 avg=132447 samples=45 current=128703
(XEN) Synced stime skew: max=114074ns avg=67049ns samples=46 current=62220ns
(XEN) Synced cycles skew: max=147335 avg=132532 samples=46 current=136336
(XEN) Synced stime skew: max=114074ns avg=66937ns samples=47 current=61780ns
(XEN) Synced cycles skew: max=147335 avg=132458 samples=47 current=129036
(XEN) Synced stime skew: max=114074ns avg=66972ns samples=48 current=68615ns
(XEN) Synced cycles skew: max=147335 avg=132399 samples=48 current=129646
(XEN) Synced stime skew: max=114074ns avg=67036ns samples=49 current=70121ns
(XEN) Synced cycles skew: max=147335 avg=132618 samples=49 current=143138
(XEN) Synced stime skew: max=114074ns avg=67235ns samples=50 current=76987ns
(XEN) Synced cycles skew: max=147335 avg=132554 samples=50 current=129434
(XEN) Synced stime skew: max=114074ns avg=67088ns samples=51 current=59725ns
(XEN) Synced cycles skew: max=147335 avg=132502 samples=51 current=129876
(XEN) Synced stime skew: max=114074ns avg=66717ns samples=52 current=47798ns
(XEN) Synced cycles skew: max=147335 avg=132458 samples=52 current=130224
(XEN) Synced stime skew: max=114074ns avg=66424ns samples=53 current=51182ns
(XEN) Synced cycles skew: max=147335 avg=132391 samples=53 current=128909
(XEN) Synced stime skew: max=114074ns avg=66357ns samples=54 current=62804ns
(XEN) Synced cycles skew: max=147335 avg=132544 samples=54 current=140644
(XEN) Synced stime skew: max=114074ns avg=66269ns samples=55 current=61534ns
(XEN) Synced cycles skew: max=147335 avg=132460 samples=55 current=127940
(XEN) Synced stime skew: max=114074ns avg=66123ns samples=56 current=58106ns
(XEN) Synced cycles skew: max=147335 avg=132387 samples=56 current=128369
(XEN) Synced stime skew: max=114074ns avg=66275ns samples=57 current=74757ns
(XEN) Synced cycles skew: max=147335 avg=132333 samples=57 current=129325
(XEN) Synced stime skew: max=114074ns avg=66295ns samples=58 current=67448ns
(XEN) Synced cycles skew: max=147335 avg=132288 samples=58 current=129711
(XEN) Synced stime skew: max=114074ns avg=66062ns samples=59 current=52570ns
(XEN) Synced cycles skew: max=147335 avg=132360 samples=59 current=136536
(XEN) Synced stime skew: max=114074ns avg=65885ns samples=60 current=55437ns
(XEN) Synced cycles skew: max=147335 avg=132320 samples=60 current=129941
(XEN) Synced stime skew: max=114074ns avg=65619ns samples=61 current=49685ns
(XEN) Synced cycles skew: max=147335 avg=132374 samples=61 current=135631
(XEN) Synced stime skew: max=114074ns avg=65465ns samples=62 current=56050ns
(XEN) Synced cycles skew: max=147335 avg=132302 samples=62 current=127925
(XEN) Synced stime skew: max=114074ns avg=65952ns samples=63 current=96125ns
(XEN) Synced cycles skew: max=147335 avg=132378 samples=63 current=137096
(XEN) Synced stime skew: max=114074ns avg=65938ns samples=64 current=65091ns
(XEN) Synced cycles skew: max=147335 avg=132336 samples=64 current=129665
(XEN) Synced stime skew: max=114074ns avg=66090ns samples=65 current=75796ns
(XEN) Synced cycles skew: max=147335 avg=132391 samples=65 current=135939
(XEN) Synced stime skew: max=114074ns avg=65911ns samples=66 current=54300ns
(XEN) Synced cycles skew: max=147335 avg=132321 samples=66 current=127769
(XEN) Synced stime skew: max=114074ns avg=65943ns samples=67 current=68057ns
(XEN) Synced cycles skew: max=147335 avg=132461 samples=67 current=141686
(XEN) Synced stime skew: max=114074ns avg=66035ns samples=68 current=72154ns
(XEN) Synced cycles skew: max=148713 avg=132700 samples=68 current=148713
(XEN) Synced stime skew: max=114074ns avg=65906ns samples=69 current=57182ns
(XEN) Synced cycles skew: max=148713 avg=132655 samples=69 current=129605
(XEN) Synced stime skew: max=114074ns avg=66096ns samples=70 current=79181ns
(XEN) Synced cycles skew: max=148713 avg=132600 samples=70 current=128797
(XEN) Synced stime skew: max=114074ns avg=66121ns samples=71 current=67886ns
(XEN) Synced cycles skew: max=148713 avg=132452 samples=71 current=122099
(XEN) Synced stime skew: max=114074ns avg=66355ns samples=72 current=82956ns
(XEN) Synced cycles skew: max=148713 avg=132486 samples=72 current=134861
(XEN) Synced stime skew: max=114074ns avg=66301ns samples=73 current=62392ns
(XEN) Synced cycles skew: max=148713 avg=132448 samples=73 current=129702
(XEN) Synced stime skew: max=114074ns avg=66148ns samples=74 current=54990ns
(XEN) Synced cycles skew: max=148713 avg=132396 samples=74 current=128615
(XEN) Synced stime skew: max=114074ns avg=65986ns samples=75 current=54038ns
(XEN) Synced cycles skew: max=148713 avg=132365 samples=75 current=130113
(XEN) Synced stime skew: max=114074ns avg=66055ns samples=76 current=71208ns
(XEN) Synced cycles skew: max=148713 avg=132322 samples=76 current=129063
(XEN) Synced stime skew: max=114074ns avg=66353ns samples=77 current=89023ns
(XEN) Synced cycles skew: max=148713 avg=132374 samples=77 current=136354
(XEN) Synced stime skew: max=114074ns avg=66410ns samples=78 current=70776ns
(XEN) Synced cycles skew: max=148713 avg=132525 samples=78 current=144157
(XEN) Synced stime skew: max=114074ns avg=66671ns samples=79 current=87005ns
(XEN) Synced cycles skew: max=148713 avg=132549 samples=79 current=134402
(XEN) Synced stime skew: max=114074ns avg=66643ns samples=80 current=64455ns
(XEN) Synced cycles skew: max=148713 avg=132594 samples=80 current=136110
(XEN) Synced stime skew: max=114074ns avg=66541ns samples=81 current=58358ns
(XEN) Synced cycles skew: max=148713 avg=132682 samples=81 current=139775
(XEN) Synced stime skew: max=114074ns avg=66389ns samples=82 current=54126ns
(XEN) Synced cycles skew: max=148713 avg=132722 samples=82 current=135982
(XEN) Synced stime skew: max=114074ns avg=66359ns samples=83 current=63867ns
(XEN) Synced cycles skew: max=148713 avg=132679 samples=83 current=129097
(XEN) Synced stime skew: max=114074ns avg=66396ns samples=84 current=69508ns
(XEN) Synced cycles skew: max=148713 avg=132623 samples=84 current=128036
(XEN) Synced stime skew: max=114074ns avg=66701ns samples=85 current=92294ns
(XEN) Synced cycles skew: max=148713 avg=132700 samples=85 current=139133
(XEN) Synced stime skew: max=114074ns avg=67039ns samples=86 current=95791ns
(XEN) Synced cycles skew: max=148713 avg=132666 samples=86 current=129790
(XEN) Synced stime skew: max=114074ns avg=66894ns samples=87 current=54424ns
(XEN) Synced cycles skew: max=148713 avg=132722 samples=87 current=137499
(XEN) Synced stime skew: max=114074ns avg=66797ns samples=88 current=58290ns
(XEN) Synced cycles skew: max=148713 avg=132756 samples=88 current=135705
(XEN) Synced stime skew: max=114074ns avg=66711ns samples=89 current=59225ns
(XEN) Synced cycles skew: max=148713 avg=132773 samples=89 current=134259
(XEN) Synced stime skew: max=114074ns avg=66694ns samples=90 current=65127ns
(XEN) Synced cycles skew: max=148713 avg=132945 samples=90 current=148336
(XEN) Synced stime skew: max=114074ns avg=66703ns samples=91 current=67514ns
(XEN) Synced cycles skew: max=148713 avg=133027 samples=91 current=140348
(XEN) Synced stime skew: max=114074ns avg=66860ns samples=92 current=81205ns
(XEN) Synced cycles skew: max=148713 avg=132980 samples=92 current=128738
(XEN) Synced stime skew: max=114074ns avg=67012ns samples=93 current=80923ns
(XEN) Synced cycles skew: max=148713 avg=133064 samples=93 current=140746
(XEN) Synced stime skew: max=114074ns avg=66937ns samples=94 current=59997ns
(XEN) Synced cycles skew: max=148713 avg=133017 samples=94 current=128660
(XEN) Synced stime skew: max=114074ns avg=66893ns samples=95 current=62789ns
(XEN) Synced cycles skew: max=148713 avg=133148 samples=95 current=145500
(XEN) Synced stime skew: max=114074ns avg=66788ns samples=96 current=56783ns
(XEN) Synced cycles skew: max=148713 avg=133169 samples=96 current=135154
(XEN) Synced stime skew: max=114074ns avg=66842ns samples=97 current=72061ns
(XEN) Synced cycles skew: max=148713 avg=133311 samples=97 current=146910
(XEN) Synced stime skew: max=114074ns avg=66872ns samples=98 current=69710ns
(XEN) Synced cycles skew: max=148713 avg=133417 samples=98 current=143715
(XEN) Synced stime skew: max=114074ns avg=67128ns samples=99 current=92284ns
(XEN) Synced cycles skew: max=148713 avg=133452 samples=99 current=136916
(XEN) Synced stime skew: max=114074ns avg=67447ns samples=100 current=98950ns
(XEN) Synced cycles skew: max=148713 avg=133445 samples=100 current=132776
(XEN) Synced stime skew: max=114074ns avg=67331ns samples=101 current=55763ns
(XEN) Synced cycles skew: max=148713 avg=133451 samples=101 current=133971
(XEN) Synced stime skew: max=114074ns avg=67234ns samples=102 current=57429ns
(XEN) Synced cycles skew: max=148713 avg=133413 samples=102 current=129592
(XEN) Synced stime skew: max=114074ns avg=67117ns samples=103 current=55250ns
(XEN) Synced cycles skew: max=148713 avg=133437 samples=103 current=135877
(XEN) Synced stime skew: max=114074ns avg=67104ns samples=104 current=65668ns
(XEN) Synced cycles skew: max=148713 avg=133391 samples=104 current=128627
(XEN) Synced stime skew: max=114074ns avg=67143ns samples=105 current=71277ns
(XEN) Synced cycles skew: max=149119 avg=133540 samples=105 current=149119
(XEN) Synced stime skew: max=114074ns avg=67119ns samples=106 current=64521ns
(XEN) Synced cycles skew: max=149119 avg=133654 samples=106 current=145602
(XEN) Synced stime skew: max=114074ns avg=67119ns samples=107 current=67195ns
(XEN) Synced cycles skew: max=149119 avg=133606 samples=107 current=128547
(XEN) Synced stime skew: max=114074ns avg=67200ns samples=108 current=75854ns
(XEN) Synced cycles skew: max=149119 avg=133678 samples=108 current=141399
(XEN) Synced stime skew: max=114074ns avg=67132ns samples=109 current=59729ns
(XEN) Synced cycles skew: max=149119 avg=133750 samples=109 current=141474

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-18  4:36 [timer/ticks related] dom0 hang during boot on large 1TB system Mukesh Rathor
@ 2009-12-18  7:02 ` Keir Fraser
  2009-12-18  8:42   ` Jan Beulich
                     ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Keir Fraser @ 2009-12-18  7:02 UTC (permalink / raw)
  To: Mukesh Rathor, Xen-devel@lists.xensource.com; +Cc: Kurt Hackel, Dan Magenheimer

On 18/12/2009 04:36, "Mukesh Rathor" <mukesh.rathor@oracle.com> wrote:

> The other fix I thought of was to change INITIAL_JIFFIES to something
> sooner.
> 
> Would appreciate any help, I don't understand xen time management well.

This isn't really Xen time code, but unchanged Linux time code. I don't know
which tree you quoted the code from -- 2.6.18 has similar but not identical.
Anyway, I suggest try using the jiffy-comparison macros from
<linux/jiffies.h>: time_before(), time_after(), etc. These are designed to
work even when jiffies wraps. Feel free to send patch(es) for that, if you
test that out and it works okay.

 -- Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-18  7:02 ` Keir Fraser
@ 2009-12-18  8:42   ` Jan Beulich
  2009-12-18  9:13     ` Keir Fraser
  2009-12-18 19:25   ` Mukesh Rathor
  2009-12-19  4:43   ` Mukesh Rathor
  2 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2009-12-18  8:42 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Kurt Hackel, Dan Magenheimer, Xen-devel@lists.xensource.com

>>> Keir Fraser <keir.fraser@eu.citrix.com> 18.12.09 08:02 >>>
>On 18/12/2009 04:36, "Mukesh Rathor" <mukesh.rathor@oracle.com> wrote:
>
>> The other fix I thought of was to change INITIAL_JIFFIES to something
>> sooner.
>> 
>> Would appreciate any help, I don't understand xen time management well.
>
>This isn't really Xen time code, but unchanged Linux time code. I don't know
>which tree you quoted the code from -- 2.6.18 has similar but not identical.
>Anyway, I suggest try using the jiffy-comparison macros from
><linux/jiffies.h>: time_before(), time_after(), etc. These are designed to
>work even when jiffies wraps. Feel free to send patch(es) for that, if you
>test that out and it works okay.

But regardless of that - shouldn't the page scrubbing really be a
background operation these days, and as such be (relatively)
performance neutral to the booting of Dom0?

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-18  8:42   ` Jan Beulich
@ 2009-12-18  9:13     ` Keir Fraser
  2009-12-18 16:35       ` Dan Magenheimer
  2009-12-18 19:28       ` Mukesh Rathor
  0 siblings, 2 replies; 51+ messages in thread
From: Keir Fraser @ 2009-12-18  9:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Kurt Hackel, Dan Magenheimer, Xen-devel@lists.xensource.com

On 18/12/2009 08:42, "Jan Beulich" <JBeulich@novell.com> wrote:

>> This isn't really Xen time code, but unchanged Linux time code. I don't know
>> which tree you quoted the code from -- 2.6.18 has similar but not identical.
>> Anyway, I suggest try using the jiffy-comparison macros from
>> <linux/jiffies.h>: time_before(), time_after(), etc. These are designed to
>> work even when jiffies wraps. Feel free to send patch(es) for that, if you
>> test that out and it works okay.
> 
> But regardless of that - shouldn't the page scrubbing really be a
> background operation these days, and as such be (relatively)
> performance neutral to the booting of Dom0?

We synchronously scrub free memory before starting dom0, and then
subsequently scrub memory only for dying domains. So I don't know what
scrubbing would be going on during dom0's boot-time calibrations, on any
version of Xen, actually.

 -- Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-18  9:13     ` Keir Fraser
@ 2009-12-18 16:35       ` Dan Magenheimer
  2009-12-18 17:15         ` Keir Fraser
  2009-12-18 19:28       ` Mukesh Rathor
  1 sibling, 1 reply; 51+ messages in thread
From: Dan Magenheimer @ 2009-12-18 16:35 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich; +Cc: kurt.hackel, Xen-devel

> So I don't know what
> scrubbing would be going on during dom0's boot-time 
> calibrations, on any
> version of Xen, actually.

Wasn't the async page scrubbing removed post 3.4.0?
(I think Mukesh's bug was seen on 3.4.0.)  I see
c/s 19886 in July 2009 is "Remove page-scrub lists
and async scrubbing"... if that patch were not
applied, would Mukesh's observed bug make more sense?

Thanks,
Dan

> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
> Sent: Friday, December 18, 2009 2:14 AM
> To: Jan Beulich
> Cc: Xen-devel@lists.xensource.com; Dan Magenheimer; Kurt 
> Hackel; Mukesh
> Rathor
> Subject: Re: [Xen-devel] [timer/ticks related] dom0 hang 
> during boot on
> large 1TB system
> 
> 
> On 18/12/2009 08:42, "Jan Beulich" <JBeulich@novell.com> wrote:
> 
> >> This isn't really Xen time code, but unchanged Linux time 
> code. I don't know
> >> which tree you quoted the code from -- 2.6.18 has similar 
> but not identical.
> >> Anyway, I suggest try using the jiffy-comparison macros from
> >> <linux/jiffies.h>: time_before(), time_after(), etc. These 
> are designed to
> >> work even when jiffies wraps. Feel free to send patch(es) 
> for that, if you
> >> test that out and it works okay.
> > 
> > But regardless of that - shouldn't the page scrubbing really be a
> > background operation these days, and as such be (relatively)
> > performance neutral to the booting of Dom0?
> 
> We synchronously scrub free memory before starting dom0, and then
> subsequently scrub memory only for dying domains. So I don't know what
> scrubbing would be going on during dom0's boot-time 
> calibrations, on any
> version of Xen, actually.
> 
>  -- Keir
> 
> 
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-18 16:35       ` Dan Magenheimer
@ 2009-12-18 17:15         ` Keir Fraser
  0 siblings, 0 replies; 51+ messages in thread
From: Keir Fraser @ 2009-12-18 17:15 UTC (permalink / raw)
  To: Dan Magenheimer, Jan Beulich
  Cc: kurt.hackel@oracle.com, Xen-devel@lists.xensource.com

On 18/12/2009 16:35, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

>> So I don't know what
>> scrubbing would be going on during dom0's boot-time
>> calibrations, on any
>> version of Xen, actually.
> 
> Wasn't the async page scrubbing removed post 3.4.0?
> (I think Mukesh's bug was seen on 3.4.0.)  I see
> c/s 19886 in July 2009 is "Remove page-scrub lists
> and async scrubbing"... if that patch were not
> applied, would Mukesh's observed bug make more sense?

Async page scrubbing was for scrubbing pages of dying domains. No domains
are dying while dom0 is still booting.

 -- Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-18  9:13     ` Keir Fraser
  2009-12-18 16:35       ` Dan Magenheimer
@ 2009-12-18 19:28       ` Mukesh Rathor
  1 sibling, 0 replies; 51+ messages in thread
From: Mukesh Rathor @ 2009-12-18 19:28 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Kurt Hackel, Dan Magenheimer, Xen-devel@lists.xensource.com,
	Jan Beulich

On Fri, 18 Dec 2009 09:13:32 +0000
Keir Fraser <keir.fraser@eu.citrix.com> wrote:

> On 18/12/2009 08:42, "Jan Beulich" <JBeulich@novell.com> wrote:
> 
> >> This isn't really Xen time code, but unchanged Linux time code. I
> >> don't know which tree you quoted the code from -- 2.6.18 has
> >> similar but not identical. Anyway, I suggest try using the
> >> jiffy-comparison macros from <linux/jiffies.h>: time_before(),
> >> time_after(), etc. These are designed to work even when jiffies
> >> wraps. Feel free to send patch(es) for that, if you test that out
> >> and it works okay.
> > 
> > But regardless of that - shouldn't the page scrubbing really be a
> > background operation these days, and as such be (relatively)
> > performance neutral to the booting of Dom0?
> 
> We synchronously scrub free memory before starting dom0, and then
> subsequently scrub memory only for dying domains. So I don't know what
> scrubbing would be going on during dom0's boot-time calibrations, on
> any version of Xen, actually.
> 
>  -- Keir
> 

Scrubbing has nothing to do with the bug. It's just that the timing is just
right to expose the bug. The system boots fine with lesser memory. Since
hyp does:
       create dom0, page scrub, unpause dom0.

It appears with large scrubbing, this gets delta in dom0 timer_interrupt() 
to be large enough that jiffies wraps. 

Hope that makes sense.

thanks,
Mukesh

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-18  7:02 ` Keir Fraser
  2009-12-18  8:42   ` Jan Beulich
@ 2009-12-18 19:25   ` Mukesh Rathor
  2009-12-19  4:43   ` Mukesh Rathor
  2 siblings, 0 replies; 51+ messages in thread
From: Mukesh Rathor @ 2009-12-18 19:25 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Hackel, Kurt, Xen-devel@lists.xensource.com, Dan Magenheimer

On Fri, 18 Dec 2009 07:02:55 +0000
Keir Fraser <keir.fraser@eu.citrix.com> wrote:

> On 18/12/2009 04:36, "Mukesh Rathor" <mukesh.rathor@oracle.com> wrote:
> 
> > The other fix I thought of was to change INITIAL_JIFFIES to
> > something sooner.
> > 
> > Would appreciate any help, I don't understand xen time management
> > well.
> 
> This isn't really Xen time code, but unchanged Linux time code. I
> don't know which tree you quoted the code from -- 2.6.18 has similar
> but not identical. Anyway, I suggest try using the jiffy-comparison
> macros from <linux/jiffies.h>: time_before(), time_after(), etc.
> These are designed to work even when jiffies wraps. Feel free to send
> patch(es) for that, if you test that out and it works okay.
> 
>  -- Keir
> 

It's from the unstable version 2.6.18 tree from
 http://xenbits.xensource.com/linux-2.6.18-xen.hg

file init/calibrate.c,  function calibrate_delay_direct(). I see the code 
exactly the same as I mentioned.

Anyways, I'm testing out the patch, trying to reproduce and make sure
fix works.

thanks,
Mukesh

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-18  7:02 ` Keir Fraser
  2009-12-18  8:42   ` Jan Beulich
  2009-12-18 19:25   ` Mukesh Rathor
@ 2009-12-19  4:43   ` Mukesh Rathor
  2009-12-21  9:55     ` Jan Beulich
                       ` (2 more replies)
  2 siblings, 3 replies; 51+ messages in thread
From: Mukesh Rathor @ 2009-12-19  4:43 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Hackel, Kurt, Xen-devel@lists.xensource.com, Dan Magenheimer,
	jeremy

[-- Attachment #1: Type: text/plain, Size: 2966 bytes --]

On Fri, 18 Dec 2009 07:02:55 +0000
Keir Fraser <keir.fraser@eu.citrix.com> wrote:

> On 18/12/2009 04:36, "Mukesh Rathor" <mukesh.rathor@oracle.com> wrote:
> 
> > The other fix I thought of was to change INITIAL_JIFFIES to
> > something sooner.
> > 
> > Would appreciate any help, I don't understand xen time management
> > well.
> 
> This isn't really Xen time code, but unchanged Linux time code. I
> don't know which tree you quoted the code from -- 2.6.18 has similar
> but not identical. Anyway, I suggest try using the jiffy-comparison
> macros from <linux/jiffies.h>: time_before(), time_after(), etc.
> These are designed to work even when jiffies wraps. Feel free to send
> patch(es) for that, if you test that out and it works okay.
> 
>  -- Keir
> 

Ok, I came up with the following patch. Jeremy, can you please take a
look also, and comment on my fix since I noticed you've got the same 
issue in your tree. Here's a summary for your benefit:

init/calibrate.c :  calibrate_delay_direct():

                start_jiffies = get_jiffies_64();
                while (get_jiffies_64() <= (start_jiffies + tick_divider)) {
                        pre_start = start;
                        read_current_timer(&start);
                }


if first ever timer interrupt comes after start_jiffies is set, dom0 boot 
may hang if delta in timer_interrupt() is so huge that it causes jiffies 
to wrap. It appears delta is very large when memory is more than 512GB on
certain boxes causing wrap around.

why is delta in dom0->timer_interrupt() related to memory on system? 
Because hyp creates dom0, then page scrubs, then unpauses vcpu. so it
appears lot of page scurbbing results in huge delta on first tick.

thanks,
Mukesh


Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>

diff --git a/init/calibrate.c b/init/calibrate.c
index 06066a6..14f62c8 100644
--- a/init/calibrate.c
+++ b/init/calibrate.c
@@ -32,7 +32,7 @@ static unsigned long __devinit calibrate_delay_direct(void)
 {
 	unsigned long pre_start, start, post_start;
 	unsigned long pre_end, end, post_end;
-	unsigned long start_jiffies;
+	u64 start_jiffies;
 	unsigned long tsc_rate_min, tsc_rate_max;
 	unsigned long good_tsc_sum = 0;
 	unsigned long good_tsc_count = 0;
@@ -64,8 +64,8 @@ static unsigned long __devinit calibrate_delay_direct(void)
 	for (i = 0; i < MAX_DIRECT_CALIBRATION_RETRIES; i++) {
 		pre_start = 0;
 		read_current_timer(&start);
-		start_jiffies = jiffies;
-		while (jiffies <= (start_jiffies + tick_divider)) {
+		start_jiffies = get_jiffies_64();
+		while (get_jiffies_64() <= (start_jiffies + tick_divider)) {
 			pre_start = start;
 			read_current_timer(&start);
 		}
@@ -73,7 +73,7 @@ static unsigned long __devinit calibrate_delay_direct(void)
 
 		pre_end = 0;
 		end = post_start;
-		while (jiffies <=
+		while (get_jiffies_64() <=
 		       (start_jiffies + tick_divider * (1 + delay_calibration_ticks))) {
 			pre_end = end;
 			read_current_timer(&end);

[-- Attachment #2: diff.out --]
[-- Type: application/octet-stream, Size: 1236 bytes --]

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>

diff --git a/init/calibrate.c b/init/calibrate.c
index 06066a6..14f62c8 100644
--- a/init/calibrate.c
+++ b/init/calibrate.c
@@ -32,7 +32,7 @@ static unsigned long __devinit calibrate_delay_direct(void)
 {
 	unsigned long pre_start, start, post_start;
 	unsigned long pre_end, end, post_end;
-	unsigned long start_jiffies;
+	u64 start_jiffies;
 	unsigned long tsc_rate_min, tsc_rate_max;
 	unsigned long good_tsc_sum = 0;
 	unsigned long good_tsc_count = 0;
@@ -64,8 +64,8 @@ static unsigned long __devinit calibrate_delay_direct(void)
 	for (i = 0; i < MAX_DIRECT_CALIBRATION_RETRIES; i++) {
 		pre_start = 0;
 		read_current_timer(&start);
-		start_jiffies = jiffies;
-		while (jiffies <= (start_jiffies + tick_divider)) {
+		start_jiffies = get_jiffies_64();
+		while (get_jiffies_64() <= (start_jiffies + tick_divider)) {
 			pre_start = start;
 			read_current_timer(&start);
 		}
@@ -73,7 +73,7 @@ static unsigned long __devinit calibrate_delay_direct(void)
 
 		pre_end = 0;
 		end = post_start;
-		while (jiffies <=
+		while (get_jiffies_64() <=
 		       (start_jiffies + tick_divider * (1 + delay_calibration_ticks))) {
 			pre_end = end;
 			read_current_timer(&end);

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-19  4:43   ` Mukesh Rathor
@ 2009-12-21  9:55     ` Jan Beulich
  2009-12-21 18:20       ` Dan Magenheimer
  2009-12-21 10:44     ` Keir Fraser
  2009-12-21 19:17     ` Steve Ofsthun
  2 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2009-12-21  9:55 UTC (permalink / raw)
  To: Mukesh Rathor
  Cc: jeremy, Xen-devel@lists.xensource.com, Hackel, Dan Magenheimer,
	Keir Fraser, Kurt

>>> Mukesh Rathor <mukesh.rathor@oracle.com> 19.12.09 05:43 >>>
>if first ever timer interrupt comes after start_jiffies is set, dom0 boot 
>may hang if delta in timer_interrupt() is so huge that it causes jiffies 
>to wrap. It appears delta is very large when memory is more than 512GB on
>certain boxes causing wrap around.
>
>why is delta in dom0->timer_interrupt() related to memory on system? 
>Because hyp creates dom0, then page scrubs, then unpauses vcpu. so it
>appears lot of page scurbbing results in huge delta on first tick.

Based on prior analysis of similar problems, I'm not convinced this is the
right solution: Kernel code should not need changing here. Instead, I'd
recommend trying to insert a call to process_pending_timers() every so
many pages scrubbed (just like is e.g. being done in the P2M/M2P table
population code).

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-21  9:55     ` Jan Beulich
@ 2009-12-21 18:20       ` Dan Magenheimer
  2009-12-21 19:07         ` Keir Fraser
  2009-12-22  8:51         ` Jan Beulich
  0 siblings, 2 replies; 51+ messages in thread
From: Dan Magenheimer @ 2009-12-21 18:20 UTC (permalink / raw)
  To: Jan Beulich, mukesh.rathor; +Cc: kurt.hackel, jeremy, Xen-devel, Keir Fraser

> From: Jan Beulich [mailto:JBeulich@novell.com]
> 
> >>> Mukesh Rathor <mukesh.rathor@oracle.com> 19.12.09 05:43 >>>
> >if first ever timer interrupt comes after start_jiffies is 
> set, dom0 boot 
> >may hang if delta in timer_interrupt() is so huge that it 
> causes jiffies 
> >to wrap. It appears delta is very large when memory is more 
> than 512GB on
> >certain boxes causing wrap around.
> >
> >why is delta in dom0->timer_interrupt() related to memory on system? 
> >Because hyp creates dom0, then page scrubs, then unpauses vcpu. so it
> >appears lot of page scurbbing results in huge delta on first tick.
> 
> Based on prior analysis of similar problems, I'm not 
> convinced this is the
> right solution: Kernel code should not need changing here. 
> Instead, I'd
> recommend trying to insert a call to process_pending_timers() every so
> many pages scrubbed (just like is e.g. being done in the P2M/M2P table
> population code).

Mukesh has dug into this a lot deeper than I, but I think
process_pending_timers() is irrelevant here.  When dom0
is constructed, its data space is initialized in memory
and jiffies has been initialized in the data section with
a fixed value of -300 * HZ.  At this point, dom0 lives in
memory but has not executed a single instruction, so is
not capable of receiving any interrupts.  I *think* Xen
also initializes a clocksource (pvclock?) here.

Then scrub_heap_pages() occurs which eats up a lot of time.

THEN dom0 is started and receives a timer interrupt and,
I guess, the clocksource code updates jiffies based on
the time elapsed and, since jiffies is unsigned, it
wraps around.

So (admitting I don't understand this fully), I think the
problem is that the kernel has hardcoded into it that it's
impossible for 300 seconds to expire between the time it
is put in memory and the time the first interrupt occurs.
That seems like a kernel bug to me, maybe in the pvclock
code, but still in the kernel.

Not to say the problem can't or shouldn't be fixed in Xen.
Keir, would bad things happen if construct_dom0 is done after
scrub_heap_pages()?  Other than some time wastage because
dom0's memory would get scrubbed just before it gets
overwritten (which is admittedly a much bigger problem
when dom0_mem is not specified in the Xen boot line
on a machine with ginormous memory).

Thanks,
Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-21 18:20       ` Dan Magenheimer
@ 2009-12-21 19:07         ` Keir Fraser
  2009-12-21 19:52           ` Mukesh Rathor
  2009-12-22  8:51         ` Jan Beulich
  1 sibling, 1 reply; 51+ messages in thread
From: Keir Fraser @ 2009-12-21 19:07 UTC (permalink / raw)
  To: Dan Magenheimer, Jan Beulich, mukesh.rathor@oracle.com
  Cc: kurt.hackel@oracle.com, Jeremy Fitzhardinge,
	Xen-devel@lists.xensource.com

On 21/12/2009 18:20, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> Not to say the problem can't or shouldn't be fixed in Xen.
> Keir, would bad things happen if construct_dom0 is done after
> scrub_heap_pages()?  Other than some time wastage because
> dom0's memory would get scrubbed just before it gets
> overwritten (which is admittedly a much bigger problem
> when dom0_mem is not specified in the Xen boot line
> on a machine with ginormous memory).

The problem is more likely that Xen system time started ticking some time
earlier during boot process. I doubt it is to do with ordering of
construct_dom0 versus boot-time scrubbing.

 -- Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-21 19:07         ` Keir Fraser
@ 2009-12-21 19:52           ` Mukesh Rathor
  2009-12-21 19:55             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 51+ messages in thread
From: Mukesh Rathor @ 2009-12-21 19:52 UTC (permalink / raw)
  To: Keir Fraser
  Cc: kurt.hackel@oracle.com, Dan Magenheimer,
	Xen-devel@lists.xensource.com, Jeremy Fitzhardinge, Jan Beulich

On Mon, 21 Dec 2009 19:07:39 +0000
Keir Fraser <keir.fraser@eu.citrix.com> wrote:

> On 21/12/2009 18:20, "Dan Magenheimer" <dan.magenheimer@oracle.com>
> wrote:
> 
> > Not to say the problem can't or shouldn't be fixed in Xen.
> > Keir, would bad things happen if construct_dom0 is done after
> > scrub_heap_pages()?  Other than some time wastage because
> > dom0's memory would get scrubbed just before it gets
> > overwritten (which is admittedly a much bigger problem
> > when dom0_mem is not specified in the Xen boot line
> > on a machine with ginormous memory).
> 
> The problem is more likely that Xen system time started ticking some
> time earlier during boot process. I doubt it is to do with ordering of
> construct_dom0 versus boot-time scrubbing.
> 
>  -- Keir
> 

The problem is exactly how Dan described it. 'delta' for first interrupt
in dom0->timer_interrupt() goes up proportionately with amount of memory
on system. On this box, it appears more than 600GB causes delta to be 
large enough to wrap jiffies.

 1TB delta: 940b7d68a4
32GB delta: 02ae56eadb

xen->send_guest_vcpu_virq() ----> dom0->handle_IRQ() -> timer_interrupt()

timer_interrupt will call do_timer delta/NS_PER_TICK number of times.

Linux initializes jiffies to -5 minutes to catch problems from jiffies 
wrap early on. But like Dan said, dom0->calibrate_delay_direct() on 
baremetal starts running right away and is guaranteed to run in less 
than 5 minutes. We could let that assumption be true by moving page
scrub before xen->construct_dom0(), in which case the first timer
interrupt in dom0 will come in lot sooner, or just fix the loop to
account for wrap. Since jiffies just represents lower 32bits of 
jiffies_64, and get_jiffies_64() is provided for the purpose of reading
64bit version, I just avail of that. 

Thanks,
Mukesh

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-21 19:52           ` Mukesh Rathor
@ 2009-12-21 19:55             ` Jeremy Fitzhardinge
  2009-12-21 22:47               ` Mukesh Rathor
  0 siblings, 1 reply; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2009-12-21 19:55 UTC (permalink / raw)
  To: Mukesh Rathor
  Cc: kurt.hackel@oracle.com, Dan Magenheimer,
	Xen-devel@lists.xensource.com, Keir Fraser, Jan Beulich

On 12/21/2009 11:52 AM, Mukesh Rathor wrote:
> On Mon, 21 Dec 2009 19:07:39 +0000
> Keir Fraser<keir.fraser@eu.citrix.com>  wrote:
>
>    
>> On 21/12/2009 18:20, "Dan Magenheimer"<dan.magenheimer@oracle.com>
>> wrote:
>>
>>      
>>> Not to say the problem can't or shouldn't be fixed in Xen.
>>> Keir, would bad things happen if construct_dom0 is done after
>>> scrub_heap_pages()?  Other than some time wastage because
>>> dom0's memory would get scrubbed just before it gets
>>> overwritten (which is admittedly a much bigger problem
>>> when dom0_mem is not specified in the Xen boot line
>>> on a machine with ginormous memory).
>>>        
>> The problem is more likely that Xen system time started ticking some
>> time earlier during boot process. I doubt it is to do with ordering of
>> construct_dom0 versus boot-time scrubbing.
>>
>>   -- Keir
>>
>>      
> The problem is exactly how Dan described it. 'delta' for first interrupt
> in dom0->timer_interrupt() goes up proportionately with amount of memory
> on system. On this box, it appears more than 600GB causes delta to be
> large enough to wrap jiffies.
>
>   1TB delta: 940b7d68a4
> 32GB delta: 02ae56eadb
>
> xen->send_guest_vcpu_virq() ---->  dom0->handle_IRQ() ->  timer_interrupt()
>
> timer_interrupt will call do_timer delta/NS_PER_TICK number of times.
>    

How is it computing that delta?

Anyway, I'm not at all sure this will apply to a pvops dom0 kernel as it 
does timekeeping quite differently from 2.6.18-xen.

     J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-21 19:55             ` Jeremy Fitzhardinge
@ 2009-12-21 22:47               ` Mukesh Rathor
  2009-12-21 23:13                 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 51+ messages in thread
From: Mukesh Rathor @ 2009-12-21 22:47 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: kurt.hackel@oracle.com, Dan Magenheimer,
	Xen-devel@lists.xensource.com, Keir Fraser, Jan Beulich

On Mon, 21 Dec 2009 11:55:09 -0800
Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> On 12/21/2009 11:52 AM, Mukesh Rathor wrote:
> > On Mon, 21 Dec 2009 19:07:39 +0000
> > Keir Fraser<keir.fraser@eu.citrix.com>  wrote:
> >
> >    
> >> On 21/12/2009 18:20, "Dan Magenheimer"<dan.magenheimer@oracle.com>
> >> wrote:
> >>
> >>      
> >>> Not to say the problem can't or shouldn't be fixed in Xen.
> >>> Keir, would bad things happen if construct_dom0 is done after
> >>> scrub_heap_pages()?  Other than some time wastage because
> >>> dom0's memory would get scrubbed just before it gets
> >>> overwritten (which is admittedly a much bigger problem
> >>> when dom0_mem is not specified in the Xen boot line
> >>> on a machine with ginormous memory).
> >>>        
> >> The problem is more likely that Xen system time started ticking
> >> some time earlier during boot process. I doubt it is to do with
> >> ordering of construct_dom0 versus boot-time scrubbing.
> >>
> >>   -- Keir
> >>
> >>      
> > The problem is exactly how Dan described it. 'delta' for first
> > interrupt in dom0->timer_interrupt() goes up proportionately with
> > amount of memory on system. On this box, it appears more than 600GB
> > causes delta to be large enough to wrap jiffies.
> >
> >   1TB delta: 940b7d68a4
> > 32GB delta: 02ae56eadb
> >
> > xen->send_guest_vcpu_virq() ---->  dom0->handle_IRQ() ->
> > timer_interrupt()
> >
> > timer_interrupt will call do_timer delta/NS_PER_TICK number of
> > times. 
> 
> How is it computing that delta?
> 
> Anyway, I'm not at all sure this will apply to a pvops dom0 kernel as
> it does timekeeping quite differently from 2.6.18-xen.
> 
>      J

delta comes from:
 
timer_inetrrupt() in time-xen.c :
...
        do {
                get_time_values_from_xen(cpu);

                /* Obtain a consistent snapshot of elapsed wallclock cycles. */
      --->      delta = delta_cpu =
                        shadow->system_timestamp + get_nsec_offset(shadow);
      --->      delta     -= processed_system_time;
                delta_cpu -= per_cpu(processed_system_time, cpu);

                /*
                 * Obtain a consistent snapshot of stolen/blocked cycles. We
                 * can use state_entry_time to detect if we get preempted here.
                 */
                do {
                        sched_time = runstate->state_entry_time;
                        barrier();
                        stolen = runstate->time[RUNSTATE_runnable] +
                                runstate->time[RUNSTATE_offline] -
                                per_cpu(processed_stolen_time, cpu);
                        blocked = runstate->time[RUNSTATE_blocked] -
                                per_cpu(processed_blocked_time, cpu);
                        barrier();
                } while (sched_time != runstate->state_entry_time);
        } while (!time_values_up_to_date(cpu));
...


At first glance, i don't understand the above algorithm. Since you've
the same code, I assumed you could also compute delta to be a large
value when dom0 starts, in which case you may observe dom0 hang. 


thanks,
Mukesh

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-21 22:47               ` Mukesh Rathor
@ 2009-12-21 23:13                 ` Jeremy Fitzhardinge
  2009-12-21 23:57                   ` Dan Magenheimer
  0 siblings, 1 reply; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2009-12-21 23:13 UTC (permalink / raw)
  To: Mukesh Rathor
  Cc: kurt.hackel@oracle.com, Dan Magenheimer,
	Xen-devel@lists.xensource.com, Keir Fraser, Jan Beulich

On 12/21/2009 02:47 PM, Mukesh Rathor wrote:
> delta comes from:
>
> timer_inetrrupt() in time-xen.c :
> ...
>          do {
>                  get_time_values_from_xen(cpu);
>
>                  /* Obtain a consistent snapshot of elapsed wallclock cycles. */
>        --->       delta = delta_cpu =
>                          shadow->system_timestamp + get_nsec_offset(shadow);
>        --->       delta     -= processed_system_time;
>                  delta_cpu -= per_cpu(processed_system_time, cpu);
>
>                  /*
>                   * Obtain a consistent snapshot of stolen/blocked cycles. We
>                   * can use state_entry_time to detect if we get preempted here.
>                   */
>                  do {
>                          sched_time = runstate->state_entry_time;
>                          barrier();
>                          stolen = runstate->time[RUNSTATE_runnable] +
>                                  runstate->time[RUNSTATE_offline] -
>                                  per_cpu(processed_stolen_time, cpu);
>                          blocked = runstate->time[RUNSTATE_blocked] -
>                                  per_cpu(processed_blocked_time, cpu);
>                          barrier();
>                  } while (sched_time != runstate->state_entry_time);
>          } while (!time_values_up_to_date(cpu));
> ...
>
>
> At first glance, i don't understand the above algorithm. Since you've
> the same code, I assumed you could also compute delta to be a large
> value when dom0 starts, in which case you may observe dom0 hang.
>    

There's some code in the pvops kernel which looks vaguely like that, but 
it has nothing to do with timer interrupts.  Could you be more specific 
about what you're referring to?

     J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-21 23:13                 ` Jeremy Fitzhardinge
@ 2009-12-21 23:57                   ` Dan Magenheimer
  2009-12-22  4:31                     ` Mukesh Rathor
  0 siblings, 1 reply; 51+ messages in thread
From: Dan Magenheimer @ 2009-12-21 23:57 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, mukesh.rathor
  Cc: kurt.hackel, Xen-devel, Keir Fraser, Jan Beulich

> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org]
>
> There's some code in the pvops kernel which looks vaguely 
> like that, but 
> it has nothing to do with timer interrupts.  Could you be 
> more specific 
> about what you're referring to?

I spent some time rooting through the 2.6.32 code and
ended up with my head spinning.  I think the bottom line
is if there is code that may cause jiffies to increment
by a large amount from a single "tick" delivered by
Xen, it's likely the same problem can occur in 2.6.32
dom0 when running on Xen.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-21 23:57                   ` Dan Magenheimer
@ 2009-12-22  4:31                     ` Mukesh Rathor
  0 siblings, 0 replies; 51+ messages in thread
From: Mukesh Rathor @ 2009-12-22  4:31 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Jeremy Fitzhardinge, Xen-devel, kurt.hackel, Beulich, Jan,
	Keir Fraser

On Mon, 21 Dec 2009 15:57:33 -0800 (PST)
Dan Magenheimer <dan.magenheimer@oracle.com> wrote:

> > From: Jeremy Fitzhardinge [mailto:jeremy@goop.org]
> >
> > There's some code in the pvops kernel which looks vaguely 
> > like that, but 
> > it has nothing to do with timer interrupts.  Could you be 
> > more specific 
> > about what you're referring to?
> 
> I spent some time rooting through the 2.6.32 code and
> ended up with my head spinning.  I think the bottom line
> is if there is code that may cause jiffies to increment
> by a large amount from a single "tick" delivered by
> Xen, it's likely the same problem can occur in 2.6.32
> dom0 when running on Xen.

Right. Your calibrate_delay_direct() is the same, so in there:

                start_jiffies = jiffies;
                ... timer interrupt....
                while (jiffies <= (start_jiffies + tick_divider)) {
                        pre_start = start;
                        read_current_timer(&start);
                }


if timer tick comes in after start_jiff is set, and upon returning from
timer interrupt the while loop finds jiffies wrapped, it will hang.

i was looking at wrong "jeremy's pvops tree", but now that i am looking
at correct one, i see that your timer_interrupt() is pretty different.
so if you believe that you could also increment jiffies by more than
one in timer_interrupt, you should consider my new patch when i submit.
i'm testing right now.

thanks,
mukesh

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-21 18:20       ` Dan Magenheimer
  2009-12-21 19:07         ` Keir Fraser
@ 2009-12-22  8:51         ` Jan Beulich
  2009-12-22 10:20           ` Keir Fraser
  1 sibling, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2009-12-22  8:51 UTC (permalink / raw)
  To: Dan Magenheimer, mukesh.rathor
  Cc: kurt.hackel, jeremy, Xen-devel, Keir Fraser

>>> Dan Magenheimer <dan.magenheimer@oracle.com> 21.12.09 19:20 >>>
> From: Jan Beulich [mailto:JBeulich@novell.com] 
>> Based on prior analysis of similar problems, I'm not 
>> convinced this is the
>> right solution: Kernel code should not need changing here. 
>> Instead, I'd
>> recommend trying to insert a call to process_pending_timers() every so
>> many pages scrubbed (just like is e.g. being done in the P2M/M2P table
>> population code).
>
>Mukesh has dug into this a lot deeper than I, but I think
>process_pending_timers() is irrelevant here.  When dom0

Why would this be any different than a lot of time being consumed
populating large p2m/m2p tables? All this happens when Dom0 already
exists, but isn't running yet.

>is constructed, its data space is initialized in memory
>and jiffies has been initialized in the data section with
>a fixed value of -300 * HZ.  At this point, dom0 lives in
>memory but has not executed a single instruction, so is
>not capable of receiving any interrupts.  I *think* Xen
>also initializes a clocksource (pvclock?) here.

... and updates it each time local_time_calibration() is run, which is
the missing piece (process_pending_timers() causes
time_calibration() to run as needed, in turn causing
TIME_CALIBRATE_SOFTIRQ to be raised as needed [and run the
latest immediately before Dom0 gets passed control], in turn
causing local_time_calibration() to run, updating dom0:vcpu0's
system time).

>Then scrub_heap_pages() occurs which eats up a lot of time.

... and confuses Xen's own time keeping (because, depending on
the platform timer used and it's wrap-around interval, a wrap may
be missed if process_pending_timers() isn't being executed
frequently enough.

But from the other mail regarding this subject I conclude that this
suggestion wasn't even tried, despite me knowing that it fixed
similar problems on 1Tb systems. And be assured, I spent hours (if
not days) analyzing the problem until I finally understood that this
is entirely unrelated to the kernel.

>THEN dom0 is started and receives a timer interrupt and,
>I guess, the clocksource code updates jiffies based on
>the time elapsed and, since jiffies is unsigned, it
>wraps around.
>
>So (admitting I don't understand this fully), I think the
>problem is that the kernel has hardcoded into it that it's
>impossible for 300 seconds to expire between the time it
>is put in memory and the time the first interrupt occurs.
>That seems like a kernel bug to me, maybe in the pvclock
>code, but still in the kernel.

No, the time the kernel gets put in memory doesn't matter at all.
Counting starts when the kernel starts initializing its time
subsystem, and with timer interrupts being disabled initially I
can't even see how multiple of them could pile up.

>Not to say the problem can't or shouldn't be fixed in Xen.
>Keir, would bad things happen if construct_dom0 is done after
>scrub_heap_pages()?  Other than some time wastage because
>dom0's memory would get scrubbed just before it gets
>overwritten (which is admittedly a much bigger problem
>when dom0_mem is not specified in the Xen boot line
>on a machine with ginormous memory).

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22  8:51         ` Jan Beulich
@ 2009-12-22 10:20           ` Keir Fraser
  2009-12-22 11:10             ` Jan Beulich
  0 siblings, 1 reply; 51+ messages in thread
From: Keir Fraser @ 2009-12-22 10:20 UTC (permalink / raw)
  To: Jan Beulich, Dan Magenheimer, mukesh.rathor@oracle.com
  Cc: kurt.hackel@oracle.com, Jeremy Fitzhardinge,
	Xen-devel@lists.xensource.com

On 22/12/2009 08:51, "Jan Beulich" <JBeulich@novell.com> wrote:

>> Then scrub_heap_pages() occurs which eats up a lot of time.
> 
> ... and confuses Xen's own time keeping (because, depending on
> the platform timer used and it's wrap-around interval, a wrap may
> be missed if process_pending_timers() isn't being executed
> frequently enough.

Process_pending_timers() has been called on every iteration of the scrub
loop for as long as I can remember. I believe it was even you who added it.

 -- Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22 10:20           ` Keir Fraser
@ 2009-12-22 11:10             ` Jan Beulich
  2009-12-22 13:35               ` Keir Fraser
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2009-12-22 11:10 UTC (permalink / raw)
  To: Keir Fraser, Dan Magenheimer, mukesh.rathor@oracle.com
  Cc: kurt.hackel@oracle.com, Jeremy Fitzhardinge,
	Xen-devel@lists.xensource.com

>>> Keir Fraser <keir.fraser@eu.citrix.com> 22.12.09 11:20 >>>
>On 22/12/2009 08:51, "Jan Beulich" <JBeulich@novell.com> wrote:
>
>>> Then scrub_heap_pages() occurs which eats up a lot of time.
>> 
>> ... and confuses Xen's own time keeping (because, depending on
>> the platform timer used and it's wrap-around interval, a wrap may
>> be missed if process_pending_timers() isn't being executed
>> frequently enough.
>
>Process_pending_timers() has been called on every iteration of the scrub
>loop for as long as I can remember. I believe it was even you who added it.

Should I have overlooked it? Indeed, I did (I looked at the end of the
loop, while it's sitting at the beginning). I'm really sorry for the noise
then.

Nevertheless I remain convinced that the problem ought not to be fixed
by a kernel change (and even less by one that modifies Xen-unspecific
code). Any patch to this effect, unless I should be convinced otherwise,
has my explicit up front NAK (in case this counts anything).

And then it should be possible to simulate the problem quite easily on
a system with much less memory, by slowing down the scrub loop
artificially. If I find time before the holiday break I'll try to do that and
see if I can convince myself otherwise (as per above).
artificially

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22 11:10             ` Jan Beulich
@ 2009-12-22 13:35               ` Keir Fraser
  2009-12-22 14:17                 ` Jan Beulich
  2009-12-22 16:33                 ` Jan Beulich
  0 siblings, 2 replies; 51+ messages in thread
From: Keir Fraser @ 2009-12-22 13:35 UTC (permalink / raw)
  To: Jan Beulich, Dan Magenheimer, mukesh.rathor@oracle.com
  Cc: kurt.hackel@oracle.com, Jeremy Fitzhardinge,
	Xen-devel@lists.xensource.com

On 22/12/2009 11:10, "Jan Beulich" <JBeulich@novell.com> wrote:

> Nevertheless I remain convinced that the problem ought not to be fixed
> by a kernel change (and even less by one that modifies Xen-unspecific
> code). Any patch to this effect, unless I should be convinced otherwise,
> has my explicit up front NAK (in case this counts anything).

Well, I must say the kernel patch looked quite sensible to me. If no other
reason than reinforcing the fact that jiffy values should always be compared
using the provided macros. But I'm happy to have a hypervisor patch as well,
if we can work out what it should be. I'm still unclear on the reason why
slow page scrubbing causes this problem - Oracle's explanation hasn't
convinced me yet.

> And then it should be possible to simulate the problem quite easily on
> a system with much less memory, by slowing down the scrub loop
> artificially. If I find time before the holiday break I'll try to do that and
> see if I can convince myself otherwise (as per above).
> artificially

That would be helpful, thanks. I'm particularly intrigued by how this could
be seen for dom0 but not be a similar or worse issue for domU.

 -- Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22 13:35               ` Keir Fraser
@ 2009-12-22 14:17                 ` Jan Beulich
  2009-12-22 14:23                   ` Jan Beulich
  2009-12-22 16:33                 ` Jan Beulich
  1 sibling, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2009-12-22 14:17 UTC (permalink / raw)
  To: Dan Magenheimer, mukesh.rathor@oracle.com
  Cc: kurt.hackel@oracle.com, Jeremy Fitzhardinge,
	Xen-devel@lists.xensource.com, Keir Fraser

>>> Keir Fraser <keir.fraser@eu.citrix.com> 22.12.09 14:35 >>>
>On 22/12/2009 11:10, "Jan Beulich" <JBeulich@novell.com> wrote:
>
>> Nevertheless I remain convinced that the problem ought not to be fixed
>> by a kernel change (and even less by one that modifies Xen-unspecific
>> code). Any patch to this effect, unless I should be convinced otherwise,
>> has my explicit up front NAK (in case this counts anything).
>
>Well, I must say the kernel patch looked quite sensible to me. If no other
>reason than reinforcing the fact that jiffy values should always be compared
>using the provided macros. But I'm happy to have a hypervisor patch as well,
>if we can work out what it should be. I'm still unclear on the reason why
>slow page scrubbing causes this problem - Oracle's explanation hasn't
>convinced me yet.

There's another thing that seems inconsistent with this report: jiffies
itself as well as all the arithmetic in calibrate_delay_direct() is using
"unsigned long", so is being done 64-bit on x86-64 (which we're
talking about here). Hence I can see even less how an overflow could
have happened, or how using explicit 64-bit types (or get_jiffies_64())
here can help.

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22 14:17                 ` Jan Beulich
@ 2009-12-22 14:23                   ` Jan Beulich
  2009-12-22 15:19                     ` Keir Fraser
  2009-12-22 15:30                     ` Dan Magenheimer
  0 siblings, 2 replies; 51+ messages in thread
From: Jan Beulich @ 2009-12-22 14:23 UTC (permalink / raw)
  To: Jan Beulich, Dan Magenheimer, mukesh.rathor@oracle.com
  Cc: kurt.hackel@oracle.com, Jeremy Fitzhardinge,
	Xen-devel@lists.xensource.com, Keir Fraser

>>> "Jan Beulich" <JBeulich@novell.com> 22.12.09 15:17 >>>
>There's another thing that seems inconsistent with this report: jiffies
>itself as well as all the arithmetic in calibrate_delay_direct() is using
>"unsigned long", so is being done 64-bit on x86-64 (which we're
>talking about here). Hence I can see even less how an overflow could
>have happened, or how using explicit 64-bit types (or get_jiffies_64())
>here can help.

Oh, or are we talking about 32-bit Dom0 on 64-bit Xen here? I don't
recall this having been mentioned anywhere, but maybe I just
overlooked it.

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22 14:23                   ` Jan Beulich
@ 2009-12-22 15:19                     ` Keir Fraser
  2009-12-22 15:30                     ` Dan Magenheimer
  1 sibling, 0 replies; 51+ messages in thread
From: Keir Fraser @ 2009-12-22 15:19 UTC (permalink / raw)
  To: Jan Beulich, Dan Magenheimer, mukesh.rathor@oracle.com
  Cc: kurt.hackel@oracle.com, Jeremy Fitzhardinge,
	Xen-devel@lists.xensource.com

On 22/12/2009 14:23, "Jan Beulich" <JBeulich@novell.com> wrote:

>>>> "Jan Beulich" <JBeulich@novell.com> 22.12.09 15:17 >>>
>> There's another thing that seems inconsistent with this report: jiffies
>> itself as well as all the arithmetic in calibrate_delay_direct() is using
>> "unsigned long", so is being done 64-bit on x86-64 (which we're
>> talking about here). Hence I can see even less how an overflow could
>> have happened, or how using explicit 64-bit types (or get_jiffies_64())
>> here can help.
> 
> Oh, or are we talking about 32-bit Dom0 on 64-bit Xen here? I don't
> recall this having been mentioned anywhere, but maybe I just
> overlooked it.

I'd assumed this must be the case. As you say, the issue couldn't happen as
described on 64-bit Linux.

 -- Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22 14:23                   ` Jan Beulich
  2009-12-22 15:19                     ` Keir Fraser
@ 2009-12-22 15:30                     ` Dan Magenheimer
  2009-12-22 15:36                       ` Jan Beulich
  1 sibling, 1 reply; 51+ messages in thread
From: Dan Magenheimer @ 2009-12-22 15:30 UTC (permalink / raw)
  To: Jan Beulich, mukesh.rathor
  Cc: kurt.hackel, Jeremy Fitzhardinge, Xen-devel, Keir Fraser

> >>> "Jan Beulich" <JBeulich@novell.com> 22.12.09 15:17 >>>
> >There's another thing that seems inconsistent with this 
> report: jiffies
> >itself as well as all the arithmetic in 
> calibrate_delay_direct() is using
> >"unsigned long", so is being done 64-bit on x86-64 (which we're
> >talking about here). Hence I can see even less how an overflow could
> >have happened, or how using explicit 64-bit types (or 
> get_jiffies_64())
> >here can help.
> 
> Oh, or are we talking about 32-bit Dom0 on 64-bit Xen here? I don't
> recall this having been mentioned anywhere, but maybe I just
> overlooked it.

Mukesh's work has been on a 32-bit dom0.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22 15:30                     ` Dan Magenheimer
@ 2009-12-22 15:36                       ` Jan Beulich
  2009-12-22 16:05                         ` Dan Magenheimer
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2009-12-22 15:36 UTC (permalink / raw)
  To: Dan Magenheimer, mukesh.rathor
  Cc: kurt.hackel, Jeremy Fitzhardinge, Xen-devel, Keir Fraser

>>> Dan Magenheimer <dan.magenheimer@oracle.com> 22.12.09 16:30 >>>
>Mukesh's work has been on a 32-bit dom0.

Which seems quite odd a combination - 1Tb of memory, but a 32-bit
Dom0 -, which is why initially I didn't even consider the possibility. I'm
afraid you're asking for more trouble with this.

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22 15:36                       ` Jan Beulich
@ 2009-12-22 16:05                         ` Dan Magenheimer
  2009-12-22 17:02                           ` Jan Beulich
  0 siblings, 1 reply; 51+ messages in thread
From: Dan Magenheimer @ 2009-12-22 16:05 UTC (permalink / raw)
  To: Jan Beulich, mukesh.rathor
  Cc: kurt.hackel, Jeremy Fitzhardinge, Xen-devel, Keir Fraser

> From: Jan Beulich [mailto:JBeulich@novell.com]
> 
> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 22.12.09 16:30 >>>
> >Mukesh's work has been on a 32-bit dom0.
> 
> Which seems quite odd a combination - 1Tb of memory, but a 32-bit
> Dom0 -, which is why initially I didn't even consider the 
> possibility. I'm
> afraid you're asking for more trouble with this.

Indeed.  Oracle expects to move to a 64-bit dom0 in a "future"
release, but we have to make the 32-bit dom0 work until then.

Our default configuration specifies dom0_mem=xxx which I would
assume would eliminate most or all of the "trouble" you are
referring to, but, Mukesh, could you confirm what dom0_mem
is set to?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22 16:05                         ` Dan Magenheimer
@ 2009-12-22 17:02                           ` Jan Beulich
  2009-12-22 18:03                             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2009-12-22 17:02 UTC (permalink / raw)
  To: Dan Magenheimer, mukesh.rathor
  Cc: kurt.hackel, Jeremy Fitzhardinge, Xen-devel, Keir Fraser

>>> Dan Magenheimer <dan.magenheimer@oracle.com> 22.12.09 17:05 >>>
>> From: Jan Beulich [mailto:JBeulich@novell.com] 
>> 
>> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 22.12.09 16:30 >>>
>> >Mukesh's work has been on a 32-bit dom0.
>> 
>> Which seems quite odd a combination - 1Tb of memory, but a 32-bit
>> Dom0 -, which is why initially I didn't even consider the 
>> possibility. I'm
>> afraid you're asking for more trouble with this.
>
>Indeed.  Oracle expects to move to a 64-bit dom0 in a "future"
>release, but we have to make the 32-bit dom0 work until then.
>
>Our default configuration specifies dom0_mem=xxx which I would
>assume would eliminate most or all of the "trouble" you are
>referring to, but, Mukesh, could you confirm what dom0_mem
>is set to?

No, that won't help. I'm referring to things like Dom0 accesses to the
M2P table it sees, which doesn't cover even nearly all memory. I can't
say whether that can go without problem, but without closely looking
at it I don't think you can assume this would work. Likewise I would
suspect tools issues (if you use the tools from xen-unstable et al),
though I have no precise pointer right now at specific issues.

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on  large 1TB system
  2009-12-22 17:02                           ` Jan Beulich
@ 2009-12-22 18:03                             ` Jeremy Fitzhardinge
  2010-01-04  8:23                               ` Jan Beulich
  0 siblings, 1 reply; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2009-12-22 18:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: kurt.hackel, Dan Magenheimer, Xen-devel, Keir Fraser

On 12/22/2009 09:02 AM, Jan Beulich wrote:
> No, that won't help. I'm referring to things like Dom0 accesses to the
> M2P table it sees, which doesn't cover even nearly all memory. I can't
> say whether that can go without problem, but without closely looking
> at it I don't think you can assume this would work. Likewise I would
> suspect tools issues (if you use the tools from xen-unstable et al),
> though I have no precise pointer right now at specific issues.
>    

32-bit dom0 is the standard use model for Citrix product, and I think 
people tend to run it even with xen-unstable.  Its a fairly well-tested 
combination.

     J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22 18:03                             ` Jeremy Fitzhardinge
@ 2010-01-04  8:23                               ` Jan Beulich
  2010-01-04 22:07                                 ` Dan Magenheimer
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2010-01-04  8:23 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: kurt.hackel, Dan Magenheimer, Xen-devel, Keir Fraser

>>> Jeremy Fitzhardinge <jeremy@goop.org> 22.12.09 19:03 >>>
>On 12/22/2009 09:02 AM, Jan Beulich wrote:
>> No, that won't help. I'm referring to things like Dom0 accesses to the
>> M2P table it sees, which doesn't cover even nearly all memory. I can't
>> say whether that can go without problem, but without closely looking
>> at it I don't think you can assume this would work. Likewise I would
>> suspect tools issues (if you use the tools from xen-unstable et al),
>> though I have no precise pointer right now at specific issues.
>>    
>
>32-bit dom0 is the standard use model for Citrix product, and I think 
>people tend to run it even with xen-unstable.  Its a fairly well-tested 
>combination.

But very unlikely with 1Tb of memory, don't you think?

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [timer/ticks related] dom0 hang during boot on large 1TB system
  2010-01-04  8:23                               ` Jan Beulich
@ 2010-01-04 22:07                                 ` Dan Magenheimer
  2010-01-04 22:21                                   ` Ian Campbell
  2010-01-05  8:33                                   ` Jan Beulich
  0 siblings, 2 replies; 51+ messages in thread
From: Dan Magenheimer @ 2010-01-04 22:07 UTC (permalink / raw)
  To: Jan Beulich, Jeremy Fitzhardinge; +Cc: kurt.hackel, Xen-devel, Keir Fraser

> From: Jan Beulich [mailto:JBeulich@novell.com]
> 
> >>> Jeremy Fitzhardinge <jeremy@goop.org> 22.12.09 19:03 >>>
> >On 12/22/2009 09:02 AM, Jan Beulich wrote:
> >> No, that won't help. I'm referring to things like Dom0 
> accesses to the
> >> M2P table it sees, which doesn't cover even nearly all 
> memory. I can't
> >> say whether that can go without problem, but without 
> closely looking
> >> at it I don't think you can assume this would work. 
> Likewise I would
> >> suspect tools issues (if you use the tools from 
> xen-unstable et al),
> >> though I have no precise pointer right now at specific issues.
> >>    
> >
> >32-bit dom0 is the standard use model for Citrix product, 
> and I think 
> >people tend to run it even with xen-unstable.  Its a fairly 
> well-tested 
> >combination.
> 
> But very unlikely with 1Tb of memory, don't you think?
> 
> Jan

Only because machines with 1TB are rare/unlikely.  I can't
speak for the Citrix product but there is NO supported
configuration (yet) of the Oracle VM product with a 64-bit dom0.
In other words, if a customer is using a released Oracle VM
product and the machine on which they are running it has 1TB
of physical memory, they ARE using a 32-bit dom0.

However, Oracle VM always specifies a dom0_mem= Xen boot parameter.
(which is always much smaller than 1TB).

If there are known issues with 1TB of memory in this
configuration, we'd like to understand them.  If 1Tb with
32-bit dom0 is rife with hidden unresolvable problems,
we'd like to make a clear support statement as to what the
physical memory limit is.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [timer/ticks related] dom0 hang during boot on large 1TB system
  2010-01-04 22:07                                 ` Dan Magenheimer
@ 2010-01-04 22:21                                   ` Ian Campbell
  2010-01-05  8:33                                   ` Jan Beulich
  1 sibling, 0 replies; 51+ messages in thread
From: Ian Campbell @ 2010-01-04 22:21 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: kurt.hackel@oracle.com, Jeremy Fitzhardinge,
	Xen-devel@lists.xensource.com, Keir Fraser, Jan Beulich

On Mon, 2010-01-04 at 22:07 +0000, Dan Magenheimer wrote: 
> > From: Jan Beulich [mailto:JBeulich@novell.com]
> > 
> > >>> Jeremy Fitzhardinge <jeremy@goop.org> 22.12.09 19:03 >>>
> > >On 12/22/2009 09:02 AM, Jan Beulich wrote:
> > >> No, that won't help. I'm referring to things like Dom0 
> > accesses to the
> > >> M2P table it sees, which doesn't cover even nearly all 
> > memory. I can't
> > >> say whether that can go without problem, but without 
> > closely looking
> > >> at it I don't think you can assume this would work. 
> > Likewise I would
> > >> suspect tools issues (if you use the tools from 
> > xen-unstable et al),
> > >> though I have no precise pointer right now at specific issues.
> > >>    
> > >
> > >32-bit dom0 is the standard use model for Citrix product, 
> > and I think 
> > >people tend to run it even with xen-unstable.  Its a fairly 
> > well-tested 
> > >combination.
> > 
> > But very unlikely with 1Tb of memory, don't you think?
> > 
> > Jan
> 
> Only because machines with 1TB are rare/unlikely.  I can't
> speak for the Citrix product but there is NO supported
> configuration (yet) of the Oracle VM product with a 64-bit dom0.
> In other words, if a customer is using a released Oracle VM
> product and the machine on which they are running it has 1TB
> of physical memory, they ARE using a 32-bit dom0.
> 
> However, Oracle VM always specifies a dom0_mem= Xen boot parameter.
> (which is always much smaller than 1TB).

So do XenServer and XCP, FWIW.

Ian.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [timer/ticks related] dom0 hang during boot on large 1TB system
  2010-01-04 22:07                                 ` Dan Magenheimer
  2010-01-04 22:21                                   ` Ian Campbell
@ 2010-01-05  8:33                                   ` Jan Beulich
  2010-01-05 15:46                                     ` Dan Magenheimer
  1 sibling, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2010-01-05  8:33 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: kurt.hackel, Jeremy Fitzhardinge, Xen-devel, Keir Fraser

>>> Dan Magenheimer <dan.magenheimer@oracle.com> 04.01.10 23:07 >>>
>If there are known issues with 1TB of memory in this
>configuration, we'd like to understand them.  If 1Tb with
>32-bit dom0 is rife with hidden unresolvable problems,
>we'd like to make a clear support statement as to what the
>physical memory limit is.

I can't say there are known problems, but I'm convinced not everything
can work properly above the boundary of 168G. Nevertheless it is quite
possible that most or all of the normal (not error handling) code paths
work well. Page table walks e.g. during exceptions or kexec would be
problem candidates. And while my knowledge of the tools is rather
limited, libxc also has - iirc - several hard coded assumptions that might
not hold.

What is clear though is that you also depend on the memory distribution
across the (physical) address space: Contiguous (apart from the below-
4G hole) memory will likely represent little problems, but sparse memory
crossing the 44-bit boundary can't work in any case (since MFNs are
represented as 32-bit quantities in 32-bit Dom0).

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [timer/ticks related] dom0 hang during boot on large 1TB system
  2010-01-05  8:33                                   ` Jan Beulich
@ 2010-01-05 15:46                                     ` Dan Magenheimer
  2010-01-05 15:54                                       ` Ian Campbell
  2010-01-05 16:08                                       ` Jan Beulich
  0 siblings, 2 replies; 51+ messages in thread
From: Dan Magenheimer @ 2010-01-05 15:46 UTC (permalink / raw)
  To: Jan Beulich; +Cc: kurt.hackel, Jeremy Fitzhardinge, Xen-devel, Keir Fraser

> What is clear though is that you also depend on the memory 
> distribution
> across the (physical) address space: Contiguous (apart from the below-
> 4G hole) memory will likely represent little problems, but 
> sparse memory
> crossing the 44-bit boundary can't work in any case (since MFNs are
> represented as 32-bit quantities in 32-bit Dom0).

Urk.  Yes, I had forgotten about the sparse problem.

> I can't say there are known problems, but I'm convinced not everything
> can work properly above the boundary of 168G. Nevertheless it is quite
> possible that most or all of the normal (not error handling) 
> code paths
> work well. Page table walks e.g. during exceptions or kexec would be
> problem candidates. And while my knowledge of the tools is rather
> limited, libxc also has - iirc - several hard coded 
> assumptions that might not hold.

What is special about 168GB?  Or is that a typo?  (And if it
is supposed to be 128GB, what is special about 128GB?)

Thanks,
Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [timer/ticks related] dom0 hang during boot on large 1TB system
  2010-01-05 15:46                                     ` Dan Magenheimer
@ 2010-01-05 15:54                                       ` Ian Campbell
  2010-01-05 16:08                                       ` Jan Beulich
  1 sibling, 0 replies; 51+ messages in thread
From: Ian Campbell @ 2010-01-05 15:54 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: kurt.hackel@oracle.com, Jeremy Fitzhardinge,
	Xen-devel@lists.xensource.com, Keir Fraser, Jan Beulich

On Tue, 2010-01-05 at 15:46 +0000, Dan Magenheimer wrote:
> > What is clear though is that you also depend on the memory 
> > distribution
> > across the (physical) address space: Contiguous (apart from the below-
> > 4G hole) memory will likely represent little problems, but 
> > sparse memory
> > crossing the 44-bit boundary can't work in any case (since MFNs are
> > represented as 32-bit quantities in 32-bit Dom0).
> 
> Urk.  Yes, I had forgotten about the sparse problem.
> 
> > I can't say there are known problems, but I'm convinced not everything
> > can work properly above the boundary of 168G. Nevertheless it is quite
> > possible that most or all of the normal (not error handling) 
> > code paths
> > work well. Page table walks e.g. during exceptions or kexec would be
> > problem candidates. And while my knowledge of the tools is rather
> > limited, libxc also has - iirc - several hard coded 
> > assumptions that might not hold.
> 
> What is special about 168GB?  Or is that a typo?  (And if it
> is supposed to be 128GB, what is special about 128GB?)

It's the size of m2p you can fit into the hypervisor hole of a PAE guest
running on a 64 bit hypervisor, since the hypervisor no longer need to
reside in there it bigger than with a PAE guest on a PAE hypervisor.

The size of the hypervisor hole is runtime settable for many guests but
I'm not sure that is plumbed through in the tools so who knows how well
it works. Increasing the size of the hypervisor hole eats in to kernel
low memory though so you would be trading off maximum per-guest RAM
against maximum host RAM to some degree.

Ian.
> 
> Thanks,
> Dan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [timer/ticks related] dom0 hang during boot on large 1TB system
  2010-01-05 15:46                                     ` Dan Magenheimer
  2010-01-05 15:54                                       ` Ian Campbell
@ 2010-01-05 16:08                                       ` Jan Beulich
  1 sibling, 0 replies; 51+ messages in thread
From: Jan Beulich @ 2010-01-05 16:08 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: kurt.hackel, Jeremy Fitzhardinge, Xen-devel, Keir Fraser

>>> Dan Magenheimer <dan.magenheimer@oracle.com> 05.01.10 16:46 >>>
>What is special about 168GB?  Or is that a typo?  (And if it
>is supposed to be 128GB, what is special about 128GB?)

No, it's not a typo. The maximum hole Xen can reserve for itself is 168M,
and this is what allows to accommodate the M2P table for 168G.

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22 13:35               ` Keir Fraser
  2009-12-22 14:17                 ` Jan Beulich
@ 2009-12-22 16:33                 ` Jan Beulich
  2009-12-22 16:42                   ` Jan Beulich
  1 sibling, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2009-12-22 16:33 UTC (permalink / raw)
  To: Keir Fraser, Dan Magenheimer, mukesh.rathor@oracle.com
  Cc: kurt.hackel@oracle.com, Jeremy Fitzhardinge,
	Xen-devel@lists.xensource.com

>>> Keir Fraser <keir.fraser@eu.citrix.com> 22.12.09 14:35 >>>
>> And then it should be possible to simulate the problem quite easily on
>> a system with much less memory, by slowing down the scrub loop
>> artificially. If I find time before the holiday break I'll try to do that and
>> see if I can convince myself otherwise (as per above).
>> artificially
>
>That would be helpful, thanks. I'm particularly intrigued by how this could
>be seen for dom0 but not be a similar or worse issue for domU.

Simulating the issue indeed went without problem. What I'm seeing is
a Xen problem, though (as expected): Right around when scrubbing
starts there is a run through time_calibration(). During and after scrub,
none happen however, until Dom0 proceeded quite a bit into its
initialization (namely, past its delay loop calibration). It is only then
when regular (1 second interval) time_calibration() invocations resume.

One other irregular at the first glance thing is that the mentioned
very first run through time_calibration() does not seem to result in
running local_time_calibration() on CPU0. One invocation (apparently
independent of time_calibration()) happens right before Dom0 starts
executing.

Jan

(XEN) *** LOADING DOMAIN 0 ***
(XEN)  Xen  kernel: 64-bit, lsb, compat32
(XEN)  Dom0 kernel: 32-bit, PAE, lsb, paddr 0x2000 -> 0x496000
(XEN) PHYSICAL MEMORY ARRANGEMENT:
(XEN)  Dom0 alloc.:   0000000229000000->000000022a000000 (1936909 pages to be allocated)
(XEN) VIRTUAL MEMORY ARRANGEMENT:
(XEN)  Loaded kernel: 00000000c0002000->00000000c0496000
(XEN)  Init. ramdisk: 00000000c0496000->00000000c0759193
(XEN)  Phys-Mach map: 00000000c075a000->00000000c0ec1834
(XEN)  Start info:    00000000c0ec2000->00000000c0ec24b4
(XEN)  Page tables:   00000000c0ec3000->00000000c0ed1000
(XEN)  Boot stack:    00000000c0ed1000->00000000c0ed2000
(XEN)  TOTAL:         00000000c0000000->00000000c1000000
(XEN)  ENTRY ADDRESS: 00000000c0002000
(XEN) Dom0 has maximum 8 VCPUs
(XEN) tc
(XEN) ltc@4[32767:4]
(XEN) ltc@3[32767:3]
(XEN) ltc@7[32767:7]
(XEN) ltc@6[32767:6]
(XEN) Scrubbing Free RAM: ltc@2[32767:2]
(XEN) ltc@5[32767:5]
(XEN) ltc@1[32767:1]
(XEN) .....done.
(XEN) Xen trace buffers: disabled
(XEN) Std. Loglevel: All
(XEN) Guest Loglevel: All
(XEN) Xen is relinquishing VGA console.
(XEN) *** Serial input -> Xen (type 'CTRL-q' three times to switch input to DOM0)
(XEN) Freed 160kB init memory.
(XEN) ltc@0[0:0]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22 16:33                 ` Jan Beulich
@ 2009-12-22 16:42                   ` Jan Beulich
  2009-12-22 17:27                     ` Dan Magenheimer
                                       ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Jan Beulich @ 2009-12-22 16:42 UTC (permalink / raw)
  To: Keir Fraser, Dan Magenheimer, mukesh.rathor@oracle.com
  Cc: kurt.hackel@oracle.com, Jeremy Fitzhardinge,
	Xen-devel@lists.xensource.com

>>> "Jan Beulich" <JBeulich@novell.com> 22.12.09 17:33 >>>
>One other irregular at the first glance thing is that the mentioned
>very first run through time_calibration() does not seem to result in
>running local_time_calibration() on CPU0. One invocation (apparently
>independent of time_calibration()) happens right before Dom0 starts
>executing.

And that's of course the problem: CPU0's TIME_CALIBRATE_SOFTIRQ can't
get serviced until entry to Dom0, but CPU0 is responsible for re-arming
calibration_timer. Hence there's a gap of calibrations, resulting in an
excessive delta observed during the first timer interrupt in Dom0.

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22 16:42                   ` Jan Beulich
@ 2009-12-22 17:27                     ` Dan Magenheimer
  2009-12-22 17:48                     ` Keir Fraser
  2009-12-22 18:42                     ` Keir Fraser
  2 siblings, 0 replies; 51+ messages in thread
From: Dan Magenheimer @ 2009-12-22 17:27 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser, mukesh.rathor
  Cc: kurt.hackel, Jeremy Fitzhardinge, Xen-devel

So, checking my understanding, the underlying problem is that
shadow->tsc_timestamp has essentially stopped but hardware tsc has
continued moving forward?  Thus in timer_interrupt() (in time-xen.c)
shadow->system_timestamp will be stale and so get_nsec_offset()
is returning a large number, resulting in a large delta,
which in turn causes jiffies to be incremented by a large
amount which, if the interrupt happens by coincidence in the
middle of the first while loop in calibrate_delay_direct()
(in init/calibrate.c) and the large jiffies increment happens
to be enough to wrap, the while loop will run for weeks.

If this is right, I'm still not clear on how it can be fixed
in Xen.

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@novell.com]
> Sent: Tuesday, December 22, 2009 9:43 AM
> To: Keir Fraser; Dan Magenheimer; Mukesh Rathor
> Cc: Jeremy Fitzhardinge; Xen-devel@lists.xensource.com; Kurt Hackel
> Subject: Re: [Xen-devel] [timer/ticks related] dom0 hang 
> during boot on
> large 1TB system
> 
> 
> >>> "Jan Beulich" <JBeulich@novell.com> 22.12.09 17:33 >>>
> >One other irregular at the first glance thing is that the mentioned
> >very first run through time_calibration() does not seem to result in
> >running local_time_calibration() on CPU0. One invocation (apparently
> >independent of time_calibration()) happens right before Dom0 starts
> >executing.
> 
> And that's of course the problem: CPU0's TIME_CALIBRATE_SOFTIRQ can't
> get serviced until entry to Dom0, but CPU0 is responsible for 
> re-arming
> calibration_timer. Hence there's a gap of calibrations, 
> resulting in an
> excessive delta observed during the first timer interrupt in Dom0.
> 
> Jan
> 
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22 16:42                   ` Jan Beulich
  2009-12-22 17:27                     ` Dan Magenheimer
@ 2009-12-22 17:48                     ` Keir Fraser
  2009-12-22 18:42                     ` Keir Fraser
  2 siblings, 0 replies; 51+ messages in thread
From: Keir Fraser @ 2009-12-22 17:48 UTC (permalink / raw)
  To: Jan Beulich, Dan Magenheimer, mukesh.rathor@oracle.com
  Cc: kurt.hackel@oracle.com, Jeremy Fitzhardinge,
	Xen-devel@lists.xensource.com

On 22/12/2009 16:42, "Jan Beulich" <JBeulich@novell.com> wrote:

>>>> "Jan Beulich" <JBeulich@novell.com> 22.12.09 17:33 >>>
>> One other irregular at the first glance thing is that the mentioned
>> very first run through time_calibration() does not seem to result in
>> running local_time_calibration() on CPU0. One invocation (apparently
>> independent of time_calibration()) happens right before Dom0 starts
>> executing.
> 
> And that's of course the problem: CPU0's TIME_CALIBRATE_SOFTIRQ can't
> get serviced until entry to Dom0, but CPU0 is responsible for re-arming
> calibration_timer. Hence there's a gap of calibrations, resulting in an
> excessive delta observed during the first timer interrupt in Dom0.

Arbitrarily delaying softirq work is probably inherently fragile. All we
have to defer is SCHEDULE_SOFTIRQ as that can preempt the current context.
So I will look into making a patch that changes process_pending_timers() to
process_pending_softirqs().

 Thanks,
 Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22 16:42                   ` Jan Beulich
  2009-12-22 17:27                     ` Dan Magenheimer
  2009-12-22 17:48                     ` Keir Fraser
@ 2009-12-22 18:42                     ` Keir Fraser
  2009-12-22 23:00                       ` Mukesh Rathor
  2 siblings, 1 reply; 51+ messages in thread
From: Keir Fraser @ 2009-12-22 18:42 UTC (permalink / raw)
  To: Jan Beulich, Dan Magenheimer, mukesh.rathor@oracle.com
  Cc: kurt.hackel@oracle.com, Jeremy Fitzhardinge,
	Xen-devel@lists.xensource.com

On 22/12/2009 16:42, "Jan Beulich" <JBeulich@novell.com> wrote:

>>>> "Jan Beulich" <JBeulich@novell.com> 22.12.09 17:33 >>>
>> One other irregular at the first glance thing is that the mentioned
>> very first run through time_calibration() does not seem to result in
>> running local_time_calibration() on CPU0. One invocation (apparently
>> independent of time_calibration()) happens right before Dom0 starts
>> executing.
> 
> And that's of course the problem: CPU0's TIME_CALIBRATE_SOFTIRQ can't
> get serviced until entry to Dom0, but CPU0 is responsible for re-arming
> calibration_timer. Hence there's a gap of calibrations, resulting in an
> excessive delta observed during the first timer interrupt in Dom0.

Please give xen-unstable:20714 a look. If that fixes the apparent problems I
think it is also a good candidate for backport to 3.4 branch.

 -- Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22 18:42                     ` Keir Fraser
@ 2009-12-22 23:00                       ` Mukesh Rathor
  0 siblings, 0 replies; 51+ messages in thread
From: Mukesh Rathor @ 2009-12-22 23:00 UTC (permalink / raw)
  To: Keir Fraser
  Cc: kurt.hackel@oracle.com, Dan Magenheimer,
	Xen-devel@lists.xensource.com, Jeremy Fitzhardinge, Jan Beulich

On Tue, 22 Dec 2009 18:42:01 +0000
Keir Fraser <keir.fraser@eu.citrix.com> wrote:

> On 22/12/2009 16:42, "Jan Beulich" <JBeulich@novell.com> wrote:
> 
> >>>> "Jan Beulich" <JBeulich@novell.com> 22.12.09 17:33 >>>
> >> One other irregular at the first glance thing is that the mentioned
> >> very first run through time_calibration() does not seem to result
> >> in running local_time_calibration() on CPU0. One invocation
> >> (apparently independent of time_calibration()) happens right
> >> before Dom0 starts executing.
> > 
> > And that's of course the problem: CPU0's TIME_CALIBRATE_SOFTIRQ
> > can't get serviced until entry to Dom0, but CPU0 is responsible for
> > re-arming calibration_timer. Hence there's a gap of calibrations,
> > resulting in an excessive delta observed during the first timer
> > interrupt in Dom0.
> 
> Please give xen-unstable:20714 a look. If that fixes the apparent
> problems I think it is also a good candidate for backport to 3.4
> branch.
> 
>  -- Keir
> 

Yup, that fixed it. Jiffies now jumps from 0xfffedb08 to 0xfffedb56
as opposed to 0x0000102c or similar.  BTW, my test was limited to
just booting the box. 

I'm glad it got resolved before I leave on holidays for a while.
So many thanks to all.

 Mukesh

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-19  4:43   ` Mukesh Rathor
  2009-12-21  9:55     ` Jan Beulich
@ 2009-12-21 10:44     ` Keir Fraser
  2009-12-21 23:40       ` Mukesh Rathor
  2009-12-21 19:17     ` Steve Ofsthun
  2 siblings, 1 reply; 51+ messages in thread
From: Keir Fraser @ 2009-12-21 10:44 UTC (permalink / raw)
  To: Mukesh Rathor
  Cc: Kurt Hackel, Dan Magenheimer, Xen-devel@lists.xensource.com,
	Jeremy Fitzhardinge

On 19/12/2009 04:43, "Mukesh Rathor" <mukesh.rathor@oracle.com> wrote:

> Ok, I came up with the following patch. Jeremy, can you please take a
> look also, and comment on my fix since I noticed you've got the same
> issue in your tree. Here's a summary for your benefit:

This patch doesn't apply to http://xenbits.xensource.com/linux-2.6.18-xen.hg
by the way. The code is different there. So I'm dropping this patch as I
have nowhere to put it.

 -- Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-21 10:44     ` Keir Fraser
@ 2009-12-21 23:40       ` Mukesh Rathor
  2009-12-22  7:35         ` Keir Fraser
  0 siblings, 1 reply; 51+ messages in thread
From: Mukesh Rathor @ 2009-12-21 23:40 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Hackel, Kurt, Xen-devel@lists.xensource.com, Dan Magenheimer,
	Jeremy Fitzhardinge

On Mon, 21 Dec 2009 10:44:51 +0000
Keir Fraser <keir.fraser@eu.citrix.com> wrote:

> On 19/12/2009 04:43, "Mukesh Rathor" <mukesh.rathor@oracle.com> wrote:
> 
> > Ok, I came up with the following patch. Jeremy, can you please take
> > a look also, and comment on my fix since I noticed you've got the
> > same issue in your tree. Here's a summary for your benefit:
> 
> This patch doesn't apply to
> http://xenbits.xensource.com/linux-2.6.18-xen.hg by the way. The code
> is different there. So I'm dropping this patch as I have nowhere to
> put it.
> 
>  -- Keir
> 

Actually, INITIAL_JIFFIES appears to be buggy on 64bit linux:

#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))

The casting to uint makes it still 0xfffedb08 instead of 
0xfffffffffffedb08 which is what the intention is, that jiffies should
wrap in few minutes. So, if they fix it in linux in future, my
patch will still have the same problem. 

Ok, I'll come up with another patch.

thanks,
Mukesh

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-21 23:40       ` Mukesh Rathor
@ 2009-12-22  7:35         ` Keir Fraser
  0 siblings, 0 replies; 51+ messages in thread
From: Keir Fraser @ 2009-12-22  7:35 UTC (permalink / raw)
  To: Mukesh Rathor
  Cc: Kurt Hackel, Dan Magenheimer, Xen-devel@lists.xensource.com,
	Jeremy Fitzhardinge

On 21/12/2009 23:40, "Mukesh Rathor" <mukesh.rathor@oracle.com> wrote:

> Actually, INITIAL_JIFFIES appears to be buggy on 64bit linux:
> 
> #define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))
> 
> The casting to uint makes it still 0xfffedb08 instead of
> 0xfffffffffffedb08 which is what the intention is, that jiffies should
> wrap in few minutes. So, if they fix it in linux in future, my
> patch will still have the same problem.

Actually the cast to unsigned int is deliberate. They want jiffies to wrap
32 bits soon after boot, but it should pretty much never wrap 64 bits.

 -- Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-19  4:43   ` Mukesh Rathor
  2009-12-21  9:55     ` Jan Beulich
  2009-12-21 10:44     ` Keir Fraser
@ 2009-12-21 19:17     ` Steve Ofsthun
  2009-12-22  4:00       ` Mukesh Rathor
  2 siblings, 1 reply; 51+ messages in thread
From: Steve Ofsthun @ 2009-12-21 19:17 UTC (permalink / raw)
  To: Mukesh Rathor
  Cc: Dan Magenheimer, Xen-devel@lists.xensource.com, Hackel, jeremy,
	Keir Fraser, Kurt

Mukesh Rathor wrote:
> On Fri, 18 Dec 2009 07:02:55 +0000
> Keir Fraser <keir.fraser@eu.citrix.com> wrote:
> 
>> On 18/12/2009 04:36, "Mukesh Rathor" <mukesh.rathor@oracle.com> wrote:
>>
>>> The other fix I thought of was to change INITIAL_JIFFIES to
>>> something sooner.
>>>
>>> Would appreciate any help, I don't understand xen time management
>>> well.
>> This isn't really Xen time code, but unchanged Linux time code. I
>> don't know which tree you quoted the code from -- 2.6.18 has similar
>> but not identical. Anyway, I suggest try using the jiffy-comparison
>> macros from <linux/jiffies.h>: time_before(), time_after(), etc.
>> These are designed to work even when jiffies wraps. Feel free to send
>> patch(es) for that, if you test that out and it works okay.
>>
>>  -- Keir
>>
> 
> Ok, I came up with the following patch. Jeremy, can you please take a
> look also, and comment on my fix since I noticed you've got the same 
> issue in your tree. Here's a summary for your benefit:
> 
> init/calibrate.c :  calibrate_delay_direct():
> 
>                 start_jiffies = get_jiffies_64();
>                 while (get_jiffies_64() <= (start_jiffies + tick_divider)) {
>                         pre_start = start;
>                         read_current_timer(&start);
>                 }
> 

Linux time code explicitly forces jiffies (32-bit) to wrap soon after boot to prevent other kernel code from making assumptions about jiffies wrap.  In your case, I'm guessing that the scrubbing delay is causing a sufficient number of timer interrupts to be delayed (queued up) that it is forcing the jiffies to wrap earlier in the boot path than expected.  

As Keir suggests, the correct solution is probably to use the time_before/after macros appropriately.

The proposed code avoids the problem by accessing jiffies_64 instead.

> if first ever timer interrupt comes after start_jiffies is set, dom0 boot 
> may hang if delta in timer_interrupt() is so huge that it causes jiffies 
> to wrap. It appears delta is very large when memory is more than 512GB on
> certain boxes causing wrap around.
> 
> why is delta in dom0->timer_interrupt() related to memory on system? 
> Because hyp creates dom0, then page scrubs, then unpauses vcpu. so it
> appears lot of page scurbbing results in huge delta on first tick.

The problem here may be that timers are running in the domain while the vcpu is not.

Steve

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-21 19:17     ` Steve Ofsthun
@ 2009-12-22  4:00       ` Mukesh Rathor
  2009-12-22  4:18         ` Mukesh Rathor
  2009-12-22  7:59         ` Keir Fraser
  0 siblings, 2 replies; 51+ messages in thread
From: Mukesh Rathor @ 2009-12-22  4:00 UTC (permalink / raw)
  To: Steve Ofsthun
  Cc: Magenheimer, Xen-devel@lists.xensource.com, Hackel, Dan, jeremy,
	Keir Fraser, Kurt

On Mon, 21 Dec 2009 14:17:57 -0500
Steve Ofsthun <steve.ofsthun@oracle.com> wrote:

> As Keir suggests, the correct solution is probably to use the
> time_before/after macros appropriately.
> 
> The proposed code avoids the problem by accessing jiffies_64 instead.

can't use time_after/before as they do signed comparisions. 
  time_after(a,b): ((long)(b) - (long)(a) < 0))

thus, time_after(0xFFFEDB09, 0xFFFEDB08) will return true as will
time_after(0x1020, 0xFFFEDB08) as they are both after 0xFFFEDB08.

For wrapping, unsigned comparision must be done, which is also the jiffies
data type. 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22  4:00       ` Mukesh Rathor
@ 2009-12-22  4:18         ` Mukesh Rathor
  2009-12-22  7:59         ` Keir Fraser
  1 sibling, 0 replies; 51+ messages in thread
From: Mukesh Rathor @ 2009-12-22  4:18 UTC (permalink / raw)
  To: Mukesh Rathor
  Cc: Dan Magenheimer, Xen-devel@lists.xensource.com, Hackel, jeremy,
	Keir Fraser, Kurt, Steve Ofsthun

On Mon, 21 Dec 2009 20:00:25 -0800
Mukesh Rathor <mukesh.rathor@oracle.com> wrote:

> On Mon, 21 Dec 2009 14:17:57 -0500
> Steve Ofsthun <steve.ofsthun@oracle.com> wrote:
> 
> > As Keir suggests, the correct solution is probably to use the
> > time_before/after macros appropriately.
> > 
> > The proposed code avoids the problem by accessing jiffies_64
> > instead.
> 
> can't use time_after/before as they do signed comparisions. 
>   time_after(a,b): ((long)(b) - (long)(a) < 0))
> 
> thus, time_after(0xFFFEDB09, 0xFFFEDB08) will return true as will
> time_after(0x1020, 0xFFFEDB08) as they are both after 0xFFFEDB08.
> 
> For wrapping, unsigned comparision must be done, which is also the
> jiffies data type. 
> 

actually my bad. it can't be used in if statement to check for
wrapping, but i can use it in the while loop here as it seems to only
care when jiffies is gone up.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22  4:00       ` Mukesh Rathor
  2009-12-22  4:18         ` Mukesh Rathor
@ 2009-12-22  7:59         ` Keir Fraser
  2009-12-22  8:05           ` Keir Fraser
  1 sibling, 1 reply; 51+ messages in thread
From: Keir Fraser @ 2009-12-22  7:59 UTC (permalink / raw)
  To: Mukesh Rathor, Steve Ofsthun
  Cc: Hackel, Dan Magenheimer, Xen-devel@lists.xensource.com,
	Jeremy Fitzhardinge, Kurt@acsinet11.oracle.com

I'll try and make this *really* clear...

On 22/12/2009 04:00, "Mukesh Rathor" <mukesh.rathor@oracle.com> wrote:

> can't use time_after/before as they do signed comparisions.
>   time_after(a,b): ((long)(b) - (long)(a) < 0))

The whole point is to do signed comparison. This gives you reliable +/-
(BITS_PER_LONG-1) bits to reliably compare: with 32-bit Linux that means
jiffy values which do not differ by more than +/- 2^31 can be reliably
compared, regardless of wrapping. Bear in mind that even at HZ=1000, it'll
take 3.5 *weeks* for jiffies to increase by 2^31.
 
> thus, time_after(0xFFFEDB09, 0xFFFEDB08) will return true as will
> time_after(0x1020, 0xFFFEDB08) as they are both after 0xFFFEDB08.

Well yeah: anything in the ranges a=0xFFFEDB09-0xFFFFFFFF and
a=0x0-0x7FFEDB09 will return true for time_after(a,0xFFFEDB08). That's how a
signed 32-bit comparison works. The assumption here is that 0x1020 is
derived from jiffies_64=0x100001020: in general the assumption is that the
arguments to time_after() were taken within seconds/minutes/hours of each
other, not days/weeks. Which precludes a jiffies_64 difference of
>0x7FFFFFFF, which is what would invalidate use of time_after().

> For wrapping, unsigned comparision must be done, which is also the jiffies
> data type. 

If you do 32-bit unsigned comparisons, that is broken by jiffies wrapping,
the fixing of which was the whole point of the comparison macros.

 -- Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [timer/ticks related] dom0 hang during boot on large 1TB system
  2009-12-22  7:59         ` Keir Fraser
@ 2009-12-22  8:05           ` Keir Fraser
  0 siblings, 0 replies; 51+ messages in thread
From: Keir Fraser @ 2009-12-22  8:05 UTC (permalink / raw)
  To: Mukesh Rathor, Steve Ofsthun
  Cc: Hackel, Dan Magenheimer, Xen-devel@lists.xensource.com,
	Jeremy Fitzhardinge, Kurt@acsinet11.oracle.com

On 22/12/2009 07:59, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:

>> For wrapping, unsigned comparision must be done, which is also the jiffies
>> data type. 
> 
> If you do 32-bit unsigned comparisons, that is broken by jiffies wrapping,
> the fixing of which was the whole point of the comparison macros.

I'm talking about '(ulong)b<(ulong)a' here of course. '(ulong)b-(ulong)a<0'
would always be false, which is even less useful.

 -- Keir

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2010-01-05 16:08 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-18  4:36 [timer/ticks related] dom0 hang during boot on large 1TB system Mukesh Rathor
2009-12-18  7:02 ` Keir Fraser
2009-12-18  8:42   ` Jan Beulich
2009-12-18  9:13     ` Keir Fraser
2009-12-18 16:35       ` Dan Magenheimer
2009-12-18 17:15         ` Keir Fraser
2009-12-18 19:28       ` Mukesh Rathor
2009-12-18 19:25   ` Mukesh Rathor
2009-12-19  4:43   ` Mukesh Rathor
2009-12-21  9:55     ` Jan Beulich
2009-12-21 18:20       ` Dan Magenheimer
2009-12-21 19:07         ` Keir Fraser
2009-12-21 19:52           ` Mukesh Rathor
2009-12-21 19:55             ` Jeremy Fitzhardinge
2009-12-21 22:47               ` Mukesh Rathor
2009-12-21 23:13                 ` Jeremy Fitzhardinge
2009-12-21 23:57                   ` Dan Magenheimer
2009-12-22  4:31                     ` Mukesh Rathor
2009-12-22  8:51         ` Jan Beulich
2009-12-22 10:20           ` Keir Fraser
2009-12-22 11:10             ` Jan Beulich
2009-12-22 13:35               ` Keir Fraser
2009-12-22 14:17                 ` Jan Beulich
2009-12-22 14:23                   ` Jan Beulich
2009-12-22 15:19                     ` Keir Fraser
2009-12-22 15:30                     ` Dan Magenheimer
2009-12-22 15:36                       ` Jan Beulich
2009-12-22 16:05                         ` Dan Magenheimer
2009-12-22 17:02                           ` Jan Beulich
2009-12-22 18:03                             ` Jeremy Fitzhardinge
2010-01-04  8:23                               ` Jan Beulich
2010-01-04 22:07                                 ` Dan Magenheimer
2010-01-04 22:21                                   ` Ian Campbell
2010-01-05  8:33                                   ` Jan Beulich
2010-01-05 15:46                                     ` Dan Magenheimer
2010-01-05 15:54                                       ` Ian Campbell
2010-01-05 16:08                                       ` Jan Beulich
2009-12-22 16:33                 ` Jan Beulich
2009-12-22 16:42                   ` Jan Beulich
2009-12-22 17:27                     ` Dan Magenheimer
2009-12-22 17:48                     ` Keir Fraser
2009-12-22 18:42                     ` Keir Fraser
2009-12-22 23:00                       ` Mukesh Rathor
2009-12-21 10:44     ` Keir Fraser
2009-12-21 23:40       ` Mukesh Rathor
2009-12-22  7:35         ` Keir Fraser
2009-12-21 19:17     ` Steve Ofsthun
2009-12-22  4:00       ` Mukesh Rathor
2009-12-22  4:18         ` Mukesh Rathor
2009-12-22  7:59         ` Keir Fraser
2009-12-22  8:05           ` Keir Fraser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.