From mboxrd@z Thu Jan 1 00:00:00 1970 From: Olivier Hanesse Subject: Re: Xen 4 TSC problems Date: Mon, 28 Feb 2011 16:54:29 +0100 Message-ID: References: <68c41dd8-9195-41b0-83d7-9242b8eff809@default> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0467043827==" Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Dan Magenheimer Cc: Jeremy Fitzhardinge , xen-devel@lists.xensource.com, Keir Fraser , Jan Beulich , Keir Fraser , Xen Users , Mark Adams List-Id: xen-devel@lists.xenproject.org --===============0467043827== Content-Type: multipart/alternative; boundary=90e6ba6e80fef3c66d049d59b13d --90e6ba6e80fef3c66d049d59b13d Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Yes this is what I mean. I am glad to hear that it isn't a bad sign :) I thought of a bad sign, because on system with "reliable TSC", this counte= r is always 0. 2011/2/28 Dan Magenheimer > Hi Olivier =96 > > > > By =93warp=3D3000+ in debug message=94 do you mean the Xen boot message = =93TSC has > constant rate..., warp =3D NNNN=94? > > > > If so, this is a very different =93warp=94 measured in cycles, not in sec= onds, > so 3000 is more like a microsecond not an hour, ! and this is normal (not= a > bad sign). > > > > Dan > > > > *From:* Olivier Hanesse [mailto:olivier.hanesse@gmail.com] > *Sent:* Monday, February 28, 2011 8:23 AM > *To:* Dan Magenheimer > *Cc:* Jeremy Fitzhardinge; Keir Fraser; Jan Beulich; Mark Adams; > xen-devel@lists.xensource.com; Xen Users; Keir Fraser > > *Subject:* Re: [Xen-devel] Xen 4 TSC problems > > > > Keir : > > > > Yes, it is "under progress". > > To make this change, I had to reboot every server, so it is taking time > (production server :() > > So i was hoping to find a quick method to mitigate this issue on domUs > while rebooting servers. > > > > As this bug happens once or twice per server since October, I can't say > that right now that changing platform timer to PIT fixed it. I have to wa= it > (I hope forever!) this bug to happen again on a 'patched' server ... > > > > But even with clcoksource=3Dpit, I am seeing some warp=3D3000+ in debug m= essage > ? I guess it is not a good sign, is it ? > > > > Jan : I was hoping to find a way to make the domU clocksource more > "independent" like with xen3.2. > > > > > > 2011/2/28 Dan Magenheimer > > Hi Olivier =96 > > > > It is the Xen clocksource that you want to try to change, not the dom0 > clocksource. To do this, you need to specify =93clocksource=3Dpit=94 on = the Xen > boot line (and reboot), not the dom0 boot line. > > > > I believe Mark Adams played with tsc_mode to see if it solved! his > (similar? identical?) problem last year, and it didn=92t make any differe= nce. > > > Please try booting Xen with =93clocksource=3Dpit=94 and ensu! re that =93= Platform > timer is 1.19MHz PIT=94 appears in the Xen boot messages. If the 50min j= ump > does not appear again, it would point to a problem in the hpet, either > hardware or software. > > > > Thanks, > > Dan > > > > *From:* Olivier Hanesse [mailto:olivier.hanesse@gmail.com] > *Sent:* Monday, February 28, 2011 7:37 AM > *To:* Jeremy Fitzhardinge > *Cc:* Dan Magenheimer; Keir Fraser; Jan Beulich; Mark Adams; > xen-devel@lists.xensource.com; Xen Users; Keir Fraser > > > *Subject:* Re: [Xen-devel] Xen 4 TSC problems > > > > Hello, > > > > It happened again twice this weekend. > > > > What about setting "tsc_mode=3D2" for my vms ? Should this mode prevent t= his > bug (coming from a bad emulated tsc due to firmware issue ? is it possibl= e > ?) from affecting time in domUs ? > > > > Setting clocksource=3Dpit, make 'tsc' available in > "/sys/devices/system/clocksource/clocksource0/available_clocksource" > (otherwise only xen is available, is it norma! l ? ). > > > > Should I bypass xen clocksource and use tsc as a clocksource for dom0/dom= U > ? or will it be worsed ? > > > > Regards > > > > Olivier > > > > 2011/2/24 Jeremy Fitzhardinge > > On 02/24/2011 09:43 AM, Dan Magenheimer wrote: > > Just a wild guess, but this in Olivier's posted output: > > > > (XEN) Platform timer appears to have unexpectedly wrapped 10 or more > times. > > > > and the fact that a 32-bit HPET wrap is ~300 seconds and, with the > > "10 or more times", 10 * 300 seconds is 3000 seconds, might be a clue > > (or a complete red herring, but I thought it worth mentioning). > > > > Mark and Olivier, it would be interesting to know if you are > > using the same processor/system. > > It definitely seems like some kind of problem on the host system rather > than anything in the guests themselves. ! If the platform timer is > misbehaving, then Xen could be completely screwing up the pvclock > calibration which it then passes to guests. > > Could it be one of those "platform clock stops in certain power states" > problems? > > > J > > >> -----Original Message----- > >> From: Keir Fraser [mailto:keir.xen@gmail.com] > >> Sent: Thursday, February 24, 2011 7:52 AM > >> To: Olivier Hanesse; Jan Beulich > >> Cc: Mark Adams; Jeremy Fitzhardinge; xen-devel@lists.xensource.com; Xe= n > >> Users; Dan Magenheimer; Keir Fraser > >> Subject: Re: [Xen-devel] Xen 4 TSC problems > >> > >> On 24/02/2011 14:20, "Olivier Hanesse" > >> wrote: > >> > >>> Both dom0 and domUs are affected by this" jump". > >>> > >>> I expect to see something like "TSC marked as reliable, warp =3D 0". > >>> I got this on newer hardware with same config/distros. > >> It depends on the CPU itself, older CPUs do not have the super-stable > >> TSC > >> features. But that should never cause a massive 3000s time jump. > >> > >>> Is there a way to measure if it is a TSC warp ? to point out a cpu > >> tsc issue ? > >> > >> The TSC warps or out-of-sync issues that we could reasonably expect > >> would be > >> on the order of microseconds. A 3000s warp is something else entirely. > >> Xen > >> is very confused and/or some TSC or platform timer has jumped a long > >> way > >> (indicating a hardware/firmware issue). > >> > >> -- Keir > >> > > >>! ;> 2011/2/24 Jan Beulich > > > >>>>>>> On 24.02.11 at 12:57, Olivier Hanesse > >> wrote: > >>>>> I tried to turn off cstates with max_cstate=3D0 without success > >> (still "not > >>>>> reliable"). > >>>>> > >>>>> With cpuidle=3D0, I also got : > >>>>> > >>>>> (XEN) TSC has constant rate, deep Cstates possible, so not > >> reliable, > >>>>> warp=3D3022 (count=3D1) > >>>> This message by itself isn't telling much I believe. > >>>> > >>>>> xm info | grep command > >>>>> xen_commandline : dom0_mem=3D512M cpuidle=3D0 loglvl=3Dall > >> guest_loglvl=3Dall > > >>>>> dom0_max_vcpus=3D1 dom0_vcp! us_pin console=3Dvga,com1 com1=3D19200= ,8n1 > > > >>>>> > >>>>> Keir : > >>>>> > >>>>> Using clocksource=3Dpit : > >>>>> > >>>>> (XEN) Platform timer is 1.193MHz PIT > >>>>> > >>>>> I also got : > >>>>> > >>>>> (XEN) TSC has constant rate, deep Cstates possible, so not > >> reliable, > >>>>> warp=3D3262 (count=3D2) > >>>> The question is whether any of this eliminates the time jumps seen > >>>> by your DomU-s (from your past mails I wasn't actually sure whether > >>>> Dom0 also experienced this problem, albeit it would be odd if it > >> didn't). > >>>> Jan > >>>> > >>>> Jan > >>>> > >>> > >> > > > > > --90e6ba6e80fef3c66d049d59b13d Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Ye= s this is what I mean.
I am glad to hear that it isn't a bad sign := )
I thought of a bad sign, because on system with "reliable TS= C", this counter is always 0.

2011/2/28 Dan Magenheimer <= span dir=3D"ltr"><dan.mage= nheimer@oracle.com>

Hi Olivier =96

=A0<= /span>

By =93warp=3D3000+ in=A0 deb= ug message=94 do you mean the Xen boot message =93TSC has constant rate...,= warp =3D NNNN=94?

=A0

If so= , this is a very different =93warp=94 measured in cycles, not in seconds, s= o 3000 is more like a microsecond not an hour, ! and this is normal (not a bad sign).

=A0

Dan

=A0

From: Olivier Hanesse [mai= lto:olivier.= hanesse@gmail.com]
Sent: Monday, February 28, 2011 8:23 AM
To: Dan Magenheime= r
Cc: Jeremy Fitzhardinge; Keir Fraser; Jan Beulich; Mark Adams; xen-deve= l@lists.xensource.com; Xen Users; Keir Fraser


Subject: Re: [Xen-devel] Xen 4 TSC problems

=A0

Keir :=A0

=A0

Yes, it is "under progress".=A0

To make this change, I had to reboot every server, so it is taking time (pr= oduction server :()

So i was hoping to= find a quick method to mitigate this issue on domUs while rebooting server= s.

=A0

= As this bug happens once or twice per server since October, I can't say= that right now that changing platform timer to PIT fixed it. I have to wai= t (I hope forever!) this bug to happen again on a 'patched' server = ...=A0

=A0

= But even with clcoksource=3Dpit, I am seeing some warp=3D3000+ in debug mes= sage ? I guess it is not a good sign, is it ?

=A0

Jan : I was hoping to find a way t= o make the domU clocksource more "independent" like with xen3.2.<= /p>

=A0

=A0

2011/2/28 Dan Mag= enheimer <dan.magenheimer@oracle.com>

<= p class=3D"MsoNormal"> Hi Olivier =96

=A0

It is the Xen clocksource that you want to try to change, not the dom0 clo= cksource.=A0 To do this, you need to specify =93clocksource=3Dpit=94 on the= Xen boot line (and reboot), not the dom0 boot line.

=A0

I believe Mark Adams played with tsc_mode to see if it solved! his (sim= ilar? identical?) problem last year, and it didn=92t make any difference.


Please try booting Xen with =93clocksource=3Dpit=94 and ensu! re that =93Platform timer is 1.19MHz PIT=94 appears in the Xen boot messages.=A0 If= the 50min jump does not appear again, it would point to a problem in the h= pet, either hardware or software.

=A0

Thank= s,

Dan

=A0

From: Ol= ivier Hanesse [mailto:olivier.hanesse@gmail.com]
Sent: Monday, February 28, 2011 7:37 AM
To: Jeremy Fitzhar= dinge
Cc: Dan Magenheimer; Keir Fraser; Jan Beulich; Mark Adams; = xen-deve= l@lists.xensource.com; Xen Users; Keir Fraser


Subject: Re: [Xen-devel] Xen 4 = TSC problems

<= p class=3D"MsoNormal">=A0

Hello,

=A0

It happened again twice this weekend.

=A0

What= about setting "tsc_mode=3D2" for my vms ? Should this mode preve= nt this bug (coming from a bad emulated tsc due to firmware issue ? is it p= ossible ?) from affecting time in domUs ?

=A0

Setting clocksource=3Dpit, make 'tsc' available in "/sys= /devices/system/clocksource/clocksource0/available_clocksource" (other= wise only xen is available, is it norma! l ? ).=A0

=A0

Should I bypass xen clocksource = and use tsc as a clocksource for dom0/domU ? or =A0will it be worsed ?

<= /div>

=A0

Regards

=

=A0

= Olivier

=

=A0

2011/2/24 Jeremy Fitzhardinge = <jeremy@goop.org>

It de= finitely seems like some kind of problem on the host system rather
than anything in the guests themselves. ! =A0If the platform time= r is
misbehaving, then Xen could be completely screwing up the pvclockcalibration which it then passes to guests.

Could it be one of tho= se "platform clock stops in certain power states"
problems?


=A0= =A0J

>> -----Original Message-----
>> From: Keir Fra= ser [mailto:
keir.xe= n@gmail.com]
>> Sent: Thursday, February 24, 2011 7:52 AM
>> To: Olivier = Hanesse; Jan Beulich
>> Cc: Mark Adams; Jeremy Fitzhardinge; xen-devel@lis= ts.xensource.com; Xen
>> Users; Dan Magenheimer; Keir Fraser
>> Subject: Re: [Xen-= devel] Xen 4 TSC problems
>>
>> On 24/02/2011 14:20, &quo= t;Olivier Hanesse" <olivier.hanesse@gmail.com>
>> wrote:
>>
>>> Both dom0 and domUs are affecte= d by this" jump".
>>>
>>> I expect to see= something like "TSC marked as reliable, warp =3D 0".
>>= > I got this on newer hardware with same config/distros.
>> It depends on the CPU itself, older CPUs do not have the super-sta= ble
>> TSC
>> features. But that should never cause a mas= sive 3000s time jump.
>>
>>> Is there a way to measure= if it is a TSC warp ? to point out a cpu
>> tsc issue ?
>>
>> The TSC warps or out-of-sync i= ssues that we could reasonably expect
>> would be
>> on t= he order of microseconds. A 3000s warp is something else entirely.
>&= gt; Xen
>> is very confused and/or some TSC or platform timer has jumped a lo= ng
>> way
>> (indicating a hardware/firmware issue).
>>
>> =A0-- Keir<= br>>>

>&gt! ;> 2011/2= /24 Jan Beulich <JBeulich@novell.com>


>>>>>>> On 24.02.11 at= 12:57, Olivier Hanesse <olivier.hanesse@gmail.com>
>> wrote:
>= ;>>>> I tried to turn off cstates with max_cstate=3D0 without s= uccess
>> (still "not
>>>>> reliable").
>&= gt;>>>
>>>>> With cpuidle=3D0, I also got :
&= gt;>>>>
>>>>> (XEN) TSC has constant rate, de= ep Cstates possible, so not
>> reliable,
>>>>> warp=3D3022 (count=3D1)
>&= gt;>> This message by itself isn't telling much I believe.
>>>>
>>>>> xm info | grep command=
>>>>> xen_commandline =A0 =A0 =A0 =A0: dom0_mem=3D512M c= puidle=3D0 loglvl=3Dall
>> guest_loglvl=3Dall

>>>>> dom0_max_vcpus=3D1 dom0_vcp! us_pin consol= e=3Dvga,com1 com1=3D19200,8n1


>>>>>
>>>>>= ; Keir :
>>>>>
>>>>> Using clocksource= =3Dpit :
>>>>>
>>>>> (XEN) Platform tim= er is 1.193MHz PIT
>>>>>
>>>>> I also got :
>>>&g= t;>
>>>>> (XEN) TSC has constant rate, deep Cstates po= ssible, so not
>> reliable,
>>>>> warp=3D3262 (c= ount=3D2)
>>>> The question is whether any of this eliminates the time ju= mps seen
>>>> by your DomU-s (from your past mails I wasn= 9;t actually sure whether
>>>> Dom0 also experienced this problem, albeit it w= ould be odd if it
>> didn't).
>>>> Jan
>&= gt;>>
>>>> Jan
>>>>
>>>
>>

=A0

=A0

=

--90e6ba6e80fef3c66d049d59b13d-- --===============0467043827== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel --===============0467043827==--