From mboxrd@z Thu Jan  1 00:00:00 1970
From: Don Slutz <dslutz@verizon.com>
Subject: Re: [PATCH 1/1] xentrace: Add TRC_HW_VCHIP
Date: Fri, 28 Mar 2014 09:18:57 -0400
Message-ID: <533576C1.1070203@terremark.com>
References: <1396005951-29160-1-git-send-email-dslutz@verizon.com>
	<53356EF402000078000035D7@nat28.tlf.novell.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <53356EF402000078000035D7@nat28.tlf.novell.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Jan Beulich <JBeulich@suse.com>
Cc: Ian Campbell <ian.campbell@citrix.com>, Stefano Stabellini <stefano.stabellini@eu.citrix.com>, George Dunlap <george.dunlap@eu.citrix.com>, Ian Jackson <ian.jackson@eu.citrix.com>, Don Slutz <dslutz@verizon.com>, xen-devel@lists.xen.org
List-Id: xen-devel@lists.xenproject.org


On 03/28/14 07:45, Jan Beulich wrote:
>>>> On 28.03.14 at 12:25, <dslutz@verizon.com> wrote:
>> This add a set of trace events that track the setup of various
>> virtual chips related to timers in domU.
>>
>> This set is hpet, pit (i8253, i8254), rtc (MC146818), apic (lapic),
>> and pic (i8259).  The pmtimer is not traced since it does not have a
>> changeable rate.
> But you're not saying anything about why this would be useful
> (considering that it wasn't needed before), and hence don't
> provide a reason for taking this change.

Thank you for asking.  I am assuming from this that some
patches (like this one) should have this.

This is an area that I am very weak on.  No simple statement
comes to mind.  So here is the story of how this patch came
about.

Months ago, 1 server for about 2 to 3 days would after 1st boot
have the 1 domU hang (1 out of 10 times) with the 1st interesting
message on the domU console of:

..MP-BIOS bug: 8254 timer not connected to IO-APIC

Since I know that this message was added to deal with certain
bad motherboards and that xen does not have this issue, I
started looking into this.

I considered add code like:

         HVM_DBG_LOG(DBG_LEVEL_VLAPIC_TIMER, "value[0x%016"PRIx64"]", 
value);

but was not sure this would not change the timer enough to
stop the bug from happening.  So I added this patch.

I then spent a lot of time trying to reproduce this issue.  I was
not able to, nor was the person and server that reported it was
able to.  This was under various configurations:

1) No change.

2) debug=y xen build

3) debug=n + patch

4) debug=y + patch

Using a few of the trace files and the source code of the domU's
kernel, I was able to determine that the hpet.c code was involved.

Using this knowledge, I made a patch to xen to simulate various
values of "diff" (tn_cmp - cur_tick).  With this debug code I was able
to generate the hang on demand.  This work is what caused me to
post the patch:

hpet: Act more like real hardware
http://lists.xen.org/archives/html/xen-devel/2014-02/msg02408.html

Which I now know to not be complete.  More testing after that time
has shown that 'diff > 0' will also cause this report if diff is large
enough.  Armed with this I went back to a few saved traces that I
had an was able to determine that the first interval in the calls to
create_periodic_time() (i.e. diff) had a very high variance.  I no
longer have the actual data, but my memory was that the
hpet_tick_to_ns(h, diff) values ranged from 23,696ns to 955,456ns.

More looking into linux in this area and learning about hpet hardware
and specification leads me to fact that this should not be happening.

I am still working on the set of changes to the hpet.c code to fix the
set of bugs that I think are there.


So I only know that this patch did provide very useful data to me. I
would think that it would be a help to developers in the future.

I think would could write a complex analysis program of the current
trace data and infer what some of the trace data was.  This to me is
a lot harder.  Since this is a new independent selectable trace a
developer can deside to include of exclude these.

This is long (and not a good reason for taking this change) but I
hope it helps.

    -Don Slutz


> Jan
>