From mboxrd@z Thu Jan 1 00:00:00 1970 From: Don Slutz Subject: Re: [PATCH 1/1] xentrace: Add TRC_HW_VCHIP Date: Fri, 28 Mar 2014 09:18:57 -0400 Message-ID: <533576C1.1070203@terremark.com> References: <1396005951-29160-1-git-send-email-dslutz@verizon.com> <53356EF402000078000035D7@nat28.tlf.novell.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <53356EF402000078000035D7@nat28.tlf.novell.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Jan Beulich Cc: Ian Campbell , Stefano Stabellini , George Dunlap , Ian Jackson , Don Slutz , xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org On 03/28/14 07:45, Jan Beulich wrote: >>>> On 28.03.14 at 12:25, wrote: >> This add a set of trace events that track the setup of various >> virtual chips related to timers in domU. >> >> This set is hpet, pit (i8253, i8254), rtc (MC146818), apic (lapic), >> and pic (i8259). The pmtimer is not traced since it does not have a >> changeable rate. > But you're not saying anything about why this would be useful > (considering that it wasn't needed before), and hence don't > provide a reason for taking this change. Thank you for asking. I am assuming from this that some patches (like this one) should have this. This is an area that I am very weak on. No simple statement comes to mind. So here is the story of how this patch came about. Months ago, 1 server for about 2 to 3 days would after 1st boot have the 1 domU hang (1 out of 10 times) with the 1st interesting message on the domU console of: ..MP-BIOS bug: 8254 timer not connected to IO-APIC Since I know that this message was added to deal with certain bad motherboards and that xen does not have this issue, I started looking into this. I considered add code like: HVM_DBG_LOG(DBG_LEVEL_VLAPIC_TIMER, "value[0x%016"PRIx64"]", value); but was not sure this would not change the timer enough to stop the bug from happening. So I added this patch. I then spent a lot of time trying to reproduce this issue. I was not able to, nor was the person and server that reported it was able to. This was under various configurations: 1) No change. 2) debug=y xen build 3) debug=n + patch 4) debug=y + patch Using a few of the trace files and the source code of the domU's kernel, I was able to determine that the hpet.c code was involved. Using this knowledge, I made a patch to xen to simulate various values of "diff" (tn_cmp - cur_tick). With this debug code I was able to generate the hang on demand. This work is what caused me to post the patch: hpet: Act more like real hardware http://lists.xen.org/archives/html/xen-devel/2014-02/msg02408.html Which I now know to not be complete. More testing after that time has shown that 'diff > 0' will also cause this report if diff is large enough. Armed with this I went back to a few saved traces that I had an was able to determine that the first interval in the calls to create_periodic_time() (i.e. diff) had a very high variance. I no longer have the actual data, but my memory was that the hpet_tick_to_ns(h, diff) values ranged from 23,696ns to 955,456ns. More looking into linux in this area and learning about hpet hardware and specification leads me to fact that this should not be happening. I am still working on the set of changes to the hpet.c code to fix the set of bugs that I think are there. So I only know that this patch did provide very useful data to me. I would think that it would be a help to developers in the future. I think would could write a complex analysis program of the current trace data and infer what some of the trace data was. This to me is a lot harder. Since this is a new independent selectable trace a developer can deside to include of exclude these. This is long (and not a good reason for taking this change) but I hope it helps. -Don Slutz > Jan >