From mboxrd@z Thu Jan 1 00:00:00 1970 From: John Sigler Subject: Re: Pin-pointing the root of unusual application latencies Date: Fri, 13 Jul 2007 10:32:37 +0200 Message-ID: <469738A5.2040306@free.fr> References: <469600F7.3060603@free.fr> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE To: linux-rt-users@vger.kernel.org, oprofile-list@lists.sourceforge.net Return-path: Received: from smtp4-g19.free.fr ([212.27.42.30]:57759 "EHLO smtp4-g19.free.fr" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756535AbXGMIck (ORCPT ); Fri, 13 Jul 2007 04:32:40 -0400 In-Reply-To: <469600F7.3060603@free.fr> Sender: linux-rt-users-owner@vger.kernel.org List-Id: linux-rt-users.vger.kernel.org John Sigler wrote: > [ Note: cross-posted to linux-rt-users and oprofile-list ] >=20 > I'm running kernel 2.6.20.7-rt8 on x86 (1267-MHz P3) with the > following debug options: >=20 > CONFIG_TRACE_IRQFLAGS_SUPPORT=3Dy > # CONFIG_PRINTK_TIME is not set > # CONFIG_ENABLE_MUST_CHECK is not set > # CONFIG_MAGIC_SYSRQ is not set > # CONFIG_UNUSED_SYMBOLS is not set > # CONFIG_DEBUG_FS is not set > # CONFIG_HEADERS_CHECK is not set > # CONFIG_DEBUG_KERNEL is not set > CONFIG_LOG_BUF_SHIFT=3D14 > CONFIG_STACKTRACE=3Dy > CONFIG_EVENT_TRACE=3Dy > CONFIG_FUNCTION_TRACE=3Dy > CONFIG_WAKEUP_TIMING=3Dy > CONFIG_LATENCY_TRACE=3Dy > # CONFIG_CRITICAL_PREEMPT_TIMING is not set > # CONFIG_CRITICAL_IRQSOFF_TIMING is not set > CONFIG_WAKEUP_LATENCY_HIST=3Dy > CONFIG_LATENCY_TIMING=3Dy > CONFIG_LATENCY_HIST=3Dy > CONFIG_MCOUNT=3Dy > CONFIG_DEBUG_BUGVERBOSE=3Dy > CONFIG_FRAME_POINTER=3Dy > CONFIG_UNWIND_INFO=3Dy > CONFIG_STACK_UNWIND=3Dy > CONFIG_EARLY_PRINTK=3Dy > CONFIG_X86_FIND_SMP_CONFIG=3Dy > CONFIG_X86_MPPARSE=3Dy > CONFIG_DOUBLEFAULT=3Dy >=20 > http://linux.kernel.free.fr/latency/config-2.6.20.7-rt8-adlink-latenc= y >=20 > Here's my situation: >=20 > A SCHED_RR process with priority 80 is performing blocking reads of=20 > 1316-byte chunks from a PCI device. The data comes in at ~38 Mbit/s. >=20 > Therefore, in a perfect world, my process would wake up periodically=20 > every 1316*8 / 38e6 =3D 277 =B5s to handle the latest chunk. >=20 > However, periodically, my process blocks up to 400 =B5s instead of 27= 7 =B5s. > Naturally, the next iteration, my process blocks less than 277 =B5s (= the=20 > PCI board has 1316 bytes ready every 277 =B5s regardless). >=20 > I've captured a histogram of read latencies. > It is expressed in units of 512 ns. > e.g. there were 23,126,460 occurences of LAT=3D278 =B5s (543 * 512 ns= ) > http://linux.kernel.free.fr/latency/histogram.txt > http://linux.kernel.free.fr/latency/read_latency.png >=20 > The system is otherwise completely idle. In other words, 100% of the = CPU=20 > is dedicated to the single task of waiting for an interrupt request,=20 > servicing it, and pushing the data to user land. >=20 > I used OProfile to frequently sample the instruction pointer while my= =20 > application ran all night long. >=20 > http://linux.kernel.free.fr/latency/oprofile_report_all_night.txt >=20 > CPU: PIII, speed 1266.7 MHz (estimated) > Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with= a=20 > unit mask of 0x00 (No unit mask) count 60000 > samples % symbol name > 958075714 90.6470 default_idle > 25637397 2.4256 mask_and_ack_8259A > 10697948 1.0122 __copy_to_user_ll > 9345347 0.8842 oprofiled > 9246368 0.8748 Dta1xxIsr > 6732131 0.6370 enable_8259A_irq > 5153650 0.4876 __schedule > 1873157 0.1772 irq_entries_start > 1729897 0.1637 increment_tail > 1565354 0.1481 clock_gettime > 1429908 0.1353 Dta1xxDmaTransfer > 1428107 0.1351 ioread16 > 1252095 0.1185 system_call > 1081768 0.1023 try_to_wake_up > 983687 0.0931 cpu_idle >=20 > I thought default_idle called HLT via safe_halt(). Since I'm only=20 > counting UNHALTED clock cycles, I'm surprised to see so many samples=20 > taken in default_idle. Does someone understand that? >=20 > Is it possible that one of these functions periodically delays the=20 > delivery of data to my SCHED_RR application? >=20 > Here's a /proc/latency_trace dump. What is there to understand? >=20 > # cat /proc/latency_trace > preemption latency trace v1.1.5 on 2.6.20.7-rt8 > -------------------------------------------------------------------- > latency: 26 us, #2/2, CPU#0 | (M:rt VP:0, KP:0, SP:1 HP:1) > ----------------- > | task: softirq-timer/0-4 (uid:0 nice:0 policy:1 rt_prio:50) > ----------------- >=20 > _------=3D> CPU# > / _-----=3D> irqs-off > | / _----=3D> need-resched > || / _---=3D> hardirq/softirq > ||| / _--=3D> preempt-depth > |||| / > ||||| delay > cmd pid ||||| time | caller > \ / ||||| \ | / > <...>-4 0D..1 26us : trace_stop_sched_switched=20 > (__sched_text_start) >=20 >=20 > vim:ft=3Dhelp Shouldn't there be many lines in this output? Do I need an additional kernel configuration option? > Is there a different diagnostic I can run to tell why data delivery > to my app is periodically delayed 100 =B5s? I'm open to all suggestions. I'm running out of ideas. > # cat /proc/latency_hist/wakeup_latency/CPU0 > #Minimum latency: 0 microseconds. > #Average latency: 0 microseconds. > #Maximum latency: 26 microseconds. > #Total samples: 14925698 > #There are 0 samples greater or equal than 10240 microseconds > #usecs samples > 0 3682745 > 1 9487737 > 2 1748372 > 3 1192 > 4 3689 > 5 50 > 6 14 > 7 14 > 8 9 > 9 46 > 10 995 > 11 799 > 12 6 > 13 8 > 14 4 > 15 6 > 16 3 > 17 0 > 18 1 > 19 0 > 20 0 > 21 0 > 22 3 > 23 1 > 24 0 > 25 1 > 26 3 >=20 > I would think that I might have a clearer picture if I could request = the=20 > last N instruction pointer samples from OProfile, to figure out what = the=20 > CPU has been doing for the past several hundred =B5s. I'd like to hear what OProfile devs think of this approach. I seems far from trivial to implement? Regards.