* [Linux-ia64] Single bundle loops on Itanium-2 sometimes run slow
@ 2003-03-14 18:51 Jack Steiner
2003-03-14 19:46 ` David Mosberger
2003-03-14 20:26 ` Mallick, Asit K
0 siblings, 2 replies; 3+ messages in thread
From: Jack Steiner @ 2003-03-14 18:51 UTC (permalink / raw)
To: linux-ia64
SUMMARY:
It appears that if a single bundle misaligned loop is interrupted (timer, etc),
when the loop is resumed, it sometimes runs at about 1/3 the normal rate until
the loop completes. (I dont mean 1 or a few iterations of the loop!!)
4000000000000eb0: [MIB] adds r8=1,r32
4000000000000eb6: adds r32=-1,r32
4000000000000ebc: br.cloop.sptk.few 4000000000000eb0 <compute+0x60>;;
If "noise" is present, pfmon shows significant lost cycles due to BE_LOST_BW_DUE_TO_FE_BR_ILOCK.
A 15 second sample shows:
BE_LOST_BW_DUE_TO_FE_BR_ILOCK = 423 for good run
BE_LOST_BW_DUE_TO_FE_BR_ILOCK = 105935651 for noisy run
Changing the test so that code is at different addresses effects the likelyhood
that noise will occur. Sometimes noise occurs on only a subset of the cpus. Sometimes
booting a different kernel changes the likelyhood of seeing noise.
Has anyone seen this behavior or have an explanation.
-----------------------------------------------------------------------------------------
DETAILS:
We have a test that attempts to measure the amount of system "noise". "Noise" is
time spent away from the user application for servicing interrupts, timers, etc.
The test was run on an Itanium-2 (900 MHz). The system is idle & most daemons have been
killed.
The relevant parts of the test are:
- a compute loop
- a loop that uses a hardware clock to measure the
actual amount of time it took to execute the compute loop
Here is the code:
long
compute(long work )
{
long i, dum=0;
for (i = COMPUTE_K * work; i; i--)
dum = i + 1;
return dum;
}
...
slave(int id) {
...
while (go) {
compute(compute_us);
getclk(clk); // uses chipset clock
delta = (clk - clkprev)/SCALE;
if (delta >= BUCKETS-1)
stats[id][BUCKETS-1]++;
else
stats[id][delta]++;
clkprev = clk;
}
}
Sometimes the test runs fine & does not show noise. The following shows the test output with
a compute loop of 5 usec (other compute times behave as expected). The test should
spin for 5 usec in the compute loop, then read the clock & update
a histogram that shows how much time was spent in the compute loop. Here is a typical
output of the program:
0 0
1 0
2 0
3 0
4 0
5 1952417
6 534859
7 3
8 13618
9 1725
10 8
11 0
12 0
13 1
14 0
15 0
There is a spike at 5-6 usec. This is the time that it takes to execute the
compute loop. There is a secondary spike at 8-9 usec. If the compute loop is
interrupted by a timer tick, the time spent in the kernel processing the timer interrupt
must be added to the time required to execute the compute loop. The test shown above ran
for 15 sec wallclock time. The timer interrupt occurs one every 1/1024 sec or about 1
interrupt every millisecond. You should expect ~15000 timer ticks during the test period. This
agrees with what was measured.
Sometimes the test shows significant noise:
0 0
1 0
2 0
3 0
4 0
5 1913480
6 559966
7 0
8 1309
9 1166
10 1038
11 1078
12 1376
13 1377
14 1551
15 1480
16 1228
...
29 400
30 0
The spike at 7-8 usec has been smeared/leveled out. It extends to ~30 usec
The total number of samples in the 8-30 usec bins is still ~15000.
Another interesting data point is that the range of the smeared/leveled out data points
depends on the amount of time spent in the compute loop. Running the test with different
times for the spinloop shows:
- 5 usec compute loop -> timer tick is smeared by ~30 usec
- 100 usec compute loop -> timer tick is smeared by ~330 usec
- 500 usec compute loop -> timer tick is smeared by ~1600 usec
KEY OBSERVATION:
Note that the smear extends to roughly 3X the size of the compute loop.
Now lets look at the code for the compute loop. The inner loop that is generated for the
compute loop is:
4000000000000eb0: [MIB] adds r8=1,r32
4000000000000eb6: adds r32=-1,r32
4000000000000ebc: br.cloop.sptk.few 4000000000000eb0 <compute+0x60>;;
This is a 1-bundle loop that is misaligned. Itanium-2 optimization guides call for
a minimum of 2 bundles. Furthermore, branch target addresses are suppose to be aligned on 32-byte
boundaries for best performance. The code shown above branches to an address that does not have
this alignment.
KEY OBSERVATION:
If I add code to the program to cause the compute loop to be 32byte
aligned OR if I change the compute loop to be a 2-bundle loop, the noise disappears!!!
I have not been successful in seeing ANY noise with a properly aligned compute() loop.
THEORY (yet to be proven):
My working theory at this point is that if the compute loop is interrupted by a
timer tick, when the loop resumes execution, it runs at 1/3 of it's normal rate
until the loop completes. This has the effect of lengthening the compute loop
by 3X of the time remaining in the compute loop. The net effect of this
lengthing is that the time actually spent in an N usec compute loop should be evenly
distributed between N and 3N usec.
NOT UNDERSTOOD YET:
- whatever is causing the loop to run at 1/3 the normal rate is
both kernel dependent & cpu dependent. Some kernels NEVER show noise. Making slight
changes in the kernel can cause the noise to appear or disappear.
- the noise does not always occur on every cpu. Rebooting the system _may_ cause the
noise to move to a different set of cpus.
- sometimes, nosie appears or disappears without a kernel reboot. This does
not occur very often, however.
Different kernels/cpus will reference different data addresses to process a timer
interrupt. Dynamic branch prediction may also change timing. This _might_ be an
explanation, but this has not been proven yet.
ADDITIONAL IMPORTANT DATA:
Here is data from pfmon that shows performance monitor statistics. The
data is collcted separately for user mode & kernel mode. The most
significant statistic is user mode BE_LOST_BW_DUE_TO_FE_BR_ILOCK. This
counter indirectly measures cpu stalls in instruction issue.
Only counters that are significantly different between good & bad runs
are listed here. (I have the full set if anyone is interested).
User Mode counts
----------------------------------------------------------------------------
GOOD RUNS NOISY RUNS RATIO STATISTIC COUNTER
----------------------------------------------------------------------------
426 105935651 248675.24 BE_LOST_BW_DUE_TO_FE_BR_ILOCK
537 1248 2.32 FE_BUBBLE_FILL_RECIRC
58714 30960 .53 FE_LOST_BW_BI
8682390 116343848 13.40 FE_LOST_BW_BR_ILOCK
1086 2426 2.23 FE_LOST_BW_FILL_RECIRC
454 107145640 236003.61 IDEAL_BE_LOST_BW_DUE_TO_FE_BR_ILOCK
15 1265 84.33 IDEAL_BE_LOST_BW_DUE_TO_FE_FILL_RECIRC
13765 964 .07 L2_IFET_CANCELS_ANY
13145 448 .03 L2_IFET_CANCELS_BYPASS
505 129 .26 L2_IFET_CANCELS_DATA_RD
Kernel counts
----------------------------------------------------------------------------
GOOD RUNS NOISY RUNS RATIO STATISTIC COUNTER
----------------------------------------------------------------------------
240 22630 94.29 BE_LOST_BW_DUE_TO_FE_FILL_RECIRC
23652 13271 .56 BUS_OOQ_LIVE_REQ_HI
218034 326080 1.50 FE_BUBBLE_BRANCH
11493 13844 1.20 FE_BUBBLE_FILL_RECIRC
227455 351785 1.55 FE_BUBBLE_GROUP3
50038 309428 6.18 FE_LOST_BW_BI
9370 3393 .36 L2_FILLB_FULL_THIS
14314 24403 1.70 L2_OZQ_CANCELS2_DIDNT_RECIRC
10194 12999 1.28 L2_OZQ_CANCELS2_RECIRC_OVER_SUB
--
Thanks
Jack Steiner (651-683-5302) (vnet 233-5302) steiner@sgi.com
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [Linux-ia64] Single bundle loops on Itanium-2 sometimes run slow
2003-03-14 18:51 [Linux-ia64] Single bundle loops on Itanium-2 sometimes run slow Jack Steiner
@ 2003-03-14 19:46 ` David Mosberger
2003-03-14 20:26 ` Mallick, Asit K
1 sibling, 0 replies; 3+ messages in thread
From: David Mosberger @ 2003-03-14 19:46 UTC (permalink / raw)
To: linux-ia64
>>>>> On Fri, 14 Mar 2003 12:51:43 -0600 (CST), Jack Steiner <steiner@sgi.com> said:
Jack> SUMMARY: It appears that if a single bundle misaligned loop is
Jack> interrupted (timer, etc), when the loop is resumed, it
Jack> sometimes runs at about 1/3 the normal rate until the loop
Jack> completes. (I dont mean 1 or a few iterations of the loop!!)
Jack> 4000000000000eb0: [MIB] adds r8=1,r32
Jack> 4000000000000eb6: adds r32=-1,r32 4000000000000ebc:
Jack> br.cloop.sptk.few 4000000000000eb0 <compute+0x60>;;
Jack> If "noise" is present, pfmon shows significant lost cycles due
Jack> to BE_LOST_BW_DUE_TO_FE_BR_ILOCK. A 15 second sample shows:
Jack> BE_LOST_BW_DUE_TO_FE_BR_ILOCK = 423 for good run
Jack> BE_LOST_BW_DUE_TO_FE_BR_ILOCK = 105935651 for noisy run
Jack> Changing the test so that code is at different addresses
Jack> effects the likelyhood that noise will occur. Sometimes noise
Jack> occurs on only a subset of the cpus. Sometimes booting a
Jack> different kernel changes the likelyhood of seeing noise.
Jack> Has anyone seen this behavior or have an explanation.
Can you check whether the problem goes away when you turn the loop
into a 2-cycle loop? My bet is that it will.
--david
^ permalink raw reply [flat|nested] 3+ messages in thread
* RE: [Linux-ia64] Single bundle loops on Itanium-2 sometimes run slow
2003-03-14 18:51 [Linux-ia64] Single bundle loops on Itanium-2 sometimes run slow Jack Steiner
2003-03-14 19:46 ` David Mosberger
@ 2003-03-14 20:26 ` Mallick, Asit K
1 sibling, 0 replies; 3+ messages in thread
From: Mallick, Asit K @ 2003-03-14 20:26 UTC (permalink / raw)
To: linux-ia64
Jack,
We are looking at this. I am assuming that if the bundle is aligned then
you do not see this problem. Is my assumption correct?
Thanks,
Asit
> -----Original Message-----
> From: David Mosberger [mailto:davidm@napali.hpl.hp.com]
> Sent: Friday, March 14, 2003 11:47 AM
> To: Jack Steiner
> Cc: linux-ia64@linuxia64.org
> Subject: Re: [Linux-ia64] Single bundle loops on Itanium-2 sometimes
run
> slow
>
> >>>>> On Fri, 14 Mar 2003 12:51:43 -0600 (CST), Jack Steiner
> <steiner@sgi.com> said:
>
> Jack> SUMMARY: It appears that if a single bundle misaligned loop is
> Jack> interrupted (timer, etc), when the loop is resumed, it
> Jack> sometimes runs at about 1/3 the normal rate until the loop
> Jack> completes. (I dont mean 1 or a few iterations of the loop!!)
>
> Jack> 4000000000000eb0: [MIB] adds r8=1,r32
> Jack> 4000000000000eb6: adds r32=-1,r32 4000000000000ebc:
> Jack> br.cloop.sptk.few 4000000000000eb0 <compute+0x60>;;
>
>
> Jack> If "noise" is present, pfmon shows significant lost cycles due
> Jack> to BE_LOST_BW_DUE_TO_FE_BR_ILOCK. A 15 second sample shows:
> Jack> BE_LOST_BW_DUE_TO_FE_BR_ILOCK = 423 for good run
> Jack> BE_LOST_BW_DUE_TO_FE_BR_ILOCK = 105935651 for noisy run
>
> Jack> Changing the test so that code is at different addresses
> Jack> effects the likelyhood that noise will occur. Sometimes noise
> Jack> occurs on only a subset of the cpus. Sometimes booting a
> Jack> different kernel changes the likelyhood of seeing noise.
>
> Jack> Has anyone seen this behavior or have an explanation.
>
> Can you check whether the problem goes away when you turn the loop
> into a 2-cycle loop? My bet is that it will.
>
> --david
>
> _______________________________________________
> Linux-IA64 mailing list
> Linux-IA64@linuxia64.org
> http://lists.linuxia64.org/lists/listinfo/linux-ia64
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2003-03-14 20:26 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-03-14 18:51 [Linux-ia64] Single bundle loops on Itanium-2 sometimes run slow Jack Steiner
2003-03-14 19:46 ` David Mosberger
2003-03-14 20:26 ` Mallick, Asit K
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox