From mboxrd@z Thu Jan  1 00:00:00 1970
From: Robin Holt <holt@sgi.com>
Date: Mon, 19 Sep 2005 15:17:43 +0000
Subject: Re: Attribute spinlock contention ticks to caller.
Message-Id: <20050919151743.GA32392@lnx-holt.americas.sgi.com>
List-Id: <linux-ia64.vger.kernel.org>
References: <20050914222644.GA5036@lnx-holt.americas.sgi.com>
In-Reply-To: <20050914222644.GA5036@lnx-holt.americas.sgi.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
To: linux-ia64@vger.kernel.org

On Sun, Sep 18, 2005 at 06:18:20PM -0700, David Mosberger-Tang wrote:
> Well, it's an example where attributing the spinlock contention time
> to the caller would have completely obfuscated the problem.

Either way, we have obfuscation.  In the one case (attributing to caller),
the obfuscation can be resolved by looking at the code.  In the other
(multiple paths contending on independent locks), the obfuscation can
only be resolved by repeating the test with different sampling.

Although that sounds simple, what if it is a difficult to execute test.
What if this appeared to be a one-time aberration that was captured during
one of many iterations.  The chance to capture is gone.

For a more complete illustration, I would like to elaborate my previous
example.  I had a sample file produced by our benchmarkers.  They had
received the results on their third run after tweaking some app settings
and the results were nearly impossible to believe.  This happened to be
an MPI job where all ranks barrier at the end of a phase so one single
rank being slow results in the entire application being slow.

After the third run, they repeated with the app settings from the
second run and then repeated again with the settings from the third
run.  Neither run showed any signs of a similar problem.  The customer
acceptance test continued.  Before the customer would accept the results,
they needed that anomaly explained.

Fortunately, the customer had required a sampling output from every
run so data had been taken using perfmon and retained.  This was on a
2.4 based system.  The system had eight Ethernet adapters spread across
the machine.  Interrupts for each were targeted to different cpus.

Because sampling was showing the caller, this turned into a simple
question, why was there so much network receive activity.  On some of
the cpus, we noticed a significant number of processes were trying to
en-queue network packets at the same time.  The sample IP showed we were
in a bundle after a spinlock was acquired.

Had we not provided the caller, we would have been left with something
that was relatively impossible to diagnose definitively.  With the unroll,
it became a simple matter of looking at the enabled network services and
finding somebody had run a network benchmark using all eight network
adapters.  We contacted the group responsible for network benchmarks
and the problem was isolated and explained to the customers satisfaction.

I hope this illustrates that one way of sampling makes it slightly more
difficult to determine that the source of slowdown is contention on
a lock where the other way of sampling results in it being impossible
to determine the source of a problem.  Given the choices, I would say
the right way to do the sampling is to not attribute the samples to
the caller.

Thanks,
Robin