From mboxrd@z Thu Jan 1 00:00:00 1970 From: Robin Holt Date: Mon, 19 Sep 2005 15:17:43 +0000 Subject: Re: Attribute spinlock contention ticks to caller. Message-Id: <20050919151743.GA32392@lnx-holt.americas.sgi.com> List-Id: References: <20050914222644.GA5036@lnx-holt.americas.sgi.com> In-Reply-To: <20050914222644.GA5036@lnx-holt.americas.sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-ia64@vger.kernel.org On Sun, Sep 18, 2005 at 06:18:20PM -0700, David Mosberger-Tang wrote: > Well, it's an example where attributing the spinlock contention time > to the caller would have completely obfuscated the problem. Either way, we have obfuscation. In the one case (attributing to caller), the obfuscation can be resolved by looking at the code. In the other (multiple paths contending on independent locks), the obfuscation can only be resolved by repeating the test with different sampling. Although that sounds simple, what if it is a difficult to execute test. What if this appeared to be a one-time aberration that was captured during one of many iterations. The chance to capture is gone. For a more complete illustration, I would like to elaborate my previous example. I had a sample file produced by our benchmarkers. They had received the results on their third run after tweaking some app settings and the results were nearly impossible to believe. This happened to be an MPI job where all ranks barrier at the end of a phase so one single rank being slow results in the entire application being slow. After the third run, they repeated with the app settings from the second run and then repeated again with the settings from the third run. Neither run showed any signs of a similar problem. The customer acceptance test continued. Before the customer would accept the results, they needed that anomaly explained. Fortunately, the customer had required a sampling output from every run so data had been taken using perfmon and retained. This was on a 2.4 based system. The system had eight Ethernet adapters spread across the machine. Interrupts for each were targeted to different cpus. Because sampling was showing the caller, this turned into a simple question, why was there so much network receive activity. On some of the cpus, we noticed a significant number of processes were trying to en-queue network packets at the same time. The sample IP showed we were in a bundle after a spinlock was acquired. Had we not provided the caller, we would have been left with something that was relatively impossible to diagnose definitively. With the unroll, it became a simple matter of looking at the enabled network services and finding somebody had run a network benchmark using all eight network adapters. We contacted the group responsible for network benchmarks and the problem was isolated and explained to the customers satisfaction. I hope this illustrates that one way of sampling makes it slightly more difficult to determine that the source of slowdown is contention on a lock where the other way of sampling results in it being impossible to determine the source of a problem. Given the choices, I would say the right way to do the sampling is to not attribute the samples to the caller. Thanks, Robin