From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jack Steiner Date: Wed, 28 Mar 2007 01:53:04 +0000 Subject: Re: [PATCH] - Optional method to purge the TLB on SN systems Message-Id: <20070328015304.GA4366@sgi.com> List-Id: References: <20070327193925.GA8615@sgi.com> In-Reply-To: <20070327193925.GA8615@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-ia64@vger.kernel.org On Wed, Mar 28, 2007 at 08:46:44AM +0800, Zou Nan hai wrote: > On Wed, 2007-03-28 at 03:39, Jack Steiner wrote: > > > This patch adds an optional method for purging the TLB on SN IA64 systems. > > The change should not affect any non-SN system. > > > > Signed-off-by: Jack Steiner > > > > --- > > > > +void > > +smp_flush_tlb_cpumask (cpumask_t xcpumask) > > +{ > > + unsigned short counts[NR_CPUS]; > > + cpumask_t cpumask = xcpumask; > > + int count, mycpu, cpu, flush_mycpu = 0; > > + > > + preempt_disable(); > > + mycpu = smp_processor_id(); > > + > > + for_each_cpu_mask(cpu, cpumask) { > > + counts[cpu] = per_cpu(local_flush_count, cpu); > > + mb(); > > + if (cpu = mycpu) > > + flush_mycpu = 1; > > + else > > + smp_send_local_flush_tlb(cpu); > > + } > > + > > + if (flush_mycpu) > > + smp_local_flush_tlb(); > > + > > + for_each_cpu_mask(cpu, cpumask) { > > + count = 0; > > + while(counts[cpu] = per_cpu(local_flush_count, cpu)) { > > Due to 64k offset of percpu data, the same percpu variable on different > CPUs are very likely to be on the same cacheline of some levels of > cache. > > So I think the operation on local_flush_count may be very cache > unfriendly... I was concerned about that, too, but testing finally convinced me that it was not an issue. I think the reason is that is takes a few hundred nanoseconds per cpu to send an IPI. So rather than a contended cache line, we have a line that is serially read by multiple cpus. Although contention can occur, typically multiple cpus are not trying to read the same line at the same time. For example (oversimplified), IPI sent to cpu 0 at time 0, to cpu 1 at time ~100, cpu 2 at time ~200, etc. The IPI requires a chipset access that takes order-of-memory-access time. Assume it take N usec for a cpu to recognize the IPI & call the TLB flushing code. Cpu 0 reads local_flush_count at time N, cpu reads local_flush_count at time 100+N, etc. Very little contention, just serial access. -- I tried a second algorithm where the local_flush_count was kept in node-local percpu data. That scheme was significantly slower. Most likely because the cpu that initiates the flush will take N (# of cpus) cache misses to get an initial snapshot of the counts, then another N cache misses to check for completion. This assumes that a cpu doing a flush is not the most-recent cpu to do a flush. I believe this is typical. Keeping the counts in a single array (64cpus/cache line) significantly reduces the number of cache misses. Another disadvantage of keeping counts in per-cpu data is that scanning the counts trashes the TLB for large NR_CPUS. The counts will be located in different 16MB granules. Each reference to cpu's percpu data will require a different TLB entry to map the address used to reference the count. To scan N cpus, there will be ~2*N TLB misses plus at the end of the flush, the contents of the TLB are useless for most kernel or user use. -- I tried a third algorithm where the counts were kept in a single array but each count was cacheline aligned to eliminate any possibility of contention. This was better that the second method that trashed the TLB. 1 TLB entry will cover the entire array. Unfortunately, this algorithm still encurs 2*N cache misses & is slower than the current algorithm. Does this explanation make sense...... If anyone has an alternate algorithm, I be glad to try it. -- jack