From: Jack Steiner <steiner@sgi.com>
To: linux-ia64@vger.kernel.org
Subject: Re: [PATCH] - Optional method to purge the TLB on SN systems
Date: Wed, 28 Mar 2007 03:26:05 +0000 [thread overview]
Message-ID: <20070328032605.GA9496@sgi.com> (raw)
In-Reply-To: <20070327193925.GA8615@sgi.com>
On Wed, Mar 28, 2007 at 11:03:50AM +0800, Zou, Nanhai wrote:
> > -----Original Message-----
> > From: Jack Steiner [mailto:steiner@sgi.com]
> > Sent: 2007??3??28?? 9:53
> > To: Zou, Nanhai
> > Cc: Luck, Tony; Linux-IA64
> > Subject: Re: [PATCH] - Optional method to purge the TLB on SN systems
> >
> > On Wed, Mar 28, 2007 at 08:46:44AM +0800, Zou Nan hai wrote:
> > > On Wed, 2007-03-28 at 03:39, Jack Steiner wrote:
> > >
> > > > This patch adds an optional method for purging the TLB on SN IA64 systems.
> > > > The change should not affect any non-SN system.
> > > >
> > > > Signed-off-by: Jack Steiner <steiner@sgi.com>
> > > >
> > > > ---
> > > >
> > > > +void
> > > > +smp_flush_tlb_cpumask (cpumask_t xcpumask)
> > > > +{
> > > > + unsigned short counts[NR_CPUS];
> > > > + cpumask_t cpumask = xcpumask;
> > > > + int count, mycpu, cpu, flush_mycpu = 0;
> > > > +
> > > > + preempt_disable();
> > > > + mycpu = smp_processor_id();
> > > > +
> > > > + for_each_cpu_mask(cpu, cpumask) {
> > > > + counts[cpu] = per_cpu(local_flush_count, cpu);
> > > > + mb();
> > > > + if (cpu = mycpu)
> > > > + flush_mycpu = 1;
> > > > + else
> > > > + smp_send_local_flush_tlb(cpu);
> > > > + }
> > > > +
> > > > + if (flush_mycpu)
> > > > + smp_local_flush_tlb();
> > > > +
> > > > + for_each_cpu_mask(cpu, cpumask) {
> > > > + count = 0;
> > > > + while(counts[cpu] = per_cpu(local_flush_count, cpu)) {
> > >
> > > Due to 64k offset of percpu data, the same percpu variable on different
> > > CPUs are very likely to be on the same cacheline of some levels of
> > > cache.
> > >
> > > So I think the operation on local_flush_count may be very cache
> > > unfriendly...
> >
> > I was concerned about that, too, but testing finally convinced me that
> > it was not an issue. I think the reason is that is takes a few hundred
> > nanoseconds per cpu to send an IPI. So rather than a contended cache
> > line, we have a line that is serially read by multiple cpus. Although
> > contention can occur, typically multiple cpus are not trying to read
> > the same line at the same time.
> >
> > For example (oversimplified), IPI sent to cpu 0 at time 0, to cpu 1 at
> > time ~100, cpu 2 at time ~200, etc. The IPI requires a chipset access
> > that takes order-of-memory-access time. Assume it take N usec for a
> > cpu to recognize the IPI & call the TLB flushing code. Cpu 0 reads
> > local_flush_count at time N, cpu reads local_flush_count at time
> > 100+N, etc. Very little contention, just serial access.
> >
> > --
> >
> > I tried a second algorithm where the local_flush_count was kept in
> > node-local percpu data. That scheme was significantly slower. Most
> > likely because the cpu that initiates the flush will take N (# of
> > cpus) cache misses to get an initial snapshot of the counts, then
> > another N cache misses to check for completion. This assumes that
> > a cpu doing a flush is not the most-recent cpu to do a flush.
> > I believe this is typical.
> >
> > Keeping the counts in a single array (64cpus/cache line)
> > significantly reduces the number of cache misses.
>
> >
> > Another disadvantage of keeping counts in per-cpu data is that
> > scanning the counts trashes the TLB for large NR_CPUS. The counts will
> > be located in different 16MB granules. Each reference to cpu's percpu
> > data will require a different TLB entry to map the address used to
> > reference the count. To scan N cpus, there will be ~2*N TLB misses
> > plus at the end of the flush, the contents of the TLB are useless
> > for most kernel or user use.
> >
> > --
> >
> > I tried a third algorithm where the counts were kept in a single array
> > but each count was cacheline aligned to eliminate any possibility
> > of contention. This was better that the second method that trashed
> > the TLB. 1 TLB entry will cover the entire array. Unfortunately,
> > this algorithm still encurs 2*N cache misses & is slower than
> > the current algorithm.
> >
> >
> > Does this explanation make sense...... If anyone has an alternate
> > algorithm, I be glad to try it.
>
> Yes, put count in a tight array could be better.
> But your original patch is using the second algorithm?
That's embarasing.
I had several variants of the patch & did a lot of testing with each.
The only difference was in the "counts". Arrays, sizes, alignment,
percpu, etc. It looks like I grabbed the wrong patch.
I want to review my notes & possibly retest to make sure that what I
said was correct about the differences between the patches & the
performance of each.
Stay tuned & thanks for the careful review.
-- jack
next prev parent reply other threads:[~2007-03-28 3:26 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-03-27 19:39 [PATCH] - Optional method to purge the TLB on SN systems Jack Steiner
2007-03-27 20:24 ` Luck, Tony
2007-03-27 20:33 ` Jack Steiner
2007-03-27 22:32 ` Luck, Tony
2007-03-27 22:46 ` Jack Steiner
2007-03-28 0:46 ` Zou Nan hai
2007-03-28 1:53 ` Jack Steiner
2007-03-28 3:03 ` Zou, Nanhai
2007-03-28 3:26 ` Jack Steiner [this message]
2007-04-05 21:39 ` Jack Steiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070328032605.GA9496@sgi.com \
--to=steiner@sgi.com \
--cc=linux-ia64@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.