From mboxrd@z Thu Jan 1 00:00:00 1970 From: will.deacon@arm.com (Will Deacon) Date: Mon, 5 Oct 2015 17:31:00 +0100 Subject: [PATCH 04/10] arm64: mm: rewrite ASID allocator and MM context-switching code In-Reply-To: <20150929084614.GY13823@e104818-lin.cambridge.arm.com> References: <1442494219-6133-1-git-send-email-will.deacon@arm.com> <1442494219-6133-5-git-send-email-will.deacon@arm.com> <20150929084614.GY13823@e104818-lin.cambridge.arm.com> Message-ID: <20151005163100.GB3211@arm.com> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org Hi Catalin, On Tue, Sep 29, 2015 at 09:46:15AM +0100, Catalin Marinas wrote: > On Thu, Sep 17, 2015 at 01:50:13PM +0100, Will Deacon wrote: > > diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h > > index 030208767185..6af677c4f118 100644 > > --- a/arch/arm64/include/asm/mmu.h > > +++ b/arch/arm64/include/asm/mmu.h > > @@ -17,15 +17,11 @@ > > #define __ASM_MMU_H > > > > typedef struct { > > - unsigned int id; > > - raw_spinlock_t id_lock; > > - void *vdso; > > + atomic64_t id; > > + void *vdso; > > } mm_context_t; > > > > -#define INIT_MM_CONTEXT(name) \ > > - .context.id_lock = __RAW_SPIN_LOCK_UNLOCKED(name.context.id_lock), > > - > > -#define ASID(mm) ((mm)->context.id & 0xffff) > > +#define ASID(mm) ((mm)->context.id.counter & 0xffff) > > If you changed the id to atomic64_t, can you not use atomic64_read() > here? I could, but it forces the access to be volatile which I don't think is necessary for any of the users of this macro (i.e. the tlbflushing code). > > -#define asid_bits(reg) \ > > - (((read_cpuid(ID_AA64MMFR0_EL1) & 0xf0) >> 2) + 8) > > +static u32 asid_bits; > > +static DEFINE_RAW_SPINLOCK(cpu_asid_lock); > > > > -#define ASID_FIRST_VERSION (1 << MAX_ASID_BITS) > > +static atomic64_t asid_generation; > > +static unsigned long *asid_map; > > > > -static DEFINE_RAW_SPINLOCK(cpu_asid_lock); > > -unsigned int cpu_last_asid = ASID_FIRST_VERSION; > > +static DEFINE_PER_CPU(atomic64_t, active_asids); > > +static DEFINE_PER_CPU(u64, reserved_asids); > > +static cpumask_t tlb_flush_pending; > > > > -/* > > - * We fork()ed a process, and we need a new context for the child to run in. > > - */ > > -void __init_new_context(struct task_struct *tsk, struct mm_struct *mm) > > +#define ASID_MASK (~GENMASK(asid_bits - 1, 0)) > > +#define ASID_FIRST_VERSION (1UL << asid_bits) > > +#define NUM_USER_ASIDS ASID_FIRST_VERSION > > Apart from NUM_USER_ASIDS, I think we can live with constants for > ASID_MASK and ASID_FIRST_VERSION (as per 16-bit ASIDs, together with > some shifts converted to a constant), marginally more optimal code > generation which avoids reading asid_bits all the time. We should be ok > with 48-bit generation field. The main reason for writing it like this is that it's easy to test the code with different asid sizes -- you just change asid_bits and all of the masks change accordingly. If we hardcode ASID_MASK then we'll break flush_context (which uses it to generate a bitmap index) and, given that ASID_MASK and ASID_FIRST_VERSION are only used on the slow-path, I'd favour the current code over a micro-optimisation. > > +void check_and_switch_context(struct mm_struct *mm, unsigned int cpu) > > { > > - unsigned int asid; > > - unsigned int cpu = smp_processor_id(); > > - struct mm_struct *mm = current->active_mm; > > + unsigned long flags; > > + u64 asid; > > + > > + asid = atomic64_read(&mm->context.id); > > > > /* > > - * current->active_mm could be init_mm for the idle thread immediately > > - * after secondary CPU boot or hotplug. TTBR0_EL1 is already set to > > - * the reserved value, so no need to reset any context. > > + * The memory ordering here is subtle. We rely on the control > > + * dependency between the generation read and the update of > > + * active_asids to ensure that we are synchronised with a > > + * parallel rollover (i.e. this pairs with the smp_wmb() in > > + * flush_context). > > */ > > - if (mm == &init_mm) > > - return; > > + if (!((asid ^ atomic64_read(&asid_generation)) >> asid_bits) > > + && atomic64_xchg_relaxed(&per_cpu(active_asids, cpu), asid)) > > + goto switch_mm_fastpath; > > Just trying to make sense of this ;). At a parallel roll-over, we have > two cases for the asid check above: it either (1) sees the new > generation or (2) the old one. > > (1) is simple since it falls back on the slow path. > > (2) means that it goes on and performs an atomic64_xchg. This may happen > before or after the active_asids xchg in flush_context(). We now have > two sub-cases: > > a) if the code above sees the updated (in flush_context()) active_asids, > it falls back on the slow path since xchg returns 0. Here we are > guaranteed that another read of asid_generation returns the new value > (by the smp_wmb() in flush_context). > > b) the code above sees the old active_asids, goes to the fast path just > like a roll-over hasn't happened (on this CPU). On the CPU doing the > roll-over, we want the active_asids xchg to see the new asid. That's > guaranteed by the atomicity of the xchg implementation (otherwise it > would be case (a) above). > > So what the control dependency actually buys us is that a store > (exclusive) is not architecturally visible if the generation check > fails. I guess this only works (with respect to the load) because of the > exclusiveness of the memory accesses. This is also the case for non-exclusive stores (i.e. a control dependency from a load to a store creates order) since we don't permit speculative writes. So here, the control dependency is between the atomic64_read of the generation and the store-exclusive part of the xchg. The exclusiveness then guarantees that we replay the load-exclusive part of the xchg in the face of contention (due to a parallel rollover). You seem to have the gist of it, though. Will