From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= Subject: Re: [RFC PATCH] getcpu_cache system call: caching current CPU number (x86) Date: Tue, 21 Jul 2015 20:00:51 +0200 Message-ID: <20150721180051.GA24053@domone> References: <1436724386-30909-1-git-send-email-mathieu.desnoyers@efficios.com> <2010227315.699.1437438300542.JavaMail.zimbra@efficios.com> <20150721073053.GA14716@domone> <894137397.137.1437483493715.JavaMail.zimbra@efficios.com> <20150721151613.GA12856@domone> <1350114812.1035.1437500726799.JavaMail.zimbra@efficios.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: <1350114812.1035.1437500726799.JavaMail.zimbra-vg+e7yoeK/dWk0Htik3J/w@public.gmane.org> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Mathieu Desnoyers Cc: Linus Torvalds , Andy Lutomirski , Ben Maurer , Ingo Molnar , libc-alpha , Andrew Morton , linux-api , rostedt , "Paul E. McKenney" , Florian Weimer , Josh Triplett , Lai Jiangshan , Paul Turner , Andrew Hunter , Peter Zijlstra List-Id: linux-api@vger.kernel.org On Tue, Jul 21, 2015 at 05:45:26PM +0000, Mathieu Desnoyers wrote: > ----- On Jul 21, 2015, at 11:16 AM, Ond=C5=99ej B=C3=ADlka neleai@sez= nam.cz wrote: >=20 > > On Tue, Jul 21, 2015 at 12:58:13PM +0000, Mathieu Desnoyers wrote: > >> ----- On Jul 21, 2015, at 3:30 AM, Ond=C5=99ej B=C3=ADlka neleai@s= eznam.cz wrote: > >>=20 > >> > On Tue, Jul 21, 2015 at 12:25:00AM +0000, Mathieu Desnoyers wrot= e: > >> >> >> Does it solve the Wine problem? If Wine uses gs for somethi= ng and > >> >> >> calls a function that does this, Wine still goes boom, right= ? > >> >> >=20 > >> >> > So the advantage of just making a global segment descriptor a= vailable > >> >> > is that it's not *that* expensive to just save/restore segmen= ts. So > >> >> > either wine could do it, or any library users would do it. > >> >> >=20 > >> >> > But anyway, I'm not sure this is a good idea. The advantage o= f it is > >> >> > that the kernel support really is _very_ minimal. > >> >>=20 > >> >> Considering that we'd at least also want this feature on ARM an= d > >> >> PowerPC 32/64, and that the gs segment selector approach clashe= s with > >> >> existing apps (wine), I'm not sure that implementing a gs segme= nt > >> >> selector based approach to cpu number caching would lead to an = overall > >> >> decrease in complexity if it leads to performance similar to th= ose of > >> >> portable approaches. > >> >>=20 > >> >> I'm perfectly fine with architecture-specific tweaks that lead = to > >> >> fast-path speedups, but if we have to bite the bullet and imple= ment > >> >> an approach based on TLS and registering a memory area at threa= d start > >> >> through a system call on other architectures anyway, it might e= nd up > >> >> being less complex to add a new system call on x86 too, especia= lly if > >> >> fast path overhead is similar. > >> >>=20 > >> >> But I'm inclined to think that some aspect of the question elud= es me, > >> >> especially given the amount of interest generated by the gs-seg= ment > >> >> selector approach. What am I missing ? > >> >>=20 > >> > As I wrote before you don't have to bite bullet as I said before= =2E It > >> > suffices to create 128k element array with cpu for each tid, mak= e that > >> > mmapable file and userspace could get cpu with nearly same perfo= rmance > >> > without hacks. > >>=20 > >> I don't see how this would be acceptable on memory-constrained emb= edded > >> systems. They have multiple cores, and performance requirements, s= o > >> having a fast getcpu would be useful there (e.g. telecom industry)= , > >> but they clearly cannot afford a 512kB table per process just for = that. > >>=20 > > Which just means that you need more complicated api and implementat= ion > > for that but idea stays same. You would need syscalls > > register/deregister_cpuid_idx that would give you index used instea= d > > tid. A kernel would need to handle that many ids could be registere= d for > > each thread and resize mmaped file in syscalls. >=20 > I feel we're talking past each other here. What I propose is to imple= ment > a system call that registers a TLS area. It can be invoked at thread = start. > The kernel can then keep the current CPU number within that registere= d > area up-to-date. This system call does not care how the TLS is implem= ented > underneath. >=20 > My understanding is that you are suggesting a way to speed up TLS acc= esses > by creating a table indexed by TID. Although it might lead to interes= ting > speed ups useful when reading the TLS, I don't see how you proposal i= s > useful in addressing the problem of caching the current CPU number (o= ther > than possibly speeding up TLS accesses). >=20 > Or am I missing something fundamental to your proposal ? > No, I still talk about getting cpu number. My first proposal is that kernel allocates table of current cpu numbers accessed by tid. That could process mmap and get cpu with cpu_tid_table[tid]. As you said tha= t size is problem I replied that you need to be more careful. Instead tid you will use different id that you get with say register_cpucache, stor= e in tls variable and get cpu with cpu_cid_table[cid]. That decreases space used to only threads that use this. A tls speedup was side remark when you would implement per-cpu page the= n you could speedup tls. As tls access speed and getting tid these are equivalent as you could easily implement one with other.