From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030318AbWFOMWu (ORCPT ); Thu, 15 Jun 2006 08:22:50 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1030321AbWFOMWt (ORCPT ); Thu, 15 Jun 2006 08:22:49 -0400 Received: from ecfrec.frec.bull.fr ([129.183.4.8]:41614 "EHLO ecfrec.frec.bull.fr") by vger.kernel.org with ESMTP id S1030318AbWFOMWs (ORCPT ); Thu, 15 Jun 2006 08:22:48 -0400 Message-ID: <44915110.2050100@bull.net> Date: Thu, 15 Jun 2006 14:22:40 +0200 From: Zoltan Menyhart User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.3) Gecko/20040913 X-Accept-Language: en-us, en, fr, hu MIME-Version: 1.0 To: Christoph Lameter Cc: akpm@osdl.org, ak@suse.de, linux-kernel@vger.kernel.org, linux-ia64@vger.kernel.org Subject: Re: light weight counters: race free through local_t? References: <449033B0.1020206@bull.net> In-Reply-To: X-MIMETrack: Itemize by SMTP Server on ECN002/FR/BULL(Release 5.0.12 |February 13, 2003) at 15/06/2006 14:26:32, Serialize by Router on ECN002/FR/BULL(Release 5.0.12 |February 13, 2003) at 15/06/2006 14:26:33, Serialize complete at 15/06/2006 14:26:33 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii; format=flowed Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Christoph Lameter wrote: > Could you do a clock cycle comparision of an > > atomic_inc(__get_per_cpu(var)) > (the fallback of local_t on ia64) > > vs. > > local_irq_save(flags) > __get_per_cpu(var)++ > local_irq_restore(flags) > (ZVC like implementation) > > vs. > > get_per_cpu(var)++ > put_cpu() > (current light weight counters) The only thing I have at hand is a small test for the 1st case: #include #include #define GET_ITC() \ ({ \ unsigned long ia64_intri_res; \ \ asm volatile ("mov %0=ar.itc" : "=r"(ia64_intri_res)); \ ia64_intri_res; \ }) #define N (1000 * 1000 * 100L) atomic_t data; main(int c, char *v[]) { unsigned long cycles; int i; cycles = GET_ITC(); for (i = 0; i < N; i++) ia64_fetchadd4_rel(&data, 1); cycles = GET_ITC() - cycles; printf("%ld %d\n", cycles / N, atomic_read(&data)); } It gives 11 clock cycles. (The loop organizing instructions are "absorbed".) "atomic_inc(__get_per_cpu(var))" compiles into: mov rx = 0xffffffffffffxxxx // &__get_per_cpu(var) ;; fetchadd4.rel ry = [rx], 1 It _should_ take 11 clock cycles, too. (Assuming it is in L2.) For the 2nd case: With a bit of modification, I can measure what "__get_per_cpu(var)++" costs: 7 or 10 clock cycles, depending on if the chance to find the counter in L1 is 100% or 0%: int data; static inline void store(int *addr, int data){ asm volatile ("st4 [%1] = %0" :: "r"(data), "r"(addr) : "memory"); } static inline int load_nt1(int *addr) { int tmp; asm volatile ("ld4.nt1 %0=[%1]" : "=r"(tmp) : "r" (addr)); return tmp; } main(int c, char *v[]) { unsigned long cycles; int i, d; cycles = GET_ITC(); for (i = 0; i < N; i++) // Avoid optimizing out the "st4" store(&data, data + 1); cycles = GET_ITC() - cycles; printf("%ld %d\n", cycles / N, data); cycles = GET_ITC(); for (i = 0; i < N; i++){ // Do not use L1 d = load_nt1(&data); store(&data, d + 1); } cycles = GET_ITC() - cycles; printf("%ld %d\n", cycles / N, data); } "local_irq_save(flags)" compiles into: mov rx = psr ;; // 13 clock cycles rsm 0x4000 ;; // 5 clock cycles "local_irq_restore(flags)" compiles into (at least): ssm 0x4000 // 5 clock cycles For the 3dr case: If CONFIG_PREEMPT, then you need to add 2 * 7 clock cycles for inc_preempt_count() / dec_preempt_count() + some more for preempt_check_resched(). My conclusion: let's stick to atomic counters. Regards, Zoltan