netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
  • [parent not found: <1383751399-10298-1-git-send-email-nhorman@tuxdriver.com>]
  • * Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
    @ 2013-10-30  5:25 Doug Ledford
      2013-10-30 10:27 ` David Laight
      2013-10-30 11:02 ` Neil Horman
      0 siblings, 2 replies; 48+ messages in thread
    From: Doug Ledford @ 2013-10-30  5:25 UTC (permalink / raw)
      To: Neil Horman; +Cc: Ingo Molnar, Eric Dumazet, Doug Ledford, linux-kernel, netdev
    
    * Neil Horman <nhorman@tuxdriver.com> wrote:
    > 3) The run times are proportionally larger, but still indicate that Parallel ALU
    > execution is hurting rather than helping, which is counter-intuitive.  I'm
    > looking into it, but thought you might want to see these results in case
    > something jumped out at you
    
    So here's my theory about all of this.
    
    I think that the original observation some years back was a fluke caused by
    either a buggy CPU or a CPU design that is no longer used.
    
    The parallel ALU design of this patch seems OK at first glance, but it means
    that two parallel operations are both trying to set/clear both the overflow
    and carry flags of the EFLAGS register of the *CPU* (not the ALU).  So, either
    some CPU in the past had a set of overflow/carry flags per ALU and did some
    sort of magic to make sure that the last state of those flags across multiple
    ALUs that might have been used in parallelizing work were always in the CPU's
    logical EFLAGS register, or the CPU has a buggy microcode that allowed two
    ALUs to operate on data at the same time in situations where they would
    potentially stomp on the carry/overflow flags of the other ALUs operations.
    
    It's my theory that all modern CPUs have this behavior fixed, probably via a
    microcode update, and so trying to do parallel ALU operations like this simply
    has no effect because the CPU (rightly so) serializes the operations to keep
    them from clobbering the overflow/carry flags of the other ALUs operations.
    
    My additional theory then is that the reason you see a slowdown from this
    patch is because the attempt to parallelize the ALU operation has caused
    us to write a series of instructions that, once serialized, are non-optimal
    and hinder smooth pipelining of the data (aka going 0*8, 2*8, 4*8, 6*8, 1*8,
    3*8, 5*8, and 7*8 in terms of memory accesses is worse than doing them in
    order, and since we aren't getting the parallel operation we want, this
    is the net result of the patch).
    
    It would explain things anyway.
    
    ^ permalink raw reply	[flat|nested] 48+ messages in thread

    end of thread, other threads:[~2013-11-07 21:23 UTC | newest]
    
    Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
    -- links below jump to the message on this page --
         [not found] <1381510298-20572-1-git-send-email-nhorman@tuxdriver.com>
         [not found] ` <20131012172124.GA18241@gmail.com>
         [not found]   ` <20131014202854.GH26880@hmsreliant.think-freely.org>
         [not found]     ` <1381785560.2045.11.camel@edumazet-glaptop.roam.corp.google.com>
         [not found]       ` <1381789127.2045.22.camel@edumazet-glaptop.roam.corp.google.com>
         [not found]         ` <20131017003421.GA31470@hmsreliant.think-freely.org>
    2013-10-17  8:41           ` [PATCH] x86: Run checksumming in parallel accross multiple alu's Ingo Molnar
    2013-10-17 18:19             ` H. Peter Anvin
    2013-10-17 18:48               ` Eric Dumazet
    2013-10-18  6:43               ` Ingo Molnar
    2013-10-28 16:01             ` Neil Horman
    2013-10-28 16:20               ` Ingo Molnar
    2013-10-28 17:49                 ` Neil Horman
    2013-10-28 16:24               ` Ingo Molnar
    2013-10-28 16:49                 ` David Ahern
    2013-10-28 17:46                 ` Neil Horman
    2013-10-28 18:29                   ` Neil Horman
    2013-10-29  8:25                     ` Ingo Molnar
    2013-10-29 11:20                       ` Neil Horman
    2013-10-29 11:30                         ` Ingo Molnar
    2013-10-29 11:49                           ` Neil Horman
    2013-10-29 12:52                             ` Ingo Molnar
    2013-10-29 13:07                               ` Neil Horman
    2013-10-29 13:11                                 ` Ingo Molnar
    2013-10-29 13:20                                   ` Neil Horman
    2013-10-29 14:17                                   ` Neil Horman
    2013-10-29 14:27                                     ` Ingo Molnar
    2013-10-29 20:26                                       ` Neil Horman
    2013-10-31 10:22                                         ` Ingo Molnar
    2013-10-31 14:33                                           ` Neil Horman
    2013-11-01  9:13                                             ` Ingo Molnar
    2013-11-01 14:06                                               ` Neil Horman
    2013-10-29 14:12                               ` David Ahern
         [not found] ` <1383751399-10298-1-git-send-email-nhorman@tuxdriver.com>
         [not found]   ` <1383751399-10298-3-git-send-email-nhorman@tuxdriver.com>
         [not found]     ` <87iow58eqf.fsf@tassilo.jf.intel.com>
    2013-11-07 21:23       ` [PATCH v2 2/2] x86: add prefetching to do_csum Neil Horman
    2013-10-30  5:25 [PATCH] x86: Run checksumming in parallel accross multiple alu's Doug Ledford
    2013-10-30 10:27 ` David Laight
    2013-10-30 11:02 ` Neil Horman
    2013-10-30 12:18   ` David Laight
    2013-10-30 13:22     ` Doug Ledford
    2013-10-30 13:35   ` Doug Ledford
    2013-10-30 14:04     ` David Laight
    2013-10-30 14:52     ` Neil Horman
    2013-10-31 18:30     ` Neil Horman
    2013-11-01  9:21       ` Ingo Molnar
    2013-11-01 15:42       ` Ben Hutchings
    2013-11-01 16:08         ` Neil Horman
    2013-11-01 16:16           ` Ben Hutchings
    2013-11-01 16:18           ` David Laight
    2013-11-01 17:37             ` Neil Horman
    2013-11-01 19:45               ` Joe Perches
    2013-11-01 19:58                 ` Neil Horman
    2013-11-01 20:26                   ` Joe Perches
    2013-11-02  2:07                     ` Neil Horman
    2013-11-04  9:47               ` David Laight
    

    This is a public inbox, see mirroring instructions
    for how to clone and mirror all data and code used for this inbox;
    as well as URLs for NNTP newsgroup(s).