All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <dada1@cosmosbay.com>
To: Andrew Gallatin <gallatin@myri.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>,
	David Miller <davem@davemloft.net>,
	brice@myri.com, sgruszka@redhat.com, netdev@vger.kernel.org
Subject: Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
Date: Thu, 30 Apr 2009 10:17:31 +0200	[thread overview]
Message-ID: <49F95E9B.5020005@cosmosbay.com> (raw)
In-Reply-To: <49F88E5A.8070908@myri.com>

Andrew Gallatin a écrit :
> Eric Dumazet wrote:
> 
>>
>> Sure, probably more cache misses or something...
> 
> Yes, that's what I thought.  The code is much more complete,
> and spread out than LRO, and seems to open itself to cache
> misses.
> 
>> You could try a longer oprofile session (with at least one million
> samples)
>> and :
>>
>> opannotate -a vmlinux >/tmp/FILE
>>
>> And select 3 or 4 suspect functions : inet_gro_receive()
> tcp_gro_receive(),
>> skb_gro_receive(), skb_gro_header()
> 
> Here is the opreport -l output from this machine for GRO for a 25 minute
> profiling run:
> 
> 
> samples  %        image name               app name symbol name
> 3742674  32.2793  vmlinux                  vmlinux copy_user_generic_string
> 890179    7.6775  myri10ge.ko              myri10ge myri10ge_poll
> 547572    4.7226  vmlinux                  vmlinux inet_gro_receive
> 477479    4.1181  vmlinux                  vmlinux skb_gro_receive
> 406562    3.5065  vmlinux                  vmlinux free_hot_cold_page
> 396796    3.4222  vmlinux                  vmlinux tcp_gro_receive
> 332364    2.8665  vmlinux                  vmlinux __rmqueue_smallest
> 319455    2.7552  vmlinux                  vmlinux skb_gro_header
> 269040    2.3204  vmlinux                  vmlinux dev_gro_receive
> 252885    2.1810  vmlinux                  vmlinux free_pages_bulk
> 247832    2.1375  vmlinux                  vmlinux
> get_pageblock_flags_group
> 211592    1.8249  myri10ge.ko              myri10ge myri10ge_alloc_rx_pages
> 208867    1.8014  vmlinux                  vmlinux __list_add
> 201491    1.7378  vmlinux                  vmlinux tcp4_gro_receive
> 187591    1.6179  vmlinux                  vmlinux __napi_gro_receive
> 170156    1.4675  vmlinux                  vmlinux get_page_from_freelist
> 116321    1.0032  vmlinux                  vmlinux                 
> list_del
> 107994    0.9314  vmlinux                  vmlinux                  kfree
> 106434    0.9180  vmlinux                  vmlinux skb_copy_datagram_iovec
> 100675    0.8683  vmlinux                  vmlinux                 
> put_page
> 
> And is here is the opannotate -a output for a few GRO functions.  BTW,
> did you mean -s
> rather than -a?   I'd naively think source might be more helpful.  But
> here is
> what you asked for:
> 
> ffffffff80479f20 <inet_gro_receive>: /* inet_gro_receive total: 547572
> 5.2554 */
>  12187  0.1170 :ffffffff80479f20:       push   %r13
>   2611  0.0251 :ffffffff80479f22:       mov    %rdi,%r13
>                :ffffffff80479f25:       push   %r12
>                :ffffffff80479f27:       push   %rbp
>   4031  0.0387 :ffffffff80479f28:       push   %rbx
>                :ffffffff80479f29:       mov    %rsi,%rbx
>                :ffffffff80479f2c:       mov    $0x14,%esi
>   6303  0.0605 :ffffffff80479f31:       mov    %rbx,%rdi
>                :ffffffff80479f34:       sub    $0x8,%rsp
>                :ffffffff80479f38:       callq  ffffffff804357a1
> <skb_gro_header>
>                :ffffffff80479f3d:       test   %rax,%rax
>   2494  0.0239 :ffffffff80479f40:       mov    %rax,%r8
>                :ffffffff80479f43:       je     ffffffff8047a0a4
> <inet_gro_receive+0x184>
>                :ffffffff80479f49:       movzbl 0x9(%rax),%eax
>   2541  0.0244 :ffffffff80479f4d:       mov
> 0xffffffff80d06280(,%rax,8),%r11
>     33 3.2e-04 :ffffffff80479f55:       test   %r11,%r11
>      5 4.8e-05 :ffffffff80479f58:       je     ffffffff8047a0a4
> <inet_gro_receive+0x184>
>  11016  0.1057 :ffffffff80479f5e:       cmpq   $0x0,0x20(%r11)
>    292  0.0028 :ffffffff80479f63:       je     ffffffff8047a0a4
> <inet_gro_receive+0x184>
>      1 9.6e-06 :ffffffff80479f69:       cmpb   $0x45,(%r8)
>   4297  0.0412 :ffffffff80479f6d:       jne    ffffffff8047a0a4
> <inet_gro_receive+0x184>
>   6086  0.0584 :ffffffff80479f73:       mov    $0x5,%eax
>                :ffffffff80479f78:       mov    %r8,%rcx
>  18706  0.1795 :ffffffff80479f7b:       mov    (%rcx),%edx
>    341  0.0033 :ffffffff80479f7d:       sub    $0x4,%eax
>                :ffffffff80479f80:       jbe    ffffffff80479fa6
> <inet_gro_receive+0x86>
>   4609  0.0442 :ffffffff80479f82:       add    0x4(%rcx),%edx
>    398  0.0038 :ffffffff80479f85:       adc    0x8(%rcx),%edx
>                :ffffffff80479f88:       adc    0xc(%rcx),%edx
>   4310  0.0414 :ffffffff80479f8b:       adc    0x10(%rcx),%edx
>    790  0.0076 :ffffffff80479f8e:       lea    0x4(%rcx),%rcx
>                :ffffffff80479f92:       dec    %eax
>   9097  0.0873 :ffffffff80479f94:       jne    ffffffff80479f8b
> <inet_gro_receive+0x6b>
>    541  0.0052 :ffffffff80479f96:       adc    $0x0,%edx
>                :ffffffff80479f99:       mov    %edx,%eax
>   1919  0.0184 :ffffffff80479f9b:       shr    $0x10,%edx
>    535  0.0051 :ffffffff80479f9e:       add    %ax,%dx
>                :ffffffff80479fa1:       adc    $0x0,%edx
>   3633  0.0349 :ffffffff80479fa4:       not    %edx
>    683  0.0066 :ffffffff80479fa6:       test   %dx,%dx
>      1 9.6e-06 :ffffffff80479fa9:       jne    ffffffff8047a0a4
> <inet_gro_receive+0x184>
>   4725  0.0453 :ffffffff80479faf:       movzwl 0x2(%r8),%eax
>   9728  0.0934 :ffffffff80479fb4:       mov    0x68(%rbx),%edx
>      8 7.7e-05 :ffffffff80479fb7:       mov    $0x1,%ebp
>  43000  0.4127 :ffffffff80479fbc:       sub    0x38(%rbx),%edx
>  11149  0.1070 :ffffffff80479fbf:       mov    %eax,%ecx
>                :ffffffff80479fc1:       shl    $0x8,%eax
>  66497  0.6382 :ffffffff80479fc4:       shr    $0x8,%ecx
>    735  0.0071 :ffffffff80479fc7:       or     %ecx,%eax
>                :ffffffff80479fc9:       movzwl %ax,%eax
>   5459  0.0524 :ffffffff80479fcc:       cmp    %edx,%eax
>    522  0.0050 :ffffffff80479fce:       jne    ffffffff80479fdc
> <inet_gro_receive+0xbc>
>                :ffffffff80479fd0:       xor    %ebp,%ebp
>   5373  0.0516 :ffffffff80479fd2:       cmpw   $0x40,0x6(%r8)
>    345  0.0033 :ffffffff80479fd8:       setne  %bpl
>                :ffffffff80479fdc:       movzwl 0x4(%r8),%eax
>   2384  0.0229 :ffffffff80479fe1:       mov    0x0(%r13),%r10
>    631  0.0061 :ffffffff80479fe5:       mov    %eax,%edx
>                :ffffffff80479fe7:       shl    $0x8,%eax
>   3044  0.0292 :ffffffff80479fea:       shr    $0x8,%edx
>    303  0.0029 :ffffffff80479fed:       or     %edx,%eax
>                :ffffffff80479fef:       movzwl %ax,%r12d
>   2747  0.0264 :ffffffff80479ff3:       jmp    ffffffff8047a071
> <inet_gro_receive+0x151>
>   2109  0.0202 :ffffffff80479ff5:       lea    0x38(%r10),%r9
>     12 1.2e-04 :ffffffff80479ff9:       cmpl   $0x0,0x4(%r9)
>     23 2.2e-04 :ffffffff80479ffe:       je     ffffffff8047a06e
> <inet_gro_receive+0x14e>
>   2104  0.0202 :ffffffff8047a000:       mov    0xac(%r10),%edi
>      2 1.9e-05 :ffffffff8047a007:       add    0xc0(%r10),%rdi
>                :ffffffff8047a00e:       mov    0x9(%rdi),%sil
>   2391  0.0229 :ffffffff8047a012:       mov    0x1(%rdi),%al
>      2 1.9e-05 :ffffffff8047a015:       xor    0x9(%r8),%sil
>      7 6.7e-05 :ffffffff8047a019:       xor    0x1(%r8),%al
>   2101  0.0202 :ffffffff8047a01d:       mov    0xc(%rdi),%edx
>      1 9.6e-06 :ffffffff8047a020:       mov    0x10(%rdi),%ecx
>                :ffffffff8047a023:       xor    0xc(%r8),%edx
>   2775  0.0266 :ffffffff8047a027:       xor    0x10(%r8),%ecx
>                :ffffffff8047a02b:       or     %esi,%eax
>                :ffffffff8047a02d:       movzbl %al,%eax
>  62734  0.6021 :ffffffff8047a030:       or     %edx,%ecx
>                :ffffffff8047a032:       or     %eax,%ecx
>                :ffffffff8047a034:       je     ffffffff8047a040
> <inet_gro_receive+0x120>
>                :ffffffff8047a036:       movl   $0x0,0x4(%r9)
>                :ffffffff8047a03e:       jmp    ffffffff8047a06e
> <inet_gro_receive+0x14e>
>   2106  0.0202 :ffffffff8047a040:       movzwl 0x4(%rdi),%edx
>                :ffffffff8047a044:       mov    0x8(%rdi),%al
>                :ffffffff8047a047:       xor    0x8(%r8),%eax
>  64244  0.6166 :ffffffff8047a04b:       mov    %edx,%ecx
>                :ffffffff8047a04d:       shl    $0x8,%edx
>                :ffffffff8047a050:       shr    $0x8,%ecx
>   2072  0.0199 :ffffffff8047a053:       movzbl %al,%eax
>                :ffffffff8047a056:       or     0x8(%r9),%eax
>                :ffffffff8047a05a:       or     %ecx,%edx
>   2629  0.0252 :ffffffff8047a05c:       add    0xc(%r9),%edx
>      2 1.9e-05 :ffffffff8047a060:       movzwl %dx,%edx
>                :ffffffff8047a063:       xor    %r12d,%edx
>  58223  0.5588 :ffffffff8047a066:       or     %edx,%eax
>      3 2.9e-05 :ffffffff8047a068:       or     %ebp,%eax
>                :ffffffff8047a06a:       mov    %eax,0x8(%r9)
>  21878  0.2100 :ffffffff8047a06e:       mov    (%r10),%r10
>   2156  0.0207 :ffffffff8047a071:       test   %r10,%r10
>                :ffffffff8047a074:       jne    ffffffff80479ff5
> <inet_gro_receive+0xd5>
>   3007  0.0289 :ffffffff8047a07a:       mov    0x38(%rbx),%eax
>     61 5.9e-04 :ffffffff8047a07d:       or     %ebp,0x40(%rbx)
>      3 2.9e-05 :ffffffff8047a080:       mov    %rbx,%rsi
>   3091  0.0297 :ffffffff8047a083:       mov    %r13,%rdi
>     41 3.9e-04 :ffffffff8047a086:       add    $0x14,%eax
>                :ffffffff8047a089:       mov    %eax,0x38(%rbx)
>   3704  0.0355 :ffffffff8047a08c:       sub    0xc0(%rbx),%eax
>     33 3.2e-04 :ffffffff8047a092:       add    0xc8(%rbx),%eax
>                :ffffffff8047a098:       mov    %eax,0xa8(%rbx)
>   2468  0.0237 :ffffffff8047a09e:       callq  *0x20(%r11)
>  20011  0.1921 :ffffffff8047a0a2:       jmp    ffffffff8047a0ab
> <inet_gro_receive+0x18b>
>                :ffffffff8047a0a4:       xor    %eax,%eax
>                :ffffffff8047a0a6:       mov    $0x1,%ebp
>  24082  0.2311 :ffffffff8047a0ab:       or     %ebp,0x40(%rbx)
>    626  0.0060 :ffffffff8047a0ae:       pop    %r10
>   1718  0.0165 :ffffffff8047a0b0:       pop    %rbx
>    446  0.0043 :ffffffff8047a0b1:       pop    %rbp
>   4074  0.0391 :ffffffff8047a0b2:       pop    %r12
>   2089  0.0200 :ffffffff8047a0b4:       pop    %r13
>    434  0.0042 :ffffffff8047a0b6:       retq
> 
> 
> 
> 
> ffffffff80430ea9 <skb_gro_receive>: /* skb_gro_receive total: 477479
> 4.5827 */
>   2158  0.0207 :ffffffff80430ea9:       push   %r15
>   2492  0.0239 :ffffffff80430eab:       mov    %rdi,%r15
>                :ffffffff80430eae:       push   %r14
>                :ffffffff80430eb0:       push   %r13
>   2432  0.0233 :ffffffff80430eb2:       push   %r12
>      1 9.6e-06 :ffffffff80430eb4:       push   %rbp
>      1 9.6e-06 :ffffffff80430eb5:       mov    %rsi,%rbp
>   2430  0.0233 :ffffffff80430eb8:       push   %rbx
>                :ffffffff80430eb9:       sub    $0x8,%rsp
>                :ffffffff80430ebd:       mov    0x68(%rsi),%ecx
>   2420  0.0232 :ffffffff80430ec0:       mov    (%rdi),%r12
>      1 9.6e-06 :ffffffff80430ec3:       mov    %ecx,%r14d
>      1 9.6e-06 :ffffffff80430ec6:       sub    0x38(%rsi),%r14d
>   2317  0.0222 :ffffffff80430eca:       mov    %r14d,%eax
>      1 9.6e-06 :ffffffff80430ecd:       add    0x68(%r12),%eax
>      1 9.6e-06 :ffffffff80430ed2:       cmp    $0xffff,%eax
>   3865  0.0371 :ffffffff80430ed7:       ja     ffffffff80431261
> <skb_gro_receive+0x3b8>
>                :ffffffff80430edd:       mov    0xb8(%r12),%eax
>                :ffffffff80430ee5:       mov    0xc0(%r12),%rdx
>   8082  0.0776 :ffffffff80430eed:       lea    (%rdx,%rax,1),%rsi
>                :ffffffff80430ef1:       cmpq   $0x0,0x18(%rsi)
>      2 1.9e-05 :ffffffff80430ef6:       jne    ffffffff804311ab
> <skb_gro_receive+0x302>
>   9249  0.0888 :ffffffff80430efc:       mov    %ecx,%edi
>                :ffffffff80430efe:       sub    0x6c(%rbp),%edi
>      6 5.8e-05 :ffffffff80430f01:       cmp    0x38(%rbp),%edi
>   3104  0.0298 :ffffffff80430f04:       ja     ffffffff80430fe2
> <skb_gro_receive+0x139>
>      2 1.9e-05 :ffffffff80430f0a:       mov    0xb8(%rbp),%ecx
>                :ffffffff80430f10:       movzwl 0x4(%rsi),%edx
>   8825  0.0847 :ffffffff80430f14:       add    0xc0(%rbp),%rcx
>                :ffffffff80430f1b:       movzwl 0x4(%rcx),%eax
>     21 2.0e-04 :ffffffff80430f1f:       add    %edx,%eax
>  19668  0.1888 :ffffffff80430f21:       cmp    $0x12,%eax
>      1 9.6e-06 :ffffffff80430f24:       ja     ffffffff80431261
> <skb_gro_receive+0x3b8>
>                :ffffffff80430f2a:       mov    0x38(%rcx),%eax
>   1974  0.0189 :ffffffff80430f2d:       add    0x38(%rbp),%eax
>                :ffffffff80430f30:       cld
>                :ffffffff80430f31:       sub    %edi,%eax
>   7666  0.0736 :ffffffff80430f33:       mov    %eax,0x38(%rcx)
>      2 1.9e-05 :ffffffff80430f36:       mov    0xb8(%rbp),%edx
>                :ffffffff80430f3c:       add    0xc0(%rbp),%rdx

Compiler has hard time to optimize these function apparently... :(

skb_shinfo(skb) & skb_shinfo(p) are evaluated many times.

>  52468  0.5036 :ffffffff80430f43:       mov    0x3c(%rdx),%eax
>      2 1.9e-05 :ffffffff80430f46:       add    0x68(%rbp),%eax
>      1 9.6e-06 :ffffffff80430f49:       sub    0x6c(%rbp),%eax
>   6592  0.0633 :ffffffff80430f4c:       sub    0x38(%rbp),%eax
>                :ffffffff80430f4f:       mov    %eax,0x3c(%rdx)
>                :ffffffff80430f52:       mov    0xb8(%r12),%eax
>  23018  0.2209 :ffffffff80430f5a:       add    0xc0(%r12),%rax
>      1 9.6e-06 :ffffffff80430f62:       mov    0xb8(%rbp),%esi
>                :ffffffff80430f68:       add    0xc0(%rbp),%rsi
>   8477  0.0814 :ffffffff80430f6f:       movzwl 0x4(%rax),%edi
>      6 5.8e-05 :ffffffff80430f73:       movzwl 0x4(%rsi),%ecx
>                :ffffffff80430f77:       add    $0x30,%rsi
>  21338  0.2048 :ffffffff80430f7b:       shl    $0x4,%rdi
>      3 2.9e-05 :ffffffff80430f7f:       lea    0x30(%rdi,%rax,1),%rdi
>      1 9.6e-06 :ffffffff80430f84:       shl    $0x4,%rcx
> 150632  1.4457 :ffffffff80430f88:       rep movsb %ds:(%rsi),%es:(%rdi)

ouch... What stupid compiler... should use movsq here :(

we could try to inline the likely case of one fragment copied...


>   3988  0.0383 :ffffffff80430f8a:       mov    0xb8(%r12),%eax
>   2015  0.0193 :ffffffff80430f92:       mov    0xb8(%rbp),%ecx
>     11 1.1e-04 :ffffffff80430f98:       add    0xc0(%r12),%rax
>      8 7.7e-05 :ffffffff80430fa0:       mov    0xc0(%rbp),%rdx
>   3295  0.0316 :ffffffff80430fa7:       mov    0x4(%rdx,%rcx,1),%edx
>                :ffffffff80430fab:       add    %dx,0x4(%rax)
>      8 7.7e-05 :ffffffff80430faf:       mov    0xb8(%rbp),%edx
>   2507  0.0241 :ffffffff80430fb5:       mov    0xc0(%rbp),%rax
>                :ffffffff80430fbc:       movw   $0x0,0x4(%rax,%rdx,1)
>   3233  0.0310 :ffffffff80430fc3:       mov    0x6c(%rbp),%eax
>      1 9.6e-06 :ffffffff80430fc6:       sub    %eax,0xd0(%rbp)
>                :ffffffff80430fcc:       sub    %eax,0x68(%rbp)
>  41540  0.3987 :ffffffff80430fcf:       movl   $0x0,0x6c(%rbp)
>                :ffffffff80430fd6:       movl   $0x1,0x48(%rbp)
>                :ffffffff80430fdd:       jmpq   ffffffff8043123f
> <skb_gro_receive+0x396>
>                :ffffffff80430fe2:       mov    0xc8(%r12),%rax
>                :ffffffff80430fea:       mov    0x20(%r12),%rdi
>                :ffffffff80430fef:       mov    %eax,%r13d
>                :ffffffff80430ff2:       sub    %edx,%r13d
>                :ffffffff80430ff5:       mov    $0x20,%edx
>                :ffffffff80430ffa:       mov    %r13d,%esi
>                :ffffffff80430ffd:       add    0x38(%r12),%esi
>                :ffffffff80431002:       callq  ffffffff8042ffe0

...

>                :ffffffff80431223:       ud2a
>                :ffffffff80431225:       jmp    ffffffff80431225
> <skb_gro_receive+0x37c>
>                :ffffffff80431227:       mov    0xb8(%rbp),%eax
>                :ffffffff8043122d:       orb    $0x10,0x7c(%rbp)
>                :ffffffff80431231:       add    0xc0(%rbp),%rax
>                :ffffffff80431238:       lock addl $0x10000,(%rax)
>  34919  0.3351 :ffffffff8043123f:       add    %r14d,0x6c(%r12)
>   1989  0.0191 :ffffffff80431244:       add    %r14d,0xd0(%r12)
>      1 9.6e-06 :ffffffff8043124c:       xor    %eax,%eax
>                :ffffffff8043124e:       add    %r14d,0x68(%r12)
>  20605  0.1978 :ffffffff80431253:       incl   0x44(%r12)
>                :ffffffff80431258:       movl   $0x1,0x3c(%rbp)
>                :ffffffff8043125f:       jmp    ffffffff80431266
> <skb_gro_receive+0x3bd>
>                :ffffffff80431261:       mov    $0xfffffff9,%eax
>  13260  0.1273 :ffffffff80431266:       pop    %r11
>   1946  0.0187 :ffffffff80431268:       pop    %rbx
>   2010  0.0193 :ffffffff80431269:       pop    %rbp
>     64 6.1e-04 :ffffffff8043126a:       pop    %r12
>   1948  0.0187 :ffffffff8043126c:       pop    %r13
>   2746  0.0264 :ffffffff8043126e:       pop    %r14
>     57 5.5e-04 :ffffffff80431270:       pop    %r15
>   2067  0.0198 :ffffffff80431272:       retq
> 
> ffffffff80460663 <tcp_gro_receive>: /* tcp_gro_receive total: 396796
> 3.8083 */
>   4433  0.0425 :ffffffff80460663:       push   %r15
>   2204  0.0212 :ffffffff80460665:       push   %r14
>                :ffffffff80460667:       mov    %rdi,%r14
>                :ffffffff8046066a:       push   %r13
>   2275  0.0218 :ffffffff8046066c:       push   %r12
>                :ffffffff8046066e:       mov    %rsi,%r12
>                :ffffffff80460671:       mov    $0x14,%esi
>   5933  0.0569 :ffffffff80460676:       mov    %r12,%rdi
>                :ffffffff80460679:       push   %rbp
>                :ffffffff8046067a:       push   %rbx
>   2180  0.0209 :ffffffff8046067b:       sub    $0x8,%rsp
>                :ffffffff8046067f:       callq  ffffffff804357a1
> <skb_gro_header>
>                :ffffffff80460684:       test   %rax,%rax
>   3218  0.0309 :ffffffff80460687:       je     ffffffff804607ed
> <tcp_gro_receive+0x18a>
>                :ffffffff8046068d:       mov    0xc(%rax),%al
>      1 9.6e-06 :ffffffff80460690:       shr    $0x4,%al
>   3528  0.0339 :ffffffff80460693:       movzbl %al,%eax
>                :ffffffff80460696:       lea    0x0(,%rax,4),%r13d
>      1 9.6e-06 :ffffffff8046069e:       cmp    $0x13,%r13d
>   2773  0.0266 :ffffffff804606a2:       jbe    ffffffff804607ed
> <tcp_gro_receive+0x18a>
>                :ffffffff804606a8:       mov    %r13d,%esi
>                :ffffffff804606ab:       mov    %r12,%rdi
>   3327  0.0319 :ffffffff804606ae:       callq  ffffffff804357a1
> <skb_gro_header>
>                :ffffffff804606b3:       test   %rax,%rax
>   2094  0.0201 :ffffffff804606b6:       mov    %rax,%r8
>                :ffffffff804606b9:       je     ffffffff804607ed
> <tcp_gro_receive+0x18a>
>                :ffffffff804606bf:       lea    0x38(%r12),%r15
>   2245  0.0215 :ffffffff804606c4:       add    %r13d,(%r15)
>                :ffffffff804606c7:       mov    0x68(%r12),%ebp
>                :ffffffff804606cc:       sub    0x38(%r12),%ebp
>   2394  0.0230 :ffffffff804606d1:       mov    0xc(%rax),%ebx
>                :ffffffff804606d4:       jmp    ffffffff80460710
> <tcp_gro_receive+0xad>
>   2111  0.0203 :ffffffff804606d6:       lea    0x38(%rdi),%r9
>      3 2.9e-05 :ffffffff804606da:       cmpl   $0x0,0x4(%r9)
>     21 2.0e-04 :ffffffff804606df:       je     ffffffff8046070d
> <tcp_gro_receive+0xaa>
>   2592  0.0249 :ffffffff804606e1:       mov    0xa8(%rdi),%eax
>                :ffffffff804606e7:       mov    0xc0(%rdi),%r10
>                :ffffffff804606ee:       mov    0x2(%r8),%dx
>   2440  0.0234 :ffffffff804606f3:       lea    (%r10,%rax,1),%rcx
>                :ffffffff804606f7:       mov    (%r8),%eax
>      1 9.6e-06 :ffffffff804606fa:       xor    0x2(%rcx),%dx
>   6275  0.0602 :ffffffff804606fe:       xor    (%rcx),%eax
>      3 2.9e-05 :ffffffff80460700:       or     %ax,%dx
>                :ffffffff80460703:       je     ffffffff8046071d
> <tcp_gro_receive+0xba>
>                :ffffffff80460705:       movl   $0x0,0x4(%r9)
>                :ffffffff8046070d:       mov    %rdi,%r14
>   2920  0.0280 :ffffffff80460710:       mov    (%r14),%rdi
>     18 1.7e-04 :ffffffff80460713:       test   %rdi,%rdi
>      2 1.9e-05 :ffffffff80460716:       jne    ffffffff804606d6
> <tcp_gro_receive+0x73>
>     33 3.2e-04 :ffffffff80460718:       jmpq   ffffffff80460807
> <tcp_gro_receive+0x1a4>
>   4253  0.0408 :ffffffff8046071d:       mov    0xe(%r8),%ax
>   2125  0.0204 :ffffffff80460722:       xor    0xe(%rcx),%ax
>      2 1.9e-05 :ffffffff80460726:       mov    %ebx,%edx
>                :ffffffff80460728:       and    $0x8000,%edx
>   8066  0.0774 :ffffffff8046072e:       or     0x8(%r9),%edx
>                :ffffffff80460732:       movzwl %ax,%esi
>                :ffffffff80460735:       mov    0x8(%r8),%eax
>  64740  0.6214 :ffffffff80460739:       xor    0x8(%rcx),%eax
>                :ffffffff8046073c:       or     %eax,%esi
>                :ffffffff8046073e:       mov    %ebx,%eax
>   2084  0.0200 :ffffffff80460740:       xor    0xc(%rcx),%eax
>                :ffffffff80460743:       and    $0x76,%ah
>                :ffffffff80460746:       or     %eax,%edx
>   2132  0.0205 :ffffffff80460748:       or     %edx,%esi
>                :ffffffff8046074a:       mov    $0x14,%edx
>                :ffffffff8046074f:       jmp    ffffffff8046075e
> <tcp_gro_receive+0xfb>
>                :ffffffff80460751:       movslq %edx,%rax
>                :ffffffff80460754:       add    $0x4,%edx
>                :ffffffff80460757:       mov    (%r8,%rax,1),%esi
>                :ffffffff8046075b:       xor    (%rcx,%rax,1),%esi
>   3670  0.0352 :ffffffff8046075e:       test   %esi,%esi
>   2162  0.0208 :ffffffff80460760:       jne    ffffffff80460767
> <tcp_gro_receive+0x104>
>                :ffffffff80460762:       cmp    %r13d,%edx
>      1 9.6e-06 :ffffffff80460765:       jb     ffffffff80460751
> <tcp_gro_receive+0xee>
>  50209  0.4819 :ffffffff80460767:       mov    0xb8(%rdi),%eax
>   4473  0.0429 :ffffffff8046076d:       mov    0x4(%rcx),%edx
>                :ffffffff80460770:       bswap  %edx
>   9554  0.0917 :ffffffff80460772:       mov    0x4(%r8),%ecx
>                :ffffffff80460776:       bswap  %ecx
>                :ffffffff80460778:       movzwl 0x6(%r10,%rax,1),%r13d
>   7572  0.0727 :ffffffff8046077e:       mov    0x68(%rdi),%eax
>                :ffffffff80460781:       sub    0x38(%rdi),%eax
>                :ffffffff80460784:       add    %edx,%eax
>   9803  0.0941 :ffffffff80460786:       xor    %eax,%ecx
>                :ffffffff80460788:       cmp    %r13d,%ebp
>                :ffffffff8046078b:       seta   %al
>  50608  0.4857 :ffffffff8046078e:       test   %ebp,%ebp
>                :ffffffff80460790:       sete   %dl
>                :ffffffff80460793:       or     %edx,%eax
>   3161  0.0303 :ffffffff80460795:       movzbl %al,%eax
>                :ffffffff80460798:       or     %eax,%esi
>                :ffffffff8046079a:       or     %esi,%ecx
>   3278  0.0315 :ffffffff8046079c:       jne    ffffffff804607f6
> <tcp_gro_receive+0x193>
>                :ffffffff8046079e:       mov    %r12,%rsi
>      2 1.9e-05 :ffffffff804607a1:       mov    %r14,%rdi
>   2579  0.0248 :ffffffff804607a4:       callq  ffffffff80430ea9
> <skb_gro_receive>
>   2059  0.0198 :ffffffff804607a9:       test   %eax,%eax
>     49 4.7e-04 :ffffffff804607ab:       jne    ffffffff804607f6
> <tcp_gro_receive+0x193>
>                :ffffffff804607ad:       mov    (%r14),%rcx
>   1945  0.0187 :ffffffff804607b0:       mov    %ebx,%edx
>      3 2.9e-05 :ffffffff804607b2:       and    $0x900,%edx
>                :ffffffff804607b8:       mov    0xa8(%rcx),%eax
>   2530  0.0243 :ffffffff804607be:       add    0xc0(%rcx),%rax
>      3 2.9e-05 :ffffffff804607c5:       or     %edx,0xc(%rax)
>     13 1.2e-04 :ffffffff804607c8:       xor    %eax,%eax
>   4881  0.0468 :ffffffff804607ca:       cmp    %r13d,%ebp
>                :ffffffff804607cd:       setb   %al
>                :ffffffff804607d0:       and    $0x2f00,%ebx
>   1912  0.0184 :ffffffff804607d6:       or     %ebx,%eax
>                :ffffffff804607d8:       test   %rcx,%rcx
>                :ffffffff804607db:       je     ffffffff80460816
> <tcp_gro_receive+0x1b3>
>   2163  0.0208 :ffffffff804607dd:       cmpl   $0x0,0x4(%r15)
>    136  0.0013 :ffffffff804607e2:       je     ffffffff804607e8
> <tcp_gro_receive+0x185>
>   2455  0.0236 :ffffffff804607e4:       test   %eax,%eax
>     57 5.5e-04 :ffffffff804607e6:       je     ffffffff80460816
> <tcp_gro_receive+0x1b3>
>    148  0.0014 :ffffffff804607e8:       mov    %r14,%rdi
>    735  0.0071 :ffffffff804607eb:       jmp    ffffffff80460818
> <tcp_gro_receive+0x1b5>
>                :ffffffff804607ed:       xor    %edi,%edi
>                :ffffffff804607ef:       mov    $0x1,%eax
>                :ffffffff804607f4:       jmp    ffffffff80460818
> <tcp_gro_receive+0x1b5>
>     68 6.5e-04 :ffffffff804607f6:       xor    %eax,%eax
>      1 9.6e-06 :ffffffff804607f8:       test   %ebp,%ebp
>     67 6.4e-04 :ffffffff804607fa:       sete   %al
>     47 4.5e-04 :ffffffff804607fd:       and    $0x2f00,%ebx
>                :ffffffff80460803:       or     %ebx,%eax
>     58 5.6e-04 :ffffffff80460805:       jmp    ffffffff804607dd
> <tcp_gro_receive+0x17a>
>    122  0.0012 :ffffffff80460807:       xor    %eax,%eax
>      9 8.6e-05 :ffffffff80460809:       test   %ebp,%ebp
>                :ffffffff8046080b:       sete   %al
>     67 6.4e-04 :ffffffff8046080e:       and    $0x2f00,%ebx
>      6 5.8e-05 :ffffffff80460814:       or     %ebx,%eax
>   1995  0.0191 :ffffffff80460816:       xor    %edi,%edi
>     68 6.5e-04 :ffffffff80460818:       or     %eax,0x40(%r12)
>    275  0.0026 :ffffffff8046081d:       mov    %rdi,%rax
>   2037  0.0196 :ffffffff80460820:       pop    %r11
>    191  0.0018 :ffffffff80460822:       pop    %rbx
>   4346  0.0417 :ffffffff80460823:       pop    %rbp
>   4739  0.0455 :ffffffff80460824:       pop    %r12
>    167  0.0016 :ffffffff80460826:       pop    %r13
>  23735  0.2278 :ffffffff80460828:       pop    %r14
>  56070  0.5381 :ffffffff8046082a:       pop    %r15
>    140  0.0013 :ffffffff8046082c:       retq
> 
> ffffffff804357a1 <skb_gro_header>: /* skb_gro_header total: 319455
> 3.0660 */
>  13604  0.1306 :ffffffff804357a1:       push   %rbp
>  14938  0.1434 :ffffffff804357a2:       push   %rbx
>                :ffffffff804357a3:       mov    %rdi,%rbx
>                :ffffffff804357a6:       sub    $0x8,%rsp
>  18392  0.1765 :ffffffff804357aa:       mov    0x38(%rdi),%ebp
>                :ffffffff804357ad:       mov    0x68(%rdi),%edx
>      1 9.6e-06 :ffffffff804357b0:       add    %ebp,%esi
>  20559  0.1973 :ffffffff804357b2:       mov    %edx,%edi
>                :ffffffff804357b4:       sub    0x6c(%rbx),%edi
>                :ffffffff804357b7:       jne    ffffffff804357cc
> <skb_gro_header+0x2b>
>  36626  0.3515 :ffffffff804357b9:       mov    0xb8(%rbx),%ecx
>      2 1.9e-05 :ffffffff804357bf:       mov    0xc0(%rbx),%rax
>      3 2.9e-05 :ffffffff804357c6:       cmp    %esi,0x3c(%rax,%rcx,1)
>  18577  0.1783 :ffffffff804357ca:       jae    ffffffff804357ee
> <skb_gro_header+0x4d>
>                :ffffffff804357cc:       cmp    %edi,%esi
>                :ffffffff804357ce:       jbe    ffffffff804357e3
> <skb_gro_header+0x42>
>                :ffffffff804357d0:       cmp    %edx,%esi
>                :ffffffff804357d2:       ja     ffffffff80435833
> <skb_gro_header+0x92>
>                :ffffffff804357d4:       sub    %edi,%esi
>                :ffffffff804357d6:       mov    %rbx,%rdi
>                :ffffffff804357d9:       callq  ffffffff8042f6ee
> <__pskb_pull_tail>
>                :ffffffff804357de:       test   %rax,%rax
>                :ffffffff804357e1:       je     ffffffff80435833
> <skb_gro_header+0x92>
>                :ffffffff804357e3:       mov    %ebp,%eax
>                :ffffffff804357e5:       add    0xc8(%rbx),%rax
>                :ffffffff804357ec:       jmp    ffffffff80435835
> <skb_gro_header+0x94>
>      3 2.9e-05 :ffffffff804357ee:       add    0xc0(%rbx),%rcx
>  25999  0.2495 :ffffffff804357f5:       mov    $0x1e0000000000,%rax
>                :ffffffff804357ff:       mov    $0x6db6db6db6db6db7,%rdx

OK, sizeof(struct page) is 0x38, we know it hurts some workloads.
It would be better to waste few bytes but to align them on cache lines here.

>  44557  0.4276 :ffffffff80435809:       add    0x30(%rcx),%rax
>                :ffffffff8043580d:       sar    $0x3,%rax
>  12588  0.1208 :ffffffff80435811:       imul   %rdx,%rax
>  10104  0.0970 :ffffffff80435815:       mov    $0xffff880000000000,%rdx
>                :ffffffff8043581f:       shl    $0xc,%rax
>                :ffffffff80435823:       add    %rdx,%rax
>  16404  0.1574 :ffffffff80435826:       mov    0x38(%rcx),%edx
>                :ffffffff80435829:       add    %rdx,%rax
>                :ffffffff8043582c:       mov    %ebp,%edx
>  15264  0.1465 :ffffffff8043582e:       add    %rdx,%rax
>                :ffffffff80435831:       jmp    ffffffff80435835
> <skb_gro_header+0x94>
>                :ffffffff80435833:       xor    %eax,%eax
>  45844  0.4400 :ffffffff80435835:       pop    %r10
>      2 1.9e-05 :ffffffff80435837:       pop    %rbx
>  12844  0.1233 :ffffffff80435838:       pop    %rbp
>  13144  0.1262 :ffffffff80435839:       retq
> 

I wonder if you could try to enlarge 'struct page' by 8 bytes and redo a test...

Here is a patch to combine two ideas. But it wont allow GRO to go much faster I guess :(

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0e80e26..44e97e2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -98,6 +98,7 @@ struct page {
 #ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS
 	unsigned long debug_flags;	/* Use atomic bitops on this */
 #endif
+	unsigned long _pad; /* so that sizeof(struct page) is 64 bytes */
 };
 
 /*
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ce6356c..74a6900 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2660,28 +2660,37 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 	struct sk_buff *nskb;
 	unsigned int headroom;
 	unsigned int len = skb_gro_len(skb);
+	int delta;
+	struct skb_shared_info *skb_shinfo_p = skb_shinfo(p);
 
 	if (p->len + len >= 65536)
 		return -E2BIG;
 
-	if (skb_shinfo(p)->frag_list)
+	delta = skb_gro_offset(skb) - skb_headlen(skb);
+	if (skb_shinfo_p->frag_list)
 		goto merge;
-	else if (skb_headlen(skb) <= skb_gro_offset(skb)) {
-		if (skb_shinfo(p)->nr_frags + skb_shinfo(skb)->nr_frags >
+	if (delta >= 0) {
+		struct skb_shared_info *skb_shinfo_skb = skb_shinfo(skb);
+
+		if (skb_shinfo_p->nr_frags + skb_shinfo_skb->nr_frags >
 		    MAX_SKB_FRAGS)
 			return -E2BIG;
 
-		skb_shinfo(skb)->frags[0].page_offset +=
-			skb_gro_offset(skb) - skb_headlen(skb);
-		skb_shinfo(skb)->frags[0].size -=
-			skb_gro_offset(skb) - skb_headlen(skb);
-
-		memcpy(skb_shinfo(p)->frags + skb_shinfo(p)->nr_frags,
-		       skb_shinfo(skb)->frags,
-		       skb_shinfo(skb)->nr_frags * sizeof(skb_frag_t));
+		skb_shinfo_skb->frags[0].page_offset += delta;
+		skb_shinfo_skb->frags[0].size -= delta;
 
-		skb_shinfo(p)->nr_frags += skb_shinfo(skb)->nr_frags;
-		skb_shinfo(skb)->nr_frags = 0;
+		if (likely(skb_shinfo_skb->nr_frags == 1)) {
+			memcpy(skb_shinfo_p->frags + skb_shinfo_p->nr_frags,
+				skb_shinfo_skb->frags,
+				sizeof(skb_frag_t));
+			skb_shinfo_p->nr_frags += 1;
+		} else {
+			memcpy(skb_shinfo_p->frags + skb_shinfo_p->nr_frags,
+			       skb_shinfo_skb->frags,
+			       skb_shinfo_skb->nr_frags * sizeof(skb_frag_t));
+			skb_shinfo_p->nr_frags += skb_shinfo_skb->nr_frags;
+		}
+		skb_shinfo_skb->nr_frags = 0;
 
 		skb->truesize -= skb->data_len;
 		skb->len -= skb->data_len;
@@ -2726,12 +2735,11 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 
 	p = nskb;
 
+	delta = skb_gro_offset(skb) - skb_headlen(skb);
 merge:
-	if (skb_gro_offset(skb) > skb_headlen(skb)) {
-		skb_shinfo(skb)->frags[0].page_offset +=
-			skb_gro_offset(skb) - skb_headlen(skb);
-		skb_shinfo(skb)->frags[0].size -=
-			skb_gro_offset(skb) - skb_headlen(skb);
+	if (delta > 0) {
+		skb_shinfo(skb)->frags[0].page_offset += delta;
+		skb_shinfo(skb)->frags[0].size -= delta;
 		skb_gro_reset_offset(skb);
 		skb_gro_pull(skb, skb_headlen(skb));
 	}



  parent reply	other threads:[~2009-04-30  8:18 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-15  8:09 [PATCH] myr10ge: again fix lro_gen_skb() alignment Stanislaw Gruszka
2009-04-15  9:28 ` David Miller
2009-04-15  9:48   ` Brice Goglin
2009-04-15 10:02     ` David Miller
2009-04-15 13:01       ` Andrew Gallatin
2009-04-15 21:04         ` Andrew Gallatin
2009-04-15 23:42           ` David Miller
2009-04-16  8:50             ` Herbert Xu
2009-04-16  9:02               ` David Miller
2009-04-21 19:19               ` Andrew Gallatin
2009-04-22 10:48                 ` Herbert Xu
2009-04-22 15:37                   ` Andrew Gallatin
2009-04-24  5:45                     ` Herbert Xu
2009-04-24 12:45                       ` Andrew Gallatin
2009-04-24 12:51                         ` Herbert Xu
2009-04-24 17:13                         ` Rick Jones
2009-04-24 16:16                       ` Andrew Gallatin
2009-04-24 16:30                         ` Herbert Xu
2009-04-24 16:31                           ` Herbert Xu
2009-04-27  8:05                         ` Herbert Xu
2009-04-27  8:07                           ` Herbert Xu
2009-04-27  9:32                             ` David Miller
2009-04-27 11:01                               ` Herbert Xu
2009-04-27 12:45                             ` David Miller
2009-04-27 12:45                           ` David Miller
2009-04-28  6:12                           ` Herbert Xu
2009-04-28 15:00                             ` Andrew Gallatin
2009-04-28 15:02                               ` David Miller
2009-04-28 15:20                               ` Herbert Xu
2009-04-28 15:44                                 ` Andrew Gallatin
2009-04-28 21:12                                 ` Andrew Gallatin
2009-04-29 13:42                                   ` Andrew Gallatin
2009-04-29 13:53                                     ` Eric Dumazet
2009-04-29 14:18                                       ` Andrew Gallatin
2009-04-29 15:26                                         ` Eric Dumazet
2009-04-29 17:28                                           ` Andrew Gallatin
2009-04-30  8:10                                             ` Herbert Xu
2009-04-30  8:14                                               ` Herbert Xu
2009-04-30  8:17                                             ` Eric Dumazet [this message]
2009-04-30 19:14                                               ` Andrew Gallatin
2009-04-23  8:00                 ` Herbert Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49F95E9B.5020005@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=brice@myri.com \
    --cc=davem@davemloft.net \
    --cc=gallatin@myri.com \
    --cc=herbert@gondor.apana.org.au \
    --cc=netdev@vger.kernel.org \
    --cc=sgruszka@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.