From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Schmidt Subject: significant napi_gro_frags() slowdown Date: Thu, 27 Mar 2014 17:39:11 +0100 Message-ID: <5334542F.1050702@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-2; format=flowed Content-Transfer-Encoding: 7bit Cc: Jerry Chu , Eric Dumazet , "David S. Miller" To: netdev@vger.kernel.org Return-path: Received: from mx1.redhat.com ([209.132.183.28]:60521 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753304AbaC0QjQ (ORCPT ); Thu, 27 Mar 2014 12:39:16 -0400 Sender: netdev-owner@vger.kernel.org List-ID: Hello, I received a report about a performance regression on a simple workload of receiving a single TCP stream using a be2net NIC. Previously the reporter was able to reach more than 9 Gb/s, but now he sees less than 7 Gb/s. On my system the performance loss was not so obvious, but profiling with perf showed significant time spent in copying memory in __pskb_pull_tail(). Here is an example perf histogram from 3.14-rc8: - 14.03% swapper [kernel.kallsyms] [k] memcpy - memcpy - 99.99% __pskb_pull_tail - 46.16% tcp_gro_receive tcp4_gro_receive inet_gro_receive dev_gro_receive napi_gro_frags be_process_rx be_poll net_rx_action __do_softirq + irq_exit - 31.18% napi_gro_frags be_process_rx be_poll net_rx_action __do_softirq irq_exit do_IRQ + ret_from_intr - 22.66% inet_gro_receive dev_gro_receive napi_gro_frags be_process_rx be_poll net_rx_action __do_softirq irq_exit do_IRQ + ret_from_intr - 13.44% netserver [kernel.kallsyms] [k] copy_user_generic_string - copy_user_generic_string - skb_copy_datagram_iovec + 56.70% skb_copy_datagram_iovec + 32.27% tcp_recvmsg + 11.03% tcp_rcv_established - 5.64% swapper [kernel.kallsyms] [k] __pskb_pull_tail - __pskb_pull_tail + 48.58% tcp_gro_receive + 24.48% inet_gro_receive + 24.11% napi_gro_frags + 1.28% tcp4_gro_receive + 0.91% be_process_rx + 0.64% dev_gro_receive + 5.13% swapper [kernel.kallsyms] [k] skb_copy_bits ... Bisection identified this first bad commit: commit 299603e8370a93dd5d8e8d800f0dff1ce2c53d36 Author: Jerry Chu Date: Wed Dec 11 20:53:45 2013 -0800 net-gro: Prepare GRO stack for the upcoming tunneling support Before this commit, the GRO code was able to access the received packets' headers quickly via NAPI_GRO_CB(skb)->frag0. After the commit, this optimization no longer applies. napi_frags_skb() will now always pull the Ethernet header from the first frag. Subsequently, the slow paths are being taken in .gro_receive functions, always pulling the other headers with __pskb_pull_tail(). To demonstrate the call chains, let's see the function graph traced by ftrace. Before the commit: 13) | napi_gro_frags() { 13) 0.038 us | skb_gro_reset_offset(); 13) | dev_gro_receive() { 13) | inet_gro_receive() { 13) | tcp4_gro_receive() { 13) | tcp_gro_receive() { 13) 0.059 us | skb_gro_receive(); 13) 0.385 us | } 13) 0.675 us | } 13) 1.012 us | } 13) 1.336 us | } 13) 0.040 us | napi_reuse_skb.isra.57(); 13) 2.235 us | } After the commit: 7) | napi_gro_frags() { 7) | __pskb_pull_tail() { 7) 0.204 us | skb_copy_bits(); 7) 0.551 us | } 7) 0.046 us | eth_type_trans(); 7) | dev_gro_receive() { 7) | inet_gro_receive() { 7) | __pskb_pull_tail() { 7) 0.095 us | skb_copy_bits(); 7) 0.412 us | } 7) | tcp4_gro_receive() { 7) | tcp_gro_receive() { 7) | __pskb_pull_tail() { 7) 0.095 us | skb_copy_bits(); 7) 0.410 us | } 7) | __pskb_pull_tail() { 7) 0.095 us | skb_copy_bits(); 7) 0.412 us | } 7) 0.055 us | skb_gro_receive(); 7) 1.771 us | } 7) 2.077 us | } 7) 3.078 us | } 7) 3.403 us | } 7) 0.043 us | napi_reuse_skb.isra.72(); 7) 5.152 us | } Here are typical times spent in napi_gro_frags(), measured using ftrace and making napi_gro_frags() a leaf function (thus avoiding the tracing overhead in child functions). Before the commit: ... 12) 0.130 us | napi_gro_frags(); 12) 0.129 us | napi_gro_frags(); 12) 0.132 us | napi_gro_frags(); 12) 0.128 us | napi_gro_frags(); 12) 0.126 us | napi_gro_frags(); 12) 0.129 us | napi_gro_frags(); 12) 1.423 us | napi_gro_frags(); ... After the commit: ... 20) 0.812 us | napi_gro_frags(); 20) 0.820 us | napi_gro_frags(); 20) 0.811 us | napi_gro_frags(); 20) 0.812 us | napi_gro_frags(); 20) 0.814 us | napi_gro_frags(); 20) 0.819 us | napi_gro_frags(); 20) 1.526 us | napi_gro_frags(); ... This should affect not just be2net, but all drivers that use napi_gro_frags(). Any suggestions how to restore the frag0 optimization? Thanks, Michal