Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Is this 32-bit NCM?y
From: Kevin Zhu @ 2014-12-05  2:20 UTC (permalink / raw)
  To: Enrico Mioso, Bjørn Mork
  Cc: Midge Shaojun  Tan, Eli Britstein, Alex Strizhevsky,
	youtux-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	linux-usb-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <alpine.LNX.2.03.1412041326160.9926-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4821 bytes --]

Regarding the location of NDP, it should be easy to fix. It can be added
to the end of the NTB only after it's ready to send. Regarding the
concern to other devices, as there's a particular driver for Huawei
devices in kernel, which is huawei_cdc_ncm, maybe we can just fix the TX
function there to avoid breaking other devices.

Regards,
Kevin

On 12/04/2014 08:28 PM, Enrico Mioso wrote:
> ... DHCP will work with some DHCPNACKS in the meanwhile, but ping
> stops working at all.
> Otherwise, it works with the standard value:
>
> --- 8.8.8.8 ping statistics ---
> 48 packets transmitted, 48 received, 0% packet loss, time 47004ms
> rtt min/avg/max/mdev = 362.084/392.878/523.132/33.636 ms
>
> And I was expecting effectively to see some lost packets, but
> instead... no.
>
>
> On Thu, 4 Dec 2014, BjÃ¸rn Mork wrote:
>
>> Date: Thu, 4 Dec 2014 12:44:56
>> From: BjÃ¸rn Mork <bjorn@mork.no>
>> To: Midge Shaojun Tan <ShaojunMidge.Tan@audiocodes.com>
>> Cc: Enrico Mioso <mrkiko.rs@gmail.com>,
>>     Kevin Zhu <Mingying.Zhu@audiocodes.com>,
>>     Eli Britstein <Eli.Britstein@audiocodes.com>,
>>     Alex Strizhevsky <alexxst@gmail.com>,
>>     "youtux@gmail.com" <youtux@gmail.com>,
>>     "linux-usb@vger.kernel.org" <linux-usb@vger.kernel.org>,
>>     "netdev@vger.kernel.org" <netdev@vger.kernel.org>
>> Subject: Re: Is this 32-bit NCM?y
>>
>> "Midge Shaojun  Tan" <ShaojunMidge.Tan@audiocodes.com> writes:
>>
>>> Hi all,
>>>
>>> I test OK with kervel 3.16.4
>>> Need disable other Ethernet network, just like eth1. (Then the DNS
>>> and route is OK)
>>> And also need disable arp, (ifconfig wwan0 -arp up), because China
>>> UNICOM don't respond the ARP message.
>>
>> The ARP functionality is independent of operator.  It is handled
>> internally by the modem firmware.  There are no MAC addresses or
>> ethernet headers transmitted over the radio link.  That's all faked by
>> the modem.  All MAC addresses and ethernet headers are local to the
>> modem<->host USB link.
>>
>>> With new mode switch string: /etc/usb_modeswitch.d/12d1:14fe
>>> Please see the patch and check whether it is correct?
>>
>> I see that you have two changes there:
>>
>> 1) the ETH_HLEN adjustment of ctx->tx_remainder is dropped
>> 2) the NDP is placed after the first frame.
>>
>> I haven't verified the effect of the tx_remainder change, but I assume
>> it fixes an alignment problem for this device.  I'd like to look more at
>> the effect of this for different values of wNdpOutPayloadRemainder and
>> wNdpOutDivisor.
>>
>> We can choose to put the NDP at the end of the NTB if we find that this
>> fixes some problem, but doing so by default for every NCM and MBIM
>> device is a bit risky. If we accept that some devices are so buggy that
>> the NDP cannot be placed anywhere (as required by the spec), then we
>> have to assume that this goes both ways.  Which means that moving the
>> NDP to the end of the NTB might break some other device.  We just don't
>> know that since we haven't ever tried it.
>>
>> And your fix doesn't really move it to the end either.  It just places
>> the NDP after the first ethernet packet.  Which happens to be the end if
>> there is only one packet in the NTB. But if we aggregate more packets
>> into this NTB then the result will look like this:
>>
>> NTH
>> eth packet 1
>> NDP
>> eth packet 2
>> ..
>> eth packet N
>>
>> I'm not convinced this modem will handle that if it cannot handle the
>> NDP being before the first packet...  This needs to be tested. Try
>> increasing /sys/class/net/wwan0/cdc_ncm/tx_timer_usecs to force the
>> driver to aggregate packets and see if everything still works.
>> Preferably while looking at the resulting NTB to verify that it does
>> contain more than one ethernet packet.
>>
>> I realize I sound a bit negative now.  This is absolutely not my
>> intention. This is great work, providing some real progress wrt figuring
>> out what goes on here.  Thanks a lot!  I am sure we can sort out the
>> remaining issues, which are really minor compared to what you have found
>> so far.
>>
>>
>>
>> BjÃ¸rn
>>
>>
This email and any files transmitted with it are confidential material. They are intended solely for the use of the designated individual or entity to whom they are addressed. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful.

If you have received this email in error please immediately notify the sender and delete or destroy any copy of this message
N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±ºÆâžØ^n‡r¡ö¦zË\x1aëh™¨èÚ&¢îý»\x05ËÛÔØï¦v¬Îf\x1dp)¹¹br	šê+€Ê+zf£¢·hšˆ§~††Ûiÿûàz¹\x1e®w¥¢¸?™¨èÚ&¢)ß¢^[f

^ permalink raw reply

* Re: [PATCH] x86: bpf_jit_comp: Remove inline from static function definitions
From: Joe Perches @ 2014-12-05  1:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Quentin Lambert, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86@kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <CAADnVQ+5T3MDiC7RwVwdyu2PEhGy0VOrp7=LWeENrX_USXevnw@mail.gmail.com>

On Thu, 2014-12-04 at 17:21 -0800, Alexei Starovoitov wrote:
> Joe, please mention dependencies next time.

No real need here.
It's in the same thread.

^ permalink raw reply

* Re: [PATCH] x86: bpf_jit_comp: Remove inline from static function definitions
From: Alexei Starovoitov @ 2014-12-05  1:21 UTC (permalink / raw)
  To: Joe Perches
  Cc: Eric Dumazet, Quentin Lambert, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86@kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <1417741284.2721.26.camel@perches.com>

On Thu, Dec 4, 2014 at 5:01 PM, Joe Perches <joe@perches.com> wrote:
> Let the compiler decide instead.
>
> No change in object size x86-64 -O2 no profiling
>
> Signed-off-by: Joe Perches <joe@perches.com>
> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>

Acked-by: Alexei Starovoitov <ast@plumgrid.com>

Dave, this is on top of previous patch:
http://patchwork.ozlabs.org/patch/417960/

Joe, please mention dependencies next time.

^ permalink raw reply

* Re: [PATCH net-next 3/3] ip: Add support for IP_CHECKSUM cmsg
From: Eric Dumazet @ 2014-12-05  1:17 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <1417653868-14922-4-git-send-email-therbert@google.com>

On Wed, 2014-12-03 at 16:44 -0800, Tom Herbert wrote:

> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 221b53f..bba2e06 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1315,8 +1315,16 @@ try_again:
>  		memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
>  		*addr_len = sizeof(*sin);
>  	}
> -	if (inet->cmsg_flags)
> +	if (inet->cmsg_flags) {
> +		/* Pull checksum past UDP header in case we are providing
> +		 * checksum in cmsg.
> +		 */
> +		if (inet->cmsg_flags & IP_CMSG_CHECKSUM)
> +			skb_postpull_rcsum(skb, skb->data,
> +					   sizeof(struct udphdr));
> +
>  		ip_cmsg_recv(msg, skb);
> +	}


Have you tried MSG_PEEK ?

You can not modify the skb, you need to do something else.

^ permalink raw reply

* Re: [PATCH net-next 1/2] tcp_cubic: add SNMP counters to track how effective is Hystart
From: Eric Dumazet @ 2014-12-05  1:10 UTC (permalink / raw)
  To: Florian Westphal
  Cc: David Miller, netdev, Nandita Dukkipati, Neal Cardwell,
	Yuchung Cheng, Sangtae Ha
In-Reply-To: <20141205001754.GG16959@breakpoint.cc>

On Fri, 2014-12-05 at 01:17 +0100, Florian Westphal wrote:

> Alternatively we could add INET_DIAG_CUBICINFO and export such info via ss
> tool.

Thats a complete different model.

When studying some param changes on servers handling million of flows,
SNMP hosts stats are giving the picture, without having to collect huge
ss info in the background and aggregating it.

^ permalink raw reply

* [PATCH] x86: bpf_jit_comp: Remove inline from static function definitions
From: Joe Perches @ 2014-12-05  1:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, Quentin Lambert, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86@kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <1417740382.22424.40.camel@edumazet-glaptop2.roam.corp.google.com>

Let the compiler decide instead.

No change in object size x86-64 -O2 no profiling

Signed-off-by: Joe Perches <joe@perches.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
---
On Thu, 2014-12-04 at 16:46 -0800, Eric Dumazet wrote:
> On Thu, 2014-12-04 at 15:31 -0800, Alexei Starovoitov wrote:
> 
> > well, it is a trivial function even from compiler point of view.
> > Dropping inline keyword doesn't help. gcc still inlines them.
> > Changing all 3 functions to _noinline_ doesn't help either.
> > So I think this patch is actually quite helpful to reduce code size.
> 
> Well, again this might depend on CONFIG_CC_OPTIMIZE_FOR_SIZE

-Os has a different size delta, but it's still
smaller using this new function.

> I guess people trying to get very small kernels are using this option.
> 
> My point was : If we care about code size, we should also remove these
> inline keywords at the same time, to increase SNR of netdev/lkml lists.

Because there's no object change here, inline removals
would probably be a good thing for this file.

 arch/x86/net/bpf_jit_comp.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 09e2cea..626e013 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -24,7 +24,7 @@ extern u8 sk_load_byte_positive_offset[];
 extern u8 sk_load_word_negative_offset[], sk_load_half_negative_offset[];
 extern u8 sk_load_byte_negative_offset[];
 
-static inline u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
+static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
 {
 	if (len == 1)
 		*ptr = bytes;
@@ -52,12 +52,12 @@ static inline u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
 #define EMIT4_off32(b1, b2, b3, b4, off) \
 	do {EMIT4(b1, b2, b3, b4); EMIT(off, 4); } while (0)
 
-static inline bool is_imm8(int value)
+static bool is_imm8(int value)
 {
 	return value <= 127 && value >= -128;
 }
 
-static inline bool is_simm32(s64 value)
+static bool is_simm32(s64 value)
 {
 	return value == (s64) (s32) value;
 }
@@ -94,7 +94,7 @@ static int bpf_size_to_x86_bytes(int bpf_size)
 #define X86_JGE 0x7D
 #define X86_JG  0x7F
 
-static inline void bpf_flush_icache(void *start, void *end)
+static void bpf_flush_icache(void *start, void *end)
 {
 	mm_segment_t old_fs = get_fs();
 
@@ -133,7 +133,7 @@ static const int reg2hex[] = {
  * which need extra byte of encoding.
  * rax,rcx,...,rbp have simpler encoding
  */
-static inline bool is_ereg(u32 reg)
+static bool is_ereg(u32 reg)
 {
 	return (1 << reg) & (BIT(BPF_REG_5) |
 			     BIT(AUX_REG) |
@@ -143,14 +143,14 @@ static inline bool is_ereg(u32 reg)
 }
 
 /* add modifiers if 'reg' maps to x64 registers r8..r15 */
-static inline u8 add_1mod(u8 byte, u32 reg)
+static u8 add_1mod(u8 byte, u32 reg)
 {
 	if (is_ereg(reg))
 		byte |= 1;
 	return byte;
 }
 
-static inline u8 add_2mod(u8 byte, u32 r1, u32 r2)
+static u8 add_2mod(u8 byte, u32 r1, u32 r2)
 {
 	if (is_ereg(r1))
 		byte |= 1;
@@ -160,13 +160,13 @@ static inline u8 add_2mod(u8 byte, u32 r1, u32 r2)
 }
 
 /* encode 'dst_reg' register into x64 opcode 'byte' */
-static inline u8 add_1reg(u8 byte, u32 dst_reg)
+static u8 add_1reg(u8 byte, u32 dst_reg)
 {
 	return byte + reg2hex[dst_reg];
 }
 
 /* encode 'dst_reg' and 'src_reg' registers into x64 opcode 'byte' */
-static inline u8 add_2reg(u8 byte, u32 dst_reg, u32 src_reg)
+static u8 add_2reg(u8 byte, u32 dst_reg, u32 src_reg)
 {
 	return byte + reg2hex[dst_reg] + (reg2hex[src_reg] << 3);
 }

^ permalink raw reply related

* Re: [PATCH net-next 1/2] tcp_cubic: add SNMP counters to track how effective is Hystart
From: Neal Cardwell @ 2014-12-05  0:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Nandita Dukkipati, Yuchung Cheng,
	Sangtae Ha
In-Reply-To: <1417738403.22424.28.camel@edumazet-glaptop2.roam.corp.google.com>

On Thu, Dec 4, 2014 at 7:13 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> When deploying FQ pacing, one thing we noticed is that CUBIC Hystart
> triggers too soon.
>
> Having SNMP counters to have an idea of how often the various Hystart
> methods trigger is useful prior to any modifications.
>
> This patch adds SNMP counters tracking, how many time "ack train" or
> "Delay" based Hystart triggers, and cumulative sum of cwnd at the time
> Hystart decided to end SS (Slow Start)
>
> myhost:~# nstat -a | grep Hystart
> TcpExtTCPHystartTrainDetect     9                  0.0
> TcpExtTCPHystartTrainCwnd       20650              0.0
> TcpExtTCPHystartDelayDetect     10                 0.0
> TcpExtTCPHystartDelayCwnd       360                0.0
>
> ->
>  Train detection was triggered 9 times, and average cwnd was
>  20650/9=2294,
>  Delay detection was triggered 10 times and average cwnd was 36
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

Very nice. IMHO it's great to have these as SNMP counters, so we don't
need special tools to get aggregate stats for all connections.

neal

^ permalink raw reply

* Re: [PATCH net-next 2/2] tcp_cubic: refine Hystart delay threshold
From: Neal Cardwell @ 2014-12-05  0:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Nandita Dukkipati, Yuchung Cheng,
	Sangtae Ha
In-Reply-To: <1417738429.22424.29.camel@edumazet-glaptop2.roam.corp.google.com>

On Thu, Dec 4, 2014 at 7:13 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> In commit 2b4636a5f8ca ("tcp_cubic: make the delay threshold of HyStart
> less sensitive"), HYSTART_DELAY_MIN was changed to 4 ms.
>
> The remaining problem is that using delay_min + (delay_min/16) as the
> threshold is too sensitive.
>
> 6.25 % of variation is too small for rtt above 60 ms, which are not
> uncommon.
>
> Lets use 12.5 % instead (delay_min + (delay_min/8))

Very nice, Eric!

Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH] x86: bpf_jit_comp: Reduce is_ereg() code size
From: Eric Dumazet @ 2014-12-05  0:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joe Perches, Quentin Lambert, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86@kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <CAADnVQ+3jiYOYReRLPUsFeG1p66ezo-TXBLX3BGQQh7V8QX0tQ@mail.gmail.com>

On Thu, 2014-12-04 at 15:31 -0800, Alexei Starovoitov wrote:

> well, it is a trivial function even from compiler point of view.
> Dropping inline keyword doesn't help. gcc still inlines them.
> Changing all 3 functions to _noinline_ doesn't help either.
> So I think this patch is actually quite helpful to reduce code size.

Well, again this might depend on CONFIG_CC_OPTIMIZE_FOR_SIZE

I guess people trying to get very small kernels are using this option.

My point was : If we care about code size, we should also remove these
inline keywords at the same time, to increase SNR of netdev/lkml lists.

^ permalink raw reply

* [PATCH iproute2] ip monitor: Fix issue when timestamp is printed w/o msg
From: Vadim Kochan @ 2014-12-05  0:18 UTC (permalink / raw)
  To: netdev; +Cc: Vadim Kochan

The issue was observed when IPv6 router broadcasted NDUSEROPT
messages which are not handled by monitor and caused printing
'Timestamps' w/o message because such kind of rtnl messages is not
handled by monitor.

As 'ip monitor' by default subscribes to the all mcast rtnl groups except
RTGRP_TC then all messages of these rtnl groups which are not handled by
monitor may cause such issues.

Fixed by subscribing by default to rtnl mcast groups which are
supported by 'ip monitor'.

Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
---
 ip/ipmonitor.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
index 4cc75f4..4708e54 100644
--- a/ip/ipmonitor.c
+++ b/ip/ipmonitor.c
@@ -134,14 +134,6 @@ static int accept_msg(const struct sockaddr_nl *who,
 		fprintf(fp, "Timestamp: %s %lu us\n", tstr, usecs);
 		return 0;
 	}
-	if (n->nlmsg_type == RTM_NEWQDISC ||
-	    n->nlmsg_type == RTM_DELQDISC ||
-	    n->nlmsg_type == RTM_NEWTCLASS ||
-	    n->nlmsg_type == RTM_DELTCLASS ||
-	    n->nlmsg_type == RTM_NEWTFILTER ||
-	    n->nlmsg_type == RTM_DELTFILTER ||
-	    n->nlmsg_type == RTM_NEWNDUSEROPT)
-		return 0;
 	if (n->nlmsg_type != NLMSG_ERROR && n->nlmsg_type != NLMSG_NOOP &&
 	    n->nlmsg_type != NLMSG_DONE) {
 		fprintf(fp, "Unknown message: type=0x%08x(%d) flags=0x%08x(%d)"
@@ -155,7 +147,7 @@ static int accept_msg(const struct sockaddr_nl *who,
 int do_ipmonitor(int argc, char **argv)
 {
 	char *file = NULL;
-	unsigned groups = ~RTMGRP_TC;
+	unsigned groups = 0;
 	int llink=0;
 	int laddr=0;
 	int lroute=0;
@@ -165,6 +157,18 @@ int do_ipmonitor(int argc, char **argv)
 	int lnetconf=0;
 	int ifindex=0;
 
+	groups |= nl_mgrp(RTNLGRP_LINK);
+	groups |= nl_mgrp(RTNLGRP_IPV4_IFADDR);
+	groups |= nl_mgrp(RTNLGRP_IPV6_IFADDR);
+	groups |= nl_mgrp(RTNLGRP_IPV4_ROUTE);
+	groups |= nl_mgrp(RTNLGRP_IPV6_ROUTE);
+	groups |= nl_mgrp(RTNLGRP_IPV4_MROUTE);
+	groups |= nl_mgrp(RTNLGRP_IPV6_MROUTE);
+	groups |= nl_mgrp(RTNLGRP_IPV6_PREFIX);
+	groups |= nl_mgrp(RTNLGRP_NEIGH);
+	groups |= nl_mgrp(RTNLGRP_IPV4_NETCONF);
+	groups |= nl_mgrp(RTNLGRP_IPV6_NETCONF);
+
 	rtnl_close(&rth);
 
 	while (argc > 0) {
@@ -195,7 +199,6 @@ int do_ipmonitor(int argc, char **argv)
 			lnetconf = 1;
 			groups = 0;
 		} else if (strcmp(*argv, "all") == 0) {
-			groups = ~RTMGRP_TC;
 			prefix_banner=1;
 		} else if (matches(*argv, "help") == 0) {
 			usage();
-- 
2.1.3

^ permalink raw reply related

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
From: Du Fan @ 2014-12-05  0:25 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Pravin Shelar, Thomas Graf, Michael S. Tsirkin, Du, Fan,
	Jason Wang, netdev@vger.kernel.org, davem@davemloft.net,
	fw@strlen.de, dev@openvswitch.org
In-Reply-To: <CAEP_g=9AH_RShKujVeUmty6BKPJZt=u_-h8NnUL1kX+tSYdBxQ@mail.gmail.com>


On 2014/12/5 7:23, Jesse Gross wrote:
> On Wed, Dec 3, 2014 at 11:48 PM, Du Fan <fengyuleidian0615@gmail.com> wrote:
>> 于 2014年12月04日 06:51, Jesse Gross 写道:
>>> My proposal would be something like this:
>>>    * For L2, reduce the VM MTU to the lowest common denominator on the
>>> segment.
>>>    * For L3, use path MTU discovery or fragment inner packet (i.e.
>>> normal routing behavior).
>>>    * As a last resort (such as if using an old version of virtio in the
>>> guest), fragment the tunnel packet.
>>
>> After some investigation on OpenvSwitch package, it seems before this
>> commit: 06021dcb "datapath: compat: Fix compilation 3.11" OpenvSwitch
>> package is doing GSO on its own.
>>
>> rpl_ip_local_out
>>    -> tnl_skb_gso_segment
>>        ^^^^^^^^^^^^^^^
>>           Perform GSO  in above function
>>      -> ip_local_out
>>            .
>>            .
>>          -> ip_finish_output
>>
>> Which means, when over-MTU-sized skb enter ip_local_out, it's not a gso
>> type skb, so the stack perform ip fragmentation, and send them out.
>> So, over-MTU-sized skb did travel through stack into outside.
>>
>> Why not dropping such OpenvSwitch level GSO operation after 3.10?
> The change in 3.11 was that the tunnel infrastructure used by OVS was
> upstreamed and shared by all implementations. It's not right to
> perform GSO in OVS itself as it prevents the logic from being used by
> other components. Breaking up the packet in OVS also eliminates some
> of the benefits of GSO by shortening the optimized path and prevents
> offloading to hardware.
Thanks for your explanation, I understand its background better now.

^ permalink raw reply

* Re: [PATCH net-next 1/2] tcp_cubic: add SNMP counters to track how effective is Hystart
From: Florian Westphal @ 2014-12-05  0:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Nandita Dukkipati, Neal Cardwell,
	Yuchung Cheng, Sangtae Ha
In-Reply-To: <1417738403.22424.28.camel@edumazet-glaptop2.roam.corp.google.com>

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> When deploying FQ pacing, one thing we noticed is that CUBIC Hystart
> triggers too soon.
> 
> Having SNMP counters to have an idea of how often the various Hystart
> methods trigger is useful prior to any modifications.
> 
> This patch adds SNMP counters tracking, how many time "ack train" or
> "Delay" based Hystart triggers, and cumulative sum of cwnd at the time
> Hystart decided to end SS (Slow Start)
> 
> myhost:~# nstat -a | grep Hystart
> TcpExtTCPHystartTrainDetect     9                  0.0
> TcpExtTCPHystartTrainCwnd       20650              0.0
> TcpExtTCPHystartDelayDetect     10                 0.0
> TcpExtTCPHystartDelayCwnd       360                0.0
> 
> ->
>  Train detection was triggered 9 times, and average cwnd was
>  20650/9=2294,
>  Delay detection was triggered 10 times and average cwnd was 36

Alternatively we could add INET_DIAG_CUBICINFO and export such info via ss
tool.

^ permalink raw reply

* [PATCH net-next 2/2] tcp_cubic: refine Hystart delay threshold
From: Eric Dumazet @ 2014-12-05  0:13 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Nandita Dukkipati, Neal Cardwell, Yuchung Cheng,
	Sangtae Ha

From: Eric Dumazet <edumazet@google.com>

In commit 2b4636a5f8ca ("tcp_cubic: make the delay threshold of HyStart
less sensitive"), HYSTART_DELAY_MIN was changed to 4 ms.

The remaining problem is that using delay_min + (delay_min/16) as the
threshold is too sensitive.

6.25 % of variation is too small for rtt above 60 ms, which are not
uncommon.

Lets use 12.5 % instead (delay_min + (delay_min/8))

Tested:
 80 ms RTT between peers, FQ/pacing packet scheduler on sender.
 10 bulk transfers of 10 seconds :

nstat >/dev/null
for i in `seq 1 10`
 do
   netperf -H remote -- -k THROUGHPUT | grep THROUGHPUT
 done
nstat | grep Hystart

With the 6.25 % threshold :

THROUGHPUT=20.66
THROUGHPUT=249.38
THROUGHPUT=254.10
THROUGHPUT=14.94
THROUGHPUT=251.92
THROUGHPUT=237.73
THROUGHPUT=19.18
THROUGHPUT=252.89
THROUGHPUT=21.32
THROUGHPUT=15.58
TcpExtTCPHystartTrainDetect     2                  0.0
TcpExtTCPHystartTrainCwnd       4756               0.0
TcpExtTCPHystartDelayDetect     5                  0.0
TcpExtTCPHystartDelayCwnd       180                0.0

With the 12.5 % threshold 
THROUGHPUT=251.09
THROUGHPUT=247.46
THROUGHPUT=250.92
THROUGHPUT=248.91
THROUGHPUT=250.88
THROUGHPUT=249.84
THROUGHPUT=250.51
THROUGHPUT=254.15
THROUGHPUT=250.62
THROUGHPUT=250.89
TcpExtTCPHystartTrainDetect     1                  0.0
TcpExtTCPHystartTrainCwnd       3175               0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_cubic.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index c1d07c7ed03d7d37fa28d1509093e686f78134d2..6b6002416a73950d493661ea1459870f49917efc 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -393,7 +393,7 @@ static void hystart_update(struct sock *sk, u32 delay)
 			ca->sample_cnt++;
 		} else {
 			if (ca->curr_rtt > ca->delay_min +
-			    HYSTART_DELAY_THRESH(ca->delay_min>>4)) {
+			    HYSTART_DELAY_THRESH(ca->delay_min >> 3)) {
 				ca->found |= HYSTART_DELAY;
 				NET_INC_STATS_BH(sock_net(sk),
 						 LINUX_MIB_TCPHYSTARTDELAYDETECT);

^ permalink raw reply related

* [PATCH net-next 1/2] tcp_cubic: add SNMP counters to track how effective is Hystart
From: Eric Dumazet @ 2014-12-05  0:13 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Nandita Dukkipati, Neal Cardwell, Yuchung Cheng,
	Sangtae Ha

From: Eric Dumazet <edumazet@google.com>

When deploying FQ pacing, one thing we noticed is that CUBIC Hystart
triggers too soon.

Having SNMP counters to have an idea of how often the various Hystart
methods trigger is useful prior to any modifications.

This patch adds SNMP counters tracking, how many time "ack train" or
"Delay" based Hystart triggers, and cumulative sum of cwnd at the time
Hystart decided to end SS (Slow Start)

myhost:~# nstat -a | grep Hystart
TcpExtTCPHystartTrainDetect     9                  0.0
TcpExtTCPHystartTrainCwnd       20650              0.0
TcpExtTCPHystartDelayDetect     10                 0.0
TcpExtTCPHystartDelayCwnd       360                0.0

->
 Train detection was triggered 9 times, and average cwnd was
 20650/9=2294,
 Delay detection was triggered 10 times and average cwnd was 36

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/uapi/linux/snmp.h |    4 ++++
 net/ipv4/proc.c           |    4 ++++
 net/ipv4/tcp_cubic.c      |   31 ++++++++++++++++++++++---------
 3 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 30f541b32895..b22224100011 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -266,6 +266,10 @@ enum
 	LINUX_MIB_TCPWANTZEROWINDOWADV,		/* TCPWantZeroWindowAdv */
 	LINUX_MIB_TCPSYNRETRANS,		/* TCPSynRetrans */
 	LINUX_MIB_TCPORIGDATASENT,		/* TCPOrigDataSent */
+	LINUX_MIB_TCPHYSTARTTRAINDETECT,	/* TCPHystartTrainDetect */
+	LINUX_MIB_TCPHYSTARTTRAINCWND,		/* TCPHystartTrainCwnd */
+	LINUX_MIB_TCPHYSTARTDELAYDETECT,	/* TCPHystartDelayDetect */
+	LINUX_MIB_TCPHYSTARTDELAYCWND,		/* TCPHystartDelayCwnd */
 	__LINUX_MIB_MAX
 };
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 6513ade8d6dc..8f9cd200ce20 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -288,6 +288,10 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPWantZeroWindowAdv", LINUX_MIB_TCPWANTZEROWINDOWADV),
 	SNMP_MIB_ITEM("TCPSynRetrans", LINUX_MIB_TCPSYNRETRANS),
 	SNMP_MIB_ITEM("TCPOrigDataSent", LINUX_MIB_TCPORIGDATASENT),
+	SNMP_MIB_ITEM("TCPHystartTrainDetect", LINUX_MIB_TCPHYSTARTTRAINDETECT),
+	SNMP_MIB_ITEM("TCPHystartTrainCwnd", LINUX_MIB_TCPHYSTARTTRAINCWND),
+	SNMP_MIB_ITEM("TCPHystartDelayDetect", LINUX_MIB_TCPHYSTARTDELAYDETECT),
+	SNMP_MIB_ITEM("TCPHystartDelayCwnd", LINUX_MIB_TCPHYSTARTDELAYCWND),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 20de0118c98e..c1d07c7ed03d 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -363,16 +363,28 @@ static void hystart_update(struct sock *sk, u32 delay)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct bictcp *ca = inet_csk_ca(sk);
 
-	if (!(ca->found & hystart_detect)) {
+	if (ca->found & hystart_detect)
+		return;
+
+	if (hystart_detect & HYSTART_ACK_TRAIN) {
 		u32 now = bictcp_clock();
 
 		/* first detection parameter - ack-train detection */
 		if ((s32)(now - ca->last_ack) <= hystart_ack_delta) {
 			ca->last_ack = now;
-			if ((s32)(now - ca->round_start) > ca->delay_min >> 4)
+			if ((s32)(now - ca->round_start) > ca->delay_min >> 4) {
 				ca->found |= HYSTART_ACK_TRAIN;
+				NET_INC_STATS_BH(sock_net(sk),
+						 LINUX_MIB_TCPHYSTARTTRAINDETECT);
+				NET_ADD_STATS_BH(sock_net(sk),
+						 LINUX_MIB_TCPHYSTARTTRAINCWND,
+						 tp->snd_cwnd);
+				tp->snd_ssthresh = tp->snd_cwnd;
+			}
 		}
+	}
 
+	if (hystart_detect & HYSTART_DELAY) {
 		/* obtain the minimum delay of more than sampling packets */
 		if (ca->sample_cnt < HYSTART_MIN_SAMPLES) {
 			if (ca->curr_rtt == 0 || ca->curr_rtt > delay)
@@ -381,15 +393,16 @@ static void hystart_update(struct sock *sk, u32 delay)
 			ca->sample_cnt++;
 		} else {
 			if (ca->curr_rtt > ca->delay_min +
-			    HYSTART_DELAY_THRESH(ca->delay_min>>4))
+			    HYSTART_DELAY_THRESH(ca->delay_min>>4)) {
 				ca->found |= HYSTART_DELAY;
+				NET_INC_STATS_BH(sock_net(sk),
+						 LINUX_MIB_TCPHYSTARTDELAYDETECT);
+				NET_ADD_STATS_BH(sock_net(sk),
+						 LINUX_MIB_TCPHYSTARTDELAYCWND,
+						 tp->snd_cwnd);
+				tp->snd_ssthresh = tp->snd_cwnd;
+			}
 		}
-		/*
-		 * Either one of two conditions are met,
-		 * we exit from slow start immediately.
-		 */
-		if (ca->found & hystart_detect)
-			tp->snd_ssthresh = tp->snd_cwnd;
 	}
 }
 

^ permalink raw reply related

* Re: [PATCH] x86: bpf_jit_comp: Reduce is_ereg() code size
From: Joe Perches @ 2014-12-04 23:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, Quentin Lambert, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86, netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <1417734891.22424.2.camel@edumazet-glaptop2.roam.corp.google.com>

On Thu, 2014-12-04 at 15:14 -0800, Eric Dumazet wrote:
> On Thu, 2014-12-04 at 15:00 -0800, Joe Perches wrote:
> > Use the (1 << reg) & mask trick to reduce code size.
[]
> Really, the root cause of this is the 'inline' abuse in non fast paths
> for non trivial functions.

There is no object size change with is_ereg()
defined "static inline" or "static"

Curiously, if you mark it noinline, the size increases.

gcc 4.9.1, x86-64, -O2 no profiling support

$ size arch/x86/net/bpf_jit_comp.o.st*
   text	   data	    bss	    dec	    hex	filename
  10679	      4	      0	  10683	   29bb	arch/x86/net/bpf_jit_comp.o.static_inline
  11535	      4	      0	  11539	   2d13	arch/x86/net/bpf_jit_comp.o.static_noinline
  10679	      4	      0	  10683	   29bb	arch/x86/net/bpf_jit_comp.o.static_without_inline

^ permalink raw reply

* Re: [PATCH] x86: bpf_jit_comp: Reduce is_ereg() code size
From: Alexei Starovoitov @ 2014-12-04 23:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Joe Perches, Quentin Lambert, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86@kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <1417734891.22424.2.camel@edumazet-glaptop2.roam.corp.google.com>

On Thu, Dec 4, 2014 at 3:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2014-12-04 at 15:00 -0800, Joe Perches wrote:
>> Use the (1 << reg) & mask trick to reduce code size.
>>
>> x86-64 size difference -O2 without profiling for various
>> gcc versions:
>
>>
>> Signed-off-by: Joe Perches <joe@perches.com>
>> ---
>>
>> compiled, untested by me, but per Alexei Starovoitov this passes
>> the test_bpf suite
>
> Really, the root cause of this is the 'inline' abuse in non fast paths
> for non trivial functions.

well, it is a trivial function even from compiler point of view.
Dropping inline keyword doesn't help. gcc still inlines them.
Changing all 3 functions to _noinline_ doesn't help either.
So I think this patch is actually quite helpful to reduce code size.

^ permalink raw reply

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
From: Jesse Gross @ 2014-12-04 23:23 UTC (permalink / raw)
  To: Du Fan
  Cc: Pravin Shelar, Thomas Graf, Michael S. Tsirkin, Du, Fan,
	Jason Wang, netdev@vger.kernel.org, davem@davemloft.net,
	fw@strlen.de, dev@openvswitch.org
In-Reply-To: <548011DE.4000208@gmail.com>

On Wed, Dec 3, 2014 at 11:48 PM, Du Fan <fengyuleidian0615@gmail.com> wrote:
> 于 2014年12月04日 06:51, Jesse Gross 写道:
>>
>> My proposal would be something like this:
>>   * For L2, reduce the VM MTU to the lowest common denominator on the
>> segment.
>>   * For L3, use path MTU discovery or fragment inner packet (i.e.
>> normal routing behavior).
>>   * As a last resort (such as if using an old version of virtio in the
>> guest), fragment the tunnel packet.
>
>
> After some investigation on OpenvSwitch package, it seems before this
> commit: 06021dcb "datapath: compat: Fix compilation 3.11" OpenvSwitch
> package is doing GSO on its own.
>
> rpl_ip_local_out
>   -> tnl_skb_gso_segment
>       ^^^^^^^^^^^^^^^
>          Perform GSO  in above function
>     -> ip_local_out
>           .
>           .
>         -> ip_finish_output
>
> Which means, when over-MTU-sized skb enter ip_local_out, it's not a gso
> type skb, so the stack perform ip fragmentation, and send them out.
> So, over-MTU-sized skb did travel through stack into outside.
>
> Why not dropping such OpenvSwitch level GSO operation after 3.10?

The change in 3.11 was that the tunnel infrastructure used by OVS was
upstreamed and shared by all implementations. It's not right to
perform GSO in OVS itself as it prevents the logic from being used by
other components. Breaking up the packet in OVS also eliminates some
of the benefits of GSO by shortening the optimized path and prevents
offloading to hardware.

^ permalink raw reply

* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
From: Jesse Gross @ 2014-12-04 23:19 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Michael S. Tsirkin, Du, Fan, Jason Wang, netdev@vger.kernel.org,
	davem@davemloft.net, fw@strlen.de, dev@openvswitch.org,
	Pravin Shelar
In-Reply-To: <20141204092650.GA13660@casper.infradead.org>

On Thu, Dec 4, 2014 at 1:26 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 12/03/14 at 05:51pm, Jesse Gross wrote:
>> I think it depends on where you put the PMTU check. If routing is
>> happening in OVS where it is decomposed in several discrete actions
>> like set MAC and decrement TTL then perhaps there is another explicit
>> action to check the MTU. If it is happening in the context of the IP
>> stack, then ICMP generation occurs automatically and if you get that
>> if you write a flow to send a packet there. In each case, it seems
>> like a flow would be steering you by way of an action to do routing so
>> you would have fine grained control. I don't see this as conflicting
>> with L3 ACLs in an L2 context in the same way that you don't have to
>> automatically decrement the TTL.
>
> OK, I was under the impression that you would detect L3 behaviour
> desire in OVS without an explicit action. Hence the worry on wrong
> classification for L3 ACLs.
>
> Whether we mark the packet and defer the MTU check or we check right
> in the action is an implementation detail in the end. I think we
> agree on concept level that we need an API to control behaviour.

Yes, I agree.

^ permalink raw reply

* Re: 3.12.33 - BUG xfrm_selector_match+0x25/0x2f6
From: Julian Anastasov @ 2014-12-04 23:15 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Smart Weblications GmbH - Florian Wiessner, netdev, LKML, stable
In-Reply-To: <20141204075627.GE6390@secunet.com>


	Hello,

On Thu, 4 Dec 2014, Steffen Klassert wrote:

> > [16623.096721] Call Trace:
> > [16623.096744]  <IRQ>
> > [16623.096749]  [<ffffffff81547a7c>] ? xfrm_sk_policy_lookup+0x44/0x9b
> > [16623.096802]  [<ffffffff81547ef7>] ? xfrm_lookup+0x91/0x446
> > [16623.096832]  [<ffffffff81541316>] ? ip_route_me_harder+0x150/0x1b0
> > [16623.096865]  [<ffffffffa01b6457>] ? ip_vs_route_me_harder+0x86/0x91 [ip_vs]
> > [16623.096899]  [<ffffffffa01b797a>] ? ip_vs_out+0x2d3/0x5bc [ip_vs]
> > [16623.096930]  [<ffffffff81501420>] ? ip_rcv_finish+0x2b8/0x2b8
> 
> I really wonder why the xfrm_sk_policy_lookup codepath is taken here.
> It looks like this is the processing of an inbound ipv4 packet that
> is going to be rerouted to the output path by ipvs, so this packet
> should not have socket context at all.

	In above trace looks like IPVS-NAT is used between
local client and some real server. IPVS handles this skb
at LOCAL_IN and calls ip_vs_route_me_harder(). If we have
skb->sk at LOCAL_IN, my first thought is about early demux.

	If I remember correctly, looking at commit f5a41847acc535e2
("ipvs: move ip_route_me_harder for ICMP") that introduced
this rerouting (2.6.37), it was needed because at that time TCP
used rt_src from received skb to select daddr in ip_send_reply().
As packets to server are DNAT-ed and packets to client are
SNAT-ed we used rerouting to fill rt_src with correct IP
after SNAT.

	Now when routing cache is removed in 3.6 and
tcp_v4_send_reset() is changed to provide ip_hdr(skb)->saddr
instead of rt_src it should be safe to remove this rerouting,
it is enough that ip_hdr(skb)->saddr was updated on IPVS-SNAT at
LOCAL_IN. In fact, rt_src was removed early in 3.0 with
commit 0a5ebb8000c5362 ("ipv4: Pass explicit daddr arg to 
ip_send_reply().").

	This is only to explain above stack. Not sure
if problem is related somehow to early demux but such
commits look interesting:

- commit 6b8dbcf2c44fd7a ("bridge: netfilter: orphan skb before invoking 
ip netfilter hooks")

	Also, it would be good to know which 3.x kernel between
3.13 and 3.17 fixes the problem, it will narrow the search.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH] x86: bpf_jit_comp: Reduce is_ereg() code size
From: Eric Dumazet @ 2014-12-04 23:14 UTC (permalink / raw)
  To: Joe Perches
  Cc: Alexei Starovoitov, Quentin Lambert, David S. Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86, netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <1417734048.2721.22.camel@perches.com>

On Thu, 2014-12-04 at 15:00 -0800, Joe Perches wrote:
> Use the (1 << reg) & mask trick to reduce code size.
> 
> x86-64 size difference -O2 without profiling for various
> gcc versions:

> 
> Signed-off-by: Joe Perches <joe@perches.com>
> ---
> 
> compiled, untested by me, but per Alexei Starovoitov this passes
> the test_bpf suite

Really, the root cause of this is the 'inline' abuse in non fast paths
for non trivial functions.

^ permalink raw reply

* Re: [PATCH] x86: bpf_jit_comp: Reduce is_ereg() code size
From: Alexei Starovoitov @ 2014-12-04 23:12 UTC (permalink / raw)
  To: Joe Perches, Daniel Borkmann
  Cc: Quentin Lambert, David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86@kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <1417734048.2721.22.camel@perches.com>

On Thu, Dec 4, 2014 at 3:00 PM, Joe Perches <joe@perches.com> wrote:
> Use the (1 << reg) & mask trick to reduce code size.
>
> x86-64 size difference -O2 without profiling for various
> gcc versions:
>
> $ size arch/x86/net/bpf_jit_comp.o*
>    text    data     bss     dec     hex filename
>    9266       4       0    9270    2436 arch/x86/net/bpf_jit_comp.o.4.4.new
>   10042       4       0   10046    273e arch/x86/net/bpf_jit_comp.o.4.4.old
>    9109       4       0    9113    2399 arch/x86/net/bpf_jit_comp.o.4.6.new
>    9717       4       0    9721    25f9 arch/x86/net/bpf_jit_comp.o.4.6.old
>    8789       4       0    8793    2259 arch/x86/net/bpf_jit_comp.o.4.7.new
>   10245       4       0   10249    2809 arch/x86/net/bpf_jit_comp.o.4.7.old
>    9671       4       0    9675    25cb arch/x86/net/bpf_jit_comp.o.4.9.new
>   10679       4       0   10683    29bb arch/x86/net/bpf_jit_comp.o.4.9.old
>
> Signed-off-by: Joe Perches <joe@perches.com>

probably it was worth noting in comment that
reg is 4-bit value and AUX_REG==12, so it won't overflow.

Dave, it's for net-next of course.

Suggested-by: Alexei Starovoitov <ast@plumgrid.com>
Tested-by: Alexei Starovoitov <ast@plumgrid.com>

^ permalink raw reply

* [PATCH] x86: bpf_jit_comp: Reduce is_ereg() code size
From: Joe Perches @ 2014-12-04 23:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Quentin Lambert, David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <CAADnVQKtYkKtsd9rUFm2yfMoh7F_SRFPVP8hpvSY00KpmB-NNg@mail.gmail.com>

Use the (1 << reg) & mask trick to reduce code size.

x86-64 size difference -O2 without profiling for various
gcc versions:

$ size arch/x86/net/bpf_jit_comp.o*
   text    data     bss     dec     hex filename
   9266       4       0    9270    2436 arch/x86/net/bpf_jit_comp.o.4.4.new
  10042       4       0   10046    273e arch/x86/net/bpf_jit_comp.o.4.4.old
   9109       4       0    9113    2399 arch/x86/net/bpf_jit_comp.o.4.6.new
   9717       4       0    9721    25f9 arch/x86/net/bpf_jit_comp.o.4.6.old
   8789       4       0    8793    2259 arch/x86/net/bpf_jit_comp.o.4.7.new
  10245       4       0   10249    2809 arch/x86/net/bpf_jit_comp.o.4.7.old
   9671       4       0    9675    25cb arch/x86/net/bpf_jit_comp.o.4.9.new
  10679       4       0   10683    29bb arch/x86/net/bpf_jit_comp.o.4.9.old

Signed-off-by: Joe Perches <joe@perches.com>
---

compiled, untested by me, but per Alexei Starovoitov this passes
the test_bpf suite

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 3f62734..09e2cea 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -135,11 +135,11 @@ static const int reg2hex[] = {
  */
 static inline bool is_ereg(u32 reg)
 {
-	if (reg == BPF_REG_5 || reg == AUX_REG ||
-	    (reg >= BPF_REG_7 && reg <= BPF_REG_9))
-		return true;
-	else
-		return false;
+	return (1 << reg) & (BIT(BPF_REG_5) |
+			     BIT(AUX_REG) |
+			     BIT(BPF_REG_7) |
+			     BIT(BPF_REG_8) |
+			     BIT(BPF_REG_9));
 }
 
 /* add modifiers if 'reg' maps to x64 registers r8..r15 */

^ permalink raw reply related

* Re: [PATCH v2 1/6] net-PPP: Replacement of a printk() call by pr_warn() in mppe_rekey()
From: Joe Perches @ 2014-12-04 22:45 UTC (permalink / raw)
  To: SF Markus Elfring
  Cc: Sergei Shtylyov, Paul Mackerras, linux-ppp, netdev, Eric Dumazet,
	LKML, kernel-janitors, Julia Lawall
In-Reply-To: <5480DFB5.4090708@users.sourceforge.net>

On Thu, 2014-12-04 at 23:27 +0100, SF Markus Elfring wrote:
> >> diff --git a/drivers/net/ppp/ppp_mppe.c b/drivers/net/ppp/ppp_mppe.c
> > []
> >> @@ -172,9 +172,8 @@ static void mppe_rekey(struct ppp_mppe_state * state, int initial_key)
> >>  		setup_sg(sg_in, state->sha1_digest, state->keylen);
> >>  		setup_sg(sg_out, state->session_key, state->keylen);
> >>  		if (crypto_blkcipher_encrypt(&desc, sg_out, sg_in,
> >> -					     state->keylen) != 0) {
> >> -    		    printk(KERN_WARNING "mppe_rekey: cipher_encrypt failed\n");
> >> -		}
> >> +					     state->keylen) != 0)
> >> +			pr_warn("mppe_rekey: cipher_encrypt failed\n");
> > 
> > It's generally nicer to replace embedded function names
> > with "%s: ", __func__
> > 
> > 			pr_warn("%s: cipher_encrypt failed\n", __func__);
> 
> Do you want that I send a third patch series for the fine-tuning of these parameters?

If you want.

I just wanted you to be aware of it for future patches.

^ permalink raw reply

* Re: [PATCH] x86: bpf_jit_comp: simplify trivial boolean return
From: Alexei Starovoitov @ 2014-12-04 22:44 UTC (permalink / raw)
  To: Joe Perches
  Cc: David Laight, Quentin Lambert, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86@kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org

On Thu, Dec 4, 2014 at 10:05 AM, Joe Perches <joe@perches.com> wrote:
> On Thu, 2014-12-04 at 07:56 -0800, Alexei Starovoitov wrote:
>> On Thu, Dec 4, 2014 at 1:26 AM, Joe Perches <joe@perches.com> wrote:
>> > On Thu, 2014-11-27 at 10:49 -0800, Joe Perches wrote:
>> >> On Thu, 2014-11-27 at 12:25 +0000, David Laight wrote:
>> >> > Why the change in data?
>> >>
>> >> btw: without gcov and using -O2
>> >>
>> >> $ size arch/x86/net/bpf_jit_comp.o*
>> >>    text          data     bss     dec     hex filename
>> >>    9671             4       0    9675    25cb arch/x86/net/bpf_jit_comp.o.new
>> >>   10679             4       0   10683    29bb arch/x86/net/bpf_jit_comp.o.old
>> >
>> > Alexei?
>> >
>> > Is this 10% reduction in size a good reason to change the code?
>>
>> yes.
>> I believe you're seeing it with gcc 4.9. I wanted to double
>> check what 4.6 and 4.7 are doing. If they're not suddenly
>> increase code size then resubmit it for inclusion please.
>
> I get these sizes for these compilers
> (x86-64, -O2, without profiling)
>
> $ size arch/x86/net/bpf_jit_comp.o*
>    text    data     bss     dec     hex filename
>    9266       4       0    9270    2436 arch/x86/net/bpf_jit_comp.o.4.4.new
>   10042       4       0   10046    273e arch/x86/net/bpf_jit_comp.o.4.4.old
>    9109       4       0    9113    2399 arch/x86/net/bpf_jit_comp.o.4.6.new
>    9717       4       0    9721    25f9 arch/x86/net/bpf_jit_comp.o.4.6.old
>    8789       4       0    8793    2259 arch/x86/net/bpf_jit_comp.o.4.7.new
>   10245       4       0   10249    2809 arch/x86/net/bpf_jit_comp.o.4.7.old
>    9671       4       0    9675    25cb arch/x86/net/bpf_jit_comp.o.4.9.new
>   10679       4       0   10683    29bb arch/x86/net/bpf_jit_comp.o.4.9.old
>
> I am a bit surprised by the size variations

yeah. the difference is surprising.
Just tried with 4.7 and my regular config and I see the same difference.
Looks like gcc wasn't able to fold conditions into cmov
and used a bunch of cmp/jmp
Since is_ereg() was inlined ~70 times on its own and as
part of other functions, the difference of 3-4 instructions
may a large difference in total size.
test_bpf also passes, so please resubmit properly.

^ permalink raw reply

* Re: [PATCH v2 1/6] net-PPP: Replacement of a printk() call by pr_warn() in mppe_rekey()
From: SF Markus Elfring @ 2014-12-04 22:27 UTC (permalink / raw)
  To: Joe Perches
  Cc: Sergei Shtylyov, Paul Mackerras, linux-ppp, netdev, Eric Dumazet,
	LKML, kernel-janitors, Julia Lawall
In-Reply-To: <1417731809.2721.17.camel@perches.com>

>> diff --git a/drivers/net/ppp/ppp_mppe.c b/drivers/net/ppp/ppp_mppe.c
> []
>> @@ -172,9 +172,8 @@ static void mppe_rekey(struct ppp_mppe_state * state, int initial_key)
>>  		setup_sg(sg_in, state->sha1_digest, state->keylen);
>>  		setup_sg(sg_out, state->session_key, state->keylen);
>>  		if (crypto_blkcipher_encrypt(&desc, sg_out, sg_in,
>> -					     state->keylen) != 0) {
>> -    		    printk(KERN_WARNING "mppe_rekey: cipher_encrypt failed\n");
>> -		}
>> +					     state->keylen) != 0)
>> +			pr_warn("mppe_rekey: cipher_encrypt failed\n");
> 
> It's generally nicer to replace embedded function names
> with "%s: ", __func__
> 
> 			pr_warn("%s: cipher_encrypt failed\n", __func__);

Do you want that I send a third patch series for the fine-tuning of these parameters?

Regards,
Martkus


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox