Netdev List
 help / color / mirror / Atom feed
* Re: [PATCHv2 net 3/3] net: sched: ife: check on metadata length
From: Eric Dumazet @ 2018-04-19 21:50 UTC (permalink / raw)
  To: Alexander Aring, yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel
In-Reply-To: <20180419214438.6801-4-aring@mojatatu.com>



On 04/19/2018 02:44 PM, Alexander Aring wrote:
> This patch checks if sk buffer is available to dererence ife header. If
> not then NULL will returned to signal an malformed ife packet. This
> avoids to crashing the kernel from outside.
> 
> Signed-off-by: Alexander Aring <aring@mojatatu.com>
> Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
> ---
>  net/ife/ife.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/ife/ife.c b/net/ife/ife.c
> index 7fbe70a0af4b..93e8c36ce6ec 100644
> --- a/net/ife/ife.c
> +++ b/net/ife/ife.c
> @@ -70,6 +70,9 @@ void *ife_decode(struct sk_buff *skb, u16 *metalen)
>  	u16 ifehdrln;
>  
>  	ifehdr = (struct ifeheadr *) (skb->data + skb->dev->hard_header_len);
> +	if (skb->len < skb->dev->hard_header_len + IFE_METAHDRLEN)
> +		return NULL;
> +
>  	ifehdrln = ntohs(ifehdr->metalen);
>  	total_pull = skb->dev->hard_header_len + ifehdrln;
>  
> 

Nope, please use pskb_may_pull()

^ permalink raw reply

* [PATCHv2 net 3/3] net: sched: ife: check on metadata length
From: Alexander Aring @ 2018-04-19 21:44 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring
In-Reply-To: <20180419214438.6801-1-aring@mojatatu.com>

This patch checks if sk buffer is available to dererence ife header. If
not then NULL will returned to signal an malformed ife packet. This
avoids to crashing the kernel from outside.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 net/ife/ife.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ife/ife.c b/net/ife/ife.c
index 7fbe70a0af4b..93e8c36ce6ec 100644
--- a/net/ife/ife.c
+++ b/net/ife/ife.c
@@ -70,6 +70,9 @@ void *ife_decode(struct sk_buff *skb, u16 *metalen)
 	u16 ifehdrln;
 
 	ifehdr = (struct ifeheadr *) (skb->data + skb->dev->hard_header_len);
+	if (skb->len < skb->dev->hard_header_len + IFE_METAHDRLEN)
+		return NULL;
+
 	ifehdrln = ntohs(ifehdr->metalen);
 	total_pull = skb->dev->hard_header_len + ifehdrln;
 
-- 
2.11.0

^ permalink raw reply related

* [PATCHv2 net 2/3] net: sched: ife: handle malformed tlv length
From: Alexander Aring @ 2018-04-19 21:44 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring
In-Reply-To: <20180419214438.6801-1-aring@mojatatu.com>

There is currently no handling to check on a invalid tlv length. This
patch adds such handling to avoid killing the kernel with a malformed
ife packet.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/ife.h   |  3 ++-
 net/ife/ife.c       | 35 +++++++++++++++++++++++++++++++++--
 net/sched/act_ife.c |  7 ++++++-
 3 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/include/net/ife.h b/include/net/ife.h
index 44b9c00f7223..e117617e3c34 100644
--- a/include/net/ife.h
+++ b/include/net/ife.h
@@ -12,7 +12,8 @@
 void *ife_encode(struct sk_buff *skb, u16 metalen);
 void *ife_decode(struct sk_buff *skb, u16 *metalen);
 
-void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen);
+void *ife_tlv_meta_decode(void *skbdata, const void *ifehdr_end, u16 *attrtype,
+			  u16 *dlen, u16 *totlen);
 int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
 			const void *dval);
 
diff --git a/net/ife/ife.c b/net/ife/ife.c
index 7d1ec76e7f43..7fbe70a0af4b 100644
--- a/net/ife/ife.c
+++ b/net/ife/ife.c
@@ -92,12 +92,43 @@ struct meta_tlvhdr {
 	__be16 len;
 };
 
+static bool __ife_tlv_meta_valid(const unsigned char *skbdata,
+				 const unsigned char *ifehdr_end)
+{
+	const struct meta_tlvhdr *tlv;
+	u16 tlvlen;
+
+	if (unlikely(skbdata + sizeof(*tlv) > ifehdr_end))
+		return false;
+
+	tlv = (const struct meta_tlvhdr *)skbdata;
+	tlvlen = ntohs(tlv->len);
+
+	/* tlv length field is inc header, check on minimum */
+	if (tlvlen < NLA_HDRLEN)
+		return false;
+
+	/* overflow by NLA_ALIGN check */
+	if (NLA_ALIGN(tlvlen) < tlvlen)
+		return false;
+
+	if (unlikely(skbdata + NLA_ALIGN(tlvlen) > ifehdr_end))
+		return false;
+
+	return true;
+}
+
 /* Caller takes care of presenting data in network order
  */
-void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen)
+void *ife_tlv_meta_decode(void *skbdata, const void *ifehdr_end, u16 *attrtype,
+			  u16 *dlen, u16 *totlen)
 {
-	struct meta_tlvhdr *tlv = (struct meta_tlvhdr *) skbdata;
+	struct meta_tlvhdr *tlv;
+
+	if (!__ife_tlv_meta_valid(skbdata, ifehdr_end))
+		return NULL;
 
+	tlv = (struct meta_tlvhdr *)skbdata;
 	*dlen = ntohs(tlv->len) - NLA_HDRLEN;
 	*attrtype = ntohs(tlv->type);
 
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 49b8ab551fbe..8527cfdc446d 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -682,7 +682,12 @@ static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
 		u16 mtype;
 		u16 dlen;
 
-		curr_data = ife_tlv_meta_decode(tlv_data, &mtype, &dlen, NULL);
+		curr_data = ife_tlv_meta_decode(tlv_data, ifehdr_end, &mtype,
+						&dlen, NULL);
+		if (!curr_data) {
+			qstats_drop_inc(this_cpu_ptr(ife->common.cpu_qstats));
+			return TC_ACT_SHOT;
+		}
 
 		if (find_decode_metaid(skb, ife, mtype, dlen, curr_data)) {
 			/* abuse overlimits to count when we receive metadata
-- 
2.11.0

^ permalink raw reply related

* [PATCHv2 net 1/3] net: sched: ife: signal not finding metaid
From: Alexander Aring @ 2018-04-19 21:44 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring
In-Reply-To: <20180419214438.6801-1-aring@mojatatu.com>

We need to record stats for received metadata that we dont know how
to process. Have find_decode_metaid() return -ENOENT to capture this.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 net/sched/act_ife.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index a5994cf0512b..49b8ab551fbe 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -652,7 +652,7 @@ static int find_decode_metaid(struct sk_buff *skb, struct tcf_ife_info *ife,
 		}
 	}
 
-	return 0;
+	return -ENOENT;
 }
 
 static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
-- 
2.11.0

^ permalink raw reply related

* [PATCHv2 net 0/3] net: sched: ife: malformed ife packet fixes
From: Alexander Aring @ 2018-04-19 21:44 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring

As promised at netdev 2.2 tc workshop I am working on adding scapy support for
tdc testing. It is still work in progress. I will submit the patches to tdc
later (they are not in good shape yet). The good news is I have been able to
find bugs which normal packet testing would not be able to find.
With fuzzy testing I was able to craft certain malformed packets that IFE
action was not able to deal with. This patch set fixes those bugs.

changes since v2:
 - remove inline from __ife_tlv_meta_valid
 - add const to cast to meta_tlvhdr
 - add acked and reviewed tags

Alexander Aring (3):
  net: sched: ife: signal not finding metaid
  net: sched: ife: handle malformed tlv length
  net: sched: ife: check on metadata length

 include/net/ife.h   |  3 ++-
 net/ife/ife.c       | 38 ++++++++++++++++++++++++++++++++++++--
 net/sched/act_ife.c |  9 +++++++--
 3 files changed, 45 insertions(+), 5 deletions(-)

-- 
2.11.0

^ permalink raw reply

* Re: [PATCH v4 00/10] New network driver for Amiga X-Surf 100 (m68k)
From: Michael Schmitz @ 2018-04-19 21:36 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Andrew Lunn, Finn Thain, Geert Uytterhoeven,
	Florian Fainelli, Linux/m68k, Michael Karcher
In-Reply-To: <20180419.161138.825724439328248224.davem@davemloft.net>

Thanks Dave!

And many thanks to all the reviewers and testers!

Cheers,

  Michael


On Fri, Apr 20, 2018 at 8:11 AM, David Miller <davem@davemloft.net> wrote:
> From: Michael Schmitz <schmitzmic@gmail.com>
> Date: Thu, 19 Apr 2018 14:05:17 +1200
>
>> This patch series adds support for the Individual Computers X-Surf 100
>> network card for m68k Amiga, a network adapter based on the AX88796 chip set.
>
> Series applied, thank you.

^ permalink raw reply

* [net-next:master 23/31] drivers/net/hyperv/rndis_filter.c:1243: undefined reference to `ucs2_as_utf8'
From: kbuild test robot @ 2018-04-19 21:33 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: kbuild-all, netdev

[-- Attachment #1: Type: text/plain, Size: 1692 bytes --]

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
head:   bda73d4ec943c4f7479603a59cad09a07c6c729a
commit: 0fe554a46a0ff855376053c7e4204673b7879f05 [23/31] hv_netvsc: propogate Hyper-V friendly name into interface alias
config: x86_64-randconfig-b0-04200208 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        git checkout 0fe554a46a0ff855376053c7e4204673b7879f05
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/net/hyperv/rndis_filter.o: In function `rndis_get_friendly_name':
>> drivers/net/hyperv/rndis_filter.c:1243: undefined reference to `ucs2_as_utf8'

vim +1243 drivers/net/hyperv/rndis_filter.c

  1226	
  1227	static void rndis_get_friendly_name(struct net_device *net,
  1228					    struct rndis_device *rndis_device,
  1229					    struct netvsc_device *net_device)
  1230	{
  1231		ucs2_char_t wname[256];
  1232		unsigned long len;
  1233		u8 ifalias[256];
  1234		u32 size;
  1235	
  1236		size = sizeof(wname);
  1237		if (rndis_filter_query_device(rndis_device, net_device,
  1238					      RNDIS_OID_GEN_FRIENDLY_NAME,
  1239					      wname, &size) != 0)
  1240			return;
  1241	
  1242		/* Convert Windows Unicode string to UTF-8 */
> 1243		len = ucs2_as_utf8(ifalias, wname, sizeof(ifalias));
  1244	
  1245		/* ignore the default value from host */
  1246		if (strcmp(ifalias, "Network Adapter") != 0)
  1247			dev_set_alias(net, ifalias, len);
  1248	}
  1249	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31389 bytes --]

^ permalink raw reply

* [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-19 21:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, Andrew Morton, linux-mm, eric.dumazet, edumazet,
	netdev, linux-kernel, jasowang, virtualization, dm-devel,
	Vlastimil Babka
In-Reply-To: <20180419193554-mutt-send-email-mst@kernel.org>



On Thu, 19 Apr 2018, Michael S. Tsirkin wrote:

> Maybe make it conditional on CONFIG_DEBUG_SG too?
> Otherwise I think you just trigger a hard to debug memory corruption.

OK, here I resend the patch with CONFIG_DEBUG_SG. With CONFIG_DEBUG_SG, 
the DMA API will print a stacktrace where the misuse happened, so it's 
much easier to debug than with CONFIG_DEBUG_VM.

Fedora doesn't use CONFIG_DEBUG_SG in its default kernel (it only uses it 
in the debugging kernel), so users won't be hurt by this.



From: Mikulas Patocka <mpatocka@redhat.com>
Subject: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG

The kvmalloc function tries to use kmalloc and falls back to vmalloc if
kmalloc fails.

Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
uses DMA-API on the returned memory or frees it with kfree. Such bugs were
found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
code.

These bugs are hard to reproduce because vmalloc falls back to kmalloc
only if memory is fragmented.

In order to detect these bugs reliably I submit this patch that changes
kvmalloc to always use vmalloc if CONFIG_DEBUG_SG is turned on.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 mm/util.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/mm/util.c
===================================================================
--- linux-2.6.orig/mm/util.c	2018-04-18 15:46:23.000000000 +0200
+++ linux-2.6/mm/util.c	2018-04-19 23:14:14.000000000 +0200
@@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
  */
 void *kvmalloc_node(size_t size, gfp_t flags, int node)
 {
+#ifndef CONFIG_DEBUG_SG
 	gfp_t kmalloc_flags = flags;
 	void *ret;
 
@@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
 	 */
 	if (ret || size <= PAGE_SIZE)
 		return ret;
+#endif
 
 	return __vmalloc_node_flags_caller(size, node, flags,
 			__builtin_return_address(0));

^ permalink raw reply

* [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
From: Martin KaFai Lau @ 2018-04-19 21:23 UTC (permalink / raw)
  To: netdev; +Cc: Tom Herbert, Eric Dumazet, Nikita Shirokov, kernel-team

This patch is not a proper fix and mainly serves for discussion purpose.
It is based on net-next which I have been using to debug the issue.

The change that works around the issue is in ensure_mtu_is_adequate().
Other changes are the rippling effect in function arg.

This bug was uncovered by one of our legacy service that
are still using ipvs for load balancing.  In that setup,
the ipvs encap the ipv6-tcp packet in another ipv6 hdr
before tx it out to eth0.

The problem is the kernel stack could pass a skb (which was
originated from a sys_write(tcp_fd)) to the driver with skb->len
bigger than the device MTU.  In one NIC setup (with gso and tso off)
that we are using, it upset the NIC/driver and caused the tx queue
stalled for tens of seconds which is how it got uncovered.
(On the NIC side, the NIC firmware and driver have been fixed
to avoid this tx queue stall after seeing this skb).

On the kernel side, based on the commit log, this bug should have
been exposed after commit 815d22e55b0e ("ip6ip6: Support for GSO/GRO").

Before commit 815d22e55b0e, ipv6_gso_segment() would just error
out (-EPROTONOSUPPORT) because the tx-ing packet is an ip6ip6.
Due to this error out, it avoid passing it to the driver.  The TCP
stack then timeout and the TCP mtu probing eventually kicked in to
lower the skb->len enough to avoid gso_segment.

After commit 815d22e55b0e, ipv6_gso_segment() -> ipv6_gso_segment()
-> tcp6_gso_segment() which segment the packet based on a mss
that does not account for the extra IPv6 hdr.

Here is a stack from the WARN_ON() that we added to the driver to
capture the issue:
[ 1128.611875] WARNING: CPU: 40 PID: 31495 at drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:424 mlx5e_xmit+0x814
...
[ 1129.016536] Call Trace:
[ 1129.021412]  ? skb_release_data+0xfc/0x120
[ 1129.029587]  ? kfree_skbmem+0x64/0x70
[ 1129.036905]  dev_hard_start_xmit+0xa4/0x200
[ 1129.045262]  sch_direct_xmit+0x10f/0x280
[ 1129.053111]  __qdisc_run+0x223/0x5a0
[ 1129.060251]  __dev_queue_xmit+0x245/0x7d0
[ 1129.068268]  dev_queue_xmit+0x10/0x20
[ 1129.075573]  ? dev_queue_xmit+0x10/0x20
[ 1129.083218]  ip6_finish_output2+0x2db/0x490
[ 1129.091573]  ip6_finish_output+0x125/0x190
[ 1129.099754]  ip6_output+0x5f/0x100
[ 1129.106548]  ? ip6_fragment+0x9f0/0x9f0
[ 1129.114212]  ip6_local_out+0x35/0x40
[ 1129.121356]  ip_vs_tunnel_xmit_v6+0x267/0x290 [ip_vs]
[ 1129.131443]  ip_vs_in.part.24+0x302/0x710 [ip_vs]
[ 1129.140837]  ? ip_vs_in.part.24+0x302/0x710 [ip_vs]
[ 1129.150578]  ? ip_vs_conn_out_get+0x17/0x140 [ip_vs]
[ 1129.160493]  ? ip_vs_conn_out_get_proto+0x25/0x30 [ip_vs]
[ 1129.171273]  ip_vs_in+0x43/0x130 [ip_vs]
[ 1129.179109]  ip_vs_local_request6+0x26/0x30 [ip_vs]
[ 1129.188849]  nf_hook_slow+0x3e/0xc0
[ 1129.195800]  ip6_xmit+0x30b/0x540
[ 1129.202421]  ? ac6_proc_exit+0x20/0x20
[ 1129.209909]  inet6_csk_xmit+0x82/0xd0
[ 1129.217207]  ? lock_timer_base+0x76/0xa0
[ 1129.225043]  tcp_transmit_skb+0x56f/0xa40
[ 1129.233051]  tcp_write_xmit+0x2b2/0x11b0
[ 1129.240885]  __tcp_push_pending_frames+0x33/0xa0
[ 1129.250106]  tcp_push+0xde/0x100
[ 1129.256554]  tcp_sendmsg_locked+0x9ca/0xca0
[ 1129.264910]  tcp_sendmsg+0x2c/0x50
[ 1129.271703]  inet_sendmsg+0x31/0xb0
[ 1129.278672]  sock_write_iter+0xf8/0x110
[ 1129.286335]  new_sync_write+0xd9/0x120
[ 1129.293823]  vfs_write+0x18d/0x1e0
[ 1129.300614]  SyS_write+0x48/0xa0
[ 1129.307045]  do_syscall_64+0x69/0x1e0
[ 1129.314361]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
...
[ 1129.648183] ---[ end trace 635061c9c300799e ]---
[ 1129.657407] skb->len:1554 MTU:1522

The tcp flow is connecting from the address ending ':27:0' to the ':85'.

[host-a] > ip -6 r show table local
local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 advmss 1440 pref medium

[host-a] > ip -6 a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
    inet6 2401:db00:1011:1f01:face:b00c:0:85/128 scope global
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
    inet6 2401:db00:1011:10af:face:0:27:0/64 scope global
       valid_lft forever preferred_lft forever

[host-a] > cat /proc/net/ip_vs
TCP  [2401:db00:1011:1f01:face:b00c:0000:0085]:01BB rr
  -> [2401:db00:1011:10cc:face:0000:0091:0000]:01BB      Tunnel  6772   9          6
  -> [2401:db00:1011:10d8:face:0000:0091:0000]:01BB      Tunnel  6772   8          6
  -> [2401:db00:1011:10d2:face:0000:0091:0000]:01BB      Tunnel  6772   19         7

[host-a] > openssl s_client -connect [2401:db00:1011:1f01:face:b00c:0:85]:443
send-something-long-here-to-trigger-the-bug

Changing the local route mtu to 1460 to account for the extra ipv6 tunnel header
can also side step the issue.  Like this:

> ip -6 r show table local
local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 mtu 1460 advmss 1440 pref medium

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/netfilter/ipvs/ip_vs_xmit.c | 49 +++++++++++++++++++++++++++--------------
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 11c416f3d6e3..88cc0d53ebce 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -212,13 +212,15 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
 		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
 }
 
-static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
+static inline bool ensure_mtu_is_adequate(struct ip_vs_conn *cp,
 					  int rt_mode,
 					  struct ip_vs_iphdr *ipvsh,
 					  struct sk_buff *skb, int mtu)
 {
+	struct netns_ipvs *ipvs = cp->ipvs;
+
 #ifdef CONFIG_IP_VS_IPV6
-	if (skb_af == AF_INET6) {
+	if (cp->af == AF_INET6) {
 		struct net *net = ipvs->net;
 
 		if (unlikely(__mtu_check_toobig_v6(skb, mtu))) {
@@ -251,6 +253,17 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
 		}
 	}
 
+	if (skb_shinfo(skb)->gso_size && cp->protocol == IPPROTO_TCP) {
+		const struct tcphdr *th = (struct tcphdr *)skb_transport_header(skb);
+		unsigned short hdr_len = (th->doff << 2) +
+			skb_network_header_len(skb);
+
+		if (mtu > hdr_len && mtu - hdr_len < skb_shinfo(skb)->gso_size)
+			skb_decrease_gso_size(skb_shinfo(skb),
+					      skb_shinfo(skb)->gso_size -
+					      (mtu - hdr_len));
+	}
+
 	return true;
 }
 
@@ -305,13 +318,15 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs,
 
 /* Get route to destination or remote server */
 static int
-__ip_vs_get_out_rt(struct netns_ipvs *ipvs, int skb_af, struct sk_buff *skb,
+__ip_vs_get_out_rt(struct ip_vs_conn *cp, struct sk_buff *skb,
 		   struct ip_vs_dest *dest,
 		   __be32 daddr, int rt_mode, __be32 *ret_saddr,
 		   struct ip_vs_iphdr *ipvsh)
 {
+	struct netns_ipvs *ipvs = cp->ipvs;
 	struct net *net = ipvs->net;
 	struct ip_vs_dest_dst *dest_dst;
+	int skb_af = cp->af;
 	struct rtable *rt;			/* Route to the other host */
 	int mtu;
 	int local, noref = 1;
@@ -389,7 +404,7 @@ __ip_vs_get_out_rt(struct netns_ipvs *ipvs, int skb_af, struct sk_buff *skb,
 		maybe_update_pmtu(skb_af, skb, mtu);
 	}
 
-	if (!ensure_mtu_is_adequate(ipvs, skb_af, rt_mode, ipvsh, skb, mtu))
+	if (!ensure_mtu_is_adequate(cp, rt_mode, ipvsh, skb, mtu))
 		goto err_put;
 
 	skb_dst_drop(skb);
@@ -455,13 +470,15 @@ __ip_vs_route_output_v6(struct net *net, struct in6_addr *daddr,
  * Get route to destination or remote server
  */
 static int
-__ip_vs_get_out_rt_v6(struct netns_ipvs *ipvs, int skb_af, struct sk_buff *skb,
+__ip_vs_get_out_rt_v6(struct ip_vs_conn *cp, struct sk_buff *skb,
 		      struct ip_vs_dest *dest,
 		      struct in6_addr *daddr, struct in6_addr *ret_saddr,
 		      struct ip_vs_iphdr *ipvsh, int do_xfrm, int rt_mode)
 {
+	struct netns_ipvs *ipvs = cp->ipvs;
 	struct net *net = ipvs->net;
 	struct ip_vs_dest_dst *dest_dst;
+	int skb_af = cp->af;
 	struct rt6_info *rt;			/* Route to the other host */
 	struct dst_entry *dst;
 	int mtu;
@@ -541,7 +558,7 @@ __ip_vs_get_out_rt_v6(struct netns_ipvs *ipvs, int skb_af, struct sk_buff *skb,
 		maybe_update_pmtu(skb_af, skb, mtu);
 	}
 
-	if (!ensure_mtu_is_adequate(ipvs, skb_af, rt_mode, ipvsh, skb, mtu))
+	if (!ensure_mtu_is_adequate(cp, rt_mode, ipvsh, skb, mtu))
 		goto err_put;
 
 	skb_dst_drop(skb);
@@ -679,7 +696,7 @@ ip_vs_bypass_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	if (__ip_vs_get_out_rt(cp->ipvs, cp->af, skb, NULL, iph->daddr,
+	if (__ip_vs_get_out_rt(cp, skb, NULL, iph->daddr,
 			       IP_VS_RT_MODE_NON_LOCAL, NULL, ipvsh) < 0)
 		goto tx_error;
 
@@ -708,7 +725,7 @@ ip_vs_bypass_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	if (__ip_vs_get_out_rt_v6(cp->ipvs, cp->af, skb, NULL,
+	if (__ip_vs_get_out_rt_v6(cp, skb, NULL,
 				  &iph->daddr, NULL,
 				  ipvsh, 0, IP_VS_RT_MODE_NON_LOCAL) < 0)
 		goto tx_error;
@@ -753,7 +770,7 @@ ip_vs_nat_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 	}
 
 	was_input = rt_is_input_route(skb_rtable(skb));
-	local = __ip_vs_get_out_rt(cp->ipvs, cp->af, skb, cp->dest, cp->daddr.ip,
+	local = __ip_vs_get_out_rt(cp, skb, cp->dest, cp->daddr.ip,
 				   IP_VS_RT_MODE_LOCAL |
 				   IP_VS_RT_MODE_NON_LOCAL |
 				   IP_VS_RT_MODE_RDR, NULL, ipvsh);
@@ -839,7 +856,7 @@ ip_vs_nat_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 		IP_VS_DBG(10, "filled cport=%d\n", ntohs(*p));
 	}
 
-	local = __ip_vs_get_out_rt_v6(cp->ipvs, cp->af, skb, cp->dest,
+	local = __ip_vs_get_out_rt_v6(cp, skb, cp->dest,
 				      &cp->daddr.in6,
 				      NULL, ipvsh, 0,
 				      IP_VS_RT_MODE_LOCAL |
@@ -1031,7 +1048,7 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	local = __ip_vs_get_out_rt(ipvs, cp->af, skb, cp->dest, cp->daddr.ip,
+	local = __ip_vs_get_out_rt(cp, skb, cp->dest, cp->daddr.ip,
 				   IP_VS_RT_MODE_LOCAL |
 				   IP_VS_RT_MODE_NON_LOCAL |
 				   IP_VS_RT_MODE_CONNECT |
@@ -1129,7 +1146,7 @@ ip_vs_tunnel_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	local = __ip_vs_get_out_rt_v6(cp->ipvs, cp->af, skb, cp->dest,
+	local = __ip_vs_get_out_rt_v6(cp, skb, cp->dest,
 				      &cp->daddr.in6,
 				      &saddr, ipvsh, 1,
 				      IP_VS_RT_MODE_LOCAL |
@@ -1218,7 +1235,7 @@ ip_vs_dr_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	local = __ip_vs_get_out_rt(cp->ipvs, cp->af, skb, cp->dest, cp->daddr.ip,
+	local = __ip_vs_get_out_rt(cp, skb, cp->dest, cp->daddr.ip,
 				   IP_VS_RT_MODE_LOCAL |
 				   IP_VS_RT_MODE_NON_LOCAL |
 				   IP_VS_RT_MODE_KNOWN_NH, NULL, ipvsh);
@@ -1252,7 +1269,7 @@ ip_vs_dr_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	EnterFunction(10);
 
-	local = __ip_vs_get_out_rt_v6(cp->ipvs, cp->af, skb, cp->dest,
+	local = __ip_vs_get_out_rt_v6(cp, skb, cp->dest,
 				      &cp->daddr.in6,
 				      NULL, ipvsh, 0,
 				      IP_VS_RT_MODE_LOCAL |
@@ -1317,7 +1334,7 @@ ip_vs_icmp_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 	rt_mode = (hooknum != NF_INET_FORWARD) ?
 		  IP_VS_RT_MODE_LOCAL | IP_VS_RT_MODE_NON_LOCAL |
 		  IP_VS_RT_MODE_RDR : IP_VS_RT_MODE_NON_LOCAL;
-	local = __ip_vs_get_out_rt(cp->ipvs, cp->af, skb, cp->dest, cp->daddr.ip, rt_mode,
+	local = __ip_vs_get_out_rt(cp, skb, cp->dest, cp->daddr.ip, rt_mode,
 				   NULL, iph);
 	if (local < 0)
 		goto tx_error;
@@ -1406,7 +1423,7 @@ ip_vs_icmp_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 	rt_mode = (hooknum != NF_INET_FORWARD) ?
 		  IP_VS_RT_MODE_LOCAL | IP_VS_RT_MODE_NON_LOCAL |
 		  IP_VS_RT_MODE_RDR : IP_VS_RT_MODE_NON_LOCAL;
-	local = __ip_vs_get_out_rt_v6(cp->ipvs, cp->af, skb, cp->dest,
+	local = __ip_vs_get_out_rt_v6(cp, skb, cp->dest,
 				      &cp->daddr.in6, NULL, ipvsh, 0, rt_mode);
 	if (local < 0)
 		goto tx_error;
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-19 21:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, virtualization,
	linux-mm, edumazet, bhutchings, David Miller, Vlastimil Babka
In-Reply-To: <20180419124751.8884e516e99825d83da3d87a@linux-foundation.org>



On Thu, 19 Apr 2018, Andrew Morton wrote:

> On Thu, 19 Apr 2018 12:12:38 -0400 (EDT) Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> > kmalloc fails.
> > 
> > Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> > uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> > found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> > code.
> > 
> > These bugs are hard to reproduce because vmalloc falls back to kmalloc
> > only if memory is fragmented.
> 
> Yes, that's nasty.
> 
> > In order to detect these bugs reliably I submit this patch that changes
> > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> > 
> > ...
> >
> > --- linux-2.6.orig/mm/util.c	2018-04-18 15:46:23.000000000 +0200
> > +++ linux-2.6/mm/util.c	2018-04-18 16:00:43.000000000 +0200
> > @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
> >   */
> >  void *kvmalloc_node(size_t size, gfp_t flags, int node)
> >  {
> > +#ifndef CONFIG_DEBUG_VM
> >  	gfp_t kmalloc_flags = flags;
> >  	void *ret;
> >  
> > @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> >  	 */
> >  	if (ret || size <= PAGE_SIZE)
> >  		return ret;
> > +#endif
> >  
> >  	return __vmalloc_node_flags_caller(size, node, flags,
> >  			__builtin_return_address(0));
> 
> Well, it doesn't have to be done at compile-time, does it?  We could
> add a knob (in debugfs, presumably) which enables this at runtime. 
> That's far more user-friendly.

But who will turn it on in debugfs? It should be default for debugging 
kernels, so that users using them would report the error.

Conditioning it on CONFIG_DEBUG_SG is better than CONFIG_DEBUG_VM, it will 
print a stacktrace where the incorrect use happened.

Mikulas

^ permalink raw reply

* Re: [PATCH net 5/5] nfp: remove false positive offloads in flower vxlan
From: Or Gerlitz @ 2018-04-19 21:11 UTC (permalink / raw)
  To: John Hurley; +Cc: Jakub Kicinski, Linux Netdev List, oss-drivers, Simon Horman
In-Reply-To: <CAK+XE=mvnVaLnJqV+5dm+XaN90YjmFF_gkFCxPBmXAgL_ts=ng@mail.gmail.com>

On Thu, Apr 19, 2018 at 1:31 AM, John Hurley <john.hurley@netronome.com> wrote:
> On Wed, Apr 18, 2018 at 7:18 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>> On Wed, Apr 18, 2018 at 3:31 PM, John Hurley <john.hurley@netronome.com> wrote:
>>> On Wed, Apr 18, 2018 at 8:43 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>>>> On Fri, Nov 17, 2017 at 4:06 AM, Jakub Kicinski
>>>> <jakub.kicinski@netronome.com> wrote:
>>>>> From: John Hurley <john.hurley@netronome.com>
>>>>>
>>>>> Pass information to the match offload on whether or not the repr is the
>>>>> ingress or egress dev. Only accept tunnel matches if repr is the egress dev.
>>>>>
>>>>> This means rules such as the following are successfully offloaded:
>>>>> tc .. add dev vxlan0 .. enc_dst_port 4789 .. action redirect dev nfp_p0
>>>>>
>>>>> While rules such as the following are rejected:
>>>>> tc .. add dev nfp_p0 .. enc_dst_port 4789 .. action redirect dev vxlan0
>>>>
>>>> cool
>>>>
>>>>
>>>>> Also reject non tunnel flows that are offloaded to an egress dev.
>>>>> Non tunnel matches assume that the offload dev is the ingress port and
>>>>> offload a match accordingly.
>>>>
>>>> not following on the "Also" here, see below
>>>>
>>>>
>>>>> diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
>>>>> index a0193e0c24a0..f5d73b83dcc2 100644
>>>>> --- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
>>>>> +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
>>>>> @@ -131,7 +131,8 @@ static bool nfp_flower_check_higher_than_mac(struct tc_cls_flower_offload *f)
>>>>>
>>>>>  static int
>>>>>  nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls,
>>>>> -                               struct tc_cls_flower_offload *flow)
>>>>> +                               struct tc_cls_flower_offload *flow,
>>>>> +                               bool egress)
>>>>>  {
>>>>>         struct flow_dissector_key_basic *mask_basic = NULL;
>>>>>         struct flow_dissector_key_basic *key_basic = NULL;
>>>>> @@ -167,6 +168,9 @@ nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls,
>>>>>                         skb_flow_dissector_target(flow->dissector,
>>>>>                                                   FLOW_DISSECTOR_KEY_ENC_CONTROL,
>>>>>                                                   flow->key);
>>>>> +               if (!egress)
>>>>> +                       return -EOPNOTSUPP;
>>>>> +
>>>>>                 if (mask_enc_ctl->addr_type != 0xffff ||
>>>>>                     enc_ctl->addr_type != FLOW_DISSECTOR_KEY_IPV4_ADDRS)
>>>>>                         return -EOPNOTSUPP;
>>>>> @@ -194,6 +198,9 @@ nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls,
>>>>>
>>>>>                 key_layer |= NFP_FLOWER_LAYER_VXLAN;
>>>>>                 key_size += sizeof(struct nfp_flower_vxlan);
>>>>> +       } else if (egress) {
>>>>> +               /* Reject non tunnel matches offloaded to egress repr. */
>>>>> +               return -EOPNOTSUPP;
>>>>>         }
>>>>
>>>> with these two hunks we get: egress <- IFF -> encap match, right?
>>>>
>>>> (1) we can't offload the egress way if there isn't matching on encap headers
>>>> (2) we can't go the matching on encap headers way if we are not egress
>>>>
>>>
>>> yes, this is correct.
>>> With the block code and egdev offload, we do not have access to the
>>> ingress netdev when doing an offload.
>>> We need to use the encap headers (especially the enc_port) to
>>> distinguish the type of tunnel used and, therefore, require that the
>>> encap matches be present before offloading.
>>>
>>>> what other cases are rejected by this logic?
>>>>
>>>
>>> Yes, some other cases may be rejected (like veth mentioned below).
>>
>> my claim is that the veth case I mentioned below will not be rejected
>> if it has the matching on encap headers, and a wrong rule will be set
>> into hw, agree?
>>
>
> yes, unfortunately this is correct.
> Without having access to the ingress netdev we have to put as many
> restrictions as possible to ensure it is 'almost certainly' a given
> ingress netdev but extreme cases can bypass this.
>
>>> However, this is better than allowing rules to be incorrectly
>>> offloaded (as could have happened before these changes).
>>
>>> Currently, we are looking at offloading flows on other ingress devices
>>> such as bonds so this will require a change to the driver code here.
>>
>> for the ingress side, Jiri suggested that the slave devices (uplink reps),
>> will be just getting all the rules set on the bond, so I am not sure what
>> problem you see here... for decap it will be still vxlan --> vf rep and your
>> egress logic will allow it.
>>
>
> Yes, Jiri suggested on another thread that the bonds simply relay
> rules to their slaves.
> This will work fine if uplink reprs are enslaved by a bond before
> rules are added to it.
> It would also assume that uplink reprs are not removed from/added to
> the bond at later stages.
> Doing this would require flushing the bond rules or writing all
> existing rules to one of the slaves but not others.
> Do you have any opinions on handling such situations?

I looked now on the thread you've posted lately, there were some responses
on the matters you brought here. We'll (MLNX) get there soon I guess too.

^ permalink raw reply

* Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format
From: Martin KaFai Lau @ 2018-04-19 20:58 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180419194034.GB3254@kernel.org>

On Thu, Apr 19, 2018 at 04:40:34PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Wed, Apr 18, 2018 at 03:55:56PM -0700, Martin KaFai Lau escreveu:
> > This patch introduces BPF Type Format (BTF).
> > 
> > BTF (BPF Type Format) is the meta data format which describes
> > the data types of BPF program/map.  Hence, it basically focus
> > on the C programming language which the modern BPF is primary
> > using.  The first use case is to provide a generic pretty print
> > capability for a BPF map.
> > 
> > A modified pahole that can convert dwarf to BTF is here:
> > https://github.com/iamkafai/pahole/tree/btf
> > (Arnaldo, there is some BTF_KIND numbering changes on
> >  Apr 18th, d61426c1571)
> 
> Thanks for letting me know, I'm starting to look at this,
Thanks for reviewing.  Feel free to comment directly on the github diff.
Also, I think it may make sense to wait for the kernel pieces to land
first.


> 
> - Arnaldo
>  
> > Please see individual patch for details.
> > 
> > v5:
> > - Remove BTF_KIND_FLOAT and BTF_KIND_FUNC which are not
> >   currently used.  They can be added in the future.
> >   Some bpf_df_xxx() are removed together.
> > - Add comment in patch 7 to clarify that the new bpffs_map_fops
> >   should not be extended further.
> > 
> > v4:
> > - Fix warning (remove unneeded semicolon)
> > - Remove a redundant variable (nr_bytes) from btf_int_check_meta() in
> >   patch 1.  Caught by W=1.
> > 
> > v3:
> > - Rebase to bpf-next
> > - Fix sparse warning (by adding static)
> > - Add BTF header logging: btf_verifier_log_hdr()
> > - Fix the alignment test on btf->type_off
> > - Add tests for the BTF header
> > - Lower the max BTF size to 16MB.  It should be enough
> >   for some time.  We could raise it later if it would
> >   be needed.
> > 
> > v2:
> > - Use kvfree where needed in patch 1 and 2
> > - Also consider BTF_INT_OFFSET() in the btf_int_check_meta()
> >   in patch 1
> > - Fix an incorrect goto target in map_create() during
> >   the btf-error-path in patch 7
> > - re-org some local vars to keep the rev xmas tree in btf.c
> > 
> > Martin KaFai Lau (10):
> >   bpf: btf: Introduce BPF Type Format (BTF)
> >   bpf: btf: Validate type reference
> >   bpf: btf: Check members of struct/union
> >   bpf: btf: Add pretty print capability for data with BTF type info
> >   bpf: btf: Add BPF_BTF_LOAD command
> >   bpf: btf: Add BPF_OBJ_GET_INFO_BY_FD support to BTF fd
> >   bpf: btf: Add pretty print support to the basic arraymap
> >   bpf: btf: Sync bpf.h and btf.h to tools/
> >   bpf: btf: Add BTF support to libbpf
> >   bpf: btf: Add BTF tests
> > 
> >  include/linux/bpf.h                          |   20 +-
> >  include/linux/btf.h                          |   48 +
> >  include/uapi/linux/bpf.h                     |   12 +
> >  include/uapi/linux/btf.h                     |  130 ++
> >  kernel/bpf/Makefile                          |    1 +
> >  kernel/bpf/arraymap.c                        |   50 +
> >  kernel/bpf/btf.c                             | 2064 ++++++++++++++++++++++++++
> >  kernel/bpf/inode.c                           |  156 +-
> >  kernel/bpf/syscall.c                         |   51 +-
> >  tools/include/uapi/linux/bpf.h               |   12 +
> >  tools/include/uapi/linux/btf.h               |  130 ++
> >  tools/lib/bpf/Build                          |    2 +-
> >  tools/lib/bpf/bpf.c                          |   92 +-
> >  tools/lib/bpf/bpf.h                          |   16 +
> >  tools/lib/bpf/btf.c                          |  374 +++++
> >  tools/lib/bpf/btf.h                          |   22 +
> >  tools/lib/bpf/libbpf.c                       |  148 +-
> >  tools/lib/bpf/libbpf.h                       |    3 +
> >  tools/testing/selftests/bpf/Makefile         |   26 +-
> >  tools/testing/selftests/bpf/test_btf.c       | 1669 +++++++++++++++++++++
> >  tools/testing/selftests/bpf/test_btf_haskv.c |   48 +
> >  tools/testing/selftests/bpf/test_btf_nokv.c  |   43 +
> >  22 files changed, 5076 insertions(+), 41 deletions(-)
> >  create mode 100644 include/linux/btf.h
> >  create mode 100644 include/uapi/linux/btf.h
> >  create mode 100644 kernel/bpf/btf.c
> >  create mode 100644 tools/include/uapi/linux/btf.h
> >  create mode 100644 tools/lib/bpf/btf.c
> >  create mode 100644 tools/lib/bpf/btf.h
> >  create mode 100644 tools/testing/selftests/bpf/test_btf.c
> >  create mode 100644 tools/testing/selftests/bpf/test_btf_haskv.c
> >  create mode 100644 tools/testing/selftests/bpf/test_btf_nokv.c
> > 
> > -- 
> > 2.9.5

^ permalink raw reply

* Re: [PATCH net-next 0/4] tracking TCP data delivery and ECN stats
From: Yuchung Cheng @ 2018-04-19 20:52 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh
In-Reply-To: <20180419.130710.1404241211516484963.davem@davemloft.net>

On Thu, Apr 19, 2018 at 10:07 AM, David Miller <davem@davemloft.net> wrote:
>
> From: Yuchung Cheng <ycheng@google.com>
> Date: Tue, 17 Apr 2018 23:18:45 -0700
>
> > This patch series improve tracking the data delivery status
> >   1. minor improvement on SYN data
> >   2. accounting bytes delivered with CE marks
> >   3. exporting the delivery stats to applications
> >
> > s.t. users can get better sense of TCP performance at per host,
> > per connection, and even per application message level.
>
> Definitely useful, so series applied.
Thanks.

TCP socket is getting bigger and bigger :-( I am cooking a patch set
to simplify loss recovery that should help conserving the space.

>
> But it is not lost upon me that slowly over time tcp sockets are
> bloating quite a bit...

^ permalink raw reply

* Re: [PATCH net 0/1] net/smc: shutdown fix
From: David Miller @ 2018-04-19 20:39 UTC (permalink / raw)
  To: ubraun
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, stephen,
	ubraun
In-Reply-To: <20180419135640.2907-1-ubraun@linux.ibm.com>

From: Ursula Braun <ubraun@linux.ibm.com>
Date: Thu, 19 Apr 2018 15:56:39 +0200

> This patch fixes the problem and is a candidate for -stable.

Ok, queueud up.

^ permalink raw reply

* Re: [PATCH net 1/1] net/smc: fix shutdown in state SMC_LISTEN
From: David Miller @ 2018-04-19 20:39 UTC (permalink / raw)
  To: ubraun
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, stephen,
	ubraun
In-Reply-To: <20180419135640.2907-2-ubraun@linux.ibm.com>

From: Ursula Braun <ubraun@linux.ibm.com>
Date: Thu, 19 Apr 2018 15:56:40 +0200

> From: Ursula Braun <ubraun@linux.vnet.ibm.com>
> 
> Calling shutdown with SHUT_RD and SHUT_RDWR for a listening SMC socket
> crashes, because
>    commit 127f49705823 ("net/smc: release clcsock from tcp_listen_worker")
> releases the internal clcsock in smc_close_active() and sets smc->clcsock
> to NULL.
> For SHUT_RD the smc_close_active() call is removed.
> For SHUT_RDWR the kernel_sock_shutdown() call is omitted, since the
> clcsock is already released.
> 
> Fixes: 127f49705823 ("net/smc: release clcsock from tcp_listen_worker")
> Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
> Reported-by: Stephen Hemminger <stephen@networkplumber.org>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net] bnxt_en: Fix memory fault in bnxt_ethtool_init()
From: David Miller @ 2018-04-19 20:35 UTC (permalink / raw)
  To: michael.chan; +Cc: netdev, kernel-team
In-Reply-To: <1524122176-13511-1-git-send-email-michael.chan@broadcom.com>

From: Michael Chan <michael.chan@broadcom.com>
Date: Thu, 19 Apr 2018 03:16:16 -0400

> From: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
> 
> In some firmware images, the length of BNX_DIR_TYPE_PKG_LOG nvram type
> could be greater than the fixed buffer length of 4096 bytes allocated by
> the driver.  This was causing HWRM_NVM_READ to copy more data to the buffer
> than the allocated size, causing general protection fault.
> 
> Fix the issue by allocating the exact buffer length returned by
> HWRM_NVM_FIND_DIR_ENTRY, instead of 4096.  Move the kzalloc() call
> into the bnxt_get_pkgver() function.
> 
> Fixes: 3ebf6f0a09a2 ("bnxt_en: Add installed-package firmware version reporting via Ethtool GDRVINFO")
> Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
> Signed-off-by: Michael Chan <michael.chan@broadcom.com>

Applied, thanks Michael.

^ permalink raw reply

* Re: [PATCH] net: hns: Avoid action name truncation
From: David Miller @ 2018-04-19 20:30 UTC (permalink / raw)
  To: dann.frazier; +Cc: yisen.zhuang, salil.mehta, netdev, linux-kernel, linyunsheng
In-Reply-To: <20180419035541.6318-1-dann.frazier@canonical.com>

From: dann frazier <dann.frazier@canonical.com>
Date: Wed, 18 Apr 2018 21:55:41 -0600

> When longer interface names are used, the action names exposed in
> /proc/interrupts and /proc/irq/* maybe truncated. For example, when
> using the predictable name algorithm in systemd on a HiSilicon D05,
> I see:
> 
>   ubuntu@d05-3:~$  grep enahisic2i0-tx /proc/interrupts | sed 's/.* //'
>   enahisic2i0-tx0
>   enahisic2i0-tx1
>   [...]
>   enahisic2i0-tx8
>   enahisic2i0-tx9
>   enahisic2i0-tx1
>   enahisic2i0-tx1
>   enahisic2i0-tx1
>   enahisic2i0-tx1
>   enahisic2i0-tx1
>   enahisic2i0-tx1
> 
> Increase the max ring name length to allow for an interface name
> of IFNAMSIZE. After this change, I now see:
> 
>   $ grep enahisic2i0-tx /proc/interrupts | sed 's/.* //'
>   enahisic2i0-tx0
>   enahisic2i0-tx1
>   enahisic2i0-tx2
>   [...]
>   enahisic2i0-tx8
>   enahisic2i0-tx9
>   enahisic2i0-tx10
>   enahisic2i0-tx11
>   enahisic2i0-tx12
>   enahisic2i0-tx13
>   enahisic2i0-tx14
>   enahisic2i0-tx15
> 
> Signed-off-by: dann frazier <dann.frazier@canonical.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next 00/11] Modernize mdio-gpio
From: Andrew Lunn @ 2018-04-19 20:20 UTC (permalink / raw)
  To: Linus Walleij; +Cc: David Miller, netdev, Florian Fainelli
In-Reply-To: <CACRpkdaXuJHMBS5Vodrj4CzgB7MwiJp_gXJuQrjEOLF4ARPydQ@mail.gmail.com>

On Thu, Apr 19, 2018 at 09:52:09PM +0200, Linus Walleij wrote:
> On Thu, Apr 19, 2018 at 1:02 AM, Andrew Lunn <andrew@lunn.ch> wrote:
> 
> > This patchset is inspired by a previous version by Linus Walleij
> >
> > It reworks the mdio-gpio code to make use of gpio descriptors instead
> > of gpio numbers. However compared to the previous version, it retains
> > support for platform devices. It does however remove the platform_data
> > header file. The needed GPIOs are now passed by making use of a gpiod
> > lookup table. e.g:
> 
> Looks good to me, but wasn't this what Florian was NACKing?

Hi Linus

At the time, i don't think either Florian or i knew about gpiod lookup
tables. It was only when i got deep into this patchset i found them.

> I thought he was going to add some x86 MDIO using platform data,
> and then I suppose he wanted to use something more than some
> GPIO descriptors, maybe IRQ etc (who knows)?

I now have said x86 MDIO device, connecting to an Ethernet switch. It
works :-)

      Andrew

^ permalink raw reply

* Re: [PATCH v4 00/10] New network driver for Amiga X-Surf 100 (m68k)
From: David Miller @ 2018-04-19 20:11 UTC (permalink / raw)
  To: schmitzmic
  Cc: netdev, andrew, fthain, geert, f.fainelli, linux-m68k,
	Michael.Karcher
In-Reply-To: <1524103526-12240-1-git-send-email-schmitzmic@gmail.com>

From: Michael Schmitz <schmitzmic@gmail.com>
Date: Thu, 19 Apr 2018 14:05:17 +1200

> This patch series adds support for the Individual Computers X-Surf 100
> network card for m68k Amiga, a network adapter based on the AX88796 chip set.

Series applied, thank you.

^ permalink raw reply

* Re: [Xen-devel] [PATCH] xen-netfront: Fix hang on device removal
From: Simon Gaiser @ 2018-04-19 20:09 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: netdev, xen-devel, Eduardo Otubo, Juergen Gross, Boris Ostrovsky,
	open list
In-Reply-To: <CAKf6xpuusCJ0DMJ_G3hG5pd9vd8rUBH=VPa4TCvaoptGto-=Zw@mail.gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 1353 bytes --]

Jason Andryuk:
> On Thu, Apr 19, 2018 at 2:10 PM, Simon Gaiser
> <simon@invisiblethingslab.com> wrote:
>> Jason Andryuk:
>>> A toolstack may delete the vif frontend and backend xenstore entries
>>> while xen-netfront is in the removal code path.  In that case, the
>>> checks for xenbus_read_driver_state would return XenbusStateUnknown, and
>>> xennet_remove would hang indefinitely.  This hang prevents system
>>> shutdown.
>>>
>>> xennet_remove must be able to handle XenbusStateUnknown, and
>>> netback_changed must also wake up the wake_queue for that state as well.
>>>
>>> Fixes: 5b5971df3bc2 ("xen-netfront: remove warning when unloading module")
>>
>> I think this should go into stable since AFAIK the hanging network
>> device can only be fixed by rebooting the guest. AFAICS this affects all
>> 4.* branches since 5b5971df3bc2 got backported to them.
>>
>> Upstream commit c2d2e6738a209f0f9dffa2dc8e7292fc45360d61.
> 
> Simon,
> 
> Yes, I agree.  I actually submitted the request to stable earlier
> today, so hopefully it gets added soon.

Ok, great. (I checked the stable patch queue, but didn't check the
mailing list archive).

> Have you experienced this hang?

Yes, it's affecting the kernel shipped by Qubes OS (see [1]).

Thanks, Simon.

[1]: https://github.com/QubesOS/qubes-issues/issues/3657


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 01/11] net: phy_ mdio-gpio: Fixup , which should be ;
From: David Miller @ 2018-04-19 20:00 UTC (permalink / raw)
  To: andrew; +Cc: netdev, f.fainelli, linus.walleij
In-Reply-To: <1524092579-15625-2-git-send-email-andrew@lunn.ch>

From: Andrew Lunn <andrew@lunn.ch>
Date: Thu, 19 Apr 2018 01:02:49 +0200

> @@ -161,7 +161,7 @@ static struct mii_bus *mdio_gpio_bus_init(struct device *dev,
>  	if (!new_bus)
>  		goto out;
>  
> -	new_bus->name = "GPIO Bitbanged MDIO",
> +	new_bus->name = "GPIO Bitbanged MDIO";

Would be so great to find a way to automatically detect these somehow.

Yes, they are useful when controlling evaluation in weird ways in
macros etc.  However most of the time if a ',' is used in a context
where a ';' also works, it's unintentional.

^ permalink raw reply

* Re: [PATCH net-next 00/11] Modernize mdio-gpio
From: David Miller @ 2018-04-19 19:59 UTC (permalink / raw)
  To: andrew; +Cc: netdev, f.fainelli, linus.walleij
In-Reply-To: <1524092579-15625-1-git-send-email-andrew@lunn.ch>

From: Andrew Lunn <andrew@lunn.ch>
Date: Thu, 19 Apr 2018 01:02:48 +0200

> This patchset is inspired by a previous version by Linus Walleij
> 
> It reworks the mdio-gpio code to make use of gpio descriptors instead
> of gpio numbers. However compared to the previous version, it retains
> support for platform devices. It does however remove the platform_data
> header file. The needed GPIOs are now passed by making use of a gpiod
> lookup table. e.g:
> 
> static struct gpiod_lookup_table zii_scu_mdio_gpiod_table = {
> 	.dev_id = "mdio-gpio.0",
> 	.table = {
> 		GPIO_LOOKUP_IDX("gpio_ich", 17, NULL, MDIO_GPIO_MDC,
> 				GPIO_ACTIVE_HIGH),
> 		GPIO_LOOKUP_IDX("gpio_ich", 2, NULL, MDIO_GPIO_MDIO,
> 				GPIO_ACTIVE_HIGH),
> 		GPIO_LOOKUP_IDX("gpio_ich", 21, NULL, MDIO_GPIO_MDO,
> 				GPIO_ACTIVE_LOW),
> 	},
> };

Nice set of simplifications, applied.

^ permalink raw reply

* Re: [PATCH net-next 00/11] Modernize mdio-gpio
From: Linus Walleij @ 2018-04-19 19:52 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: David Miller, netdev, Florian Fainelli
In-Reply-To: <1524092579-15625-1-git-send-email-andrew@lunn.ch>

On Thu, Apr 19, 2018 at 1:02 AM, Andrew Lunn <andrew@lunn.ch> wrote:

> This patchset is inspired by a previous version by Linus Walleij
>
> It reworks the mdio-gpio code to make use of gpio descriptors instead
> of gpio numbers. However compared to the previous version, it retains
> support for platform devices. It does however remove the platform_data
> header file. The needed GPIOs are now passed by making use of a gpiod
> lookup table. e.g:

Looks good to me, but wasn't this what Florian was NACKing?

I thought he was going to add some x86 MDIO using platform data,
and then I suppose he wanted to use something more than some
GPIO descriptors, maybe IRQ etc (who knows)?

If he only needs to put in GPIO descriptors then this is fine of
course.

Anyway the series has a solid:
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>

Thanks,
Linus Walleij

^ permalink raw reply

* [PATCH v2] net: ethernet: ti: cpsw: fix tx vlan priority mapping
From: Ivan Khoronzhuk @ 2018-04-19 19:49 UTC (permalink / raw)
  To: grygorii.strashko
  Cc: davem, linux-omap, netdev, linux-kernel, Ivan Khoronzhuk

The CPDMA_TX_PRIORITY_MAP in real is vlan pcp field priority mapping
register and basically replaces vlan pcp field for tagged packets.
So, set it to be 1:1 mapping. Otherwise, it will cause unexpected
change of egress vlan tagged packets, like prio 2 -> prio 5.

Fixes: e05107e6b747 ("net: ethernet: ti: cpsw: add multi queue support")
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
Based on net/master

 drivers/net/ethernet/ti/cpsw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 3037127..74f8284 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -129,7 +129,7 @@ do {								\
 
 #define RX_PRIORITY_MAPPING	0x76543210
 #define TX_PRIORITY_MAPPING	0x33221100
-#define CPDMA_TX_PRIORITY_MAP	0x01234567
+#define CPDMA_TX_PRIORITY_MAP	0x76543210
 
 #define CPSW_VLAN_AWARE		BIT(1)
 #define CPSW_RX_VLAN_ENCAP	BIT(2)
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Andrew Morton @ 2018-04-19 19:47 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: David Miller, linux-mm, eric.dumazet, edumazet, bhutchings,
	netdev, linux-kernel, mst, jasowang, virtualization, dm-devel,
	Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804191207380.31175@file01.intranet.prod.int.rdu2.redhat.com>

On Thu, 19 Apr 2018 12:12:38 -0400 (EDT) Mikulas Patocka <mpatocka@redhat.com> wrote:

> The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> kmalloc fails.
> 
> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code.
> 
> These bugs are hard to reproduce because vmalloc falls back to kmalloc
> only if memory is fragmented.

Yes, that's nasty.

> In order to detect these bugs reliably I submit this patch that changes
> kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> 
> ...
>
> --- linux-2.6.orig/mm/util.c	2018-04-18 15:46:23.000000000 +0200
> +++ linux-2.6/mm/util.c	2018-04-18 16:00:43.000000000 +0200
> @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
>   */
>  void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  {
> +#ifndef CONFIG_DEBUG_VM
>  	gfp_t kmalloc_flags = flags;
>  	void *ret;
>  
> @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
>  	 */
>  	if (ret || size <= PAGE_SIZE)
>  		return ret;
> +#endif
>  
>  	return __vmalloc_node_flags_caller(size, node, flags,
>  			__builtin_return_address(0));

Well, it doesn't have to be done at compile-time, does it?  We could
add a knob (in debugfs, presumably) which enables this at runtime. 
That's far more user-friendly.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox