Netdev List
 help / color / mirror / Atom feed
* DSA
From: Dave Richards @ 2018-04-27 18:10 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hello,

I am building a prototype for a new product based on a Lanner, Inc. embedded PC.  It is an Intel Celeron-based system with two host I210 GbE chips connected to 2 MV88E6172 chips (one NIC to one switch).  Everything appears to show up hardware-wise.  My question is, what is the next step?  How does DSA know which NICs are intended to be masters?  Is this supposed to be auto-detected or is this knowledge supposed to be communicated explicitly.  Reading through the DSA driver code I see that there is a check of the OF property list for the device for a "label"/"cpu" property/value pair that needs to be present.  Who sets this and when?

I'm sorry for this basic question, but Google has not enlightened me.

	Thanks!

	Dave


Dave Richards
VP Software Engineering
Impinj, Inc
400 Fairview Ave N. #1200
Seattle, WA 
O: (206) 812-9863

^ permalink raw reply

* Re: [PATCH bpf-next v2 00/15] Introducing AF_XDP support
From: Björn Töpel @ 2018-04-27 18:12 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z
In-Reply-To: <CAF=yD-JUezRSGP_6f=WDHqcznFEW4K95hd6qVK8BaAr6Q5VR5Q@mail.gmail.com>

2018-04-27 19:16 GMT+02:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
> On Fri, Apr 27, 2018 at 8:17 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> This patch set introduces a new address family called AF_XDP that is
>> optimized for high performance packet processing and, in upcoming
>> patch sets, zero-copy semantics. In this v2 version, we have removed
>> all zero-copy related code in order to make it smaller, simpler and
>> hopefully more review friendly. This patch set only supports copy-mode
>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
>> for RX using the XDP_DRV path. Zero-copy support requires XDP and
>> driver changes that Jesper Dangaard Brouer is working on. Some of his
>> work has already been accepted. We will publish our zero-copy support
>> for RX and TX on top of his patch sets at a later point in time.
>
>> Changes from V1:
>>
>> * Fixes to bugs spotted by Will in his review
>> * Implemented the performance otimization to BPF_MAP_TYPE_XSKMAP
>>   suggested by Will
>
> An xsk may only exist in one map at a time. Is this somehow assured?
>

Actually this is *not* the case. An xsk may reside in many maps, and
multiple times in the same map. So it's not assured at all. :-)

The restriction for an xsk is per netdev/queue/umem (and) the napi
context guarantee the SPSC constraint.

For the record, your XSKMAP suggestion gave ~100kpps in the ingress
path! Very nice!

>> * Refactored packet_direct_xmit to become a common function
>>   in core/dev.c as suggested by Will
>> * Added documentation as suggested by Jesper
>> * Proper page unpinning as suggested by MST
>> * Some minor code cleanups
>
> Everything else looks great to me. If the above is correct (or corrected)
>
> Acked-by: Willem de Bruijn <willemb@google.com>
>

Thanks for the in-depth review, Will! Very much appreciated! (bow)


Björn

> I did not read everything again, but applied both patchsets on top of
> bpf-next to do a diff of diffs. In case others find it useful:
>
>   https://github.com/wdebruij/linux/tree/bpf-next-afxdp-v1
>   https://github.com/wdebruij/linux/tree/bpf-next-afxdp-v2

^ permalink raw reply

* [PATCH 0/3] Clean up users of skb_tx_hash and __skb_tx_hash
From: Alexander Duyck @ 2018-04-27 18:06 UTC (permalink / raw)
  To: netdev, davem
  Cc: linux-rdma, dennis.dalessandro, niranjana.vishwanathapura, tariqt

I am in the process of doing some work to try and enable macvlan Tx queue
selection without using ndo_select_queue. As a part of that I will likely
need to make changes to skb_tx_hash. As such this is a clean up or refactor
of the two spots where he function has been used. In both cases it didn't
really seem like the function was being used correctly so I have updated
both code paths to not make use of the function.

My current development environment doesn't have an mlx4 or OPA vnic
available so the changes to those have been build tested only.

---

Alexander Duyck (3):
      opa_vnic: Just use skb_get_hash instead of skb_tx_hash
      mlx4: Don't bother using skb_tx_hash in mlx4_en_select_queue
      net: Revoke export for __skb_tx_hash, update it to just be static skb_tx_hash


 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c   |   21 ++++++++++----------
 .../infiniband/ulp/opa_vnic/opa_vnic_internal.h    |    2 +-
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c  |    2 +-
 drivers/net/ethernet/mellanox/mlx4/en_tx.c         |    2 +-
 include/linux/netdevice.h                          |   13 ------------
 net/core/dev.c                                     |   10 ++++------
 6 files changed, 17 insertions(+), 33 deletions(-)

^ permalink raw reply

* [PATCH 1/3] opa_vnic: Just use skb_get_hash instead of skb_tx_hash
From: Alexander Duyck @ 2018-04-27 18:06 UTC (permalink / raw)
  To: netdev, davem
  Cc: linux-rdma, dennis.dalessandro, niranjana.vishwanathapura, tariqt
In-Reply-To: <20180427180142.4883.96259.stgit@ahduyck-green-test.jf.intel.com>

This patch is meant to clean up how the opa_vnic is obtaining entropy from
Tx packets.

The code as it was written was claiming to get 16 bits of hash, but from
what I can tell it was only ever actually getting 14 bits as it was limited
to 0 - (2^15 - 1). It then was folding the result to get a 8 bit value for
entropy.

Instead of throwing away all that input I am cutting out the middle man and
instead having the code call skb_get_hash directly and then folding the 32
bit value into a 8 bit value using a pair of shifts and XOR operations.

Execution wise this new approach should provide more entropy and be faster
since we are bypassing the reciprocal multiplication to reduce the 32b
value to 16b and instead just using a shift/XOR combination.

In addition we can drop the unneeded adapter value from the call to get the
entropy since the netdev itself isn't even needed.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c   |   21 ++++++++++----------
 .../infiniband/ulp/opa_vnic/opa_vnic_internal.h    |    2 +-
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c  |    2 +-
 3 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c
index 4be3aef..267da82 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c
@@ -443,17 +443,16 @@ static u8 opa_vnic_get_rc(struct __opa_veswport_info *info,
 }
 
 /* opa_vnic_calc_entropy - calculate the packet entropy */
-u8 opa_vnic_calc_entropy(struct opa_vnic_adapter *adapter, struct sk_buff *skb)
+u8 opa_vnic_calc_entropy(struct sk_buff *skb)
 {
-	u16 hash16;
-
-	/*
-	 * Get flow based 16-bit hash and then XOR the upper and lower bytes
-	 * to get the entropy.
-	 * __skb_tx_hash limits qcount to 16 bits. Hence, get 15-bit hash.
-	 */
-	hash16 = __skb_tx_hash(adapter->netdev, skb, BIT(15));
-	return (u8)((hash16 >> 8) ^ (hash16 & 0xff));
+	u32 hash = skb_get_hash(skb);
+
+	/* store XOR of all bytes in lower 8 bits */
+	hash ^= hash >> 8;
+	hash ^= hash >> 16;
+
+	/* return lower 8 bits as entropy */
+	return (u8)(hash & 0xFF);
 }
 
 /* opa_vnic_get_def_port - get default port based on entropy */
@@ -490,7 +489,7 @@ void opa_vnic_encap_skb(struct opa_vnic_adapter *adapter, struct sk_buff *skb)
 
 	hdr = skb_push(skb, OPA_VNIC_HDR_LEN);
 
-	entropy = opa_vnic_calc_entropy(adapter, skb);
+	entropy = opa_vnic_calc_entropy(skb);
 	def_port = opa_vnic_get_def_port(adapter, entropy);
 	len = opa_vnic_wire_length(skb);
 	dlid = opa_vnic_get_dlid(adapter, skb, def_port);
diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h b/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h
index afd95f4..43ac61f 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h
@@ -299,7 +299,7 @@ struct opa_vnic_adapter *opa_vnic_add_netdev(struct ib_device *ibdev,
 void opa_vnic_rem_netdev(struct opa_vnic_adapter *adapter);
 void opa_vnic_encap_skb(struct opa_vnic_adapter *adapter, struct sk_buff *skb);
 u8 opa_vnic_get_vl(struct opa_vnic_adapter *adapter, struct sk_buff *skb);
-u8 opa_vnic_calc_entropy(struct opa_vnic_adapter *adapter, struct sk_buff *skb);
+u8 opa_vnic_calc_entropy(struct sk_buff *skb);
 void opa_vnic_process_vema_config(struct opa_vnic_adapter *adapter);
 void opa_vnic_release_mac_tbl(struct opa_vnic_adapter *adapter);
 void opa_vnic_query_mac_tbl(struct opa_vnic_adapter *adapter,
diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c b/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
index ce57e0f..0c8aec6 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
@@ -104,7 +104,7 @@ static u16 opa_vnic_select_queue(struct net_device *netdev, struct sk_buff *skb,
 
 	/* pass entropy and vl as metadata in skb */
 	mdata = skb_push(skb, sizeof(*mdata));
-	mdata->entropy =  opa_vnic_calc_entropy(adapter, skb);
+	mdata->entropy = opa_vnic_calc_entropy(skb);
 	mdata->vl = opa_vnic_get_vl(adapter, skb);
 	rc = adapter->rn_ops->ndo_select_queue(netdev, skb,
 					       accel_priv, fallback);

^ permalink raw reply related

* [PATCH 2/3] mlx4: Don't bother using skb_tx_hash in mlx4_en_select_queue
From: Alexander Duyck @ 2018-04-27 18:06 UTC (permalink / raw)
  To: netdev, davem
  Cc: linux-rdma, dennis.dalessandro, niranjana.vishwanathapura, tariqt
In-Reply-To: <20180427180142.4883.96259.stgit@ahduyck-green-test.jf.intel.com>

The code in the fallback path has supported XDP in conjunction with the Tx
traffic classification for TCs for over a year now. So instead of just
calling skb_tx_hash for every packet we are better off using the fallback
since that will record the Tx queue to the socket and then that can be used
instead of having to recompute the hash every time.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 6b68537..0227786 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -694,7 +694,7 @@ u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb,
 	u16 rings_p_up = priv->num_tx_rings_p_up;
 
 	if (netdev_get_num_tc(dev))
-		return skb_tx_hash(dev, skb);
+		return fallback(dev, skb);
 
 	return fallback(dev, skb) % rings_p_up;
 }

^ permalink raw reply related

* Re: [PATCH net-next 03/13] sctp: remove an if() that is always true
From: Marcelo Ricardo Leitner @ 2018-04-27 18:13 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, linux-sctp, Vlad Yasevich, Xin Long
In-Reply-To: <20180427105050.GA22078@hmswarspite.think-freely.org>

On Fri, Apr 27, 2018 at 06:50:50AM -0400, Neil Horman wrote:
> On Thu, Apr 26, 2018 at 04:58:52PM -0300, Marcelo Ricardo Leitner wrote:
> > As noticed by Xin Long, the if() here is always true as PMTU can never
> > be 0.
> >
> > Reported-by: Xin Long <lucien.xin@gmail.com>
> > Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > ---
> >  net/sctp/associola.c | 6 ++----
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> > index b3aa95222bd52113295cb246c503c903bdd5c353..c5ed09cfa8423b17546e3d45f6d06db03af66384 100644
> > --- a/net/sctp/associola.c
> > +++ b/net/sctp/associola.c
> > @@ -1397,10 +1397,8 @@ void sctp_assoc_sync_pmtu(struct sctp_association *asoc)
> >  			pmtu = t->pathmtu;
> >  	}
> >
> > -	if (pmtu) {
> > -		asoc->pathmtu = pmtu;
> > -		asoc->frag_point = sctp_frag_point(asoc, pmtu);
> > -	}
> > +	asoc->pathmtu = pmtu;
> > +	asoc->frag_point = sctp_frag_point(asoc, pmtu);
> >
> Can you double check this?  Looking at it, it seems far fetched, but if someone

Sure.

> sends a crafted icmp dest unreach message to the host, pmtu_sending might be
> able to get set for an association (which may have no transports established
> yet), and if so, on the first packet send sctp_assoc_sync_pmtu can be called,
> leading to a fall through in the loop over all transports, and pmtu being zero.
> It seems like a far fetched set of circumstances, I know, but if it can happen,
> I think you might see a crash in sctp_frag_point due to an underflow of the frag
> value

If I got you right, this situation would not happen because when
handling the icmp it will check if there is a transport and ignore it
otherwise.

  Marcelo

>
> Neil
>
> >  	pr_debug("%s: asoc:%p, pmtu:%d, frag_point:%d\n", __func__, asoc,
> >  		 asoc->pathmtu, asoc->frag_point);
> > --
> > 2.14.3
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* [PATCH 3/3] net: Revoke export for __skb_tx_hash, update it to just be static skb_tx_hash
From: Alexander Duyck @ 2018-04-27 18:06 UTC (permalink / raw)
  To: netdev, davem
  Cc: linux-rdma, dennis.dalessandro, niranjana.vishwanathapura, tariqt
In-Reply-To: <20180427180142.4883.96259.stgit@ahduyck-green-test.jf.intel.com>

I am dropping the export of __skb_tx_hash as after my patches nobody is
using it outside of the net/core/dev.c file. In addition I am renaming and
repurposing it to just be a static declaration of skb_tx_hash since that
was the only user for it at this point. By doing this the compiler can
inline it into __netdev_pick_tx as that will improve performance.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 include/linux/netdevice.h |   13 -------------
 net/core/dev.c            |   10 ++++------
 2 files changed, 4 insertions(+), 19 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf44503..6da5371 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3213,19 +3213,6 @@ static inline int netif_set_xps_queue(struct net_device *dev,
 }
 #endif
 
-u16 __skb_tx_hash(const struct net_device *dev, struct sk_buff *skb,
-		  unsigned int num_tx_queues);
-
-/*
- * Returns a Tx hash for the given packet when dev->real_num_tx_queues is used
- * as a distribution range limit for the returned value.
- */
-static inline u16 skb_tx_hash(const struct net_device *dev,
-			      struct sk_buff *skb)
-{
-	return __skb_tx_hash(dev, skb, dev->real_num_tx_queues);
-}
-
 /**
  *	netif_is_multiqueue - test if device has multiple transmit queues
  *	@dev: network device
diff --git a/net/core/dev.c b/net/core/dev.c
index 9b04a9f..7584d9c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2614,17 +2614,16 @@ void netif_device_attach(struct net_device *dev)
  * Returns a Tx hash based on the given packet descriptor a Tx queues' number
  * to be used as a distribution range.
  */
-u16 __skb_tx_hash(const struct net_device *dev, struct sk_buff *skb,
-		  unsigned int num_tx_queues)
+static u16 skb_tx_hash(const struct net_device *dev, struct sk_buff *skb)
 {
 	u32 hash;
 	u16 qoffset = 0;
-	u16 qcount = num_tx_queues;
+	u16 qcount = dev->real_num_tx_queues;
 
 	if (skb_rx_queue_recorded(skb)) {
 		hash = skb_get_rx_queue(skb);
-		while (unlikely(hash >= num_tx_queues))
-			hash -= num_tx_queues;
+		while (unlikely(hash >= qcount))
+			hash -= qcount;
 		return hash;
 	}
 
@@ -2637,7 +2636,6 @@ u16 __skb_tx_hash(const struct net_device *dev, struct sk_buff *skb,
 
 	return (u16) reciprocal_scale(skb_get_hash(skb), qcount) + qoffset;
 }
-EXPORT_SYMBOL(__skb_tx_hash);
 
 static void skb_warn_bad_offload(const struct sk_buff *skb)
 {

^ permalink raw reply related

* [PATCH net] bridge: netfilter stp fix reference to uninitialized data
From: Stephen Hemminger @ 2018-04-27 18:16 UTC (permalink / raw)
  To: pablo, kadlec, fw, davem
  Cc: netfilter-devel, bridge, netdev, Stephen Hemminger,
	Stephen Hemminger

The destination mac (destmac) is only valid if EBT_DESTMAC flag
is set. Fix by changing the order of the comparison to look for
the flag first.

Reported-by: syzbot+5c06e318fc558cc27823@syzkaller.appspotmail.com
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
Note: no fixes since this bug goes back to pre-git days.
Should go to stable as well.

 net/bridge/netfilter/ebt_stp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/bridge/netfilter/ebt_stp.c b/net/bridge/netfilter/ebt_stp.c
index 47ba98db145d..46c1fe7637ea 100644
--- a/net/bridge/netfilter/ebt_stp.c
+++ b/net/bridge/netfilter/ebt_stp.c
@@ -161,8 +161,8 @@ static int ebt_stp_mt_check(const struct xt_mtchk_param *par)
 	/* Make sure the match only receives stp frames */
 	if (!par->nft_compat &&
 	    (!ether_addr_equal(e->destmac, eth_stp_addr) ||
-	     !is_broadcast_ether_addr(e->destmsk) ||
-	     !(e->bitmask & EBT_DESTMAC)))
+	     !(e->bitmask & EBT_DESTMAC) ||
+	     !is_broadcast_ether_addr(e->destmsk)))
 		return -EINVAL;
 
 	return 0;
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH net-next] net: sch: prio: Set bands to default on delete instead of noop
From: Cong Wang @ 2018-04-27 18:20 UTC (permalink / raw)
  To: Nogah Frankel
  Cc: Linux Kernel Network Developers, David Miller, Jiri Pirko,
	Jamal Hadi Salim, mlxsw
In-Reply-To: <1524749556-36199-1-git-send-email-nogahf@mellanox.com>

On Thu, Apr 26, 2018 at 6:32 AM, Nogah Frankel <nogahf@mellanox.com> wrote:
> When a band is created, it is set to the default qdisc, which is
> "invisible" pfifo.


Isn't TCA_DUMP_INVISIBLE for dumping this invisible qdisc?


> However, if a band is set to a qdisc that is later being deleted, it will
> be set to noop qdisc. This can cause a packet loss, while there is no clear
> user indication for it. ("invisible" qdisc are not being shown by default).
> This patch sets a band to the default qdisc, rather then the noop qdisc, on
> delete operation.

It is set to noop historically, may be not reasonable but changing
it could break things.

What's wrong with using TCA_DUMP_INVISIBLE to dump it?

^ permalink raw reply

* Re: DSA
From: Florian Fainelli @ 2018-04-27 18:32 UTC (permalink / raw)
  To: Dave Richards, netdev@vger.kernel.org, andrew, vivien.didelot
In-Reply-To: <MWHPR06MB3503CE521D6993C7786A3E93DC8D0@MWHPR06MB3503.namprd06.prod.outlook.com>

Hello,

On 04/27/2018 11:10 AM, Dave Richards wrote:
> Hello,
> 
> I am building a prototype for a new product based on a Lanner, Inc. embedded PC.  It is an Intel Celeron-based system with two host I210 GbE chips connected to 2 MV88E6172 chips (one NIC to one switch).  Everything appears to show up hardware-wise.  My question is, what is the next step?  How does DSA know which NICs are intended to be masters?  Is this supposed to be auto-detected or is this knowledge supposed to be communicated explicitly.  Reading through the DSA driver code I see that there is a check of the OF property list for the device for a "label"/"cpu" property/value pair that needs to be present.  Who sets this and when?

On system where Device Tree can be used, we expect you to declare all
relevant peripherals in Device Tree and those would include the Ethernet
controller (i210) and the Ethernet switches. An example of this can be
found here:

On x86, there is not an universal use of Device Tree, so we can use
something along these lines to register an Ethernet switch through DSA:

https://github.com/lunn/linux/commit/34055b931848545b6ba11ee50b88e89aeb02c9a5

There might be a way for you to use a conjuction of DMI Match entries to
match your specific board design and based on that run the DSA switch
registration code, indicating the port mapping and the Ethernet
controller to be used as a "master device.
-- 
Florian

^ permalink raw reply

* Re: [pull request][net 0/7] Mellanox, mlx5 fixes 2018-04-26
From: David Miller @ 2018-04-27 18:32 UTC (permalink / raw)
  To: saeedm; +Cc: netdev
In-Reply-To: <20180426195842.29665-1-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Thu, 26 Apr 2018 12:58:35 -0700

> This pull request includes fixes for mlx5 core and netdev driver.
> 
> Please pull and let me know if there's any problems.

Pulled.

> For -stable v4.12
>     net/mlx5e: TX, Use correct counter in dma_map error flow
> For -stable v4.13
>     net/mlx5: Avoid cleaning flow steering table twice during error flow
> For -stable v4.14
>     net/mlx5e: Allow offloading ipv4 header re-write for icmp
> For -stable v4.15
>     net/mlx5e: DCBNL fix min inline header size for dscp
> For -stable v4.16
>     net/mlx5: Fix mlx5_get_vector_affinity function

Queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next] selftests: pmtu: Minimum MTU for vti6 is 68
From: David Miller @ 2018-04-27 18:33 UTC (permalink / raw)
  To: sbrivio; +Cc: steffen.klassert, lucien.xin, alexey.kodanev, jarod, sd, netdev
In-Reply-To: <c2369c8f004006b33007bad40b63c35f50ff3c23.1524764073.git.sbrivio@redhat.com>

From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu, 26 Apr 2018 19:41:02 +0200

> A vti6 interface can carry IPv4 packets too.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Applied, thank you.

^ permalink raw reply

* Re: Suggestions on iterating eBPF maps
From: Chenbo Feng @ 2018-04-27 18:33 UTC (permalink / raw)
  To: netdev, Alexei Starovoitov, Daniel Borkmann
  Cc: Lorenzo Colitti, Joel Fernandes
In-Reply-To: <CAMOXUJm3pYr4RwzJSYop3jFT+NjrcjKgm=QGZbbP_1U2299JhQ@mail.gmail.com>

resend with  plain text

On Fri, Apr 27, 2018 at 11:22 AM Chenbo Feng <fengc@google.com> wrote:

> Hi net-next,

> When doing the eBPF tools user-space development I noticed that the map
iterating process in user-space have some little flaws. If we want to dump
the whole map. The only way now I know is to use a null key to start the
iteration and keep calling bpf_get_next_key and bpf_look_up_elem for each
new key value pair until we reach the end of the map. I noticed the
bpftools recently added used the similar approach.

> The overhead of repeating syscalls is acceptable, but the race problem
come with this iteration process is a little annoying. If the current key
we are using get deleted before we do the syscall to get the next key . The
next key returned will start from the beginning of the map again and some
entry will be dumped again depending on the position of the key deleted. If
the racing problem is within the same userspace process, it can be easily
fixed by adding some read/write locks. However, if multiple processes is
reading the map through pinned fd while there is one process is editing the
map entry or the kernel program is deleting entries, it become harder to
get a consistent and correct map dump.

> We are wondering if there is already implementation we didn't notice in
mainline kernel that help improved this iteration process and addressed the
racing problem mentioned above? If not, what can be down to address the
issue above. One thing we came up with is to use a single entry bpf map as
a across process lock to prevent multiple userspace process to read/write
other maps at the same time. But I don't know how safe this solution is
since there will still be a race to read the lock map value and setup the
lock.

> Thanks
> Chenbo Feng

^ permalink raw reply

* Re: [PATCH net-next 00/13] sctp: refactor MTU handling
From: David Miller @ 2018-04-27 18:42 UTC (permalink / raw)
  To: marcelo.leitner; +Cc: netdev, linux-sctp, vyasevich, nhorman, lucien.xin
In-Reply-To: <cover.1524772453.git.marcelo.leitner@gmail.com>

From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date: Thu, 26 Apr 2018 16:58:49 -0300

> Currently MTU handling is spread over SCTP stack. There are multiple
> places doing same/similar calculations and updating them is error prone
> as one spot can easily be left out.
> 
> This patchset converges it into a more concise and consistent code. In
> general, it moves MTU handling from functions with bigger objectives,
> such as sctp_assoc_add_peer(), to specific functions.
> 
> It's also a preparation for the next patchset, which removes the
> duplication between sctp_make_op_error_space and
> sctp_make_op_error_fixed and relies on sctp_mtu_payload introduced here.
> 
> More details on each patch.

Series applied, thanks!

^ permalink raw reply

* Re: Request for stable 4.14.x inclusion: net: don't call update_pmtu unconditionally
From: Eddie Chapman @ 2018-04-27 18:43 UTC (permalink / raw)
  To: Thomas Deutschmann, Greg KH; +Cc: stable, davem, nicolas.dichtel, netdev
In-Reply-To: <f663a39c-25f4-0558-21e0-a25c21287384@gentoo.org>

On 27/04/18 19:07, Thomas Deutschmann wrote:
> Hi Greg,
> 
> first, we need to cherry-pick another patch first:
>   
>>  From 52a589d51f1008f62569bf89e95b26221ee76690 Mon Sep 17 00:00:00 2001
>> From: Xin Long <lucien.xin@gmail.com>
>> Date: Mon, 25 Dec 2017 14:43:58 +0800
>> Subject: [PATCH] geneve: update skb dst pmtu on tx path
>>
>> Commit a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path") has fixed
>> a performance issue caused by the change of lower dev's mtu for vxlan.
>>
>> The same thing needs to be done for geneve as well.
>>
>> Note that geneve cannot adjust it's mtu according to lower dev's mtu
>> when creating it. The performance is very low later when netperfing
>> over it without fixing the mtu manually. This patch could also avoid
>> this issue.
>>
>> Signed-off-by: Xin Long <lucien.xin@gmail.com>
>> Signed-off-by: David S. Miller <davem@davemloft.net>

Oops, I completely missed that the coreos patch doesn't have the geneve 
hunk that is in the original 4.15 patch. I don't load the geneve module 
on my box hence why no problems surfaced on my machine.

Thanks Thomas for the correct instructions. Ignore my message Greg, I'll 
drop back into the shadows where I belong, sorry for the noise!

^ permalink raw reply

* [PATCH v5 net-next 0/3] lan78xx updates along with Fixed phy Support
From: Raghuram Chary J @ 2018-04-27 18:47 UTC (permalink / raw)
  To: davem; +Cc: netdev, unglinuxdriver, woojung.huh, raghuramchary.jallipalli

These series of patches handle few modifications in driver
and adds support for fixed phy.

Raghuram Chary J (3):
  lan78xx: Lan7801 Support for Fixed PHY
  lan78xx: Remove DRIVER_VERSION for lan78xx driver
  lan78xx: Modify error messages

 drivers/net/usb/Kconfig   |   1 +
 drivers/net/usb/lan78xx.c | 110 ++++++++++++++++++++++++++++++++--------------
 2 files changed, 79 insertions(+), 32 deletions(-)

-- 
2.16.2

^ permalink raw reply

* [PATCH v5 net-next 1/3] lan78xx: Lan7801 Support for Fixed PHY
From: Raghuram Chary J @ 2018-04-27 18:47 UTC (permalink / raw)
  To: davem; +Cc: netdev, unglinuxdriver, woojung.huh, raghuramchary.jallipalli
In-Reply-To: <20180427184744.28798-1-raghuramchary.jallipalli@microchip.com>

Adding Fixed PHY support to the lan78xx driver.

Signed-off-by: Raghuram Chary J <raghuramchary.jallipalli@microchip.com>
---
v0->v1:
   * Remove driver version #define
   * Modify netdev_info to netdev_dbg
   * Move lan7801 specific to new routine and add switch case
   * Minor cleanup

v1->v2:
   * Removed fixedphy variable and used phy_is_pseudo_fixed_link() check.
v2->v3:
   * Revert driver version, debug statment changes for separate patch.
   * Modify lan7801 specific routine with return type struct phy_device.
v3->v4:
   * Modify lan7801 specific routine by removing phydev arg and get phydev.
v4->v5:
   * Patched in latest net-next.
   * Also handled error condition for fixedphy.
---
 drivers/net/usb/Kconfig   |   1 +
 drivers/net/usb/lan78xx.c | 104 +++++++++++++++++++++++++++++++++-------------
 2 files changed, 77 insertions(+), 28 deletions(-)

diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig
index f28bd74ac275..418b0904cecb 100644
--- a/drivers/net/usb/Kconfig
+++ b/drivers/net/usb/Kconfig
@@ -111,6 +111,7 @@ config USB_LAN78XX
 	select MII
 	select PHYLIB
 	select MICROCHIP_PHY
+	select FIXED_PHY
 	help
 	  This option adds support for Microchip LAN78XX based USB 2
 	  & USB 3 10/100/1000 Ethernet adapters.
diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index c59f8afd0d73..81dfd10c3b92 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -36,7 +36,7 @@
 #include <linux/irq.h>
 #include <linux/irqchip/chained_irq.h>
 #include <linux/microchipphy.h>
-#include <linux/phy.h>
+#include <linux/phy_fixed.h>
 #include <linux/of_mdio.h>
 #include <linux/of_net.h>
 #include "lan78xx.h"
@@ -2063,52 +2063,91 @@ static int ksz9031rnx_fixup(struct phy_device *phydev)
 	return 1;
 }
 
-static int lan78xx_phy_init(struct lan78xx_net *dev)
+static struct phy_device *lan7801_phy_init(struct lan78xx_net *dev)
 {
+	u32 buf;
 	int ret;
-	u32 mii_adv;
+	struct fixed_phy_status fphy_status = {
+		.link = 1,
+		.speed = SPEED_1000,
+		.duplex = DUPLEX_FULL,
+	};
 	struct phy_device *phydev;
 
 	phydev = phy_find_first(dev->mdiobus);
 	if (!phydev) {
-		netdev_err(dev->net, "no PHY found\n");
-		return -EIO;
-	}
-
-	if ((dev->chipid == ID_REV_CHIP_ID_7800_) ||
-	    (dev->chipid == ID_REV_CHIP_ID_7850_)) {
-		phydev->is_internal = true;
-		dev->interface = PHY_INTERFACE_MODE_GMII;
-
-	} else if (dev->chipid == ID_REV_CHIP_ID_7801_) {
+		netdev_dbg(dev->net, "PHY Not Found!! Registering Fixed PHY\n");
+		phydev = fixed_phy_register(PHY_POLL, &fphy_status, -1,
+					    NULL);
+		if (IS_ERR(phydev)) {
+			netdev_err(dev->net, "No PHY/fixed_PHY found\n");
+			return NULL;
+		}
+		netdev_dbg(dev->net, "Registered FIXED PHY\n");
+		dev->interface = PHY_INTERFACE_MODE_RGMII;
+		ret = lan78xx_write_reg(dev, MAC_RGMII_ID,
+					MAC_RGMII_ID_TXC_DELAY_EN_);
+		ret = lan78xx_write_reg(dev, RGMII_TX_BYP_DLL, 0x3D00);
+		ret = lan78xx_read_reg(dev, HW_CFG, &buf);
+		buf |= HW_CFG_CLK125_EN_;
+		buf |= HW_CFG_REFCLK25_EN_;
+		ret = lan78xx_write_reg(dev, HW_CFG, buf);
+	} else {
 		if (!phydev->drv) {
 			netdev_err(dev->net, "no PHY driver found\n");
-			return -EIO;
+			return NULL;
 		}
-
 		dev->interface = PHY_INTERFACE_MODE_RGMII;
-
 		/* external PHY fixup for KSZ9031RNX */
 		ret = phy_register_fixup_for_uid(PHY_KSZ9031RNX, 0xfffffff0,
 						 ksz9031rnx_fixup);
 		if (ret < 0) {
 			netdev_err(dev->net, "fail to register fixup\n");
-			return ret;
+			return NULL;
 		}
 		/* external PHY fixup for LAN8835 */
 		ret = phy_register_fixup_for_uid(PHY_LAN8835, 0xfffffff0,
 						 lan8835_fixup);
 		if (ret < 0) {
 			netdev_err(dev->net, "fail to register fixup\n");
-			return ret;
+			return NULL;
 		}
 		/* add more external PHY fixup here if needed */
 
 		phydev->is_internal = false;
-	} else {
-		netdev_err(dev->net, "unknown ID found\n");
-		ret = -EIO;
-		goto error;
+	}
+	return phydev;
+}
+
+static int lan78xx_phy_init(struct lan78xx_net *dev)
+{
+	int ret;
+	u32 mii_adv;
+	struct phy_device *phydev;
+
+	switch (dev->chipid) {
+	case ID_REV_CHIP_ID_7801_:
+		phydev = lan7801_phy_init(dev);
+		if (!phydev) {
+			netdev_err(dev->net, "lan7801: PHY Init Failed");
+			return -EIO;
+		}
+		break;
+
+	case ID_REV_CHIP_ID_7800_:
+	case ID_REV_CHIP_ID_7850_:
+		phydev = phy_find_first(dev->mdiobus);
+		if (!phydev) {
+			netdev_err(dev->net, "no PHY found\n");
+			return -EIO;
+		}
+		phydev->is_internal = true;
+		dev->interface = PHY_INTERFACE_MODE_GMII;
+		break;
+
+	default:
+		netdev_err(dev->net, "Unknown CHIP ID found\n");
+		return -EIO;
 	}
 
 	/* if phyirq is not set, use polling mode in phylib */
@@ -2127,6 +2166,16 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
 	if (ret) {
 		netdev_err(dev->net, "can't attach PHY to %s\n",
 			   dev->mdiobus->id);
+		if (dev->chipid == ID_REV_CHIP_ID_7801_) {
+			if (phy_is_pseudo_fixed_link(phydev)) {
+				fixed_phy_unregister(phydev);
+			} else {
+				phy_unregister_fixup_for_uid(PHY_KSZ9031RNX,
+							     0xfffffff0);
+				phy_unregister_fixup_for_uid(PHY_LAN8835,
+							     0xfffffff0);
+			}
+		}
 		return -EIO;
 	}
 
@@ -2166,12 +2215,6 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
 	dev->fc_autoneg = phydev->autoneg;
 
 	return 0;
-
-error:
-	phy_unregister_fixup_for_uid(PHY_KSZ9031RNX, 0xfffffff0);
-	phy_unregister_fixup_for_uid(PHY_LAN8835, 0xfffffff0);
-
-	return ret;
 }
 
 static int lan78xx_set_rx_max_frame_length(struct lan78xx_net *dev, int size)
@@ -3569,6 +3612,7 @@ static void lan78xx_disconnect(struct usb_interface *intf)
 	struct lan78xx_net		*dev;
 	struct usb_device		*udev;
 	struct net_device		*net;
+	struct phy_device		*phydev;
 
 	dev = usb_get_intfdata(intf);
 	usb_set_intfdata(intf, NULL);
@@ -3577,12 +3621,16 @@ static void lan78xx_disconnect(struct usb_interface *intf)
 
 	udev = interface_to_usbdev(intf);
 	net = dev->net;
+	phydev = net->phydev;
 
 	phy_unregister_fixup_for_uid(PHY_KSZ9031RNX, 0xfffffff0);
 	phy_unregister_fixup_for_uid(PHY_LAN8835, 0xfffffff0);
 
 	phy_disconnect(net->phydev);
 
+	if (phy_is_pseudo_fixed_link(phydev))
+		fixed_phy_unregister(phydev);
+
 	unregister_netdev(net);
 
 	cancel_delayed_work_sync(&dev->wq);
-- 
2.16.2

^ permalink raw reply related

* [PATCH v5 net-next 2/3] lan78xx: Remove DRIVER_VERSION for lan78xx driver
From: Raghuram Chary J @ 2018-04-27 18:47 UTC (permalink / raw)
  To: davem; +Cc: netdev, unglinuxdriver, woojung.huh, raghuramchary.jallipalli
In-Reply-To: <20180427184744.28798-1-raghuramchary.jallipalli@microchip.com>

Remove driver version info from the lan78xx driver.

Signed-off-by: Raghuram Chary J <raghuramchary.jallipalli@microchip.com>
---
 drivers/net/usb/lan78xx.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 81dfd10c3b92..54f8db887e3d 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -44,7 +44,6 @@
 #define DRIVER_AUTHOR	"WOOJUNG HUH <woojung.huh@microchip.com>"
 #define DRIVER_DESC	"LAN78XX USB 3.0 Gigabit Ethernet Devices"
 #define DRIVER_NAME	"lan78xx"
-#define DRIVER_VERSION	"1.0.6"
 
 #define TX_TIMEOUT_JIFFIES		(5 * HZ)
 #define THROTTLE_JIFFIES		(HZ / 8)
@@ -1503,7 +1502,6 @@ static void lan78xx_get_drvinfo(struct net_device *net,
 	struct lan78xx_net *dev = netdev_priv(net);
 
 	strncpy(info->driver, DRIVER_NAME, sizeof(info->driver));
-	strncpy(info->version, DRIVER_VERSION, sizeof(info->version));
 	usb_make_path(dev->udev, info->bus_info, sizeof(info->bus_info));
 }
 
-- 
2.16.2

^ permalink raw reply related

* [PATCH v5 net-next 3/3] lan78xx: Modify error messages
From: Raghuram Chary J @ 2018-04-27 18:47 UTC (permalink / raw)
  To: davem; +Cc: netdev, unglinuxdriver, woojung.huh, raghuramchary.jallipalli
In-Reply-To: <20180427184744.28798-1-raghuramchary.jallipalli@microchip.com>

Modify the error messages when phy registration fails.

Signed-off-by: Raghuram Chary J <raghuramchary.jallipalli@microchip.com>
---
 drivers/net/usb/lan78xx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 54f8db887e3d..4b930c9faa16 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -2100,14 +2100,14 @@ static struct phy_device *lan7801_phy_init(struct lan78xx_net *dev)
 		ret = phy_register_fixup_for_uid(PHY_KSZ9031RNX, 0xfffffff0,
 						 ksz9031rnx_fixup);
 		if (ret < 0) {
-			netdev_err(dev->net, "fail to register fixup\n");
+			netdev_err(dev->net, "fail to register fixup for PHY_KSZ9031RNX\n");
 			return NULL;
 		}
 		/* external PHY fixup for LAN8835 */
 		ret = phy_register_fixup_for_uid(PHY_LAN8835, 0xfffffff0,
 						 lan8835_fixup);
 		if (ret < 0) {
-			netdev_err(dev->net, "fail to register fixup\n");
+			netdev_err(dev->net, "fail to register fixup for PHY_LAN8835\n");
 			return NULL;
 		}
 		/* add more external PHY fixup here if needed */
-- 
2.16.2

^ permalink raw reply related

* RE: [PATCH v1 net-next] lan78xx: Lan7801 Support for Fixed PHY
From: RaghuramChary.Jallipalli @ 2018-04-27 18:46 UTC (permalink / raw)
  To: davem; +Cc: netdev, UNGLinuxDriver, Woojung.Huh
In-Reply-To: <20180427.110022.1044855000838407292.davem@davemloft.net>

> >    * Minor cleanup
> 
> This patch doesn't apply cleanly to net-next, please respin.

Apologies. Have sent updated version v5.

Thanks,
-Raghu

^ permalink raw reply

* Re: [PATCH net-next] hv_netvsc: simplify receive side calling arguments
From: David Miller @ 2018-04-27 18:46 UTC (permalink / raw)
  To: stephen; +Cc: haiyangz, netdev, sthemmin
In-Reply-To: <20180426213425.13135-1-sthemmin@microsoft.com>

From: Stephen Hemminger <stephen@networkplumber.org>
Date: Thu, 26 Apr 2018 14:34:25 -0700

> The calls up from the napi poll reading the receive ring had many
> places where an argument was being recreated. I.e the caller already
> had the value and wasn't passing it, then the callee would use
> known relationship to determine the same value. Simpler and faster
> to just pass arguments needed.
> 
> Also, add const in a couple places where message is being only read.
> 
> Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>

Applied, thanks Stephen.

^ permalink raw reply

* [PATCH net-next 2/2] net-backports: selftest: add test for TCP_INQ
From: Soheil Hassas Yeganeh @ 2018-04-27 18:50 UTC (permalink / raw)
  To: davem, netdev; +Cc: ycheng, ncardwell, edumazet, willemb, Soheil Hassas Yeganeh
In-Reply-To: <20180427185038.32714-1-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil@google.com>

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
---
 tools/testing/selftests/net/Makefile  |   3 +-
 tools/testing/selftests/net/tcp_inq.c | 189 ++++++++++++++++++++++++++
 2 files changed, 191 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/tcp_inq.c

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index df9102ec7b7af..0a1821f8dfb18 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -9,7 +9,7 @@ TEST_PROGS += fib_tests.sh fib-onlink-tests.sh in_netns.sh pmtu.sh udpgso.sh
 TEST_PROGS += udpgso_bench.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
-TEST_GEN_FILES += tcp_mmap
+TEST_GEN_FILES += tcp_mmap tcp_inq
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict
 TEST_GEN_PROGS += udpgso udpgso_bench_tx udpgso_bench_rx
@@ -18,3 +18,4 @@ include ../lib.mk
 
 $(OUTPUT)/reuseport_bpf_numa: LDFLAGS += -lnuma
 $(OUTPUT)/tcp_mmap: LDFLAGS += -lpthread
+$(OUTPUT)/tcp_inq: LDFLAGS += -lpthread
diff --git a/tools/testing/selftests/net/tcp_inq.c b/tools/testing/selftests/net/tcp_inq.c
new file mode 100644
index 0000000000000..3f6a27efbe5cf
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_inq.c
@@ -0,0 +1,189 @@
+/*
+ * Copyright 2018 Google Inc.
+ * Author: Soheil Hassas Yeganeh (soheil@google.com)
+ *
+ * Simple example on how to use TCP_INQ and TCP_CM_INQ.
+ *
+ * License (GPLv2):
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
+ * more details.
+ */
+#define _GNU_SOURCE
+
+#include <error.h>
+#include <netinet/in.h>
+#include <netinet/tcp.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <errno.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <unistd.h>
+
+#ifndef TCP_INQ
+#define TCP_INQ 35
+#endif
+
+#ifndef TCP_CM_INQ
+#define TCP_CM_INQ TCP_INQ
+#endif
+
+#define BUF_SIZE 8192
+#define CMSG_SIZE 32
+
+static int family = AF_INET6;
+static socklen_t addr_len = sizeof(struct sockaddr_in6);
+static int port = 4974;
+
+static void setup_loopback_addr(int family, struct sockaddr_storage *sockaddr)
+{
+	struct sockaddr_in6 *addr6 = (void *) sockaddr;
+	struct sockaddr_in *addr4 = (void *) sockaddr;
+
+	switch (family) {
+	case PF_INET:
+		memset(addr4, 0, sizeof(*addr4));
+		addr4->sin_family = AF_INET;
+		addr4->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+		addr4->sin_port = htons(port);
+		break;
+	case PF_INET6:
+		memset(addr6, 0, sizeof(*addr6));
+		addr6->sin6_family = AF_INET6;
+		addr6->sin6_addr = in6addr_loopback;
+		addr6->sin6_port = htons(port);
+		break;
+	default:
+		error(1, 0, "illegal family");
+	}
+}
+
+void *start_server(void *arg)
+{
+	int server_fd = (int)(unsigned long)arg;
+	struct sockaddr_in addr;
+	socklen_t addrlen = sizeof(addr);
+	char *buf;
+	int fd;
+	int r;
+
+	buf = malloc(BUF_SIZE);
+
+	for (;;) {
+		fd = accept(server_fd, (struct sockaddr *)&addr, &addrlen);
+		if (fd == -1) {
+			perror("accept");
+			break;
+		}
+		do {
+			r = send(fd, buf, BUF_SIZE, 0);
+		} while (r < 0 && errno == EINTR);
+		if (r < 0)
+			perror("send");
+		if (r != BUF_SIZE)
+			fprintf(stderr, "can only send %d bytes\n", r);
+		/* TCP_INQ can overestimate in-queue by one byte if we send
+		 * the FIN packet. Sleep for 1 second, so that the client
+		 * likely invoked recvmsg().
+		 */
+		sleep(1);
+		close(fd);
+	}
+
+	free(buf);
+	close(server_fd);
+	pthread_exit(0);
+}
+
+int main(int argc, char *argv[])
+{
+	struct sockaddr_storage listen_addr, addr;
+	int c, one = 1, inq = -1;
+	pthread_t server_thread;
+	char cmsgbuf[CMSG_SIZE];
+	struct iovec iov[1];
+	struct cmsghdr *cm;
+	struct msghdr msg;
+	int server_fd, fd;
+	char *buf;
+
+	while ((c = getopt(argc, argv, "46p:")) != -1) {
+		switch (c) {
+		case '4':
+			family = PF_INET;
+			addr_len = sizeof(struct sockaddr_in);
+			break;
+		case '6':
+			family = PF_INET6;
+			addr_len = sizeof(struct sockaddr_in6);
+			break;
+		case 'p':
+			port = atoi(optarg);
+			break;
+		}
+	}
+
+	server_fd = socket(family, SOCK_STREAM, 0);
+	if (server_fd < 0)
+		error(1, errno, "server socket");
+	setup_loopback_addr(family, &listen_addr);
+	if (setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR,
+		       &one, sizeof(one)) != 0)
+		error(1, errno, "setsockopt(SO_REUSEADDR)");
+	if (bind(server_fd, (const struct sockaddr *)&listen_addr,
+		 addr_len) == -1)
+		error(1, errno, "bind");
+	if (listen(server_fd, 128) == -1)
+		error(1, errno, "listen");
+	if (pthread_create(&server_thread, NULL, start_server,
+			   (void *)(unsigned long)server_fd) != 0)
+		error(1, errno, "pthread_create");
+
+	fd = socket(family, SOCK_STREAM, 0);
+	if (fd < 0)
+		error(1, errno, "client socket");
+	setup_loopback_addr(family, &addr);
+	if (connect(fd, (const struct sockaddr *)&addr, addr_len) == -1)
+		error(1, errno, "connect");
+	if (setsockopt(fd, SOL_TCP, TCP_INQ, &one, sizeof(one)) != 0)
+		error(1, errno, "setsockopt(TCP_INQ)");
+
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_iov = iov;
+	msg.msg_iovlen = 1;
+	msg.msg_control = cmsgbuf;
+	msg.msg_controllen = sizeof(cmsgbuf);
+	msg.msg_flags = 0;
+
+	buf = malloc(BUF_SIZE);
+	iov[0].iov_base = buf;
+	iov[0].iov_len = BUF_SIZE / 2;
+
+	if (recvmsg(fd, &msg, 0) != iov[0].iov_len)
+		error(1, errno, "recvmsg");
+	if (msg.msg_flags & MSG_CTRUNC)
+		error(1, 0, "control message is truncated");
+
+	for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm))
+		if (cm->cmsg_level == SOL_TCP && cm->cmsg_type == TCP_CM_INQ)
+			inq = *((int *) CMSG_DATA(cm));
+
+	if (inq != BUF_SIZE - iov[0].iov_len) {
+		fprintf(stderr, "unexpected inq: %d\n", inq);
+		exit(1);
+	}
+
+	printf("PASSED\n");
+	free(buf);
+	close(fd);
+	return 0;
+}
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* [PATCH net-next 1/2] tcp: send in-queue bytes in cmsg upon read
From: Soheil Hassas Yeganeh @ 2018-04-27 18:50 UTC (permalink / raw)
  To: davem, netdev; +Cc: ycheng, ncardwell, edumazet, willemb, Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soheil@google.com>

Applications with many concurrent connections, high variance
in receive queue length and tight memory bounds cannot
allocate worst-case buffer size to drain sockets. Knowing
the size of receive queue length, applications can optimize
how they allocate buffers to read from the socket.

The number of bytes pending on the socket is directly
available through ioctl(FIONREAD/SIOCINQ) and can be
approximated using getsockopt(MEMINFO) (rmem_alloc includes
skb overheads in addition to application data). But, both of
these options add an extra syscall per recvmsg. Moreover,
ioctl(FIONREAD/SIOCINQ) takes the socket lock.

Add the TCP_INQ socket option to TCP. When this socket
option is set, recvmsg() relays the number of bytes available
on the socket for reading to the application via the
TCP_CM_INQ control message.

Calculate the number of bytes after releasing the socket lock
to include the processed backlog, if any. To avoid an extra
branch in the hot path of recvmsg() for this new control
message, move all cmsg processing inside an existing branch for
processing receive timestamps. Since the socket lock is not held
when calculating the size of receive queue, TCP_INQ is a hint.
For example, it can overestimate the queue size by one byte,
if FIN is received.

With this method, applications can start reading from the socket
using a small buffer, and then use larger buffers based on the
remaining data when needed.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
---
 include/linux/tcp.h      |  2 +-
 include/net/tcp.h        |  8 ++++++++
 include/uapi/linux/tcp.h |  3 +++
 net/ipv4/tcp.c           | 27 +++++++++++++++++++++++----
 4 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 20585d5c4e1c3..807776928cb86 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -228,7 +228,7 @@ struct tcp_sock {
 		unused:2;
 	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
-		unused1	    : 1,
+		recvmsg_inq : 1,/* Indicate # of bytes in queue upon recvmsg */
 		repair      : 1,
 		frto        : 1;/* F-RTO (RFC5682) activated in CA_Loss */
 	u8	repair_queue;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 833154e3df173..0986836b5df5b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1951,6 +1951,14 @@ static inline int tcp_inq(struct sock *sk)
 	return answ;
 }
 
+static inline int tcp_inq_hint(const struct sock *sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+
+	return max_t(int, 0,
+		     READ_ONCE(tp->rcv_nxt) - READ_ONCE(tp->copied_seq));
+}
+
 int tcp_peek_len(struct socket *sock);
 
 static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff *skb)
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 379b08700a542..d4cdd25a7bd48 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -122,6 +122,9 @@ enum {
 #define TCP_MD5SIG_EXT		32	/* TCP MD5 Signature with extensions */
 #define TCP_FASTOPEN_KEY	33	/* Set the key for Fast Open (cookie) */
 #define TCP_FASTOPEN_NO_COOKIE	34	/* Enable TFO without a TFO cookie */
+#define TCP_INQ			35	/* Notify bytes available to read as a cmsg on read */
+
+#define TCP_CM_INQ		TCP_INQ
 
 struct tcp_repair_opt {
 	__u32	opt_code;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index dfd090ea54ad4..5a7056980f730 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1910,13 +1910,14 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 	u32 peek_seq;
 	u32 *seq;
 	unsigned long used;
-	int err;
+	int err, inq;
 	int target;		/* Read at least this many bytes */
 	long timeo;
 	struct sk_buff *skb, *last;
 	u32 urg_hole = 0;
 	struct scm_timestamping tss;
 	bool has_tss = false;
+	bool has_cmsg;
 
 	if (unlikely(flags & MSG_ERRQUEUE))
 		return inet_recv_error(sk, msg, len, addr_len);
@@ -1931,6 +1932,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 	if (sk->sk_state == TCP_LISTEN)
 		goto out;
 
+	has_cmsg = tp->recvmsg_inq;
 	timeo = sock_rcvtimeo(sk, nonblock);
 
 	/* Urgent data needs to be handled specially. */
@@ -2117,6 +2119,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 		if (TCP_SKB_CB(skb)->has_rxtstamp) {
 			tcp_update_recv_tstamps(skb, &tss);
 			has_tss = true;
+			has_cmsg = true;
 		}
 		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
 			goto found_fin_ok;
@@ -2136,13 +2139,20 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 	 * on connected socket. I was just happy when found this 8) --ANK
 	 */
 
-	if (has_tss)
-		tcp_recv_timestamp(msg, sk, &tss);
-
 	/* Clean up data we have read: This will do ACK frames. */
 	tcp_cleanup_rbuf(sk, copied);
 
 	release_sock(sk);
+
+	if (has_cmsg) {
+		if (has_tss)
+			tcp_recv_timestamp(msg, sk, &tss);
+		if (tp->recvmsg_inq) {
+			inq = tcp_inq_hint(sk);
+			put_cmsg(msg, SOL_TCP, TCP_CM_INQ, sizeof(inq), &inq);
+		}
+	}
+
 	return copied;
 
 out:
@@ -3011,6 +3021,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		tp->notsent_lowat = val;
 		sk->sk_write_space(sk);
 		break;
+	case TCP_INQ:
+		if (val > 1 || val < 0)
+			err = -EINVAL;
+		else
+			tp->recvmsg_inq = val;
+		break;
 	default:
 		err = -ENOPROTOOPT;
 		break;
@@ -3436,6 +3452,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_NOTSENT_LOWAT:
 		val = tp->notsent_lowat;
 		break;
+	case TCP_INQ:
+		val = tp->recvmsg_inq;
+		break;
 	case TCP_SAVE_SYN:
 		val = tp->save_syn;
 		break;
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* Re: [PATCH net-next v2 00/14] bnxt_en: Net-next updates.
From: David Miller @ 2018-04-27 18:53 UTC (permalink / raw)
  To: michael.chan; +Cc: netdev
In-Reply-To: <1524779084-4016-1-git-send-email-michael.chan@broadcom.com>

From: Michael Chan <michael.chan@broadcom.com>
Date: Thu, 26 Apr 2018 17:44:30 -0400

> This series has 3 main features.  The first is to add mqprio TC to
> hardware queue mapping to avoid reprogramming hardware CoS queue
> watermarks during run-time.  The second is DIM improvements from
> Andy Gospo.  The third is some improvements to VF resource allocations
> when supporting large numbers of VFs with more limited resources.
> 
> There are some additional minor improvements and a new function level
> discard counter.
> 
> v2: Fixed EEPROM typo noted by Andrew Lunn.

Series applied, thanks Michael.

^ permalink raw reply

* Re: [PATCH net-next 2/2] net-backports: selftest: add test for TCP_INQ
From: Soheil Hassas Yeganeh @ 2018-04-27 18:52 UTC (permalink / raw)
  To: David Miller, netdev
In-Reply-To: <20180427185038.32714-2-soheil.kdev@gmail.com>

On Fri, Apr 27, 2018 at 2:50 PM, Soheil Hassas Yeganeh
<soheil.kdev@gmail.com> wrote:
> From: Soheil Hassas Yeganeh <soheil@google.com>
>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> Reviewed-by: Neal Cardwell <ncardwell@google.com>

Really sorry about the wrong patch subject. I'll send a V2 with the
corrected subject momentarily.

> ---
>  tools/testing/selftests/net/Makefile  |   3 +-
>  tools/testing/selftests/net/tcp_inq.c | 189 ++++++++++++++++++++++++++
>  2 files changed, 191 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/net/tcp_inq.c
>
> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
> index df9102ec7b7af..0a1821f8dfb18 100644
> --- a/tools/testing/selftests/net/Makefile
> +++ b/tools/testing/selftests/net/Makefile
> @@ -9,7 +9,7 @@ TEST_PROGS += fib_tests.sh fib-onlink-tests.sh in_netns.sh pmtu.sh udpgso.sh
>  TEST_PROGS += udpgso_bench.sh
>  TEST_GEN_FILES =  socket
>  TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
> -TEST_GEN_FILES += tcp_mmap
> +TEST_GEN_FILES += tcp_mmap tcp_inq
>  TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
>  TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict
>  TEST_GEN_PROGS += udpgso udpgso_bench_tx udpgso_bench_rx
> @@ -18,3 +18,4 @@ include ../lib.mk
>
>  $(OUTPUT)/reuseport_bpf_numa: LDFLAGS += -lnuma
>  $(OUTPUT)/tcp_mmap: LDFLAGS += -lpthread
> +$(OUTPUT)/tcp_inq: LDFLAGS += -lpthread
> diff --git a/tools/testing/selftests/net/tcp_inq.c b/tools/testing/selftests/net/tcp_inq.c
> new file mode 100644
> index 0000000000000..3f6a27efbe5cf
> --- /dev/null
> +++ b/tools/testing/selftests/net/tcp_inq.c
> @@ -0,0 +1,189 @@
> +/*
> + * Copyright 2018 Google Inc.
> + * Author: Soheil Hassas Yeganeh (soheil@google.com)
> + *
> + * Simple example on how to use TCP_INQ and TCP_CM_INQ.
> + *
> + * License (GPLv2):
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
> + * more details.
> + */
> +#define _GNU_SOURCE
> +
> +#include <error.h>
> +#include <netinet/in.h>
> +#include <netinet/tcp.h>
> +#include <pthread.h>
> +#include <stdio.h>
> +#include <errno.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/socket.h>
> +#include <unistd.h>
> +
> +#ifndef TCP_INQ
> +#define TCP_INQ 35
> +#endif
> +
> +#ifndef TCP_CM_INQ
> +#define TCP_CM_INQ TCP_INQ
> +#endif
> +
> +#define BUF_SIZE 8192
> +#define CMSG_SIZE 32
> +
> +static int family = AF_INET6;
> +static socklen_t addr_len = sizeof(struct sockaddr_in6);
> +static int port = 4974;
> +
> +static void setup_loopback_addr(int family, struct sockaddr_storage *sockaddr)
> +{
> +       struct sockaddr_in6 *addr6 = (void *) sockaddr;
> +       struct sockaddr_in *addr4 = (void *) sockaddr;
> +
> +       switch (family) {
> +       case PF_INET:
> +               memset(addr4, 0, sizeof(*addr4));
> +               addr4->sin_family = AF_INET;
> +               addr4->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
> +               addr4->sin_port = htons(port);
> +               break;
> +       case PF_INET6:
> +               memset(addr6, 0, sizeof(*addr6));
> +               addr6->sin6_family = AF_INET6;
> +               addr6->sin6_addr = in6addr_loopback;
> +               addr6->sin6_port = htons(port);
> +               break;
> +       default:
> +               error(1, 0, "illegal family");
> +       }
> +}
> +
> +void *start_server(void *arg)
> +{
> +       int server_fd = (int)(unsigned long)arg;
> +       struct sockaddr_in addr;
> +       socklen_t addrlen = sizeof(addr);
> +       char *buf;
> +       int fd;
> +       int r;
> +
> +       buf = malloc(BUF_SIZE);
> +
> +       for (;;) {
> +               fd = accept(server_fd, (struct sockaddr *)&addr, &addrlen);
> +               if (fd == -1) {
> +                       perror("accept");
> +                       break;
> +               }
> +               do {
> +                       r = send(fd, buf, BUF_SIZE, 0);
> +               } while (r < 0 && errno == EINTR);
> +               if (r < 0)
> +                       perror("send");
> +               if (r != BUF_SIZE)
> +                       fprintf(stderr, "can only send %d bytes\n", r);
> +               /* TCP_INQ can overestimate in-queue by one byte if we send
> +                * the FIN packet. Sleep for 1 second, so that the client
> +                * likely invoked recvmsg().
> +                */
> +               sleep(1);
> +               close(fd);
> +       }
> +
> +       free(buf);
> +       close(server_fd);
> +       pthread_exit(0);
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +       struct sockaddr_storage listen_addr, addr;
> +       int c, one = 1, inq = -1;
> +       pthread_t server_thread;
> +       char cmsgbuf[CMSG_SIZE];
> +       struct iovec iov[1];
> +       struct cmsghdr *cm;
> +       struct msghdr msg;
> +       int server_fd, fd;
> +       char *buf;
> +
> +       while ((c = getopt(argc, argv, "46p:")) != -1) {
> +               switch (c) {
> +               case '4':
> +                       family = PF_INET;
> +                       addr_len = sizeof(struct sockaddr_in);
> +                       break;
> +               case '6':
> +                       family = PF_INET6;
> +                       addr_len = sizeof(struct sockaddr_in6);
> +                       break;
> +               case 'p':
> +                       port = atoi(optarg);
> +                       break;
> +               }
> +       }
> +
> +       server_fd = socket(family, SOCK_STREAM, 0);
> +       if (server_fd < 0)
> +               error(1, errno, "server socket");
> +       setup_loopback_addr(family, &listen_addr);
> +       if (setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR,
> +                      &one, sizeof(one)) != 0)
> +               error(1, errno, "setsockopt(SO_REUSEADDR)");
> +       if (bind(server_fd, (const struct sockaddr *)&listen_addr,
> +                addr_len) == -1)
> +               error(1, errno, "bind");
> +       if (listen(server_fd, 128) == -1)
> +               error(1, errno, "listen");
> +       if (pthread_create(&server_thread, NULL, start_server,
> +                          (void *)(unsigned long)server_fd) != 0)
> +               error(1, errno, "pthread_create");
> +
> +       fd = socket(family, SOCK_STREAM, 0);
> +       if (fd < 0)
> +               error(1, errno, "client socket");
> +       setup_loopback_addr(family, &addr);
> +       if (connect(fd, (const struct sockaddr *)&addr, addr_len) == -1)
> +               error(1, errno, "connect");
> +       if (setsockopt(fd, SOL_TCP, TCP_INQ, &one, sizeof(one)) != 0)
> +               error(1, errno, "setsockopt(TCP_INQ)");
> +
> +       msg.msg_name = NULL;
> +       msg.msg_namelen = 0;
> +       msg.msg_iov = iov;
> +       msg.msg_iovlen = 1;
> +       msg.msg_control = cmsgbuf;
> +       msg.msg_controllen = sizeof(cmsgbuf);
> +       msg.msg_flags = 0;
> +
> +       buf = malloc(BUF_SIZE);
> +       iov[0].iov_base = buf;
> +       iov[0].iov_len = BUF_SIZE / 2;
> +
> +       if (recvmsg(fd, &msg, 0) != iov[0].iov_len)
> +               error(1, errno, "recvmsg");
> +       if (msg.msg_flags & MSG_CTRUNC)
> +               error(1, 0, "control message is truncated");
> +
> +       for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm))
> +               if (cm->cmsg_level == SOL_TCP && cm->cmsg_type == TCP_CM_INQ)
> +                       inq = *((int *) CMSG_DATA(cm));
> +
> +       if (inq != BUF_SIZE - iov[0].iov_len) {
> +               fprintf(stderr, "unexpected inq: %d\n", inq);
> +               exit(1);
> +       }
> +
> +       printf("PASSED\n");
> +       free(buf);
> +       close(fd);
> +       return 0;
> +}
> --
> 2.17.0.441.gb46fe60e1d-goog
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox