Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC PATCH v2 06/10] udp: cope with UDP GRO packet misdirection
From: Paolo Abeni @ 2018-10-22 12:51 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: netdev, Willem de Bruijn
In-Reply-To: <20181022114317.GI3823@gauss3.secunet.de>

Hi,

On Mon, 2018-10-22 at 13:43 +0200, Steffen Klassert wrote:
> On Fri, Oct 19, 2018 at 04:25:16PM +0200, Paolo Abeni wrote:
> > +
> > +static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
> > +					      struct sk_buff *skb)
> > +{
> > +	struct sk_buff *segs;
> > +
> > +	/* the GSO CB lays after the UDP one, no need to save and restore any
> > +	 * CB fragment, just initialize it
> > +	 */
> > +	segs = __skb_gso_segment(skb, NETIF_F_SG, false);
> > +	if (unlikely(IS_ERR(segs)))
> > +		kfree_skb(skb);
> > +	else if (segs)
> > +		consume_skb(skb);
> > +	return segs;
> > +}
> > +
> > +
> 
> One empty line too much.

Thank you, will handle in the next iteration.

> >  #define udp_portaddr_for_each_entry(__sk, list) \
> >  	hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
> >  
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 2331ac9de954..0d55145ce9f5 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -1909,7 +1909,7 @@ EXPORT_SYMBOL(udp_encap_enable);
> >   * Note that in the success and error cases, the skb is assumed to
> >   * have either been requeued or freed.
> >   */
> > -static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> > +static int udp_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
> >  {
> >  	struct udp_sock *up = udp_sk(sk);
> >  	int is_udplite = IS_UDPLITE(sk);
> > @@ -2012,6 +2012,29 @@ static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> >  	return -1;
> >  }
> >  
> > +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int proto);
> > +
> > +static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> > +{
> > +	struct sk_buff *next, *segs;
> > +	int ret;
> > +
> > +	if (likely(!udp_unexpected_gso(sk, skb)))
> > +		return udp_queue_rcv_one_skb(sk, skb);
> > +
> > +	BUILD_BUG_ON(sizeof(struct udp_skb_cb) > SKB_SGO_CB_OFFSET);
> > +	__skb_push(skb, -skb_mac_offset(skb));
> > +	segs = udp_rcv_segment(sk, skb);
> > +	for (skb = segs; skb; skb = next) {
> > +		next = skb->next;
> > +		__skb_pull(skb, skb_transport_offset(skb));
> > +		ret = udp_queue_rcv_one_skb(sk, skb);
> 
> udp_queue_rcv_one_skb() starts with doing a xfrm4_policy_check().
> Maybe we can do this on the GSO packet instead of the segments.
> So far this code is just for handling a corner case, but this might
> change.

I thought about keeping the policy check here, but then I preferred
what looked the safest option. Perhaps we can improve with a follow-up?

Cheers,

Paolo

^ permalink raw reply

* Re: [PATCH 01/20] ath6kl: Mark expected switch fall-through
From: Steve deRosier @ 2018-10-22 21:11 UTC (permalink / raw)
  To: gustavo
  Cc: ath9k-devel, Kalle Valo, linux-wireless, davem,
	Network Development, LKML
In-Reply-To: <7932c27ae4e53aceacb7a0a1ae26db91432e5014.1540239684.git.gustavo@embeddedor.com>

On Mon, Oct 22, 2018 at 1:37 PM Gustavo A. R. Silva
<gustavo@embeddedor.com> wrote:
>
> In preparation to enabling -Wimplicit-fallthrough, mark switch cases
> where we are expecting to fall through.
>
> Addresses-Coverity-ID: 201383 ("Missing break in switch")
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
> ---
>  drivers/net/wireless/ath/ath6kl/main.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/wireless/ath/ath6kl/main.c b/drivers/net/wireless/ath/ath6kl/main.c
> index cb59016..5e7ea83 100644
> --- a/drivers/net/wireless/ath/ath6kl/main.c
> +++ b/drivers/net/wireless/ath/ath6kl/main.c
> @@ -389,6 +389,7 @@ void ath6kl_connect_ap_mode_bss(struct ath6kl_vif *vif, u16 channel)
>                 if (!ik->valid || ik->key_type != WAPI_CRYPT)
>                         break;
>                 /* for WAPI, we need to set the delayed group key, continue: */
> +               /* fall through */
>         case WPA_PSK_AUTH:
>         case WPA2_PSK_AUTH:
>         case (WPA_PSK_AUTH | WPA2_PSK_AUTH):
> --
> 2.7.4
>

Looks fine to me. Though it's too bad that the compiler can't detect
the fall through as already clearly commented.

Acked-by: Steve deRosier <derosier@cal-sierra.com>

- Steve

^ permalink raw reply

* [PATCH] net/wan/fsl_ucc_hdlc: error counters
From: Mathias Thore @ 2018-10-22 12:55 UTC (permalink / raw)
  To: qiang.zhao, linuxppc-dev, netdev, joakim.tjernlund,
	david.gounaris
  Cc: Mathias Thore

Extract error information from rx and tx buffer descriptors,
and update error counters.

Signed-off-by: Mathias Thore <mathias.thore@infinera.com>
---
 drivers/net/wan/fsl_ucc_hdlc.c | 51 +++++++++++++++++++++++++++++-----
 include/soc/fsl/qe/ucc_fast.h  |  8 +++++-
 2 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/drivers/net/wan/fsl_ucc_hdlc.c b/drivers/net/wan/fsl_ucc_hdlc.c
index 5f0366a125e2..934fc149c6b3 100644
--- a/drivers/net/wan/fsl_ucc_hdlc.c
+++ b/drivers/net/wan/fsl_ucc_hdlc.c
@@ -36,6 +36,7 @@
 #define DRV_NAME "ucc_hdlc"
 
 #define TDM_PPPOHT_SLIC_MAXIN
+#define RX_BD_ERRORS (R_CD_S | R_OV_S | R_CR_S | R_AB_S | R_NO_S | R_LG_S)
 
 static struct ucc_tdm_info utdm_primary_info = {
 	.uf_info = {
@@ -420,12 +421,25 @@ static netdev_tx_t ucc_hdlc_tx(struct sk_buff *skb, struct net_device *dev)
 	return NETDEV_TX_OK;
 }
 
+static int hdlc_tx_restart(struct ucc_hdlc_private *priv)
+{
+	u32 cecr_subblock;
+
+	cecr_subblock =
+		ucc_fast_get_qe_cr_subblock(priv->ut_info->uf_info.ucc_num);
+
+	qe_issue_cmd(QE_RESTART_TX, cecr_subblock,
+		     QE_CR_PROTOCOL_UNSPECIFIED, 0);
+	return 0;
+}
+
 static int hdlc_tx_done(struct ucc_hdlc_private *priv)
 {
 	/* Start from the next BD that should be filled */
 	struct net_device *dev = priv->ndev;
 	struct qe_bd *bd;		/* BD pointer */
 	u16 bd_status;
+	int tx_restart = 0;
 
 	bd = priv->dirty_tx;
 	bd_status = ioread16be(&bd->status);
@@ -434,6 +448,15 @@ static int hdlc_tx_done(struct ucc_hdlc_private *priv)
 	while ((bd_status & T_R_S) == 0) {
 		struct sk_buff *skb;
 
+		if (bd_status & T_UN_S) { /* Underrun */
+			dev->stats.tx_fifo_errors++;
+			tx_restart = 1;
+		}
+		if (bd_status & T_CT_S) { /* Carrier lost */
+			dev->stats.tx_carrier_errors++;
+			tx_restart = 1;
+		}
+
 		/* BD contains already transmitted buffer.   */
 		/* Handle the transmitted buffer and release */
 		/* the BD to be used with the current frame  */
@@ -465,6 +488,9 @@ static int hdlc_tx_done(struct ucc_hdlc_private *priv)
 	}
 	priv->dirty_tx = bd;
 
+	if (tx_restart)
+		hdlc_tx_restart(priv);
+
 	return 0;
 }
 
@@ -483,11 +509,22 @@ static int hdlc_rx_done(struct ucc_hdlc_private *priv, int rx_work_limit)
 
 	/* while there are received buffers and BD is full (~R_E) */
 	while (!((bd_status & (R_E_S)) || (--rx_work_limit < 0))) {
-		if (bd_status & R_OV_S)
-			dev->stats.rx_over_errors++;
-		if (bd_status & R_CR_S) {
-			dev->stats.rx_crc_errors++;
-			dev->stats.rx_dropped++;
+		if (bd_status & (RX_BD_ERRORS)) {
+			dev->stats.rx_errors++;
+
+			if (bd_status & R_CD_S)
+				dev->stats.collisions++;
+			if (bd_status & R_OV_S)
+				dev->stats.rx_fifo_errors++;
+			if (bd_status & R_CR_S)
+				dev->stats.rx_crc_errors++;
+			if (bd_status & R_AB_S)
+				dev->stats.rx_over_errors++;
+			if (bd_status & R_NO_S)
+				dev->stats.rx_frame_errors++;
+			if (bd_status & R_LG_S)
+				dev->stats.rx_length_errors++;
+
 			goto recycle;
 		}
 		bdbuffer = priv->rx_buffer +
@@ -535,7 +572,7 @@ static int hdlc_rx_done(struct ucc_hdlc_private *priv, int rx_work_limit)
 		netif_receive_skb(skb);
 
 recycle:
-		iowrite16be(bd_status | R_E_S | R_I_S, &bd->status);
+		iowrite16be((bd_status & R_W_S) | R_E_S | R_I_S, &bd->status);
 
 		/* update to point at the next bd */
 		if (bd_status & R_W_S) {
@@ -611,7 +648,7 @@ static irqreturn_t ucc_hdlc_irq_handler(int irq, void *dev_id)
 
 	/* Errors and other events */
 	if (ucce >> 16 & UCC_HDLC_UCCE_BSY)
-		dev->stats.rx_errors++;
+		dev->stats.rx_missed_errors++;
 	if (ucce >> 16 & UCC_HDLC_UCCE_TXE)
 		dev->stats.tx_errors++;
 
diff --git a/include/soc/fsl/qe/ucc_fast.h b/include/soc/fsl/qe/ucc_fast.h
index 3ee9e7c1a7d7..dcd6b865b590 100644
--- a/include/soc/fsl/qe/ucc_fast.h
+++ b/include/soc/fsl/qe/ucc_fast.h
@@ -41,8 +41,12 @@
 #define R_L_S	0x0800	/* last */
 #define R_F_S	0x0400	/* first */
 #define R_CM_S	0x0200	/* continuous mode */
+#define R_LG_S  0x0020  /* frame length */
+#define R_NO_S  0x0010  /* nonoctet */
+#define R_AB_S  0x0008  /* abort */
 #define R_CR_S	0x0004	/* crc */
-#define R_OV_S	0x0002	/* crc */
+#define R_OV_S	0x0002	/* overrun */
+#define R_CD_S  0x0001  /* carrier detect */
 
 /* transmit BD's status */
 #define T_R_S	0x8000	/* ready bit */
@@ -51,6 +55,8 @@
 #define T_L_S	0x0800	/* last */
 #define T_TC_S	0x0400	/* crc */
 #define T_TM_S	0x0200	/* continuous mode */
+#define T_UN_S  0x0002  /* hdlc underrun */
+#define T_CT_S  0x0001  /* hdlc carrier lost */
 
 /* Rx Data buffer must be 4 bytes aligned in most cases */
 #define UCC_FAST_RX_ALIGN			4
-- 
2.18.1

^ permalink raw reply related

* Re: [RFC PATCH v2 02/10] udp: implement GRO for plain UDP sockets.
From: Paolo Abeni @ 2018-10-22 13:41 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: netdev, Willem de Bruijn
In-Reply-To: <20181022112451.GH3823@gauss3.secunet.de>

On Mon, 2018-10-22 at 13:24 +0200, Steffen Klassert wrote:
> On Fri, Oct 19, 2018 at 04:25:12PM +0200, Paolo Abeni wrote:
> >  
> > +#define UDO_GRO_CNT_MAX 64
> 
> Maybe better UDP_GRO_CNT_MAX?

Oops, typo. Yes, sure, will address in the next iteration.

> Btw. do we really need this explicit limit?
> We should not get more than 64 packets during
> one napi poll cycle.

With HZ >= 1000, gro_flush happens at most once per jiffies: we can
have much more than 64 packets per segment, with appropriate pkt len.

> 
> > +static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
> > +					       struct sk_buff *skb)
> > +{
> > +	struct udphdr *uh = udp_hdr(skb);
> > +	struct sk_buff *pp = NULL;
> > +	struct udphdr *uh2;
> > +	struct sk_buff *p;
> > +
> > +	/* requires non zero csum, for simmetry with GSO */
> > +	if (!uh->check) {
> > +		NAPI_GRO_CB(skb)->flush = 1;
> > +		return NULL;
> > +	}
> 
> Why is the requirement of checksums different than in 
> udp_gro_receive? It's not that I care much about UDP
> packets without a checksum, but you would not need
> to implement your own loop if the requirement could
> be the same as in udp_gro_receive.

uhm.... 
AFAIU, we need to generated aggregated packets that UDP GSO is able to
process/segment. I was unable to get a nocsum packet segment (possibly
PEBKAC) so I enforced that condition on the rx path.

@Willem: did I see ghost here? is UDP_SEGMENT fine with no checksum
segment?

> > +
> > +	/* pull encapsulating udp header */
> > +	skb_gro_pull(skb, sizeof(struct udphdr));
> > +	skb_gro_postpull_rcsum(skb, uh, sizeof(struct udphdr));
> > +
> > +	list_for_each_entry(p, head, list) {
> > +		if (!NAPI_GRO_CB(p)->same_flow)
> > +			continue;
> > +
> > +		uh2 = udp_hdr(p);
> > +
> > +		/* Match ports only, as csum is always non zero */
> > +		if ((*(u32 *)&uh->source != *(u32 *)&uh2->source)) {
> > +			NAPI_GRO_CB(p)->same_flow = 0;
> > +			continue;
> > +		}
> > +
> > +		/* Terminate the flow on len mismatch or if it grow "too much".
> > +		 * Under small packet flood GRO count could elsewhere grow a lot
> > +		 * leading to execessive truesize values
> > +		 */
> > +		if (!skb_gro_receive(p, skb) &&
> > +		    NAPI_GRO_CB(p)->count > UDO_GRO_CNT_MAX)
> 
> This allows to merge UDO_GRO_CNT_MAX + 1 packets.

Thanks, will address in the next iteration.

Cheers,

Paolo

^ permalink raw reply

* [PATCHv2 1/2] arm64: dts: stratix10: Support Ethernet Jumbo frame
From: thor.thayer @ 2018-10-22 22:22 UTC (permalink / raw)
  To: peppe.cavallaro, alexandre.torgue, joabreu, davem, dinguyen,
	robh+dt, mark.rutland
  Cc: devicetree, netdev, linux-kernel, Thor Thayer

From: Thor Thayer <thor.thayer@linux.intel.com>

Properly specify the RX and TX FIFO size which is important
for Jumbo frames.
Update the max-frame-size to support Jumbo frames.

Signed-off-by: Thor Thayer <thor.thayer@linux.intel.com>
---
v2 Shorten title. No change to DTS.
---
 arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi      | 6 ++++++
 arch/arm64/boot/dts/altera/socfpga_stratix10_socdk.dts | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
index d033da401c26..8a2641c742ae 100644
--- a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
+++ b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
@@ -137,6 +137,8 @@
 			reset-names = "stmmaceth", "stmmaceth-ocp";
 			clocks = <&clkmgr STRATIX10_EMAC0_CLK>;
 			clock-names = "stmmaceth";
+			tx-fifo-depth = <16384>;
+			rx-fifo-depth = <16384>;
 			status = "disabled";
 		};
 
@@ -150,6 +152,8 @@
 			reset-names = "stmmaceth", "stmmaceth-ocp";
 			clocks = <&clkmgr STRATIX10_EMAC1_CLK>;
 			clock-names = "stmmaceth";
+			tx-fifo-depth = <16384>;
+			rx-fifo-depth = <16384>;
 			status = "disabled";
 		};
 
@@ -163,6 +167,8 @@
 			reset-names = "stmmaceth", "stmmaceth-ocp";
 			clocks = <&clkmgr STRATIX10_EMAC2_CLK>;
 			clock-names = "stmmaceth";
+			tx-fifo-depth = <16384>;
+			rx-fifo-depth = <16384>;
 			status = "disabled";
 		};
 
diff --git a/arch/arm64/boot/dts/altera/socfpga_stratix10_socdk.dts b/arch/arm64/boot/dts/altera/socfpga_stratix10_socdk.dts
index 6edc4fa9fd42..7c661753bfaf 100644
--- a/arch/arm64/boot/dts/altera/socfpga_stratix10_socdk.dts
+++ b/arch/arm64/boot/dts/altera/socfpga_stratix10_socdk.dts
@@ -76,7 +76,7 @@
 	phy-mode = "rgmii";
 	phy-handle = <&phy0>;
 
-	max-frame-size = <3800>;
+	max-frame-size = <9000>;
 
 	mdio0 {
 		#address-cells = <1>;
-- 
2.7.4

^ permalink raw reply related

* [PATCHv2 2/2] net: stmmac: Set OWN bit for jumbo frames
From: thor.thayer @ 2018-10-22 22:22 UTC (permalink / raw)
  To: peppe.cavallaro, alexandre.torgue, joabreu, davem, dinguyen,
	robh+dt, mark.rutland
  Cc: devicetree, netdev, linux-kernel, Thor Thayer
In-Reply-To: <1540246946-12784-1-git-send-email-thor.thayer@linux.intel.com>

From: Thor Thayer <thor.thayer@linux.intel.com>

Ping with Jumbo packet does not reply and get a watchdog timeout

[   46.059616] ------------[ cut here ]------------
[   46.064268] NETDEV WATCHDOG: eth0 (socfpga-dwmac): transmit queue 0 timed out
[   46.071471] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x2cc/0x2d8
[   46.079708] Modules linked in:
[   46.082761] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.18.0-00115-gc262be665854-dirty #264
[   46.091082] Hardware name: SoCFPGA Stratix 10 SoCDK (DT)
[   46.096377] pstate: 20000005 (nzCv daif -PAN -UAO)
[   46.101152] pc : dev_watchdog+0x2cc/0x2d8
[   46.105149] lr : dev_watchdog+0x2cc/0x2d8
[   46.109144] sp : ffff00000800bd80
[   46.112447] x29: ffff00000800bd80 x28: ffff80007a9b4940
[   46.117744] x27: 00000000ffffffff x26: ffff80007aa183b0
[   46.123040] x25: 0000000000000001 x24: 0000000000000140
[   46.128336] x23: ffff80007aa1839c x22: ffff80007aa17fb0
[   46.133632] x21: ffff80007aa18000 x20: ffff0000091a7000
[   46.138927] x19: 0000000000000000 x18: ffffffffffffffff
[   46.144223] x17: 0000000000000000 x16: 0000000000000000
[   46.149519] x15: ffff0000091a96c8 x14: 07740775076f0720
[   46.154814] x13: 07640765076d0769 x12: 0774072007300720
[   46.160110] x11: 0765077507650775 x10: 0771072007740769
[   46.165406] x9 : 076d0773076e0761 x8 : 077207740720073a
[   46.170702] x7 : 072907630761076d x6 : ffff80007ff9a0c0
[   46.175997] x5 : ffff80007ff9a0c0 x4 : 0000000000000002
[   46.181293] x3 : 0000000000000000 x2 : ffff0000091ac180
[   46.186589] x1 : e6a742ebe628e800 x0 : 0000000000000000
[   46.191885] Call trace:
[   46.194326]  dev_watchdog+0x2cc/0x2d8
[   46.197980]  call_timer_fn+0x20/0x78
[   46.201544]  expire_timers+0xa4/0xb0
[   46.205108]  run_timer_softirq+0xe4/0x198
[   46.209107]  __do_softirq+0x114/0x210
[   46.212760]  irq_exit+0xd0/0xd8
[   46.215895]  __handle_domain_irq+0x60/0xb0
[   46.219977]  gic_handle_irq+0x58/0xa8
[   46.223628]  el1_irq+0xb0/0x128
[   46.226761]  arch_cpu_idle+0x10/0x18
[   46.230326]  do_idle+0x1d4/0x288
[   46.233544]  cpu_startup_entry+0x24/0x28
[   46.237457]  secondary_start_kernel+0x17c/0x1c0
[   46.241971] ---[ end trace 57048cd1372cd828 ]---

Inspection of queue showed Jumbo packets were not sent out.
The ring Jumbo packet function needs to set the OWN bit so
the packet is sent.

Signed-off-by: Thor Thayer <thor.thayer@linux.intel.com>
---
v2  Add fix for jumbo frames > 8K (same OWN bit)
---
 drivers/net/ethernet/stmicro/stmmac/ring_mode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/ring_mode.c b/drivers/net/ethernet/stmicro/stmmac/ring_mode.c
index a7ffc73fffe8..abc3f85270cd 100644
--- a/drivers/net/ethernet/stmicro/stmmac/ring_mode.c
+++ b/drivers/net/ethernet/stmicro/stmmac/ring_mode.c
@@ -59,7 +59,7 @@ static int jumbo_frm(void *p, struct sk_buff *skb, int csum)
 
 		desc->des3 = cpu_to_le32(des2 + BUF_SIZE_4KiB);
 		stmmac_prepare_tx_desc(priv, desc, 1, bmax, csum,
-				STMMAC_RING_MODE, 0, false, skb->len);
+				STMMAC_RING_MODE, 1, false, skb->len);
 		tx_q->tx_skbuff[entry] = NULL;
 		entry = STMMAC_GET_ENTRY(entry, DMA_TX_SIZE);
 
@@ -91,7 +91,7 @@ static int jumbo_frm(void *p, struct sk_buff *skb, int csum)
 		tx_q->tx_skbuff_dma[entry].is_jumbo = true;
 		desc->des3 = cpu_to_le32(des2 + BUF_SIZE_4KiB);
 		stmmac_prepare_tx_desc(priv, desc, 1, nopaged_len, csum,
-				STMMAC_RING_MODE, 0, true, skb->len);
+				STMMAC_RING_MODE, 1, true, skb->len);
 	}
 
 	tx_q->cur_tx = entry;
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net-next 0/6] net/ncsi: Allow enabling multiple packages & channels
From: Samuel Mendoza-Jonas @ 2018-10-22 22:24 UTC (permalink / raw)
  To: Justin.Lee1, davem; +Cc: netdev, linux-kernel, openbmc
In-Reply-To: <79ee8c9befaa40e781bfeba4ba25d396@AUSX13MPS302.AMER.DELL.COM>

On Fri, 2018-10-19 at 21:38 +0000, Justin.Lee1@Dell.com wrote:
> Hi Sam,
> 
> When I run the testing (enable multi-package (package 0 and 1) and 
> multi-channel (channel 0 and 1 on both packages)), I see there are two channels
> with TX enable.
> 
> When I unplug/plug network cable, it seems the behavior is to keep one TX per
> package, is it the behavior you expect? All channels are behind the same eth2,
> I would expect there is only one active TX.
> 
> I create the below file to expose driver's status. RX and TX are printing 
> from the follow variables.
> nc->modes[NCSI_MODE_ENABLE].enable,
> nc->modes[NCSI_MODE_TX_ENABLE].enable,
> 
> cat /sys/kernel/debug/ncsi_protocol/ncsi_device_status;
> IFIDX IFNAME NAME   PID CID RX TX MP MC WP WC PS RU CR NQ
> =========================================================
>   2   eth2   ncsi0  000 000 1  1  1  1  1  1  1  1  1  0 
>   2   eth2   ncsi1  000 001 1  0  1  1  1  1  1  1  1  0 
>   2   eth2   ncsi2  001 000 1  1  1  1  1  1  1  1  1  0 
>   2   eth2   ncsi3  001 001 1  0  1  1  1  1  1  1  1  0 
> =========================================================
> M:  Multi-mode      P:  Package ID
> W:  Whitelist       C:  Channel ID
> PS: Poll Status
> RU: Running
> CR: Carrier OK
> NQ: Queue Stopped

Hi Justin,

Thanks for testing; this is probably an oversight in
ncsi_channel_is_tx(), where channels from a package other than the
current one could be ignored when checking for the current channel's link
status. I'll update this to be more robust.

Sam

> 
> Thanks,
> Justin
> 
> 
> From: Samuel Mendoza-Jonas <sam@mendozajonas.com>
> Date: Thu, 18 Oct 2018 14:59:11 +1100
> 
> > This series extends the NCSI driver to configure multiple packages
> > and/or channels simultaneously. Since the RFC series this includes a few
> > extra changes to fix areas in the driver that either made this harder or
> > were roadblocks due to deviations from the NCSI specification.
> > 
> > Patches 1 & 2 fix two issues where the driver made assumptions about the
> > capabilities of the NCSI topology.
> > Patches 3 & 4 change some internal semantics slightly to make multi-mode
> > easier.
> > Patch 5 introduces a cleaner way of reconfiguring the NCSI configuration
> > and keeping track of channel states.
> > Patch 6 implements the main multi-package/multi-channel configuration,
> > configured via the Netlink interface.
> > 
> > Readers who have an interesting NCSI setup - especially multi-package
> > with HWA - please test! I think I've covered all permutations but I
> > don't have infinite hardware to test on.
> 
> 

^ permalink raw reply

* Re: [PATCH net-next 2/3] sctp: add sock_reuseport for the sock in __sctp_hash_endpoint
From: Marcelo Ricardo Leitner @ 2018-10-22 14:15 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, linux-sctp, Neil Horman, davem
In-Reply-To: <661578e3134c79c575d934b3267b327773fd34f7.1540095102.git.lucien.xin@gmail.com>

On Sun, Oct 21, 2018 at 12:43:37PM +0800, Xin Long wrote:
> This is a part of sk_reuseport support for sctp. It defines a helper
> sctp_bind_addrs_check() to check if the bind_addrs in two socks are
> matched. It will add sock_reuseport if they are completely matched,
> and return err if they are partly matched, and alloc sock_reuseport
> if all socks are not matched at all.
> 
> It will work until sk_reuseport support is added in
> sctp_get_port_local() in the next patch.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  include/net/sctp/sctp.h    |  2 +-
>  include/net/sctp/structs.h |  2 ++
>  net/core/sock_reuseport.c  |  1 +
>  net/sctp/bind_addr.c       | 28 ++++++++++++++++++++++
>  net/sctp/input.c           | 60 +++++++++++++++++++++++++++++++++++++++-------
>  net/sctp/socket.c          |  3 +--
>  6 files changed, 85 insertions(+), 11 deletions(-)
> 
> diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
> index 8c2caa3..b8cd58d 100644
> --- a/include/net/sctp/sctp.h
> +++ b/include/net/sctp/sctp.h
> @@ -152,7 +152,7 @@ int sctp_primitive_RECONF(struct net *net, struct sctp_association *asoc,
>   */
>  int sctp_rcv(struct sk_buff *skb);
>  void sctp_v4_err(struct sk_buff *skb, u32 info);
> -void sctp_hash_endpoint(struct sctp_endpoint *);
> +int sctp_hash_endpoint(struct sctp_endpoint *ep);
>  void sctp_unhash_endpoint(struct sctp_endpoint *);
>  struct sock *sctp_err_lookup(struct net *net, int family, struct sk_buff *,
>  			     struct sctphdr *, struct sctp_association **,
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index a11f937..15d017f 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -1190,6 +1190,8 @@ int sctp_bind_addr_conflict(struct sctp_bind_addr *, const union sctp_addr *,
>  			 struct sctp_sock *, struct sctp_sock *);
>  int sctp_bind_addr_state(const struct sctp_bind_addr *bp,
>  			 const union sctp_addr *addr);
> +int sctp_bind_addrs_check(struct sctp_sock *sp,
> +			  struct sctp_sock *sp2, int cnt2);
>  union sctp_addr *sctp_find_unmatch_addr(struct sctp_bind_addr	*bp,
>  					const union sctp_addr	*addrs,
>  					int			addrcnt,
> diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> index ba5cba5..d8fe3e5 100644
> --- a/net/core/sock_reuseport.c
> +++ b/net/core/sock_reuseport.c
> @@ -187,6 +187,7 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
>  		call_rcu(&old_reuse->rcu, reuseport_free_rcu);
>  	return 0;
>  }
> +EXPORT_SYMBOL(reuseport_add_sock);
>  
>  void reuseport_detach_sock(struct sock *sk)
>  {
> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
> index 7df3704..78d0d93 100644
> --- a/net/sctp/bind_addr.c
> +++ b/net/sctp/bind_addr.c
> @@ -337,6 +337,34 @@ int sctp_bind_addr_match(struct sctp_bind_addr *bp,
>  	return match;
>  }
>  
> +int sctp_bind_addrs_check(struct sctp_sock *sp,
> +			  struct sctp_sock *sp2, int cnt2)
> +{
> +	struct sctp_bind_addr *bp2 = &sp2->ep->base.bind_addr;
> +	struct sctp_bind_addr *bp = &sp->ep->base.bind_addr;
> +	struct sctp_sockaddr_entry *laddr, *laddr2;
> +	bool exist = false;
> +	int cnt = 0;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(laddr, &bp->address_list, list) {
> +		list_for_each_entry_rcu(laddr2, &bp2->address_list, list) {
> +			if (sp->pf->af->cmp_addr(&laddr->a, &laddr2->a) &&
> +			    laddr->valid == laddr2->valid) {

I think by here in the normal run laddr2->valid will always be true,
but as is it gives the impression that it accepts 0 == 0 too, which
would be bad.  May be on a fast BINDX_REM/BINDX_ADD it could trigger
laddr2->valid = 0 in there, not sure.

Anyway, may be '... laddr->valid && laddr2->valid' instead or you
really want to allow the 0 == 0 case?

> +				exist = true;
> +				goto next;
> +			}
> +		}
> +		cnt = 0;
> +		break;
> +next:
> +		cnt++;
> +	}
> +	rcu_read_unlock();
> +
> +	return (cnt == cnt2) ? 0 : (exist ? -EEXIST : 1);
> +}
> +
>  /* Does the address 'addr' conflict with any addresses in
>   * the bp.
>   */
> diff --git a/net/sctp/input.c b/net/sctp/input.c
> index 60ede89..6bfeb10 100644
> --- a/net/sctp/input.c
> +++ b/net/sctp/input.c
> @@ -723,43 +723,87 @@ static int sctp_rcv_ootb(struct sk_buff *skb)
>  }
>  
>  /* Insert endpoint into the hash table.  */
> -static void __sctp_hash_endpoint(struct sctp_endpoint *ep)
> +static int __sctp_hash_endpoint(struct sctp_endpoint *ep)
>  {
> -	struct net *net = sock_net(ep->base.sk);
> -	struct sctp_ep_common *epb;
> +	struct sock *sk = ep->base.sk;
> +	struct net *net = sock_net(sk);
>  	struct sctp_hashbucket *head;
> +	struct sctp_ep_common *epb;
>  
>  	epb = &ep->base;
> -
>  	epb->hashent = sctp_ep_hashfn(net, epb->bind_addr.port);
>  	head = &sctp_ep_hashtable[epb->hashent];
>  
> +	if (sk->sk_reuseport) {
> +		bool any = sctp_is_ep_boundall(sk);
> +		struct sctp_ep_common *epb2;
> +		struct list_head *list;
> +		int cnt = 0, err = 1;
> +
> +		list_for_each(list, &ep->base.bind_addr.address_list)
> +			cnt++;
> +
> +		sctp_for_each_hentry(epb2, &head->chain) {
> +			struct sock *sk2 = epb2->sk;
> +
> +			if (!net_eq(sock_net(sk2), net) || sk2 == sk ||
> +			    !uid_eq(sock_i_uid(sk2), sock_i_uid(sk)) ||
> +			    !sk2->sk_reuseport)
> +				continue;
> +
> +			err = sctp_bind_addrs_check(sctp_sk(sk2),
> +						    sctp_sk(sk), cnt);
> +			if (!err) {
> +				err = reuseport_add_sock(sk, sk2, any);
> +				if (err)
> +					return err;
> +				break;
> +			} else if (err < 0) {
> +				return err;
> +			}
> +		}
> +
> +		if (err) {
> +			err = reuseport_alloc(sk, any);
> +			if (err)
> +				return err;
> +		}
> +	}
> +
>  	write_lock(&head->lock);
>  	hlist_add_head(&epb->node, &head->chain);
>  	write_unlock(&head->lock);
> +	return 0;
>  }
>  
>  /* Add an endpoint to the hash. Local BH-safe. */
> -void sctp_hash_endpoint(struct sctp_endpoint *ep)
> +int sctp_hash_endpoint(struct sctp_endpoint *ep)
>  {
> +	int err;
> +
>  	local_bh_disable();
> -	__sctp_hash_endpoint(ep);
> +	err = __sctp_hash_endpoint(ep);
>  	local_bh_enable();
> +
> +	return err;
>  }
>  
>  /* Remove endpoint from the hash table.  */
>  static void __sctp_unhash_endpoint(struct sctp_endpoint *ep)
>  {
> -	struct net *net = sock_net(ep->base.sk);
> +	struct sock *sk = ep->base.sk;
>  	struct sctp_hashbucket *head;
>  	struct sctp_ep_common *epb;
>  
>  	epb = &ep->base;
>  
> -	epb->hashent = sctp_ep_hashfn(net, epb->bind_addr.port);
> +	epb->hashent = sctp_ep_hashfn(sock_net(sk), epb->bind_addr.port);
>  
>  	head = &sctp_ep_hashtable[epb->hashent];
>  
> +	if (rcu_access_pointer(sk->sk_reuseport_cb))
> +		reuseport_detach_sock(sk);
> +
>  	write_lock(&head->lock);
>  	hlist_del_init(&epb->node);
>  	write_unlock(&head->lock);
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index fc0386e..44e7d8c 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -7850,8 +7850,7 @@ static int sctp_listen_start(struct sock *sk, int backlog)
>  	}
>  
>  	sk->sk_max_ack_backlog = backlog;
> -	sctp_hash_endpoint(ep);
> -	return 0;
> +	return sctp_hash_endpoint(ep);
>  }
>  
>  /*
> -- 
> 2.1.0
> 

^ permalink raw reply

* Re: [PATCH net-next 1/3] sctp: do reuseport_select_sock in __sctp_rcv_lookup_endpoint
From: Marcelo Ricardo Leitner @ 2018-10-22 14:17 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, linux-sctp, Neil Horman, davem
In-Reply-To: <08b533092a01a8f7cf2eb4c459fe3570a8df702b.1540095102.git.lucien.xin@gmail.com>

On Sun, Oct 21, 2018 at 12:43:36PM +0800, Xin Long wrote:
> This is a part of sk_reuseport support for sctp, and it selects a
> sock by the hashkey of lport, paddr and dport by default. It will
> work until sk_reuseport support is added in sctp_get_port_local()
> in the next patch.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/sctp/input.c | 69 +++++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 41 insertions(+), 28 deletions(-)
> 
> diff --git a/net/sctp/input.c b/net/sctp/input.c
> index 5c36a99..60ede89 100644
> --- a/net/sctp/input.c
> +++ b/net/sctp/input.c
> @@ -57,6 +57,7 @@
>  #include <net/sctp/checksum.h>
>  #include <net/net_namespace.h>
>  #include <linux/rhashtable.h>
> +#include <net/sock_reuseport.h>
>  
>  /* Forward declarations for internal helpers. */
>  static int sctp_rcv_ootb(struct sk_buff *);
> @@ -65,8 +66,10 @@ static struct sctp_association *__sctp_rcv_lookup(struct net *net,
>  				      const union sctp_addr *paddr,
>  				      const union sctp_addr *laddr,
>  				      struct sctp_transport **transportp);
> -static struct sctp_endpoint *__sctp_rcv_lookup_endpoint(struct net *net,
> -						const union sctp_addr *laddr);
> +static struct sctp_endpoint *__sctp_rcv_lookup_endpoint(
> +					struct net *net, struct sk_buff *skb,
> +					const union sctp_addr *laddr,
> +					const union sctp_addr *daddr);
>  static struct sctp_association *__sctp_lookup_association(
>  					struct net *net,
>  					const union sctp_addr *local,
> @@ -171,7 +174,7 @@ int sctp_rcv(struct sk_buff *skb)
>  	asoc = __sctp_rcv_lookup(net, skb, &src, &dest, &transport);
>  
>  	if (!asoc)
> -		ep = __sctp_rcv_lookup_endpoint(net, &dest);
> +		ep = __sctp_rcv_lookup_endpoint(net, skb, &dest, &src);
>  
>  	/* Retrieve the common input handling substructure. */
>  	rcvr = asoc ? &asoc->base : &ep->base;
> @@ -770,16 +773,35 @@ void sctp_unhash_endpoint(struct sctp_endpoint *ep)
>  	local_bh_enable();
>  }
>  
> +static inline __u32 sctp_hashfn(const struct net *net, __be16 lport,
> +				const union sctp_addr *paddr, __u32 seed)
> +{
> +	__u32 addr;
> +
> +	if (paddr->sa.sa_family == AF_INET6)
> +		addr = jhash(&paddr->v6.sin6_addr, 16, seed);
> +	else
> +		addr = (__force __u32)paddr->v4.sin_addr.s_addr;
> +
> +	return  jhash_3words(addr, ((__force __u32)paddr->v4.sin_port) << 16 |
> +			     (__force __u32)lport, net_hash_mix(net), seed);
> +}
> +
>  /* Look up an endpoint. */
> -static struct sctp_endpoint *__sctp_rcv_lookup_endpoint(struct net *net,
> -						const union sctp_addr *laddr)
> +static struct sctp_endpoint *__sctp_rcv_lookup_endpoint(
> +					struct net *net, struct sk_buff *skb,
> +					const union sctp_addr *laddr,
> +					const union sctp_addr *paddr)
>  {
>  	struct sctp_hashbucket *head;
>  	struct sctp_ep_common *epb;
>  	struct sctp_endpoint *ep;
> +	struct sock *sk;
> +	__be32 lport;

This could be a __be16 one.

>  	int hash;
>  
> -	hash = sctp_ep_hashfn(net, ntohs(laddr->v4.sin_port));
> +	lport = laddr->v4.sin_port;
> +	hash = sctp_ep_hashfn(net, ntohs(lport));
>  	head = &sctp_ep_hashtable[hash];
>  	read_lock(&head->lock);
>  	sctp_for_each_hentry(epb, &head->chain) {
> @@ -791,6 +813,15 @@ static struct sctp_endpoint *__sctp_rcv_lookup_endpoint(struct net *net,
>  	ep = sctp_sk(net->sctp.ctl_sock)->ep;
>  
>  hit:
> +	sk = ep->base.sk;
> +	if (sk->sk_reuseport) {
> +		__u32 phash = sctp_hashfn(net, lport, paddr, 0);
> +
> +		sk = reuseport_select_sock(sk, phash, skb,
> +					   sizeof(struct sctphdr));
> +		if (sk)
> +			ep = sctp_sk(sk)->ep;
> +	}
>  	sctp_endpoint_hold(ep);
>  	read_unlock(&head->lock);
>  	return ep;
> @@ -829,35 +860,17 @@ static inline int sctp_hash_cmp(struct rhashtable_compare_arg *arg,
>  static inline __u32 sctp_hash_obj(const void *data, u32 len, u32 seed)
>  {
>  	const struct sctp_transport *t = data;
> -	const union sctp_addr *paddr = &t->ipaddr;
> -	const struct net *net = sock_net(t->asoc->base.sk);
> -	__be16 lport = htons(t->asoc->base.bind_addr.port);
> -	__u32 addr;
> -
> -	if (paddr->sa.sa_family == AF_INET6)
> -		addr = jhash(&paddr->v6.sin6_addr, 16, seed);
> -	else
> -		addr = (__force __u32)paddr->v4.sin_addr.s_addr;
>  
> -	return  jhash_3words(addr, ((__force __u32)paddr->v4.sin_port) << 16 |
> -			     (__force __u32)lport, net_hash_mix(net), seed);
> +	return sctp_hashfn(sock_net(t->asoc->base.sk),
> +			   htons(t->asoc->base.bind_addr.port),
> +			   &t->ipaddr, seed);
>  }
>  
>  static inline __u32 sctp_hash_key(const void *data, u32 len, u32 seed)
>  {
>  	const struct sctp_hash_cmp_arg *x = data;
> -	const union sctp_addr *paddr = x->paddr;
> -	const struct net *net = x->net;
> -	__be16 lport = x->lport;
> -	__u32 addr;
> -
> -	if (paddr->sa.sa_family == AF_INET6)
> -		addr = jhash(&paddr->v6.sin6_addr, 16, seed);
> -	else
> -		addr = (__force __u32)paddr->v4.sin_addr.s_addr;
>  
> -	return  jhash_3words(addr, ((__force __u32)paddr->v4.sin_port) << 16 |
> -			     (__force __u32)lport, net_hash_mix(net), seed);
> +	return sctp_hashfn(x->net, x->lport, x->paddr, seed);
>  }
>  
>  static const struct rhashtable_params sctp_hash_params = {
> -- 
> 2.1.0
> 

^ permalink raw reply

* Re: [PATCH net-next 0/3] sctp: add support for sk_reuseport
From: Marcelo Ricardo Leitner @ 2018-10-22 14:20 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, linux-sctp, Neil Horman, davem
In-Reply-To: <cover.1540095102.git.lucien.xin@gmail.com>

On Sun, Oct 21, 2018 at 12:43:35PM +0800, Xin Long wrote:
> sctp sk_reuseport allows multiple socks to listen on the same port and
> addresses, as long as these socks have the same uid. This works pretty
> much as TCP/UDP does, the only difference is that sctp is multi-homing
> and all the bind_addrs in these socks will have to completely matched,
> otherwise listen() will return err.
> 

FWIW, I won't be able to review this patchset thoroughly. The 2 small
comments that I sent are all I have.

Thanks,
Marcelo

^ permalink raw reply

* Re: [PATCH 1/1] ptp: ptp_dte: simplify getting .driver_data
From: Richard Cochran @ 2018-10-22 22:43 UTC (permalink / raw)
  To: Wolfram Sang; +Cc: linux-kernel, linux-renesas-soc, netdev
In-Reply-To: <20181021200039.1933-2-wsa+renesas@sang-engineering.com>

On Sun, Oct 21, 2018 at 10:00:39PM +0200, Wolfram Sang wrote:
> We should get 'driver_data' from 'struct device' directly. Going via
> platform_device is an unneeded step back and forth.

Acked-by: Richard Cochran <richardcochran@gmail.com>

^ permalink raw reply

* Re: [PATCH net-next 1/6] qed: Add doorbell overflow recovery mechanism
From: kbuild test robot @ 2018-10-22 14:26 UTC (permalink / raw)
  To: Ariel Elior
  Cc: kbuild-all, davem, netdev, Ariel Elior, Michal Kalderon,
	Tomer Tayar
In-Reply-To: <20181022122743.20384-2-Ariel.Elior@cavium.com>

[-- Attachment #1: Type: text/plain, Size: 2021 bytes --]

Hi Ariel,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Ariel-Elior/qed-Doorbell-overflow-recovery/20181022-212749
config: i386-randconfig-x005-201842 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   In file included from drivers/net/ethernet/qlogic/qed/qed_dev.c:47:0:
   drivers/net/ethernet/qlogic/qed/qed_dev.c: In function 'qed_db_recovery_ring':
>> include/linux/qed/qed_if.h:466:40: error: implicit declaration of function 'writeq'; did you mean 'writel'? [-Werror=implicit-function-declaration]
    #define DIRECT_REG_WR64(reg_addr, val) writeq((u32)val, \
                                           ^
>> drivers/net/ethernet/qlogic/qed/qed_dev.c:342:4: note: in expansion of macro 'DIRECT_REG_WR64'
       DIRECT_REG_WR64(db_entry->db_addr,
       ^~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors
--
   In file included from drivers/net//ethernet/qlogic/qed/qed_dev.c:47:0:
   drivers/net//ethernet/qlogic/qed/qed_dev.c: In function 'qed_db_recovery_ring':
>> include/linux/qed/qed_if.h:466:40: error: implicit declaration of function 'writeq'; did you mean 'writel'? [-Werror=implicit-function-declaration]
    #define DIRECT_REG_WR64(reg_addr, val) writeq((u32)val, \
                                           ^
   drivers/net//ethernet/qlogic/qed/qed_dev.c:342:4: note: in expansion of macro 'DIRECT_REG_WR64'
       DIRECT_REG_WR64(db_entry->db_addr,
       ^~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +466 include/linux/qed/qed_if.h

   465	
 > 466	#define DIRECT_REG_WR64(reg_addr, val) writeq((u32)val,	\
   467						      (void __iomem *)(reg_addr))
   468	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29652 bytes --]

^ permalink raw reply

* [PATCH bpf] bpf: add bpf_jit_limit knob to restrict unpriv allocations
From: Daniel Borkmann @ 2018-10-22 23:11 UTC (permalink / raw)
  To: ast
  Cc: rick.p.edgecombe, eric.dumazet, jannh, keescook, linux-kernel,
	netdev, Daniel Borkmann

Rick reported that the BPF JIT could potentially fill the entire module
space with BPF programs from unprivileged users which would prevent later
attempts to load normal kernel modules or privileged BPF programs, for
example. If JIT was enabled but unsuccessful to generate the image, then
before commit 290af86629b2 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
we would always fall back to the BPF interpreter. Nowadays in the case
where the CONFIG_BPF_JIT_ALWAYS_ON could be set, then the load will abort
with a failure since the BPF interpreter was compiled out.

Add a global limit and enforce it for unprivileged users such that in case
of BPF interpreter compiled out we fail once the limit has been reached
or we fall back to BPF interpreter earlier w/o using module mem if latter
was compiled in. In a next step, fair share among unprivileged users can
be resolved in particular for the case where we would fail hard once limit
is reached.

Fixes: 290af86629b2 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
Fixes: 0a14842f5a3c ("net: filter: Just In Time compiler for x86-64")
Co-Developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: LKML <linux-kernel@vger.kernel.org>
---
 Hi Rick, I've reworked the original patch into something much simpler
 which is only focussing on the actual main issue we want to resolve right
 now as a first step to make some forward progress, that is, limiting usage
 on the JIT for unprivileged users. Tested the below on x86 and arm64.
 (Trimmed down massive Cc list as well a bit and Cc'ed people related to
 commits referenced and netdev where BPF patches are usually discussed.)
 Thanks a lot!

 Documentation/sysctl/net.txt |  8 ++++++++
 include/linux/filter.h       |  1 +
 kernel/bpf/core.c            | 49 +++++++++++++++++++++++++++++++++++++++++---
 net/core/sysctl_net_core.c   | 10 +++++++--
 4 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index 9ecde51..2793d4e 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -92,6 +92,14 @@ Values :
 	0 - disable JIT kallsyms export (default value)
 	1 - enable JIT kallsyms export for privileged users only
 
+bpf_jit_limit
+-------------
+
+This enforces a global limit for memory allocations to the BPF JIT
+compiler in order to reject unprivileged JIT requests once it has
+been surpassed. bpf_jit_limit contains the value of the global limit
+in bytes.
+
 dev_weight
 --------------
 
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 91b4c93..de629b7 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -854,6 +854,7 @@ bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
 extern int bpf_jit_enable;
 extern int bpf_jit_harden;
 extern int bpf_jit_kallsyms;
+extern int bpf_jit_limit;
 
 typedef void (*bpf_jit_fill_hole_t)(void *area, unsigned int size);
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 7c7eeea..6377225 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -365,10 +365,13 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
 }
 
 #ifdef CONFIG_BPF_JIT
+# define BPF_JIT_LIMIT_DEFAULT	(PAGE_SIZE * 40000)
+
 /* All BPF JIT sysctl knobs here. */
 int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
 int bpf_jit_harden   __read_mostly;
 int bpf_jit_kallsyms __read_mostly;
+int bpf_jit_limit    __read_mostly = BPF_JIT_LIMIT_DEFAULT;
 
 static __always_inline void
 bpf_get_prog_addr_region(const struct bpf_prog *prog,
@@ -577,27 +580,64 @@ int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
 	return ret;
 }
 
+static atomic_long_t bpf_jit_current;
+
+#if defined(MODULES_VADDR)
+static int __init bpf_jit_charge_init(void)
+{
+	/* Only used as heuristic here to derive limit. */
+	bpf_jit_limit = min_t(u64, round_up((MODULES_END - MODULES_VADDR) >> 2,
+					    PAGE_SIZE), INT_MAX);
+	return 0;
+}
+pure_initcall(bpf_jit_charge_init);
+#endif
+
+static int bpf_jit_charge_modmem(u32 pages)
+{
+	if (atomic_long_add_return(pages, &bpf_jit_current) >
+	    (bpf_jit_limit >> PAGE_SHIFT)) {
+		if (!capable(CAP_SYS_ADMIN)) {
+			atomic_long_sub(pages, &bpf_jit_current);
+			return -EPERM;
+		}
+	}
+
+	return 0;
+}
+
+static void bpf_jit_uncharge_modmem(u32 pages)
+{
+	atomic_long_sub(pages, &bpf_jit_current);
+}
+
 struct bpf_binary_header *
 bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 		     unsigned int alignment,
 		     bpf_jit_fill_hole_t bpf_fill_ill_insns)
 {
 	struct bpf_binary_header *hdr;
-	unsigned int size, hole, start;
+	u32 size, hole, start, pages;
 
 	/* Most of BPF filters are really small, but if some of them
 	 * fill a page, allow at least 128 extra bytes to insert a
 	 * random section of illegal instructions.
 	 */
 	size = round_up(proglen + sizeof(*hdr) + 128, PAGE_SIZE);
+	pages = size / PAGE_SIZE;
+
+	if (bpf_jit_charge_modmem(pages))
+		return NULL;
 	hdr = module_alloc(size);
-	if (hdr == NULL)
+	if (!hdr) {
+		bpf_jit_uncharge_modmem(pages);
 		return NULL;
+	}
 
 	/* Fill space with illegal/arch-dep instructions. */
 	bpf_fill_ill_insns(hdr, size);
 
-	hdr->pages = size / PAGE_SIZE;
+	hdr->pages = pages;
 	hole = min_t(unsigned int, size - (proglen + sizeof(*hdr)),
 		     PAGE_SIZE - sizeof(*hdr));
 	start = (get_random_int() % hole) & ~(alignment - 1);
@@ -610,7 +650,10 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 
 void bpf_jit_binary_free(struct bpf_binary_header *hdr)
 {
+	u32 pages = hdr->pages;
+
 	module_memfree(hdr);
+	bpf_jit_uncharge_modmem(pages);
 }
 
 /* This symbol is only overridden by archs that have different
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index b1a2c5e..37b4667 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -279,7 +279,6 @@ static int proc_dointvec_minmax_bpf_enable(struct ctl_table *table, int write,
 	return ret;
 }
 
-# ifdef CONFIG_HAVE_EBPF_JIT
 static int
 proc_dointvec_minmax_bpf_restricted(struct ctl_table *table, int write,
 				    void __user *buffer, size_t *lenp,
@@ -290,7 +289,6 @@ proc_dointvec_minmax_bpf_restricted(struct ctl_table *table, int write,
 
 	return proc_dointvec_minmax(table, write, buffer, lenp, ppos);
 }
-# endif
 #endif
 
 static struct ctl_table net_core_table[] = {
@@ -397,6 +395,14 @@ static struct ctl_table net_core_table[] = {
 		.extra2		= &one,
 	},
 # endif
+	{
+		.procname	= "bpf_jit_limit",
+		.data		= &bpf_jit_limit,
+		.maxlen		= sizeof(int),
+		.mode		= 0600,
+		.proc_handler	= proc_dointvec_minmax_bpf_restricted,
+		.extra1		= &one,
+	},
 #endif
 	{
 		.procname	= "netdev_tstamp_prequeue",
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH 0/2] Mark expected switch fall-throughs and fix missing break
From: Gustavo A. R. Silva @ 2018-10-22 14:53 UTC (permalink / raw)
  To: Jes Sorensen, linux-kernel
  Cc: Kalle Valo, linux-wireless, David S. Miller, netdev
In-Reply-To: <a1b9b68a-ca94-f02f-4a87-bbd46782c41c@gmail.com>

On 10/22/18 4:36 PM, Jes Sorensen wrote:
> On 10/22/18 7:49 AM, Gustavo A. R. Silva wrote:
>> In preparation to enabling -Wimplicit-fallthrough, this patchset aims
>> to mark multiple switch cases where we are expecting to fall through.
>>
>> Also, the second patch in this series fixes a missing break in switch.
> 
> Enabling that sounds like a great way to inflict pain and suffering.
> 

Not really. The -Wimplicit-fallthrough will be enabled until after all the
current warnings have been addressed.

There are 600 of these issues left. So, hopefully I will complete this task
during the next development cycle.

Thanks

^ permalink raw reply

* Re: CRC errors between mvneta and macb
From: Richard Genoud @ 2018-10-22 15:15 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: linux-kernel, Thomas Petazzoni, Antoine Tenart, Gregory CLEMENT,
	Yelena Krivosheev, Maxime Chevallier, Nicolas Ferre, netdev
In-Reply-To: <c24489f6-3b9a-f26b-bc7f-b564d5567294@sorico.fr>

Le 22/10/2018 à 08:51, Richard Genoud a écrit :
> Le 19/10/2018 à 17:44, Willy Tarreau a écrit :
>> On Fri, Oct 19, 2018 at 05:15:03PM +0200, Richard Genoud wrote:
>>> When there's a CRC error, the TXCLK has its polarity inverted...
>>> That's a clue !
>>>
>>> But this TXCLK (25MHz) is not used on the g35-ek.
>>> Only the REFCLK/XT2 (50MHz) is used to synchronise the PHY and the macb.
>>> So I guess that the TXCLK has a role to play to generate TX+/TX-
>>
>> Well, just a stupid idea, maybe when this signal is inverted, the TX+/TX-
>> are desynchronized by half a clock and are not always properly interpreted
>> on the other side ?
>>
>> Willy
>>
> 
> I must admit that I'm not familiar with the PHY internals, I'll have to
> dig into that.
> 
> Richard.
> 

I dug more on the subject, and I think I found what Marvell's PHY/MAC
doesn't like.

First of all, I forced the liaison at 10Mbits full duplex on both sides,
as the Manchester code is "easier" to decode than the 4B5B-MLT3 used for
fast ethernet.

Fortunately, the FCS errors are still present on 10Mbits/s.

After analyzing the ethernet frame on the Davicom PHY's output (pin
TX+), I find out that the FCS errors occurs when the ethernet preamble
is longer than 56bits. (something like 58 or 60 bits)

To say this in another way, instead of having 28 times 1-0 followed by
the SFD (10101011), I see 29 or 30 times 1-0 followed by the SFD.
(sometimes 29, sometimes 30)

Should a longer preamble be considered as an FCS error ? It seems a
little harsh since the point of the preamble is to synchronize the frame.

I don't know what the 802.3 standard says about that.

Regards,
Richard.

^ permalink raw reply

* Re: [RFC PATCH v2 02/10] udp: implement GRO for plain UDP sockets.
From: Willem de Bruijn @ 2018-10-22 15:15 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Network Development, Willem de Bruijn, steffen.klassert
In-Reply-To: <c046191abff49ddaef72b870182d4e922b8d80a2.camel@redhat.com>

On Mon, Oct 22, 2018 at 6:13 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Sun, 2018-10-21 at 16:06 -0400, Willem de Bruijn wrote:
> > On Fri, Oct 19, 2018 at 10:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > This is the RX counterpart of commit bec1f6f69736 ("udp: generate gso
> > > with UDP_SEGMENT"). When UDP_GRO is enabled, such socket is also
> > > eligible for GRO in the rx path: UDP segments directed to such socket
> > > are assembled into a larger GSO_UDP_L4 packet.
> > >
> > > The core UDP GRO support is enabled with setsockopt(UDP_GRO).
> > >
> > > Initial benchmark numbers:
> > >
> > > Before:
> > > udp rx:   1079 MB/s   769065 calls/s
> > >
> > > After:
> > > udp rx:   1466 MB/s    24877 calls/s
> > >
> > >
> > > This change introduces a side effect in respect to UDP tunnels:
> > > after a UDP tunnel creation, now the kernel performs a lookup per ingress
> > > UDP packet, while before such lookup happened only if the ingress packet
> > > carried a valid internal header csum.
> > >
> > > v1 -> v2:
> > >  - use a new option to enable UDP GRO
> > >  - use static keys to protect the UDP GRO socket lookup
> > >
> > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > ---
> > >  include/linux/udp.h      |   3 +-
> > >  include/uapi/linux/udp.h |   1 +
> > >  net/ipv4/udp.c           |   7 +++
> > >  net/ipv4/udp_offload.c   | 109 +++++++++++++++++++++++++++++++--------
> > >  net/ipv6/udp_offload.c   |   6 +--
> > >  5 files changed, 98 insertions(+), 28 deletions(-)
> > >
> > > diff --git a/include/linux/udp.h b/include/linux/udp.h
> > > index a4dafff407fb..f613b329852e 100644
> > > --- a/include/linux/udp.h
> > > +++ b/include/linux/udp.h
> > > @@ -50,11 +50,12 @@ struct udp_sock {
> > >         __u8             encap_type;    /* Is this an Encapsulation socket? */
> > >         unsigned char    no_check6_tx:1,/* Send zero UDP6 checksums on TX? */
> > >                          no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */
> > > -                        encap_enabled:1; /* This socket enabled encap
> > > +                        encap_enabled:1, /* This socket enabled encap
> > >                                            * processing; UDP tunnels and
> > >                                            * different encapsulation layer set
> > >                                            * this
> > >                                            */
> > > +                        gro_enabled:1; /* Can accept GRO packets */
> > >
> > >         /*
> > >          * Following member retains the information to create a UDP header
> > >          * when the socket is uncorked.
> > > diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h
> > > index 09502de447f5..30baccb6c9c4 100644
> > > --- a/include/uapi/linux/udp.h
> > > +++ b/include/uapi/linux/udp.h
> > > @@ -33,6 +33,7 @@ struct udphdr {
> > >  #define UDP_NO_CHECK6_TX 101   /* Disable sending checksum for UDP6X */
> > >  #define UDP_NO_CHECK6_RX 102   /* Disable accpeting checksum for UDP6 */
> > >  #define UDP_SEGMENT    103     /* Set GSO segmentation size */
> > > +#define UDP_GRO                104     /* This socket can receive UDP GRO packets */
> > >
> > >  /* UDP encapsulation types */
> > >  #define UDP_ENCAP_ESPINUDP_NON_IKE     1 /* draft-ietf-ipsec-nat-t-ike-00/01 */
> > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > index 9fcb5374e166..3c277378814f 100644
> > > --- a/net/ipv4/udp.c
> > > +++ b/net/ipv4/udp.c
> > > @@ -115,6 +115,7 @@
> > >  #include "udp_impl.h"
> > >  #include <net/sock_reuseport.h>
> > >  #include <net/addrconf.h>
> > > +#include <net/udp_tunnel.h>
> > >
> > >  struct udp_table udp_table __read_mostly;
> > >  EXPORT_SYMBOL(udp_table);
> > > @@ -2459,6 +2460,12 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname,
> > >                 up->gso_size = val;
> > >                 break;
> > >
> > > +       case UDP_GRO:
> > > +               if (valbool)
> > > +                       udp_tunnel_encap_enable(sk->sk_socket);
> > > +               up->gro_enabled = valbool;
> >
> > The socket lock is not held here, so multiple updates to
> > up->gro_enabled and the up->encap_enabled and the static branch can
> > race. Syzkaller is adept at generating those.
>
> Good catch. I was fooled by the current existing code. I think there
> are potentially similar issues for UDP_ENCAP, UDPLITE_SEND_CSCOV, ...
>
> Since the rx path don't take it anymore and we don't risk starving, I
> think we should could/always acquire the socket lock on setsockopt,
> wdyt?

Agreed. We had to add a lot of those in packet_setsockopt for the same reason.

^ permalink raw reply

* [PATCH RFC net-next 2/3] tcp: BPF_TCP_INFO_NOTIFY support
From: Sowmini Varadhan @ 2018-10-22 15:23 UTC (permalink / raw)
  To: sowmini.varadhan, netdev; +Cc: edumazet, brakmo
In-Reply-To: <cover.1540220847.git.sowmini.varadhan@oracle.com>

We want to be able to set up the monitoring application so that it can
be aysnchronously notified when "interesting" events happen, e.g., when
application-determined thresholds on parameters like RTT estimate, number
of retransmissions, RTO are reached.

The bpf_sock_ops infrastructure provided as part of Commit 40304b2a1567
("bpf: BPF support for sock_ops") provides an elegant way to trigger
this asynchronous notification. The BPF program can examine the
current TCP state reported in the bpf_sock_ops and conditionally
return a (new) status BPF_TCP_INFO_NOTIFY. The return status is used
by the caller to queue up a tcp_info notification for the application.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/net/tcp.h        |   15 +++++++++++++--
 include/uapi/linux/bpf.h |    4 ++++
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0d29292..df06a9f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -47,6 +47,7 @@
 #include <linux/seq_file.h>
 #include <linux/memcontrol.h>
 #include <linux/bpf-cgroup.h>
+#include <linux/sock_diag.h>
 
 extern struct inet_hashinfo tcp_hashinfo;
 
@@ -2065,6 +2066,12 @@ struct tcp_ulp_ops {
 	__MODULE_INFO(alias, alias_userspace, name);		\
 	__MODULE_INFO(alias, alias_tcp_ulp, "tcp-ulp-" name)
 
+#define	TCPDIAG_CB(sk)							\
+do {									\
+	if (unlikely(sk->sk_net_refcnt && sock_diag_has_listeners(sk)))	\
+		sock_diag_broadcast(sk);				\
+} while (0)
+
 /* Call BPF_SOCK_OPS program that returns an int. If the return value
  * is < 0, then the BPF op failed (for example if the loaded BPF
  * program does not support the chosen operation or there is no BPF
@@ -2088,9 +2095,13 @@ static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 		memcpy(sock_ops.args, args, nargs * sizeof(*args));
 
 	ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
-	if (ret == 0)
+	if (ret == 0) {
 		ret = sock_ops.reply;
-	else
+
+		/* XXX would be nice if we could use replylong[1] here */
+		if (ret == BPF_TCP_INFO_NOTIFY)
+			TCPDIAG_CB(sk);
+	} else
 		ret = -1;
 	return ret;
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index aa5ccd2..bc45e5e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2678,6 +2678,10 @@ enum {
 	BPF_TCP_MAX_STATES	/* Leave at the end! */
 };
 
+enum {
+	BPF_TCP_INFO_NOTIFY = 2
+};
+
 #define TCP_BPF_IW		1001	/* Set TCP initial congestion window */
 #define TCP_BPF_SNDCWND_CLAMP	1002	/* Set sndcwnd_clamp */
 
-- 
1.7.1

^ permalink raw reply related

* [PATCH RFC net-next 1/3] sock_diag: Refactor inet_sock_diag_destroy code
From: Sowmini Varadhan @ 2018-10-22 15:23 UTC (permalink / raw)
  To: sowmini.varadhan, netdev; +Cc: edumazet, brakmo
In-Reply-To: <cover.1540220847.git.sowmini.varadhan@oracle.com>

We want to use the inet_sock_diag_destroy code to send notifications
for more types of TCP events than just socket_close(), so refactor
the code to allow this.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/sock_diag.h      |   18 +++++++++++++-----
 include/uapi/linux/sock_diag.h |    2 ++
 net/core/sock.c                |    4 ++--
 net/core/sock_diag.c           |   11 ++++++-----
 4 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/include/linux/sock_diag.h b/include/linux/sock_diag.h
index 15fe980..df85767 100644
--- a/include/linux/sock_diag.h
+++ b/include/linux/sock_diag.h
@@ -34,7 +34,7 @@ int sock_diag_put_filterinfo(bool may_report_filterinfo, struct sock *sk,
 			     struct sk_buff *skb, int attrtype);
 
 static inline
-enum sknetlink_groups sock_diag_destroy_group(const struct sock *sk)
+enum sknetlink_groups sock_diag_group(const struct sock *sk)
 {
 	switch (sk->sk_family) {
 	case AF_INET:
@@ -43,7 +43,15 @@ enum sknetlink_groups sock_diag_destroy_group(const struct sock *sk)
 
 		switch (sk->sk_protocol) {
 		case IPPROTO_TCP:
-			return SKNLGRP_INET_TCP_DESTROY;
+			switch (sk->sk_state) {
+			case TCP_ESTABLISHED:
+				return SKNLGRP_INET_TCP_CONNECTED;
+			case TCP_SYN_SENT:
+			case TCP_SYN_RECV:
+				return SKNLGRP_INET_TCP_3WH;
+			default:
+				return SKNLGRP_INET_TCP_DESTROY;
+			}
 		case IPPROTO_UDP:
 			return SKNLGRP_INET_UDP_DESTROY;
 		default:
@@ -67,15 +75,15 @@ enum sknetlink_groups sock_diag_destroy_group(const struct sock *sk)
 }
 
 static inline
-bool sock_diag_has_destroy_listeners(const struct sock *sk)
+bool sock_diag_has_listeners(const struct sock *sk)
 {
 	const struct net *n = sock_net(sk);
-	const enum sknetlink_groups group = sock_diag_destroy_group(sk);
+	const enum sknetlink_groups group = sock_diag_group(sk);
 
 	return group != SKNLGRP_NONE && n->diag_nlsk &&
 		netlink_has_listeners(n->diag_nlsk, group);
 }
-void sock_diag_broadcast_destroy(struct sock *sk);
+void sock_diag_broadcast(struct sock *sk);
 
 int sock_diag_destroy(struct sock *sk, int err);
 #endif
diff --git a/include/uapi/linux/sock_diag.h b/include/uapi/linux/sock_diag.h
index e592500..4252674 100644
--- a/include/uapi/linux/sock_diag.h
+++ b/include/uapi/linux/sock_diag.h
@@ -32,6 +32,8 @@ enum sknetlink_groups {
 	SKNLGRP_INET_UDP_DESTROY,
 	SKNLGRP_INET6_TCP_DESTROY,
 	SKNLGRP_INET6_UDP_DESTROY,
+	SKNLGRP_INET_TCP_3WH,
+	SKNLGRP_INET_TCP_CONNECTED,
 	__SKNLGRP_MAX,
 };
 #define SKNLGRP_MAX	(__SKNLGRP_MAX - 1)
diff --git a/net/core/sock.c b/net/core/sock.c
index 7e8796a..6684840 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1600,8 +1600,8 @@ static void __sk_free(struct sock *sk)
 	if (likely(sk->sk_net_refcnt))
 		sock_inuse_add(sock_net(sk), -1);
 
-	if (unlikely(sk->sk_net_refcnt && sock_diag_has_destroy_listeners(sk)))
-		sock_diag_broadcast_destroy(sk);
+	if (unlikely(sk->sk_net_refcnt && sock_diag_has_listeners(sk)))
+		sock_diag_broadcast(sk);
 	else
 		sk_destruct(sk);
 }
diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index 3312a58..dbd9e65 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -116,14 +116,14 @@ static size_t sock_diag_nlmsg_size(void)
 	       + nla_total_size_64bit(sizeof(struct tcp_info))); /* INET_DIAG_INFO */
 }
 
-static void sock_diag_broadcast_destroy_work(struct work_struct *work)
+static void sock_diag_broadcast_work(struct work_struct *work)
 {
 	struct broadcast_sk *bsk =
 		container_of(work, struct broadcast_sk, work);
 	struct sock *sk = bsk->sk;
 	const struct sock_diag_handler *hndl;
 	struct sk_buff *skb;
-	const enum sknetlink_groups group = sock_diag_destroy_group(sk);
+	const enum sknetlink_groups group = sock_diag_group(sk);
 	int err = -1;
 
 	WARN_ON(group == SKNLGRP_NONE);
@@ -144,11 +144,12 @@ static void sock_diag_broadcast_destroy_work(struct work_struct *work)
 	else
 		kfree_skb(skb);
 out:
-	sk_destruct(sk);
+	if (group <= SKNLGRP_INET6_UDP_DESTROY)
+		sk_destruct(sk);
 	kfree(bsk);
 }
 
-void sock_diag_broadcast_destroy(struct sock *sk)
+void sock_diag_broadcast(struct sock *sk)
 {
 	/* Note, this function is often called from an interrupt context. */
 	struct broadcast_sk *bsk =
@@ -156,7 +157,7 @@ void sock_diag_broadcast_destroy(struct sock *sk)
 	if (!bsk)
 		return sk_destruct(sk);
 	bsk->sk = sk;
-	INIT_WORK(&bsk->work, sock_diag_broadcast_destroy_work);
+	INIT_WORK(&bsk->work, sock_diag_broadcast_work);
 	queue_work(broadcast_wq, &bsk->work);
 }
 
-- 
1.7.1

^ permalink raw reply related

* [PATCH RFC net-next 3/3] bpf: Added a sample for tcp_info_notify callback
From: Sowmini Varadhan @ 2018-10-22 15:24 UTC (permalink / raw)
  To: sowmini.varadhan, netdev; +Cc: edumazet, brakmo
In-Reply-To: <cover.1540220847.git.sowmini.varadhan@oracle.com>

Simple Proof-Of-Concept test program for BPF_TCP_INFO_NOTIFY
(will move this to testing/selftests/net later)

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 samples/bpf/Makefile          |    1 +
 samples/bpf/tcp_notify_kern.c |   73 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 74 insertions(+), 0 deletions(-)
 create mode 100644 samples/bpf/tcp_notify_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index be0a961..d937bd2 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -152,6 +152,7 @@ always += tcp_bufs_kern.o
 always += tcp_cong_kern.o
 always += tcp_iw_kern.o
 always += tcp_clamp_kern.o
+always += tcp_notify_kern.o
 always += tcp_basertt_kern.o
 always += tcp_tos_reflect_kern.o
 always += xdp_redirect_kern.o
diff --git a/samples/bpf/tcp_notify_kern.c b/samples/bpf/tcp_notify_kern.c
new file mode 100644
index 0000000..bc4efd8
--- /dev/null
+++ b/samples/bpf/tcp_notify_kern.c
@@ -0,0 +1,73 @@
+/* Sample BPF program to demonstrate how to triger async tcp_info
+ * notification based on thresholds determeined in the filter.
+ * The example here will trigger notification if  skops->total_retrans > 16
+ *
+ * Use load_sock_ops to load this BPF program.
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+#define DEBUG 0
+
+#define bpf_printk(fmt, ...)					\
+({								\
+	       char ____fmt[] = fmt;				\
+	       bpf_trace_printk(____fmt, sizeof(____fmt),	\
+				##__VA_ARGS__);			\
+})
+
+SEC("sockops")
+int bpf_tcp_info_notify(struct bpf_sock_ops *skops)
+{
+	int bufsize = 150000;
+	int to_init = 10;
+	int clamp = 100;
+	int rv = 0;
+	int op;
+
+	/* For testing purposes, only execute rest of BPF program
+	 * if neither port numberis 5001 (default iperf port)
+	 */
+	if (bpf_ntohl(skops->remote_port) != 5001 &&
+	    skops->local_port != 5001) {
+		skops->reply = -1;
+		return 0;
+	}
+
+	op = (int) skops->op;
+
+#ifdef DEBUG
+	bpf_printk("BPF command: %d\n", op);
+#endif
+
+	switch (op) {
+	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+		bpf_sock_ops_cb_flags_set(skops,
+			(BPF_SOCK_OPS_RETRANS_CB_FLAG|
+			BPF_SOCK_OPS_RTO_CB_FLAG));
+		rv = 1;
+		break;
+	case BPF_SOCK_OPS_RETRANS_CB:
+	case BPF_SOCK_OPS_RTO_CB:
+		if (skops->total_retrans < 16)
+			rv = 1; /* skip */
+		else
+			rv = BPF_TCP_INFO_NOTIFY;
+		break;
+	default:
+		rv = -1;
+	}
+#ifdef DEBUG
+	bpf_printk("Returning %d\n", rv);
+#endif
+	skops->reply = rv;
+	return 1;
+}
+char _license[] SEC("license") = "GPL";
-- 
1.7.1

^ permalink raw reply related

* [PATCH RFC net-next 0/3] Extensions to allow asynchronous TCP_INFO notifications based on congestion parameters
From: Sowmini Varadhan @ 2018-10-22 15:23 UTC (permalink / raw)
  To: sowmini.varadhan, netdev; +Cc: edumazet, brakmo

Problem statement:
  We would like to monitor some subset of TCP sockets in user-space,
  (the monitoring application would define 4-tuples it wants to monitor)
  using TCP_INFO stats to analyze reported problems. The idea is to
  use those stats to see where the bottlenecks are likely to be ("is it
  application-limited?" or "is there evidence of BufferBloat in the
  path?" etc)

  Today we can do this by periodically polling for tcp_info, but this
  could be made more efficient if the kernel would asynchronously
  notify the application via tcp_info when some "interesting"
  thresholds (e.g., "RTT variance > X", or "total_retrans > Y" etc)
  are reached. And to make this effective, it is better if
  we could apply the threshold check *before* constructing the
  tcp_info netlink notification, so that we don't waste resources
  constructing notifications that will be discarded by the filter.

One idea, implemented in this patchset, is to extend the tcp_call_bpf()
infra so that the BPF kernel module (the sock_ops filter/callback)
can examine the values in the sock_ops, apply any thresholds it wants,
and return some new status ("BPF_TCP_INFO_NOTIFY"). Use this status in
the tcp stack to queue up a tcp_info notification (similar to
sock_diag_broadcast_destroy() today..)

Patch 1 in this set refactors the existing sock_diag code so that
the functions can be reused for notifications from other states than CLOSE.

Patch 2 provides a minor extension to tcp_call_bpf() so that it
will queue a tcp_info_notification if the BPF callout returns 
BPF_TCP_INFO_NOTIFY

Patch 3, provided strictly as a demonstration/PoC to aid in reviewing
this proposal, shows a simple sample/bpf example where we trigger the
tcp_info notification for an iperf connection if the number of 
retransmits exceeds 16.

Sowmini Varadhan (3):
  sock_diag: Refactor inet_sock_diag_destroy code
  tcp: BPF_TCP_INFO_NOTIFY support
  bpf: Added a sample for tcp_info_notify callback

 include/linux/sock_diag.h      |   18 +++++++---
 include/net/tcp.h              |   15 +++++++-
 include/uapi/linux/bpf.h       |    4 ++
 include/uapi/linux/sock_diag.h |    2 +
 net/core/sock.c                |    4 +-
 net/core/sock_diag.c           |   11 +++---
 samples/bpf/Makefile           |    1 +
 samples/bpf/tcp_notify_kern.c  |   73 ++++++++++++++++++++++++++++++++++++++++
 8 files changed, 114 insertions(+), 14 deletions(-)
 create mode 100644 samples/bpf/tcp_notify_kern.c

^ permalink raw reply

* [PATCH bpf] bpf: devmap: fix wrong interface selection in notifier_call
From: Taehee Yoo @ 2018-10-22 15:41 UTC (permalink / raw)
  To: daniel, ast; +Cc: netdev, ap420073

The dev_map_notification() removes interface in devmap if
unregistering interface's ifindex is same.
But only checking ifindex is not enough because other netns can have
same ifindex. so that wrong interface selection could occurred.
Hence the net_eq() is needed.

Fixes: 2ddf71e23cc2 ("net: add notifier hooks for devmap bpf map")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---
 kernel/bpf/devmap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 141710b82a6c..0d9211e49a4a 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -496,6 +496,7 @@ static int dev_map_notification(struct notifier_block *notifier,
 				ulong event, void *ptr)
 {
 	struct net_device *netdev = netdev_notifier_info_to_dev(ptr);
+	struct net *net = dev_net(netdev);
 	struct bpf_dtab *dtab;
 	int i;
 
@@ -512,7 +513,7 @@ static int dev_map_notification(struct notifier_block *notifier,
 				struct bpf_dtab_netdev *dev, *odev;
 
 				dev = READ_ONCE(dtab->netdev_map[i]);
-				if (!dev ||
+				if (!dev || !net_eq(net, dev_net(dev->dev)) ||
 				    dev->dev->ifindex != netdev->ifindex)
 					continue;
 				odev = cmpxchg(&dtab->netdev_map[i], dev, NULL);
-- 
2.17.1

^ permalink raw reply related

* Re: [RFC PATCH v2 03/10] udp: add support for UDP_GRO cmsg
From: Paolo Abeni @ 2018-10-22 15:44 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Network Development, Willem de Bruijn, steffen.klassert
In-Reply-To: <CAF=yD-+JOJJX+vOGuV3z3aYNEdjNQ=U0+OaZdeuVvOTrRa-oRA@mail.gmail.com>

On Sun, 2018-10-21 at 16:07 -0400, Willem de Bruijn wrote:
> On Fri, Oct 19, 2018 at 10:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > 
> > When UDP GRO is enabled, the UDP_GRO cmsg will carry the ingress
> > datagram size. User-space can use such info to compute the original
> > packets layout.
> > 
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > ---
> > CHECK: should we use a separate setsockopt to explicitly enable
> > gso_size cmsg reception? So that user space can enable UDP_GRO and
> > fetch cmsg without forcefully receiving GRO related info.
> 
> A user can avoid the message by not passing control data. Though in
> most practical cases it seems unsafe to do so, anyway, as the path MTU
> can be lower than the expected device MTU.
> 
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 3c277378814f..2331ac9de954 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -1714,6 +1714,10 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
> >                 memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
> >                 *addr_len = sizeof(*sin);
> >         }
> > +
> > +       if (udp_sk(sk)->gro_enabled)
> > +               udp_cmsg_recv(msg, sk, skb);
> > +
> 
> Perhaps we can avoid adding a branch by setting a bit in
> inet->cmsg_flags for gso_size to let the below branch handle the cmsg
> processing.

Uhmm... I think that for ipv6 sockets we need to set a bit in rxopt
instead (and we already have some conditionals we could for ipv6 socket
recv cmsg processing).

> I'd still set that as part of the UDP_GRO setsockopt. Though if you
> insist it could be a value 2 instead of 1, effectively allowing the
> above mentioned opt-out.

I'm ok with the current impl (no additional value to opt-in the UDP GRO
cmsg).

Cheers,

Paolo

^ permalink raw reply

* Re: [PATCH] wireless: mark expected switch fall-throughs
From: Gustavo A. R. Silva @ 2018-10-23  0:07 UTC (permalink / raw)
  To: Johannes Berg, David S. Miller
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1530880164.3197.38.camel-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>



On 7/6/18 2:29 PM, Johannes Berg wrote:
> Hi Gustavo,
> 
>> In preparation to enabling -Wimplicit-fallthrough, mark switch cases
>> where we are expecting to fall through.
> 
> You dropped the remark saying you didn't review them, but did you?
> 

I'll add it in v2.

>>  	case NL80211_CHAN_WIDTH_20:
>>  		if (!ht_cap->ht_supported)
>>  			return false;
>> +		/* else: fall through */
> 
> What's the point in else:?
> 
> We also don't necessarily write
> 
> if (!...)
>   return false;
> else
>   do_something();
> 
> but rather
> 
> if (!...)
>   return false;
> do_something().
> 
> I think I'd prefer without the "else:"
> 

Sure thing. I'll change this in v2.

I'll send v2 shortly.

Thanks for the feedback.
--
Gustavo

^ permalink raw reply

* Re: [PATCH net-next 3/4] net: phy-c45: Implement reset/suspend/resume callbacks
From: Russell King - ARM Linux @ 2018-10-22 15:48 UTC (permalink / raw)
  To: Jose Abreu
  Cc: Andrew Lunn, netdev, Florian Fainelli, David S. Miller,
	Joao Pinto
In-Reply-To: <8114f345-0efb-9a64-b867-2cfe2fba09ab@synopsys.com>

On Mon, Oct 22, 2018 at 01:47:48PM +0100, Jose Abreu wrote:
> Hello,
> 
> On 22-10-2018 13:28, Andrew Lunn wrote:
> >>  EXPORT_SYMBOL_GPL(gen10g_resume);
> >> @@ -327,7 +381,7 @@ struct phy_driver genphy_10g_driver = {
> >>  	.phy_id         = 0xffffffff,
> >>  	.phy_id_mask    = 0xffffffff,
> >>  	.name           = "Generic 10G PHY",
> >> -	.soft_reset	= gen10g_no_soft_reset,
> >> +	.soft_reset	= gen10g_soft_reset,
> >>  	.config_init    = gen10g_config_init,
> >>  	.features       = 0,
> >>  	.aneg_done	= genphy_c45_aneg_done,
> > Hi Jose
> >
> > You need to be careful here. There is a reason this is called
> > gen10g_no_soft_reset, rather than having an empty
> > gen10g_soft_reset. Some PHYs break when you do a reset.  So adding a
> > gen10g_soft_reset is fine, but don't change this here, without first
> > understanding the history, and talking to Russell King.
> 
> Hmm, the reset function only interacts with standard PCS
> registers, which should always be available ...
> 
> >From my tests I need to do at least 1 reset during power-up so in
> ultimate case I can add a feature quirk or similar.
> 
> Russell, can you please comment ?

Setting the reset bit on 88x3310 causes the entire device to become
completely inaccessible until hardware reset.  Therefore, this bit
must _never_ be set for these devices.  That said, we have a separate
driver for these PHYs, but that will only be used for them if it's
present in the kernel.  If we accidentally fall back to the generic
driver, then we'll screw the 88x3310 until a full hardware reset.

We also have a bunch of net devices that make use of this crippled
"generic" 10G support - we don't know whether resetting the PHY
for those systems will cause a regression - maybe board firmware
already configured the PHY?  I can't say either way on that, except
that we've had crippled 10G support in PHYLIB for a number of years
now _with_ users, and adding reset support drastically changes the
subsystem's behaviour for these users.

I would recommend not touching the generic 10G driver, but instead
implement your own driver for your PHY to avoid causing regressions.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply

* Re: [RFC PATCH v2 02/10] udp: implement GRO for plain UDP sockets.
From: Willem de Bruijn @ 2018-10-22 15:51 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: steffen.klassert, Network Development, Willem de Bruijn
In-Reply-To: <b69077c93ec047845d4b22a57fa6f89b63c0639c.camel@redhat.com>

> >
> > > +static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
> > > +                                          struct sk_buff *skb)
> > > +{
> > > +   struct udphdr *uh = udp_hdr(skb);
> > > +   struct sk_buff *pp = NULL;
> > > +   struct udphdr *uh2;
> > > +   struct sk_buff *p;
> > > +
> > > +   /* requires non zero csum, for simmetry with GSO */
> > > +   if (!uh->check) {
> > > +           NAPI_GRO_CB(skb)->flush = 1;
> > > +           return NULL;
> > > +   }
> >
> > Why is the requirement of checksums different than in
> > udp_gro_receive? It's not that I care much about UDP
> > packets without a checksum, but you would not need
> > to implement your own loop if the requirement could
> > be the same as in udp_gro_receive.

It would be nice if we could deduplicate the loops, but even without
the checksum difference they look to me a bit too different for it to be
practical, also with the constraints on segment length and max aggregation.

> uhm....
> AFAIU, we need to generated aggregated packets that UDP GSO is able to
> process/segment. I was unable to get a nocsum packet segment (possibly
> PEBKAC) so I enforced that condition on the rx path.
>
> @Willem: did I see ghost here? is UDP_SEGMENT fine with no checksum
> segment?

udp_send_skb fails with EIO if ip_summed is anything but CHECKSUM_PARTIAL.

but that's not in the forwarding path. Still, __udp_gso_segment as is
depends on that invariant and will not handle packets with zero
checksum correctly. It unconditionally adjusts uh->check. That could
be changed, of course.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox