Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next] fib_rules: rename FRA_PROTOCOL to FRA_ORIGINATOR
From: Nicolas Dichtel @ 2018-03-21 15:51 UTC (permalink / raw)
  To: David Miller, dsahern; +Cc: netdev, sharpd
In-Reply-To: <20180321.111552.1322071853274264739.davem@davemloft.net>

Le 21/03/2018 à 16:15, David Miller a écrit :
> From: David Ahern <dsahern@gmail.com>
> Date: Wed, 21 Mar 2018 09:00:09 -0600
> 
>> The rule->proto value is not used as a selector. It is passed in, stored
>> on a rule and returned to userspace. It is book keeping only so an admin
>> has some idea of which program installed the rule.
> 
> This is how I understand this value to work as well.
> 
Me too, and it's the reason why I propose to rename it, to make this clear.
'originator' is a lot more explicit than 'proto'.

^ permalink raw reply

* [PATCH net-next 1/1] tc-testing: Correct compound statements for namespace execution
From: Lucas Bates @ 2018-03-21 15:49 UTC (permalink / raw)
  To: davem; +Cc: netdev, kernel, jhs, xiyou.wangcong, jiri, Lucas Bates

If tdc is executing test cases inside a namespace, only the
first command in a compound statement will be executed inside
the namespace by tdc. As a result, the subsequent commands
are not executed inside the namespace and the test will fail.

Example:

for i in {x..y}; do args="foo"; done && tc actions add $args

The namespace execution feature will prepend 'ip netns exec'
to the command:

ip netns exec tcut for i in {x..y}; do args="foo"; done && \
  tc actions add $args

So the actual tc command is not parsed by the shell as being
part of the namespace execution.

Enclosing these compound statements inside a bash invocation
with proper escape characters resolves the problem by creating
a subshell inside the namespace.

Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
---
 tools/testing/selftests/tc-testing/tc-tests/actions/gact.json | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/gact.json b/tools/testing/selftests/tc-testing/tc-tests/actions/gact.json
index ae96d03..68c9102 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/actions/gact.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/gact.json
@@ -481,7 +481,7 @@
                 255
             ]
         ],
-        "cmdUnderTest": "for i in `seq 1 32`; do cmd=\"action pass index $i \"; args=\"$args$cmd\"; done && $TC actions add $args",
+        "cmdUnderTest": "bash -c \"for i in \\`seq 1 32\\`; do cmd=\\\"action pass index \\$i \\\"; args=\"\\$args\\$cmd\"; done && $TC actions add \\$args\"",
         "expExitCode": "0",
         "verifyCmd": "$TC actions list action gact",
         "matchPattern": "^[ \t]+index [0-9]+ ref",
@@ -505,7 +505,7 @@
                 255
             ]
         ],
-        "cmdUnderTest": "for i in `seq 1 32`; do cmd=\"action continue index $i cookie aabbccddeeff112233445566778800a1 \"; args=\"$args$cmd\"; done && $TC actions add $args",
+        "cmdUnderTest": "bash -c \"for i in \\`seq 1 32\\`; do cmd=\\\"action continue index \\$i cookie aabbccddeeff112233445566778800a1 \\\"; args=\"\\$args\\$cmd\"; done && $TC actions add \\$args\"",
         "expExitCode": "0",
         "verifyCmd": "$TC actions list action gact",
         "matchPattern": "^[ \t]+index [0-9]+ ref",
@@ -528,13 +528,13 @@
                 1,
                 255
             ],
-            "for i in `seq 1 32`; do cmd=\"action continue index $i \"; args=\"$args$cmd\"; done && $TC actions add $args"
+            "bash -c \"for i in \\`seq 1 32\\`; do cmd=\\\"action continue index \\$i \\\"; args=\\\"\\$args\\$cmd\\\"; done && $TC actions add \\$args\""
         ],
-        "cmdUnderTest": "for i in `seq 1 32`; do cmd=\"action gact index $i \"; args=\"$args$cmd\"; done && $TC actions del $args",
+        "cmdUnderTest": "bash -c \"for i in \\`seq 1 32\\`; do cmd=\\\"action gact index \\$i \\\"; args=\"\\$args\\$cmd\"; done && $TC actions del \\$args\"",
         "expExitCode": "0",
         "verifyCmd": "$TC actions list action gact",
         "matchPattern": "^[ \t]+index [0-9]+ ref",
         "matchCount": "0",
         "teardown": []
     }
-]
\ No newline at end of file
+]
--
2.7.4

^ permalink raw reply related

* Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access
From: Andy Lutomirski @ 2018-03-21 15:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Thomas Gleixner, David Laight, Rahul Lakkireddy,
	x86@kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, mingo@redhat.com, hpa@zytor.com,
	davem@davemloft.net, akpm@linux-foundation.org,
	ganeshgr@chelsio.com, nirranjan@chelsio.com, indranil@chelsio.com,
	Andy Lutomirski, Peter Zijlstra, Fenghua Yu, Eric 
In-Reply-To: <20180321063256.bdqcpvgb3auxzwzk@gmail.com>

On Wed, Mar 21, 2018 at 6:32 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> And even if you ignore that "maintenance problems down the line" issue
>> ("we can fix them when they happen") I don't want to see games like
>> this, because I'm pretty sure it breaks the optimized xsave by tagging
>> the state as being dirty.
>
> That's true - and it would penalize the context switch cost of the affected task
> for the rest of its lifetime, as I don't think there's much that clears XINUSE
> other than a FINIT, which is rarely done by user-space.
>
>> So no. Don't use vector stuff in the kernel. It's not worth the pain.
>
> I agree, but:
>
>> The *only* valid use is pretty much crypto, and even there it has had issues.
>> Benchmarks use big arrays and/or dense working sets etc to "prove" how good the
>> vector version is, and then you end up in situations where it's used once per
>> fairly small packet for an interrupt, and it's actually much worse than doing it
>> by hand.
>
> That's mainly because the XSAVE/XRESTOR done by kernel_fpu_begin()/end() is so
> expensive, so this argument is somewhat circular.

If we do the deferred restore, then the XSAVE/XRSTOR happens at most
once per kernel entry, which isn't so bad IMO.  Also, with PTI, kernel
entries are already so slow that this will be mostly in the noise :(

--Andy

^ permalink raw reply

* Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure
From: Boris Pismenny @ 2018-03-21 15:38 UTC (permalink / raw)
  To: Dave Watson, Saeed Mahameed
  Cc: David S. Miller, netdev, Ilya Lesokhin, Aviad Yehezkel
In-Reply-To: <20180321150822.GA92320@davejwatson-mba.local>



On 3/21/2018 5:08 PM, Dave Watson wrote:
> On 03/19/18 07:45 PM, Saeed Mahameed wrote:
>> +#define TLS_OFFLOAD_CONTEXT_SIZE                                               \
>> +	(ALIGN(sizeof(struct tls_offload_context), sizeof(void *)) +           \
>> +	 TLS_DRIVER_STATE_SIZE)
>> +
>> +	pfrag = sk_page_frag(sk);
>> +
>> +	/* KTLS_TLS_HEADER_SIZE is not counted as part of the TLS record, and
> 
> I think the define is actually TLS_HEADER_SIZE, no KTLS_ prefix
> 

Fixed. Thanks.

>> +	memcpy(ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv, iv_size);
>> +
>> +	ctx->rec_seq_size = rec_seq_size;
>> +	/* worst case is:
>> +	 * MAX_SKB_FRAGS in tls_record_info
>> +	 * MAX_SKB_FRAGS + 1 in SKB head an frags.
> 
> spelling
> 

Fixed. Thanks.

>> +int tls_sw_fallback_init(struct sock *sk,
>> +			 struct tls_offload_context *offload_ctx,
>> +			 struct tls_crypto_info *crypto_info)
>> +{
>> +	int rc;
>> +	const u8 *key;
>> +
>> +	offload_ctx->aead_send =
>> +	    crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);
> 
> in tls_sw we went with async + crypto_wait_req, any reason to not do
> that here?  Otherwise I think you still get the software gcm on x86
> instead of aesni without additional changes.
> 

Yes, synchronous crypto code runs to handle a software fallback in 
validate_xmit_skb, where waiting is not possible. I know Steffen 
recently added support for calling async crypto from validate_xmit_skb, 
but it wasn't available when we were writing these patches.

I think we could implemented async support in the future based on the 
infrastructure introduced by Steffen.

>> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
>> index d824d548447e..e0dface33017 100644
>> --- a/net/tls/tls_main.c
>> +++ b/net/tls/tls_main.c
>> @@ -54,6 +54,9 @@ enum {
>>   enum {
>>   	TLS_BASE_TX,
>>   	TLS_SW_TX,
>> +#ifdef CONFIG_TLS_DEVICE
>> +	TLS_HW_TX,
>> +#endif
>>   	TLS_NUM_CONFIG,
>>   };
> 
> I have posted SW_RX patches, do you forsee any issues with SW_RX + HW_TX?
> 

No, but I haven't tested these patches with the SW_RX patches.
I'll try to rebase your V2 SW_RX patches over this series tomorrow and 
run some tests.

> Thanks
> 

^ permalink raw reply

* Re: [PATCH net-next v4 2/2] net: bpf: add a test for skb_segment in test_bpf module
From: Eric Dumazet @ 2018-03-21 15:26 UTC (permalink / raw)
  To: Yonghong Song, edumazet, ast, daniel, diptanu, netdev; +Cc: kernel-team
In-Reply-To: <20180321064722.1411857-3-yhs@fb.com>



On 03/20/2018 11:47 PM, Yonghong Song wrote:
> +static __init int test_skb_segment(void)
> +{
> +	netdev_features_t features;
> +	struct sk_buff *skb;
> +	int ret = -1;
> +
> +	features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM |
> +		   NETIF_F_IPV6_CSUM;
> +	features |= NETIF_F_RXCSUM;
> +	skb = build_test_skb();
> +	if (!skb) {
> +		pr_info("%s: failed to build_test_skb", __func__);
> +		goto done;
> +	}
> +
> +	if (skb_segment(skb, features)) {
> +		ret = 0;
> +		pr_info("%s: success in skb_segment!", __func__);
> +	} else {
> +		pr_info("%s: failed in skb_segment!", __func__);
> +	}
> +	kfree_skb(skb);

If skb_segmen() was successful (original) skb was already freed.

kfree_skb(old_skb) should thus panic the box, if you run this code
on a kernel having some debugging features like KASAN

So you must store in a variable the return of skb_segment(),
to be able to free skb(s), using kfree_skb_list()


> +done:
> +	return ret;
> +}
> +

^ permalink raw reply

* Re: [PATCH net-next] fib_rules: rename FRA_PROTOCOL to FRA_ORIGINATOR
From: David Miller @ 2018-03-21 15:15 UTC (permalink / raw)
  To: dsahern; +Cc: nicolas.dichtel, netdev, sharpd
In-Reply-To: <cb26a8d2-c2a9-6b93-3711-c18888c7c010@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Wed, 21 Mar 2018 09:00:09 -0600

> The rule->proto value is not used as a selector. It is passed in, stored
> on a rule and returned to userspace. It is book keeping only so an admin
> has some idea of which program installed the rule.

This is how I understand this value to work as well.

^ permalink raw reply

* [PATCH net-next v2] net: mvpp2: Don't use dynamic allocs for local variables
From: Maxime Chevallier @ 2018-03-21 15:14 UTC (permalink / raw)
  To: davem
  Cc: Maxime Chevallier, netdev, linux-kernel, Antoine Tenart,
	thomas.petazzoni, gregory.clement, miquel.raynal, nadavh, stefanc,
	ymarkman, mw

Some helper functions that search for given entries in the TCAM filter
on PPv2 controller make use of dynamically alloced temporary variables,
allocated with GFP_KERNEL. These functions can be called in atomic
context, and dynamic alloc is not really needed in these cases anyways.

This commit gets rid of dynamic allocs and use stack allocation in the
following functions, and where they're used :
 - mvpp2_prs_flow_find
 - mvpp2_prs_vlan_find
 - mvpp2_prs_double_vlan_find
 - mvpp2_prs_mac_da_range_find

For all these functions, instead of returning an temporary object
representing the TCAM entry, we simply return the TCAM id that matches
the requested entry.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
---
V2: Remove unnecessary brackets, following Antoine Tenart's review.

 drivers/net/ethernet/marvell/mvpp2.c | 289 +++++++++++++++--------------------
 1 file changed, 127 insertions(+), 162 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
index 9bd35f2291d6..28e33e139178 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -1913,16 +1913,11 @@ static void mvpp2_prs_sram_offset_set(struct mvpp2_prs_entry *pe,
 }
 
 /* Find parser flow entry */
-static struct mvpp2_prs_entry *mvpp2_prs_flow_find(struct mvpp2 *priv, int flow)
+static int mvpp2_prs_flow_find(struct mvpp2 *priv, int flow)
 {
-	struct mvpp2_prs_entry *pe;
+	struct mvpp2_prs_entry pe;
 	int tid;
 
-	pe = kzalloc(sizeof(*pe), GFP_KERNEL);
-	if (!pe)
-		return NULL;
-	mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_FLOWS);
-
 	/* Go through the all entires with MVPP2_PRS_LU_FLOWS */
 	for (tid = MVPP2_PRS_TCAM_SRAM_SIZE - 1; tid >= 0; tid--) {
 		u8 bits;
@@ -1931,17 +1926,16 @@ static struct mvpp2_prs_entry *mvpp2_prs_flow_find(struct mvpp2 *priv, int flow)
 		    priv->prs_shadow[tid].lu != MVPP2_PRS_LU_FLOWS)
 			continue;
 
-		pe->index = tid;
-		mvpp2_prs_hw_read(priv, pe);
-		bits = mvpp2_prs_sram_ai_get(pe);
+		pe.index = tid;
+		mvpp2_prs_hw_read(priv, &pe);
+		bits = mvpp2_prs_sram_ai_get(&pe);
 
 		/* Sram store classification lookup ID in AI bits [5:0] */
 		if ((bits & MVPP2_PRS_FLOW_ID_MASK) == flow)
-			return pe;
+			return tid;
 	}
-	kfree(pe);
 
-	return NULL;
+	return -ENOENT;
 }
 
 /* Return first free tcam index, seeking from start to end */
@@ -2189,16 +2183,12 @@ static void mvpp2_prs_dsa_tag_ethertype_set(struct mvpp2 *priv, int port,
 }
 
 /* Search for existing single/triple vlan entry */
-static struct mvpp2_prs_entry *mvpp2_prs_vlan_find(struct mvpp2 *priv,
-						   unsigned short tpid, int ai)
+static int mvpp2_prs_vlan_find(struct mvpp2 *priv, unsigned short tpid, int ai)
 {
-	struct mvpp2_prs_entry *pe;
+	struct mvpp2_prs_entry pe;
 	int tid;
 
-	pe = kzalloc(sizeof(*pe), GFP_KERNEL);
-	if (!pe)
-		return NULL;
-	mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_VLAN);
+	memset(&pe, 0, sizeof(pe));
 
 	/* Go through the all entries with MVPP2_PRS_LU_VLAN */
 	for (tid = MVPP2_PE_FIRST_FREE_TID;
@@ -2210,19 +2200,19 @@ static struct mvpp2_prs_entry *mvpp2_prs_vlan_find(struct mvpp2 *priv,
 		    priv->prs_shadow[tid].lu != MVPP2_PRS_LU_VLAN)
 			continue;
 
-		pe->index = tid;
+		pe.index = tid;
 
-		mvpp2_prs_hw_read(priv, pe);
-		match = mvpp2_prs_tcam_data_cmp(pe, 0, swab16(tpid));
+		mvpp2_prs_hw_read(priv, &pe);
+		match = mvpp2_prs_tcam_data_cmp(&pe, 0, swab16(tpid));
 		if (!match)
 			continue;
 
 		/* Get vlan type */
-		ri_bits = mvpp2_prs_sram_ri_get(pe);
+		ri_bits = mvpp2_prs_sram_ri_get(&pe);
 		ri_bits &= MVPP2_PRS_RI_VLAN_MASK;
 
 		/* Get current ai value from tcam */
-		ai_bits = mvpp2_prs_tcam_ai_get(pe);
+		ai_bits = mvpp2_prs_tcam_ai_get(&pe);
 		/* Clear double vlan bit */
 		ai_bits &= ~MVPP2_PRS_DBL_VLAN_AI_BIT;
 
@@ -2231,33 +2221,30 @@ static struct mvpp2_prs_entry *mvpp2_prs_vlan_find(struct mvpp2 *priv,
 
 		if (ri_bits == MVPP2_PRS_RI_VLAN_SINGLE ||
 		    ri_bits == MVPP2_PRS_RI_VLAN_TRIPLE)
-			return pe;
+			return tid;
 	}
-	kfree(pe);
 
-	return NULL;
+	return -ENOENT;
 }
 
 /* Add/update single/triple vlan entry */
 static int mvpp2_prs_vlan_add(struct mvpp2 *priv, unsigned short tpid, int ai,
 			      unsigned int port_map)
 {
-	struct mvpp2_prs_entry *pe;
+	struct mvpp2_prs_entry pe;
 	int tid_aux, tid;
 	int ret = 0;
 
-	pe = mvpp2_prs_vlan_find(priv, tpid, ai);
+	tid = mvpp2_prs_vlan_find(priv, tpid, ai);
 
-	if (!pe) {
+	if (tid < 0) {
 		/* Create new tcam entry */
 		tid = mvpp2_prs_tcam_first_free(priv, MVPP2_PE_LAST_FREE_TID,
 						MVPP2_PE_FIRST_FREE_TID);
 		if (tid < 0)
 			return tid;
 
-		pe = kzalloc(sizeof(*pe), GFP_KERNEL);
-		if (!pe)
-			return -ENOMEM;
+		memset(&pe, 0, sizeof(pe));
 
 		/* Get last double vlan tid */
 		for (tid_aux = MVPP2_PE_LAST_FREE_TID;
@@ -2268,49 +2255,47 @@ static int mvpp2_prs_vlan_add(struct mvpp2 *priv, unsigned short tpid, int ai,
 			    priv->prs_shadow[tid_aux].lu != MVPP2_PRS_LU_VLAN)
 				continue;
 
-			pe->index = tid_aux;
-			mvpp2_prs_hw_read(priv, pe);
-			ri_bits = mvpp2_prs_sram_ri_get(pe);
+			pe.index = tid_aux;
+			mvpp2_prs_hw_read(priv, &pe);
+			ri_bits = mvpp2_prs_sram_ri_get(&pe);
 			if ((ri_bits & MVPP2_PRS_RI_VLAN_MASK) ==
 			    MVPP2_PRS_RI_VLAN_DOUBLE)
 				break;
 		}
 
-		if (tid <= tid_aux) {
-			ret = -EINVAL;
-			goto free_pe;
-		}
+		if (tid <= tid_aux)
+			return -EINVAL;
 
-		memset(pe, 0, sizeof(*pe));
-		mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_VLAN);
-		pe->index = tid;
+		memset(&pe, 0, sizeof(pe));
+		mvpp2_prs_tcam_lu_set(&pe, MVPP2_PRS_LU_VLAN);
+		pe.index = tid;
 
-		mvpp2_prs_match_etype(pe, 0, tpid);
+		mvpp2_prs_match_etype(&pe, 0, tpid);
 
 		/* VLAN tag detected, proceed with VID filtering */
-		mvpp2_prs_sram_next_lu_set(pe, MVPP2_PRS_LU_VID);
+		mvpp2_prs_sram_next_lu_set(&pe, MVPP2_PRS_LU_VID);
 
 		/* Clear all ai bits for next iteration */
-		mvpp2_prs_sram_ai_update(pe, 0, MVPP2_PRS_SRAM_AI_MASK);
+		mvpp2_prs_sram_ai_update(&pe, 0, MVPP2_PRS_SRAM_AI_MASK);
 
 		if (ai == MVPP2_PRS_SINGLE_VLAN_AI) {
-			mvpp2_prs_sram_ri_update(pe, MVPP2_PRS_RI_VLAN_SINGLE,
+			mvpp2_prs_sram_ri_update(&pe, MVPP2_PRS_RI_VLAN_SINGLE,
 						 MVPP2_PRS_RI_VLAN_MASK);
 		} else {
 			ai |= MVPP2_PRS_DBL_VLAN_AI_BIT;
-			mvpp2_prs_sram_ri_update(pe, MVPP2_PRS_RI_VLAN_TRIPLE,
+			mvpp2_prs_sram_ri_update(&pe, MVPP2_PRS_RI_VLAN_TRIPLE,
 						 MVPP2_PRS_RI_VLAN_MASK);
 		}
-		mvpp2_prs_tcam_ai_update(pe, ai, MVPP2_PRS_SRAM_AI_MASK);
+		mvpp2_prs_tcam_ai_update(&pe, ai, MVPP2_PRS_SRAM_AI_MASK);
 
-		mvpp2_prs_shadow_set(priv, pe->index, MVPP2_PRS_LU_VLAN);
+		mvpp2_prs_shadow_set(priv, pe.index, MVPP2_PRS_LU_VLAN);
+	} else {
+		mvpp2_prs_hw_read(priv, &pe);
 	}
 	/* Update ports' mask */
-	mvpp2_prs_tcam_port_map_set(pe, port_map);
+	mvpp2_prs_tcam_port_map_set(&pe, port_map);
 
-	mvpp2_prs_hw_write(priv, pe);
-free_pe:
-	kfree(pe);
+	mvpp2_prs_hw_write(priv, &pe);
 
 	return ret;
 }
@@ -2329,17 +2314,14 @@ static int mvpp2_prs_double_vlan_ai_free_get(struct mvpp2 *priv)
 }
 
 /* Search for existing double vlan entry */
-static struct mvpp2_prs_entry *mvpp2_prs_double_vlan_find(struct mvpp2 *priv,
-							  unsigned short tpid1,
-							  unsigned short tpid2)
+static int mvpp2_prs_double_vlan_find(struct mvpp2 *priv, unsigned short tpid1,
+				      unsigned short tpid2)
 {
-	struct mvpp2_prs_entry *pe;
+	struct mvpp2_prs_entry pe;
 	int tid;
 
-	pe = kzalloc(sizeof(*pe), GFP_KERNEL);
-	if (!pe)
-		return NULL;
-	mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_VLAN);
+	memset(&pe, 0, sizeof(pe));
+	mvpp2_prs_tcam_lu_set(&pe, MVPP2_PRS_LU_VLAN);
 
 	/* Go through the all entries with MVPP2_PRS_LU_VLAN */
 	for (tid = MVPP2_PE_FIRST_FREE_TID;
@@ -2351,22 +2333,21 @@ static struct mvpp2_prs_entry *mvpp2_prs_double_vlan_find(struct mvpp2 *priv,
 		    priv->prs_shadow[tid].lu != MVPP2_PRS_LU_VLAN)
 			continue;
 
-		pe->index = tid;
-		mvpp2_prs_hw_read(priv, pe);
+		pe.index = tid;
+		mvpp2_prs_hw_read(priv, &pe);
 
-		match = mvpp2_prs_tcam_data_cmp(pe, 0, swab16(tpid1))
-			&& mvpp2_prs_tcam_data_cmp(pe, 4, swab16(tpid2));
+		match = mvpp2_prs_tcam_data_cmp(&pe, 0, swab16(tpid1)) &&
+			mvpp2_prs_tcam_data_cmp(&pe, 4, swab16(tpid2));
 
 		if (!match)
 			continue;
 
-		ri_mask = mvpp2_prs_sram_ri_get(pe) & MVPP2_PRS_RI_VLAN_MASK;
+		ri_mask = mvpp2_prs_sram_ri_get(&pe) & MVPP2_PRS_RI_VLAN_MASK;
 		if (ri_mask == MVPP2_PRS_RI_VLAN_DOUBLE)
-			return pe;
+			return tid;
 	}
-	kfree(pe);
 
-	return NULL;
+	return -ENOENT;
 }
 
 /* Add or update double vlan entry */
@@ -2374,28 +2355,24 @@ static int mvpp2_prs_double_vlan_add(struct mvpp2 *priv, unsigned short tpid1,
 				     unsigned short tpid2,
 				     unsigned int port_map)
 {
-	struct mvpp2_prs_entry *pe;
+	struct mvpp2_prs_entry pe;
 	int tid_aux, tid, ai, ret = 0;
 
-	pe = mvpp2_prs_double_vlan_find(priv, tpid1, tpid2);
+	tid = mvpp2_prs_double_vlan_find(priv, tpid1, tpid2);
 
-	if (!pe) {
+	if (tid < 0) {
 		/* Create new tcam entry */
 		tid = mvpp2_prs_tcam_first_free(priv, MVPP2_PE_FIRST_FREE_TID,
 				MVPP2_PE_LAST_FREE_TID);
 		if (tid < 0)
 			return tid;
 
-		pe = kzalloc(sizeof(*pe), GFP_KERNEL);
-		if (!pe)
-			return -ENOMEM;
+		memset(&pe, 0, sizeof(pe));
 
 		/* Set ai value for new double vlan entry */
 		ai = mvpp2_prs_double_vlan_ai_free_get(priv);
-		if (ai < 0) {
-			ret = ai;
-			goto free_pe;
-		}
+		if (ai < 0)
+			return ai;
 
 		/* Get first single/triple vlan tid */
 		for (tid_aux = MVPP2_PE_FIRST_FREE_TID;
@@ -2406,46 +2383,45 @@ static int mvpp2_prs_double_vlan_add(struct mvpp2 *priv, unsigned short tpid1,
 			    priv->prs_shadow[tid_aux].lu != MVPP2_PRS_LU_VLAN)
 				continue;
 
-			pe->index = tid_aux;
-			mvpp2_prs_hw_read(priv, pe);
-			ri_bits = mvpp2_prs_sram_ri_get(pe);
+			pe.index = tid_aux;
+			mvpp2_prs_hw_read(priv, &pe);
+			ri_bits = mvpp2_prs_sram_ri_get(&pe);
 			ri_bits &= MVPP2_PRS_RI_VLAN_MASK;
 			if (ri_bits == MVPP2_PRS_RI_VLAN_SINGLE ||
 			    ri_bits == MVPP2_PRS_RI_VLAN_TRIPLE)
 				break;
 		}
 
-		if (tid >= tid_aux) {
-			ret = -ERANGE;
-			goto free_pe;
-		}
+		if (tid >= tid_aux)
+			return -ERANGE;
 
-		memset(pe, 0, sizeof(*pe));
-		mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_VLAN);
-		pe->index = tid;
+		memset(&pe, 0, sizeof(pe));
+		mvpp2_prs_tcam_lu_set(&pe, MVPP2_PRS_LU_VLAN);
+		pe.index = tid;
 
 		priv->prs_double_vlans[ai] = true;
 
-		mvpp2_prs_match_etype(pe, 0, tpid1);
-		mvpp2_prs_match_etype(pe, 4, tpid2);
+		mvpp2_prs_match_etype(&pe, 0, tpid1);
+		mvpp2_prs_match_etype(&pe, 4, tpid2);
 
-		mvpp2_prs_sram_next_lu_set(pe, MVPP2_PRS_LU_VLAN);
+		mvpp2_prs_sram_next_lu_set(&pe, MVPP2_PRS_LU_VLAN);
 		/* Shift 4 bytes - skip outer vlan tag */
-		mvpp2_prs_sram_shift_set(pe, MVPP2_VLAN_TAG_LEN,
+		mvpp2_prs_sram_shift_set(&pe, MVPP2_VLAN_TAG_LEN,
 					 MVPP2_PRS_SRAM_OP_SEL_SHIFT_ADD);
-		mvpp2_prs_sram_ri_update(pe, MVPP2_PRS_RI_VLAN_DOUBLE,
+		mvpp2_prs_sram_ri_update(&pe, MVPP2_PRS_RI_VLAN_DOUBLE,
 					 MVPP2_PRS_RI_VLAN_MASK);
-		mvpp2_prs_sram_ai_update(pe, ai | MVPP2_PRS_DBL_VLAN_AI_BIT,
+		mvpp2_prs_sram_ai_update(&pe, ai | MVPP2_PRS_DBL_VLAN_AI_BIT,
 					 MVPP2_PRS_SRAM_AI_MASK);
 
-		mvpp2_prs_shadow_set(priv, pe->index, MVPP2_PRS_LU_VLAN);
+		mvpp2_prs_shadow_set(priv, pe.index, MVPP2_PRS_LU_VLAN);
+	} else {
+		mvpp2_prs_hw_read(priv, &pe);
 	}
 
 	/* Update ports' mask */
-	mvpp2_prs_tcam_port_map_set(pe, port_map);
-	mvpp2_prs_hw_write(priv, pe);
-free_pe:
-	kfree(pe);
+	mvpp2_prs_tcam_port_map_set(&pe, port_map);
+	mvpp2_prs_hw_write(priv, &pe);
+
 	return ret;
 }
 
@@ -3528,7 +3504,7 @@ static int mvpp2_prs_vid_range_find(struct mvpp2 *priv, int pmap, u16 vid,
 		return tid;
 	}
 
-	return 0;
+	return -ENOENT;
 }
 
 /* Write parser entry for VID filtering */
@@ -3551,7 +3527,7 @@ static int mvpp2_prs_vid_entry_add(struct mvpp2_port *port, u16 vid)
 		shift = MVPP2_VLAN_TAG_LEN;
 
 	/* No such entry */
-	if (!tid) {
+	if (tid < 0) {
 		memset(&pe, 0, sizeof(pe));
 
 		/* Go through all entries from first to last in vlan range */
@@ -3604,7 +3580,7 @@ static void mvpp2_prs_vid_entry_remove(struct mvpp2_port *port, u16 vid)
 	tid = mvpp2_prs_vid_range_find(priv, (1 << port->id), vid, 0xfff);
 
 	/* No such entry */
-	if (tid)
+	if (tid < 0)
 		return;
 
 	mvpp2_prs_hw_inv(priv, tid);
@@ -3771,18 +3747,13 @@ static bool mvpp2_prs_mac_range_equals(struct mvpp2_prs_entry *pe,
 }
 
 /* Find tcam entry with matched pair <MAC DA, port> */
-static struct mvpp2_prs_entry *
+static int
 mvpp2_prs_mac_da_range_find(struct mvpp2 *priv, int pmap, const u8 *da,
 			    unsigned char *mask, int udf_type)
 {
-	struct mvpp2_prs_entry *pe;
+	struct mvpp2_prs_entry pe;
 	int tid;
 
-	pe = kzalloc(sizeof(*pe), GFP_ATOMIC);
-	if (!pe)
-		return NULL;
-	mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_MAC);
-
 	/* Go through the all entires with MVPP2_PRS_LU_MAC */
 	for (tid = MVPP2_PE_MAC_RANGE_START;
 	     tid <= MVPP2_PE_MAC_RANGE_END; tid++) {
@@ -3793,17 +3764,16 @@ mvpp2_prs_mac_da_range_find(struct mvpp2 *priv, int pmap, const u8 *da,
 		    (priv->prs_shadow[tid].udf != udf_type))
 			continue;
 
-		pe->index = tid;
-		mvpp2_prs_hw_read(priv, pe);
-		entry_pmap = mvpp2_prs_tcam_port_map_get(pe);
+		pe.index = tid;
+		mvpp2_prs_hw_read(priv, &pe);
+		entry_pmap = mvpp2_prs_tcam_port_map_get(&pe);
 
-		if (mvpp2_prs_mac_range_equals(pe, da, mask) &&
+		if (mvpp2_prs_mac_range_equals(&pe, da, mask) &&
 		    entry_pmap == pmap)
-			return pe;
+			return tid;
 	}
-	kfree(pe);
 
-	return NULL;
+	return -ENOENT;
 }
 
 /* Update parser's mac da entry */
@@ -3813,15 +3783,15 @@ static int mvpp2_prs_mac_da_accept(struct mvpp2_port *port, const u8 *da,
 	unsigned char mask[ETH_ALEN] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
 	struct mvpp2 *priv = port->priv;
 	unsigned int pmap, len, ri;
-	struct mvpp2_prs_entry *pe;
+	struct mvpp2_prs_entry pe;
 	int tid;
 
 	/* Scan TCAM and see if entry with this <MAC DA, port> already exist */
-	pe = mvpp2_prs_mac_da_range_find(priv, BIT(port->id), da, mask,
-					 MVPP2_PRS_UDF_MAC_DEF);
+	tid = mvpp2_prs_mac_da_range_find(priv, BIT(port->id), da, mask,
+					  MVPP2_PRS_UDF_MAC_DEF);
 
 	/* No such entry */
-	if (!pe) {
+	if (tid < 0) {
 		if (!add)
 			return 0;
 
@@ -3833,39 +3803,39 @@ static int mvpp2_prs_mac_da_accept(struct mvpp2_port *port, const u8 *da,
 		if (tid < 0)
 			return tid;
 
-		pe = kzalloc(sizeof(*pe), GFP_ATOMIC);
-		if (!pe)
-			return -ENOMEM;
-		mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_MAC);
-		pe->index = tid;
+		memset(&pe, 0, sizeof(pe));
+
+		pe.index = tid;
 
 		/* Mask all ports */
-		mvpp2_prs_tcam_port_map_set(pe, 0);
+		mvpp2_prs_tcam_port_map_set(&pe, 0);
+	} else {
+		mvpp2_prs_hw_read(priv, &pe);
 	}
 
+	mvpp2_prs_tcam_lu_set(&pe, MVPP2_PRS_LU_MAC);
+
 	/* Update port mask */
-	mvpp2_prs_tcam_port_set(pe, port->id, add);
+	mvpp2_prs_tcam_port_set(&pe, port->id, add);
 
 	/* Invalidate the entry if no ports are left enabled */
-	pmap = mvpp2_prs_tcam_port_map_get(pe);
+	pmap = mvpp2_prs_tcam_port_map_get(&pe);
 	if (pmap == 0) {
-		if (add) {
-			kfree(pe);
+		if (add)
 			return -EINVAL;
-		}
-		mvpp2_prs_hw_inv(priv, pe->index);
-		priv->prs_shadow[pe->index].valid = false;
-		kfree(pe);
+
+		mvpp2_prs_hw_inv(priv, pe.index);
+		priv->prs_shadow[pe.index].valid = false;
 		return 0;
 	}
 
 	/* Continue - set next lookup */
-	mvpp2_prs_sram_next_lu_set(pe, MVPP2_PRS_LU_DSA);
+	mvpp2_prs_sram_next_lu_set(&pe, MVPP2_PRS_LU_DSA);
 
 	/* Set match on DA */
 	len = ETH_ALEN;
 	while (len--)
-		mvpp2_prs_tcam_data_byte_set(pe, len, da[len], 0xff);
+		mvpp2_prs_tcam_data_byte_set(&pe, len, da[len], 0xff);
 
 	/* Set result info bits */
 	if (is_broadcast_ether_addr(da)) {
@@ -3879,21 +3849,19 @@ static int mvpp2_prs_mac_da_accept(struct mvpp2_port *port, const u8 *da,
 			ri |= MVPP2_PRS_RI_MAC_ME_MASK;
 	}
 
-	mvpp2_prs_sram_ri_update(pe, ri, MVPP2_PRS_RI_L2_CAST_MASK |
+	mvpp2_prs_sram_ri_update(&pe, ri, MVPP2_PRS_RI_L2_CAST_MASK |
 				 MVPP2_PRS_RI_MAC_ME_MASK);
-	mvpp2_prs_shadow_ri_set(priv, pe->index, ri, MVPP2_PRS_RI_L2_CAST_MASK |
+	mvpp2_prs_shadow_ri_set(priv, pe.index, ri, MVPP2_PRS_RI_L2_CAST_MASK |
 				MVPP2_PRS_RI_MAC_ME_MASK);
 
 	/* Shift to ethertype */
-	mvpp2_prs_sram_shift_set(pe, 2 * ETH_ALEN,
+	mvpp2_prs_sram_shift_set(&pe, 2 * ETH_ALEN,
 				 MVPP2_PRS_SRAM_OP_SEL_SHIFT_ADD);
 
 	/* Update shadow table and hw entry */
-	priv->prs_shadow[pe->index].udf = MVPP2_PRS_UDF_MAC_DEF;
-	mvpp2_prs_shadow_set(priv, pe->index, MVPP2_PRS_LU_MAC);
-	mvpp2_prs_hw_write(priv, pe);
-
-	kfree(pe);
+	priv->prs_shadow[pe.index].udf = MVPP2_PRS_UDF_MAC_DEF;
+	mvpp2_prs_shadow_set(priv, pe.index, MVPP2_PRS_LU_MAC);
+	mvpp2_prs_hw_write(priv, &pe);
 
 	return 0;
 }
@@ -4014,13 +3982,13 @@ static int mvpp2_prs_tag_mode_set(struct mvpp2 *priv, int port, int type)
 /* Set prs flow for the port */
 static int mvpp2_prs_def_flow(struct mvpp2_port *port)
 {
-	struct mvpp2_prs_entry *pe;
+	struct mvpp2_prs_entry pe;
 	int tid;
 
-	pe = mvpp2_prs_flow_find(port->priv, port->id);
+	tid = mvpp2_prs_flow_find(port->priv, port->id);
 
 	/* Such entry not exist */
-	if (!pe) {
+	if (tid < 0) {
 		/* Go through the all entires from last to first */
 		tid = mvpp2_prs_tcam_first_free(port->priv,
 						MVPP2_PE_LAST_FREE_TID,
@@ -4028,24 +3996,21 @@ static int mvpp2_prs_def_flow(struct mvpp2_port *port)
 		if (tid < 0)
 			return tid;
 
-		pe = kzalloc(sizeof(*pe), GFP_KERNEL);
-		if (!pe)
-			return -ENOMEM;
-
-		mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_FLOWS);
-		pe->index = tid;
+		pe.index = tid;
 
 		/* Set flow ID*/
-		mvpp2_prs_sram_ai_update(pe, port->id, MVPP2_PRS_FLOW_ID_MASK);
-		mvpp2_prs_sram_bits_set(pe, MVPP2_PRS_SRAM_LU_DONE_BIT, 1);
+		mvpp2_prs_sram_ai_update(&pe, port->id, MVPP2_PRS_FLOW_ID_MASK);
+		mvpp2_prs_sram_bits_set(&pe, MVPP2_PRS_SRAM_LU_DONE_BIT, 1);
 
 		/* Update shadow table */
-		mvpp2_prs_shadow_set(port->priv, pe->index, MVPP2_PRS_LU_FLOWS);
+		mvpp2_prs_shadow_set(port->priv, pe.index, MVPP2_PRS_LU_FLOWS);
+	} else {
+		mvpp2_prs_hw_read(port->priv, &pe);
 	}
 
-	mvpp2_prs_tcam_port_map_set(pe, (1 << port->id));
-	mvpp2_prs_hw_write(port->priv, pe);
-	kfree(pe);
+	mvpp2_prs_tcam_lu_set(&pe, MVPP2_PRS_LU_FLOWS);
+	mvpp2_prs_tcam_port_map_set(&pe, (1 << port->id));
+	mvpp2_prs_hw_write(port->priv, &pe);
 
 	return 0;
 }
-- 
2.11.0

^ permalink raw reply related

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
From: Thomas Gleixner @ 2018-03-21 15:11 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Eric Dumazet, Jesus Sanchez-Palencia, netdev, jhs, xiyou.wangcong,
	jiri, vinicius.gomes, anna-maria, henrik, John Stultz,
	levi.pearson, edumazet, willemb, mlichvar
In-Reply-To: <20180321145930.4hlmihghljpv2wrq@localhost>

On Wed, 21 Mar 2018, Richard Cochran wrote:

@Intel: I removed intel-wired-lan@ as I have absolutely zero interest in
	the moderation spam from that list. Can you please either get rid
	of this moderation nonsense or stop CC'ing that list when posting
	to lkml/netdev?

> On Wed, Mar 21, 2018 at 01:58:51PM +0100, Thomas Gleixner wrote:
> > Errm. No. There is no way to support fd based clocks or one of the CPU
> > time/process time based clocks for this.
> 
> Why not?
>  
> If the we have HW offloading, then the transmit time had better be
> expressed in terms of the MAC's internal clock.  Otherwise we would
> need to translate between a kernel clock and the MAC clock, but that
> is expensive (eg over PCIe) and silly (because in a typical use case
> the MAC will already be synchronized to the network time).

Sure, but you CANNOT use a clockid for that because there is NONE.

The mac clock is exposed via a dynamic posix clock and can only be
referenced via a file descriptor.

The qdisc setup does fd = open(...) and hands that in as clockid. Later the
application does fd = open(...) and uses that as clockid for tagging the
messages.

What the heck guarantees that both the setup and the application will get
the same fd number?

Exactly nothing. So any attempt to use the filedescriptor number as clockid
is broken by definition.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH net-next 5/6] tls: RX path for ktls
From: Dave Watson @ 2018-03-21 15:10 UTC (permalink / raw)
  To: Boris Pismenny
  Cc: David S. Miller, Tom Herbert, Alexei Starovoitov, herbert,
	linux-crypto, netdev, Atul Gupta, Vakul Garg,
	Hannes Frederic Sowa, Steffen Klassert, John Fastabend,
	Daniel Borkmann
In-Reply-To: <95b3174b-c860-3c87-2c8e-0917ccb72f1d@mellanox.com>

On 03/21/18 07:20 AM, Boris Pismenny wrote:
> 
> 
> On 3/20/2018 7:54 PM, Dave Watson wrote:
> > +	ctx->control = header[0];
> > +
> > +	data_len = ((header[4] & 0xFF) | (header[3] << 8));
> > +
> > +	cipher_overhead = tls_ctx->rx.tag_size + tls_ctx->rx.iv_size;
> > +
> > +	if (data_len > TLS_MAX_PAYLOAD_SIZE + cipher_overhead) {
> > +		ret = -EMSGSIZE;
> > +		goto read_failure;
> > +	}
> > +	if (data_len < cipher_overhead) {
> > +		ret = -EMSGSIZE;
> 
> I think this should be considered EBADMSG, because this error is cipher
> dependent. At least, that's what happens within OpenSSL. Also, EMSGSIZE is
> usually used only for too long messages.

Ah, indeed.  Thanks, will send v2.

^ permalink raw reply

* Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure
From: Dave Watson @ 2018-03-21 15:08 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, netdev, Boris Pismenny, Ilya Lesokhin,
	Aviad Yehezkel
In-Reply-To: <20180320024510.7408-7-saeedm@mellanox.com>

On 03/19/18 07:45 PM, Saeed Mahameed wrote:
> +#define TLS_OFFLOAD_CONTEXT_SIZE                                               \
> +	(ALIGN(sizeof(struct tls_offload_context), sizeof(void *)) +           \
> +	 TLS_DRIVER_STATE_SIZE)
> +
> +	pfrag = sk_page_frag(sk);
> +
> +	/* KTLS_TLS_HEADER_SIZE is not counted as part of the TLS record, and

I think the define is actually TLS_HEADER_SIZE, no KTLS_ prefix

> +	memcpy(ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv, iv_size);
> +
> +	ctx->rec_seq_size = rec_seq_size;
> +	/* worst case is:
> +	 * MAX_SKB_FRAGS in tls_record_info
> +	 * MAX_SKB_FRAGS + 1 in SKB head an frags.

spelling

> +int tls_sw_fallback_init(struct sock *sk,
> +			 struct tls_offload_context *offload_ctx,
> +			 struct tls_crypto_info *crypto_info)
> +{
> +	int rc;
> +	const u8 *key;
> +
> +	offload_ctx->aead_send =
> +	    crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);

in tls_sw we went with async + crypto_wait_req, any reason to not do
that here?  Otherwise I think you still get the software gcm on x86
instead of aesni without additional changes.

> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> index d824d548447e..e0dface33017 100644
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -54,6 +54,9 @@ enum {
>  enum {
>  	TLS_BASE_TX,
>  	TLS_SW_TX,
> +#ifdef CONFIG_TLS_DEVICE
> +	TLS_HW_TX,
> +#endif
>  	TLS_NUM_CONFIG,
>  };

I have posted SW_RX patches, do you forsee any issues with SW_RX + HW_TX?

Thanks

^ permalink raw reply

* Re: [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
From: Richard Cochran @ 2018-03-21 15:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jesus Sanchez-Palencia, netdev, jhs, xiyou.wangcong, jiri,
	vinicius.gomes, intel-wired-lan, anna-maria, henrik, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar
In-Reply-To: <alpine.DEB.2.21.1803211448310.3754@nanos.tec.linutronix.de>

On Wed, Mar 21, 2018 at 03:22:11PM +0100, Thomas Gleixner wrote:
> Which clockid will be handed in from the application? The network adapter
> time has no fixed clockid. The only way you can get to it is via a fd based
> posix clock and that does not work at all because the qdisc setup might
> have a different FD than the application which queues packets.

Duh.  That explains it.  Please ignore my "why not?" Q in the other thread...

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH net-next v4 2/2] net: bpf: add a test for skb_segment in test_bpf module
From: Eric Dumazet @ 2018-03-21 15:03 UTC (permalink / raw)
  To: Yonghong Song, edumazet, ast, daniel, diptanu, netdev; +Cc: kernel-team
In-Reply-To: <20180321064722.1411857-3-yhs@fb.com>



On 03/20/2018 11:47 PM, Yonghong Song wrote:
...

+
> +static __init int test_skb_segment(void)
> +{
> +	netdev_features_t features;
> +	struct sk_buff *skb;
> +	int ret = -1;
> +
> +	features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM |
> +		   NETIF_F_IPV6_CSUM;
> +	features |= NETIF_F_RXCSUM;
> +	skb = build_test_skb();
> +	if (!skb) {
> +		pr_info("%s: failed to build_test_skb", __func__);
> +		goto done;
> +	}
> +
> +	if (skb_segment(skb, features)) {
> +		ret = 0;
> +		pr_info("%s: success in skb_segment!", __func__);
> +	} else {
> +		pr_info("%s: failed in skb_segment!", __func__);
> +	}

If skb_segmen() was successful skb was already freed.

kfree_skb(old_skb) should thus panic the box, if you run this code
on a kernel having some debugging features like KASAN

So if you do not store in a variable the return of skb_segment(),
you can not properly free memory.

> +	kfree_skb(skb);
> +done:
> +	return ret;
> +}
> +
> 
Please make sure to fully test this code.

^ permalink raw reply

* [PATCH 2/2] bpftool: Adjust to new print_bpf_insn interface
From: Jiri Olsa @ 2018-03-21 15:02 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: lkml, netdev, Quentin Monnet
In-Reply-To: <20180321150212.5586-1-jolsa@kernel.org>

Change bpftool to skip the removed struct bpf_verifier_env
argument in print_bpf_insn. It was passed as NULL anyway.

No functional change intended.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/bpf/bpftool/prog.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index e549e329be82..108001d974ee 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -489,7 +489,7 @@ static struct kernel_sym *kernel_syms_search(struct dump_data *dd,
 		       sizeof(*dd->sym_mapping), kernel_syms_cmp) : NULL;
 }
 
-static void print_insn(struct bpf_verifier_env *env, const char *fmt, ...)
+static void print_insn(void *private_data, const char *fmt, ...)
 {
 	va_list args;
 
@@ -576,7 +576,7 @@ static void dump_xlated_plain(struct dump_data *dd, void *buf,
 		double_insn = insn[i].code == (BPF_LD | BPF_IMM | BPF_DW);
 
 		printf("% 4d: ", i);
-		print_bpf_insn(&cbs, NULL, insn + i, true);
+		print_bpf_insn(&cbs, insn + i, true);
 
 		if (opcodes) {
 			printf("       ");
@@ -590,7 +590,7 @@ static void dump_xlated_plain(struct dump_data *dd, void *buf,
 	}
 }
 
-static void print_insn_json(struct bpf_verifier_env *env, const char *fmt, ...)
+static void print_insn_json(void *private_data, const char *fmt, ...)
 {
 	unsigned int l = strlen(fmt);
 	char chomped_fmt[l];
@@ -628,7 +628,7 @@ static void dump_xlated_json(struct dump_data *dd, void *buf,
 
 		jsonw_start_object(json_wtr);
 		jsonw_name(json_wtr, "disasm");
-		print_bpf_insn(&cbs, NULL, insn + i, true);
+		print_bpf_insn(&cbs, insn + i, true);
 
 		if (opcodes) {
 			jsonw_name(json_wtr, "opcodes");
-- 
2.13.6

^ permalink raw reply related

* [PATCH 1/2] bpf: Remove struct bpf_verifier_env argument from print_bpf_insn
From: Jiri Olsa @ 2018-03-21 15:02 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: lkml, netdev, Quentin Monnet

We use print_bpf_insn in user space (bpftool and soon perf),
so it'd be nice to keep it generic and strip it off the kernel
struct bpf_verifier_env argument.

This argument can be safely removed, because its users can
use the struct bpf_insn_cbs::private_data to pass it.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/bpf/disasm.c   | 52 +++++++++++++++++++++++++--------------------------
 kernel/bpf/disasm.h   |  5 +----
 kernel/bpf/verifier.c |  6 +++---
 3 files changed, 30 insertions(+), 33 deletions(-)

diff --git a/kernel/bpf/disasm.c b/kernel/bpf/disasm.c
index 8740406df2cd..d6b76377cb6e 100644
--- a/kernel/bpf/disasm.c
+++ b/kernel/bpf/disasm.c
@@ -113,16 +113,16 @@ static const char *const bpf_jmp_string[16] = {
 };
 
 static void print_bpf_end_insn(bpf_insn_print_t verbose,
-			       struct bpf_verifier_env *env,
+			       void *private_data,
 			       const struct bpf_insn *insn)
 {
-	verbose(env, "(%02x) r%d = %s%d r%d\n", insn->code, insn->dst_reg,
+	verbose(private_data, "(%02x) r%d = %s%d r%d\n",
+		insn->code, insn->dst_reg,
 		BPF_SRC(insn->code) == BPF_TO_BE ? "be" : "le",
 		insn->imm, insn->dst_reg);
 }
 
 void print_bpf_insn(const struct bpf_insn_cbs *cbs,
-		    struct bpf_verifier_env *env,
 		    const struct bpf_insn *insn,
 		    bool allow_ptr_leaks)
 {
@@ -132,23 +132,23 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs,
 	if (class == BPF_ALU || class == BPF_ALU64) {
 		if (BPF_OP(insn->code) == BPF_END) {
 			if (class == BPF_ALU64)
-				verbose(env, "BUG_alu64_%02x\n", insn->code);
+				verbose(cbs->private_data, "BUG_alu64_%02x\n", insn->code);
 			else
-				print_bpf_end_insn(verbose, env, insn);
+				print_bpf_end_insn(verbose, cbs->private_data, insn);
 		} else if (BPF_OP(insn->code) == BPF_NEG) {
-			verbose(env, "(%02x) r%d = %s-r%d\n",
+			verbose(cbs->private_data, "(%02x) r%d = %s-r%d\n",
 				insn->code, insn->dst_reg,
 				class == BPF_ALU ? "(u32) " : "",
 				insn->dst_reg);
 		} else if (BPF_SRC(insn->code) == BPF_X) {
-			verbose(env, "(%02x) %sr%d %s %sr%d\n",
+			verbose(cbs->private_data, "(%02x) %sr%d %s %sr%d\n",
 				insn->code, class == BPF_ALU ? "(u32) " : "",
 				insn->dst_reg,
 				bpf_alu_string[BPF_OP(insn->code) >> 4],
 				class == BPF_ALU ? "(u32) " : "",
 				insn->src_reg);
 		} else {
-			verbose(env, "(%02x) %sr%d %s %s%d\n",
+			verbose(cbs->private_data, "(%02x) %sr%d %s %s%d\n",
 				insn->code, class == BPF_ALU ? "(u32) " : "",
 				insn->dst_reg,
 				bpf_alu_string[BPF_OP(insn->code) >> 4],
@@ -157,46 +157,46 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs,
 		}
 	} else if (class == BPF_STX) {
 		if (BPF_MODE(insn->code) == BPF_MEM)
-			verbose(env, "(%02x) *(%s *)(r%d %+d) = r%d\n",
+			verbose(cbs->private_data, "(%02x) *(%s *)(r%d %+d) = r%d\n",
 				insn->code,
 				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
 				insn->dst_reg,
 				insn->off, insn->src_reg);
 		else if (BPF_MODE(insn->code) == BPF_XADD)
-			verbose(env, "(%02x) lock *(%s *)(r%d %+d) += r%d\n",
+			verbose(cbs->private_data, "(%02x) lock *(%s *)(r%d %+d) += r%d\n",
 				insn->code,
 				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
 				insn->dst_reg, insn->off,
 				insn->src_reg);
 		else
-			verbose(env, "BUG_%02x\n", insn->code);
+			verbose(cbs->private_data, "BUG_%02x\n", insn->code);
 	} else if (class == BPF_ST) {
 		if (BPF_MODE(insn->code) != BPF_MEM) {
-			verbose(env, "BUG_st_%02x\n", insn->code);
+			verbose(cbs->private_data, "BUG_st_%02x\n", insn->code);
 			return;
 		}
-		verbose(env, "(%02x) *(%s *)(r%d %+d) = %d\n",
+		verbose(cbs->private_data, "(%02x) *(%s *)(r%d %+d) = %d\n",
 			insn->code,
 			bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
 			insn->dst_reg,
 			insn->off, insn->imm);
 	} else if (class == BPF_LDX) {
 		if (BPF_MODE(insn->code) != BPF_MEM) {
-			verbose(env, "BUG_ldx_%02x\n", insn->code);
+			verbose(cbs->private_data, "BUG_ldx_%02x\n", insn->code);
 			return;
 		}
-		verbose(env, "(%02x) r%d = *(%s *)(r%d %+d)\n",
+		verbose(cbs->private_data, "(%02x) r%d = *(%s *)(r%d %+d)\n",
 			insn->code, insn->dst_reg,
 			bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
 			insn->src_reg, insn->off);
 	} else if (class == BPF_LD) {
 		if (BPF_MODE(insn->code) == BPF_ABS) {
-			verbose(env, "(%02x) r0 = *(%s *)skb[%d]\n",
+			verbose(cbs->private_data, "(%02x) r0 = *(%s *)skb[%d]\n",
 				insn->code,
 				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
 				insn->imm);
 		} else if (BPF_MODE(insn->code) == BPF_IND) {
-			verbose(env, "(%02x) r0 = *(%s *)skb[r%d + %d]\n",
+			verbose(cbs->private_data, "(%02x) r0 = *(%s *)skb[r%d + %d]\n",
 				insn->code,
 				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
 				insn->src_reg, insn->imm);
@@ -212,12 +212,12 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs,
 			if (map_ptr && !allow_ptr_leaks)
 				imm = 0;
 
-			verbose(env, "(%02x) r%d = %s\n",
+			verbose(cbs->private_data, "(%02x) r%d = %s\n",
 				insn->code, insn->dst_reg,
 				__func_imm_name(cbs, insn, imm,
 						tmp, sizeof(tmp)));
 		} else {
-			verbose(env, "BUG_ld_%02x\n", insn->code);
+			verbose(cbs->private_data, "BUG_ld_%02x\n", insn->code);
 			return;
 		}
 	} else if (class == BPF_JMP) {
@@ -227,35 +227,35 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs,
 			char tmp[64];
 
 			if (insn->src_reg == BPF_PSEUDO_CALL) {
-				verbose(env, "(%02x) call pc%s\n",
+				verbose(cbs->private_data, "(%02x) call pc%s\n",
 					insn->code,
 					__func_get_name(cbs, insn,
 							tmp, sizeof(tmp)));
 			} else {
 				strcpy(tmp, "unknown");
-				verbose(env, "(%02x) call %s#%d\n", insn->code,
+				verbose(cbs->private_data, "(%02x) call %s#%d\n", insn->code,
 					__func_get_name(cbs, insn,
 							tmp, sizeof(tmp)),
 					insn->imm);
 			}
 		} else if (insn->code == (BPF_JMP | BPF_JA)) {
-			verbose(env, "(%02x) goto pc%+d\n",
+			verbose(cbs->private_data, "(%02x) goto pc%+d\n",
 				insn->code, insn->off);
 		} else if (insn->code == (BPF_JMP | BPF_EXIT)) {
-			verbose(env, "(%02x) exit\n", insn->code);
+			verbose(cbs->private_data, "(%02x) exit\n", insn->code);
 		} else if (BPF_SRC(insn->code) == BPF_X) {
-			verbose(env, "(%02x) if r%d %s r%d goto pc%+d\n",
+			verbose(cbs->private_data, "(%02x) if r%d %s r%d goto pc%+d\n",
 				insn->code, insn->dst_reg,
 				bpf_jmp_string[BPF_OP(insn->code) >> 4],
 				insn->src_reg, insn->off);
 		} else {
-			verbose(env, "(%02x) if r%d %s 0x%x goto pc%+d\n",
+			verbose(cbs->private_data, "(%02x) if r%d %s 0x%x goto pc%+d\n",
 				insn->code, insn->dst_reg,
 				bpf_jmp_string[BPF_OP(insn->code) >> 4],
 				insn->imm, insn->off);
 		}
 	} else {
-		verbose(env, "(%02x) %s\n",
+		verbose(cbs->private_data, "(%02x) %s\n",
 			insn->code, bpf_class_string[class]);
 	}
 }
diff --git a/kernel/bpf/disasm.h b/kernel/bpf/disasm.h
index 266fe8ee542b..e1324a834a24 100644
--- a/kernel/bpf/disasm.h
+++ b/kernel/bpf/disasm.h
@@ -22,14 +22,12 @@
 #include <string.h>
 #endif
 
-struct bpf_verifier_env;
-
 extern const char *const bpf_alu_string[16];
 extern const char *const bpf_class_string[8];
 
 const char *func_id_name(int id);
 
-typedef __printf(2, 3) void (*bpf_insn_print_t)(struct bpf_verifier_env *env,
+typedef __printf(2, 3) void (*bpf_insn_print_t)(void *private_data,
 						const char *, ...);
 typedef const char *(*bpf_insn_revmap_call_t)(void *private_data,
 					      const struct bpf_insn *insn);
@@ -45,7 +43,6 @@ struct bpf_insn_cbs {
 };
 
 void print_bpf_insn(const struct bpf_insn_cbs *cbs,
-		    struct bpf_verifier_env *env,
 		    const struct bpf_insn *insn,
 		    bool allow_ptr_leaks);
 #endif
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c6eff108aa99..9f27d3fa7259 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -202,8 +202,7 @@ EXPORT_SYMBOL_GPL(bpf_verifier_log_write);
  * generic for symbol export. The function was renamed, but not the calls in
  * the verifier to avoid complicating backports. Hence the alias below.
  */
-static __printf(2, 3) void verbose(struct bpf_verifier_env *env,
-				   const char *fmt, ...)
+static __printf(2, 3) void verbose(void *private_data, const char *fmt, ...)
 	__attribute__((alias("bpf_verifier_log_write")));
 
 static bool type_is_pkt_pointer(enum bpf_reg_type type)
@@ -4601,10 +4600,11 @@ static int do_check(struct bpf_verifier_env *env)
 		if (env->log.level) {
 			const struct bpf_insn_cbs cbs = {
 				.cb_print	= verbose,
+				.private_data	= env,
 			};
 
 			verbose(env, "%d: ", insn_idx);
-			print_bpf_insn(&cbs, env, insn, env->allow_ptr_leaks);
+			print_bpf_insn(&cbs, insn, env->allow_ptr_leaks);
 		}
 
 		if (bpf_prog_is_dev_bound(env->prog->aux)) {
-- 
2.13.6

^ permalink raw reply related

* Re: [PATCH net-next] fib_rules: rename FRA_PROTOCOL to FRA_ORIGINATOR
From: David Ahern @ 2018-03-21 15:00 UTC (permalink / raw)
  To: nicolas.dichtel, davem; +Cc: netdev, Donald Sharp
In-Reply-To: <74299bf5-ea7f-eaa9-25de-fd485fd6f9ad@6wind.com>

On 3/21/18 3:24 AM, Nicolas Dichtel wrote:
> Le 20/03/2018 à 18:27, David Ahern a écrit :
>> On 3/20/18 11:04 AM, Nicolas Dichtel wrote:
>>> As the comment said, this attribute defines the originator of the rule,
>>> it's not really a (network) protocol.
>>> Let's rename it accordingly to avoid confusion (difference between
>>> FRA_PROTOCOL and FRA_IP_PROTO was not obvious).
>>>
>>> CC: Donald Sharp <sharpd@cumulusnetworks.com>
>>> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>>> ---
>>>
>>> FRA_PROTOCOL exists only in net-next for now, thus it's still possible to
>>> rename it.
>>>
>>>  drivers/net/vrf.c              |  4 ++--
>>>  include/net/fib_rules.h        |  4 ++--
>>>  include/uapi/linux/fib_rules.h |  2 +-
>>>  net/core/fib_rules.c           | 14 +++++++-------
>>>  4 files changed, 12 insertions(+), 12 deletions(-)
>>>
>>>
>>
>> This protocol is meant to be analogous to rtm_protocol. Changing the
>> name to FRA_ORIGINATOR loses that connection.
> I understand your concerns. But I think the connection still exists after the
> patch because the values used for this field are RTPROT_*
> rtm_protocol is here from ages and we cannot change that. Moreover, FRA_*
> attributes are usually used as a selector or a target, which is not the case for
> a route. Thus I think it's important to carrefully choose the name.

The rule->proto value is not used as a selector. It is passed in, stored
on a rule and returned to userspace. It is book keeping only so an admin
has some idea of which program installed the rule.

^ permalink raw reply

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
From: Richard Cochran @ 2018-03-21 14:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Eric Dumazet, Jesus Sanchez-Palencia, netdev, jhs, xiyou.wangcong,
	jiri, vinicius.gomes, intel-wired-lan, anna-maria, henrik,
	john.stultz, levi.pearson, edumazet, willemb, mlichvar
In-Reply-To: <alpine.DEB.2.21.1803211244320.3754@nanos.tec.linutronix.de>

On Wed, Mar 21, 2018 at 01:58:51PM +0100, Thomas Gleixner wrote:
> Errm. No. There is no way to support fd based clocks or one of the CPU
> time/process time based clocks for this.

Why not?
 
If the we have HW offloading, then the transmit time had better be
expressed in terms of the MAC's internal clock.  Otherwise we would
need to translate between a kernel clock and the MAC clock, but that
is expensive (eg over PCIe) and silly (because in a typical use case
the MAC will already be synchronized to the network time).

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH net-next v3 1/2] net: permit skb_segment on head_frag frag_list skb
From: Alexander Duyck @ 2018-03-21 14:59 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Eric Dumazet, ast, Daniel Borkmann, diptanu, Netdev, Kernel Team
In-Reply-To: <9482a78e-f49d-c9b4-cbff-b1f287dee0d0@fb.com>

On Tue, Mar 20, 2018 at 10:02 PM, Yonghong Song <yhs@fb.com> wrote:
>
>
> On 3/20/18 4:50 PM, Alexander Duyck wrote:
>>
>> On Tue, Mar 20, 2018 at 4:21 PM, Yonghong Song <yhs@fb.com> wrote:
>>>
>>> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
>>> function skb_segment(), line 3667. The bpf program attaches to
>>> clsact ingress, calls bpf_skb_change_proto to change protocol
>>> from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
>>> to send the changed packet out.
>>>
>>> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>> 3473                             netdev_features_t features)
>>> 3474 {
>>> 3475         struct sk_buff *segs = NULL;
>>> 3476         struct sk_buff *tail = NULL;
>>> ...
>>> 3665                 while (pos < offset + len) {
>>> 3666                         if (i >= nfrags) {
>>> 3667                                 BUG_ON(skb_headlen(list_skb));
>>> 3668
>>> 3669                                 i = 0;
>>> 3670                                 nfrags =
>>> skb_shinfo(list_skb)->nr_frags;
>>> 3671                                 frag = skb_shinfo(list_skb)->frags;
>>> 3672                                 frag_skb = list_skb;
>>> ...
>>>
>>> call stack:
>>> ...
>>>   #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
>>>   #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
>>>   #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
>>>   #4 [ffff883ffef03668] die at ffffffff8101deb2
>>>   #5 [ffff883ffef03698] do_trap at ffffffff8101a700
>>>   #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
>>>   #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
>>>   #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
>>>      [exception RIP: skb_segment+3044]
>>>      RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
>>>      RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
>>>      RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
>>>      RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
>>>      R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
>>>      R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
>>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>>   #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
>>> --- <IRQ stack> ---
>>> ...
>>>
>>> The triggering input skb has the following properties:
>>>      list_skb = skb->frag_list;
>>>      skb->nfrags != NULL && skb_headlen(list_skb) != 0
>>> and skb_segment() is not able to handle a frag_list skb
>>> if its headlen (list_skb->len - list_skb->data_len) is not 0.
>>>
>>> This patch addressed the issue by handling skb_headlen(list_skb) != 0
>>> case properly if list_skb->head_frag is true, which is expected in
>>> most cases. The head frag is processed before list_skb->frags
>>> are processed.
>>>
>>> Reported-by: Diptanu Gon Choudhury <diptanu@fb.com>
>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>> ---
>>>   net/core/skbuff.c | 51
>>> +++++++++++++++++++++++++++++++++++++--------------
>>>   1 file changed, 37 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index 715c134..59bbc06 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -3475,7 +3475,7 @@ struct sk_buff *skb_segment(struct sk_buff
>>> *head_skb,
>>>          struct sk_buff *segs = NULL;
>>>          struct sk_buff *tail = NULL;
>>>          struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
>>> -       skb_frag_t *frag = skb_shinfo(head_skb)->frags;
>>> +       skb_frag_t *frag = skb_shinfo(head_skb)->frags, *head_frag =
>>> NULL;
>>
>>
>> I think you misunderstood me. I wasn't saying you allocate head_frag.
>> I was saying you could move the declaration down.
>
>
> Sorry for my misunderstanding. I did understand your intention of moving
> the declaration down in order to save stack space. I thought that we cannot
> really move declaration down (although it works in C, but semantically it is
> not quite right, more later), so I moved on to
> use runtime allocation. But indeed skb_frag_t is not big (16 bytes), it
> could live on the stack.
>
>>
>>>          unsigned int mss = skb_shinfo(head_skb)->gso_size;
>>>          unsigned int doffset = head_skb->data -
>>> skb_mac_header(head_skb);
>>>          struct sk_buff *frag_skb = head_skb;
>>> @@ -3664,19 +3664,39 @@ struct sk_buff *skb_segment(struct sk_buff
>>> *head_skb,
>>>
>>>                  while (pos < offset + len) {
>>
>>
>> So right here in the loop you could add a "skb_frag_t head_frag;" just
>> so we declare it here and save ourselves the stack space.
>
>
> I actually tried to move "skb_frag_t head_frag". The stack size remains the
> same, 0xc0. This is related to how C compiler allocates stack space.
> The declaration place won't decide the stack size as long as the declaration
> dictates the usage. The stack size is really determined by liveness
> analysis.
>
> Further, we have code like:
>         do {
>            ....
>            while (pos < offset + len) {
>                if (i >= nfrags) {
>                    ...
>                    head_frag = ...
>                }
>                ... = head_frag; // head_frag access guaranteed after
>                                 // above definition, but it may not
>                                 // be in the same outer do-while loop.
>            }
>            ...
>          } while (((offset += len) < head_skb->len);
>
> So the use of head_frag maybe in different outer loop iterations.
> So I feel the definition of head_frag should be outside the
> outer do-while loop, which is the main function scope. I will add some
> comments here.

So the point I had is that head_frag doesn't need to live that long.
All you are doing is arranging the data so that you can essentially
just dump it into *nskb_frag.

One alternative you could look at would be to rearrange the code so
that the MAX_SKB_FRAGS check occurs first, then perform the check for
(i >= nfrags). Doing it that way we end up bypassing the need for
head_frag entirely as you could just populate *nskb_frag directly. You
could probably just add an inline structure that would convert the
head frag into a skb_frag_t structure and return that. Something along
the lines of:
        *nskb_frag = (i < 0) ? skb_head_frag_to_page_desc(frag_skb) : *frag;

Doing it that way you could pull out the bits below that were
populating head frag and just have it all handled as an inline
function. An added advantage is that it should make the line wrapping
easier to deal with. :-)

>>
>>>                          if (i >= nfrags) {
>>> -                               BUG_ON(skb_headlen(list_skb));
>>> -
>>>                                  i = 0;
>>> +                               if (skb_headlen(list_skb)) {
>>> +                                       struct page *page;
>>> +
>>> +                                       BUG_ON(!list_skb->head_frag);
>>> +
>>> +                                       page =
>>> virt_to_head_page(list_skb->head);
>>> +                                       if (!head_frag) {
>>> +                                               head_frag =
>>> kmalloc(sizeof(skb_frag_t),
>>> +
>>> GFP_KERNEL);
>>> +                                               if (!head_frag)
>>> +                                                       goto err;
>>> +                                       }
>>
>>
>> Please no memory allocation. I just meant you could allocate it on the
>> stack later.
>>
>>> +                                       head_frag->page.p = page;
>>> +                                       head_frag->page_offset =
>>> list_skb->data -
>>> +                                               (unsigned char
>>> *)page_address(page);
>>> +                                       head_frag->size =
>>> skb_headlen(list_skb);
>>> +                                       /* set i = -1 so we will pick
>>> head_frag
>>> +                                        * instead of
>>> skb_shinfo(list_skb)->frags
>>> +                                        * when i == -1.
>>> +                                        */
>>> +                                       i = -1;
>>> +                               }
>>
>>
>> So it took me a bit to pick up on the fact that line below wasn't
>> removed. So we are basically trying to do this all in one pass now. Do
>> I have that right?
>>
>> One thing you could look at doing to save yourself the extra "if"
>> later would be to pull frag pointer before you go through skb_headlen
>> check above. Then if you are going to use a head_frag you could just
>> do a "i--; frag--;" combination just to rewind and make the room for
>> the increment to come later. That way you don't have an invalid frag
>> pointer floating around. That way you only have to do this once
>> instead of having to do a conditional check per fragment.
>
>
> Right. This indeed make code more cleaner.
>
>>
>>>                                  nfrags = skb_shinfo(list_skb)->nr_frags;
>>> -                               frag = skb_shinfo(list_skb)->frags;
>>
>>
>> This patch might be more readable if you were to just insert the
>> skb_headlen() bits down here and left the i=0 through frag = .. in one
>> piece.
>
>
> Right. Will implement as suggested.
>
>>
>>> -                               frag_skb = list_skb;
>>> -
>>> -                               BUG_ON(!nfrags);
>>> -
>>> -                               if (skb_orphan_frags(frag_skb,
>>> GFP_ATOMIC) ||
>>> -                                   skb_zerocopy_clone(nskb, frag_skb,
>>> -                                                      GFP_ATOMIC))
>>> -                                       goto err;
>>> +                               if (nfrags) {
>>> +                                       frag =
>>> skb_shinfo(list_skb)->frags;
>>> +                                       frag_skb = list_skb;
>>> +
>>> +                                       if (skb_orphan_frags(frag_skb,
>>> GFP_ATOMIC) ||
>>> +                                           skb_zerocopy_clone(nskb,
>>> frag_skb,
>>> +
>>> GFP_ATOMIC))
>>> +                                               goto err;
>>> +                               }
>>>
>>>                                  list_skb = list_skb->next;
>>>                          }
>>> @@ -3689,7 +3709,7 @@ struct sk_buff *skb_segment(struct sk_buff
>>> *head_skb,
>>>                                  goto err;
>>>                          }
>>>
>>> -                       *nskb_frag = *frag;
>>> +                       *nskb_frag = (i == -1) ? *head_frag : *frag;
>>
>>
>> So this would be better as "*nskb_frag = (i < 0) ? head_frag : *frag;".
>
>
> Good suggestion. Will implement as suggested.
>
>
>>
>>>                          __skb_frag_ref(nskb_frag);
>>>                          size = skb_frag_size(nskb_frag);
>>>
>>> @@ -3702,7 +3722,8 @@ struct sk_buff *skb_segment(struct sk_buff
>>> *head_skb,
>>>
>>>                          if (pos + size <= offset + len) {
>>>                                  i++;
>>> -                               frag++;
>>> +                               if (i != 0)
>>> +                                       frag++;
>>>                                  pos += size;
>>>                          } else {
>>>                                  skb_frag_size_sub(nskb_frag, pos + size
>>> - (offset + len));
>>> @@ -3774,10 +3795,12 @@ struct sk_buff *skb_segment(struct sk_buff
>>> *head_skb,
>>>                  swap(tail->destructor, head_skb->destructor);
>>>                  swap(tail->sk, head_skb->sk);
>>>          }
>>> +       kfree(head_frag);
>>>          return segs;
>>>
>>>   err:
>>>          kfree_skb_list(segs);
>>> +       kfree(head_frag);
>>>          return ERR_PTR(err);
>>>   }
>>>   EXPORT_SYMBOL_GPL(skb_segment);
>>> --
>>> 2.9.5
>>>
>

^ permalink raw reply

* Re: [PATCH net-next] Documentation/networking: Add net DIM documentation
From: Andy Gospodarek @ 2018-03-21 14:56 UTC (permalink / raw)
  To: Tal Gilboa; +Cc: David S. Miller, netdev@vger.kernel.org, Tariq Toukan
In-Reply-To: <1521624629-46666-1-git-send-email-talgi@mellanox.com>

On Wed, Mar 21, 2018 at 11:30:29AM +0200, Tal Gilboa wrote:
> Net DIM is a generic algorithm, purposed for dynamically
> optimizing network devices interrupt moderation. This
> document describes how it works and how to use it.
> 
> Signed-off-by: Tal Gilboa <talgi@mellanox.com>

Looks like a nice summary of how to integrate it with a driver.  Thanks
for documenting DIM.

Acked-by: Andy Gospodarek <gospo@broadcom.com>

> ---
>  Documentation/networking/net_dim.txt | 174 +++++++++++++++++++++++++++++++++++
>  1 file changed, 174 insertions(+)
>  create mode 100644 Documentation/networking/net_dim.txt
> 
> diff --git a/Documentation/networking/net_dim.txt b/Documentation/networking/net_dim.txt
> new file mode 100644
> index 0000000..ef622c8
> --- /dev/null
> +++ b/Documentation/networking/net_dim.txt
> @@ -0,0 +1,174 @@
> +Net DIM - Generic Network Dynamic Interrupt Moderation
> +======================================================
> +
> +Author:
> +	Tal Gilboa <talgi@mellanox.com>
> +
> +
> +Contents
> +=========
> +
> +- Assumptions
> +- Introduction
> +- The Net DIM Algorithm
> +- Registering a Network Device to DIM
> +- Example
> +
> +Part 0: Assumptions
> +======================
> +
> +This document assumes the reader has basic knowledge in network drivers
> +and in general interrupt moderation.
> +
> +
> +Part I: Introduction
> +======================
> +
> +Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the interrupt
> +moderation configuration of a channel in order to optimize packet processing.
> +The mechanism includes an algorithm which decides if and how to change
> +moderation parameters for a channel, usually by performing an analysis on
> +runtime data sampled from the system. Net DIM is such a mechanism. In each
> +iteration of the algorithm, it analyses a given sample of the data, compares it to
> +the previous sample and if required, is can decide to change some of the interrupt moderation
> +configuration fields. The data sample is composed of data bandwidth, the number of
> +packets and the number of events. The time between samples is also measured. Net DIM
> +compares the current and the previous data and returns an adjusted interrupt
> +moderation configuration object. In some cases, the algorithm might decide not
> +to change anything. The configuration fields are the minimum duration
> +(microseconds) allowed between events and the maximum number of wanted packets
> +per event. The Net DIM algorithm ascribes importance to increase bandwidth over
> +reducing interrupt rate.
> +
> +
> +Part II: The Net DIM Algorithm
> +===============================
> +
> +Each iteration of the Net DIM algorithm follows these steps:
> +1. Calculates new data sample.
> +2. Compares it to previous sample.
> +3. Makes a decision - suggests interrupt moderation configuration fields.
> +4. Applies a schedule work function, which applies suggested configuration.
> +
> +The first two steps are straight forward, both the new and the previous data are
> +supplied by the driver registered to Net DIM. The previous data is the new data
> +supplied to the previous iteration. The comparison step checks the difference
> +between the new and previous data and decides on the result of the last step. A step
> +would result as "better" if bandwidth increases and as "worse" if bandwidth
> +reduces. If there is no change in bandwidth, the packet rate is compared in a similar
> +fashion - increase == "better" and decrease == "worse". In case there is no
> +change in the packet rate as well, the interrupt rate is compared. Here the
> +algorithm tries to optimize for lower interrupt rate so an increase in the
> +interrupt rate is considered "worse" and a decrease is considered "better".
> +Step #2 has an optimization for avoiding false results, it only considers a
> +difference between samples as valid if it is greater than a certain percentage.
> +Also, since Net DIM does not measure anything by itself, it assumes the data
> +provided by the driver is valid.
> +
> +Step #3 decides on the suggested configuration based on the result from step #2
> +and the internal state of the algorithm. The states reflect the "direction" of
> +the algorithm, is it going left (reducing moderation), right (increasing
> +moderation) or standing still. Another optimization is that if a decision
> +to stay still is made multiple times, the interval between iterations of the
> +algorithm would increase in order to reduce calculation overhead. Also, after
> +"parking" on one of the most left or most right decisions, the algorithm may
> +decide to verify this decision by taking a step on the other direction. This is
> +done in order to avoid getting stuck in a "deep sleep" scenario. Once a
> +decision is made, an interrupt moderation configuration is selected from
> +the predefined profiles.
> +
> +The last step is to notify the registered driver that it should apply the suggested
> +configuration. This is done by scheduling a work function, defined by the Net DIM
> +API and provided by the registered driver.
> +
> +As you can see, Net DIM itself does not actively interact with the system. It
> +would have trouble making the correct decisions if the wrong data is supplied to it
> +and it would be useless if the work function would not apply the suggested
> +configuration. This does, however, allows the registered driver some room for manoeuvre
> +as it may provide partial data or ignore the algorithm suggestion under some
> +conditions.
> +
> +
> +Part III: Registering a Network Device to DIM
> +==============================================
> +
> +Net DIM API exposes the main function net_dim(struct net_dim *dim,
> +struct net_dim_sample end_sample). This function is the entry point to the Net
> +DIM algorithm and has to be called every time the driver would like to check if
> +it should change interrupt moderation parameters. The driver should provide two
> +data structures: struct net_dim and struct net_dim_sample. Struct net_dim
> +describes the state of DIM for a specific object (RX queue, TX queue,
> +other queues, etc.). This includes the current selected profile, previous data
> +samples, the callback function provided by the driver and more.
> +Struct net_dim_sample describes a data sample, which will be compared to the
> +data sample stored in struct net_dim in order to decide on the algorithm's next
> +step. The sample should include bytes, packets and interrupts, measured by
> +the driver.
> +
> +In order to use Net DIM from a networking driver, the driver needs to call the
> +main net_dim() function. The recommended method is to call net_dim() on each
> +interrupt. Since Net DIM has a built-in moderation and it might decide to skip
> +iterations under certain conditions, there is no need to moderate the net_dim()
> +calls as well. As mentioned above, the driver needs to provide an object of type
> +struct net_dim to the net_dim() function call. It is advised for each entity
> +using Net DIM to hold a struct net_dim as part of its data structure and use it
> +as the main Net DIM API object. The struct net_dim_sample should hold the latest
> +bytes, packets and interrupts count. No need to perform any calculations, just
> +include the raw data.
> +
> +The net_dim() call itself does not return anything. Instead Net DIM relies on the
> +driver to provide a callback function, which is called when the algorithm
> +decides to make a change in the interrupt moderation parameters. This callback
> +will be scheduled and ran in a separate thread in order not to add overhead to
> +the data flow. After the work is done, Net DIM algorithm needs to be set to
> +the proper state in order to move to the next iteration.
> +
> +
> +Part IV: Example
> +=================
> +
> +The following code demonstrates how to register a driver to Net DIM. The actual
> +usage is not complete but it should make the outline if the usage clear.
> +
> +my_driver.c:
> +
> +#include <linux/net_dim.h>
> +
> +/* Callback for net DIM to schedule on a decision to change moderation */
> +void my_driver_do_dim_work(struct work_struct *work)
> +{
> +	/* Get struct net_dim from struct work_struct */
> +	struct net_dim *dim = container_of(work, struct net_dim,
> +					   work);
> +	/* Do interrupt moderation related stuff */
> +	...
> +
> +	/* Signal net DIM work is done and it should move to next iteration */
> +	dim->state = NET_DIM_START_MEASURE;
> +}
> +
> +/* My driver's interrupt handler */
> +int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...)
> +{
> +	...
> +	/* A struct to hold current measured data */
> +	struct net_dim_sample dim_sample;
> +	...
> +	/* Initiate data sample struct with current data */
> +	net_dim_sample(my_entity->events,
> +		       my_entity->packets,
> +		       my_entity->bytes,
> +		       &dim_sample);
> +	/* Call net DIM */
> +	net_dim(&my_entity->dim, dim_sample);
> +	...
> +}
> +
> +/* My entity's initialization function (my_entity was already allocated) */
> +int my_driver_init_my_entity(struct my_driver_entity *my_entity, ...)
> +{
> +	...
> +	/* Initiate struct work_struct with my driver's callback function */
> +	INIT_WORK(&my_entity->dim.work, my_driver_do_dim_work);
> +	...
> +}

^ permalink raw reply

* Re: [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
From: Thomas Gleixner @ 2018-03-21 14:22 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, john.stultz, levi.pearson,
	edumazet, willemb, mlichvar
In-Reply-To: <20180307011230.24001-15-jesus.sanchez-palencia@intel.com>

On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>            map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
> 
> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload
> 
> In this example, the Qdisc will use HW offload for the control of the
> transmission time through the network adapter. It's assumed the timestamp
> in skbuffs are in reference to the interface's PHC and setting any other
> valid clockid would be treated as an error. Because there is no
> scheduling being performed in the qdisc, setting a delta != 0 would also
> be considered an error.

Which clockid will be handed in from the application? The network adapter
time has no fixed clockid. The only way you can get to it is via a fd based
posix clock and that does not work at all because the qdisc setup might
have a different FD than the application which queues packets.

I think this should look like this:

    clock_adapter:	1 = clock of the network adapter
    			0 = system clock selected by clock_system

    clock_system:	0 = CLOCK_REALTIME
    			1 = CLOCK_MONOTONIC

or something like that.

> Example 2:
> 
> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>            map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
> 
> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 100000 \
> 	   clockid CLOCK_REALTIME sorting
> 
> Here, the Qdisc will use HW offload for the txtime control again,
> but now sorting will be enabled, and thus there will be scheduling being
> performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
> reference and packets leave the Qdisc "delta" (100000) nanoseconds before
> their transmission time. Because this will be using HW offload and
> since dynamic clocks are not supported by the hrtimer, the system clock
> and the PHC clock must be synchronized for this mode to behave as expected.

So what you do here is queueing the packets in the qdisk and then schedule
them at some point ahead of actual transmission time for delivery to the
hardware. That delivery uses the same txtime as used for qdisc scheduling
to tell the hardware when the packet should go on the wire. That's needed
when the network adapter does not support queueing of multiple packets.

Bah, and probably there you need CLOCK_TAI because that's what PTP is based
on, so clock_system needs to accomodate that as well. Dammit, there goes
the simple 2 bits implementation. CLOCK_TAI is 11, so we'd need 4 clock
bits plus the adapter bit.

Though we could spare a bit. The fixed CLOCK_* space goes from 0 to 15. I
don't see us adding new fixed clocks, so we really can reserve #15 for
selecting the adapter clock if sparing that extra bit is truly required.

Thanks,

	tglx

^ permalink raw reply

* Re: HW question: i210 vs. BCM5461S over SGMII: no response from PHY to MDIO requests?
From: Frantisek Rysanek @ 2018-03-21 14:07 UTC (permalink / raw)
  To: Andrew Lunn, netdev
In-Reply-To: <20180321125842.GA11206@lunn.ch>

...
yes I've just noticed Russell's patch mentioning mvneta,
and found the phylink patches to mvneta in net-next (until then I'd 
been reading the vanilla, where they haven't landed yet). 
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/tre
e/drivers/net/ethernet/marvell/mvneta.c

phylink_create() looks clear enough, but I got stuck once I looked at 
phylink_register_sfp(). Which IMO asks the device tree to enumerate 
driver "sfp" or some such. I'm lost again. Thanks for hinting that 
the existing style of the Intel driver and the Linux "device tree" 
don't mix well or readily.

It's sad. Mr. King's code is so obvious and easy to follow (except 
maybe for the indirections through the device tree).

As for Intel software and support... that's a bit of love and hate on 
my part :-) The more I get involved, the more intensive the mix.

On 21 Mar 2018 at 13:58, Andrew Lunn wrote:
> > I'd love to use existing code of the phylib to talk to the SFP
> > PHY's, maybe extend the phylib a bit (with the phy's I have), rather
> > than cobble together something crude and private on my own, inside
> > the igb driver.
> 
> So this is quite a big job, to do it cleanly. You probably need to
> retain the intel code for MDIO/PHY/I2C, but add an option to make use
> of the Linux common MDIO/PHY/I2C infrastructure. Then you need to
> extend PHYLINK with a non device tree way to configure it, and glue
> all the parts together.
> 
> You can make it a bit easier by just throwing away all the intel
> MDIO/PHY/I2C code, replacing it will Linux common code. But i expect
> the Intel maintainers would then reject your changes. There is too
> high a chance of introducing regressions.
>
thanks for that explanation :-) You have saved me some time.
 
Frankly I see no chance of my code getting accepted to upstream - I'm 
critical to my own capabilities here.

int phylink_callback(char* msg)
{
	printk(msg);
}
"Hello there, there's an SGMII SFP plugged into the port that you
have registered earlier, on the mdio-i2c bus you have set up,
we have configured the PHY for you, all you need to do is tuck this
leash up your MAC and tell it to try a handshake."

It would be lovely to add SFP hot-swap support and PHY auto-config to 
the Intel driver, and it would possibly result in quite a voluminous 
crapectomy in the Intel driver, but in my position the effort 
required is clearly over my head, and, as you correctly say, I'm not 
in the position to push this back upstream :-)

I'll see how far I can go, "low hanging fruit /halfway there" style. 
For instance, the "previous-generation" API starting from 
mdiobus_alloc() looks "bounded" in terms of effort.
(Example in drivers/net/ethernet/broadcom/tg3.c)
If I understand this correctly, it should allow me to tap into 
phylib, only I'd have to give up the SFP EEPROM understanding
and PHY hot-swap (available from the phylink with SFP support).

I'll probably become silent now, for a while.
Thanks for all your polite explanations Andrew :-)

Frank

^ permalink raw reply

* Re: [PATCH] mac80211: aes-cmac: remove VLA usage
From: Gustavo A. R. Silva @ 2018-03-21 14:02 UTC (permalink / raw)
  To: Johannes Berg, David S. Miller; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <1521640687.2645.30.camel@sipsolutions.net>



On 03/21/2018 08:58 AM, Johannes Berg wrote:
> On Wed, 2018-03-21 at 08:57 -0500, Gustavo A. R. Silva wrote:
>>
>> SHA_DESC_ON_STACK is currently being used in multiple places. But, yeah,
>> I think we can define multiple macros of the same kind and adjust to the
>> characteristics of each the component.
>>
>> How big do you think tfm can get?
> 
> I have no idea, I guess you'll have to take that with Herbert.
> 
> johannes
> 

I see. I'll contact him then.

Thanks for the feedback.
--
Gustavo

^ permalink raw reply

* Re: [PATCH] mac80211: aes-cmac: remove VLA usage
From: Johannes Berg @ 2018-03-21 13:58 UTC (permalink / raw)
  To: Gustavo A. R. Silva, David S. Miller; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <b367cf76-e6c5-3d6d-5d81-a68ae245e6b9@embeddedor.com>

On Wed, 2018-03-21 at 08:57 -0500, Gustavo A. R. Silva wrote:
> 
> SHA_DESC_ON_STACK is currently being used in multiple places. But, yeah, 
> I think we can define multiple macros of the same kind and adjust to the 
> characteristics of each the component.
> 
> How big do you think tfm can get?

I have no idea, I guess you'll have to take that with Herbert.

johannes

^ permalink raw reply

* Re: [PATCH] mac80211: aes-cmac: remove VLA usage
From: Gustavo A. R. Silva @ 2018-03-21 13:57 UTC (permalink / raw)
  To: Johannes Berg, David S. Miller; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <1521640094.2645.29.camel@sipsolutions.net>



On 03/21/2018 08:48 AM, Johannes Berg wrote:
> On Wed, 2018-03-21 at 08:42 -0500, Gustavo A. R. Silva wrote:
>> In preparation to enabling -Wvla, remove VLAs and replace them
>> with dynamic memory allocation instead.
>>
>> The use of stack Variable Length Arrays needs to be avoided, as they
>> can be a vector for stack exhaustion, which can be both a runtime bug
>> or a security flaw. Also, in general, as code evolves it is easy to
>> lose track of how big a VLA can get. Thus, we can end up having runtime
>> failures that are hard to debug.
>>
>> Also, fixed as part of the directive to remove all VLAs from
>> the kernel: https://lkml.org/lkml/2018/3/7/621
>>
>> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
>> ---
>>   net/mac80211/aes_cmac.c | 36 ++++++++++++++++++++++++------------
>>   1 file changed, 24 insertions(+), 12 deletions(-)
>>
>> diff --git a/net/mac80211/aes_cmac.c b/net/mac80211/aes_cmac.c
>> index 2fb6558..c9444bf 100644
>> --- a/net/mac80211/aes_cmac.c
>> +++ b/net/mac80211/aes_cmac.c
>> @@ -27,30 +27,42 @@ static const u8 zero[CMAC_TLEN_256];
>>   void ieee80211_aes_cmac(struct crypto_shash *tfm, const u8 *aad,
>>   			const u8 *data, size_t data_len, u8 *mic)
>>   {
>> -	SHASH_DESC_ON_STACK(desc, tfm);
>> +	struct shash_desc *shash;
>>   	u8 out[AES_BLOCK_SIZE];
>>   
>> -	desc->tfm = tfm;
>> +	shash = kmalloc(sizeof(*shash) + crypto_shash_descsize(tfm),
>> +			GFP_KERNEL);
>> +	if (!shash)
>> +		return;
> 
> Honestly, this seems like a really bad idea - you're now hitting
> kmalloc for every TX/RX frame here.
> 
> SHA_DESC_ON_STACK() should just be fixed to not need a VLA, but take
> some sort of maximum, I guess?
> 

SHA_DESC_ON_STACK is currently being used in multiple places. But, yeah, 
I think we can define multiple macros of the same kind and adjust to the 
characteristics of each the component.

How big do you think tfm can get?

Thanks
--
Gustavo

^ permalink raw reply

* [net-next  3/3] tipc: step sk->sk_drops when rcv buffer is full
From: GhantaKrishnamurthy MohanKrishna @ 2018-03-21 13:37 UTC (permalink / raw)
  To: tipc-discussion, jon.maloy, maloy, ying.xue,
	mohan.krishna.ghanta.krishnamurthy, netdev, davem
  Cc: Parthasarathy Bhuvaragan
In-Reply-To: <1521639465-3169-1-git-send-email-mohan.krishna.ghanta.krishnamurthy@ericsson.com>

Currently when tipc is unable to queue a received message on a
socket, the message is rejected back to the sender with error
TIPC_ERR_OVERLOAD. However, the application on this socket
has no knowledge about these discards.

In this commit, we try to step the sk_drops counter when tipc
is unable to queue a received message. Export sk_drops
using tipc socket diagnostics.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
---
 include/uapi/linux/tipc_netlink.h | 1 +
 net/tipc/socket.c                 | 9 +++++++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/tipc_netlink.h b/include/uapi/linux/tipc_netlink.h
index d7cec0480d70..d896ded51bcb 100644
--- a/include/uapi/linux/tipc_netlink.h
+++ b/include/uapi/linux/tipc_netlink.h
@@ -251,6 +251,7 @@ enum {
 	TIPC_NLA_SOCK_STAT_SENDQ,	/* u32 */
 	TIPC_NLA_SOCK_STAT_LINK_CONG,	/* flag */
 	TIPC_NLA_SOCK_STAT_CONN_CONG,	/* flag */
+	TIPC_NLA_SOCK_STAT_DROP,	/* u32 */
 
 	__TIPC_NLA_SOCK_STAT_MAX,
 	TIPC_NLA_SOCK_STAT_MAX = __TIPC_NLA_SOCK_STAT_MAX - 1
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 31fdd13d444e..a084c78408fb 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -2122,8 +2122,10 @@ static void tipc_sk_filter_rcv(struct sock *sk, struct sk_buff *skb,
 		    (!sk_conn && msg_connected(hdr)) ||
 		    (!grp && msg_in_group(hdr)))
 			err = TIPC_ERR_NO_PORT;
-		else if (sk_rmem_alloc_get(sk) + skb->truesize >= limit)
+		else if (sk_rmem_alloc_get(sk) + skb->truesize >= limit) {
+			atomic_inc(&sk->sk_drops);
 			err = TIPC_ERR_OVERLOAD;
+		}
 
 		if (unlikely(err)) {
 			tipc_skb_reject(net, err, skb, xmitq);
@@ -2202,6 +2204,7 @@ static void tipc_sk_enqueue(struct sk_buff_head *inputq, struct sock *sk,
 
 		/* Overload => reject message back to sender */
 		onode = tipc_own_addr(sock_net(sk));
+		atomic_inc(&sk->sk_drops);
 		if (tipc_msg_reverse(onode, &skb, TIPC_ERR_OVERLOAD))
 			__skb_queue_tail(xmitq, skb);
 		break;
@@ -3287,7 +3290,9 @@ int tipc_sk_fill_sock_diag(struct sk_buff *skb, struct tipc_sock *tsk,
 	if (nla_put_u32(skb, TIPC_NLA_SOCK_STAT_RCVQ,
 			skb_queue_len(&sk->sk_receive_queue)) ||
 	    nla_put_u32(skb, TIPC_NLA_SOCK_STAT_SENDQ,
-			skb_queue_len(&sk->sk_write_queue)))
+			skb_queue_len(&sk->sk_write_queue)) ||
+	    nla_put_u32(skb, TIPC_NLA_SOCK_STAT_DROP,
+			atomic_read(&sk->sk_drops)))
 		goto stat_msg_cancel;
 
 	if (tsk->cong_link_cnt &&
-- 
2.1.4

^ permalink raw reply related

* [net-next  1/3] tipc: modify socket iterator for sock_diag
From: GhantaKrishnamurthy MohanKrishna @ 2018-03-21 13:37 UTC (permalink / raw)
  To: tipc-discussion, jon.maloy, maloy, ying.xue,
	mohan.krishna.ghanta.krishnamurthy, netdev, davem
  Cc: Parthasarathy Bhuvaragan
In-Reply-To: <1521639465-3169-1-git-send-email-mohan.krishna.ghanta.krishnamurthy@ericsson.com>

The current socket iterator function tipc_nl_sk_dump, handles socket
locks and calls __tipc_nl_add_sk for each socket.
To reuse this logic in sock_diag implementation, we do minor
modifications to make these functions generic as described below.

In this commit, we add a two new functions __tipc_nl_sk_walk,
__tipc_nl_add_sk_info and modify tipc_nl_sk_dump, __tipc_nl_add_sk
accordingly.

In __tipc_nl_sk_walk we:
1. acquire and release socket locks
2. for each socket, execute the specified callback function

In __tipc_nl_add_sk we:
- Move the netlink attribute insertion to __tipc_nl_add_sk_info.

tipc_nl_sk_dump calls tipc_nl_sk_walk with __tipc_nl_add_sk as argument.

sock_diag will use these generic functions in a later commit.

There is no functional change in this commit.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
---
 net/tipc/socket.c | 65 +++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 41 insertions(+), 24 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index f93477187a90..8e83b0501d4e 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -3154,16 +3154,33 @@ static int __tipc_nl_add_sk_con(struct sk_buff *skb, struct tipc_sock *tsk)
 	return -EMSGSIZE;
 }
 
+static int __tipc_nl_add_sk_info(struct sk_buff *skb, struct tipc_sock
+			  *tsk)
+{
+	struct net *net = sock_net(skb->sk);
+	struct tipc_net *tn = tipc_net(net);
+	struct sock *sk = &tsk->sk;
+
+	if (nla_put_u32(skb, TIPC_NLA_SOCK_REF, tsk->portid) ||
+	    nla_put_u32(skb, TIPC_NLA_SOCK_ADDR, tn->own_addr))
+		return -EMSGSIZE;
+
+	if (tipc_sk_connected(sk)) {
+		if (__tipc_nl_add_sk_con(skb, tsk))
+			return -EMSGSIZE;
+	} else if (!list_empty(&tsk->publications)) {
+		if (nla_put_flag(skb, TIPC_NLA_SOCK_HAS_PUBL))
+			return -EMSGSIZE;
+	}
+	return 0;
+}
+
 /* Caller should hold socket lock for the passed tipc socket. */
 static int __tipc_nl_add_sk(struct sk_buff *skb, struct netlink_callback *cb,
 			    struct tipc_sock *tsk)
 {
-	int err;
-	void *hdr;
 	struct nlattr *attrs;
-	struct net *net = sock_net(skb->sk);
-	struct tipc_net *tn = net_generic(net, tipc_net_id);
-	struct sock *sk = &tsk->sk;
+	void *hdr;
 
 	hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
 			  &tipc_genl_family, NLM_F_MULTI, TIPC_NL_SOCK_GET);
@@ -3173,19 +3190,10 @@ static int __tipc_nl_add_sk(struct sk_buff *skb, struct netlink_callback *cb,
 	attrs = nla_nest_start(skb, TIPC_NLA_SOCK);
 	if (!attrs)
 		goto genlmsg_cancel;
-	if (nla_put_u32(skb, TIPC_NLA_SOCK_REF, tsk->portid))
-		goto attr_msg_cancel;
-	if (nla_put_u32(skb, TIPC_NLA_SOCK_ADDR, tn->own_addr))
+
+	if (__tipc_nl_add_sk_info(skb, tsk))
 		goto attr_msg_cancel;
 
-	if (tipc_sk_connected(sk)) {
-		err = __tipc_nl_add_sk_con(skb, tsk);
-		if (err)
-			goto attr_msg_cancel;
-	} else if (!list_empty(&tsk->publications)) {
-		if (nla_put_flag(skb, TIPC_NLA_SOCK_HAS_PUBL))
-			goto attr_msg_cancel;
-	}
 	nla_nest_end(skb, attrs);
 	genlmsg_end(skb, hdr);
 
@@ -3199,16 +3207,19 @@ static int __tipc_nl_add_sk(struct sk_buff *skb, struct netlink_callback *cb,
 	return -EMSGSIZE;
 }
 
-int tipc_nl_sk_dump(struct sk_buff *skb, struct netlink_callback *cb)
+static int __tipc_nl_sk_walk(struct sk_buff *skb, struct netlink_callback *cb,
+			     int (*skb_handler)(struct sk_buff *skb,
+						struct netlink_callback *cb,
+						struct tipc_sock *tsk))
 {
-	int err;
-	struct tipc_sock *tsk;
-	const struct bucket_table *tbl;
-	struct rhash_head *pos;
 	struct net *net = sock_net(skb->sk);
-	struct tipc_net *tn = net_generic(net, tipc_net_id);
-	u32 tbl_id = cb->args[0];
+	struct tipc_net *tn = tipc_net(net);
+	const struct bucket_table *tbl;
 	u32 prev_portid = cb->args[1];
+	u32 tbl_id = cb->args[0];
+	struct rhash_head *pos;
+	struct tipc_sock *tsk;
+	int err;
 
 	rcu_read_lock();
 	tbl = rht_dereference_rcu((&tn->sk_rht)->tbl, &tn->sk_rht);
@@ -3220,12 +3231,13 @@ int tipc_nl_sk_dump(struct sk_buff *skb, struct netlink_callback *cb)
 				continue;
 			}
 
-			err = __tipc_nl_add_sk(skb, cb, tsk);
+			err = skb_handler(skb, cb, tsk);
 			if (err) {
 				prev_portid = tsk->portid;
 				spin_unlock_bh(&tsk->sk.sk_lock.slock);
 				goto out;
 			}
+
 			prev_portid = 0;
 			spin_unlock_bh(&tsk->sk.sk_lock.slock);
 		}
@@ -3238,6 +3250,11 @@ int tipc_nl_sk_dump(struct sk_buff *skb, struct netlink_callback *cb)
 	return skb->len;
 }
 
+int tipc_nl_sk_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	return __tipc_nl_sk_walk(skb, cb, __tipc_nl_add_sk);
+}
+
 /* Caller should hold socket lock for the passed tipc socket. */
 static int __tipc_nl_add_sk_publ(struct sk_buff *skb,
 				 struct netlink_callback *cb,
-- 
2.1.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox