Netdev List
 help / color / mirror / Atom feed
* [PATCH] ip6_vti: simplify stats handling in vti6_xmit
From: Haishuang Yan @ 2018-08-18 14:43 UTC (permalink / raw)
  To: Steffen Klassert, David S. Miller, Alexey Kuznetsov
  Cc: netdev, linux-kernel, Haishuang Yan

Same as ip_vti, use iptunnel_xmit_stats to updates stats in tunnel xmit
code path.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
---
 net/ipv6/ip6_vti.c | 14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index c72ae3a..65d4a80 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -503,17 +503,9 @@ static bool vti6_state_check(const struct xfrm_state *x,
 	skb->dev = skb_dst(skb)->dev;
 
 	err = dst_output(t->net, skb->sk, skb);
-	if (net_xmit_eval(err) == 0) {
-		struct pcpu_sw_netstats *tstats = this_cpu_ptr(dev->tstats);
-
-		u64_stats_update_begin(&tstats->syncp);
-		tstats->tx_bytes += pkt_len;
-		tstats->tx_packets++;
-		u64_stats_update_end(&tstats->syncp);
-	} else {
-		stats->tx_errors++;
-		stats->tx_aborted_errors++;
-	}
+	if (net_xmit_eval(err) == 0)
+		err = pkt_len;
+	iptunnel_xmit_stats(dev, err);
 
 	return 0;
 tx_err_link_failure:
-- 
1.8.3.1

^ permalink raw reply related

* [PATCHv2 1/2] ethernet: declance:  Use NULL to compare with pointer-typed value rather than 0
From: zhong jiang @ 2018-08-18  6:32 UTC (permalink / raw)
  To: davem; +Cc: vz, slemieux.tyco, keescook, netdev, linux-kernel
In-Reply-To: <1534573949-17548-1-git-send-email-zhongjiang@huawei.com>

We should use NULL to compare with pointer-typed value rather than
0. The issue is detected with the help of Coccinelle.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
---
 drivers/net/ethernet/amd/declance.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/amd/declance.c b/drivers/net/ethernet/amd/declance.c
index 116997a..c636f02 100644
--- a/drivers/net/ethernet/amd/declance.c
+++ b/drivers/net/ethernet/amd/declance.c
@@ -606,8 +606,7 @@ static int lance_rx(struct net_device *dev)
 		} else {
 			len = (*rds_ptr(rd, mblength, lp->type) & 0xfff) - 4;
 			skb = netdev_alloc_skb(dev, len + 2);
-
-			if (skb == 0) {
+			if (!skb) {
 				dev->stats.rx_dropped++;
 				*rds_ptr(rd, mblength, lp->type) = 0;
 				*rds_ptr(rd, rmd1, lp->type) =
-- 
1.7.12.4

^ permalink raw reply related

* RE: [PATCH net-next v1] net/tls: Add support for async decryption of tls records
From: Vakul Garg @ 2018-08-18  5:55 UTC (permalink / raw)
  To: Dave Watson
  Cc: netdev@vger.kernel.org, borisp@mellanox.com, aviadye@mellanox.com,
	davem@davemloft.net
In-Reply-To: <20180817221238.b4napcwedbwup22q@davejwatson-mba.local.dhcp.thefacebook.com>



> -----Original Message-----
> From: Dave Watson <davejwatson@fb.com>
> Sent: Saturday, August 18, 2018 3:43 AM
> To: Vakul Garg <vakul.garg@nxp.com>
> Cc: netdev@vger.kernel.org; borisp@mellanox.com;
> aviadye@mellanox.com; davem@davemloft.net
> Subject: Re: [PATCH net-next v1] net/tls: Add support for async decryption of
> tls records
> 
> On 08/16/18 08:49 PM, Vakul Garg wrote:
> > Changes since RFC version:
> > 	1) Improved commit message.
> > 	2) Fixed dequeued record offset handling because of which few of
> > 	   tls selftests 'recv_partial, recv_peek, recv_peek_multiple' were
> failing.
> 
> Thanks! Commit message much more clear, tests work great for me also,
> only minor comments on clarity
> 
> > -			if (tls_sw_advance_skb(sk, skb, chunk)) {
> > +			if (async) {
> > +				/* Finished with current record, pick up next
> */
> > +				ctx->recv_pkt = NULL;
> > +				__strp_unpause(&ctx->strp);
> > +				goto mark_eor_chk_ctrl;
> 
> Control flow is a little hard to follow here, maybe just pass an async flag to
> tls_sw_advance_skb?  It already does strp_unpause and recv_pkt = NULL.
> 

I improved it but in a slightly different way. Please see in v2.
As net-next is closed right now, I would send the patch to you privately &
later post it on list when David gives a green signal.
Is it ok?


> > +			} else if (tls_sw_advance_skb(sk, skb, chunk)) {
> >  				/* Return full control message to
> >  				 * userspace before trying to parse
> >  				 * another message type
> >  				 */
> > +mark_eor_chk_ctrl:
> >  				msg->msg_flags |= MSG_EOR;
> >  				if (control != TLS_RECORD_TYPE_DATA)
> >  					goto recv_end;
> > +			} else {
> > +				break;
> 
> I don't see the need for the else { break; }, isn't this already covered by
> while(len); below as before?
 
When tls_sw_advance_skb() returns false, it is certain that we cannot 
continue in the loop. So putting a break here avoids having to execute
'if' checks and while (len) checks down below.

^ permalink raw reply

* Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
From: Ard Biesheuvel @ 2018-08-18  8:13 UTC (permalink / raw)
  To: D. J. Bernstein
  Cc: Eric Biggers, Jason A. Donenfeld, Eric Biggers,
	Linux Crypto Mailing List, LKML, Netdev, David Miller,
	Andrew Lutomirski, Greg Kroah-Hartman, Samuel Neves, Tanja Lange,
	Jean-Philippe Aumasson, Karthikeyan Bhargavan
In-Reply-To: <20180817073120.12640.qmail@cr.yp.to>



> On 17 Aug 2018, at 10:31, D. J. Bernstein <djb@cr.yp.to> wrote:
> 
> Eric Biggers writes:
>> If (more likely) you're talking about things like "use this NEON implementation
>> on Cortex-A7 but this other NEON implementation on Cortex-A53", it's up the
>> developers and community to test different CPUs and make appropriate decisions,
>> and yes it can be very useful to have external benchmarks like SUPERCOP to refer
>> to, and I appreciate your work in that area.
> 
> You seem to be talking about a process that selects (e.g.) ChaCha20
> implementations as follows: manually inspect benchmarks of various
> implementations on various CPUs, manually write code to map CPUs to
> implementations, manually update the code as necessary for new CPUs, and
> of course manually do the same for every other primitive that can see
> differences between microarchitectures (which isn't something weird---
> it's the normal situation after enough optimization effort).
> 
> This is quite a bit of manual work, so the kernel often doesn't do it,
> so we end up with unhappy people talking about performance regressions.
> 
> For comparison, imagine one simple central piece of code in the kernel
> to automatically do the following:
> 
>   When a CPU core is booted:
>     For each primitive:
>       Benchmark all implementations of the primitive on the core.
>       Select the fastest for subsequent use on the core.
> 
> If this is a general-purpose mechanism (as in SUPERCOP, NaCl, and
> libpqcrypto) rather than something ad-hoc (as in raid6), then there's no
> manual work per primitive, and no work per implementation. Each CPU, old
> or new, automatically obtains the fastest available code for that CPU.
> 
> The only cost is a moment of benchmarking at boot time. _If_ this is a
> noticeable cost then there are many ways to speed it up: for example,
> automatically copy the results across identical cores, automatically
> copy the results across boots if the cores are unchanged, automatically
> copy results from a central database indexed by CPU identifiers, etc.
> The SUPERCOP database is evolving towards enabling this type of sharing.
> 

‘Fastest’ does not imply ‘preferred’. For instance, running the table based cache thrashing generic AES implementation may be fast, but may put a disproportionate load on, e.g., a hyperthreading system, and as you have pointed out yourself, it is time variant as well. Then, there is the power consumption aspect: NEON bit sliced AES may be faster, but does a lot more work, and does it on the SIMD unit which could potentially be turned off entirely otherwise. Only the implementations based on h/w instructions can generally be assumed optimal in all senses, and there is no real point in benchmarking those against pure software implementations.

Then, there is the aspect of accelerators: the kernel’s crypto API seamlessly supports crypto peripherals, which may be slower or faster, have more or fewer queues than the number of CPUs, may offer additional benefits such as protected AES keys etc etc.

In the linux kernel, we generally try to stay away from policy decisions, and offer the controls to allow userland to take charge of this. The modularized crypto code can be blacklisted per algo implementation if desired, and beyond that, we simply try to offer functionality that covers the common case.

>> A lot of code can be shared, but in practice different environments have
>> different constraints, and kernel programming in particular has some distinct
>> differences from userspace programming.  For example, you cannot just use the
>> FPU (including SSE, AVX, NEON, etc.) registers whenever you want to, since on
>> most architectures they can't be used in some contexts such as hardirq context,
>> and even when they *can* be used you have to run special code before and after
>> which does things like saving all the FPU registers to the task_struct,
>> disabling preemption, and/or enabling the FPU.
> 
> Is there some reason that each implementor is being pestered to handle
> all this? Detecting FPU usage is a simple static-analysis exercise, and
> the rest sounds like straightforward boilerplate that should be handled
> centrally.
> 

Detecting it is easy but that does not mean that you can use SIMD in any context, and whether a certain function may ever be called from such a context cannot be decided by static analysis. Also, there are performance and latency concerns which need to be taken into account.

In the kernel, we simply cannot write our algorithm as if our code is the only thing running on the system.

>> But disabling preemption for
>> long periods of time hurts responsiveness, so it's also desirable to yield the
>> processor occasionally, which means that assembly implementations should be
>> incremental rather than having a single entry point that does everything.
> 
> Doing this rewrite automatically is a bit more of a code-analysis
> challenge, but the alternative approach of doing it by hand is insanely
> error-prone. See, e.g., https://eprint.iacr.org/2017/891.
> 
>> Many people may have contributed to SUPERCOP already, but that doesn't mean
>> there aren't things you could do to make it more appealing to contributors and
>> more of a community project,
> 
> The logic in this sentence is impeccable, and is already illustrated by
> many SUPERCOP improvements through the years from an increasing number
> of contributors, as summarized in the 87 release announcements so far on
> the relevant public mailing list, which you're welcome to study in
> detail along with the 400 megabytes of current code and as many previous
> versions as you're interested in. That's also the mailing list where
> people are told to send patches, as you'll see if you RTFM.
> 
>> So Linux distributions may not want to take on the legal risk of
>> distributing it
> 
> This is a puzzling comment. A moment ago we were talking about the
> possibility of useful sharing of (e.g.) ChaCha20 implementations between
> SUPERCOP and the Linux kernel, avoiding pointless fracturing of the
> community's development process for these implementations. This doesn't
> mean that the kernel should be grabbing implementations willy-nilly from
> SUPERCOP---surely the kernel should be doing security audits, and the
> kernel already has various coding requirements, and the kernel requires
> GPL compatibility, while putting any of these requirements into SUPERCOP
> would be counterproductive.
> 
> If you mean having the entire SUPERCOP benchmarking package distributed
> through Linux distributions, I have no idea what your motivation is or
> how this is supposed to be connected to anything else we're discussing.
> Obviously SUPERCOP's broad code-inclusion policies make this idea a
> non-starter.
> 
>> nor may companies want to take on the risk of contributing.
> 
> RTFM. People who submit code are authorizing public redistribution for
> benchmarking. It's up to them to decide if they want to allow more.
> 
> ---Dan

^ permalink raw reply

* Re: [PATCH 0/2] ethernet: Use NULL to compare with pointer-typed value rather than 0
From: zhong jiang @ 2018-08-18  6:45 UTC (permalink / raw)
  To: davem; +Cc: vz, slemieux.tyco, keescook, netdev, linux-kernel
In-Reply-To: <1534573773-17358-1-git-send-email-zhongjiang@huawei.com>

ingore the patchset.  should be change the title  from [patch] to [patchv2].
On 2018/8/18 14:29, zhong jiang wrote:
> v1->v2:
>  - According to Vladimir's suggestion. change a common 0 and NULL comparsion form.
>
> zhong jiang (2):
>   ethernet: declance:  Use NULL to compare with pointer-typed value
>     rather than 0
>   ethernet: lpc_eth: Use NULL to compare with pointer-typed value
>     rather than 0
>
>  drivers/net/ethernet/amd/declance.c | 3 +--
>  drivers/net/ethernet/nxp/lpc_eth.c  | 2 +-
>  2 files changed, 2 insertions(+), 3 deletions(-)
>

^ permalink raw reply

* [PATCHv2 2/2] ethernet: lpc_eth: Use NULL to compare with pointer-typed value rather than 0
From: zhong jiang @ 2018-08-18  6:32 UTC (permalink / raw)
  To: davem; +Cc: vz, slemieux.tyco, keescook, netdev, linux-kernel
In-Reply-To: <1534573949-17548-1-git-send-email-zhongjiang@huawei.com>

We should use NULL to compare with pointer-typed value rather than 0.
The issue is detected with the help of Coccinelle.

Acked-by: Vladimir Zapolskiy <vz@mleia.com>
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
---
 drivers/net/ethernet/nxp/lpc_eth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/nxp/lpc_eth.c b/drivers/net/ethernet/nxp/lpc_eth.c
index 08381ef..1c41b07 100644
--- a/drivers/net/ethernet/nxp/lpc_eth.c
+++ b/drivers/net/ethernet/nxp/lpc_eth.c
@@ -1350,7 +1350,7 @@ static int lpc_eth_drv_probe(struct platform_device *pdev)
 				"IRAM not big enough for net buffers, using SDRAM instead.\n");
 	}
 
-	if (pldat->dma_buff_base_v == 0) {
+	if (!pldat->dma_buff_base_v) {
 		ret = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
 		if (ret)
 			goto err_out_free_irq;
-- 
1.7.12.4

^ permalink raw reply related

* [PATCHv2 0/2] ethernet: Use NULL to compare with pointer-typed value rather than 0
From: zhong jiang @ 2018-08-18  6:32 UTC (permalink / raw)
  To: davem; +Cc: vz, slemieux.tyco, keescook, netdev, linux-kernel

v1->v2:
 - According to Vladimir's suggestion. change a common 0 and NULL comparsion form.

zhong jiang (2):
  ethernet: declance:  Use NULL to compare with pointer-typed value
    rather than 0
  ethernet: lpc_eth: Use NULL to compare with pointer-typed value
    rather than 0

 drivers/net/ethernet/amd/declance.c | 3 +--
 drivers/net/ethernet/nxp/lpc_eth.c  | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

-- 
1.7.12.4

^ permalink raw reply

* [PATCH 2/2] ethernet: lpc_eth: Use NULL to compare with pointer-typed value rather than 0
From: zhong jiang @ 2018-08-18  6:29 UTC (permalink / raw)
  To: davem; +Cc: vz, slemieux.tyco, keescook, netdev, linux-kernel
In-Reply-To: <1534573773-17358-1-git-send-email-zhongjiang@huawei.com>

We should use NULL to compare with pointer-typed value rather than 0.
The issue is detected with the help of Coccinelle.

Acked-by: Vladimir Zapolskiy <vz@mleia.com>
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
---
 drivers/net/ethernet/nxp/lpc_eth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/nxp/lpc_eth.c b/drivers/net/ethernet/nxp/lpc_eth.c
index 08381ef..1c41b07 100644
--- a/drivers/net/ethernet/nxp/lpc_eth.c
+++ b/drivers/net/ethernet/nxp/lpc_eth.c
@@ -1350,7 +1350,7 @@ static int lpc_eth_drv_probe(struct platform_device *pdev)
 				"IRAM not big enough for net buffers, using SDRAM instead.\n");
 	}
 
-	if (pldat->dma_buff_base_v == 0) {
+	if (!pldat->dma_buff_base_v) {
 		ret = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
 		if (ret)
 			goto err_out_free_irq;
-- 
1.7.12.4

^ permalink raw reply related

* [PATCH 1/2] ethernet: declance:  Use NULL to compare with pointer-typed value rather than 0
From: zhong jiang @ 2018-08-18  6:29 UTC (permalink / raw)
  To: davem; +Cc: vz, slemieux.tyco, keescook, netdev, linux-kernel
In-Reply-To: <1534573773-17358-1-git-send-email-zhongjiang@huawei.com>

We should use NULL to compare with pointer-typed value rather than
0. The issue is detected with the help of Coccinelle.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
---
 drivers/net/ethernet/amd/declance.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/amd/declance.c b/drivers/net/ethernet/amd/declance.c
index 116997a..c636f02 100644
--- a/drivers/net/ethernet/amd/declance.c
+++ b/drivers/net/ethernet/amd/declance.c
@@ -606,8 +606,7 @@ static int lance_rx(struct net_device *dev)
 		} else {
 			len = (*rds_ptr(rd, mblength, lp->type) & 0xfff) - 4;
 			skb = netdev_alloc_skb(dev, len + 2);
-
-			if (skb == 0) {
+			if (!skb) {
 				dev->stats.rx_dropped++;
 				*rds_ptr(rd, mblength, lp->type) = 0;
 				*rds_ptr(rd, rmd1, lp->type) =
-- 
1.7.12.4

^ permalink raw reply related

* [PATCH 0/2] ethernet: Use NULL to compare with pointer-typed value rather than 0
From: zhong jiang @ 2018-08-18  6:29 UTC (permalink / raw)
  To: davem; +Cc: vz, slemieux.tyco, keescook, netdev, linux-kernel

v1->v2:
 - According to Vladimir's suggestion. change a common 0 and NULL comparsion form.

zhong jiang (2):
  ethernet: declance:  Use NULL to compare with pointer-typed value
    rather than 0
  ethernet: lpc_eth: Use NULL to compare with pointer-typed value
    rather than 0

 drivers/net/ethernet/amd/declance.c | 3 +--
 drivers/net/ethernet/nxp/lpc_eth.c  | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

-- 
1.7.12.4

^ permalink raw reply

* Re: [PATCH 2/2] ethernet: lpc_eth: Use NULL to compare with pointer-typed value rather than 0
From: zhong jiang @ 2018-08-18  5:59 UTC (permalink / raw)
  To: Vladimir Zapolskiy; +Cc: davem, slemieux.tyco, keescook, netdev, linux-kernel
In-Reply-To: <edc2f14f-8694-3160-47b1-ccb5be8a3484@mleia.com>

On 2018/8/17 23:29, Vladimir Zapolskiy wrote:
> Hi zhong jiang,
>
> On 08/17/2018 04:18 PM, zhong jiang wrote:
>> We should use NULL to compare with pointer-typed value rather than 0.
>> The issue is detected with the help of Coccinelle.
>> ---
>>  drivers/net/ethernet/nxp/lpc_eth.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/nxp/lpc_eth.c b/drivers/net/ethernet/nxp/lpc_eth.c
>> index 08381ef..04d9e62 100644
>> --- a/drivers/net/ethernet/nxp/lpc_eth.c
>> +++ b/drivers/net/ethernet/nxp/lpc_eth.c
>> @@ -1350,7 +1350,7 @@ static int lpc_eth_drv_probe(struct platform_device *pdev)
>>  				"IRAM not big enough for net buffers, using SDRAM instead.\n");
>>  	}
>>  
>> -	if (pldat->dma_buff_base_v == 0) {
>> +	if (pldat->dma_buff_base_v == NULL) {
> That's a valid finding, but please use a common 0 and NULL comparison in form of
>
> 	if (!pldat->dma_buff_base_v)
>
> To the change above please feel free to add my
>
> Acked-by: Vladimir Zapolskiy <vz@mleia.com>
 Thanks, Will do in v2
>>  		ret = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
>>  		if (ret)
>>  			goto err_out_free_irq;
>>
>
> .
>

^ permalink raw reply

* [PATCH] fm10k_main.c: fix missing return value check of alloc_skb()
From: Jiecheng Wu @ 2018-08-18  2:27 UTC (permalink / raw)
  To: netdev

Function fm10k_init_module() defined in drivers/net/ethernet/intel/fm10k/fm10k_main.c
 calls alloc_workqueue() to allocate memory for struct workqueue_struct which is 
 dereferenced immediately. As alloc_workqueue() may return NULL on failure, 
 this code piece may cause NULL pointer dereference bug.
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 3f53654..78a43d6 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -41,6 +41,8 @@ static int __init fm10k_init_module(void)
 	/* create driver workqueue */
 	fm10k_workqueue = alloc_workqueue("%s", WQ_MEM_RECLAIM, 0,
 					  fm10k_driver_name);
+	if (!fm10k_workqueue)
+		return -ENOMEM;
 
 	fm10k_dbg_init();
 
-- 
2.6.4

^ permalink raw reply related

* Re: [PATCH] fm10k_main.c: fix missing return value check of alloc_skb()
From: Andrew Lunn @ 2018-08-18  2:14 UTC (permalink / raw)
  To: Jiecheng Wu; +Cc: netdev
In-Reply-To: <20180818020058.14652-1-jasonwood2031@gmail.com>

On Sat, Aug 18, 2018 at 10:00:58AM +0800, Jiecheng Wu wrote:
> Function fm10k_init_module() defined in drivers/net/ethernet/intel/fm10k/fm10k_main.c calls alloc_workqueue() to allocate memory for struct workqueue_struct which is dereferenced immediately. As alloc_workqueue() may return NULL on failure, this code piece may cause NULL pointer dereference bug.

Hi Jiecheng

Please wrap your commit message to around 80 character wide.

You also need to add a Signed-off-by to your patch. Please see 

https://www.kernel.org/doc/html/v4.17/process/submitting-patches.html#developer-s-certificate-of-origin-1-1

scripts/checkpatch.pl will help you get these things rights.

	Andrew

^ permalink raw reply

* [PATCH] fm10k_main.c: fix missing return value check of alloc_skb()
From: Jiecheng Wu @ 2018-08-18  2:00 UTC (permalink / raw)
  To: netdev

Function fm10k_init_module() defined in drivers/net/ethernet/intel/fm10k/fm10k_main.c calls alloc_workqueue() to allocate memory for struct workqueue_struct which is dereferenced immediately. As alloc_workqueue() may return NULL on failure, this code piece may cause NULL pointer dereference bug.
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 3f53654..78a43d6 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -41,6 +41,8 @@ static int __init fm10k_init_module(void)
 	/* create driver workqueue */
 	fm10k_workqueue = alloc_workqueue("%s", WQ_MEM_RECLAIM, 0,
 					  fm10k_driver_name);
+	if (!fm10k_workqueue)
+		return -ENOMEM;
 
 	fm10k_dbg_init();
 
-- 
2.6.4

^ permalink raw reply related

* Re: Bug in FIB insert
From: David Ahern @ 2018-08-18  1:21 UTC (permalink / raw)
  To: Md. Islam, kuznet, David Miller, Netdev, Stephen Hemminger,
	Jesper Dangaard Brouer
In-Reply-To: <CAFgPn1D2MuaX0Y-_CTno7jxcaHdM2w2cR0Vbgdjr8RVgRc=YvA@mail.gmail.com>

On 8/16/18 6:59 PM, Md. Islam wrote:
> There is a bug in fib_table_insert(). If I add following routes,
> 
> 23.20.0.0/14 <http://23.20.0.0/14>    veth1
> 23.20.0.0/15 <http://23.20.0.0/15>    veth2
> 
> FIB lookup on 23.22.111.212  results veth1, not veth2.
> 

veth1 is the correct lookup response. '22' toggles the 15th bit which
means the compare to 23.20/15 fails.

^ permalink raw reply

* pull-request: bpf 2018-08-18
From: Daniel Borkmann @ 2018-08-17 23:29 UTC (permalink / raw)
  To: davem; +Cc: daniel, ast, netdev

Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a BPF selftest failure in test_cgroup_storage due to rlimit
   restrictions, from Yonghong.

2) Fix a suspicious RCU rcu_dereference_check() warning triggered
   from removing a device's XDP memory allocator by using the correct
   rhashtable lookup function, from Tariq.

3) A batch of BPF sockmap and ULP fixes mainly fixing leaks and races
   as well as enforcing module aliases for ULPs. Another fix for BPF
   map redirect to make them work again with tail calls, from Daniel.

4) Fix XDP BPF samples to unload their programs upon SIGTERM, from Jesper.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!

----------------------------------------------------------------

The following changes since commit 9a76aba02a37718242d7cdc294f0a3901928aa57:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next (2018-08-15 15:04:25 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to f6069b9aa9934ede26f41ac0781fce241279ad43:

  bpf: fix redirect to map under tail calls (2018-08-17 15:56:23 -0700)

----------------------------------------------------------------
Alexei Starovoitov (1):
      Merge branch 'sockmap-ulp-fixes'

Daniel Borkmann (6):
      tcp, ulp: add alias for all ulp modules
      tcp, ulp: fix leftover icsk_ulp_ops preventing sock from reattach
      bpf, sockmap: fix leakage of smap_psock_map_entry
      bpf, sockmap: fix map elem deletion race with smap_stop_sock
      bpf, sockmap: fix sock_map_ctx_update_elem race with exist/noexist
      bpf: fix redirect to map under tail calls

Jesper Dangaard Brouer (1):
      samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM

Tariq Toukan (1):
      net/xdp: Fix suspicious RCU usage warning

Yonghong Song (2):
      bpf: fix a rcu usage warning in bpf_prog_array_copy_core()
      tools/bpf: fix bpf selftest test_cgroup_storage failure

 include/linux/filter.h                            |   3 +-
 include/net/tcp.h                                 |   4 +
 include/trace/events/xdp.h                        |   5 +-
 kernel/bpf/core.c                                 |   2 +-
 kernel/bpf/cpumap.c                               |   2 +
 kernel/bpf/devmap.c                               |   1 +
 kernel/bpf/sockmap.c                              | 120 ++++++++++++----------
 kernel/bpf/verifier.c                             |  21 ----
 kernel/bpf/xskmap.c                               |   1 +
 net/core/filter.c                                 |  68 ++++++------
 net/core/xdp.c                                    |  14 +--
 net/ipv4/tcp_ulp.c                                |   4 +-
 net/tls/tls_main.c                                |   1 +
 samples/bpf/xdp_redirect_cpu_user.c               |   3 +-
 samples/bpf/xdp_rxq_info_user.c                   |   3 +-
 tools/testing/selftests/bpf/test_cgroup_storage.c |   1 +
 16 files changed, 123 insertions(+), 130 deletions(-)

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH next-queue 0/8] ixgbe/ixgbevf: IPsec offload support for VFs
From: Alexander Duyck @ 2018-08-17 23:27 UTC (permalink / raw)
  To: Shannon Nelson; +Cc: intel-wired-lan, Jeff Kirsher, Steffen Klassert, Netdev
In-Reply-To: <84477537-900a-fb3b-50fa-8142bba76869@oracle.com>

On Fri, Aug 17, 2018 at 4:19 PM Shannon Nelson
<shannon.nelson@oracle.com> wrote:
>
>
> On 8/16/2018 2:36 PM, Shannon Nelson wrote:
> > On 8/16/2018 2:15 PM, Alexander Duyck wrote:
> >> On Tue, Aug 14, 2018 at 10:10 AM Shannon Nelson
> >> <shannon.nelson@oracle.com> wrote:
> >>>
> >>> On 8/14/2018 8:30 AM, Alexander Duyck wrote:
> >>>> On Mon, Aug 13, 2018 at 11:43 AM Shannon Nelson
> >>>> <shannon.nelson@oracle.com> wrote:
> >>>>>
> >>>>> This set of patches implements IPsec hardware offload for VF
> >>>>> devices in
> >>>>> Intel's 10Gbe x540 family of Ethernet devices.
> >>>
> >>> [...]
> >>>
> >>>>
> >>>> So the one question I would have about this patch set is what happens
> >>>> if you are setting up a ipsec connection between the PF and one of the
> >>>> VFs on the same port/function? Do the ipsec offloads get translated
> >>>> across the Tx loopback or do they end up causing issues? Specifically
> >>>> I would be interested in seeing the results of a test either between
> >>>> two VFs, or the PF and one of the VFs on the same port.
> >>>>
> >>>> - Alex
> >>>>
> >>>
> >>> There is definitely something funky in the internal switch connection,
> >>> as messages going from PF to VF with an offloaded encryption don't seem
> >>> to get received by the VF, at least when in a VEB setup.  If I only set
> >>> up offloads on the Rx on both PF and VF, and don't offload the Tx, then
> >>> things work.
> >>>
> >>> I don't have a setup to test this, but I suspect that in a VEPA
> >>> configuration, with packets going out to the switch and turned around
> >>> back in, the Tx encryption offload would happen as expected.
> >>>
> >>> sln
> >>
> >> We should probably look at adding at least one patch to the set then
> >> that disables IPsec Tx offload if SR-IOV is enabled with VEB so that
> >> we don't end up breaking connections should a VF be migrated from a
> >> remote system to a local one that it is connected to.
> >>
> >> - Alex
> >>
> >
> > The problem with this is that someone could set up an IPsec connection
> > on the PF for Tx and Rx use, then set num_vfs, start some VFs, and we
> > still can end up in the same place.  I don't think we want to disallow
> > all Tx IPsec offload.
> >
> > Maybe we can catch it in ixgbe_ipsec_offload_ok()?  If it can find that
> > the dest mac is on the internal switch, perhaps it can NAK the Tx
> > offload?  That would force the XFRM xmit code to do a regular SW encrypt
> > before sending the packet.  I'll look into this.
> >
> > sln
>
> This would be a great idea, but the xdo_state_offload_ok() callback
> happens in the network stack before routing has happened, so there is no
> mac address yet in the skb.  We may be stuck with NAKing *all* Tx
> offloads when num_vfs != 0.  It works, and it is better than no offload
> at all, but it sure harshes the vibe.  Blech.
>
> sln

You can probably just think of the Tx offload as being lumped in with
all the other offloads that don't work when SR-IOV is enabled such as
ATR and RSC.

- Alex

^ permalink raw reply

* Re: [PATCH] mt76: Enable NL80211_EXT_FEATURE_CQM_RSSI_LIST
From: Andrew Zaborowski @ 2018-08-18  2:28 UTC (permalink / raw)
  To: Kristian Evensen
  Cc: lorenzo.bianconi-H+wXaHxf7aLQT0dZR+AlfA, Arend van Spriel,
	Kalle Valo, linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	Network Development
In-Reply-To: <CAKfDRXia3zwW9Q=7RZGqK_DWCsyQzG4KEZHsyHzGcNb70Ge1kA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hi,

On 13 August 2018 at 16:25, Kristian Evensen <kristian.evensen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, Aug 13, 2018 at 12:55 PM Lorenzo Bianconi
> <lorenzo.bianconi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> According to my understanding (please correct me if I am wrong)
>> BSS_CHANGED_CQM is only needed if CQM_RSSI is handled
>> by the driver/fw, while if it is not set mac80211 will take care of that
>> in ieee80211_handle_beacon_sig routine.
>> I am AFK at the moment, I will test that patch when I am back from vacations.
>
> That matches my understanding as well (base on for example the message
> for commit ae44b502669d0cd1f167cdb48994292aa20fd3dd).

This is right, mt7601u had this flag added in that commit but the mt76
drivers were added later.  It seems mt76_rx_convert always sets the
ieee80211_rx_status.signal field so mac80211 can check the rssi values
and call any CQM callbacks if needed.

Best regards

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH next-queue 0/8] ixgbe/ixgbevf: IPsec offload support for VFs
From: Shannon Nelson @ 2018-08-17 23:19 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: intel-wired-lan, Jeff Kirsher, Steffen Klassert, Netdev
In-Reply-To: <7796a29d-625a-4728-defd-cda73afc83ba@oracle.com>


On 8/16/2018 2:36 PM, Shannon Nelson wrote:
> On 8/16/2018 2:15 PM, Alexander Duyck wrote:
>> On Tue, Aug 14, 2018 at 10:10 AM Shannon Nelson
>> <shannon.nelson@oracle.com> wrote:
>>>
>>> On 8/14/2018 8:30 AM, Alexander Duyck wrote:
>>>> On Mon, Aug 13, 2018 at 11:43 AM Shannon Nelson
>>>> <shannon.nelson@oracle.com> wrote:
>>>>>
>>>>> This set of patches implements IPsec hardware offload for VF 
>>>>> devices in
>>>>> Intel's 10Gbe x540 family of Ethernet devices.
>>>
>>> [...]
>>>
>>>>
>>>> So the one question I would have about this patch set is what happens
>>>> if you are setting up a ipsec connection between the PF and one of the
>>>> VFs on the same port/function? Do the ipsec offloads get translated
>>>> across the Tx loopback or do they end up causing issues? Specifically
>>>> I would be interested in seeing the results of a test either between
>>>> two VFs, or the PF and one of the VFs on the same port.
>>>>
>>>> - Alex
>>>>
>>>
>>> There is definitely something funky in the internal switch connection,
>>> as messages going from PF to VF with an offloaded encryption don't seem
>>> to get received by the VF, at least when in a VEB setup.  If I only set
>>> up offloads on the Rx on both PF and VF, and don't offload the Tx, then
>>> things work.
>>>
>>> I don't have a setup to test this, but I suspect that in a VEPA
>>> configuration, with packets going out to the switch and turned around
>>> back in, the Tx encryption offload would happen as expected.
>>>
>>> sln
>>
>> We should probably look at adding at least one patch to the set then
>> that disables IPsec Tx offload if SR-IOV is enabled with VEB so that
>> we don't end up breaking connections should a VF be migrated from a
>> remote system to a local one that it is connected to.
>>
>> - Alex
>>
> 
> The problem with this is that someone could set up an IPsec connection 
> on the PF for Tx and Rx use, then set num_vfs, start some VFs, and we 
> still can end up in the same place.  I don't think we want to disallow 
> all Tx IPsec offload.
> 
> Maybe we can catch it in ixgbe_ipsec_offload_ok()?  If it can find that 
> the dest mac is on the internal switch, perhaps it can NAK the Tx 
> offload?  That would force the XFRM xmit code to do a regular SW encrypt 
> before sending the packet.  I'll look into this.
> 
> sln

This would be a great idea, but the xdo_state_offload_ok() callback 
happens in the network stack before routing has happened, so there is no 
mac address yet in the skb.  We may be stuck with NAKing *all* Tx 
offloads when num_vfs != 0.  It works, and it is better than no offload 
at all, but it sure harshes the vibe.  Blech.

sln

^ permalink raw reply

* [PATCH] RDMA/smc: Replace ib_query_gid with rdma_get_gid_attr
From: Jason Gunthorpe @ 2018-08-18  2:17 UTC (permalink / raw)
  To: Ursula Braun; +Cc: linux-rdma, Parav Pandit, netdev, linux-kernel

All RDMA ULPs should be using rdma_get_gid_attr instead of
ib_query_gid. Convert SMC to use the new API.

In the process correct some confusion with gid_type - if attr->ndev is
!NULL then gid_type can never be IB_GID_TYPE_IB by
definition. IB_GID_TYPE_ROCE shares the same enum value and is probably
what was intended here.

Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 include/rdma/ib_cache.h | 24 ---------------------
 net/smc/smc_ib.c        | 48 +++++++++++++++++++++--------------------
 2 files changed, 25 insertions(+), 47 deletions(-)

This is intended to go this merge window (ie in the next 4 days) to
remove the wrong-headed __deprecated warning. If there is no objection
I'll send it via the rdma tree.

Thanks,
Jason

diff --git a/include/rdma/ib_cache.h b/include/rdma/ib_cache.h
index 3e11e7cc60b745..62e990b620aaf7 100644
--- a/include/rdma/ib_cache.h
+++ b/include/rdma/ib_cache.h
@@ -133,28 +133,4 @@ const struct ib_gid_attr *rdma_get_gid_attr(struct ib_device *device,
 void rdma_put_gid_attr(const struct ib_gid_attr *attr);
 void rdma_hold_gid_attr(const struct ib_gid_attr *attr);
 
-/*
- * This is to be removed. It only exists to make merging rdma and smc simpler.
- */
-static inline __deprecated int ib_query_gid(struct ib_device *device,
-					    u8 port_num, int index,
-					    union ib_gid *gid,
-					    struct ib_gid_attr *attr_out)
-{
-	const struct ib_gid_attr *attr;
-
-	memset(attr_out, 0, sizeof(*attr_out));
-	attr = rdma_get_gid_attr(device, port_num, index);
-	if (IS_ERR(attr))
-		return PTR_ERR(attr);
-
-	if (attr->ndev)
-		dev_hold(attr->ndev);
-	*attr_out = *attr;
-
-	rdma_put_gid_attr(attr);
-
-	return 0;
-}
-
 #endif /* _IB_CACHE_H */
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 9bb5274a244e83..e519ef29c0ffcb 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -145,17 +145,21 @@ int smc_ib_ready_link(struct smc_link *lnk)
 
 static int smc_ib_fill_mac(struct smc_ib_device *smcibdev, u8 ibport)
 {
-	struct ib_gid_attr gattr;
-	union ib_gid gid;
-	int rc;
+	const struct ib_gid_attr *attr;
+	int rc = 0;
 
-	rc = ib_query_gid(smcibdev->ibdev, ibport, 0, &gid, &gattr);
-	if (rc || !gattr.ndev)
+	attr = rdma_get_gid_attr(smcibdev->ibdev, ibport, 0);
+	if (IS_ERR(attr))
 		return -ENODEV;
 
-	memcpy(smcibdev->mac[ibport - 1], gattr.ndev->dev_addr, ETH_ALEN);
-	dev_put(gattr.ndev);
-	return 0;
+	if (attr->ndev)
+		memcpy(smcibdev->mac[ibport - 1], attr->ndev->dev_addr,
+		       ETH_ALEN);
+	else
+		rc = -ENODEV;
+
+	rdma_put_gid_attr(attr);
+	return rc;
 }
 
 /* Create an identifier unique for this instance of SMC-R.
@@ -180,29 +184,27 @@ bool smc_ib_port_active(struct smc_ib_device *smcibdev, u8 ibport)
 int smc_ib_determine_gid(struct smc_ib_device *smcibdev, u8 ibport,
 			 unsigned short vlan_id, u8 gid[], u8 *sgid_index)
 {
-	struct ib_gid_attr gattr;
-	union ib_gid _gid;
+	const struct ib_gid_attr *attr;
 	int i;
 
 	for (i = 0; i < smcibdev->pattr[ibport - 1].gid_tbl_len; i++) {
-		memset(&_gid, 0, SMC_GID_SIZE);
-		memset(&gattr, 0, sizeof(gattr));
-		if (ib_query_gid(smcibdev->ibdev, ibport, i, &_gid, &gattr))
+		attr = rdma_get_gid_attr(smcibdev->ibdev, ibport, i);
+		if (IS_ERR(attr))
 			continue;
-		if (!gattr.ndev)
-			continue;
-		if (((!vlan_id && !is_vlan_dev(gattr.ndev)) ||
-		     (vlan_id && is_vlan_dev(gattr.ndev) &&
-		      vlan_dev_vlan_id(gattr.ndev) == vlan_id)) &&
-		    gattr.gid_type == IB_GID_TYPE_IB) {
+
+		if (attr->ndev &&
+		    ((!vlan_id && !is_vlan_dev(attr->ndev)) ||
+		     (vlan_id && is_vlan_dev(attr->ndev) &&
+		      vlan_dev_vlan_id(attr->ndev) == vlan_id)) &&
+		    attr->gid_type == IB_GID_TYPE_ROCE) {
 			if (gid)
-				memcpy(gid, &_gid, SMC_GID_SIZE);
+				memcpy(gid, &attr->gid, SMC_GID_SIZE);
 			if (sgid_index)
-				*sgid_index = i;
-			dev_put(gattr.ndev);
+				*sgid_index = attr->index;
+			rdma_put_gid_attr(attr);
 			return 0;
 		}
-		dev_put(gattr.ndev);
+		rdma_put_gid_attr(attr);
 	}
 	return -ENODEV;
 }
-- 
2.18.0

^ permalink raw reply related

* [RFC v3 net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
From: Tushar Dave @ 2018-08-17 23:08 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev
In-Reply-To: <1534547305-25140-1-git-send-email-tushar.n.dave@oracle.com>

Add a sample program that shows how socksg program is used and attached
to socket filter. The kernel sample program deals with struct
scatterlist that is passed as bpf context.

When run in server mode, the sample RDS program opens PF_RDS socket,
attaches eBPF program to RDS socket which then uses bpf_msg_pull_data
helper to inspect packet data contained in struct scatterlist and
returns appropriate action code back to kernel.

To ease testing, RDS client functionality is also added so that users
can generate RDS packet.

Server:
[root@lab71 bpf]# ./rds_filter -s 192.168.3.71 -t tcp
running server in a loop
transport tcp
server bound to address: 192.168.3.71 port 4000
server listening on 192.168.3.71

Client:
[root@lab70 bpf]# ./rds_filter -s 192.168.3.71 -c 192.168.3.70 -t tcp
transport tcp
client bound to address: 192.168.3.70 port 25278
client sending 8192 byte message  from 192.168.3.70 to 192.168.3.71 on
port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...

Server output:
192.168.3.71 received a packet from 192.168.3.71 of len 8192 cmsg len 0,
on port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...
server listening on 192.168.3.71

[root@lab71 tushar]# cat /sys/kernel/debug/tracing/trace_pipe
          <idle>-0     [038] ..s.   146.947362: 0: 30 31 32
          <idle>-0     [038] ..s.   146.947364: 0: 33 34 35

Similarly specifying '-t ib' will run this on IB link.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 samples/bpf/Makefile          |   3 +
 samples/bpf/rds_filter_kern.c |  42 ++++++
 samples/bpf/rds_filter_user.c | 339 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 384 insertions(+)
 create mode 100644 samples/bpf/rds_filter_kern.c
 create mode 100644 samples/bpf/rds_filter_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 36f9f41..dbb30d0 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -53,6 +53,7 @@ hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 hostprogs-y += task_fd_query
 hostprogs-y += xdp_sample_pkts
+hostprogs-y += rds_filter
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -109,6 +110,7 @@ xdpsock-objs := xdpsock_user.o
 xdp_fwd-objs := xdp_fwd_user.o
 task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
 xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
+rds_filter-objs := bpf_load.o rds_filter_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -166,6 +168,7 @@ always += xdpsock_kern.o
 always += xdp_fwd_kern.o
 always += task_fd_query_kern.o
 always += xdp_sample_pkts_kern.o
+always += rds_filter_kern.o
 
 KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include
 KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/rds_filter_kern.c b/samples/bpf/rds_filter_kern.c
new file mode 100644
index 0000000..633e687
--- /dev/null
+++ b/samples/bpf/rds_filter_kern.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/filter.h>
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include <linux/rds.h>
+#include "bpf_helpers.h"
+
+#define bpf_printk(fmt, ...)				\
+({							\
+	char ____fmt[] = fmt;				\
+	bpf_trace_printk(____fmt, sizeof(____fmt),	\
+			##__VA_ARGS__);			\
+})
+
+SEC("socksg")
+int main_prog(struct sk_msg_md *msg)
+{
+	int start, end, err;
+	unsigned char *d;
+
+	start = 0;
+	end = 6;
+
+	err = bpf_msg_pull_data(msg, start, end, 0);
+	if (err) {
+		bpf_printk("socksg: pull_data err %i\n", err);
+		return SOCKSG_PASS;
+	}
+
+	if (msg->data + 6 > msg->data_end)
+		return SOCKSG_PASS;
+
+	d = (unsigned char *)msg->data;
+	bpf_printk("%x %x %x\n", d[0], d[1], d[2]);
+	bpf_printk("%x %x %x\n", d[3], d[4], d[5]);
+
+	return SOCKSG_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/rds_filter_user.c b/samples/bpf/rds_filter_user.c
new file mode 100644
index 0000000..1186345
--- /dev/null
+++ b/samples/bpf/rds_filter_user.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <arpa/inet.h>
+#include <assert.h>
+#include "bpf_load.h"
+#include <getopt.h>
+#include <errno.h>
+#include <netinet/in.h>
+#include <limits.h>
+#include <linux/sockios.h>
+#include <linux/rds.h>
+#include <linux/errqueue.h>
+#include <linux/bpf.h>
+#include <strings.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <string.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+
+#define TESTPORT	4000
+#define BUFSIZE		8192
+
+int transport = -1;
+
+static int str2trans(const char *trans)
+{
+	if (strcmp(trans, "tcp") == 0)
+		return RDS_TRANS_TCP;
+	if (strcmp(trans, "ib") == 0)
+		return RDS_TRANS_IB;
+	return (RDS_TRANS_NONE);
+}
+
+static const char *trans2str(int trans)
+{
+	switch (trans) {
+	case RDS_TRANS_TCP:
+		return ("tcp");
+	case RDS_TRANS_IB:
+		return ("ib");
+	case RDS_TRANS_NONE:
+		return ("none");
+	default:
+		return ("unknown");
+	}
+}
+
+static int gettransport(int sock)
+{
+	int err;
+	char val;
+	socklen_t len = sizeof(int);
+
+	err = getsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+			 (char *)&val, &len);
+	if (err < 0) {
+		fprintf(stderr, "%s: getsockopt %s\n",
+			__func__, strerror(errno));
+		return err;
+	}
+	return (int)val;
+}
+
+static int settransport(int sock, int transport)
+{
+	int err;
+
+	err = setsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+			 (char *)&transport, sizeof(transport));
+	if (err < 0) {
+		fprintf(stderr, "could not set transport %s, %s\n",
+			trans2str(transport), strerror(errno));
+	}
+	return err;
+}
+
+static void print_sock_local_info(int fd, char *str, struct sockaddr_in *ret)
+{
+	socklen_t sin_size = sizeof(struct sockaddr_in);
+	struct sockaddr_in sin;
+	int err;
+
+	err = getsockname(fd, (struct sockaddr *)&sin, &sin_size);
+	if (err < 0) {
+		fprintf(stderr, "%s getsockname %s\n",
+			__func__, strerror(errno));
+		return;
+	}
+	printf("%s address: %s port %d\n",
+		(str ? str : ""), inet_ntoa(sin.sin_addr), ntohs(sin.sin_port));
+
+	if (ret != NULL)
+		*ret = sin;
+}
+
+static void print_payload(char *buf)
+{
+	int i;
+
+	printf("payload contains:");
+	for (i = 0; i < 10; i++)
+		printf("%x ", buf[i]);
+	printf("...\n");
+}
+
+static void server(char *address, in_port_t port)
+{
+	struct sockaddr_in sin, din;
+	struct msghdr msg;
+	struct iovec *iov;
+	int rc, sock;
+	char *buf;
+
+	buf = calloc(BUFSIZE, sizeof(char));
+	if (!buf) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return;
+	}
+
+	sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+	if (sock < 0) {
+		fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+		goto out;
+	}
+	if (settransport(sock, transport) < 0)
+		goto out;
+
+	printf("transport %s\n", trans2str(gettransport(sock)));
+
+	memset(&sin, 0, sizeof(sin));
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = inet_addr(address);
+	sin.sin_port = htons(port);
+
+	rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+	if (rc < 0) {
+		fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	/* attach bpf prog */
+	assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
+			  sizeof(prog_fd[0])) == 0);
+
+	print_sock_local_info(sock, "server bound to", NULL);
+
+	iov = calloc(1, sizeof(struct iovec));
+	if (!iov) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	while (1) {
+		memset(buf, 0, BUFSIZE);
+		iov[0].iov_base = buf;
+		iov[0].iov_len = BUFSIZE;
+
+		memset(&msg, 0, sizeof(msg));
+		msg.msg_name = &din;
+		msg.msg_namelen = sizeof(din);
+		msg.msg_iov = iov;
+		msg.msg_iovlen = 1;
+
+		printf("server listening on %s\n", inet_ntoa(sin.sin_addr));
+
+		rc = recvmsg(sock, &msg, 0);
+		if (rc < 0) {
+			fprintf(stderr, "%s: recvmsg %s\n",
+				__func__, strerror(errno));
+			break;
+		}
+
+		printf("%s received a packet from %s of len %d cmsg len %d, on port %d\n",
+			inet_ntoa(sin.sin_addr),
+			inet_ntoa(din.sin_addr),
+			(uint32_t) iov[0].iov_len,
+			(uint32_t) msg.msg_controllen,
+			ntohs(din.sin_port));
+
+		print_payload(buf);
+	}
+	free(iov);
+out:
+	free(buf);
+}
+
+static void create_message(char *buf)
+{
+	unsigned int i;
+
+	for (i = 0; i < BUFSIZE; i++) {
+		buf[i] = i + 0x30;
+	}
+}
+
+static int build_rds_packet(struct msghdr *msg, char *buf)
+{
+	struct iovec *iov;
+
+	iov = calloc(1, sizeof(struct iovec));
+	if (!iov) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return -1;
+	}
+
+	msg->msg_iov = iov;
+	msg->msg_iovlen = 1;
+
+	iov[0].iov_base = buf;
+	iov[0].iov_len = BUFSIZE * sizeof(char);
+
+	return 0;
+}
+
+static void client(char *localaddr, char *remoteaddr, in_port_t server_port)
+{
+	struct sockaddr_in sin, din;
+	struct msghdr msg;
+	int rc, sock;
+	char *buf;
+
+	buf = calloc(BUFSIZE, sizeof(char));
+	if (!buf) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return;
+	}
+
+	create_message(buf);
+
+	sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+	if (sock < 0) {
+		fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	if (settransport(sock, transport) < 0)
+		goto out;
+
+	printf("transport %s\n", trans2str(gettransport(sock)));
+
+	memset(&sin, 0, sizeof(sin));
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = inet_addr(localaddr);
+	sin.sin_port = 0;
+
+	rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+	if (rc < 0) {
+		fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+		goto out;
+	}
+	print_sock_local_info(sock, "client bound to",  &sin);
+
+	memset(&msg, 0, sizeof(msg));
+	msg.msg_name = &din;
+	msg.msg_namelen = sizeof(din);
+
+	memset(&din, 0, sizeof(din));
+	din.sin_family = AF_INET;
+	din.sin_addr.s_addr = inet_addr(remoteaddr);
+	din.sin_port = htons(server_port);
+
+	rc = build_rds_packet(&msg, buf);
+	if (rc < 0)
+		goto out;
+
+	printf("client sending %d byte message from %s to %s on port %d\n",
+		(uint32_t) msg.msg_iov->iov_len, localaddr,
+		remoteaddr, ntohs(sin.sin_port));
+
+	rc = sendmsg(sock, &msg, 0);
+	if (rc < 0)
+		fprintf(stderr, "%s: sendmsg %s\n", __func__, strerror(errno));
+
+	print_payload(buf);
+
+	if (msg.msg_control)
+		free(msg.msg_control);
+	if (msg.msg_iov)
+		free(msg.msg_iov);
+out:
+	free(buf);
+
+	return;
+}
+
+static void usage(char *progname)
+{
+	fprintf(stderr, "Usage %s [-s srvaddr] [-c clientaddr] [-t transport]"
+		"\n", progname);
+}
+
+int main(int argc, char **argv)
+{
+	in_port_t server_port = TESTPORT;
+	char *serveraddr = NULL;
+	char *clientaddr = NULL;
+	char filename[256];
+	int opt;
+
+	while ((opt = getopt(argc, argv, "s:c:t:")) != -1) {
+		switch (opt) {
+		case 's':
+			serveraddr = optarg;
+			break;
+		case 'c':
+			clientaddr = optarg;
+			break;
+		case 't':
+			transport = str2trans(optarg);
+			if (transport == RDS_TRANS_NONE) {
+				fprintf(stderr,
+					"unknown transport %s\n", optarg);
+					usage(argv[0]);
+					return (-1);
+			}
+			break;
+		default:
+			usage(argv[0]);
+			return 1;
+		}
+	}
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		fprintf(stderr, "Error: load_bpf_file %s", bpf_log_buf);
+		return 1;
+	}
+
+	if (serveraddr && !clientaddr) {
+		printf("running server in a loop\n");
+		server(serveraddr, server_port);
+	} else if (serveraddr && clientaddr) {
+		client(clientaddr, serveraddr, server_port);
+	}
+
+	return 0;
+}
-- 
1.8.3.1

^ permalink raw reply related

* [RFC v3 net-next 4/5] rds: invoke socket sg filter attached to rds socket
From: Tushar Dave @ 2018-08-17 23:08 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev
In-Reply-To: <1534547305-25140-1-git-send-email-tushar.n.dave@oracle.com>

RDS module sits on top of TCP (rds_tcp) and IB (rds_rdma), so messages
arrive in form of skb (over TCP) and scatterlist (over IB/RDMA).
However, because socket filter only deal with skb (e.g. struct skb as
bpf context) we can only use socket filter for rds_tcp and not for
rds_rdma.

Considering one filtering solution for RDS, it seems that the common
denominator between sk_buff and scatterlist is scatterlist. Therefore,
this patch converts skb to sgvec and invoke sg_filter_run for
rds_tcp and simply invoke sg_filter_run for IB/rds_rdma.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/ib.c       |  1 +
 net/rds/ib.h       |  1 +
 net/rds/ib_recv.c  | 12 ++++++++++++
 net/rds/rds.h      |  2 ++
 net/rds/recv.c     | 17 +++++++++++++++++
 net/rds/tcp.c      |  2 ++
 net/rds/tcp.h      |  2 ++
 net/rds/tcp_recv.c | 38 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 75 insertions(+)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index 89c6333..6ba1f75 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -532,6 +532,7 @@ struct rds_transport rds_ib_transport = {
 	.conn_path_shutdown	= rds_ib_conn_path_shutdown,
 	.inc_copy_to_user	= rds_ib_inc_copy_to_user,
 	.inc_free		= rds_ib_inc_free,
+	.inc_to_sg_get		= rds_ib_inc_to_sg_get,
 	.cm_initiate_connect	= rds_ib_cm_initiate_connect,
 	.cm_handle_connect	= rds_ib_cm_handle_connect,
 	.cm_connect_complete	= rds_ib_cm_connect_complete,
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 73427ff..0a12b41 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -404,6 +404,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev,
 void rds_ib_recv_free_caches(struct rds_ib_connection *ic);
 void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp);
 void rds_ib_inc_free(struct rds_incoming *inc);
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg);
 int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
 void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc,
 			     struct rds_ib_ack_state *state);
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index d300186..2f76a91 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -219,6 +219,18 @@ void rds_ib_inc_free(struct rds_incoming *inc)
 	rds_ib_recv_cache_put(&ibinc->ii_cache_entry, &ic->i_cache_incs);
 }
 
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg)
+{
+	struct rds_ib_incoming *ibinc;
+	struct rds_page_frag *frag;
+
+	ibinc = container_of(inc, struct rds_ib_incoming, ii_inc);
+	frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item);
+	*sg =  &frag->f_sg;
+
+	return 0;
+}
+
 static void rds_ib_recv_clear_one(struct rds_ib_connection *ic,
 				  struct rds_ib_recv_work *recv)
 {
diff --git a/net/rds/rds.h b/net/rds/rds.h
index c4dcf65..abcd5ce 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -542,6 +542,8 @@ struct rds_transport {
 	int (*recv_path)(struct rds_conn_path *cp);
 	int (*inc_copy_to_user)(struct rds_incoming *inc, struct iov_iter *to);
 	void (*inc_free)(struct rds_incoming *inc);
+	int (*inc_to_sg_get)(struct rds_incoming *inc, struct scatterlist **sg);
+	void (*inc_to_sg_put)(struct scatterlist **sg);
 
 	int (*cm_handle_connect)(struct rdma_cm_id *cm_id,
 				 struct rdma_cm_event *event, bool isv6);
diff --git a/net/rds/recv.c b/net/rds/recv.c
index 504cd6b..261904c 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -292,6 +292,8 @@ void rds_recv_incoming(struct rds_connection *conn, struct in6_addr *saddr,
 	struct sock *sk;
 	unsigned long flags;
 	struct rds_conn_path *cp;
+	struct sk_filter *filter;
+	int result = __SOCKSG_PASS;
 
 	inc->i_conn = conn;
 	inc->i_rx_jiffies = jiffies;
@@ -376,6 +378,21 @@ void rds_recv_incoming(struct rds_connection *conn, struct in6_addr *saddr,
 	/* We can be racing with rds_release() which marks the socket dead. */
 	sk = rds_rs_to_sk(rs);
 
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
+	if (filter) {
+		if (conn->c_trans->inc_to_sg_get) {
+			struct scatterlist *sg;
+
+			if (conn->c_trans->inc_to_sg_get(inc, &sg) == 0) {
+				result = sg_filter_run(sk, sg);
+				if (conn->c_trans->inc_to_sg_put)
+					conn->c_trans->inc_to_sg_put(&sg);
+			}
+		}
+	}
+	rcu_read_unlock();
+
 	/* serialize with rds_release -> sock_orphan */
 	write_lock_irqsave(&rs->rs_recv_lock, flags);
 	if (!sock_flag(sk, SOCK_DEAD)) {
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 2c7b7c3..35454c7 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -465,6 +465,8 @@ struct rds_transport rds_tcp_transport = {
 	.conn_path_shutdown	= rds_tcp_conn_path_shutdown,
 	.inc_copy_to_user	= rds_tcp_inc_copy_to_user,
 	.inc_free		= rds_tcp_inc_free,
+	.inc_to_sg_get		= rds_tcp_inc_to_sg_get,
+	.inc_to_sg_put		= rds_tcp_inc_to_sg_put,
 	.stats_info_copy	= rds_tcp_stats_info_copy,
 	.exit			= rds_tcp_exit,
 	.t_owner		= THIS_MODULE,
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index 3c69361..b2cc910 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -82,6 +82,8 @@ void rds_tcp_restore_callbacks(struct socket *sock,
 int rds_tcp_recv_path(struct rds_conn_path *cp);
 void rds_tcp_inc_free(struct rds_incoming *inc);
 int rds_tcp_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
+int rds_tcp_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg);
+void rds_tcp_inc_to_sg_put(struct scatterlist **sg);
 
 /* tcp_send.c */
 void rds_tcp_xmit_path_prepare(struct rds_conn_path *cp);
diff --git a/net/rds/tcp_recv.c b/net/rds/tcp_recv.c
index 42c5ff1..b45e69b 100644
--- a/net/rds/tcp_recv.c
+++ b/net/rds/tcp_recv.c
@@ -56,6 +56,44 @@ void rds_tcp_inc_free(struct rds_incoming *inc)
 	kmem_cache_free(rds_tcp_incoming_slab, tinc);
 }
 
+#define MAX_SG MAX_SKB_FRAGS
+int rds_tcp_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg)
+{
+	struct scatterlist *sg_list;
+	struct rds_tcp_incoming *tinc;
+	struct sk_buff *skb;
+	int num_sg = 0;
+
+	tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
+
+	/* For now we are assuming that the max sg elements we need is MAX_SG.
+	 * To determine actual number of sg elements we need to traverse the
+	 * skb queue e.g.
+	 *
+	 * skb_queue_walk(&tinc->ti_skb_list, skb) {
+	 *	num_sg += skb_shinfo(skb)->nr_frags + 1;
+	 * }
+	 */
+	sg_list = kzalloc(sizeof(*sg_list) * MAX_SG, GFP_KERNEL);
+	if (!sg_list)
+		return -ENOMEM;
+
+	sg_init_table(sg_list, MAX_SG);
+	skb_queue_walk(&tinc->ti_skb_list, skb) {
+		num_sg += skb_to_sgvec_nomark(skb, &sg_list[num_sg], 0,
+					      skb->len);
+	}
+	sg_mark_end(&sg_list[num_sg - 1]);
+	*sg = sg_list;
+
+	return 0;
+}
+
+void rds_tcp_inc_to_sg_put(struct scatterlist **sg)
+{
+	kfree(*sg);
+}
+
 /*
  * this is pretty lame, but, whatever.
  */
-- 
1.8.3.1

^ permalink raw reply related

* [RFC v3 net-next 2/5] ebpf: Add sg_filter_run()
From: Tushar Dave @ 2018-08-17 23:08 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev
In-Reply-To: <1534547305-25140-1-git-send-email-tushar.n.dave@oracle.com>

When sg_filter_run() is invoked it runs the attached eBPF
prog of type BPF_PROG_TYPE_SOCKET_SG_FILTER which deals with
struct scatterlist.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/filter.h         |  8 ++++++++
 include/uapi/linux/bpf.h       |  6 ++++++
 net/core/filter.c              | 24 ++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  6 ++++++
 4 files changed, 44 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 5d565c5..9f1f7c1 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1112,4 +1112,12 @@ struct bpf_sock_ops_kern {
 					 */
 };
 
+enum __socksg_action {
+	__SOCKSG_DROP = 0,
+	__SOCKSG_PASS,
+	__SOCKSG_REDIRECT,
+};
+
+int sg_filter_run(struct sock *sk, struct scatterlist *sg);
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6ec1e32..d1d0ceb 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2428,6 +2428,12 @@ enum sk_action {
 	SK_PASS,
 };
 
+enum socksg_action {
+	SOCKSG_DROP = 0,
+	SOCKSG_PASS,
+	SOCKSG_REDIRECT,
+};
+
 /* user accessible metadata for SK_MSG packet hook, new fields must
  * be added to the end of this structure
  */
diff --git a/net/core/filter.c b/net/core/filter.c
index cec3807..e427c8e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -121,6 +121,30 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
 }
 EXPORT_SYMBOL(sk_filter_trim_cap);
 
+int sg_filter_run(struct sock *sk, struct scatterlist *sg)
+{
+	struct sk_filter *filter;
+	int result;
+
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
+	if (filter) {
+		struct sk_msg_buff mb = {0};
+
+		memcpy(mb.sg_data, sg, sizeof(*sg) * MAX_SKB_FRAGS);
+		mb.sg_start = 0;
+		mb.sg_end = sg_nents(sg) - 1;
+		mb.data = sg_virt(sg);
+		mb.data_end = mb.data + sg->length;
+		mb.sg_copy[mb.sg_end] = true;
+		result = BPF_PROG_RUN(filter->prog, &mb);
+	}
+	rcu_read_unlock();
+
+	return result;
+}
+EXPORT_SYMBOL(sg_filter_run);
+
 BPF_CALL_1(bpf_skb_get_pay_offset, struct sk_buff *, skb)
 {
 	return skb_get_poff(skb);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 6ec1e32..d1d0ceb 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2428,6 +2428,12 @@ enum sk_action {
 	SK_PASS,
 };
 
+enum socksg_action {
+	SOCKSG_DROP = 0,
+	SOCKSG_PASS,
+	SOCKSG_REDIRECT,
+};
+
 /* user accessible metadata for SK_MSG packet hook, new fields must
  * be added to the end of this structure
  */
-- 
1.8.3.1

^ permalink raw reply related

* [RFC v3 net-next 3/5] ebpf: fix bpf_msg_pull_data
From: Tushar Dave @ 2018-08-17 23:08 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev
In-Reply-To: <1534547305-25140-1-git-send-email-tushar.n.dave@oracle.com>

Like sockmap (sk_msg), socksg also deals with struct scatterlist
therefore socksg programs can use existing bpf helper bpf_msg_pull_data
to access packet data contained in struct scatterlist. While doing some
prelimnary testing, there are couple of issues found with
bpf_msg_pull_data that are fixed in this patch.

Also, there cannot be more than MAX_SKB_FRAGS entries in sg_data
therefore any checks for sg entry more than MAX_SKB_FRAGS in
bpf_msg_pull_data() is removed.

Besides that, I also ran into issues while put_page() is invoked.
e.g.
[ 450.568723] BUG: Bad page state in process swapper/10 pfn:2021540
[ 450.575632] page:ffffea0080855000 count:0 mapcount:0
mapping:ffff88103d006840 index:0xffff882021540000 compound_mapcount: 0
[ 450.588069] flags: 0x6fffff80008100(slab|head)
[ 450.593033] raw: 006fffff80008100 dead000000000100 dead000000000200
ffff88103d006840
[ 450.601683] raw: ffff882021540000 0000000080080007 00000000ffffffff
0000000000000000
[ 450.610337] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[ 450.617530] bad because of flags: 0x100(slab)

To avoid above issue, currently put_page() is disabled in this patch
temporarily. I am working on alternatives so that page allocated via
slab (in this case) can be freed without any issue.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/core/filter.c | 61 +++++++++++++++++++++++++++++--------------------------
 1 file changed, 32 insertions(+), 29 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index e427c8e..cc52baa 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2316,7 +2316,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 BPF_CALL_4(bpf_msg_pull_data,
 	   struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
 {
-	unsigned int len = 0, offset = 0, copy = 0;
+	unsigned int len = 0, offset = 0, copy = 0, off = 0;
 	struct scatterlist *sg = msg->sg_data;
 	int first_sg, last_sg, i, shift;
 	unsigned char *p, *to, *from;
@@ -2330,22 +2330,28 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 	i = msg->sg_start;
 	do {
 		len = sg[i].length;
-		offset += len;
 		if (start < offset + len)
 			break;
+		offset += len;
 		i++;
-		if (i == MAX_SKB_FRAGS)
-			i = 0;
-	} while (i != msg->sg_end);
+	} while (i <= msg->sg_end);
 
+	/* return error if start is out of range */
 	if (unlikely(start >= offset + len))
 		return -EINVAL;
 
-	if (!msg->sg_copy[i] && bytes <= len)
-		goto out;
+	/* return error if i is last entry in sglist and end is out of range */
+	if (msg->sg_copy[i] && end > offset + len)
+		return -EINVAL;
 
 	first_sg = i;
 
+	/* if i is not last entry in sg list and end (i.e start + bytes) is
+	 * within this sg[i] then goto out and calculate data and data_end
+	 */
+	if (!msg->sg_copy[i] && end <= offset + len)
+		goto out;
+
 	/* At this point we need to linearize multiple scatterlist
 	 * elements or a single shared page. Either way we need to
 	 * copy into a linear buffer exclusively owned by BPF. Then
@@ -2359,11 +2365,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 	do {
 		copy += sg[i].length;
 		i++;
-		if (i == MAX_SKB_FRAGS)
-			i = 0;
-		if (bytes < copy)
+		if (end < copy)
 			break;
-	} while (i != msg->sg_end);
+	} while (i <= msg->sg_end);
+
+	/* return error if i is last entry in sglist and end is out of range */
+	if (i > msg->sg_end && end > offset + copy)
+		return -EINVAL;
+
 	last_sg = i;
 
 	if (unlikely(copy < end - start))
@@ -2373,23 +2382,25 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 	if (unlikely(!page))
 		return -ENOMEM;
 	p = page_address(page);
-	offset = 0;
 
 	i = first_sg;
 	do {
 		from = sg_virt(&sg[i]);
 		len = sg[i].length;
-		to = p + offset;
+		to = p + off;
 
 		memcpy(to, from, len);
-		offset += len;
+		off += len;
 		sg[i].length = 0;
-		put_page(sg_page(&sg[i]));
+		/* if original page is allocated via slab then put_page
+		 * causes error BUG: Bad page state in process. So temporarily
+		 * disabled put_page.
+		 * Todo: fix it
+		 */
+		//put_page(sg_page(&sg[i]));
 
 		i++;
-		if (i == MAX_SKB_FRAGS)
-			i = 0;
-	} while (i != last_sg);
+	} while (i < last_sg);
 
 	sg[first_sg].length = copy;
 	sg_set_page(&sg[first_sg], page, copy, 0);
@@ -2406,12 +2417,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 	do {
 		int move_from;
 
-		if (i + shift >= MAX_SKB_FRAGS)
-			move_from = i + shift - MAX_SKB_FRAGS;
-		else
-			move_from = i + shift;
-
-		if (move_from == msg->sg_end)
+		move_from = i + shift;
+		if (move_from > msg->sg_end)
 			break;
 
 		sg[i] = sg[move_from];
@@ -2420,14 +2427,10 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 		sg[move_from].offset = 0;
 
 		i++;
-		if (i == MAX_SKB_FRAGS)
-			i = 0;
 	} while (1);
 	msg->sg_end -= shift;
-	if (msg->sg_end < 0)
-		msg->sg_end += MAX_SKB_FRAGS;
 out:
-	msg->data = sg_virt(&sg[i]) + start - offset;
+	msg->data = sg_virt(&sg[first_sg]) + start - offset;
 	msg->data_end = msg->data + bytes;
 
 	return 0;
-- 
1.8.3.1

^ permalink raw reply related

* [RFC v3 net-next 1/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
From: Tushar Dave @ 2018-08-17 23:08 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev
In-Reply-To: <1534547305-25140-1-git-send-email-tushar.n.dave@oracle.com>

Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
existing socket filter infrastructure for bpf program attach and load.
SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
contrast to SOCKET_FILTER which deals with struct skb. This is useful
for kernel entities that don't have skb to represent packet data but
want to run eBPF socket filter on packet data that is in form of struct
scatterlist e.g. IB/RDMA

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/bpf_types.h      |  1 +
 include/uapi/linux/bpf.h       |  1 +
 kernel/bpf/syscall.c           |  1 +
 kernel/bpf/verifier.c          |  1 +
 net/core/filter.c              | 55 ++++++++++++++++++++++++++++++++++++++++--
 samples/bpf/bpf_load.c         | 11 ++++++---
 tools/bpf/bpftool/prog.c       |  1 +
 tools/include/uapi/linux/bpf.h |  1 +
 tools/lib/bpf/libbpf.c         |  3 +++
 tools/lib/bpf/libbpf.h         |  2 ++
 10 files changed, 72 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cd26c09..7dc1503 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -16,6 +16,7 @@
 BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_SG_FILTER, socksg_filter)
 #endif
 #ifdef CONFIG_BPF_EVENTS
 BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4..6ec1e32 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_SOCKET_SG_FILTER,
 };
 
 enum bpf_attach_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8339d81..160cdb2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1362,6 +1362,7 @@ static int bpf_prog_load(union bpf_attr *attr)
 
 	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
 	    type != BPF_PROG_TYPE_CGROUP_SKB &&
+	    type != BPF_PROG_TYPE_SOCKET_SG_FILTER &&
 	    !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ca90679..5abc788 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1321,6 +1321,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 	case BPF_PROG_TYPE_LWT_XMIT:
 	case BPF_PROG_TYPE_SK_SKB:
 	case BPF_PROG_TYPE_SK_MSG:
+	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
 		if (meta)
 			return meta->pkt_access;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index fd423ce..cec3807 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1140,7 +1140,8 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
 
 static void __bpf_prog_release(struct bpf_prog *prog)
 {
-	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
+	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
+	    prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
 		bpf_prog_put(prog);
 	} else {
 		bpf_release_orig_filter(prog);
@@ -1539,10 +1540,16 @@ int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk)
 
 static struct bpf_prog *__get_bpf(u32 ufd, struct sock *sk)
 {
+	struct bpf_prog *prog;
+
 	if (sock_flag(sk, SOCK_FILTER_LOCKED))
 		return ERR_PTR(-EPERM);
 
-	return bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+	prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+	if (IS_ERR(prog))
+		prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_SG_FILTER);
+
+	return prog;
 }
 
 int sk_attach_bpf(u32 ufd, struct sock *sk)
@@ -4920,6 +4927,17 @@ bool bpf_helper_changes_pkt_data(void *func)
 }
 
 static const struct bpf_func_proto *
+socksg_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_msg_pull_data:
+		return &bpf_msg_pull_data_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
 tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
@@ -6738,6 +6756,30 @@ static u32 sk_skb_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+static u32 socksg_filter_convert_ctx_access(enum bpf_access_type type,
+					    const struct bpf_insn *si,
+					    struct bpf_insn *insn_buf,
+					    struct bpf_prog *prog,
+					    u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct sk_msg_md, data):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct sk_msg_buff, data));
+		break;
+	case offsetof(struct sk_msg_md, data_end):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data_end),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct sk_msg_buff, data_end));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
 				     const struct bpf_insn *si,
 				     struct bpf_insn *insn_buf,
@@ -6876,6 +6918,15 @@ static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
 	.test_run		= bpf_prog_test_run_skb,
 };
 
+const struct bpf_verifier_ops socksg_filter_verifier_ops = {
+	.get_func_proto         = socksg_filter_func_proto,
+	.is_valid_access        = sk_msg_is_valid_access,
+	.convert_ctx_access     = socksg_filter_convert_ctx_access,
+};
+
+const struct bpf_prog_ops socksg_filter_prog_ops = {
+};
+
 const struct bpf_verifier_ops tc_cls_act_verifier_ops = {
 	.get_func_proto		= tc_cls_act_func_proto,
 	.is_valid_access	= tc_cls_act_is_valid_access,
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 904e775..3b1697d 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -69,6 +69,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	bool is_sockops = strncmp(event, "sockops", 7) == 0;
 	bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
 	bool is_sk_msg = strncmp(event, "sk_msg", 6) == 0;
+	bool is_socksg = strncmp(event, "socksg", 6) == 0;
+
 	size_t insns_cnt = size / sizeof(struct bpf_insn);
 	enum bpf_prog_type prog_type;
 	char buf[256];
@@ -102,6 +104,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		prog_type = BPF_PROG_TYPE_SK_SKB;
 	} else if (is_sk_msg) {
 		prog_type = BPF_PROG_TYPE_SK_MSG;
+	} else if (is_socksg) {
+		prog_type = BPF_PROG_TYPE_SOCKET_SG_FILTER;
 	} else {
 		printf("Unknown event '%s'\n", event);
 		return -1;
@@ -122,8 +126,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
 		return 0;
 
-	if (is_socket || is_sockops || is_sk_skb || is_sk_msg) {
-		if (is_socket)
+	if (is_socket || is_sockops || is_sk_skb || is_sk_msg || is_socksg) {
+		if (is_socket || is_socksg)
 			event += 6;
 		else
 			event += 7;
@@ -627,7 +631,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
 		    memcmp(shname, "cgroup/", 7) == 0 ||
 		    memcmp(shname, "sockops", 7) == 0 ||
 		    memcmp(shname, "sk_skb", 6) == 0 ||
-		    memcmp(shname, "sk_msg", 6) == 0) {
+		    memcmp(shname, "sk_msg", 6) == 0 ||
+		    memcmp(shname, "socksg", 6) == 0) {
 			ret = load_and_attach(shname, data->d_buf,
 					      data->d_size);
 			if (ret != 0)
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index dce960d..9c57c4e 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -74,6 +74,7 @@
 	[BPF_PROG_TYPE_RAW_TRACEPOINT]	= "raw_tracepoint",
 	[BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
 	[BPF_PROG_TYPE_LIRC_MODE2]	= "lirc_mode2",
+	[BPF_PROG_TYPE_SOCKET_SG_FILTER] = "socket_sg_filter",
 };
 
 static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 66917a4..6ec1e32 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_SOCKET_SG_FILTER,
 };
 
 enum bpf_attach_type {
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2abd0f1..a7ac51c 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 	case BPF_PROG_TYPE_LIRC_MODE2:
 	case BPF_PROG_TYPE_SK_REUSEPORT:
+	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
 		return false;
 	case BPF_PROG_TYPE_UNSPEC:
 	case BPF_PROG_TYPE_KPROBE:
@@ -2077,6 +2078,7 @@ static bool bpf_program__is_type(struct bpf_program *prog,
 BPF_PROG_TYPE_FNS(raw_tracepoint, BPF_PROG_TYPE_RAW_TRACEPOINT);
 BPF_PROG_TYPE_FNS(xdp, BPF_PROG_TYPE_XDP);
 BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
+BPF_PROG_TYPE_FNS(socket_sg_filter, BPF_PROG_TYPE_SOCKET_SG_FILTER);
 
 void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 					   enum bpf_attach_type type)
@@ -2129,6 +2131,7 @@ void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 	BPF_SA_PROG_SEC("cgroup/sendmsg6", BPF_CGROUP_UDP6_SENDMSG),
 	BPF_S_PROG_SEC("cgroup/post_bind4", BPF_CGROUP_INET4_POST_BIND),
 	BPF_S_PROG_SEC("cgroup/post_bind6", BPF_CGROUP_INET6_POST_BIND),
+	BPF_PROG_SEC("socksg",          BPF_PROG_TYPE_SOCKET_SG_FILTER),
 };
 
 #undef BPF_PROG_SEC
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 96c55fa..7527ea4 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -208,6 +208,7 @@ int bpf_program__set_prep(struct bpf_program *prog, int nr_instance,
 void bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type);
 void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 					   enum bpf_attach_type type);
+int bpf_program__set_socket_sg_filter(struct bpf_program *prog);
 
 bool bpf_program__is_socket_filter(struct bpf_program *prog);
 bool bpf_program__is_tracepoint(struct bpf_program *prog);
@@ -217,6 +218,7 @@ void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 bool bpf_program__is_sched_act(struct bpf_program *prog);
 bool bpf_program__is_xdp(struct bpf_program *prog);
 bool bpf_program__is_perf_event(struct bpf_program *prog);
+bool bpf_program__is_socket_sg_filter(struct bpf_program *prog);
 
 /*
  * No need for __attribute__((packed)), all members of 'bpf_map_def'
-- 
1.8.3.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox