Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 0/2] ethernet: Use NULL to compare with pointer-typed value rather than 0
From: zhong jiang @ 2018-08-18  6:29 UTC (permalink / raw)
  To: davem; +Cc: vz, slemieux.tyco, keescook, netdev, linux-kernel

v1->v2:
 - According to Vladimir's suggestion. change a common 0 and NULL comparsion form.

zhong jiang (2):
  ethernet: declance:  Use NULL to compare with pointer-typed value
    rather than 0
  ethernet: lpc_eth: Use NULL to compare with pointer-typed value
    rather than 0

 drivers/net/ethernet/amd/declance.c | 3 +--
 drivers/net/ethernet/nxp/lpc_eth.c  | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

-- 
1.7.12.4

^ permalink raw reply

* Re: [PATCH 2/2] ethernet: lpc_eth: Use NULL to compare with pointer-typed value rather than 0
From: zhong jiang @ 2018-08-18  5:59 UTC (permalink / raw)
  To: Vladimir Zapolskiy; +Cc: davem, slemieux.tyco, keescook, netdev, linux-kernel
In-Reply-To: <edc2f14f-8694-3160-47b1-ccb5be8a3484@mleia.com>

On 2018/8/17 23:29, Vladimir Zapolskiy wrote:
> Hi zhong jiang,
>
> On 08/17/2018 04:18 PM, zhong jiang wrote:
>> We should use NULL to compare with pointer-typed value rather than 0.
>> The issue is detected with the help of Coccinelle.
>> ---
>>  drivers/net/ethernet/nxp/lpc_eth.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/nxp/lpc_eth.c b/drivers/net/ethernet/nxp/lpc_eth.c
>> index 08381ef..04d9e62 100644
>> --- a/drivers/net/ethernet/nxp/lpc_eth.c
>> +++ b/drivers/net/ethernet/nxp/lpc_eth.c
>> @@ -1350,7 +1350,7 @@ static int lpc_eth_drv_probe(struct platform_device *pdev)
>>  				"IRAM not big enough for net buffers, using SDRAM instead.\n");
>>  	}
>>  
>> -	if (pldat->dma_buff_base_v == 0) {
>> +	if (pldat->dma_buff_base_v == NULL) {
> That's a valid finding, but please use a common 0 and NULL comparison in form of
>
> 	if (!pldat->dma_buff_base_v)
>
> To the change above please feel free to add my
>
> Acked-by: Vladimir Zapolskiy <vz@mleia.com>
 Thanks, Will do in v2
>>  		ret = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
>>  		if (ret)
>>  			goto err_out_free_irq;
>>
>
> .
>

^ permalink raw reply

* [PATCH] fm10k_main.c: fix missing return value check of alloc_skb()
From: Jiecheng Wu @ 2018-08-18  2:27 UTC (permalink / raw)
  To: netdev

Function fm10k_init_module() defined in drivers/net/ethernet/intel/fm10k/fm10k_main.c
 calls alloc_workqueue() to allocate memory for struct workqueue_struct which is 
 dereferenced immediately. As alloc_workqueue() may return NULL on failure, 
 this code piece may cause NULL pointer dereference bug.
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 3f53654..78a43d6 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -41,6 +41,8 @@ static int __init fm10k_init_module(void)
 	/* create driver workqueue */
 	fm10k_workqueue = alloc_workqueue("%s", WQ_MEM_RECLAIM, 0,
 					  fm10k_driver_name);
+	if (!fm10k_workqueue)
+		return -ENOMEM;
 
 	fm10k_dbg_init();
 
-- 
2.6.4

^ permalink raw reply related

* Re: [PATCH] fm10k_main.c: fix missing return value check of alloc_skb()
From: Andrew Lunn @ 2018-08-18  2:14 UTC (permalink / raw)
  To: Jiecheng Wu; +Cc: netdev
In-Reply-To: <20180818020058.14652-1-jasonwood2031@gmail.com>

On Sat, Aug 18, 2018 at 10:00:58AM +0800, Jiecheng Wu wrote:
> Function fm10k_init_module() defined in drivers/net/ethernet/intel/fm10k/fm10k_main.c calls alloc_workqueue() to allocate memory for struct workqueue_struct which is dereferenced immediately. As alloc_workqueue() may return NULL on failure, this code piece may cause NULL pointer dereference bug.

Hi Jiecheng

Please wrap your commit message to around 80 character wide.

You also need to add a Signed-off-by to your patch. Please see 

https://www.kernel.org/doc/html/v4.17/process/submitting-patches.html#developer-s-certificate-of-origin-1-1

scripts/checkpatch.pl will help you get these things rights.

	Andrew

^ permalink raw reply

* [PATCH] fm10k_main.c: fix missing return value check of alloc_skb()
From: Jiecheng Wu @ 2018-08-18  2:00 UTC (permalink / raw)
  To: netdev

Function fm10k_init_module() defined in drivers/net/ethernet/intel/fm10k/fm10k_main.c calls alloc_workqueue() to allocate memory for struct workqueue_struct which is dereferenced immediately. As alloc_workqueue() may return NULL on failure, this code piece may cause NULL pointer dereference bug.
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 3f53654..78a43d6 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -41,6 +41,8 @@ static int __init fm10k_init_module(void)
 	/* create driver workqueue */
 	fm10k_workqueue = alloc_workqueue("%s", WQ_MEM_RECLAIM, 0,
 					  fm10k_driver_name);
+	if (!fm10k_workqueue)
+		return -ENOMEM;
 
 	fm10k_dbg_init();
 
-- 
2.6.4

^ permalink raw reply related

* Re: Bug in FIB insert
From: David Ahern @ 2018-08-18  1:21 UTC (permalink / raw)
  To: Md. Islam, kuznet, David Miller, Netdev, Stephen Hemminger,
	Jesper Dangaard Brouer
In-Reply-To: <CAFgPn1D2MuaX0Y-_CTno7jxcaHdM2w2cR0Vbgdjr8RVgRc=YvA@mail.gmail.com>

On 8/16/18 6:59 PM, Md. Islam wrote:
> There is a bug in fib_table_insert(). If I add following routes,
> 
> 23.20.0.0/14 <http://23.20.0.0/14>    veth1
> 23.20.0.0/15 <http://23.20.0.0/15>    veth2
> 
> FIB lookup on 23.22.111.212  results veth1, not veth2.
> 

veth1 is the correct lookup response. '22' toggles the 15th bit which
means the compare to 23.20/15 fails.

^ permalink raw reply

* pull-request: bpf 2018-08-18
From: Daniel Borkmann @ 2018-08-17 23:29 UTC (permalink / raw)
  To: davem; +Cc: daniel, ast, netdev

Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a BPF selftest failure in test_cgroup_storage due to rlimit
   restrictions, from Yonghong.

2) Fix a suspicious RCU rcu_dereference_check() warning triggered
   from removing a device's XDP memory allocator by using the correct
   rhashtable lookup function, from Tariq.

3) A batch of BPF sockmap and ULP fixes mainly fixing leaks and races
   as well as enforcing module aliases for ULPs. Another fix for BPF
   map redirect to make them work again with tail calls, from Daniel.

4) Fix XDP BPF samples to unload their programs upon SIGTERM, from Jesper.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!

----------------------------------------------------------------

The following changes since commit 9a76aba02a37718242d7cdc294f0a3901928aa57:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next (2018-08-15 15:04:25 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to f6069b9aa9934ede26f41ac0781fce241279ad43:

  bpf: fix redirect to map under tail calls (2018-08-17 15:56:23 -0700)

----------------------------------------------------------------
Alexei Starovoitov (1):
      Merge branch 'sockmap-ulp-fixes'

Daniel Borkmann (6):
      tcp, ulp: add alias for all ulp modules
      tcp, ulp: fix leftover icsk_ulp_ops preventing sock from reattach
      bpf, sockmap: fix leakage of smap_psock_map_entry
      bpf, sockmap: fix map elem deletion race with smap_stop_sock
      bpf, sockmap: fix sock_map_ctx_update_elem race with exist/noexist
      bpf: fix redirect to map under tail calls

Jesper Dangaard Brouer (1):
      samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM

Tariq Toukan (1):
      net/xdp: Fix suspicious RCU usage warning

Yonghong Song (2):
      bpf: fix a rcu usage warning in bpf_prog_array_copy_core()
      tools/bpf: fix bpf selftest test_cgroup_storage failure

 include/linux/filter.h                            |   3 +-
 include/net/tcp.h                                 |   4 +
 include/trace/events/xdp.h                        |   5 +-
 kernel/bpf/core.c                                 |   2 +-
 kernel/bpf/cpumap.c                               |   2 +
 kernel/bpf/devmap.c                               |   1 +
 kernel/bpf/sockmap.c                              | 120 ++++++++++++----------
 kernel/bpf/verifier.c                             |  21 ----
 kernel/bpf/xskmap.c                               |   1 +
 net/core/filter.c                                 |  68 ++++++------
 net/core/xdp.c                                    |  14 +--
 net/ipv4/tcp_ulp.c                                |   4 +-
 net/tls/tls_main.c                                |   1 +
 samples/bpf/xdp_redirect_cpu_user.c               |   3 +-
 samples/bpf/xdp_rxq_info_user.c                   |   3 +-
 tools/testing/selftests/bpf/test_cgroup_storage.c |   1 +
 16 files changed, 123 insertions(+), 130 deletions(-)

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH next-queue 0/8] ixgbe/ixgbevf: IPsec offload support for VFs
From: Alexander Duyck @ 2018-08-17 23:27 UTC (permalink / raw)
  To: Shannon Nelson; +Cc: intel-wired-lan, Jeff Kirsher, Steffen Klassert, Netdev
In-Reply-To: <84477537-900a-fb3b-50fa-8142bba76869@oracle.com>

On Fri, Aug 17, 2018 at 4:19 PM Shannon Nelson
<shannon.nelson@oracle.com> wrote:
>
>
> On 8/16/2018 2:36 PM, Shannon Nelson wrote:
> > On 8/16/2018 2:15 PM, Alexander Duyck wrote:
> >> On Tue, Aug 14, 2018 at 10:10 AM Shannon Nelson
> >> <shannon.nelson@oracle.com> wrote:
> >>>
> >>> On 8/14/2018 8:30 AM, Alexander Duyck wrote:
> >>>> On Mon, Aug 13, 2018 at 11:43 AM Shannon Nelson
> >>>> <shannon.nelson@oracle.com> wrote:
> >>>>>
> >>>>> This set of patches implements IPsec hardware offload for VF
> >>>>> devices in
> >>>>> Intel's 10Gbe x540 family of Ethernet devices.
> >>>
> >>> [...]
> >>>
> >>>>
> >>>> So the one question I would have about this patch set is what happens
> >>>> if you are setting up a ipsec connection between the PF and one of the
> >>>> VFs on the same port/function? Do the ipsec offloads get translated
> >>>> across the Tx loopback or do they end up causing issues? Specifically
> >>>> I would be interested in seeing the results of a test either between
> >>>> two VFs, or the PF and one of the VFs on the same port.
> >>>>
> >>>> - Alex
> >>>>
> >>>
> >>> There is definitely something funky in the internal switch connection,
> >>> as messages going from PF to VF with an offloaded encryption don't seem
> >>> to get received by the VF, at least when in a VEB setup.  If I only set
> >>> up offloads on the Rx on both PF and VF, and don't offload the Tx, then
> >>> things work.
> >>>
> >>> I don't have a setup to test this, but I suspect that in a VEPA
> >>> configuration, with packets going out to the switch and turned around
> >>> back in, the Tx encryption offload would happen as expected.
> >>>
> >>> sln
> >>
> >> We should probably look at adding at least one patch to the set then
> >> that disables IPsec Tx offload if SR-IOV is enabled with VEB so that
> >> we don't end up breaking connections should a VF be migrated from a
> >> remote system to a local one that it is connected to.
> >>
> >> - Alex
> >>
> >
> > The problem with this is that someone could set up an IPsec connection
> > on the PF for Tx and Rx use, then set num_vfs, start some VFs, and we
> > still can end up in the same place.  I don't think we want to disallow
> > all Tx IPsec offload.
> >
> > Maybe we can catch it in ixgbe_ipsec_offload_ok()?  If it can find that
> > the dest mac is on the internal switch, perhaps it can NAK the Tx
> > offload?  That would force the XFRM xmit code to do a regular SW encrypt
> > before sending the packet.  I'll look into this.
> >
> > sln
>
> This would be a great idea, but the xdo_state_offload_ok() callback
> happens in the network stack before routing has happened, so there is no
> mac address yet in the skb.  We may be stuck with NAKing *all* Tx
> offloads when num_vfs != 0.  It works, and it is better than no offload
> at all, but it sure harshes the vibe.  Blech.
>
> sln

You can probably just think of the Tx offload as being lumped in with
all the other offloads that don't work when SR-IOV is enabled such as
ATR and RSC.

- Alex

^ permalink raw reply

* Re: [PATCH] mt76: Enable NL80211_EXT_FEATURE_CQM_RSSI_LIST
From: Andrew Zaborowski @ 2018-08-18  2:28 UTC (permalink / raw)
  To: Kristian Evensen
  Cc: lorenzo.bianconi-H+wXaHxf7aLQT0dZR+AlfA, Arend van Spriel,
	Kalle Valo, linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	Network Development
In-Reply-To: <CAKfDRXia3zwW9Q=7RZGqK_DWCsyQzG4KEZHsyHzGcNb70Ge1kA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hi,

On 13 August 2018 at 16:25, Kristian Evensen <kristian.evensen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, Aug 13, 2018 at 12:55 PM Lorenzo Bianconi
> <lorenzo.bianconi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> According to my understanding (please correct me if I am wrong)
>> BSS_CHANGED_CQM is only needed if CQM_RSSI is handled
>> by the driver/fw, while if it is not set mac80211 will take care of that
>> in ieee80211_handle_beacon_sig routine.
>> I am AFK at the moment, I will test that patch when I am back from vacations.
>
> That matches my understanding as well (base on for example the message
> for commit ae44b502669d0cd1f167cdb48994292aa20fd3dd).

This is right, mt7601u had this flag added in that commit but the mt76
drivers were added later.  It seems mt76_rx_convert always sets the
ieee80211_rx_status.signal field so mac80211 can check the rssi values
and call any CQM callbacks if needed.

Best regards

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH next-queue 0/8] ixgbe/ixgbevf: IPsec offload support for VFs
From: Shannon Nelson @ 2018-08-17 23:19 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: intel-wired-lan, Jeff Kirsher, Steffen Klassert, Netdev
In-Reply-To: <7796a29d-625a-4728-defd-cda73afc83ba@oracle.com>


On 8/16/2018 2:36 PM, Shannon Nelson wrote:
> On 8/16/2018 2:15 PM, Alexander Duyck wrote:
>> On Tue, Aug 14, 2018 at 10:10 AM Shannon Nelson
>> <shannon.nelson@oracle.com> wrote:
>>>
>>> On 8/14/2018 8:30 AM, Alexander Duyck wrote:
>>>> On Mon, Aug 13, 2018 at 11:43 AM Shannon Nelson
>>>> <shannon.nelson@oracle.com> wrote:
>>>>>
>>>>> This set of patches implements IPsec hardware offload for VF 
>>>>> devices in
>>>>> Intel's 10Gbe x540 family of Ethernet devices.
>>>
>>> [...]
>>>
>>>>
>>>> So the one question I would have about this patch set is what happens
>>>> if you are setting up a ipsec connection between the PF and one of the
>>>> VFs on the same port/function? Do the ipsec offloads get translated
>>>> across the Tx loopback or do they end up causing issues? Specifically
>>>> I would be interested in seeing the results of a test either between
>>>> two VFs, or the PF and one of the VFs on the same port.
>>>>
>>>> - Alex
>>>>
>>>
>>> There is definitely something funky in the internal switch connection,
>>> as messages going from PF to VF with an offloaded encryption don't seem
>>> to get received by the VF, at least when in a VEB setup.  If I only set
>>> up offloads on the Rx on both PF and VF, and don't offload the Tx, then
>>> things work.
>>>
>>> I don't have a setup to test this, but I suspect that in a VEPA
>>> configuration, with packets going out to the switch and turned around
>>> back in, the Tx encryption offload would happen as expected.
>>>
>>> sln
>>
>> We should probably look at adding at least one patch to the set then
>> that disables IPsec Tx offload if SR-IOV is enabled with VEB so that
>> we don't end up breaking connections should a VF be migrated from a
>> remote system to a local one that it is connected to.
>>
>> - Alex
>>
> 
> The problem with this is that someone could set up an IPsec connection 
> on the PF for Tx and Rx use, then set num_vfs, start some VFs, and we 
> still can end up in the same place.  I don't think we want to disallow 
> all Tx IPsec offload.
> 
> Maybe we can catch it in ixgbe_ipsec_offload_ok()?  If it can find that 
> the dest mac is on the internal switch, perhaps it can NAK the Tx 
> offload?  That would force the XFRM xmit code to do a regular SW encrypt 
> before sending the packet.  I'll look into this.
> 
> sln

This would be a great idea, but the xdo_state_offload_ok() callback 
happens in the network stack before routing has happened, so there is no 
mac address yet in the skb.  We may be stuck with NAKing *all* Tx 
offloads when num_vfs != 0.  It works, and it is better than no offload 
at all, but it sure harshes the vibe.  Blech.

sln

^ permalink raw reply

* [PATCH] RDMA/smc: Replace ib_query_gid with rdma_get_gid_attr
From: Jason Gunthorpe @ 2018-08-18  2:17 UTC (permalink / raw)
  To: Ursula Braun; +Cc: linux-rdma, Parav Pandit, netdev, linux-kernel

All RDMA ULPs should be using rdma_get_gid_attr instead of
ib_query_gid. Convert SMC to use the new API.

In the process correct some confusion with gid_type - if attr->ndev is
!NULL then gid_type can never be IB_GID_TYPE_IB by
definition. IB_GID_TYPE_ROCE shares the same enum value and is probably
what was intended here.

Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 include/rdma/ib_cache.h | 24 ---------------------
 net/smc/smc_ib.c        | 48 +++++++++++++++++++++--------------------
 2 files changed, 25 insertions(+), 47 deletions(-)

This is intended to go this merge window (ie in the next 4 days) to
remove the wrong-headed __deprecated warning. If there is no objection
I'll send it via the rdma tree.

Thanks,
Jason

diff --git a/include/rdma/ib_cache.h b/include/rdma/ib_cache.h
index 3e11e7cc60b745..62e990b620aaf7 100644
--- a/include/rdma/ib_cache.h
+++ b/include/rdma/ib_cache.h
@@ -133,28 +133,4 @@ const struct ib_gid_attr *rdma_get_gid_attr(struct ib_device *device,
 void rdma_put_gid_attr(const struct ib_gid_attr *attr);
 void rdma_hold_gid_attr(const struct ib_gid_attr *attr);
 
-/*
- * This is to be removed. It only exists to make merging rdma and smc simpler.
- */
-static inline __deprecated int ib_query_gid(struct ib_device *device,
-					    u8 port_num, int index,
-					    union ib_gid *gid,
-					    struct ib_gid_attr *attr_out)
-{
-	const struct ib_gid_attr *attr;
-
-	memset(attr_out, 0, sizeof(*attr_out));
-	attr = rdma_get_gid_attr(device, port_num, index);
-	if (IS_ERR(attr))
-		return PTR_ERR(attr);
-
-	if (attr->ndev)
-		dev_hold(attr->ndev);
-	*attr_out = *attr;
-
-	rdma_put_gid_attr(attr);
-
-	return 0;
-}
-
 #endif /* _IB_CACHE_H */
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 9bb5274a244e83..e519ef29c0ffcb 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -145,17 +145,21 @@ int smc_ib_ready_link(struct smc_link *lnk)
 
 static int smc_ib_fill_mac(struct smc_ib_device *smcibdev, u8 ibport)
 {
-	struct ib_gid_attr gattr;
-	union ib_gid gid;
-	int rc;
+	const struct ib_gid_attr *attr;
+	int rc = 0;
 
-	rc = ib_query_gid(smcibdev->ibdev, ibport, 0, &gid, &gattr);
-	if (rc || !gattr.ndev)
+	attr = rdma_get_gid_attr(smcibdev->ibdev, ibport, 0);
+	if (IS_ERR(attr))
 		return -ENODEV;
 
-	memcpy(smcibdev->mac[ibport - 1], gattr.ndev->dev_addr, ETH_ALEN);
-	dev_put(gattr.ndev);
-	return 0;
+	if (attr->ndev)
+		memcpy(smcibdev->mac[ibport - 1], attr->ndev->dev_addr,
+		       ETH_ALEN);
+	else
+		rc = -ENODEV;
+
+	rdma_put_gid_attr(attr);
+	return rc;
 }
 
 /* Create an identifier unique for this instance of SMC-R.
@@ -180,29 +184,27 @@ bool smc_ib_port_active(struct smc_ib_device *smcibdev, u8 ibport)
 int smc_ib_determine_gid(struct smc_ib_device *smcibdev, u8 ibport,
 			 unsigned short vlan_id, u8 gid[], u8 *sgid_index)
 {
-	struct ib_gid_attr gattr;
-	union ib_gid _gid;
+	const struct ib_gid_attr *attr;
 	int i;
 
 	for (i = 0; i < smcibdev->pattr[ibport - 1].gid_tbl_len; i++) {
-		memset(&_gid, 0, SMC_GID_SIZE);
-		memset(&gattr, 0, sizeof(gattr));
-		if (ib_query_gid(smcibdev->ibdev, ibport, i, &_gid, &gattr))
+		attr = rdma_get_gid_attr(smcibdev->ibdev, ibport, i);
+		if (IS_ERR(attr))
 			continue;
-		if (!gattr.ndev)
-			continue;
-		if (((!vlan_id && !is_vlan_dev(gattr.ndev)) ||
-		     (vlan_id && is_vlan_dev(gattr.ndev) &&
-		      vlan_dev_vlan_id(gattr.ndev) == vlan_id)) &&
-		    gattr.gid_type == IB_GID_TYPE_IB) {
+
+		if (attr->ndev &&
+		    ((!vlan_id && !is_vlan_dev(attr->ndev)) ||
+		     (vlan_id && is_vlan_dev(attr->ndev) &&
+		      vlan_dev_vlan_id(attr->ndev) == vlan_id)) &&
+		    attr->gid_type == IB_GID_TYPE_ROCE) {
 			if (gid)
-				memcpy(gid, &_gid, SMC_GID_SIZE);
+				memcpy(gid, &attr->gid, SMC_GID_SIZE);
 			if (sgid_index)
-				*sgid_index = i;
-			dev_put(gattr.ndev);
+				*sgid_index = attr->index;
+			rdma_put_gid_attr(attr);
 			return 0;
 		}
-		dev_put(gattr.ndev);
+		rdma_put_gid_attr(attr);
 	}
 	return -ENODEV;
 }
-- 
2.18.0

^ permalink raw reply related

* [RFC v3 net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
From: Tushar Dave @ 2018-08-17 23:08 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev
In-Reply-To: <1534547305-25140-1-git-send-email-tushar.n.dave@oracle.com>

Add a sample program that shows how socksg program is used and attached
to socket filter. The kernel sample program deals with struct
scatterlist that is passed as bpf context.

When run in server mode, the sample RDS program opens PF_RDS socket,
attaches eBPF program to RDS socket which then uses bpf_msg_pull_data
helper to inspect packet data contained in struct scatterlist and
returns appropriate action code back to kernel.

To ease testing, RDS client functionality is also added so that users
can generate RDS packet.

Server:
[root@lab71 bpf]# ./rds_filter -s 192.168.3.71 -t tcp
running server in a loop
transport tcp
server bound to address: 192.168.3.71 port 4000
server listening on 192.168.3.71

Client:
[root@lab70 bpf]# ./rds_filter -s 192.168.3.71 -c 192.168.3.70 -t tcp
transport tcp
client bound to address: 192.168.3.70 port 25278
client sending 8192 byte message  from 192.168.3.70 to 192.168.3.71 on
port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...

Server output:
192.168.3.71 received a packet from 192.168.3.71 of len 8192 cmsg len 0,
on port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...
server listening on 192.168.3.71

[root@lab71 tushar]# cat /sys/kernel/debug/tracing/trace_pipe
          <idle>-0     [038] ..s.   146.947362: 0: 30 31 32
          <idle>-0     [038] ..s.   146.947364: 0: 33 34 35

Similarly specifying '-t ib' will run this on IB link.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 samples/bpf/Makefile          |   3 +
 samples/bpf/rds_filter_kern.c |  42 ++++++
 samples/bpf/rds_filter_user.c | 339 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 384 insertions(+)
 create mode 100644 samples/bpf/rds_filter_kern.c
 create mode 100644 samples/bpf/rds_filter_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 36f9f41..dbb30d0 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -53,6 +53,7 @@ hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 hostprogs-y += task_fd_query
 hostprogs-y += xdp_sample_pkts
+hostprogs-y += rds_filter
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -109,6 +110,7 @@ xdpsock-objs := xdpsock_user.o
 xdp_fwd-objs := xdp_fwd_user.o
 task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
 xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
+rds_filter-objs := bpf_load.o rds_filter_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -166,6 +168,7 @@ always += xdpsock_kern.o
 always += xdp_fwd_kern.o
 always += task_fd_query_kern.o
 always += xdp_sample_pkts_kern.o
+always += rds_filter_kern.o
 
 KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include
 KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/rds_filter_kern.c b/samples/bpf/rds_filter_kern.c
new file mode 100644
index 0000000..633e687
--- /dev/null
+++ b/samples/bpf/rds_filter_kern.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/filter.h>
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include <linux/rds.h>
+#include "bpf_helpers.h"
+
+#define bpf_printk(fmt, ...)				\
+({							\
+	char ____fmt[] = fmt;				\
+	bpf_trace_printk(____fmt, sizeof(____fmt),	\
+			##__VA_ARGS__);			\
+})
+
+SEC("socksg")
+int main_prog(struct sk_msg_md *msg)
+{
+	int start, end, err;
+	unsigned char *d;
+
+	start = 0;
+	end = 6;
+
+	err = bpf_msg_pull_data(msg, start, end, 0);
+	if (err) {
+		bpf_printk("socksg: pull_data err %i\n", err);
+		return SOCKSG_PASS;
+	}
+
+	if (msg->data + 6 > msg->data_end)
+		return SOCKSG_PASS;
+
+	d = (unsigned char *)msg->data;
+	bpf_printk("%x %x %x\n", d[0], d[1], d[2]);
+	bpf_printk("%x %x %x\n", d[3], d[4], d[5]);
+
+	return SOCKSG_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/rds_filter_user.c b/samples/bpf/rds_filter_user.c
new file mode 100644
index 0000000..1186345
--- /dev/null
+++ b/samples/bpf/rds_filter_user.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <arpa/inet.h>
+#include <assert.h>
+#include "bpf_load.h"
+#include <getopt.h>
+#include <errno.h>
+#include <netinet/in.h>
+#include <limits.h>
+#include <linux/sockios.h>
+#include <linux/rds.h>
+#include <linux/errqueue.h>
+#include <linux/bpf.h>
+#include <strings.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <string.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+
+#define TESTPORT	4000
+#define BUFSIZE		8192
+
+int transport = -1;
+
+static int str2trans(const char *trans)
+{
+	if (strcmp(trans, "tcp") == 0)
+		return RDS_TRANS_TCP;
+	if (strcmp(trans, "ib") == 0)
+		return RDS_TRANS_IB;
+	return (RDS_TRANS_NONE);
+}
+
+static const char *trans2str(int trans)
+{
+	switch (trans) {
+	case RDS_TRANS_TCP:
+		return ("tcp");
+	case RDS_TRANS_IB:
+		return ("ib");
+	case RDS_TRANS_NONE:
+		return ("none");
+	default:
+		return ("unknown");
+	}
+}
+
+static int gettransport(int sock)
+{
+	int err;
+	char val;
+	socklen_t len = sizeof(int);
+
+	err = getsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+			 (char *)&val, &len);
+	if (err < 0) {
+		fprintf(stderr, "%s: getsockopt %s\n",
+			__func__, strerror(errno));
+		return err;
+	}
+	return (int)val;
+}
+
+static int settransport(int sock, int transport)
+{
+	int err;
+
+	err = setsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+			 (char *)&transport, sizeof(transport));
+	if (err < 0) {
+		fprintf(stderr, "could not set transport %s, %s\n",
+			trans2str(transport), strerror(errno));
+	}
+	return err;
+}
+
+static void print_sock_local_info(int fd, char *str, struct sockaddr_in *ret)
+{
+	socklen_t sin_size = sizeof(struct sockaddr_in);
+	struct sockaddr_in sin;
+	int err;
+
+	err = getsockname(fd, (struct sockaddr *)&sin, &sin_size);
+	if (err < 0) {
+		fprintf(stderr, "%s getsockname %s\n",
+			__func__, strerror(errno));
+		return;
+	}
+	printf("%s address: %s port %d\n",
+		(str ? str : ""), inet_ntoa(sin.sin_addr), ntohs(sin.sin_port));
+
+	if (ret != NULL)
+		*ret = sin;
+}
+
+static void print_payload(char *buf)
+{
+	int i;
+
+	printf("payload contains:");
+	for (i = 0; i < 10; i++)
+		printf("%x ", buf[i]);
+	printf("...\n");
+}
+
+static void server(char *address, in_port_t port)
+{
+	struct sockaddr_in sin, din;
+	struct msghdr msg;
+	struct iovec *iov;
+	int rc, sock;
+	char *buf;
+
+	buf = calloc(BUFSIZE, sizeof(char));
+	if (!buf) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return;
+	}
+
+	sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+	if (sock < 0) {
+		fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+		goto out;
+	}
+	if (settransport(sock, transport) < 0)
+		goto out;
+
+	printf("transport %s\n", trans2str(gettransport(sock)));
+
+	memset(&sin, 0, sizeof(sin));
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = inet_addr(address);
+	sin.sin_port = htons(port);
+
+	rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+	if (rc < 0) {
+		fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	/* attach bpf prog */
+	assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
+			  sizeof(prog_fd[0])) == 0);
+
+	print_sock_local_info(sock, "server bound to", NULL);
+
+	iov = calloc(1, sizeof(struct iovec));
+	if (!iov) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	while (1) {
+		memset(buf, 0, BUFSIZE);
+		iov[0].iov_base = buf;
+		iov[0].iov_len = BUFSIZE;
+
+		memset(&msg, 0, sizeof(msg));
+		msg.msg_name = &din;
+		msg.msg_namelen = sizeof(din);
+		msg.msg_iov = iov;
+		msg.msg_iovlen = 1;
+
+		printf("server listening on %s\n", inet_ntoa(sin.sin_addr));
+
+		rc = recvmsg(sock, &msg, 0);
+		if (rc < 0) {
+			fprintf(stderr, "%s: recvmsg %s\n",
+				__func__, strerror(errno));
+			break;
+		}
+
+		printf("%s received a packet from %s of len %d cmsg len %d, on port %d\n",
+			inet_ntoa(sin.sin_addr),
+			inet_ntoa(din.sin_addr),
+			(uint32_t) iov[0].iov_len,
+			(uint32_t) msg.msg_controllen,
+			ntohs(din.sin_port));
+
+		print_payload(buf);
+	}
+	free(iov);
+out:
+	free(buf);
+}
+
+static void create_message(char *buf)
+{
+	unsigned int i;
+
+	for (i = 0; i < BUFSIZE; i++) {
+		buf[i] = i + 0x30;
+	}
+}
+
+static int build_rds_packet(struct msghdr *msg, char *buf)
+{
+	struct iovec *iov;
+
+	iov = calloc(1, sizeof(struct iovec));
+	if (!iov) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return -1;
+	}
+
+	msg->msg_iov = iov;
+	msg->msg_iovlen = 1;
+
+	iov[0].iov_base = buf;
+	iov[0].iov_len = BUFSIZE * sizeof(char);
+
+	return 0;
+}
+
+static void client(char *localaddr, char *remoteaddr, in_port_t server_port)
+{
+	struct sockaddr_in sin, din;
+	struct msghdr msg;
+	int rc, sock;
+	char *buf;
+
+	buf = calloc(BUFSIZE, sizeof(char));
+	if (!buf) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return;
+	}
+
+	create_message(buf);
+
+	sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+	if (sock < 0) {
+		fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	if (settransport(sock, transport) < 0)
+		goto out;
+
+	printf("transport %s\n", trans2str(gettransport(sock)));
+
+	memset(&sin, 0, sizeof(sin));
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = inet_addr(localaddr);
+	sin.sin_port = 0;
+
+	rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+	if (rc < 0) {
+		fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+		goto out;
+	}
+	print_sock_local_info(sock, "client bound to",  &sin);
+
+	memset(&msg, 0, sizeof(msg));
+	msg.msg_name = &din;
+	msg.msg_namelen = sizeof(din);
+
+	memset(&din, 0, sizeof(din));
+	din.sin_family = AF_INET;
+	din.sin_addr.s_addr = inet_addr(remoteaddr);
+	din.sin_port = htons(server_port);
+
+	rc = build_rds_packet(&msg, buf);
+	if (rc < 0)
+		goto out;
+
+	printf("client sending %d byte message from %s to %s on port %d\n",
+		(uint32_t) msg.msg_iov->iov_len, localaddr,
+		remoteaddr, ntohs(sin.sin_port));
+
+	rc = sendmsg(sock, &msg, 0);
+	if (rc < 0)
+		fprintf(stderr, "%s: sendmsg %s\n", __func__, strerror(errno));
+
+	print_payload(buf);
+
+	if (msg.msg_control)
+		free(msg.msg_control);
+	if (msg.msg_iov)
+		free(msg.msg_iov);
+out:
+	free(buf);
+
+	return;
+}
+
+static void usage(char *progname)
+{
+	fprintf(stderr, "Usage %s [-s srvaddr] [-c clientaddr] [-t transport]"
+		"\n", progname);
+}
+
+int main(int argc, char **argv)
+{
+	in_port_t server_port = TESTPORT;
+	char *serveraddr = NULL;
+	char *clientaddr = NULL;
+	char filename[256];
+	int opt;
+
+	while ((opt = getopt(argc, argv, "s:c:t:")) != -1) {
+		switch (opt) {
+		case 's':
+			serveraddr = optarg;
+			break;
+		case 'c':
+			clientaddr = optarg;
+			break;
+		case 't':
+			transport = str2trans(optarg);
+			if (transport == RDS_TRANS_NONE) {
+				fprintf(stderr,
+					"unknown transport %s\n", optarg);
+					usage(argv[0]);
+					return (-1);
+			}
+			break;
+		default:
+			usage(argv[0]);
+			return 1;
+		}
+	}
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		fprintf(stderr, "Error: load_bpf_file %s", bpf_log_buf);
+		return 1;
+	}
+
+	if (serveraddr && !clientaddr) {
+		printf("running server in a loop\n");
+		server(serveraddr, server_port);
+	} else if (serveraddr && clientaddr) {
+		client(clientaddr, serveraddr, server_port);
+	}
+
+	return 0;
+}
-- 
1.8.3.1

^ permalink raw reply related

* [RFC v3 net-next 4/5] rds: invoke socket sg filter attached to rds socket
From: Tushar Dave @ 2018-08-17 23:08 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev
In-Reply-To: <1534547305-25140-1-git-send-email-tushar.n.dave@oracle.com>

RDS module sits on top of TCP (rds_tcp) and IB (rds_rdma), so messages
arrive in form of skb (over TCP) and scatterlist (over IB/RDMA).
However, because socket filter only deal with skb (e.g. struct skb as
bpf context) we can only use socket filter for rds_tcp and not for
rds_rdma.

Considering one filtering solution for RDS, it seems that the common
denominator between sk_buff and scatterlist is scatterlist. Therefore,
this patch converts skb to sgvec and invoke sg_filter_run for
rds_tcp and simply invoke sg_filter_run for IB/rds_rdma.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/ib.c       |  1 +
 net/rds/ib.h       |  1 +
 net/rds/ib_recv.c  | 12 ++++++++++++
 net/rds/rds.h      |  2 ++
 net/rds/recv.c     | 17 +++++++++++++++++
 net/rds/tcp.c      |  2 ++
 net/rds/tcp.h      |  2 ++
 net/rds/tcp_recv.c | 38 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 75 insertions(+)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index 89c6333..6ba1f75 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -532,6 +532,7 @@ struct rds_transport rds_ib_transport = {
 	.conn_path_shutdown	= rds_ib_conn_path_shutdown,
 	.inc_copy_to_user	= rds_ib_inc_copy_to_user,
 	.inc_free		= rds_ib_inc_free,
+	.inc_to_sg_get		= rds_ib_inc_to_sg_get,
 	.cm_initiate_connect	= rds_ib_cm_initiate_connect,
 	.cm_handle_connect	= rds_ib_cm_handle_connect,
 	.cm_connect_complete	= rds_ib_cm_connect_complete,
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 73427ff..0a12b41 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -404,6 +404,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev,
 void rds_ib_recv_free_caches(struct rds_ib_connection *ic);
 void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp);
 void rds_ib_inc_free(struct rds_incoming *inc);
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg);
 int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
 void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc,
 			     struct rds_ib_ack_state *state);
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index d300186..2f76a91 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -219,6 +219,18 @@ void rds_ib_inc_free(struct rds_incoming *inc)
 	rds_ib_recv_cache_put(&ibinc->ii_cache_entry, &ic->i_cache_incs);
 }
 
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg)
+{
+	struct rds_ib_incoming *ibinc;
+	struct rds_page_frag *frag;
+
+	ibinc = container_of(inc, struct rds_ib_incoming, ii_inc);
+	frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item);
+	*sg =  &frag->f_sg;
+
+	return 0;
+}
+
 static void rds_ib_recv_clear_one(struct rds_ib_connection *ic,
 				  struct rds_ib_recv_work *recv)
 {
diff --git a/net/rds/rds.h b/net/rds/rds.h
index c4dcf65..abcd5ce 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -542,6 +542,8 @@ struct rds_transport {
 	int (*recv_path)(struct rds_conn_path *cp);
 	int (*inc_copy_to_user)(struct rds_incoming *inc, struct iov_iter *to);
 	void (*inc_free)(struct rds_incoming *inc);
+	int (*inc_to_sg_get)(struct rds_incoming *inc, struct scatterlist **sg);
+	void (*inc_to_sg_put)(struct scatterlist **sg);
 
 	int (*cm_handle_connect)(struct rdma_cm_id *cm_id,
 				 struct rdma_cm_event *event, bool isv6);
diff --git a/net/rds/recv.c b/net/rds/recv.c
index 504cd6b..261904c 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -292,6 +292,8 @@ void rds_recv_incoming(struct rds_connection *conn, struct in6_addr *saddr,
 	struct sock *sk;
 	unsigned long flags;
 	struct rds_conn_path *cp;
+	struct sk_filter *filter;
+	int result = __SOCKSG_PASS;
 
 	inc->i_conn = conn;
 	inc->i_rx_jiffies = jiffies;
@@ -376,6 +378,21 @@ void rds_recv_incoming(struct rds_connection *conn, struct in6_addr *saddr,
 	/* We can be racing with rds_release() which marks the socket dead. */
 	sk = rds_rs_to_sk(rs);
 
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
+	if (filter) {
+		if (conn->c_trans->inc_to_sg_get) {
+			struct scatterlist *sg;
+
+			if (conn->c_trans->inc_to_sg_get(inc, &sg) == 0) {
+				result = sg_filter_run(sk, sg);
+				if (conn->c_trans->inc_to_sg_put)
+					conn->c_trans->inc_to_sg_put(&sg);
+			}
+		}
+	}
+	rcu_read_unlock();
+
 	/* serialize with rds_release -> sock_orphan */
 	write_lock_irqsave(&rs->rs_recv_lock, flags);
 	if (!sock_flag(sk, SOCK_DEAD)) {
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 2c7b7c3..35454c7 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -465,6 +465,8 @@ struct rds_transport rds_tcp_transport = {
 	.conn_path_shutdown	= rds_tcp_conn_path_shutdown,
 	.inc_copy_to_user	= rds_tcp_inc_copy_to_user,
 	.inc_free		= rds_tcp_inc_free,
+	.inc_to_sg_get		= rds_tcp_inc_to_sg_get,
+	.inc_to_sg_put		= rds_tcp_inc_to_sg_put,
 	.stats_info_copy	= rds_tcp_stats_info_copy,
 	.exit			= rds_tcp_exit,
 	.t_owner		= THIS_MODULE,
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index 3c69361..b2cc910 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -82,6 +82,8 @@ void rds_tcp_restore_callbacks(struct socket *sock,
 int rds_tcp_recv_path(struct rds_conn_path *cp);
 void rds_tcp_inc_free(struct rds_incoming *inc);
 int rds_tcp_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
+int rds_tcp_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg);
+void rds_tcp_inc_to_sg_put(struct scatterlist **sg);
 
 /* tcp_send.c */
 void rds_tcp_xmit_path_prepare(struct rds_conn_path *cp);
diff --git a/net/rds/tcp_recv.c b/net/rds/tcp_recv.c
index 42c5ff1..b45e69b 100644
--- a/net/rds/tcp_recv.c
+++ b/net/rds/tcp_recv.c
@@ -56,6 +56,44 @@ void rds_tcp_inc_free(struct rds_incoming *inc)
 	kmem_cache_free(rds_tcp_incoming_slab, tinc);
 }
 
+#define MAX_SG MAX_SKB_FRAGS
+int rds_tcp_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg)
+{
+	struct scatterlist *sg_list;
+	struct rds_tcp_incoming *tinc;
+	struct sk_buff *skb;
+	int num_sg = 0;
+
+	tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
+
+	/* For now we are assuming that the max sg elements we need is MAX_SG.
+	 * To determine actual number of sg elements we need to traverse the
+	 * skb queue e.g.
+	 *
+	 * skb_queue_walk(&tinc->ti_skb_list, skb) {
+	 *	num_sg += skb_shinfo(skb)->nr_frags + 1;
+	 * }
+	 */
+	sg_list = kzalloc(sizeof(*sg_list) * MAX_SG, GFP_KERNEL);
+	if (!sg_list)
+		return -ENOMEM;
+
+	sg_init_table(sg_list, MAX_SG);
+	skb_queue_walk(&tinc->ti_skb_list, skb) {
+		num_sg += skb_to_sgvec_nomark(skb, &sg_list[num_sg], 0,
+					      skb->len);
+	}
+	sg_mark_end(&sg_list[num_sg - 1]);
+	*sg = sg_list;
+
+	return 0;
+}
+
+void rds_tcp_inc_to_sg_put(struct scatterlist **sg)
+{
+	kfree(*sg);
+}
+
 /*
  * this is pretty lame, but, whatever.
  */
-- 
1.8.3.1

^ permalink raw reply related

* [RFC v3 net-next 2/5] ebpf: Add sg_filter_run()
From: Tushar Dave @ 2018-08-17 23:08 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev
In-Reply-To: <1534547305-25140-1-git-send-email-tushar.n.dave@oracle.com>

When sg_filter_run() is invoked it runs the attached eBPF
prog of type BPF_PROG_TYPE_SOCKET_SG_FILTER which deals with
struct scatterlist.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/filter.h         |  8 ++++++++
 include/uapi/linux/bpf.h       |  6 ++++++
 net/core/filter.c              | 24 ++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  6 ++++++
 4 files changed, 44 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 5d565c5..9f1f7c1 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1112,4 +1112,12 @@ struct bpf_sock_ops_kern {
 					 */
 };
 
+enum __socksg_action {
+	__SOCKSG_DROP = 0,
+	__SOCKSG_PASS,
+	__SOCKSG_REDIRECT,
+};
+
+int sg_filter_run(struct sock *sk, struct scatterlist *sg);
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6ec1e32..d1d0ceb 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2428,6 +2428,12 @@ enum sk_action {
 	SK_PASS,
 };
 
+enum socksg_action {
+	SOCKSG_DROP = 0,
+	SOCKSG_PASS,
+	SOCKSG_REDIRECT,
+};
+
 /* user accessible metadata for SK_MSG packet hook, new fields must
  * be added to the end of this structure
  */
diff --git a/net/core/filter.c b/net/core/filter.c
index cec3807..e427c8e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -121,6 +121,30 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
 }
 EXPORT_SYMBOL(sk_filter_trim_cap);
 
+int sg_filter_run(struct sock *sk, struct scatterlist *sg)
+{
+	struct sk_filter *filter;
+	int result;
+
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
+	if (filter) {
+		struct sk_msg_buff mb = {0};
+
+		memcpy(mb.sg_data, sg, sizeof(*sg) * MAX_SKB_FRAGS);
+		mb.sg_start = 0;
+		mb.sg_end = sg_nents(sg) - 1;
+		mb.data = sg_virt(sg);
+		mb.data_end = mb.data + sg->length;
+		mb.sg_copy[mb.sg_end] = true;
+		result = BPF_PROG_RUN(filter->prog, &mb);
+	}
+	rcu_read_unlock();
+
+	return result;
+}
+EXPORT_SYMBOL(sg_filter_run);
+
 BPF_CALL_1(bpf_skb_get_pay_offset, struct sk_buff *, skb)
 {
 	return skb_get_poff(skb);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 6ec1e32..d1d0ceb 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2428,6 +2428,12 @@ enum sk_action {
 	SK_PASS,
 };
 
+enum socksg_action {
+	SOCKSG_DROP = 0,
+	SOCKSG_PASS,
+	SOCKSG_REDIRECT,
+};
+
 /* user accessible metadata for SK_MSG packet hook, new fields must
  * be added to the end of this structure
  */
-- 
1.8.3.1

^ permalink raw reply related

* [RFC v3 net-next 3/5] ebpf: fix bpf_msg_pull_data
From: Tushar Dave @ 2018-08-17 23:08 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev
In-Reply-To: <1534547305-25140-1-git-send-email-tushar.n.dave@oracle.com>

Like sockmap (sk_msg), socksg also deals with struct scatterlist
therefore socksg programs can use existing bpf helper bpf_msg_pull_data
to access packet data contained in struct scatterlist. While doing some
prelimnary testing, there are couple of issues found with
bpf_msg_pull_data that are fixed in this patch.

Also, there cannot be more than MAX_SKB_FRAGS entries in sg_data
therefore any checks for sg entry more than MAX_SKB_FRAGS in
bpf_msg_pull_data() is removed.

Besides that, I also ran into issues while put_page() is invoked.
e.g.
[ 450.568723] BUG: Bad page state in process swapper/10 pfn:2021540
[ 450.575632] page:ffffea0080855000 count:0 mapcount:0
mapping:ffff88103d006840 index:0xffff882021540000 compound_mapcount: 0
[ 450.588069] flags: 0x6fffff80008100(slab|head)
[ 450.593033] raw: 006fffff80008100 dead000000000100 dead000000000200
ffff88103d006840
[ 450.601683] raw: ffff882021540000 0000000080080007 00000000ffffffff
0000000000000000
[ 450.610337] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[ 450.617530] bad because of flags: 0x100(slab)

To avoid above issue, currently put_page() is disabled in this patch
temporarily. I am working on alternatives so that page allocated via
slab (in this case) can be freed without any issue.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/core/filter.c | 61 +++++++++++++++++++++++++++++--------------------------
 1 file changed, 32 insertions(+), 29 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index e427c8e..cc52baa 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2316,7 +2316,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 BPF_CALL_4(bpf_msg_pull_data,
 	   struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
 {
-	unsigned int len = 0, offset = 0, copy = 0;
+	unsigned int len = 0, offset = 0, copy = 0, off = 0;
 	struct scatterlist *sg = msg->sg_data;
 	int first_sg, last_sg, i, shift;
 	unsigned char *p, *to, *from;
@@ -2330,22 +2330,28 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 	i = msg->sg_start;
 	do {
 		len = sg[i].length;
-		offset += len;
 		if (start < offset + len)
 			break;
+		offset += len;
 		i++;
-		if (i == MAX_SKB_FRAGS)
-			i = 0;
-	} while (i != msg->sg_end);
+	} while (i <= msg->sg_end);
 
+	/* return error if start is out of range */
 	if (unlikely(start >= offset + len))
 		return -EINVAL;
 
-	if (!msg->sg_copy[i] && bytes <= len)
-		goto out;
+	/* return error if i is last entry in sglist and end is out of range */
+	if (msg->sg_copy[i] && end > offset + len)
+		return -EINVAL;
 
 	first_sg = i;
 
+	/* if i is not last entry in sg list and end (i.e start + bytes) is
+	 * within this sg[i] then goto out and calculate data and data_end
+	 */
+	if (!msg->sg_copy[i] && end <= offset + len)
+		goto out;
+
 	/* At this point we need to linearize multiple scatterlist
 	 * elements or a single shared page. Either way we need to
 	 * copy into a linear buffer exclusively owned by BPF. Then
@@ -2359,11 +2365,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 	do {
 		copy += sg[i].length;
 		i++;
-		if (i == MAX_SKB_FRAGS)
-			i = 0;
-		if (bytes < copy)
+		if (end < copy)
 			break;
-	} while (i != msg->sg_end);
+	} while (i <= msg->sg_end);
+
+	/* return error if i is last entry in sglist and end is out of range */
+	if (i > msg->sg_end && end > offset + copy)
+		return -EINVAL;
+
 	last_sg = i;
 
 	if (unlikely(copy < end - start))
@@ -2373,23 +2382,25 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 	if (unlikely(!page))
 		return -ENOMEM;
 	p = page_address(page);
-	offset = 0;
 
 	i = first_sg;
 	do {
 		from = sg_virt(&sg[i]);
 		len = sg[i].length;
-		to = p + offset;
+		to = p + off;
 
 		memcpy(to, from, len);
-		offset += len;
+		off += len;
 		sg[i].length = 0;
-		put_page(sg_page(&sg[i]));
+		/* if original page is allocated via slab then put_page
+		 * causes error BUG: Bad page state in process. So temporarily
+		 * disabled put_page.
+		 * Todo: fix it
+		 */
+		//put_page(sg_page(&sg[i]));
 
 		i++;
-		if (i == MAX_SKB_FRAGS)
-			i = 0;
-	} while (i != last_sg);
+	} while (i < last_sg);
 
 	sg[first_sg].length = copy;
 	sg_set_page(&sg[first_sg], page, copy, 0);
@@ -2406,12 +2417,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 	do {
 		int move_from;
 
-		if (i + shift >= MAX_SKB_FRAGS)
-			move_from = i + shift - MAX_SKB_FRAGS;
-		else
-			move_from = i + shift;
-
-		if (move_from == msg->sg_end)
+		move_from = i + shift;
+		if (move_from > msg->sg_end)
 			break;
 
 		sg[i] = sg[move_from];
@@ -2420,14 +2427,10 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 		sg[move_from].offset = 0;
 
 		i++;
-		if (i == MAX_SKB_FRAGS)
-			i = 0;
 	} while (1);
 	msg->sg_end -= shift;
-	if (msg->sg_end < 0)
-		msg->sg_end += MAX_SKB_FRAGS;
 out:
-	msg->data = sg_virt(&sg[i]) + start - offset;
+	msg->data = sg_virt(&sg[first_sg]) + start - offset;
 	msg->data_end = msg->data + bytes;
 
 	return 0;
-- 
1.8.3.1

^ permalink raw reply related

* [RFC v3 net-next 1/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
From: Tushar Dave @ 2018-08-17 23:08 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev
In-Reply-To: <1534547305-25140-1-git-send-email-tushar.n.dave@oracle.com>

Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
existing socket filter infrastructure for bpf program attach and load.
SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
contrast to SOCKET_FILTER which deals with struct skb. This is useful
for kernel entities that don't have skb to represent packet data but
want to run eBPF socket filter on packet data that is in form of struct
scatterlist e.g. IB/RDMA

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/bpf_types.h      |  1 +
 include/uapi/linux/bpf.h       |  1 +
 kernel/bpf/syscall.c           |  1 +
 kernel/bpf/verifier.c          |  1 +
 net/core/filter.c              | 55 ++++++++++++++++++++++++++++++++++++++++--
 samples/bpf/bpf_load.c         | 11 ++++++---
 tools/bpf/bpftool/prog.c       |  1 +
 tools/include/uapi/linux/bpf.h |  1 +
 tools/lib/bpf/libbpf.c         |  3 +++
 tools/lib/bpf/libbpf.h         |  2 ++
 10 files changed, 72 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cd26c09..7dc1503 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -16,6 +16,7 @@
 BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_SG_FILTER, socksg_filter)
 #endif
 #ifdef CONFIG_BPF_EVENTS
 BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4..6ec1e32 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_SOCKET_SG_FILTER,
 };
 
 enum bpf_attach_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8339d81..160cdb2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1362,6 +1362,7 @@ static int bpf_prog_load(union bpf_attr *attr)
 
 	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
 	    type != BPF_PROG_TYPE_CGROUP_SKB &&
+	    type != BPF_PROG_TYPE_SOCKET_SG_FILTER &&
 	    !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ca90679..5abc788 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1321,6 +1321,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 	case BPF_PROG_TYPE_LWT_XMIT:
 	case BPF_PROG_TYPE_SK_SKB:
 	case BPF_PROG_TYPE_SK_MSG:
+	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
 		if (meta)
 			return meta->pkt_access;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index fd423ce..cec3807 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1140,7 +1140,8 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
 
 static void __bpf_prog_release(struct bpf_prog *prog)
 {
-	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
+	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
+	    prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
 		bpf_prog_put(prog);
 	} else {
 		bpf_release_orig_filter(prog);
@@ -1539,10 +1540,16 @@ int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk)
 
 static struct bpf_prog *__get_bpf(u32 ufd, struct sock *sk)
 {
+	struct bpf_prog *prog;
+
 	if (sock_flag(sk, SOCK_FILTER_LOCKED))
 		return ERR_PTR(-EPERM);
 
-	return bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+	prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+	if (IS_ERR(prog))
+		prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_SG_FILTER);
+
+	return prog;
 }
 
 int sk_attach_bpf(u32 ufd, struct sock *sk)
@@ -4920,6 +4927,17 @@ bool bpf_helper_changes_pkt_data(void *func)
 }
 
 static const struct bpf_func_proto *
+socksg_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_msg_pull_data:
+		return &bpf_msg_pull_data_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
 tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
@@ -6738,6 +6756,30 @@ static u32 sk_skb_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+static u32 socksg_filter_convert_ctx_access(enum bpf_access_type type,
+					    const struct bpf_insn *si,
+					    struct bpf_insn *insn_buf,
+					    struct bpf_prog *prog,
+					    u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct sk_msg_md, data):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct sk_msg_buff, data));
+		break;
+	case offsetof(struct sk_msg_md, data_end):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data_end),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct sk_msg_buff, data_end));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
 				     const struct bpf_insn *si,
 				     struct bpf_insn *insn_buf,
@@ -6876,6 +6918,15 @@ static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
 	.test_run		= bpf_prog_test_run_skb,
 };
 
+const struct bpf_verifier_ops socksg_filter_verifier_ops = {
+	.get_func_proto         = socksg_filter_func_proto,
+	.is_valid_access        = sk_msg_is_valid_access,
+	.convert_ctx_access     = socksg_filter_convert_ctx_access,
+};
+
+const struct bpf_prog_ops socksg_filter_prog_ops = {
+};
+
 const struct bpf_verifier_ops tc_cls_act_verifier_ops = {
 	.get_func_proto		= tc_cls_act_func_proto,
 	.is_valid_access	= tc_cls_act_is_valid_access,
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 904e775..3b1697d 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -69,6 +69,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	bool is_sockops = strncmp(event, "sockops", 7) == 0;
 	bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
 	bool is_sk_msg = strncmp(event, "sk_msg", 6) == 0;
+	bool is_socksg = strncmp(event, "socksg", 6) == 0;
+
 	size_t insns_cnt = size / sizeof(struct bpf_insn);
 	enum bpf_prog_type prog_type;
 	char buf[256];
@@ -102,6 +104,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		prog_type = BPF_PROG_TYPE_SK_SKB;
 	} else if (is_sk_msg) {
 		prog_type = BPF_PROG_TYPE_SK_MSG;
+	} else if (is_socksg) {
+		prog_type = BPF_PROG_TYPE_SOCKET_SG_FILTER;
 	} else {
 		printf("Unknown event '%s'\n", event);
 		return -1;
@@ -122,8 +126,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
 		return 0;
 
-	if (is_socket || is_sockops || is_sk_skb || is_sk_msg) {
-		if (is_socket)
+	if (is_socket || is_sockops || is_sk_skb || is_sk_msg || is_socksg) {
+		if (is_socket || is_socksg)
 			event += 6;
 		else
 			event += 7;
@@ -627,7 +631,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
 		    memcmp(shname, "cgroup/", 7) == 0 ||
 		    memcmp(shname, "sockops", 7) == 0 ||
 		    memcmp(shname, "sk_skb", 6) == 0 ||
-		    memcmp(shname, "sk_msg", 6) == 0) {
+		    memcmp(shname, "sk_msg", 6) == 0 ||
+		    memcmp(shname, "socksg", 6) == 0) {
 			ret = load_and_attach(shname, data->d_buf,
 					      data->d_size);
 			if (ret != 0)
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index dce960d..9c57c4e 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -74,6 +74,7 @@
 	[BPF_PROG_TYPE_RAW_TRACEPOINT]	= "raw_tracepoint",
 	[BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
 	[BPF_PROG_TYPE_LIRC_MODE2]	= "lirc_mode2",
+	[BPF_PROG_TYPE_SOCKET_SG_FILTER] = "socket_sg_filter",
 };
 
 static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 66917a4..6ec1e32 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_SOCKET_SG_FILTER,
 };
 
 enum bpf_attach_type {
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2abd0f1..a7ac51c 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 	case BPF_PROG_TYPE_LIRC_MODE2:
 	case BPF_PROG_TYPE_SK_REUSEPORT:
+	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
 		return false;
 	case BPF_PROG_TYPE_UNSPEC:
 	case BPF_PROG_TYPE_KPROBE:
@@ -2077,6 +2078,7 @@ static bool bpf_program__is_type(struct bpf_program *prog,
 BPF_PROG_TYPE_FNS(raw_tracepoint, BPF_PROG_TYPE_RAW_TRACEPOINT);
 BPF_PROG_TYPE_FNS(xdp, BPF_PROG_TYPE_XDP);
 BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
+BPF_PROG_TYPE_FNS(socket_sg_filter, BPF_PROG_TYPE_SOCKET_SG_FILTER);
 
 void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 					   enum bpf_attach_type type)
@@ -2129,6 +2131,7 @@ void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 	BPF_SA_PROG_SEC("cgroup/sendmsg6", BPF_CGROUP_UDP6_SENDMSG),
 	BPF_S_PROG_SEC("cgroup/post_bind4", BPF_CGROUP_INET4_POST_BIND),
 	BPF_S_PROG_SEC("cgroup/post_bind6", BPF_CGROUP_INET6_POST_BIND),
+	BPF_PROG_SEC("socksg",          BPF_PROG_TYPE_SOCKET_SG_FILTER),
 };
 
 #undef BPF_PROG_SEC
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 96c55fa..7527ea4 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -208,6 +208,7 @@ int bpf_program__set_prep(struct bpf_program *prog, int nr_instance,
 void bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type);
 void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 					   enum bpf_attach_type type);
+int bpf_program__set_socket_sg_filter(struct bpf_program *prog);
 
 bool bpf_program__is_socket_filter(struct bpf_program *prog);
 bool bpf_program__is_tracepoint(struct bpf_program *prog);
@@ -217,6 +218,7 @@ void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 bool bpf_program__is_sched_act(struct bpf_program *prog);
 bool bpf_program__is_xdp(struct bpf_program *prog);
 bool bpf_program__is_perf_event(struct bpf_program *prog);
+bool bpf_program__is_socket_sg_filter(struct bpf_program *prog);
 
 /*
  * No need for __attribute__((packed)), all members of 'bpf_map_def'
-- 
1.8.3.1

^ permalink raw reply related

* [RFC v3 net-next 0/5] eBPF and struct scatterlist
From: Tushar Dave @ 2018-08-17 23:08 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev

This is v3 of the RFC sent earlier,
(https://patchwork.ozlabs.org/cover/931785/).

v2->v3:
- As per the review feedback received, this patchset reuses as much code
as possible from sockmap/sk_msg. e.g. it uses existing struct
sk_msg_buff, struct sk_msg_md, sk_msg_convert_ctx_access and part of
code from sk_msg_convert_ctx_access.

- bpf helper bpf_msg_pull_data() is used to access packet data. Some
issues found with bpf_msg_pull_data() are therefore fixed in patch 3.

- A feedback was given that unprivileged user can attach a new
BPF_PROG_TYPE_SOCKET_SG_FILTER to a non-rds socket e.g. normal tcp/udp
through the SO_ATTACH_BPF sockopt, where input context is skb instead of
sg list and can cause issues. However, I found that as an unprivileged,
user can attach any kind of eBPF program to socket using SO_ATTACH_BPF,
not only socksg. But if eBPF program is faulty, kernel BPF verifier take
care of it and invalidate any access to kernel data, doesn't let eBPF
program to run.

- socksg programs now returns action code (e.g. SOCKSG_PASS etc,.).

Background:
The motivation for this work is to allow eBPF based firewalling for
kernel modules that do not always get their packet as an sk_buff from
their downlink drivers. One such instance of this use-case is RDS, which
can be run both over IB (driver RDMA's a scatterlist to the RDS module)
or over TCP (TCP passes an sk_buff to the RDS module).

This patchset uses exiting socket filter infrastructure and extend it
with new eBPF program type that deals with struct scatterlist.
Existing bpf helper bpf_msg_pull_data() is used to inspect packet data
that are in form struct scatterlist. For RDS, the integrated approach
treats the scatterlist as the common denominator, and allows the
application to write a filter for processing a scatterlist.

Details:
Patch 1 adds new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which
uses the existing socket filter infrastructure for bpf program attach
and load. eBPF program of type BPF_PROG_TYPE_SOCKET_SG_FILTER deals with
struct scatterlist as bpf context contrast to
BPF_PROG_TYPE_SOCKET_FILTER which deals with struct skb. This new eBPF
program type allow socket filter to run on packet data that is in form
of struct scatterlist.

Patch 2 adds sg_filter_run() that runs BPF_PROG_TYPE_SOCKET_SG_FILTER.

Patch 3 fixes bpf_msg_pull_data() for the bugs that were found while
doing some experiment with different size of packets.

patch 4 allows rds_recv_incoming to invoke socket filter program which
deals with struct scatterlist.

Patch 5 adds socket filter eBPF sample program that uses patches 1 to 4.
The sample program opens an rds socket, attach ebpf program
(socksg i.e. BPF_PROG_TYPE_SOCKET_SG_FILTER) to rds socket and uses
bpf_msg_pull_data() helper to inspect RDS packet data. For a test,
current sample program only prints first few bytes of packet data.

Testing:
To confirm data accuracy and results, RDS packets of various sizes has
been tested with socksg program along with various start and end values
for bpf_msg_pull_data(). All such tests shows accurate results.

Thanks.

-Tushar

Tushar Dave (5):
  eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
  ebpf: Add sg_filter_run()
  ebpf: fix bpf_msg_pull_data
  rds: invoke socket sg filter attached to rds socket
  ebpf: Add sample ebpf program for SOCKET_SG_FILTER

 include/linux/bpf_types.h      |   1 +
 include/linux/filter.h         |   8 +
 include/uapi/linux/bpf.h       |   7 +
 kernel/bpf/syscall.c           |   1 +
 kernel/bpf/verifier.c          |   1 +
 net/core/filter.c              | 140 +++++++++++++----
 net/rds/ib.c                   |   1 +
 net/rds/ib.h                   |   1 +
 net/rds/ib_recv.c              |  12 ++
 net/rds/rds.h                  |   2 +
 net/rds/recv.c                 |  17 +++
 net/rds/tcp.c                  |   2 +
 net/rds/tcp.h                  |   2 +
 net/rds/tcp_recv.c             |  38 +++++
 samples/bpf/Makefile           |   3 +
 samples/bpf/bpf_load.c         |  11 +-
 samples/bpf/rds_filter_kern.c  |  42 +++++
 samples/bpf/rds_filter_user.c  | 339 +++++++++++++++++++++++++++++++++++++++++
 tools/bpf/bpftool/prog.c       |   1 +
 tools/include/uapi/linux/bpf.h |   7 +
 tools/lib/bpf/libbpf.c         |   3 +
 tools/lib/bpf/libbpf.h         |   2 +
 22 files changed, 607 insertions(+), 34 deletions(-)
 create mode 100644 samples/bpf/rds_filter_kern.c
 create mode 100644 samples/bpf/rds_filter_user.c

-- 
1.8.3.1

^ permalink raw reply

* Re: [PATCH bpf] bpf: fix redirect to map under tail calls
From: Alexei Starovoitov @ 2018-08-17 22:59 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: john.fastabend, netdev
In-Reply-To: <20180817212614.2423-1-daniel@iogearbox.net>

On Fri, Aug 17, 2018 at 11:26:14PM +0200, Daniel Borkmann wrote:
> Commits 109980b894e9 ("bpf: don't select potentially stale ri->map
> from buggy xdp progs") and 7c3001313396 ("bpf: fix ri->map_owner
> pointer on bpf_prog_realloc") tried to mitigate that buggy programs
> using bpf_redirect_map() helper call do not leave stale maps behind.
> Idea was to add a map_owner cookie into the per CPU struct redirect_info
> which was set to prog->aux by the prog making the helper call as a
> proof that the map is not stale since the prog is implicitly holding
> a reference to it. This owner cookie could later on get compared with
> the program calling into BPF whether they match and therefore the
> redirect could proceed with processing the map safely.
> 
> In (obvious) hindsight, this approach breaks down when tail calls are
> involved since the original caller's prog->aux pointer does not have
> to match the one from one of the progs out of the tail call chain,
> and therefore the xdp buffer will be dropped instead of redirected.
> A way around that would be to fix the issue differently (which also
> allows to remove related work in fast path at the same time): once
> the life-time of a redirect map has come to its end we use it's map
> free callback where we need to wait on synchronize_rcu() for current
> outstanding xdp buffers and remove such a map pointer from the
> redirect info if found to be present. At that time no program is
> using this map anymore so we simply invalidate the map pointers to
> NULL iff they previously pointed to that instance while making sure
> that the redirect path only reads out the map once.
> 
> Fixes: 97f91a7cf04f ("bpf: add bpf_redirect_map helper routine")
> Fixes: 109980b894e9 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
> Reported-by: Sebastiano Miano <sebastiano.miano@polito.it>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: John Fastabend <john.fastabend@gmail.com>

Applied, Thanks

^ permalink raw reply

* Re: [PATCH net-next v1] net/tls: Add support for async decryption of tls records
From: Dave Watson @ 2018-08-17 22:12 UTC (permalink / raw)
  To: Vakul Garg; +Cc: netdev, borisp, aviadye, davem
In-Reply-To: <20180816151906.17624-1-vakul.garg@nxp.com>

On 08/16/18 08:49 PM, Vakul Garg wrote:
> Changes since RFC version:
> 	1) Improved commit message.
> 	2) Fixed dequeued record offset handling because of which few of
> 	   tls selftests 'recv_partial, recv_peek, recv_peek_multiple' were failing.

Thanks! Commit message much more clear, tests work great for me also,
only minor comments on clarity

> -			if (tls_sw_advance_skb(sk, skb, chunk)) {
> +			if (async) {
> +				/* Finished with current record, pick up next */
> +				ctx->recv_pkt = NULL;
> +				__strp_unpause(&ctx->strp);
> +				goto mark_eor_chk_ctrl;

Control flow is a little hard to follow here, maybe just pass an async
flag to tls_sw_advance_skb?  It already does strp_unpause and recv_pkt
= NULL.  

> +			} else if (tls_sw_advance_skb(sk, skb, chunk)) {
>  				/* Return full control message to
>  				 * userspace before trying to parse
>  				 * another message type
>  				 */
> +mark_eor_chk_ctrl:
>  				msg->msg_flags |= MSG_EOR;
>  				if (control != TLS_RECORD_TYPE_DATA)
>  					goto recv_end;
> +			} else {
> +				break;

I don't see the need for the else { break; }, isn't this already
covered by while(len); below as before?

^ permalink raw reply

* Re: [offlist] Re: Crash in netlink/sk_filter_trim_cap on ARMv7 on 4.18rc1
From: Daniel Borkmann @ 2018-08-17 22:06 UTC (permalink / raw)
  To: Peter Robinson
  Cc: Russell King - ARM Linux, Marc Haber, linux-arm-kernel, netdev,
	labbott, Eric Dumazet
In-Reply-To: <CALeDE9N=EjaHAqOrw8Hi=EK15f65j0bKFOSbFZaLtjSG20Aytg@mail.gmail.com>

On 08/17/2018 11:13 PM, Peter Robinson wrote:
> On Fri, Aug 17, 2018 at 7:30 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 08/17/2018 06:17 PM, Russell King - ARM Linux wrote:
>>> On Fri, Aug 17, 2018 at 02:40:19PM +0200, Daniel Borkmann wrote:
>>>> I'd have one potential bug suspicion, for the 4.18 one you were trying,
>>>> could you run with the below patch to see whether it would help?
>>>
>>> I think this is almost certainly the problem - looking at the history,
>>> it seems that the "-4" was assumed to be part of the scratch stuff in
>>> commit 38ca93060163 ("bpf, arm32: save 4 bytes of unneeded stack space")
>>> but it isn't - it's because "off" of zero refers to the top word in the
>>> stack (iow at STACK_SIZE-4).
>>
>> Yeah agree, my thinking as well (albeit bit late, sigh, sorry about that).
>> Waiting for Peter to get back with results for definite confirmation. Your
>> rework in 1c35ba122d4a ("ARM: net: bpf: use negative numbers for stacked
>> registers") and 96cced4e774a ("ARM: net: bpf: access eBPF scratch space using
>> ARM FP register") fixes this in mainline, so unless I'm missing something this
>> would only need a stand-alone fix for 4.18/stable which I can cook up and
>> submit then.
> 
> I can confirm that fixes the problems I was seeing on Fedora 29.
> 
> Feel free to add a tested by from me:
> 
> Tested-by: Peter Robinson <pbrobinson@gmail.com>

Great, thanks everyone! Will get it out asap.

^ permalink raw reply

* RE: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
From: Steve Wise @ 2018-08-17 21:28 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Max Gurtovoy',
	'Jason Gunthorpe'
  Cc: 'Leon Romanovsky', 'Doug Ledford',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'
In-Reply-To: <ed1ea546-e7aa-3c17-710b-8c95fe90deea@grimberg.me>

> 
> 
> > Hey Sagi,
> >
> > The patch works allowing connections for the various affinity mappings
> below:
> >
> > One comp_vector per core across all cores, starting with numa-local cores:
> 
> Thanks Steve, is this your "Tested by:" tag?

Sure:

Tested-by: Steve Wise <swise@opengridcomputing.com>

^ permalink raw reply

* [PATCH bpf] bpf: fix redirect to map under tail calls
From: Daniel Borkmann @ 2018-08-17 21:26 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: john.fastabend, netdev, Daniel Borkmann

Commits 109980b894e9 ("bpf: don't select potentially stale ri->map
from buggy xdp progs") and 7c3001313396 ("bpf: fix ri->map_owner
pointer on bpf_prog_realloc") tried to mitigate that buggy programs
using bpf_redirect_map() helper call do not leave stale maps behind.
Idea was to add a map_owner cookie into the per CPU struct redirect_info
which was set to prog->aux by the prog making the helper call as a
proof that the map is not stale since the prog is implicitly holding
a reference to it. This owner cookie could later on get compared with
the program calling into BPF whether they match and therefore the
redirect could proceed with processing the map safely.

In (obvious) hindsight, this approach breaks down when tail calls are
involved since the original caller's prog->aux pointer does not have
to match the one from one of the progs out of the tail call chain,
and therefore the xdp buffer will be dropped instead of redirected.
A way around that would be to fix the issue differently (which also
allows to remove related work in fast path at the same time): once
the life-time of a redirect map has come to its end we use it's map
free callback where we need to wait on synchronize_rcu() for current
outstanding xdp buffers and remove such a map pointer from the
redirect info if found to be present. At that time no program is
using this map anymore so we simply invalidate the map pointers to
NULL iff they previously pointed to that instance while making sure
that the redirect path only reads out the map once.

Fixes: 97f91a7cf04f ("bpf: add bpf_redirect_map helper routine")
Fixes: 109980b894e9 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
Reported-by: Sebastiano Miano <sebastiano.miano@polito.it>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/filter.h     |  3 +-
 include/trace/events/xdp.h |  5 ++--
 kernel/bpf/cpumap.c        |  2 ++
 kernel/bpf/devmap.c        |  1 +
 kernel/bpf/verifier.c      | 21 --------------
 kernel/bpf/xskmap.c        |  1 +
 net/core/filter.c          | 68 ++++++++++++++++++++--------------------------
 7 files changed, 38 insertions(+), 63 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 5d565c5..6791a0a 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -543,7 +543,6 @@ struct bpf_redirect_info {
 	u32 flags;
 	struct bpf_map *map;
 	struct bpf_map *map_to_flush;
-	unsigned long   map_owner;
 	u32 kern_flags;
 };
 
@@ -781,6 +780,8 @@ static inline bool bpf_dump_raw_ok(void)
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
 
+void bpf_clear_redirect_map(struct bpf_map *map);
+
 static inline bool xdp_return_frame_no_direct(void)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index 1ecf4c6..e95cb86 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -147,9 +147,8 @@ struct _bpf_dtab_netdev {
 
 #define devmap_ifindex(fwd, map)				\
 	(!fwd ? 0 :						\
-	 (!map ? 0 :						\
-	  ((map->map_type == BPF_MAP_TYPE_DEVMAP) ?		\
-	   ((struct _bpf_dtab_netdev *)fwd)->dev->ifindex : 0)))
+	 ((map->map_type == BPF_MAP_TYPE_DEVMAP) ?		\
+	  ((struct _bpf_dtab_netdev *)fwd)->dev->ifindex : 0))
 
 #define _trace_xdp_redirect_map(dev, xdp, fwd, map, idx)		\
 	 trace_xdp_redirect_map(dev, xdp, devmap_ifindex(fwd, map),	\
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 620bc50..24aac0d 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -479,6 +479,8 @@ static void cpu_map_free(struct bpf_map *map)
 	 * It does __not__ ensure pending flush operations (if any) are
 	 * complete.
 	 */
+
+	bpf_clear_redirect_map(map);
 	synchronize_rcu();
 
 	/* To ensure all pending flush operations have completed wait for flush
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index ac1df79..141710b 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -161,6 +161,7 @@ static void dev_map_free(struct bpf_map *map)
 	list_del_rcu(&dtab->list);
 	spin_unlock(&dev_map_lock);
 
+	bpf_clear_redirect_map(map);
 	synchronize_rcu();
 
 	/* To ensure all pending flush operations have completed wait for flush
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ca90679..9224611 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5844,27 +5844,6 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
 			goto patch_call_imm;
 		}
 
-		if (insn->imm == BPF_FUNC_redirect_map) {
-			/* Note, we cannot use prog directly as imm as subsequent
-			 * rewrites would still change the prog pointer. The only
-			 * stable address we can use is aux, which also works with
-			 * prog clones during blinding.
-			 */
-			u64 addr = (unsigned long)prog->aux;
-			struct bpf_insn r4_ld[] = {
-				BPF_LD_IMM64(BPF_REG_4, addr),
-				*insn,
-			};
-			cnt = ARRAY_SIZE(r4_ld);
-
-			new_prog = bpf_patch_insn_data(env, i + delta, r4_ld, cnt);
-			if (!new_prog)
-				return -ENOMEM;
-
-			delta    += cnt - 1;
-			env->prog = prog = new_prog;
-			insn      = new_prog->insnsi + i + delta;
-		}
 patch_call_imm:
 		fn = env->ops->get_func_proto(insn->imm, env->prog);
 		/* all functions that have prototype and verifier allowed
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 4ddf61e1..9f8463a 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -75,6 +75,7 @@ static void xsk_map_free(struct bpf_map *map)
 	struct xsk_map *m = container_of(map, struct xsk_map, map);
 	int i;
 
+	bpf_clear_redirect_map(map);
 	synchronize_net();
 
 	for (i = 0; i < map->max_entries; i++) {
diff --git a/net/core/filter.c b/net/core/filter.c
index fd423ce..c25eb36 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3246,31 +3246,33 @@ static void *__xdp_map_lookup_elem(struct bpf_map *map, u32 index)
 	}
 }
 
-static inline bool xdp_map_invalid(const struct bpf_prog *xdp_prog,
-				   unsigned long aux)
+void bpf_clear_redirect_map(struct bpf_map *map)
 {
-	return (unsigned long)xdp_prog->aux != aux;
+	struct bpf_redirect_info *ri;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		ri = per_cpu_ptr(&bpf_redirect_info, cpu);
+		/* Avoid polluting remote cacheline due to writes if
+		 * not needed. Once we pass this test, we need the
+		 * cmpxchg() to make sure it hasn't been changed in
+		 * the meantime by remote CPU.
+		 */
+		if (unlikely(READ_ONCE(ri->map) == map))
+			cmpxchg(&ri->map, map, NULL);
+	}
 }
 
 static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
-			       struct bpf_prog *xdp_prog)
+			       struct bpf_prog *xdp_prog, struct bpf_map *map)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-	unsigned long map_owner = ri->map_owner;
-	struct bpf_map *map = ri->map;
 	u32 index = ri->ifindex;
 	void *fwd = NULL;
 	int err;
 
 	ri->ifindex = 0;
-	ri->map = NULL;
-	ri->map_owner = 0;
-
-	if (unlikely(xdp_map_invalid(xdp_prog, map_owner))) {
-		err = -EFAULT;
-		map = NULL;
-		goto err;
-	}
+	WRITE_ONCE(ri->map, NULL);
 
 	fwd = __xdp_map_lookup_elem(map, index);
 	if (!fwd) {
@@ -3296,12 +3298,13 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_map *map = READ_ONCE(ri->map);
 	struct net_device *fwd;
 	u32 index = ri->ifindex;
 	int err;
 
-	if (ri->map)
-		return xdp_do_redirect_map(dev, xdp, xdp_prog);
+	if (map)
+		return xdp_do_redirect_map(dev, xdp, xdp_prog, map);
 
 	fwd = dev_get_by_index_rcu(dev_net(dev), index);
 	ri->ifindex = 0;
@@ -3325,24 +3328,17 @@ EXPORT_SYMBOL_GPL(xdp_do_redirect);
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
 				       struct xdp_buff *xdp,
-				       struct bpf_prog *xdp_prog)
+				       struct bpf_prog *xdp_prog,
+				       struct bpf_map *map)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-	unsigned long map_owner = ri->map_owner;
-	struct bpf_map *map = ri->map;
 	u32 index = ri->ifindex;
 	void *fwd = NULL;
 	int err = 0;
 
 	ri->ifindex = 0;
-	ri->map = NULL;
-	ri->map_owner = 0;
+	WRITE_ONCE(ri->map, NULL);
 
-	if (unlikely(xdp_map_invalid(xdp_prog, map_owner))) {
-		err = -EFAULT;
-		map = NULL;
-		goto err;
-	}
 	fwd = __xdp_map_lookup_elem(map, index);
 	if (unlikely(!fwd)) {
 		err = -EINVAL;
@@ -3379,13 +3375,14 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 			    struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_map *map = READ_ONCE(ri->map);
 	u32 index = ri->ifindex;
 	struct net_device *fwd;
 	int err = 0;
 
-	if (ri->map)
-		return xdp_do_generic_redirect_map(dev, skb, xdp, xdp_prog);
-
+	if (map)
+		return xdp_do_generic_redirect_map(dev, skb, xdp, xdp_prog,
+						   map);
 	ri->ifindex = 0;
 	fwd = dev_get_by_index_rcu(dev_net(dev), index);
 	if (unlikely(!fwd)) {
@@ -3416,8 +3413,7 @@ BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 
 	ri->ifindex = ifindex;
 	ri->flags = flags;
-	ri->map = NULL;
-	ri->map_owner = 0;
+	WRITE_ONCE(ri->map, NULL);
 
 	return XDP_REDIRECT;
 }
@@ -3430,8 +3426,8 @@ static const struct bpf_func_proto bpf_xdp_redirect_proto = {
 	.arg2_type      = ARG_ANYTHING,
 };
 
-BPF_CALL_4(bpf_xdp_redirect_map, struct bpf_map *, map, u32, ifindex, u64, flags,
-	   unsigned long, map_owner)
+BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, map, u32, ifindex,
+	   u64, flags)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
@@ -3440,15 +3436,11 @@ BPF_CALL_4(bpf_xdp_redirect_map, struct bpf_map *, map, u32, ifindex, u64, flags
 
 	ri->ifindex = ifindex;
 	ri->flags = flags;
-	ri->map = map;
-	ri->map_owner = map_owner;
+	WRITE_ONCE(ri->map, map);
 
 	return XDP_REDIRECT;
 }
 
-/* Note, arg4 is hidden from users and populated by the verifier
- * with the right pointer.
- */
 static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.func           = bpf_xdp_redirect_map,
 	.gpl_only       = false,
-- 
2.9.5

^ permalink raw reply related

* Re: [GIT PULL] 9p updates for 4.19
From: Linus Torvalds @ 2018-08-18  0:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dominique Martinet, V9FS Developers, Linux Kernel Mailing List,
	Network Development
In-Reply-To: <CA+55aFxzuLnK7E3hz_YZSNjjJ7TeMTmPrpn4rhHGjgrY2DgTtQ@mail.gmail.com>

On Fri, Aug 17, 2018 at 4:41 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Aug 17, 2018 at 3:41 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Please do.  I'll actually be sending you the MAINTAINERS update in
> > about 55 seconds.
>
> Heh. Ok, that resolves my biggest issue with the pull request.

.. and with that patch-bomb from Andrew out of the way, I've now
merged the 9p pull request. It's going through my basic build tests
before I push it out, but expect that in minutes.

                   Linus

^ permalink raw reply

* Re: [offlist] Re: Crash in netlink/sk_filter_trim_cap on ARMv7 on 4.18rc1
From: Peter Robinson @ 2018-08-17 21:15 UTC (permalink / raw)
  To: Stefan Wahren
  Cc: Daniel Borkmann, Marc Haber, labbott, Eric Dumazet, netdev,
	Russell King - ARM Linux, linux-arm-kernel
In-Reply-To: <1425110527.374417.1534531889127@email.1und1.de>

Hi Stefan,

>> On 08/17/2018 06:17 PM, Russell King - ARM Linux wrote:
>> > On Fri, Aug 17, 2018 at 02:40:19PM +0200, Daniel Borkmann wrote:
>> >> I'd have one potential bug suspicion, for the 4.18 one you were trying,
>> >> could you run with the below patch to see whether it would help?
>> >
>> > I think this is almost certainly the problem - looking at the history,
>> > it seems that the "-4" was assumed to be part of the scratch stuff in
>> > commit 38ca93060163 ("bpf, arm32: save 4 bytes of unneeded stack space")
>> > but it isn't - it's because "off" of zero refers to the top word in the
>> > stack (iow at STACK_SIZE-4).
>>
>> Yeah agree, my thinking as well (albeit bit late, sigh, sorry about that).
>> Waiting for Peter to get back with results for definite confirmation. Your
>> rework in 1c35ba122d4a ("ARM: net: bpf: use negative numbers for stacked
>> registers") and 96cced4e774a ("ARM: net: bpf: access eBPF scratch space using
>> ARM FP register") fixes this in mainline, so unless I'm missing something this
>> would only need a stand-alone fix for 4.18/stable which I can cook up and
>> submit then.
>
> i was able to reproduce this issue on RPi 3 with Linux 4.18.1 + multi_v7_defconfig and the following  config changes:
>
>  --- a/arch/arm/configs/multi_v7_defconfig
> +++ b/arch/arm/configs/multi_v7_defconfig
> @@ -2,7 +2,10 @@ CONFIG_SYSVIPC=y
>  CONFIG_NO_HZ=y
>  CONFIG_HIGH_RES_TIMERS=y
>  CONFIG_CGROUPS=y
> +CONFIG_CGROUP_BPF=y
>  CONFIG_BLK_DEV_INITRD=y
> +CONFIG_BPF_SYSCALL=y
> +CONFIG_BPF_JIT_ALWAYS_ON=y
>  CONFIG_EMBEDDED=y
>  CONFIG_PERF_EVENTS=y
>  CONFIG_MODULES=y
> @@ -153,6 +156,8 @@ CONFIG_IPV6_MIP6=m
>  CONFIG_IPV6_TUNNEL=m
>  CONFIG_IPV6_MULTIPLE_TABLES=y
>  CONFIG_NET_DSA=m
> +CONFIG_BPF_JIT=y
> +CONFIG_BPF_STREAM_PARSER=y
>  CONFIG_CAN=y
>  CONFIG_CAN_AT91=m
>  CONFIG_CAN_FLEXCAN=m
>
> After applying the "-4" patch the oopses doesn't appear during boot anymore.

Would be fab to get that into the kernel so this is widely tested
moving forward.

Peter

^ permalink raw reply

* Re: [offlist] Re: Crash in netlink/sk_filter_trim_cap on ARMv7 on 4.18rc1
From: Peter Robinson @ 2018-08-17 21:13 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Russell King - ARM Linux, Marc Haber, linux-arm-kernel, netdev,
	labbott, Eric Dumazet
In-Reply-To: <adf49ea9-09a3-80f8-8c85-a62d028e21a3@iogearbox.net>

On Fri, Aug 17, 2018 at 7:30 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 08/17/2018 06:17 PM, Russell King - ARM Linux wrote:
>> On Fri, Aug 17, 2018 at 02:40:19PM +0200, Daniel Borkmann wrote:
>>> I'd have one potential bug suspicion, for the 4.18 one you were trying,
>>> could you run with the below patch to see whether it would help?
>>
>> I think this is almost certainly the problem - looking at the history,
>> it seems that the "-4" was assumed to be part of the scratch stuff in
>> commit 38ca93060163 ("bpf, arm32: save 4 bytes of unneeded stack space")
>> but it isn't - it's because "off" of zero refers to the top word in the
>> stack (iow at STACK_SIZE-4).
>
> Yeah agree, my thinking as well (albeit bit late, sigh, sorry about that).
> Waiting for Peter to get back with results for definite confirmation. Your
> rework in 1c35ba122d4a ("ARM: net: bpf: use negative numbers for stacked
> registers") and 96cced4e774a ("ARM: net: bpf: access eBPF scratch space using
> ARM FP register") fixes this in mainline, so unless I'm missing something this
> would only need a stand-alone fix for 4.18/stable which I can cook up and
> submit then.

I can confirm that fixes the problems I was seeing on Fedora 29.

Feel free to add a tested by from me:

Tested-by: Peter Robinson <pbrobinson@gmail.com>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox