Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 net-next 1/2] net: dsa: lan9303: Move tag setup to new lan9303_setup_tagging
From: Egil Hjelmeland @ 2017-10-10 15:30 UTC (permalink / raw)
  To: Woojung.Huh, andrew, vivien.didelot, f.fainelli, netdev,
	linux-kernel
In-Reply-To: <9235D6609DB808459E95D78E17F2E43D40B48C3D@CHN-SV-EXMX02.mchp-main.com>

On 10. okt. 2017 17:14, Woojung.Huh@microchip.com wrote:
>> +/* forward special tagged packets from port 0 to port 1 *or* port 2 */
>> +static int lan9303_setup_tagging(struct lan9303 *chip)
>> +{
>> +	int ret;
>> +	u32 val;
>> +	/* enable defining the destination port via special VLAN tagging
>> +	 * for port 0
>> +	 */
>> +	ret = lan9303_write_switch_reg(chip,
>> LAN9303_SWE_INGRESS_PORT_TYPE,
>> +
>> LAN9303_SWE_INGRESS_PORT_TYPE_VLAN);
>> +	if (ret)
>> +		return ret;
>> +
>> +	/* tag incoming packets at port 1 and 2 on their way to port 0 to be
>> +	 * able to discover their source port
>> +	 */
>> +	val = LAN9303_BM_EGRSS_PORT_TYPE_SPECIAL_TAG_PORT0;
>> +	return lan9303_write_switch_reg(chip,
>> LAN9303_BM_EGRSS_PORT_TYPE, val);
> Specific reason to use val then using LAN9303_BM_EGRSS_PORT_TYPE_SPECIAL_TAG_PORT0
> like previous line?
> 
Specific reason was to please a reviewer that did not like my
indenting in first version. I did not agree with him, but since
nobody else spoke up, I changed the code.

>> @@ -644,6 +648,10 @@ static int lan9303_setup(struct dsa_switch *ds)
>>   		return -EINVAL;
>>   	}
>>
>> +	ret = lan9303_setup_tagging(chip);
>> +	if (ret)
>> +		dev_err(chip->dev, "failed to setup port tagging %d\n", ret);
>> +
> Still move on when error happens?
> 
Good question. I just followed the pattern from the original function,
which was not made by me. Actually I did once reflect on whether this 
was the correct way. Perhaps it could be argued that it is better to 
allow the device to come up, so the problem can be investigated?

>>   	ret = lan9303_separate_ports(chip);
>>   	if (ret)
>>   		dev_err(chip->dev, "failed to separate ports %d\n", ret);
>> --
>> 2.11.0
> 
> - Woojung
> 

^ permalink raw reply

* Re: [PATCH RFC 0/3] tun zerocopy stats
From: Willem de Bruijn @ 2017-10-10 15:29 UTC (permalink / raw)
  To: David Miller
  Cc: Network Development, Michael S. Tsirkin, Jason Wang,
	Willem de Bruijn
In-Reply-To: <20171009.205228.714368596112967819.davem@davemloft.net>

On Mon, Oct 9, 2017 at 11:52 PM, David Miller <davem@davemloft.net> wrote:
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Fri,  6 Oct 2017 18:25:13 -0400
>
>> From: Willem de Bruijn <willemb@google.com>
>>
>> Add zerocopy transfer statistics to the vhost_net/tun zerocopy path.
>>
>> I've been using this to verify recent changes to zerocopy tuning [1].
>> Sharing more widely, as it may be useful in similar future work.
>>
>> Use ethtool stats as interface, as these are defined per device
>> driver and can easily be extended.
>>
>> Make the zerocopy release callback take an extra hop through the tun
>> driver to allow the driver to increment its counters.
>>
>> Care must be taken to avoid adding an alloc/free to this hot path.
>> Since the caller already must allocate a ubuf_info, make it allocate
>> two at a time and grant one to the tun device.
>>
>>  1/3: introduce ethtool stats (`ethtool -S $DEV`) for tun devices
>>  2/3: add zerocopy tx and tx_err counters
>>  3/3: convert vhost_net to pass a pair of ubuf_info to tun
>>
>> [1] http://patchwork.ozlabs.org/patch/822613/
>
> This looks mostly fine to me, but I don't know enough about how vhost
> and tap interact to tell whether this makes sense to upstream.

Thanks for taking a look. The need for monitoring these stats has
come up in a couple of patch evaluation discussions, so I wanted
to share at least one implementation to get the data.

Because the choice to use zerocopy is based on heuristics and
there is a cost if it mispredicts, I think we even want to being able
to continuously monitor this in production.

The implementation is probably not ready for that as is.

> What are the runtime costs for these new statistics?

It currently doubles the size of the ubuf_info memory pool. That can be
fixed, as the current size is UIO_MAXIOV (1024), but the number of
zerocopy packets in flight is bound by VHOST_MAX_PEND (128).

It also adds an indirect function call to call to each zerocopy skb free
path, though.

If there is a way to expose these stats through vhost_net directly,
instead of through tun, that may be better. But I did not see a
suitable interface. Perhaps debugfs.

^ permalink raw reply

* Re: RIF/VRF overflow in spectrum and reporting errors back to user
From: David Ahern @ 2017-10-10 15:23 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: Ido Schimmel, Jiri Pirko, netdev@vger.kernel.org
In-Reply-To: <20171009093110.GA5193@shredder.mtl.com>

On 10/9/17 3:31 AM, Ido Schimmel wrote:
> Hi David,
> 
> On Sun, Oct 08, 2017 at 02:10:33PM -0600, David Ahern wrote:
>> Jiri / Ido:
>>
>> I am looking at adding user messages for spectrum failures related to
>> RIF and VRF overflow coming from the inetaddr and inet6addr notifier
>> paths. The key is that if the notifiers fail the address add needs to
>> fail and an error reported to the user as to what happened.
> 
> Thanks for working on this. Very nice idea!
> 
>> Earlier this year 3ad7d2468f79f added in_validator_info and
>> in6_validator_info as a way for the notifiers to fail adding an address.
>> Adding support to spectrum for that notifier is complicated by the fact
>> that the validator notifier and address notifiers will come in back to
>> back for the NETDEV_UP case. Ignoring NETDEV_UP in
>> mlxsw_sp_inetaddr_event seems ok for IPv6 but not clear for IPv4 since
>> the NETDEV_UP case is emitted on an address delete that involves a
>> promotion. Handling the back to back NETDEV_UP is complicated since
>> functions invoked by __mlxsw_sp_inetaddr_event can take multiple
>> references. Specifically, in mlxsw_sp_port_vlan_router_join():
>>     fid = rif->ops->fid_get(rif);
>>
>> Can NETDEV_UP be ignored for the inetaddr notifier if it is handled by
>> the validator notitifer?
> 
> Yes. The case where we get a NETDEV_DOWN for an address delete and then
> a NETDEV_UP for a promotion is basically a NOP from the driver's
> perspective. When the NETDEV_DOWN is received, the RIF isn't destroyed
> because the address list isn't empty (there's an address to be
> promoted). When the NETDEV_UP is received, it's ignored because we
> already have a RIF.

You lost me on the RIF. Looking at the chain:

mlxsw_sp_inet6addr_event_work or mlxsw_sp_inetaddr_event
- __mlxsw_sp_inetaddr_event
  + mlxsw_sp_inetaddr_vlan_event
    * mlxsw_sp_inetaddr_port_vlan_event
      - NETDEV_UP: mlxsw_sp_port_vlan_router_join

mlxsw_sp_port_vlan_router_join does the rif lookup and if it exists
calls fid_get() which takes a reference. I read that to mean
back-to-back NETDEV_UP notifiers (the address validator and then the
address notifier) would lead to a reference count leak.

Based on your address delete comment, I take the IPv4 solution to be
adding the validator notifier to spectrum and then ignoring NETDEV_UP in
mlxsw_sp_inetaddr_event. That means IPv4 inetaddr work is done for the
validator notifier while NETDEV_DOWN is done through the inetaddr notifier.

> 
> Regarding IPv6, it's a bit more complicated actually, since we do the
> actual work in a workqueue, as the notification chain is atomic. I
> believe this is because the notifier can be called from softirq in
> response to RA packets.
> 
> However, this case isn't interesting for mlxsw, as the fact that you
> process an RA packet suggests you already have a link-local address and
> thus a RIF. Plus, the kernel won't even process such packets in our case
> as you most likely have forwarding enabled (unless you tweaked accept_ra
> for some reason).
> 
> Looking at ipvlan (the only user of inet6addr_validator_chain), I see
> that it ignores this specific case and returns NOTIFY_DONE. Maybe we can
> move this notification chain to be blocking and not call it in response
> to RA packets seeing that all its users ignore it?

Seems reasonable to me.

I have it coded. Let me test and send an rfc.

^ permalink raw reply

* [net 2/2] i40e: Fix memory leak related filter programming status
From: Jeff Kirsher @ 2017-10-10 15:14 UTC (permalink / raw)
  To: davem; +Cc: Alexander Duyck, netdev, nhorman, sassmann, jogreene,
	Jeff Kirsher
In-Reply-To: <20171010151416.43149-1-jeffrey.t.kirsher@intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>

It looks like we weren't correctly placing the pages from buffers that had
been used to return a filter programming status back on the ring. As a
result they were being overwritten and tracking of the pages was lost.

This change works to correct that by incorporating part of
i40e_put_rx_buffer into the programming status handler code. As a result we
should now be correctly placing the pages for those buffers on the
re-allocation list instead of letting them stay in place.

Fixes: 0e626ff7ccbf ("i40e: Fix support for flow director programming status")
Reported-by: Anders K. Pedersen <akp@cohaesio.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Anders K Pedersen <akp@cohaesio.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 63 ++++++++++++++++-------------
 1 file changed, 36 insertions(+), 27 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 1519dfb851d0..2756131495f0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1037,6 +1037,32 @@ static bool i40e_set_new_dynamic_itr(struct i40e_ring_container *rc)
 	return false;
 }
 
+/**
+ * i40e_reuse_rx_page - page flip buffer and store it back on the ring
+ * @rx_ring: rx descriptor ring to store buffers on
+ * @old_buff: donor buffer to have page reused
+ *
+ * Synchronizes page for reuse by the adapter
+ **/
+static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
+			       struct i40e_rx_buffer *old_buff)
+{
+	struct i40e_rx_buffer *new_buff;
+	u16 nta = rx_ring->next_to_alloc;
+
+	new_buff = &rx_ring->rx_bi[nta];
+
+	/* update, and store next to alloc */
+	nta++;
+	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
+
+	/* transfer page from old buffer to new buffer */
+	new_buff->dma		= old_buff->dma;
+	new_buff->page		= old_buff->page;
+	new_buff->page_offset	= old_buff->page_offset;
+	new_buff->pagecnt_bias	= old_buff->pagecnt_bias;
+}
+
 /**
  * i40e_rx_is_programming_status - check for programming status descriptor
  * @qw: qword representing status_error_len in CPU ordering
@@ -1071,15 +1097,24 @@ static void i40e_clean_programming_status(struct i40e_ring *rx_ring,
 					  union i40e_rx_desc *rx_desc,
 					  u64 qw)
 {
-	u32 ntc = rx_ring->next_to_clean + 1;
+	struct i40e_rx_buffer *rx_buffer;
+	u32 ntc = rx_ring->next_to_clean;
 	u8 id;
 
 	/* fetch, update, and store next to clean */
+	rx_buffer = &rx_ring->rx_bi[ntc++];
 	ntc = (ntc < rx_ring->count) ? ntc : 0;
 	rx_ring->next_to_clean = ntc;
 
 	prefetch(I40E_RX_DESC(rx_ring, ntc));
 
+	/* place unused page back on the ring */
+	i40e_reuse_rx_page(rx_ring, rx_buffer);
+	rx_ring->rx_stats.page_reuse_count++;
+
+	/* clear contents of buffer_info */
+	rx_buffer->page = NULL;
+
 	id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
 		  I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
 
@@ -1638,32 +1673,6 @@ static bool i40e_cleanup_headers(struct i40e_ring *rx_ring, struct sk_buff *skb,
 	return false;
 }
 
-/**
- * i40e_reuse_rx_page - page flip buffer and store it back on the ring
- * @rx_ring: rx descriptor ring to store buffers on
- * @old_buff: donor buffer to have page reused
- *
- * Synchronizes page for reuse by the adapter
- **/
-static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
-			       struct i40e_rx_buffer *old_buff)
-{
-	struct i40e_rx_buffer *new_buff;
-	u16 nta = rx_ring->next_to_alloc;
-
-	new_buff = &rx_ring->rx_bi[nta];
-
-	/* update, and store next to alloc */
-	nta++;
-	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
-
-	/* transfer page from old buffer to new buffer */
-	new_buff->dma		= old_buff->dma;
-	new_buff->page		= old_buff->page;
-	new_buff->page_offset	= old_buff->page_offset;
-	new_buff->pagecnt_bias	= old_buff->pagecnt_bias;
-}
-
 /**
  * i40e_page_is_reusable - check if any reuse is possible
  * @page: page struct to check
-- 
2.14.2

^ permalink raw reply related

* [net 1/2] i40e: Fix comment about locking for __i40e_read_nvm_word()
From: Jeff Kirsher @ 2017-10-10 15:14 UTC (permalink / raw)
  To: davem; +Cc: Stefano Brivio, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <20171010151416.43149-1-jeffrey.t.kirsher@intel.com>

From: Stefano Brivio <sbrivio@redhat.com>

Caller needs to acquire the lock. Called functions will not.

Fixes: 09f79fd49d94 ("i40e: avoid NVM acquire deadlock during NVM update")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_nvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_nvm.c b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
index 57505b1df98d..d591b3e6bd7c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_nvm.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
@@ -298,7 +298,7 @@ static i40e_status i40e_read_nvm_word_aq(struct i40e_hw *hw, u16 offset,
 }
 
 /**
- * __i40e_read_nvm_word - Reads nvm word, assumes called does the locking
+ * __i40e_read_nvm_word - Reads nvm word, assumes caller does the locking
  * @hw: pointer to the HW structure
  * @offset: offset of the Shadow RAM word to read (0x000000 - 0x001FFF)
  * @data: word read from the Shadow RAM
-- 
2.14.2

^ permalink raw reply related

* [net 0/2][pull request] Intel Wired LAN Driver Updates 2017-10-10
From: Jeff Kirsher @ 2017-10-10 15:14 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, nhorman, sassmann, jogreene

This series contains updates to i40e only.

Stefano Brivio fixes the grammar in a function header comment.

Alex fixes a memory leak where we were not correctly placing the pages
from buffers that had been used to return a filter programming status
back on the ring.

The following are changes since commit 529a86e063e9ff625c4ff247d8aa17d8072444fb:
  Merge branch 'ppc-bundle' (bundle from Michael Ellerman)
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 40GbE

Alexander Duyck (1):
  i40e: Fix memory leak related filter programming status

Stefano Brivio (1):
  i40e: Fix comment about locking for __i40e_read_nvm_word()

 drivers/net/ethernet/intel/i40e/i40e_nvm.c  |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 63 ++++++++++++++++-------------
 2 files changed, 37 insertions(+), 28 deletions(-)

-- 
2.14.2

^ permalink raw reply

* RE: [PATCH v2 net-next 1/2] net: dsa: lan9303: Move tag setup to new lan9303_setup_tagging
From: Woojung.Huh @ 2017-10-10 15:14 UTC (permalink / raw)
  To: privat, andrew, vivien.didelot, f.fainelli, netdev, linux-kernel
In-Reply-To: <20171010124953.386-2-privat@egil-hjelmeland.no>

> +/* forward special tagged packets from port 0 to port 1 *or* port 2 */
> +static int lan9303_setup_tagging(struct lan9303 *chip)
> +{
> +	int ret;
> +	u32 val;
> +	/* enable defining the destination port via special VLAN tagging
> +	 * for port 0
> +	 */
> +	ret = lan9303_write_switch_reg(chip,
> LAN9303_SWE_INGRESS_PORT_TYPE,
> +
> LAN9303_SWE_INGRESS_PORT_TYPE_VLAN);
> +	if (ret)
> +		return ret;
> +
> +	/* tag incoming packets at port 1 and 2 on their way to port 0 to be
> +	 * able to discover their source port
> +	 */
> +	val = LAN9303_BM_EGRSS_PORT_TYPE_SPECIAL_TAG_PORT0;
> +	return lan9303_write_switch_reg(chip,
> LAN9303_BM_EGRSS_PORT_TYPE, val);
Specific reason to use val then using LAN9303_BM_EGRSS_PORT_TYPE_SPECIAL_TAG_PORT0
like previous line?

> @@ -644,6 +648,10 @@ static int lan9303_setup(struct dsa_switch *ds)
>  		return -EINVAL;
>  	}
> 
> +	ret = lan9303_setup_tagging(chip);
> +	if (ret)
> +		dev_err(chip->dev, "failed to setup port tagging %d\n", ret);
> +
Still move on when error happens?

>  	ret = lan9303_separate_ports(chip);
>  	if (ret)
>  		dev_err(chip->dev, "failed to separate ports %d\n", ret);
> --
> 2.11.0

- Woojung

^ permalink raw reply

* RE: [patch net-next 2/4] net: sched: introduce per-egress action device callbacks
From: David Laight @ 2017-10-10 15:12 UTC (permalink / raw)
  To: 'Jiri Pirko'
  Cc: netdev@vger.kernel.org, davem@davemloft.net, jhs@mojatatu.com,
	xiyou.wangcong@gmail.com, saeedm@mellanox.com,
	matanb@mellanox.com, leonro@mellanox.com, mlxsw@mellanox.com
In-Reply-To: <20171010143152.GG2033@nanopsycho>

From: Jiri Pirko
> Sent: 10 October 2017 15:32
> To: David Laight
> Cc: netdev@vger.kernel.org; davem@davemloft.net; jhs@mojatatu.com; xiyou.wangcong@gmail.com;
> saeedm@mellanox.com; matanb@mellanox.com; leonro@mellanox.com; mlxsw@mellanox.com
> Subject: Re: [patch net-next 2/4] net: sched: introduce per-egress action device callbacks
> 
> Tue, Oct 10, 2017 at 03:31:59PM CEST, David.Laight@ACULAB.COM wrote:
> >From: Jiri Pirko
> >> Sent: 10 October 2017 08:30
> >> Introduce infrastructure that allows drivers to register callbacks that
> >> are called whenever tc would offload inserted rule and specified device
> >> acts as tc action egress device.
> >
> >How does a driver safely unregister a callback?
> >(to avoid a race with the callback being called.)
> >
> >Usually this requires a callback in the context that makes the
> >notification callbacks indicating that no more such callbacks
> >will be made.
> 
> rtnl is your answer. It is being held during register/unregister/cb

Do you mean 'acquired during register/unregister' and 'held across the
callback' ?

So the unregister sleeps (or spins?) until any callbacks complete?
So the driver mustn't hold any locks (etc) across the unregister that
it acquires in the callback.
That ought to be noted somewhere.

	David

^ permalink raw reply

* Re: [PATCH v4 net-next] rtnetlink: bridge: use ext_ack instead of printk
From: David Ahern @ 2017-10-10 15:11 UTC (permalink / raw)
  To: Florian Westphal, netdev
In-Reply-To: <20171010151004.20056-1-fw@strlen.de>

On 10/10/17 9:10 AM, Florian Westphal wrote:
> We can now piggyback error strings to userspace via extended acks
> rather than using printk.
> 
> Before:
> bridge fdb add 01:02:03:04:05:06 dev br0 vlan 4095
> RTNETLINK answers: Invalid argument
> 
> After:
> bridge fdb add 01:02:03:04:05:06 dev br0 vlan 4095
> Error: invalid vlan id.
> 
> v3: drop 'RTM_' prefixes, suggested by David Ahern, they
> are not useful, the add/del in bridge command line is enough.
> 
> Also reword error in response to malformed/bad vlan id attribute
> size.
> 
> Cc: David Ahern <dsahern@gmail.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  change since v3: forgot to remove "RTM_SETLINK:" prefix in error message.
> 
>  net/core/rtnetlink.c | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)

Reviewed-by: David Ahern <dsahern@gmail.com>

^ permalink raw reply

* [PATCH v4 net-next] rtnetlink: bridge: use ext_ack instead of printk
From: Florian Westphal @ 2017-10-10 15:10 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal, David Ahern

We can now piggyback error strings to userspace via extended acks
rather than using printk.

Before:
bridge fdb add 01:02:03:04:05:06 dev br0 vlan 4095
RTNETLINK answers: Invalid argument

After:
bridge fdb add 01:02:03:04:05:06 dev br0 vlan 4095
Error: invalid vlan id.

v3: drop 'RTM_' prefixes, suggested by David Ahern, they
are not useful, the add/del in bridge command line is enough.

Also reword error in response to malformed/bad vlan id attribute
size.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 change since v3: forgot to remove "RTM_SETLINK:" prefix in error message.

 net/core/rtnetlink.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index e84d108cfee4..6a09f3d575af 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3066,21 +3066,21 @@ int ndo_dflt_fdb_add(struct ndmsg *ndm,
 }
 EXPORT_SYMBOL(ndo_dflt_fdb_add);
 
-static int fdb_vid_parse(struct nlattr *vlan_attr, u16 *p_vid)
+static int fdb_vid_parse(struct nlattr *vlan_attr, u16 *p_vid,
+			 struct netlink_ext_ack *extack)
 {
 	u16 vid = 0;
 
 	if (vlan_attr) {
 		if (nla_len(vlan_attr) != sizeof(u16)) {
-			pr_info("PF_BRIDGE: RTM_NEWNEIGH with invalid vlan\n");
+			NL_SET_ERR_MSG(extack, "invalid vlan attribute size");
 			return -EINVAL;
 		}
 
 		vid = nla_get_u16(vlan_attr);
 
 		if (!vid || vid >= VLAN_VID_MASK) {
-			pr_info("PF_BRIDGE: RTM_NEWNEIGH with invalid vlan id %d\n",
-				vid);
+			NL_SET_ERR_MSG(extack, "invalid vlan id");
 			return -EINVAL;
 		}
 	}
@@ -3105,24 +3105,24 @@ static int rtnl_fdb_add(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	ndm = nlmsg_data(nlh);
 	if (ndm->ndm_ifindex == 0) {
-		pr_info("PF_BRIDGE: RTM_NEWNEIGH with invalid ifindex\n");
+		NL_SET_ERR_MSG(extack, "invalid ifindex");
 		return -EINVAL;
 	}
 
 	dev = __dev_get_by_index(net, ndm->ndm_ifindex);
 	if (dev == NULL) {
-		pr_info("PF_BRIDGE: RTM_NEWNEIGH with unknown ifindex\n");
+		NL_SET_ERR_MSG(extack, "unknown ifindex");
 		return -ENODEV;
 	}
 
 	if (!tb[NDA_LLADDR] || nla_len(tb[NDA_LLADDR]) != ETH_ALEN) {
-		pr_info("PF_BRIDGE: RTM_NEWNEIGH with invalid address\n");
+		NL_SET_ERR_MSG(extack, "invalid address");
 		return -EINVAL;
 	}
 
 	addr = nla_data(tb[NDA_LLADDR]);
 
-	err = fdb_vid_parse(tb[NDA_VLAN], &vid);
+	err = fdb_vid_parse(tb[NDA_VLAN], &vid, extack);
 	if (err)
 		return err;
 
@@ -3209,24 +3209,24 @@ static int rtnl_fdb_del(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	ndm = nlmsg_data(nlh);
 	if (ndm->ndm_ifindex == 0) {
-		pr_info("PF_BRIDGE: RTM_DELNEIGH with invalid ifindex\n");
+		NL_SET_ERR_MSG(extack, "invalid ifindex");
 		return -EINVAL;
 	}
 
 	dev = __dev_get_by_index(net, ndm->ndm_ifindex);
 	if (dev == NULL) {
-		pr_info("PF_BRIDGE: RTM_DELNEIGH with unknown ifindex\n");
+		NL_SET_ERR_MSG(extack, "unknown ifindex");
 		return -ENODEV;
 	}
 
 	if (!tb[NDA_LLADDR] || nla_len(tb[NDA_LLADDR]) != ETH_ALEN) {
-		pr_info("PF_BRIDGE: RTM_DELNEIGH with invalid address\n");
+		NL_SET_ERR_MSG(extack, "invalid address");
 		return -EINVAL;
 	}
 
 	addr = nla_data(tb[NDA_LLADDR]);
 
-	err = fdb_vid_parse(tb[NDA_VLAN], &vid);
+	err = fdb_vid_parse(tb[NDA_VLAN], &vid, extack);
 	if (err)
 		return err;
 
@@ -3666,7 +3666,7 @@ static int rtnl_bridge_setlink(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	dev = __dev_get_by_index(net, ifm->ifi_index);
 	if (!dev) {
-		pr_info("PF_BRIDGE: RTM_SETLINK with unknown ifindex\n");
+		NL_SET_ERR_MSG(extack, "unknown ifindex");
 		return -ENODEV;
 	}
 
@@ -3741,7 +3741,7 @@ static int rtnl_bridge_dellink(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	dev = __dev_get_by_index(net, ifm->ifi_index);
 	if (!dev) {
-		pr_info("PF_BRIDGE: RTM_SETLINK with unknown ifindex\n");
+		NL_SET_ERR_MSG(extack, "unknown ifindex");
 		return -ENODEV;
 	}
 
-- 
2.13.6

^ permalink raw reply related

* Re: [PATCH net-next] openvswitch: add ct_clear action
From: Eric Garver @ 2017-10-10 15:09 UTC (permalink / raw)
  To: Joe Stringer; +Cc: Pravin Shelar, Linux Kernel Network Developers, ovs dev
In-Reply-To: <CAPWQB7FgKYe4Ax08NzW97-WGmriC7j9YhxQF9QtuQZwMjA00bQ@mail.gmail.com>

On Tue, Oct 10, 2017 at 05:33:48AM -0700, Joe Stringer wrote:
> On 9 October 2017 at 21:41, Pravin Shelar <pshelar@ovn.org> wrote:
> > On Fri, Oct 6, 2017 at 9:44 AM, Eric Garver <e@erig.me> wrote:
> >> This adds a ct_clear action for clearing conntrack state. ct_clear is
> >> currently implemented in OVS userspace, but is not backed by an action
> >> in the kernel datapath. This is useful for flows that may modify a
> >> packet tuple after a ct lookup has already occurred.
> >>
> >> Signed-off-by: Eric Garver <e@erig.me>
> > Patch mostly looks good. I have following comments.
> >
> >> ---
> >>  include/uapi/linux/openvswitch.h |  2 ++
> >>  net/openvswitch/actions.c        |  5 +++++
> >>  net/openvswitch/conntrack.c      | 12 ++++++++++++
> >>  net/openvswitch/conntrack.h      |  7 +++++++
> >>  net/openvswitch/flow_netlink.c   |  5 +++++
> >>  5 files changed, 31 insertions(+)
> >>
> >> diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
> >> index 156ee4cab82e..1b6e510e2cc6 100644
> >> --- a/include/uapi/linux/openvswitch.h
> >> +++ b/include/uapi/linux/openvswitch.h
> >> @@ -806,6 +806,7 @@ struct ovs_action_push_eth {
> >>   * packet.
> >>   * @OVS_ACTION_ATTR_POP_ETH: Pop the outermost Ethernet header off the
> >>   * packet.
> >> + * @OVS_ACTION_ATTR_CT_CLEAR: Clear conntrack state from the packet.
> >>   *
> >>   * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not all
> >>   * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
> >> @@ -835,6 +836,7 @@ enum ovs_action_attr {
> >>         OVS_ACTION_ATTR_TRUNC,        /* u32 struct ovs_action_trunc. */
> >>         OVS_ACTION_ATTR_PUSH_ETH,     /* struct ovs_action_push_eth. */
> >>         OVS_ACTION_ATTR_POP_ETH,      /* No argument. */
> >> +       OVS_ACTION_ATTR_CT_CLEAR,     /* No argument. */
> >>
> >>         __OVS_ACTION_ATTR_MAX,        /* Nothing past this will be accepted
> >>                                        * from userspace. */
> >> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> >> index a54a556fcdb5..db9c7f2e662b 100644
> >> --- a/net/openvswitch/actions.c
> >> +++ b/net/openvswitch/actions.c
> >> @@ -1203,6 +1203,10 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
> >>                                 return err == -EINPROGRESS ? 0 : err;
> >>                         break;
> >>
> >> +               case OVS_ACTION_ATTR_CT_CLEAR:
> >> +                       err = ovs_ct_clear(skb, key);
> >> +                       break;
> >> +
> >>                 case OVS_ACTION_ATTR_PUSH_ETH:
> >>                         err = push_eth(skb, key, nla_data(a));
> >>                         break;
> >> @@ -1210,6 +1214,7 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
> >>                 case OVS_ACTION_ATTR_POP_ETH:
> >>                         err = pop_eth(skb, key);
> >>                         break;
> >> +
> >>                 }
> > Unrelated change.
> >
> >>
> >>                 if (unlikely(err)) {
> >> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> >> index d558e882ca0c..f9b73c726ad7 100644
> >> --- a/net/openvswitch/conntrack.c
> >> +++ b/net/openvswitch/conntrack.c
> >> @@ -1129,6 +1129,18 @@ int ovs_ct_execute(struct net *net, struct sk_buff *skb,
> >>         return err;
> >>  }
> >>
> >> +int ovs_ct_clear(struct sk_buff *skb, struct sw_flow_key *key)
> >> +{
> >> +       if (skb_nfct(skb)) {
> >> +               nf_conntrack_put(skb_nfct(skb));
> >> +               nf_ct_set(skb, NULL, 0);
> > Can the new conntract state be appropriate? may be IP_CT_UNTRACKED?
> >
> >> +       }
> >> +
> >> +       ovs_ct_fill_key(skb, key);
> >> +
> > I do not see need to refill the key if there is no skb-nf-ct.
> 
> Really this is trying to just zero the CT key fields, but reuses
> existing functions, right? This means that subsequent upcalls, for

Right.

> instance, won't have the outdated view of the CT state from the
> previous lookup (that was prior to the ct_clear). I'd expect these key
> fields to be cleared.

I assumed Pravin was saying that we don't need to clear them if there is
no conntrack state. They should already be zero.

^ permalink raw reply

* [PATCH net] macsec: fix memory leaks when skb_to_sgvec fails
From: Sabrina Dubroca @ 2017-10-10 15:07 UTC (permalink / raw)
  To: netdev; +Cc: Sabrina Dubroca, Jason A . Donenfeld

Fixes: cda7ea690350 ("macsec: check return value of skb_to_sgvec always")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
 drivers/net/macsec.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index 98e4deaa3a6a..5ab1b8849c30 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -742,6 +742,7 @@ static struct sk_buff *macsec_encrypt(struct sk_buff *skb,
 	sg_init_table(sg, ret);
 	ret = skb_to_sgvec(skb, sg, 0, skb->len);
 	if (unlikely(ret < 0)) {
+		aead_request_free(req);
 		macsec_txsa_put(tx_sa);
 		kfree_skb(skb);
 		return ERR_PTR(ret);
@@ -954,6 +955,7 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb,
 	sg_init_table(sg, ret);
 	ret = skb_to_sgvec(skb, sg, 0, skb->len);
 	if (unlikely(ret < 0)) {
+		aead_request_free(req);
 		kfree_skb(skb);
 		return ERR_PTR(ret);
 	}
-- 
2.14.2

^ permalink raw reply related

* Re: [PATCH net-next v2 4/5] selinux: bpf: Add selinux check for eBPF syscall operations
From: Stephen Smalley @ 2017-10-10 14:52 UTC (permalink / raw)
  To: Chenbo Feng, linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, SELinux
  Cc: Chenbo Feng, Alexei Starovoitov, Daniel Borkmann,
	lorenzo-hpIqsD4AKlfQT0dZR+AlfA
In-Reply-To: <1507645097.30616.6.camel-+05T5uksL2qpZYMLLGbcSA@public.gmane.org>

On Tue, 2017-10-10 at 10:18 -0400, Stephen Smalley wrote:
> On Mon, 2017-10-09 at 15:20 -0700, Chenbo Feng wrote:
> > From: Chenbo Feng <fengc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > 
> > Implement the actual checks introduced to eBPF related syscalls.
> > This
> > implementation use the security field inside bpf object to store a
> > sid that
> > identify the bpf object. And when processes try to access the
> > object,
> > selinux will check if processes have the right privileges. The
> > creation
> > of eBPF object are also checked at the general bpf check hook and
> > new
> > cmd introduced to eBPF domain can also be checked there.
> > 
> > Signed-off-by: Chenbo Feng <fengc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > Acked-by: Alexei Starovoitov <ast-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > ---
> >  security/selinux/hooks.c            | 111
> > ++++++++++++++++++++++++++++++++++++
> >  security/selinux/include/classmap.h |   2 +
> >  security/selinux/include/objsec.h   |   4 ++
> >  3 files changed, 117 insertions(+)
> > 
> > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> > index f5d304736852..41aba4e3d57c 100644
> > --- a/security/selinux/hooks.c
> > +++ b/security/selinux/hooks.c
> > @@ -85,6 +85,7 @@
> >  #include <linux/export.h>
> >  #include <linux/msg.h>
> >  #include <linux/shm.h>
> > +#include <linux/bpf.h>
> >  
> >  #include "avc.h"
> >  #include "objsec.h"
> > @@ -6252,6 +6253,106 @@ static void selinux_ib_free_security(void
> > *ib_sec)
> >  }
> >  #endif
> >  
> > +#ifdef CONFIG_BPF_SYSCALL
> > +static int selinux_bpf(int cmd, union bpf_attr *attr,
> > +				     unsigned int size)
> > +{
> > +	u32 sid = current_sid();
> > +	int ret;
> > +
> > +	switch (cmd) {
> > +	case BPF_MAP_CREATE:
> > +		ret = avc_has_perm(sid, sid, SECCLASS_BPF_MAP,
> > BPF_MAP__CREATE,
> > +				   NULL);
> > +		break;
> > +	case BPF_PROG_LOAD:
> > +		ret = avc_has_perm(sid, sid, SECCLASS_BPF_PROG,
> > BPF_PROG__LOAD,
> > +				   NULL);
> > +		break;
> > +	default:
> > +		ret = 0;
> > +		break;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static u32 bpf_map_fmode_to_av(fmode_t fmode)
> > +{
> > +	u32 av = 0;
> > +
> > +	if (f_mode & FMODE_READ)
> > +		av |= BPF_MAP__READ;
> > +	if (f_mode & FMODE_WRITE)
> > +		av |= BPF_MAP__WRITE;
> > +	return av;
> > +}
> > +
> > +static int selinux_bpf_map(struct bpf_map *map, fmode_t fmode)
> > +{
> > +	u32 sid = current_sid();
> > +	struct bpf_security_struct *bpfsec;
> > +
> > +	bpfsec = map->security;
> > +	return avc_has_perm(sid, bpfsec->sid, SECCLASS_BPF_MAP,
> > +			    bpf_map_fmode_to_av(fmode), NULL);
> > +}
> > +
> > +static int selinux_bpf_prog(struct bpf_prog *prog)
> > +{
> > +	u32 sid = current_sid();
> > +	struct bpf_security_struct *bpfsec;
> > +
> > +	bpfsec = prog->aux->security;
> > +	return avc_has_perm(sid, bpfsec->sid, SECCLASS_BPF_PROG,
> > +			    BPF_PROG__USE, NULL);
> > +}
> > +
> > +static int selinux_bpf_map_alloc(struct bpf_map *map)
> > +{
> > +	struct bpf_security_struct *bpfsec;
> > +
> > +	bpfsec = kzalloc(sizeof(*bpfsec), GFP_KERNEL);
> > +	if (!bpfsec)
> > +		return -ENOMEM;
> > +
> > +	bpfsec->sid = current_sid();
> > +	map->security = bpfsec;
> > +
> > +	return 0;
> > +}
> > +
> > +static void selinux_bpf_map_free(struct bpf_map *map)
> > +{
> > +	struct bpf_security_struct *bpfsec = map->security;
> > +
> > +	map->security = NULL;
> > +	kfree(bpfsec);
> > +}
> > +
> > +static int selinux_bpf_prog_alloc(struct bpf_prog_aux *aux)
> > +{
> > +	struct bpf_security_struct *bpfsec;
> > +
> > +	bpfsec = kzalloc(sizeof(*bpfsec), GFP_KERNEL);
> > +	if (!bpfsec)
> > +		return -ENOMEM;
> > +
> > +	bpfsec->sid = current_sid();
> > +	aux->security = bpfsec;
> > +
> > +	return 0;
> > +}
> > +
> > +static void selinux_bpf_prog_free(struct bpf_prog_aux *aux)
> > +{
> > +	struct bpf_security_struct *bpfsec = aux->security;
> > +
> > +	aux->security = NULL;
> > +	kfree(bpfsec);
> > +}
> > +#endif
> > +
> >  static struct security_hook_list selinux_hooks[]
> > __lsm_ro_after_init
> > = {
> >  	LSM_HOOK_INIT(binder_set_context_mgr,
> > selinux_binder_set_context_mgr),
> >  	LSM_HOOK_INIT(binder_transaction,
> > selinux_binder_transaction),
> > @@ -6471,6 +6572,16 @@ static struct security_hook_list
> > selinux_hooks[] __lsm_ro_after_init = {
> >  	LSM_HOOK_INIT(audit_rule_match, selinux_audit_rule_match),
> >  	LSM_HOOK_INIT(audit_rule_free, selinux_audit_rule_free),
> >  #endif
> > +
> > +#ifdef CONFIG_BPF_SYSCALL
> > +	LSM_HOOK_INIT(bpf, selinux_bpf),
> > +	LSM_HOOK_INIT(bpf_map, selinux_bpf_map),
> > +	LSM_HOOK_INIT(bpf_prog, selinux_bpf_prog),
> > +	LSM_HOOK_INIT(bpf_map_alloc_security,
> > selinux_bpf_map_alloc),
> > +	LSM_HOOK_INIT(bpf_prog_alloc_security,
> > selinux_bpf_prog_alloc),
> > +	LSM_HOOK_INIT(bpf_map_free_security,
> > selinux_bpf_map_free),
> > +	LSM_HOOK_INIT(bpf_prog_free_security,
> > selinux_bpf_prog_free),
> > +#endif
> >  };
> >  
> >  static __init int selinux_init(void)
> > diff --git a/security/selinux/include/classmap.h
> > b/security/selinux/include/classmap.h
> > index 35ffb29a69cb..7253c5eea59c 100644
> > --- a/security/selinux/include/classmap.h
> > +++ b/security/selinux/include/classmap.h
> > @@ -237,6 +237,8 @@ struct security_class_mapping secclass_map[] =
> > {
> >  	  { "access", NULL } },
> >  	{ "infiniband_endport",
> >  	  { "manage_subnet", NULL } },
> > +	{ "bpf_map", {"create", "read", "write"} },
> > +	{ "bpf_prog", {"load", "use"} },
> 
> Again I have to ask: do you truly need/want two separate classes, or
> would a single class with distinct permissions suffice, ala:
>         { "bpf", { "create_map", "read_map", "write_map",
> "prog_load",
> "prog_use" } },
> 
> and then allow A self:bpf { create_map read_map write_map prog_load
> prog_use }; would be stored in a single policy avtab rule, and be
> cached in a single AVC entry.

I guess for consistency in naming it should be either:
        { "bpf", { "map_create", "map_read", "map_write", "prog_load",
"prog_use" } },
 
or:

        { "bpf", { "create_map", "read_map", "write_map", "load_prog",
"use_prog" } },
 

> >  	{ NULL }
> >    };
> >  
> > diff --git a/security/selinux/include/objsec.h
> > b/security/selinux/include/objsec.h
> > index 1649cd18eb0b..3d54468ce334 100644
> > --- a/security/selinux/include/objsec.h
> > +++ b/security/selinux/include/objsec.h
> > @@ -150,6 +150,10 @@ struct pkey_security_struct {
> >  	u32	sid;	/* SID of pkey */
> >  };
> >  
> > +struct bpf_security_struct {
> > +	u32 sid;  /*SID of bpf obj creater*/
> > +};
> > +
> >  extern unsigned int selinux_checkreqprot;
> >  
> >  #endif /* _SELINUX_OBJSEC_H_ */

^ permalink raw reply

* Re: [PATCH net v2] net: enable interface alias removal via rtnl
From: David Ahern @ 2017-10-10 14:50 UTC (permalink / raw)
  To: Nicolas Dichtel, davem; +Cc: netdev, oliver, Stephen Hemminger
In-Reply-To: <20171010124138.27342-1-nicolas.dichtel@6wind.com>

On 10/10/17 6:41 AM, Nicolas Dichtel wrote:
> IFLA_IFALIAS is defined as NLA_STRING. It means that the minimal length of
> the attribute is 1 ("\0"). However, to remove an alias, the attribute
> length must be 0 (see dev_set_alias()).
> 
> Let's define the type to NLA_BINARY, so that the alias can be removed.

not to be pedantic, but we need to be clear that the type is changed
only for policy validation.

> 
> Example:
> $ ip l s dummy0 alias foo
> $ ip l l dev dummy0
> 5: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>     link/ether ae:20:30:4f:a7:f3 brd ff:ff:ff:ff:ff:ff
>     alias foo
> 
> Before the patch:
> $ ip l s dummy0 alias ""
> RTNETLINK answers: Numerical result out of range
> 
> After the patch:
> $ ip l s dummy0 alias ""
> $ ip l l dev dummy0
> 5: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>     link/ether ae:20:30:4f:a7:f3 brd ff:ff:ff:ff:ff:ff
> 
> CC: Oliver Hartkopp <oliver@hartkopp.net>
> CC: Stephen Hemminger <stephen@networkplumber.org>
> Fixes: 96ca4a2cc145 ("net: remove ifalias on empty given alias")
> Reported-by: Julien FLoret <julien.floret@6wind.com>
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> ---
> 
> v1 -> v2: add the comment
> 
>  net/core/rtnetlink.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index d4bcdcc68e92..5343565d88b7 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -1483,7 +1483,10 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
>  	[IFLA_LINKINFO]		= { .type = NLA_NESTED },
>  	[IFLA_NET_NS_PID]	= { .type = NLA_U32 },
>  	[IFLA_NET_NS_FD]	= { .type = NLA_U32 },
> -	[IFLA_IFALIAS]	        = { .type = NLA_STRING, .len = IFALIASZ-1 },
> +	/* IFLA_IFALIAS is a string, but policy is set to NLA_BINARY to
> +	 * allow 0-length string (needed to remove an alias).
> +	 */
> +	[IFLA_IFALIAS]	        = { .type = NLA_BINARY, .len = IFALIASZ - 1 },
>  	[IFLA_VFINFO_LIST]	= {. type = NLA_NESTED },
>  	[IFLA_VF_PORTS]		= { .type = NLA_NESTED },
>  	[IFLA_PORT_SELF]	= { .type = NLA_NESTED },
> 

Seems like a reasonable solution.

Acked-by: David Ahern <dsahern@gmail.com>

^ permalink raw reply

* Re: [PATCH v3 net-next] rtnetlink: bridge: use ext_ack instead of printk
From: David Ahern @ 2017-10-10 14:47 UTC (permalink / raw)
  To: Florian Westphal, netdev
In-Reply-To: <20171010144427.8341-1-fw@strlen.de>

On 10/10/17 8:44 AM, Florian Westphal wrote:
> @@ -3666,7 +3666,7 @@ static int rtnl_bridge_setlink(struct sk_buff *skb, struct nlmsghdr *nlh,
>  
>  	dev = __dev_get_by_index(net, ifm->ifi_index);
>  	if (!dev) {
> -		pr_info("PF_BRIDGE: RTM_SETLINK with unknown ifindex\n");
> +		NL_SET_ERR_MSG(extack, "RTM_SETLINK with unknown ifindex");
>  		return -ENODEV;
>  	}
>  
> @@ -3741,7 +3741,7 @@ static int rtnl_bridge_dellink(struct sk_buff *skb, struct nlmsghdr *nlh,
>  
>  	dev = __dev_get_by_index(net, ifm->ifi_index);
>  	if (!dev) {
> -		pr_info("PF_BRIDGE: RTM_SETLINK with unknown ifindex\n");
> +		NL_SET_ERR_MSG(extack, "RTM_SETLINK with unknown ifindex");
>  		return -ENODEV;
>  	}

missed a couple of 'RTM_* with' strings

^ permalink raw reply

* Re: [PATCH v4 1/2] net: phy: DP83822 initial driver submission
From: Dan Murphy @ 2017-10-10 14:45 UTC (permalink / raw)
  To: Andrew F. Davis, andrew, f.fainelli; +Cc: netdev, Woojung.Huh
In-Reply-To: <17264c76-92d5-9dc8-b5ec-9dc09cf38ec0@ti.com>

Andrew

Thanks for the review

On 10/09/2017 02:12 PM, Andrew F. Davis wrote:
> On 10/09/2017 11:59 AM, Dan Murphy wrote:
>> Add support for the TI  DP83822 10/100Mbit ethernet phy.
>>
>> The DP83822 provides flexibility to connect to a MAC through a
>> standard MII, RMII or RGMII interface.
>>
>> In addition the DP83822 needs to be removed from the DP83848 driver
>> as the WoL support is added here for this device.
>>
>> Datasheet:
>> http://www.ti.com/product/DP83822I/datasheet
>>
>> Signed-off-by: Dan Murphy <dmurphy@ti.com>> ---
>>
>> v4 - Squash DP83822 removal patch into this patch -
>> https://www.mail-archive.com/netdev@vger.kernel.org/msg192424.html
>>
>> v3 - Fixed WoL indication bit and removed mutex for suspend/resume - 
>> https://www.mail-archive.com/netdev@vger.kernel.org/msg191891.html and
>> https://www.mail-archive.com/netdev@vger.kernel.org/msg191665.html
>>
>> v2 - Updated per comments.  Removed unnessary parantheis, called genphy_suspend/
>> resume routines and then performing WoL changes, reworked sopass storage and reduced
>> the number of phy reads, and moved WOL_SECURE_ON - 
>> https://www.mail-archive.com/netdev@vger.kernel.org/msg191392.html
>>
>>  drivers/net/phy/Kconfig   |   5 +
>>  drivers/net/phy/Makefile  |   1 +
>>  drivers/net/phy/dp83822.c | 302 ++++++++++++++++++++++++++++++++++++++++++++++
>>  drivers/net/phy/dp83848.c |   3 -
>>  4 files changed, 308 insertions(+), 3 deletions(-)
>>  create mode 100644 drivers/net/phy/dp83822.c
>>
>> diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
>> index cd931cf9dcc2..8e78a482e09e 100644
>> --- a/drivers/net/phy/Kconfig
>> +++ b/drivers/net/phy/Kconfig
>> @@ -277,6 +277,11 @@ config DAVICOM_PHY
>>  	---help---
>>  	  Currently supports dm9161e and dm9131
>>  
>> +config DP83822_PHY
>> +	tristate "Texas Instruments DP83822 PHY"
>> +	---help---
>> +	  Supports the DP83822 PHY.
>> +
>>  config DP83848_PHY
>>  	tristate "Texas Instruments DP83848 PHY"
>>  	---help---
>> diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
>> index 416df92fbf4f..df3b82ba8550 100644
>> --- a/drivers/net/phy/Makefile
>> +++ b/drivers/net/phy/Makefile
>> @@ -55,6 +55,7 @@ obj-$(CONFIG_CICADA_PHY)	+= cicada.o
>>  obj-$(CONFIG_CORTINA_PHY)	+= cortina.o
>>  obj-$(CONFIG_DAVICOM_PHY)	+= davicom.o
>>  obj-$(CONFIG_DP83640_PHY)	+= dp83640.o
>> +obj-$(CONFIG_DP83822_PHY)	+= dp83822.o
>>  obj-$(CONFIG_DP83848_PHY)	+= dp83848.o
>>  obj-$(CONFIG_DP83867_PHY)	+= dp83867.o
>>  obj-$(CONFIG_FIXED_PHY)		+= fixed_phy.o
>> diff --git a/drivers/net/phy/dp83822.c b/drivers/net/phy/dp83822.c
>> new file mode 100644
>> index 000000000000..de196dbc46cd
>> --- /dev/null
>> +++ b/drivers/net/phy/dp83822.c
>> @@ -0,0 +1,302 @@
>> +/*
>> + * Driver for the Texas Instruments DP83822 PHY
>> + *
>> + * Copyright (C) 2017 Texas Instruments Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + */
>> +
>> +#include <linux/ethtool.h>
>> +#include <linux/etherdevice.h>
>> +#include <linux/kernel.h>
>> +#include <linux/mii.h>
>> +#include <linux/module.h>
>> +#include <linux/of.h>
>> +#include <linux/phy.h>
>> +#include <linux/netdevice.h>
>> +
>> +#define DP83822_PHY_ID	        0x2000a240
>> +#define DP83822_DEVADDR		0x1f
>> +
>> +#define MII_DP83822_MISR1	0x12
>> +#define MII_DP83822_MISR2	0x13
>> +#define MII_DP83822_RESET_CTRL	0x1f
>> +
>> +#define DP83822_HW_RESET	BIT(15)
>> +#define DP83822_SW_RESET	BIT(14)
>> +
>> +/* MISR1 bits */
>> +#define DP83822_RX_ERR_HF_INT_EN	BIT(0)
>> +#define DP83822_FALSE_CARRIER_HF_INT_EN	BIT(1)
>> +#define DP83822_ANEG_COMPLETE_INT_EN	BIT(2)
>> +#define DP83822_DUP_MODE_CHANGE_INT_EN	BIT(3)
>> +#define DP83822_SPEED_CHANGED_INT_EN	BIT(4)
>> +#define DP83822_LINK_STAT_INT_EN	BIT(5)
>> +#define DP83822_ENERGY_DET_INT_EN	BIT(6)
>> +#define DP83822_LINK_QUAL_INT_EN	BIT(7)
>> +
>> +/* MISR2 bits */
>> +#define DP83822_JABBER_DET_INT_EN	BIT(0)
>> +#define DP83822_WOL_PKT_INT_EN		BIT(1)
>> +#define DP83822_SLEEP_MODE_INT_EN	BIT(2)
>> +#define DP83822_MDI_XOVER_INT_EN	BIT(3)
>> +#define DP83822_LB_FIFO_INT_EN		BIT(4)
>> +#define DP83822_PAGE_RX_INT_EN		BIT(5)
>> +#define DP83822_ANEG_ERR_INT_EN		BIT(6)
>> +#define DP83822_EEE_ERROR_CHANGE_INT_EN	BIT(7)
>> +
>> +/* INT_STAT1 bits */
>> +#define DP83822_WOL_INT_EN	BIT(4)
>> +#define DP83822_WOL_INT_STAT	BIT(12)
>> +
>> +#define MII_DP83822_RXSOP1	0x04a5
>> +#define	MII_DP83822_RXSOP2	0x04a6
>> +#define	MII_DP83822_RXSOP3	0x04a7
>> +
>> +/* WoL Registers */
>> +#define	MII_DP83822_WOL_CFG	0x04a0
>> +#define	MII_DP83822_WOL_STAT	0x04a1
>> +#define	MII_DP83822_WOL_DA1	0x04a2
>> +#define	MII_DP83822_WOL_DA2	0x04a3
>> +#define	MII_DP83822_WOL_DA3	0x04a4
>> +
>> +/* WoL bits */
>> +#define DP83822_WOL_MAGIC_EN	BIT(1)
> 
> Datasheet seems to indicate MAGIC_EN is bit 0, not 1.

OK

> 
>> +#define DP83822_WOL_SECURE_ON	BIT(5)
>> +#define DP83822_WOL_EN		BIT(7)
>> +#define DP83822_WOL_INDICATION_SEL BIT(8)
>> +#define DP83822_WOL_CLR_INDICATION BIT(11)
>> +
>> +static int dp83822_ack_interrupt(struct phy_device *phydev)
>> +{
>> +	int err = phy_read(phydev, MII_DP83822_MISR1);
>> +
>> +	if (err < 0)
>> +		return err;
>> +
> 
> The above could also be written:
> 
> int err;
> 
> err = phy_read(phydev, MII_DP83822_MISR1);
> if (err < 0)
> 	return err;
> 
> This matches the below better and is more clear to me.

OK

> 
>> +	err = phy_read(phydev, MII_DP83822_MISR2);
>> +	if (err < 0)
>> +		return err;
>> +
>> +	return 0;
>> +}
>> +
>> +static int dp83822_set_wol(struct phy_device *phydev,
>> +			   struct ethtool_wolinfo *wol)
>> +{
>> +	struct net_device *ndev = phydev->attached_dev;
>> +	u16 value;
>> +	const u8 *mac;
>> +
>> +	if (wol->wolopts & (WAKE_MAGIC | WAKE_MAGICSECURE)) {
>> +		mac = (const u8 *)ndev->dev_addr;
>> +
>> +		if (!is_valid_ether_addr(mac))
>> +			return -EINVAL;
>> +
>> +		/* MAC addresses start with byte 5, but stored in mac[0].
>> +		 * 822 PHYs store bytes 4|5, 2|3, 0|1
>> +		 */
>> +		phy_write_mmd(phydev, DP83822_DEVADDR,
>> +			      MII_DP83822_WOL_DA1, (mac[1] << 8) | mac[0]);
>> +		phy_write_mmd(phydev, DP83822_DEVADDR,
>> +			      MII_DP83822_WOL_DA2, (mac[3] << 8) | mac[2]);
>> +		phy_write_mmd(phydev, DP83822_DEVADDR, MII_DP83822_WOL_DA3,
>> +			      (mac[5] << 8) | mac[4]);
> 
> This 'phy_write_mmd' doesn't match the others, 'MII_DP83822_WOL_DAx'
> should be on the next line, or all others should be on same.

ok

> 
>> +
>> +		value = phy_read_mmd(phydev, DP83822_DEVADDR,
>> +				     MII_DP83822_WOL_CFG);
>> +		if (wol->wolopts & WAKE_MAGIC)
>> +			value |= DP83822_WOL_MAGIC_EN;
>> +		else
>> +			value &= ~DP83822_WOL_MAGIC_EN;
>> +
>> +		if (wol->wolopts & WAKE_MAGICSECURE) {
>> +			phy_write_mmd(phydev, DP83822_DEVADDR,
>> +				      MII_DP83822_RXSOP1,
>> +				      (wol->sopass[1] << 8) | wol->sopass[0]);
>> +			phy_write_mmd(phydev, DP83822_DEVADDR,
>> +				      MII_DP83822_RXSOP2,
>> +				      (wol->sopass[3] << 8) | wol->sopass[2]);
>> +			phy_write_mmd(phydev, DP83822_DEVADDR,
>> +				      MII_DP83822_RXSOP3,
>> +				      (wol->sopass[5] << 8) | wol->sopass[4]);
>> +			value |= DP83822_WOL_SECURE_ON;
>> +		} else {
>> +			value &= ~DP83822_WOL_SECURE_ON;
>> +		}
>> +
>> +		value |= (DP83822_WOL_EN | DP83822_WOL_INDICATION_SEL |
>> +			  DP83822_WOL_CLR_INDICATION);
>> +		phy_write_mmd(phydev, DP83822_DEVADDR, MII_DP83822_WOL_CFG,
>> +			      value);
>> +	} else {
>> +		value = phy_read_mmd(phydev, DP83822_DEVADDR,
>> +				     MII_DP83822_WOL_CFG);
>> +		value &= ~DP83822_WOL_EN;
>> +		phy_write_mmd(phydev, DP83822_DEVADDR, MII_DP83822_WOL_CFG,
>> +			      value);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static void dp83822_get_wol(struct phy_device *phydev,
>> +			    struct ethtool_wolinfo *wol)
>> +{
>> +	int value;
>> +	u16 sopass_val;
>> +
>> +	wol->supported = (WAKE_MAGIC | WAKE_MAGICSECURE);
>> +	wol->wolopts = 0;
>> +
>> +	value = phy_read_mmd(phydev, DP83822_DEVADDR, MII_DP83822_WOL_CFG);
>> +	if (value & DP83822_WOL_MAGIC_EN)
>> +		wol->wolopts |= WAKE_MAGIC;
>> +
>> +	if (~value & DP83822_WOL_CLR_INDICATION)
>> +		wol->wolopts = 0;
> 
> I'm not sure I understand the logic here, why do we clear all other
> wolopts if this is not set?

Actually this needs to be WOL_ENABLE bit check and if the WoL enable bit
is not set it should just return to indicate that WoL is disabled.  And the
rest of the opts should not matter.

> 
>> +
>> +	if (value & DP83822_WOL_SECURE_ON) {
>> +		wol->wolopts |= WAKE_MAGICSECURE;
>> +	} else {
>> +		wol->wolopts &= ~WAKE_MAGICSECURE;
> 
> wol->wolopts is set to 0 at the start, and nothing else sets it, why
> clear it here?

The above should fix this

> 
>> +		return;
>> +	}
>> +
>> +	sopass_val = phy_read_mmd(phydev, DP83822_DEVADDR, MII_DP83822_RXSOP1);
>> +	wol->sopass[0] = (sopass_val & 0xff);
>> +	wol->sopass[1] = (sopass_val >> 8);
>> +
>> +	sopass_val = phy_read_mmd(phydev, DP83822_DEVADDR, MII_DP83822_RXSOP2);
>> +	wol->sopass[2] = (sopass_val & 0xff);
>> +	wol->sopass[3] = (sopass_val >> 8);
>> +
>> +	sopass_val = phy_read_mmd(phydev, DP83822_DEVADDR, MII_DP83822_RXSOP3);
>> +	wol->sopass[4] = (sopass_val & 0xff);
>> +	wol->sopass[5] = (sopass_val >> 8);
> 
> Why not encase the above password lines in the 'if (value &
> DP83822_WOL_SECURE_ON)' block above, then you can drop the whole else block.

Moved

> 
>> +}
>> +
>> +static int dp83822_config_intr(struct phy_device *phydev)
>> +{
>> +	int misr_status;
>> +	int err;
>> +
>> +	if (phydev->interrupts == PHY_INTERRUPT_ENABLED) {
>> +		misr_status = phy_read(phydev, MII_DP83822_MISR1);
>> +		if (misr_status < 0)
>> +			return misr_status;
>> +
>> +		misr_status |= (DP83822_RX_ERR_HF_INT_EN |
>> +				DP83822_FALSE_CARRIER_HF_INT_EN |
>> +				DP83822_ANEG_COMPLETE_INT_EN |
>> +				DP83822_DUP_MODE_CHANGE_INT_EN |
>> +				DP83822_SPEED_CHANGED_INT_EN |
>> +				DP83822_LINK_STAT_INT_EN |
>> +				DP83822_ENERGY_DET_INT_EN |
>> +				DP83822_LINK_QUAL_INT_EN);
>> +
>> +		err = phy_write(phydev, MII_DP83822_MISR1, misr_status);
>> +		if (err < 0)
>> +			return err;
>> +
>> +		misr_status = phy_read(phydev, MII_DP83822_MISR2);
>> +		if (misr_status < 0)
>> +			return misr_status;
>> +
>> +		misr_status |= (DP83822_JABBER_DET_INT_EN |
>> +				DP83822_WOL_PKT_INT_EN |
>> +				DP83822_SLEEP_MODE_INT_EN |
>> +				DP83822_MDI_XOVER_INT_EN |
>> +				DP83822_LB_FIFO_INT_EN |
>> +				DP83822_PAGE_RX_INT_EN |
>> +				DP83822_ANEG_ERR_INT_EN |
>> +				DP83822_EEE_ERROR_CHANGE_INT_EN);
>> +
>> +		err = phy_write(phydev, MII_DP83822_MISR2, misr_status);
>> +	} else {
>> +		err = phy_write(phydev, MII_DP83822_MISR1, 0);
> 
> You should only clear the ones you set, I know it is all of them plus
> the other registers are read-only, but for clarity you could have a
> define with the mask you are using for each register and then ~MASK when
> clearing, like the dp83848.c driver.

The dp83848 only creates a define for setting the interrupts in the MISR register.
In that drivers ack_interrupt routine it just reads the MISR register and returns.  The mask is
not used anywhere else.  IMO it's a little over kill to create a define that is used once.

> 
>> +		if (err < 0)
>> +			return err;
>> +
>> +		err = phy_write(phydev, MII_DP83822_MISR1, 0);
>> +	}
>> +
>> +	return err;
>> +}
>> +
>> +static int dp83822_phy_reset(struct phy_device *phydev)
>> +{
>> +	int err;
>> +
>> +	err = phy_write(phydev, MII_DP83822_RESET_CTRL, DP83822_HW_RESET);
>> +	if (err < 0)
>> +		return err;
>> +
>> +	return 0;
>> +}
>> +
>> +static int dp83822_suspend(struct phy_device *phydev)
>> +{
>> +	int value;
>> +
>> +	value = phy_read_mmd(phydev, DP83822_DEVADDR, MII_DP83822_WOL_CFG);
>> +
>> +	if (!(value & DP83822_WOL_EN))
>> +		genphy_suspend(phydev);
>> +
>> +	return 0;
>> +}
>> +
>> +static int dp83822_resume(struct phy_device *phydev)
>> +{
>> +	int value;
>> +
>> +	genphy_resume(phydev);
>> +
>> +	value = phy_read_mmd(phydev, DP83822_DEVADDR, MII_DP83822_WOL_CFG);
>> +
>> +	phy_write_mmd(phydev, DP83822_DEVADDR, MII_DP83822_WOL_CFG, value |
>> +		      DP83822_WOL_CLR_INDICATION);
>> +
>> +
> 
> Extra newline.

Removed

> 
>> +	return 0;
>> +}
>> +
>> +static struct phy_driver dp83822_driver[] = {
>> +	{
>> +	 .phy_id = DP83822_PHY_ID,
>> +	 .phy_id_mask = 0xfffffff0,
>> +	 .name = "TI DP83822",
>> +	 .features = PHY_BASIC_FEATURES,
>> +	 .flags = PHY_HAS_INTERRUPT,
>> +	 .config_init = genphy_config_init,
>> +	 .soft_reset = dp83822_phy_reset,
>> +	 .get_wol = dp83822_get_wol,
>> +	 .set_wol = dp83822_set_wol,
>> +	 .ack_interrupt = dp83822_ack_interrupt,
>> +	 .config_intr = dp83822_config_intr,
>> +	 .config_aneg = genphy_config_aneg,
>> +	 .read_status = genphy_read_status,
>> +	 .suspend = dp83822_suspend,
>> +	 .resume = dp83822_resume,
>> +	 },
> 
> Something is not right about the indenting here, tab then space?
> 

Fixed

>> +};
>> +module_phy_driver(dp83822_driver);
>> +
>> +static struct mdio_device_id __maybe_unused dp83822_tbl[] = {
>> +	{ DP83822_PHY_ID, 0xfffffff0 },
>> +	{ },
>> +};
>> +MODULE_DEVICE_TABLE(mdio, dp83822_tbl);
>> +
>> +MODULE_DESCRIPTION("Texas Instruments DP83822 PHY driver");
>> +MODULE_AUTHOR("Dan Murphy <dmurphy@ti.com");
>> +MODULE_LICENSE("GPL");
>> diff --git a/drivers/net/phy/dp83848.c b/drivers/net/phy/dp83848.c
>> index 3de4fe4dda77..3966d43c5146 100644
>> --- a/drivers/net/phy/dp83848.c
>> +++ b/drivers/net/phy/dp83848.c
>> @@ -20,7 +20,6 @@
>>  #define TI_DP83620_PHY_ID		0x20005ce0
>>  #define NS_DP83848C_PHY_ID		0x20005c90
>>  #define TLK10X_PHY_ID			0x2000a210
>> -#define TI_DP83822_PHY_ID		0x2000a240
>>  
>>  /* Registers */
>>  #define DP83848_MICR			0x11 /* MII Interrupt Control Register */
>> @@ -80,7 +79,6 @@ static struct mdio_device_id __maybe_unused dp83848_tbl[] = {
>>  	{ NS_DP83848C_PHY_ID, 0xfffffff0 },
>>  	{ TI_DP83620_PHY_ID, 0xfffffff0 },
>>  	{ TLK10X_PHY_ID, 0xfffffff0 },
>> -	{ TI_DP83822_PHY_ID, 0xfffffff0 },
>>  	{ }
>>  };
>>  MODULE_DEVICE_TABLE(mdio, dp83848_tbl);
>> @@ -110,7 +108,6 @@ static struct phy_driver dp83848_driver[] = {
>>  	DP83848_PHY_DRIVER(NS_DP83848C_PHY_ID, "NS DP83848C 10/100 Mbps PHY"),
>>  	DP83848_PHY_DRIVER(TI_DP83620_PHY_ID, "TI DP83620 10/100 Mbps PHY"),
>>  	DP83848_PHY_DRIVER(TLK10X_PHY_ID, "TI TLK10X 10/100 Mbps PHY"),
>> -	DP83848_PHY_DRIVER(TI_DP83822_PHY_ID, "TI DP83822 10/100 Mbps PHY"),
>>  };
>>  module_phy_driver(dp83848_driver);
>>  
>>


-- 
------------------
Dan Murphy

^ permalink raw reply

* [PATCH v3 net-next] rtnetlink: bridge: use ext_ack instead of printk
From: Florian Westphal @ 2017-10-10 14:44 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal, David Ahern

We can now piggyback error strings to userspace via extended acks
rather than using printk.

Before:
bridge fdb add 01:02:03:04:05:06 dev br0 vlan 4095
RTNETLINK answers: Invalid argument

After:
bridge fdb add 01:02:03:04:05:06 dev br0 vlan 4095
Error: invalid vlan id.

v3: drop 'RTM_' prefixes, suggested by David Ahern, they
are not useful, the add/del in bridge command line is enough.

Also reword error in response to malformed/bad vlan id attribute
size.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/core/rtnetlink.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index e84d108cfee4..af2dea45df33 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3066,21 +3066,21 @@ int ndo_dflt_fdb_add(struct ndmsg *ndm,
 }
 EXPORT_SYMBOL(ndo_dflt_fdb_add);
 
-static int fdb_vid_parse(struct nlattr *vlan_attr, u16 *p_vid)
+static int fdb_vid_parse(struct nlattr *vlan_attr, u16 *p_vid,
+			 struct netlink_ext_ack *extack)
 {
 	u16 vid = 0;
 
 	if (vlan_attr) {
 		if (nla_len(vlan_attr) != sizeof(u16)) {
-			pr_info("PF_BRIDGE: RTM_NEWNEIGH with invalid vlan\n");
+			NL_SET_ERR_MSG(extack, "invalid vlan attribute size");
 			return -EINVAL;
 		}
 
 		vid = nla_get_u16(vlan_attr);
 
 		if (!vid || vid >= VLAN_VID_MASK) {
-			pr_info("PF_BRIDGE: RTM_NEWNEIGH with invalid vlan id %d\n",
-				vid);
+			NL_SET_ERR_MSG(extack, "invalid vlan id");
 			return -EINVAL;
 		}
 	}
@@ -3105,24 +3105,24 @@ static int rtnl_fdb_add(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	ndm = nlmsg_data(nlh);
 	if (ndm->ndm_ifindex == 0) {
-		pr_info("PF_BRIDGE: RTM_NEWNEIGH with invalid ifindex\n");
+		NL_SET_ERR_MSG(extack, "invalid ifindex");
 		return -EINVAL;
 	}
 
 	dev = __dev_get_by_index(net, ndm->ndm_ifindex);
 	if (dev == NULL) {
-		pr_info("PF_BRIDGE: RTM_NEWNEIGH with unknown ifindex\n");
+		NL_SET_ERR_MSG(extack, "unknown ifindex");
 		return -ENODEV;
 	}
 
 	if (!tb[NDA_LLADDR] || nla_len(tb[NDA_LLADDR]) != ETH_ALEN) {
-		pr_info("PF_BRIDGE: RTM_NEWNEIGH with invalid address\n");
+		NL_SET_ERR_MSG(extack, "invalid address");
 		return -EINVAL;
 	}
 
 	addr = nla_data(tb[NDA_LLADDR]);
 
-	err = fdb_vid_parse(tb[NDA_VLAN], &vid);
+	err = fdb_vid_parse(tb[NDA_VLAN], &vid, extack);
 	if (err)
 		return err;
 
@@ -3209,24 +3209,24 @@ static int rtnl_fdb_del(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	ndm = nlmsg_data(nlh);
 	if (ndm->ndm_ifindex == 0) {
-		pr_info("PF_BRIDGE: RTM_DELNEIGH with invalid ifindex\n");
+		NL_SET_ERR_MSG(extack, "invalid ifindex");
 		return -EINVAL;
 	}
 
 	dev = __dev_get_by_index(net, ndm->ndm_ifindex);
 	if (dev == NULL) {
-		pr_info("PF_BRIDGE: RTM_DELNEIGH with unknown ifindex\n");
+		NL_SET_ERR_MSG(extack, "unknown ifindex");
 		return -ENODEV;
 	}
 
 	if (!tb[NDA_LLADDR] || nla_len(tb[NDA_LLADDR]) != ETH_ALEN) {
-		pr_info("PF_BRIDGE: RTM_DELNEIGH with invalid address\n");
+		NL_SET_ERR_MSG(extack, "invalid address");
 		return -EINVAL;
 	}
 
 	addr = nla_data(tb[NDA_LLADDR]);
 
-	err = fdb_vid_parse(tb[NDA_VLAN], &vid);
+	err = fdb_vid_parse(tb[NDA_VLAN], &vid, extack);
 	if (err)
 		return err;
 
@@ -3666,7 +3666,7 @@ static int rtnl_bridge_setlink(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	dev = __dev_get_by_index(net, ifm->ifi_index);
 	if (!dev) {
-		pr_info("PF_BRIDGE: RTM_SETLINK with unknown ifindex\n");
+		NL_SET_ERR_MSG(extack, "RTM_SETLINK with unknown ifindex");
 		return -ENODEV;
 	}
 
@@ -3741,7 +3741,7 @@ static int rtnl_bridge_dellink(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	dev = __dev_get_by_index(net, ifm->ifi_index);
 	if (!dev) {
-		pr_info("PF_BRIDGE: RTM_SETLINK with unknown ifindex\n");
+		NL_SET_ERR_MSG(extack, "RTM_SETLINK with unknown ifindex");
 		return -ENODEV;
 	}
 
-- 
2.13.6

^ permalink raw reply related

* Re: [patch net-next 2/4] net: sched: introduce per-egress action device callbacks
From: Jiri Pirko @ 2017-10-10 14:31 UTC (permalink / raw)
  To: David Laight
  Cc: netdev@vger.kernel.org, davem@davemloft.net, jhs@mojatatu.com,
	xiyou.wangcong@gmail.com, saeedm@mellanox.com,
	matanb@mellanox.com, leonro@mellanox.com, mlxsw@mellanox.com
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DD008ED67@AcuExch.aculab.com>

Tue, Oct 10, 2017 at 03:31:59PM CEST, David.Laight@ACULAB.COM wrote:
>From: Jiri Pirko
>> Sent: 10 October 2017 08:30
>> Introduce infrastructure that allows drivers to register callbacks that
>> are called whenever tc would offload inserted rule and specified device
>> acts as tc action egress device.
>
>How does a driver safely unregister a callback?
>(to avoid a race with the callback being called.)
>
>Usually this requires a callback in the context that makes the
>notification callbacks indicating that no more such callbacks
>will be made.

rtnl is your answer. It is being held during register/unregister/cb

^ permalink raw reply

* Re: [PATCH v2 net-next] rtnetlink: bridge: use ext_ack instead of printk
From: Florian Westphal @ 2017-10-10 14:28 UTC (permalink / raw)
  To: David Ahern; +Cc: Florian Westphal, netdev
In-Reply-To: <9c01905a-92e1-1246-35ee-2a60ac11733e@gmail.com>

David Ahern <dsahern@gmail.com> wrote:
> On 10/10/17 5:32 AM, Florian Westphal wrote:
> > diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> > index e84d108cfee4..19ea53a5210f 100644
> > --- a/net/core/rtnetlink.c
> > +++ b/net/core/rtnetlink.c
> > @@ -3066,21 +3066,22 @@ int ndo_dflt_fdb_add(struct ndmsg *ndm,
> >  }
> >  EXPORT_SYMBOL(ndo_dflt_fdb_add);
> >  
> > -static int fdb_vid_parse(struct nlattr *vlan_attr, u16 *p_vid)
> > +static int fdb_vid_parse(struct nlattr *vlan_attr, u16 *p_vid,
> > +			 struct netlink_ext_ack *exta)
> >  {
> >  	u16 vid = 0;
> >  
> >  	if (vlan_attr) {
> >  		if (nla_len(vlan_attr) != sizeof(u16)) {
> > -			pr_info("PF_BRIDGE: RTM_NEWNEIGH with invalid vlan\n");
> > +			NL_SET_ERR_MSG(exta, "RTM_NEWNEIGH with invalid vlan");
> 
> I realize you are keeping the existing wording, but the messages are
> moving from out of line pr_info to inline message in response to a user
> command.  From a user's perspective the RTM_NEWNEIGH and DELNEIGH do not
> add much value, and the add and del in the bridge command tells which it
> is. So in this case just emit "Invalid vlan id".

Right, makes sense.

> Although this failure is an invalid vlan attribute as opposed to an
> invalid vlan id which is what the next message checks. So the message
> needs to be updated as well.

Indeed, I'll send a v2, thanks!

^ permalink raw reply

* Re: net/wireless/ray_cs: Convert timers to use
From: Kees Cook @ 2017-10-10 14:22 UTC (permalink / raw)
  To: Kalle Valo; +Cc: LKML, linux-wireless, Network Development, Thomas Gleixner
In-Reply-To: <20171010082603.8E283607C5@smtp.codeaurora.org>

On Tue, Oct 10, 2017 at 1:26 AM, Kalle Valo <kvalo@codeaurora.org> wrote:
> Kees Cook <keescook@chromium.org> wrote:
>
>> In preparation for unconditionally passing the struct timer_list pointer to
>> all timer callbacks, switch to using the new timer_setup() and from_timer()
>> to pass the timer pointer explicitly.
>>
>> Cc: Kalle Valo <kvalo@codeaurora.org>
>> Cc: linux-wireless@vger.kernel.org
>> Cc: netdev@vger.kernel.org
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Signed-off-by: Kees Cook <keescook@chromium.org>
>
> I'll apply this once I have fast forwarded wireless-drivers-next to
> -rc3. I'll also fix the title, what was it supposed to say?

It was truncated from "net/wireless/ray_cs: Convert timers to use
timer_setup()"; I've fixed that glitch in my workflow now.

> Patch set to Awaiting Upstream.

Thanks!

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply

* Re: [PATCH net-next v2 5/5] selinux: bpf: Add addtional check for bpf object file receive
From: Stephen Smalley @ 2017-10-10 14:24 UTC (permalink / raw)
  To: Chenbo Feng, linux-security-module, netdev, SELinux
  Cc: Daniel Borkmann, Chenbo Feng, Alexei Starovoitov, lorenzo
In-Reply-To: <20171009222028.13096-6-chenbofeng.kernel@gmail.com>

On Mon, 2017-10-09 at 15:20 -0700, Chenbo Feng wrote:
> From: Chenbo Feng <fengc@google.com>
> 
> Introduce a bpf object related check when sending and receiving files
> through unix domain socket as well as binder. It checks if the
> receiving
> process have privilege to read/write the bpf map or use the bpf
> program.
> This check is necessary because the bpf maps and programs are using a
> anonymous inode as their shared inode so the normal way of checking
> the
> files and sockets when passing between processes cannot work properly
> on
> eBPF object. This check only works when the BPF_SYSCALL is
> configured.
> The information stored inside the file security struct is the same as
> the information in bpf object security struct.
> 
> Signed-off-by: Chenbo Feng <fengc@google.com>
> ---
>  include/linux/bpf.h       |  3 +++
>  include/linux/lsm_hooks.h | 17 +++++++++++++
>  include/linux/security.h  |  9 +++++++
>  kernel/bpf/syscall.c      |  4 ++--
>  security/security.c       |  8 +++++++
>  security/selinux/hooks.c  | 61
> +++++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 100 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 225740688ab7..81d6c01b8825 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -285,6 +285,9 @@ int bpf_prog_array_copy_to_user(struct
> bpf_prog_array __rcu *progs,
>  #ifdef CONFIG_BPF_SYSCALL
>  DECLARE_PER_CPU(int, bpf_prog_active);
>  
> +extern const struct file_operations bpf_map_fops;
> +extern const struct file_operations bpf_prog_fops;
> +
>  #define BPF_PROG_TYPE(_id, _ops) \
>  	extern const struct bpf_verifier_ops _ops;
>  #define BPF_MAP_TYPE(_id, _ops) \
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index 7161d8e7ee79..517dea60b87b 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -1385,6 +1385,19 @@
>   * @bpf_prog_free_security:
>   *	Clean up the security information stored inside bpf prog.
>   *
> + * @bpf_map_file:
> + *	When creating a bpf map fd, set up the file security
> information with
> + *	the bpf security information stored in the map struct. So
> when the map
> + *	fd is passed between processes, the security module can
> directly read
> + *	the security information from file security struct rather
> than the bpf
> + *	security struct.
> + *
> + * @bpf_prog_file:
> + *	When creating a bpf prog fd, set up the file security
> information with
> + *	the bpf security information stored in the prog struct. So
> when the prog
> + *	fd is passed between processes, the security module can
> directly read
> + *	the security information from file security struct rather
> than the bpf
> + *	security struct.
>   */
>  union security_list_options {
>  	int (*binder_set_context_mgr)(struct task_struct *mgr);
> @@ -1726,6 +1739,8 @@ union security_list_options {
>  	void (*bpf_map_free_security)(struct bpf_map *map);
>  	int (*bpf_prog_alloc_security)(struct bpf_prog_aux *aux);
>  	void (*bpf_prog_free_security)(struct bpf_prog_aux *aux);
> +	void (*bpf_map_file)(struct bpf_map *map, struct file
> *file);
> +	void (*bpf_prog_file)(struct bpf_prog_aux *aux, struct file
> *file);
>  #endif /* CONFIG_BPF_SYSCALL */
>  };
>  
> @@ -1954,6 +1969,8 @@ struct security_hook_heads {
>  	struct list_head bpf_map_free_security;
>  	struct list_head bpf_prog_alloc_security;
>  	struct list_head bpf_prog_free_security;
> +	struct list_head bpf_map_file;
> +	struct list_head bpf_prog_file;
>  #endif /* CONFIG_BPF_SYSCALL */
>  } __randomize_layout;
>  
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 18800b0911e5..57573b794e2d 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -1740,6 +1740,8 @@ extern int security_bpf_map_alloc(struct
> bpf_map *map);
>  extern void security_bpf_map_free(struct bpf_map *map);
>  extern int security_bpf_prog_alloc(struct bpf_prog_aux *aux);
>  extern void security_bpf_prog_free(struct bpf_prog_aux *aux);
> +extern void security_bpf_map_file(struct bpf_map *map, struct file
> *file);
> +extern void security_bpf_prog_file(struct bpf_prog_aux *aux, struct
> file *file);
>  #else
>  static inline int security_bpf(int cmd, union bpf_attr *attr,
>  					     unsigned int size)
> @@ -1772,6 +1774,13 @@ static inline int
> security_bpf_prog_alloc(struct bpf_prog_aux *aux)
>  
>  static inline void security_bpf_prog_free(struct bpf_prog_aux *aux)
>  { }
> +
> +static inline void security_bpf_map_file(struct bpf_map *map, struct
> file *file)
> +{ }
> +
> +static inline void security_bpf_prog_file(struct bpf_prog_aux *aux,
> +					  struct file *file)
> +{ }
>  #endif /* CONFIG_SECURITY */
>  #endif /* CONFIG_BPF_SYSCALL */
>  
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 1cf31ddd7616..b144181d3f3a 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -313,7 +313,7 @@ static ssize_t bpf_dummy_write(struct file *filp,
> const char __user *buf,
>  	return -EINVAL;
>  }
>  
> -static const struct file_operations bpf_map_fops = {
> +const struct file_operations bpf_map_fops = {
>  #ifdef CONFIG_PROC_FS
>  	.show_fdinfo	= bpf_map_show_fdinfo,
>  #endif
> @@ -964,7 +964,7 @@ static void bpf_prog_show_fdinfo(struct seq_file
> *m, struct file *filp)
>  }
>  #endif
>  
> -static const struct file_operations bpf_prog_fops = {
> +const struct file_operations bpf_prog_fops = {
>  #ifdef CONFIG_PROC_FS
>  	.show_fdinfo	= bpf_prog_show_fdinfo,
>  #endif
> diff --git a/security/security.c b/security/security.c
> index 1cd8526cb0b7..dacf649b8cfa 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -1734,4 +1734,12 @@ void security_bpf_prog_free(struct
> bpf_prog_aux *aux)
>  {
>  	call_void_hook(bpf_prog_free_security, aux);
>  }
> +void security_bpf_map_file(struct bpf_map *map, struct file *file)
> +{
> +	call_void_hook(bpf_map_file, map, file);
> +}
> +void security_bpf_prog_file(struct bpf_prog_aux *aux, struct file
> *file)
> +{
> +	call_void_hook(bpf_prog_file, aux, file);
> +}
>  #endif /* CONFIG_BPF_SYSCALL */
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 41aba4e3d57c..fea88655e0ee 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -1815,6 +1815,10 @@ static inline int file_path_has_perm(const
> struct cred *cred,
>  	return inode_has_perm(cred, file_inode(file), av, &ad);
>  }
>  
> +#ifdef CONFIG_BPF_SYSCALL
> +static int bpf_file_check(struct file *file, u32 sid);
> +#endif
> +
>  /* Check whether a task can use an open file descriptor to
>     access an inode in a given way.  Check access to the
>     descriptor itself, and then use dentry_has_perm to
> @@ -1845,6 +1849,12 @@ static int file_has_perm(const struct cred
> *cred,
>  			goto out;
>  	}
>  
> +#ifdef CONFIG_BPF_SYSCALL
> +	rc = bpf_file_check(file, cred_sid(cred));
> +	if (rc)
> +		goto out;
> +#endif
> +
>  	/* av is zero if only checking access to the descriptor. */
>  	rc = 0;
>  	if (av)
> @@ -2165,6 +2175,12 @@ static int selinux_binder_transfer_file(struct
> task_struct *from,
>  			return rc;
>  	}
>  
> +#ifdef CONFIG_BPF_SYSCALL
> +	rc = bpf_file_check(file, sid);
> +	if (rc)
> +		return rc;
> +#endif
> +
>  	if (unlikely(IS_PRIVATE(d_backing_inode(dentry))))
>  		return 0;
>  
> @@ -6288,6 +6304,33 @@ static u32 bpf_map_fmode_to_av(fmode_t fmode)
>  	return av;
>  }
>  
> +/* This function will check the file pass through unix socket or
> binder to see
> + * if it is a bpf related object. And apply correspinding checks on
> the bpf
> + * object based on the type. The bpf maps and programs, not like
> other files and
> + * socket, are using a shared anonymous inode inside the kernel as
> their inode.
> + * So checking that inode cannot identify if the process have
> privilege to
> + * access the bpf object and that's why we have to add this
> additional check in
> + * selinux_file_receive and selinux_binder_transfer_files.
> + */
> +static int bpf_file_check(struct file *file, u32 sid)
> +{
> +	struct file_security_struct *fsec = file->f_security;
> +	int ret;
> +
> +	if (file->f_op == &bpf_map_fops) {
> +		ret = avc_has_perm(sid, fsec->sid, SECCLASS_BPF_MAP,
> +				   bpf_map_fmode_to_av(file-
> >f_mode), NULL);
> +		if (ret)
> +			return ret;
> +	} else if (file->f_op == &bpf_prog_fops) {
> +		ret = avc_has_perm(sid, fsec->sid,
> SECCLASS_BPF_PROG,
> +				   BPF_PROG__USE, NULL);
> +		if (ret)
> +			return ret;
> +	}
> +	return 0;
> +}
> +
>  static int selinux_bpf_map(struct bpf_map *map, fmode_t fmode)
>  {
>  	u32 sid = current_sid();
> @@ -6351,6 +6394,22 @@ static void selinux_bpf_prog_free(struct
> bpf_prog_aux *aux)
>  	aux->security = NULL;
>  	kfree(bpfsec);
>  }
> +
> +static void selinux_bpf_map_file(struct bpf_map *map, struct file
> *file)
> +{
> +	struct bpf_security_struct *bpfsec = map->security;
> +	struct file_security_struct *fsec = file->f_security;
> +
> +	fsec->sid = bpfsec->sid;
> +}
> +
> +static void selinux_bpf_prog_file(struct bpf_prog_aux *aux, struct
> file *file)
> +{
> +	struct bpf_security_struct *bpfsec = aux->security;
> +	struct file_security_struct *fsec = file->f_security;
> +
> +	fsec->sid = bpfsec->sid;

I could be wrong, but isn't it the case that fsec->sid already will
equal bpfsec->sid, because they are both created by the same thread
during the same system call, and they each inherit the SID of the
current task?

What I expected you to do was to add and set a flags field in the
file_security_struct to indicate that this is a bpf map or prog, and
then test for that in your bpf_file_check() function instead of having
to export and test the fops structures.


> +}
>  #endif
>  
>  static struct security_hook_list selinux_hooks[] __lsm_ro_after_init
> = {
> @@ -6581,6 +6640,8 @@ static struct security_hook_list
> selinux_hooks[] __lsm_ro_after_init = {
>  	LSM_HOOK_INIT(bpf_prog_alloc_security,
> selinux_bpf_prog_alloc),
>  	LSM_HOOK_INIT(bpf_map_free_security, selinux_bpf_map_free),
>  	LSM_HOOK_INIT(bpf_prog_free_security,
> selinux_bpf_prog_free),
> +	LSM_HOOK_INIT(bpf_map_file, selinux_bpf_map_file),
> +	LSM_HOOK_INIT(bpf_prog_file, selinux_bpf_prog_file),
>  #endif
>  };
>  

^ permalink raw reply

* Re: [PATCH v2 net-next] rtnetlink: bridge: use ext_ack instead of printk
From: David Ahern @ 2017-10-10 14:18 UTC (permalink / raw)
  To: Florian Westphal, netdev
In-Reply-To: <20171010113236.8889-1-fw@strlen.de>

On 10/10/17 5:32 AM, Florian Westphal wrote:
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index e84d108cfee4..19ea53a5210f 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -3066,21 +3066,22 @@ int ndo_dflt_fdb_add(struct ndmsg *ndm,
>  }
>  EXPORT_SYMBOL(ndo_dflt_fdb_add);
>  
> -static int fdb_vid_parse(struct nlattr *vlan_attr, u16 *p_vid)
> +static int fdb_vid_parse(struct nlattr *vlan_attr, u16 *p_vid,
> +			 struct netlink_ext_ack *exta)
>  {
>  	u16 vid = 0;
>  
>  	if (vlan_attr) {
>  		if (nla_len(vlan_attr) != sizeof(u16)) {
> -			pr_info("PF_BRIDGE: RTM_NEWNEIGH with invalid vlan\n");
> +			NL_SET_ERR_MSG(exta, "RTM_NEWNEIGH with invalid vlan");

I realize you are keeping the existing wording, but the messages are
moving from out of line pr_info to inline message in response to a user
command.  From a user's perspective the RTM_NEWNEIGH and DELNEIGH do not
add much value, and the add and del in the bridge command tells which it
is. So in this case just emit "Invalid vlan id".

Although this failure is an invalid vlan attribute as opposed to an
invalid vlan id which is what the next message checks. So the message
needs to be updated as well.


>  			return -EINVAL;
>  		}
>  
>  		vid = nla_get_u16(vlan_attr);
>  
>  		if (!vid || vid >= VLAN_VID_MASK) {
> -			pr_info("PF_BRIDGE: RTM_NEWNEIGH with invalid vlan id %d\n",
> -				vid);
> +			NL_SET_ERR_MSG(exta,
> +				       "RTM_NEWNEIGH with invalid vlan id");
>  			return -EINVAL;
>  		}
>  	}

^ permalink raw reply

* Re: [PATCH net-next v2 4/5] selinux: bpf: Add selinux check for eBPF syscall operations
From: Stephen Smalley @ 2017-10-10 14:18 UTC (permalink / raw)
  To: Chenbo Feng, linux-security-module, netdev, SELinux
  Cc: Jeffrey Vander Stoep, Alexei Starovoitov, lorenzo,
	Daniel Borkmann, Chenbo Feng
In-Reply-To: <20171009222028.13096-5-chenbofeng.kernel@gmail.com>

On Mon, 2017-10-09 at 15:20 -0700, Chenbo Feng wrote:
> From: Chenbo Feng <fengc@google.com>
> 
> Implement the actual checks introduced to eBPF related syscalls. This
> implementation use the security field inside bpf object to store a
> sid that
> identify the bpf object. And when processes try to access the object,
> selinux will check if processes have the right privileges. The
> creation
> of eBPF object are also checked at the general bpf check hook and new
> cmd introduced to eBPF domain can also be checked there.
> 
> Signed-off-by: Chenbo Feng <fengc@google.com>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  security/selinux/hooks.c            | 111
> ++++++++++++++++++++++++++++++++++++
>  security/selinux/include/classmap.h |   2 +
>  security/selinux/include/objsec.h   |   4 ++
>  3 files changed, 117 insertions(+)
> 
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index f5d304736852..41aba4e3d57c 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -85,6 +85,7 @@
>  #include <linux/export.h>
>  #include <linux/msg.h>
>  #include <linux/shm.h>
> +#include <linux/bpf.h>
>  
>  #include "avc.h"
>  #include "objsec.h"
> @@ -6252,6 +6253,106 @@ static void selinux_ib_free_security(void
> *ib_sec)
>  }
>  #endif
>  
> +#ifdef CONFIG_BPF_SYSCALL
> +static int selinux_bpf(int cmd, union bpf_attr *attr,
> +				     unsigned int size)
> +{
> +	u32 sid = current_sid();
> +	int ret;
> +
> +	switch (cmd) {
> +	case BPF_MAP_CREATE:
> +		ret = avc_has_perm(sid, sid, SECCLASS_BPF_MAP,
> BPF_MAP__CREATE,
> +				   NULL);
> +		break;
> +	case BPF_PROG_LOAD:
> +		ret = avc_has_perm(sid, sid, SECCLASS_BPF_PROG,
> BPF_PROG__LOAD,
> +				   NULL);
> +		break;
> +	default:
> +		ret = 0;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static u32 bpf_map_fmode_to_av(fmode_t fmode)
> +{
> +	u32 av = 0;
> +
> +	if (f_mode & FMODE_READ)
> +		av |= BPF_MAP__READ;
> +	if (f_mode & FMODE_WRITE)
> +		av |= BPF_MAP__WRITE;
> +	return av;
> +}
> +
> +static int selinux_bpf_map(struct bpf_map *map, fmode_t fmode)
> +{
> +	u32 sid = current_sid();
> +	struct bpf_security_struct *bpfsec;
> +
> +	bpfsec = map->security;
> +	return avc_has_perm(sid, bpfsec->sid, SECCLASS_BPF_MAP,
> +			    bpf_map_fmode_to_av(fmode), NULL);
> +}
> +
> +static int selinux_bpf_prog(struct bpf_prog *prog)
> +{
> +	u32 sid = current_sid();
> +	struct bpf_security_struct *bpfsec;
> +
> +	bpfsec = prog->aux->security;
> +	return avc_has_perm(sid, bpfsec->sid, SECCLASS_BPF_PROG,
> +			    BPF_PROG__USE, NULL);
> +}
> +
> +static int selinux_bpf_map_alloc(struct bpf_map *map)
> +{
> +	struct bpf_security_struct *bpfsec;
> +
> +	bpfsec = kzalloc(sizeof(*bpfsec), GFP_KERNEL);
> +	if (!bpfsec)
> +		return -ENOMEM;
> +
> +	bpfsec->sid = current_sid();
> +	map->security = bpfsec;
> +
> +	return 0;
> +}
> +
> +static void selinux_bpf_map_free(struct bpf_map *map)
> +{
> +	struct bpf_security_struct *bpfsec = map->security;
> +
> +	map->security = NULL;
> +	kfree(bpfsec);
> +}
> +
> +static int selinux_bpf_prog_alloc(struct bpf_prog_aux *aux)
> +{
> +	struct bpf_security_struct *bpfsec;
> +
> +	bpfsec = kzalloc(sizeof(*bpfsec), GFP_KERNEL);
> +	if (!bpfsec)
> +		return -ENOMEM;
> +
> +	bpfsec->sid = current_sid();
> +	aux->security = bpfsec;
> +
> +	return 0;
> +}
> +
> +static void selinux_bpf_prog_free(struct bpf_prog_aux *aux)
> +{
> +	struct bpf_security_struct *bpfsec = aux->security;
> +
> +	aux->security = NULL;
> +	kfree(bpfsec);
> +}
> +#endif
> +
>  static struct security_hook_list selinux_hooks[] __lsm_ro_after_init
> = {
>  	LSM_HOOK_INIT(binder_set_context_mgr,
> selinux_binder_set_context_mgr),
>  	LSM_HOOK_INIT(binder_transaction,
> selinux_binder_transaction),
> @@ -6471,6 +6572,16 @@ static struct security_hook_list
> selinux_hooks[] __lsm_ro_after_init = {
>  	LSM_HOOK_INIT(audit_rule_match, selinux_audit_rule_match),
>  	LSM_HOOK_INIT(audit_rule_free, selinux_audit_rule_free),
>  #endif
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +	LSM_HOOK_INIT(bpf, selinux_bpf),
> +	LSM_HOOK_INIT(bpf_map, selinux_bpf_map),
> +	LSM_HOOK_INIT(bpf_prog, selinux_bpf_prog),
> +	LSM_HOOK_INIT(bpf_map_alloc_security,
> selinux_bpf_map_alloc),
> +	LSM_HOOK_INIT(bpf_prog_alloc_security,
> selinux_bpf_prog_alloc),
> +	LSM_HOOK_INIT(bpf_map_free_security, selinux_bpf_map_free),
> +	LSM_HOOK_INIT(bpf_prog_free_security,
> selinux_bpf_prog_free),
> +#endif
>  };
>  
>  static __init int selinux_init(void)
> diff --git a/security/selinux/include/classmap.h
> b/security/selinux/include/classmap.h
> index 35ffb29a69cb..7253c5eea59c 100644
> --- a/security/selinux/include/classmap.h
> +++ b/security/selinux/include/classmap.h
> @@ -237,6 +237,8 @@ struct security_class_mapping secclass_map[] = {
>  	  { "access", NULL } },
>  	{ "infiniband_endport",
>  	  { "manage_subnet", NULL } },
> +	{ "bpf_map", {"create", "read", "write"} },
> +	{ "bpf_prog", {"load", "use"} },

Again I have to ask: do you truly need/want two separate classes, or
would a single class with distinct permissions suffice, ala:
        { "bpf", { "create_map", "read_map", "write_map", "prog_load",
"prog_use" } },

and then allow A self:bpf { create_map read_map write_map prog_load
prog_use }; would be stored in a single policy avtab rule, and be
cached in a single AVC entry.

>  	{ NULL }
>    };
>  
> diff --git a/security/selinux/include/objsec.h
> b/security/selinux/include/objsec.h
> index 1649cd18eb0b..3d54468ce334 100644
> --- a/security/selinux/include/objsec.h
> +++ b/security/selinux/include/objsec.h
> @@ -150,6 +150,10 @@ struct pkey_security_struct {
>  	u32	sid;	/* SID of pkey */
>  };
>  
> +struct bpf_security_struct {
> +	u32 sid;  /*SID of bpf obj creater*/
> +};
> +
>  extern unsigned int selinux_checkreqprot;
>  
>  #endif /* _SELINUX_OBJSEC_H_ */

^ permalink raw reply

* [PATCH net-next] selftests: rtnetlink: test RTM_GETNETCONF
From: Florian Westphal @ 2017-10-10 14:18 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

exercise RTM_GETNETCONF call path for unspec, inet and inet6
families, they are DOIT_UNLOCKED candidates.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 tools/testing/selftests/net/rtnetlink.sh | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
index e8c86c416ed0..a8a8cdf726b2 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -37,6 +37,26 @@ kci_del_dummy()
 	check_err $?
 }
 
+kci_test_netconf()
+{
+	dev="$1"
+	r=$ret
+
+	ip netconf show dev "$dev" > /dev/null
+	check_err $?
+
+	for f in 4 6; do
+		ip -$f netconf show dev "$dev" > /dev/null
+		check_err $?
+	done
+
+	if [ $ret -ne 0 ] ;then
+		echo "FAIL: ip netconf show $dev"
+		test $r -eq 0 && ret=0
+		return 1
+	fi
+}
+
 # add a bridge with vlans on top
 kci_test_bridge()
 {
@@ -63,6 +83,11 @@ kci_test_bridge()
 	check_err $?
 	ip r s t all > /dev/null
 	check_err $?
+
+	for name in "$devbr" "$vlandev" "$devdummy" ; do
+		kci_test_netconf "$name"
+	done
+
 	ip -6 addr del dev "$vlandev" dead:42::1234/64
 	check_err $?
 
@@ -100,6 +125,9 @@ kci_test_gre()
 	check_err $?
 	ip addr > /dev/null
 	check_err $?
+
+	kci_test_netconf "$gredev"
+
 	ip addr del dev "$devdummy" 10.23.7.11/24
 	check_err $?
 
-- 
2.13.6

^ permalink raw reply related

* [PATCH net-next 1/1] net/smc: add SMC rendezvous protocol
From: Ursula Braun @ 2017-10-10 14:14 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-s390, jwi, schwidefsky, heiko.carstens, raspl,
	hwippel, ubraun

From: Ursula Braun <ubraun@linux.vnet.ibm.com>

The SMC protocol [1] uses a rendezvous protocol to negotiate SMC
capability between peers. The current Linux implementation does not use
this rendezvous protocol and, thus, is not compliant to RFC7609 and
incompatible with other SMC implementations like in zOS. This patch adds
support for the SMC rendezvous protocol.

Details:

The SMC rendezvous protocol relies on the use of a new TCP experimental
option. With this option, SMC capabilities are exchanged between the
peers during the TCP three way handshake.

The goal of this patch is to leave common TCP code unmodified. Thus,
it uses netfilter hooks to intercept TCP SYN and SYN/ACK packets. For
outgoing packets originating from SMC sockets, the experimental option
is added. For inbound packets destined for SMC sockets, the experimental
option is checked.

Another goal was to minimize the performance impact on non-SMC traffic
(when SMC is enabled). The netfilter hooks used for SMC client
connections are active only during TCP connection establishment.
The netfilter hooks used for SMC servers are active as long as there are
listening SMC sockets.

When the hooks are active, the following additional operations are
performed on incoming and outgoing packets:
  (1) call SMC netfilter hook (all IPv4 packets)
  (2) check if TCP SYN or SYN/ACK packet (all IPv4 packets)
  (3) check if packet goes to/comes from SMC socket (SYN & SYN/ACK
      packets only)
  (4) check/add SMC experimental option (SMC sockets' SYN & SYN/ACK
      packets only)

References:
  [1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609

Signed-off-by: Hans Wippel <hwippel@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
---
 net/smc/Kconfig  |   2 +-
 net/smc/Makefile |   2 +-
 net/smc/af_smc.c |  66 ++++++-
 net/smc/smc.h    |  10 +-
 net/smc/smc_rv.c | 543 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/smc/smc_rv.h |  31 ++++
 6 files changed, 646 insertions(+), 8 deletions(-)
 create mode 100644 net/smc/smc_rv.c
 create mode 100644 net/smc/smc_rv.h

diff --git a/net/smc/Kconfig b/net/smc/Kconfig
index c717ef0896aa..ad49086e8ed7 100644
--- a/net/smc/Kconfig
+++ b/net/smc/Kconfig
@@ -1,6 +1,6 @@
 config SMC
 	tristate "SMC socket protocol family"
-	depends on INET && INFINIBAND
+	depends on INET && INFINIBAND && NETFILTER
 	---help---
 	  SMC-R provides a "sockets over RDMA" solution making use of
 	  RDMA over Converged Ethernet (RoCE) technology to upgrade
diff --git a/net/smc/Makefile b/net/smc/Makefile
index 188104654b54..2155a7eff41d 100644
--- a/net/smc/Makefile
+++ b/net/smc/Makefile
@@ -1,4 +1,4 @@
 obj-$(CONFIG_SMC)	+= smc.o
 obj-$(CONFIG_SMC_DIAG)	+= smc_diag.o
 smc-y := af_smc.o smc_pnet.o smc_ib.o smc_clc.o smc_core.o smc_wr.o smc_llc.o
-smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o
+smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o smc_rv.o
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 745f145d4c4d..290b9ff06e01 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -34,6 +34,7 @@
 #include <net/smc.h>
 
 #include "smc.h"
+#include "smc_rv.h"
 #include "smc_clc.h"
 #include "smc_llc.h"
 #include "smc_cdc.h"
@@ -109,6 +110,7 @@ static int smc_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
 	struct smc_sock *smc;
+	int old_state;
 	int rc = 0;
 
 	if (!sk)
@@ -123,6 +125,7 @@ static int smc_release(struct socket *sock)
 		lock_sock_nested(sk, SINGLE_DEPTH_NESTING);
 	else
 		lock_sock(sk);
+	old_state = sk->sk_state;
 
 	if (smc->use_fallback) {
 		sk->sk_state = SMC_CLOSED;
@@ -132,6 +135,10 @@ static int smc_release(struct socket *sock)
 		sock_set_flag(sk, SOCK_DEAD);
 		sk->sk_shutdown |= SHUTDOWN_MASK;
 	}
+	if (old_state == SMC_LISTEN) {
+		smc_rv_nf_unregister_hook(sock_net(sk), &smc_nfho_serv);
+		kfree(smc->listen_pends);
+	}
 	if (smc->clcsock) {
 		sock_release(smc->clcsock);
 		smc->clcsock = NULL;
@@ -178,6 +185,7 @@ static struct sock *smc_sock_alloc(struct net *net, struct socket *sock)
 	sk->sk_destruct = smc_destruct;
 	sk->sk_protocol = SMCPROTO_SMC;
 	smc = smc_sk(sk);
+	smc->use_fallback = true; /* default: not SMC-capable */
 	INIT_WORK(&smc->tcp_listen_work, smc_tcp_listen_work);
 	INIT_LIST_HEAD(&smc->accept_q);
 	spin_lock_init(&smc->accept_q_lock);
@@ -390,6 +398,10 @@ static int smc_connect_rdma(struct smc_sock *smc)
 	int rc = 0;
 	u8 ibport;
 
+	if (smc->use_fallback)
+		/* peer has not signalled SMC-capability */
+		goto out_connected;
+
 	/* IPSec connections opt out of SMC-R optimizations */
 	if (using_ipsec(smc)) {
 		reason_code = SMC_CLC_DECL_IPSEC;
@@ -500,7 +512,6 @@ static int smc_connect_rdma(struct smc_sock *smc)
 	smc_tx_init(smc);
 
 out_connected:
-	smc_copy_sock_settings_to_clc(smc);
 	if (smc->sk.sk_state == SMC_INIT)
 		smc->sk.sk_state = SMC_ACTIVE;
 
@@ -555,7 +566,11 @@ static int smc_connect(struct socket *sock, struct sockaddr *addr,
 	}
 
 	smc_copy_sock_settings_to_clc(smc);
+	smc_rv_nf_register_hook(sock_net(sk), &smc_nfho_clnt);
+
 	rc = kernel_connect(smc->clcsock, addr, alen, flags);
+	if (rc != -EINPROGRESS)
+		smc_rv_nf_unregister_hook(sock_net(sk), &smc_nfho_clnt);
 	if (rc)
 		goto out;
 
@@ -574,10 +589,12 @@ static int smc_connect(struct socket *sock, struct sockaddr *addr,
 
 static int smc_clcsock_accept(struct smc_sock *lsmc, struct smc_sock **new_smc)
 {
+	struct smc_listen_pending *pnd;
 	struct sock *sk = &lsmc->sk;
 	struct socket *new_clcsock;
 	struct sock *new_sk;
-	int rc;
+	unsigned long flags;
+	int i, rc;
 
 	release_sock(&lsmc->sk);
 	new_sk = smc_sock_alloc(sock_net(sk), NULL);
@@ -613,6 +630,25 @@ static int smc_clcsock_accept(struct smc_sock *lsmc, struct smc_sock **new_smc)
 	}
 
 	(*new_smc)->clcsock = new_clcsock;
+
+	/* enable SMC-capability if an SMC-capable connecting socket is
+	 * contained in listen_pends; invalidate this entry
+	 */
+	spin_lock_irqsave(&lsmc->listen_pends_lock, flags);
+	for (i = 0; i < 2 * lsmc->sk.sk_max_ack_backlog; i++) {
+		pnd = lsmc->listen_pends + i;
+		if (pnd->used &&
+		    pnd->addr == new_clcsock->sk->sk_daddr &&
+		    pnd->port == new_clcsock->sk->sk_dport &&
+		    jiffies_to_msecs(get_jiffies_64() - pnd->time) <=
+						SMC_LISTEN_PEND_VALID_TIME) {
+			(*new_smc)->use_fallback = false;
+			pnd->used = false;
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&lsmc->listen_pends_lock, flags);
+
 out:
 	return rc;
 }
@@ -759,6 +795,10 @@ static void smc_listen_work(struct work_struct *work)
 	u8 prefix_len;
 	u8 ibport;
 
+	if (new_smc->use_fallback)
+		/* peer has not signalled SMC-capability */
+		goto out_connected;
+
 	/* do inband token exchange -
 	 *wait for and receive SMC Proposal CLC message
 	 */
@@ -929,7 +969,6 @@ static void smc_tcp_listen_work(struct work_struct *work)
 			continue;
 
 		new_smc->listen_smc = lsmc;
-		new_smc->use_fallback = false; /* assume rdma capability first*/
 		sock_hold(&lsmc->sk); /* sock_put in smc_listen_work */
 		INIT_WORK(&new_smc->smc_listen_work, smc_listen_work);
 		smc_copy_sock_settings_to_smc(new_smc);
@@ -954,16 +993,32 @@ static int smc_listen(struct socket *sock, int backlog)
 	if ((sk->sk_state != SMC_INIT) && (sk->sk_state != SMC_LISTEN))
 		goto out;
 
+	rc = -ENOMEM;
+	/* Addresses and ports of incoming SYN packets with experimental option
+	 * SMC are saved, but TCP might decide to drop them. Thus more slots
+	 * than the backlog value are allocated for pending connecting sockets
+	 */
+	smc->listen_pends = kzalloc(
+			2 * backlog * sizeof(struct smc_listen_pending),
+			GFP_KERNEL);
+	if (!smc->listen_pends)
+		goto out;
+	spin_lock_init(&smc->listen_pends_lock);
+
 	rc = 0;
 	if (sk->sk_state == SMC_LISTEN) {
 		sk->sk_max_ack_backlog = backlog;
 		goto out;
 	}
+
+	smc->use_fallback = false; /* listen sockets are SMC-capable */
 	/* some socket options are handled in core, so we could not apply
 	 * them to the clc socket -- copy smc socket options to clc socket
 	 */
 	smc_copy_sock_settings_to_clc(smc);
 
+	smc_rv_nf_register_hook(sock_net(sk), &smc_nfho_serv);
+
 	rc = kernel_listen(smc->clcsock, backlog);
 	if (rc)
 		goto out;
@@ -1114,7 +1169,7 @@ static unsigned int smc_poll(struct file *file, struct socket *sock,
 	struct sock *sk = sock->sk;
 	unsigned int mask = 0;
 	struct smc_sock *smc;
-	int rc;
+	int rc = 0;
 
 	smc = smc_sk(sock->sk);
 	if ((sk->sk_state == SMC_INIT) || smc->use_fallback) {
@@ -1123,6 +1178,7 @@ static unsigned int smc_poll(struct file *file, struct socket *sock,
 		/* if non-blocking connect finished ... */
 		lock_sock(sk);
 		if ((sk->sk_state == SMC_INIT) && (mask & POLLOUT)) {
+			smc_rv_nf_unregister_hook(sock_net(sk), &smc_nfho_clnt);
 			sk->sk_err = smc->clcsock->sk->sk_err;
 			if (sk->sk_err) {
 				mask |= POLLERR;
@@ -1348,7 +1404,6 @@ static int smc_create(struct net *net, struct socket *sock, int protocol,
 
 	/* create internal TCP socket for CLC handshake and fallback */
 	smc = smc_sk(sk);
-	smc->use_fallback = false; /* assume rdma capability first */
 	rc = sock_create_kern(net, PF_INET, SOCK_STREAM,
 			      IPPROTO_TCP, &smc->clcsock);
 	if (rc)
@@ -1370,6 +1425,7 @@ static int __init smc_init(void)
 {
 	int rc;
 
+	smc_rv_init();
 	rc = smc_pnet_init();
 	if (rc)
 		return rc;
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 0ccd6fa387ad..96d7a20ba7db 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -167,6 +167,13 @@ struct smc_connection {
 	struct work_struct	close_work;	/* peer sent some closing */
 };
 
+struct smc_listen_pending {
+	u64		time;			/* time when entry was created*/
+	bool		used;			/* true if entry is in use */
+	__be32		addr;			/* address of a listen socket */
+	__be16		port;			/* port of a listen socket */
+};
+
 struct smc_sock {				/* smc sock container */
 	struct sock		sk;
 	struct socket		*clcsock;	/* internal tcp socket */
@@ -175,6 +182,8 @@ struct smc_sock {				/* smc sock container */
 	struct smc_sock		*listen_smc;	/* listen parent */
 	struct work_struct	tcp_listen_work;/* handle tcp socket accepts */
 	struct work_struct	smc_listen_work;/* prepare new accept socket */
+	struct smc_listen_pending *listen_pends;/* listen pending SYNs */
+	spinlock_t		listen_pends_lock; /* protects listen_pends */
 	struct list_head	accept_q;	/* sockets to be accepted */
 	spinlock_t		accept_q_lock;	/* protects accept_q */
 	struct delayed_work	sock_put_work;	/* final socket freeing */
@@ -271,5 +280,4 @@ int smc_conn_create(struct smc_sock *smc, __be32 peer_in_addr,
 		    struct smc_clc_msg_local *lcl, int srv_first_contact);
 struct sock *smc_accept_dequeue(struct sock *parent, struct socket *new_sock);
 void smc_close_non_accepted(struct sock *sk);
-
 #endif	/* __SMC_H */
diff --git a/net/smc/smc_rv.c b/net/smc/smc_rv.c
new file mode 100644
index 000000000000..4ce01dec808f
--- /dev/null
+++ b/net/smc/smc_rv.c
@@ -0,0 +1,543 @@
+/*
+ *  Shared Memory Communications over RDMA (SMC-R) and RoCE
+ *
+ *  SMC Rendezvous to determine SMC-capability of the peer
+ *
+ *  Copyright IBM Corp. 2017
+ *
+ *  Author(s):	Hans Wippel <hwippel@linux.vnet.ibm.com>
+ *		Ursula Braun <ubraun@linux.vnet.ibm.com>
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/netdevice.h>
+#include <linux/netfilter.h>
+#include <linux/netfilter_ipv4.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <net/tcp.h>
+
+#include "smc.h"
+#include "smc_rv.h"
+
+#define TCPOLEN_SMC			8
+#define TCPOLEN_SMC_BASE		6
+#define TCPOLEN_SMC_ALIGNED		2
+
+static const char TCPOPT_SMC_MAGIC[4] = {'\xe2', '\xd4', '\xc3', '\xd9'};
+
+/* in TCP header, replace EOL option and remaining header bytes with NOPs */
+static bool smc_rv_replace_eol_option(struct sk_buff *skb)
+{
+	struct tcphdr *tcph = tcp_hdr(skb);
+	int opt_bytes = tcp_optlen(skb);
+	unsigned char *buf;
+	int i = 0;
+
+	buf = (unsigned char *)(tcph + 1);
+	/* Parse TCP options. Based on tcp_parse_options in tcp_input.c */
+	while (i < opt_bytes) {
+		switch (buf[i]) {
+		/* one byte options */
+		case TCPOPT_EOL:
+			/* replace remaining bytes with NOPs */
+			while (i < opt_bytes) {
+				buf[i] = TCPOPT_NOP;
+				i++;
+			}
+			return true;
+		case TCPOPT_NOP:
+			i++;
+			continue;
+		default:
+			/* multi-byte options */
+			if (buf[i + 1] < 2 || i + buf[i + 1] > opt_bytes)
+				return false; /* bad option */
+			i += buf[i + 1];
+			continue;
+		}
+	}
+	return true;
+}
+
+/* check if TCP header contains SMC option */
+static bool smc_rv_has_smc_option(struct sk_buff *skb)
+{
+	struct tcphdr *tcph = tcp_hdr(skb);
+	int opt_bytes = tcp_optlen(skb);
+	unsigned char *buf;
+	int i = 0;
+
+	buf = (unsigned char *)(tcph + 1);
+	/* Parse TCP options. Based on tcp_parse_options in tcp_input.c */
+	while (i < opt_bytes) {
+		switch (buf[i]) {
+		/* one byte options */
+		case TCPOPT_EOL:
+			return false;
+		case TCPOPT_NOP:
+			i++;
+			continue;
+		default:
+			/* multi-byte options */
+			if (buf[i + 1] < 2)
+				return false; /* bad option */
+			/* check for SMC rendezvous option */
+			if (buf[i] == TCPOPT_EXP &&
+			    buf[i + 1] == TCPOLEN_SMC_BASE &&
+			    (opt_bytes - i >= TCPOLEN_SMC_BASE) &&
+			    !memcmp(&buf[i + 2], TCPOPT_SMC_MAGIC,
+						sizeof(TCPOPT_SMC_MAGIC)))
+				return true;
+			i += buf[i + 1];
+			continue;
+		}
+	}
+
+	return false;
+}
+
+/* Add SMC option to TCP header */
+static int smc_rv_add_smc_option(struct sk_buff *skb)
+{
+	unsigned char smc_opt[] = {TCPOPT_NOP, TCPOPT_NOP,
+				   TCPOPT_EXP, TCPOLEN_SMC_BASE,
+				   TCPOPT_SMC_MAGIC[0], TCPOPT_SMC_MAGIC[1],
+				   TCPOPT_SMC_MAGIC[2], TCPOPT_SMC_MAGIC[3]};
+	struct tcphdr *tcph = tcp_hdr(skb);
+	struct iphdr *iph = ip_hdr(skb);
+	int tcplen = 0;
+
+	if (skb_availroom(skb) < TCPOLEN_SMC)
+		return -EFAULT;
+
+	if (tcp_optlen(skb) + TCPOLEN_SMC > MAX_TCP_OPTION_SPACE)
+		return -EFAULT;
+
+	/* give up if there is data after the TCP header */
+	if (skb_headlen(skb) > ip_hdrlen(skb) + tcp_hdrlen(skb))
+		return -EFAULT;
+
+	if (smc_rv_has_smc_option(skb))
+		return -EFAULT;
+
+	if (!smc_rv_replace_eol_option(skb))
+		return -EFAULT;
+
+	iph->tot_len = cpu_to_be16(be16_to_cpu(iph->tot_len) + TCPOLEN_SMC);
+	iph->check = 0;
+	iph->check = ip_fast_csum(iph, iph->ihl);
+	skb_put_data(skb, smc_opt, TCPOLEN_SMC);
+	tcph->doff += TCPOLEN_SMC_ALIGNED;
+	tcplen = (skb->len - ip_hdrlen(skb));
+	tcph->check = 0;
+	tcph->check = tcp_v4_check(tcplen, iph->saddr, iph->daddr,
+				   csum_partial(tcph, tcplen, 0));
+	skb->ip_summed = CHECKSUM_NONE;
+	return 0;
+}
+
+/* return an smc socket with certain source and destination */
+static struct smc_sock *smc_rv_lookup_connecting_smc(struct net *net,
+						     __be32 dest_addr,
+						     __be16 dest_port,
+						     __be32 source_addr,
+						     __be16 source_port)
+{
+	struct smc_sock *smc = NULL;
+	struct hlist_head *head;
+	struct socket *clcsock;
+	struct sock *sk;
+
+	read_lock(&smc_proto.h.smc_hash->lock);
+	head = &smc_proto.h.smc_hash->ht;
+
+	if (hlist_empty(head))
+		goto out;
+
+	sk_for_each(sk, head) {
+		if (!net_eq(sock_net(sk), net))
+			continue;
+		if (sk->sk_state != SMC_INIT)
+			continue;
+		clcsock = smc_sk(sk)->clcsock;
+		if (!clcsock)
+			continue;
+		if (source_port != htons(clcsock->sk->sk_num))
+			continue;
+		if (source_addr != clcsock->sk->sk_rcv_saddr)
+			continue;
+		if (dest_port != clcsock->sk->sk_dport)
+			continue;
+		if (dest_addr == clcsock->sk->sk_daddr) {
+			smc = smc_sk(sk);
+			break;
+		}
+	}
+
+out:
+	read_unlock(&smc_proto.h.smc_hash->lock);
+	return smc;
+}
+
+/* for netfilter smc_rv_hook_out_clnt (outgoing SYN):
+ * check if there exists a connecting smc socket with certain source and
+ * destination
+ */
+static bool smc_rv_exists_connecting_smc(struct net *net,
+					 __be32 dest_addr,
+					 __be16 dest_port,
+					 __be32 source_addr,
+					 __be16 source_port)
+{
+	return (smc_rv_lookup_connecting_smc(net, dest_addr, dest_port,
+					     source_addr, source_port) ?
+								true : false);
+}
+
+/* for netfilter smc_rv_hook_in_clnt (incoming SYN ACK):
+ * enable SMC-capability for the corresponding socket
+ */
+static void smc_rv_accepting_smc_peer(struct net *net,
+				      __be32 dest_addr,
+				      __be16 dest_port,
+				      __be32 source_addr,
+				      __be16 source_port)
+{
+	struct smc_sock *smc;
+
+	smc = smc_rv_lookup_connecting_smc(net, dest_addr, dest_port,
+					   source_addr, source_port);
+	if (smc)
+		/* connection is SMC-capable */
+		smc->use_fallback = false;
+}
+
+/* return an smc socket listening on a certain port */
+static struct smc_sock *smc_rv_lookup_listen_socket(struct net *net,
+						    __be32 listen_addr,
+						    __be16 listen_port)
+{
+	struct smc_sock *smc = NULL;
+	struct hlist_head *head;
+	struct socket *clcsock;
+	struct sock *sk;
+
+	read_lock(&smc_proto.h.smc_hash->lock);
+	head = &smc_proto.h.smc_hash->ht;
+
+	if (hlist_empty(head))
+		goto out;
+
+	sk_for_each(sk, head) {
+		if (!net_eq(sock_net(sk), net))
+			continue;
+		if (sk->sk_state != SMC_LISTEN)
+			continue;
+		clcsock = smc_sk(sk)->clcsock;
+		if (listen_port != htons(clcsock->sk->sk_num))
+			continue;
+		if (!listen_addr || !clcsock->sk->sk_rcv_saddr ||
+		    listen_addr == clcsock->sk->sk_rcv_saddr) {
+			smc = smc_sk(sk);
+			break;
+		}
+	}
+
+out:
+	read_unlock(&smc_proto.h.smc_hash->lock);
+	return smc;
+}
+
+/* for netfilter smc_rv_hook_in_serv (incoming SYN):
+ * save addr and port of connecting smc peer
+ */
+static void smc_rv_connecting_smc_peer(struct net *net,
+				       __be32 listen_addr,
+				       __be16 listen_port,
+				       __be32 peer_addr,
+				       __be16 peer_port)
+{
+	struct smc_listen_pending *pnd;
+	struct smc_sock *lsmc;
+	unsigned long flags;
+	int i;
+
+	lsmc = smc_rv_lookup_listen_socket(net, listen_addr, listen_port);
+	if (!lsmc)
+		return;
+
+	spin_lock_irqsave(&lsmc->listen_pends_lock, flags);
+	for (i = 0; i < 2 * lsmc->sk.sk_max_ack_backlog; i++) {
+		pnd = lsmc->listen_pends + i;
+		/* either use an unused entry or reuse an outdated entry */
+		if (!pnd->used ||
+		    jiffies_to_msecs(get_jiffies_64() - pnd->time) >
+						SMC_LISTEN_PEND_VALID_TIME) {
+			pnd->used = true;
+			pnd->addr = peer_addr;
+			pnd->port = peer_port;
+			pnd->time = get_jiffies_64();
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&lsmc->listen_pends_lock, flags);
+}
+
+/* for netfilter smc_rv_hook_out_serv (outgoing SYN/ACK):
+ * remove listen_pends entry of connecting smc peer in case of a problem
+ */
+static void smc_rv_remove_smc_peer(struct net *net,
+				   __be32 listen_addr,
+				   __be16 listen_port,
+				   __be32 peer_addr,
+				   __be16 peer_port)
+{
+	struct smc_listen_pending *pnd;
+	struct smc_sock *lsmc;
+	unsigned long flags;
+	int i;
+
+	lsmc = smc_rv_lookup_listen_socket(net, listen_addr, listen_port);
+	if (!lsmc)
+		return;
+
+	spin_lock_irqsave(&lsmc->listen_pends_lock, flags);
+	for (i = 0; i < 2 * lsmc->sk.sk_max_ack_backlog; i++) {
+		pnd = lsmc->listen_pends + i;
+		if (pnd->used &&
+		    pnd->addr == peer_addr &&
+		    pnd->port == peer_port &&
+		    jiffies_to_msecs(get_jiffies_64() - pnd->time) <=
+						SMC_LISTEN_PEND_VALID_TIME) {
+			pnd->used = false;
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&lsmc->listen_pends_lock, flags);
+}
+
+/* for netfilter smc_rv_hook_out_serv (outgoing SYN ACK):
+ * check if there has been a connecting smc peer
+ */
+static bool smc_rv_exists_connecting_smc_peer(struct net *net,
+					      __be32 listen_addr,
+					      __be16 listen_port,
+					      __be32 peer_addr,
+					      __be16 peer_port)
+{
+	struct smc_listen_pending *pnd;
+	struct smc_sock *lsmc;
+	unsigned long flags;
+	int i;
+
+	lsmc = smc_rv_lookup_listen_socket(net, listen_addr, listen_port);
+	if (!lsmc)
+		return false;
+
+	spin_lock_irqsave(&lsmc->listen_pends_lock, flags);
+	for (i = 0; i < 2 * lsmc->sk.sk_max_ack_backlog; i++) {
+		pnd = lsmc->listen_pends + i;
+		if (pnd->used &&
+		    pnd->addr == peer_addr &&
+		    pnd->port == peer_port &&
+		    jiffies_to_msecs(get_jiffies_64() - pnd->time) <=
+						SMC_LISTEN_PEND_VALID_TIME) {
+			spin_unlock_irqrestore(&lsmc->listen_pends_lock, flags);
+			return true;
+		}
+	}
+	spin_unlock_irqrestore(&lsmc->listen_pends_lock, flags);
+	return false;
+}
+
+/* Netfilter hooks */
+
+/* netfilter hook for incoming packets (client) */
+static unsigned int smc_rv_hook_in_clnt(void *priv, struct sk_buff *skb,
+					const struct nf_hook_state *state)
+{
+	struct tcphdr *tcph = tcp_hdr(skb);
+	struct iphdr *iph;
+
+	if (skb_headlen(skb) - sizeof(*iph) < sizeof(*tcph))
+		return NF_ACCEPT;
+
+	iph = ip_hdr(skb);
+	if (iph->protocol != IPPROTO_TCP)
+		return NF_ACCEPT;
+
+	/* Local SMC client, incoming SYN,ACK from server
+	 * check if there really is a local SMC client
+	 * and tell the client connection if the server is SMC capable
+	 */
+	if (tcph->syn == 1 && tcph->ack == 1) {
+		/* check for experimental option */
+		if (!smc_rv_has_smc_option(skb))
+			return NF_ACCEPT;
+		/* add info about server SMC capability */
+		smc_rv_accepting_smc_peer(state->net, iph->saddr, tcph->source,
+					  iph->daddr, tcph->dest);
+	}
+	return NF_ACCEPT;
+}
+
+/* netfilter hook for incoming packets (server) */
+static unsigned int smc_rv_hook_in_serv(void *priv, struct sk_buff *skb,
+					const struct nf_hook_state *state)
+{
+	struct tcphdr *tcph = tcp_hdr(skb);
+	struct iphdr *iph;
+
+	if (skb_headlen(skb) - sizeof(*iph) < sizeof(*tcph))
+		return NF_ACCEPT;
+
+	iph = ip_hdr(skb);
+	if (iph->protocol != IPPROTO_TCP)
+		return NF_ACCEPT;
+
+	/* Local SMC Server, incoming SYN request from client
+	 * check if there is a local SMC server
+	 * and tell the server if there is a new SMC capable client
+	 */
+	if (tcph->syn == 1 && tcph->ack == 0) {
+		/* check for experimental option */
+		if (!smc_rv_has_smc_option(skb))
+			return NF_ACCEPT;
+		/* add info about new client SMC capability */
+		smc_rv_connecting_smc_peer(state->net, iph->daddr, tcph->dest,
+					   iph->saddr, tcph->source);
+	}
+	return NF_ACCEPT;
+}
+
+/* netfilter hook for outgoing packets (client) */
+static unsigned int smc_rv_hook_out_clnt(void *priv, struct sk_buff *skb,
+					 const struct nf_hook_state *state)
+{
+	struct tcphdr *tcph = tcp_hdr(skb);
+	struct iphdr *iph;
+
+	if (skb_headlen(skb) - sizeof(*iph) < sizeof(*tcph))
+		return NF_ACCEPT;
+
+	iph = ip_hdr(skb);
+	if (iph->protocol != IPPROTO_TCP)
+		return NF_ACCEPT;
+
+	/* Local SMC client, outgoing SYN request to server
+	 * add TCP experimental option if there really is a local SMC client
+	 */
+	if (tcph->syn == 1 && tcph->ack == 0) {
+		/* check for local SMC client */
+		if (!smc_rv_exists_connecting_smc(state->net,
+						  iph->daddr, tcph->dest,
+						  iph->saddr, tcph->source))
+			return NF_ACCEPT;
+		/* add experimental option */
+		smc_rv_add_smc_option(skb);
+	}
+	return NF_ACCEPT;
+}
+
+/* netfilter hook for outgoing packets (server) */
+static unsigned int smc_rv_hook_out_serv(void *priv, struct sk_buff *skb,
+					 const struct nf_hook_state *state)
+{
+	struct tcphdr *tcph = tcp_hdr(skb);
+	struct iphdr *iph;
+
+	if (skb_headlen(skb) - sizeof(*iph) < sizeof(*tcph))
+		return NF_ACCEPT;
+
+	iph = ip_hdr(skb);
+	if (iph->protocol != IPPROTO_TCP)
+		return NF_ACCEPT;
+
+	/* Local SMC server, outgoing SYN,ACK to client
+	 * add TCP experimental option if there really is a local SMC server
+	 */
+	if (tcph->syn == 1 && tcph->ack == 1) {
+		/* check if client's SYN contained the experimental option */
+		if (!smc_rv_exists_connecting_smc_peer(state->net,
+						       iph->saddr, tcph->source,
+						       iph->daddr, tcph->dest))
+			return NF_ACCEPT;
+		/* add experimental option */
+		if (smc_rv_add_smc_option(skb) < 0)
+			smc_rv_remove_smc_peer(state->net,
+					       iph->saddr, tcph->source,
+					       iph->daddr, tcph->dest);
+	}
+	return NF_ACCEPT;
+}
+
+static struct nf_hook_ops smc_nfho_ops_clnt[] = {
+	{
+		.hook = smc_rv_hook_in_clnt,
+		.hooknum = NF_INET_PRE_ROUTING,
+		.pf = PF_INET,
+		.priority = NF_IP_PRI_FIRST,
+	},
+	{
+		.hook = smc_rv_hook_out_clnt,
+		.hooknum = NF_INET_POST_ROUTING,
+		.pf = PF_INET,
+		.priority = NF_IP_PRI_FIRST,
+	},
+};
+
+static struct nf_hook_ops smc_nfho_ops_serv[] = {
+	{
+		.hook = smc_rv_hook_in_serv,
+		.hooknum = NF_INET_PRE_ROUTING,
+		.pf = PF_INET,
+		.priority = NF_IP_PRI_FIRST,
+	},
+	{
+		.hook = smc_rv_hook_out_serv,
+		.hooknum = NF_INET_POST_ROUTING,
+		.pf = PF_INET,
+		.priority = NF_IP_PRI_FIRST,
+	},
+};
+
+struct smc_nf_hook smc_nfho_clnt = {
+	.refcount = 0,
+	.hook = &smc_nfho_ops_clnt[0],
+};
+
+struct smc_nf_hook smc_nfho_serv = {
+	.refcount = 0,
+	.hook = &smc_nfho_ops_serv[0],
+};
+
+int smc_rv_nf_register_hook(struct net *net, struct smc_nf_hook *nfho)
+{
+	int rc = 0;
+
+	mutex_lock(&nfho->nf_hook_mutex);
+	if (!(nfho->refcount++)) {
+		rc = nf_register_net_hooks(net, nfho->hook, 2);
+		if (rc)
+			nfho->refcount--;
+	}
+	mutex_unlock(&nfho->nf_hook_mutex);
+	return rc;
+}
+
+void smc_rv_nf_unregister_hook(struct net *net, struct smc_nf_hook *nfho)
+{
+	mutex_lock(&nfho->nf_hook_mutex);
+	if (!(--nfho->refcount))
+		nf_unregister_net_hooks(net, nfho->hook, 2);
+	mutex_unlock(&nfho->nf_hook_mutex);
+}
+
+void __init smc_rv_init(void)
+{
+	mutex_init(&smc_nfho_clnt.nf_hook_mutex);
+	mutex_init(&smc_nfho_serv.nf_hook_mutex);
+}
diff --git a/net/smc/smc_rv.h b/net/smc/smc_rv.h
new file mode 100644
index 000000000000..c3bdf4c0a5cb
--- /dev/null
+++ b/net/smc/smc_rv.h
@@ -0,0 +1,31 @@
+/*
+ * Shared Memory Communications over RDMA (SMC-R) and RoCE
+ *
+ *  Definitions for SMC Rendezvous - SMC capability checking
+ *
+ *  Copyright IBM Corp. 2017
+ *
+ *  Author(s):	Hans Wippel <hwippel@linux.vnet.ibm.com>
+ *		Ursula Braun <ubraun@linux.vnet.ibm.com>
+ */
+
+#ifndef _SMC_RV_H
+#define _SMC_RV_H
+
+#include <linux/netfilter.h>
+
+#define SMC_LISTEN_PEND_VALID_TIME	(600 * HZ)
+
+struct smc_nf_hook {
+	struct mutex		nf_hook_mutex;	/* serialize nf register ops */
+	int			refcount;
+	struct nf_hook_ops	*hook;
+};
+
+extern struct smc_nf_hook smc_nfho_clnt;
+extern struct smc_nf_hook smc_nfho_serv;
+
+int smc_rv_nf_register_hook(struct net *net, struct smc_nf_hook *nfho);
+void smc_rv_nf_unregister_hook(struct net *net, struct smc_nf_hook *nfho);
+void smc_rv_init(void) __init;
+#endif
-- 
2.13.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox