Netdev List
 help / color / mirror / Atom feed
* [PATCH] bnx2x: Prevent load reordering in tx completion processing
From: Brian King @ 2019-07-15 21:41 UTC (permalink / raw)
  To: GR-everest-linux-l2; +Cc: skalluru, aelior, netdev, Brian King

This patch fixes an issue seen on Power systems with bnx2x which results
in the skb is NULL WARN_ON in bnx2x_free_tx_pkt firing due to the skb
pointer getting loaded in bnx2x_free_tx_pkt prior to the hw_cons
load in bnx2x_tx_int. Adding a read memory barrier resolves the issue.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 656ed80..e2be5a6 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -285,6 +285,9 @@ int bnx2x_tx_int(struct bnx2x *bp, struct bnx2x_fp_txdata *txdata)
 	hw_cons = le16_to_cpu(*txdata->tx_cons_sb);
 	sw_cons = txdata->tx_pkt_cons;
 
+	/* Ensure subsequent loads occur after hw_cons */
+	smp_rmb();
+
 	while (sw_cons != hw_cons) {
 		u16 pkt_cons;
 
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-07-15 21:09 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Tycho Andersen, Serge E. Hallyn, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	ebiederm, nhorman
In-Reply-To: <20190708181237.5poheliito7zpvmc@madcap2.tricolour.ca>

On Mon, Jul 8, 2019 at 2:12 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-05-30 19:26, Paul Moore wrote:

...

> > I like the creativity, but I worry that at some point these
> > limitations are going to be raised (limits have a funny way of doing
> > that over time) and we will be in trouble.  I say "trouble" because I
> > want to be able to quickly do an audit container ID comparison and
> > we're going to pay a penalty for these larger values (we'll need this
> > when we add multiple auditd support and the requisite record routing).
> >
> > Thinking about this makes me also realize we probably need to think a
> > bit longer about audit container ID conflicts between orchestrators.
> > Right now we just take the value that is given to us by the
> > orchestrator, but if we want to allow multiple container orchestrators
> > to work without some form of cooperation in userspace (I think we have
> > to assume the orchestrators will not talk to each other) we likely
> > need to have some way to block reuse of an audit container ID.  We
> > would either need to prevent the orchestrator from explicitly setting
> > an audit container ID to a currently in use value, or instead generate
> > the audit container ID in the kernel upon an event triggered by the
> > orchestrator (e.g. a write to a /proc file).  I suspect we should
> > start looking at the idr code, I think we will need to make use of it.
>
> To address this, I'd suggest that it is enforced to only allow the
> setting of descendants and to maintain a master list of audit container
> identifiers (with a hash table if necessary later) that includes the
> container owner.

We're discussing the audit container ID management policy elsewhere in
this thread so I won't comment on that here, but I did want to say
that we will likely need something better than a simple list of audit
container IDs from the start.  It's common for systems to have
thousands of containers now (or multiple thousands), which tells me
that a list is a poor choice.  You mentioned a hash table, so I would
suggest starting with that over the list for the initial patchset.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* [PATCH v1] net: fec: optionally reset PHY via a reset-controller
From: Sven Van Asbroeck @ 2019-07-15 21:05 UTC (permalink / raw)
  To: Fugang Duan; +Cc: David S . Miller, netdev, linux-kernel

The current fec driver allows the PHY to be reset via a gpio,
specified in the devicetree. However, some PHYs need to be reset
in a more complex way.

To accommodate such PHYs, allow an optional reset controller
in the fec devicetree. If no reset controller is found, the
fec will fall back to the legacy reset behaviour.

Example:
&fec {
	phy-mode = "rgmii";
	resets = <&phy_reset>;
	reset-names = "phy";
	status = "okay";
};

Signed-off-by: Sven Van Asbroeck <TheSven73@gmail.com>
---

Will send a Documentation patch if this receives a positive review.

 drivers/net/ethernet/freescale/fec_main.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 38f10f7dcbc3..5a5f3ed6f16d 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -61,6 +61,7 @@
 #include <linux/regulator/consumer.h>
 #include <linux/if_vlan.h>
 #include <linux/pinctrl/consumer.h>
+#include <linux/reset.h>
 #include <linux/prefetch.h>
 #include <soc/imx/cpuidle.h>
 
@@ -3335,6 +3336,7 @@ static int fec_enet_get_irq_cnt(struct platform_device *pdev)
 static int
 fec_probe(struct platform_device *pdev)
 {
+	struct reset_control *phy_reset;
 	struct fec_enet_private *fep;
 	struct fec_platform_data *pdata;
 	struct net_device *ndev;
@@ -3490,7 +3492,9 @@ fec_probe(struct platform_device *pdev)
 	pm_runtime_set_active(&pdev->dev);
 	pm_runtime_enable(&pdev->dev);
 
-	ret = fec_reset_phy(pdev);
+	phy_reset = devm_reset_control_get_exclusive(&pdev->dev, "phy");
+	ret = IS_ERR(phy_reset) ? fec_reset_phy(pdev) :
+			reset_control_reset(phy_reset);
 	if (ret)
 		goto failed_reset;
 
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-07-15 21:04 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Serge E. Hallyn, Tycho Andersen, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	ebiederm, nhorman
In-Reply-To: <20190708180558.5bar6ripag3sdadl@madcap2.tricolour.ca>

On Mon, Jul 8, 2019 at 2:06 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-05-30 15:29, Paul Moore wrote:

...

> > [REMINDER: It is an "*audit* container ID" and not a general
> > "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> >
> > I'm not interested in supporting/merging something that isn't useful;
> > if this doesn't work for your use case then we need to figure out what
> > would work.  It sounds like nested containers are much more common in
> > the lxc world, can you elaborate a bit more on this?
> >
> > As far as the possible solutions you mention above, I'm not sure I
> > like the per-userns audit container IDs, I'd much rather just emit the
> > necessary tracking information via the audit record stream and let the
> > log analysis tools figure it out.  However, the bigger question is how
> > to limit (re)setting the audit container ID when you are in a non-init
> > userns.  For reasons already mentioned, using capable() is a non
> > starter for everything but the initial userns, and using ns_capable()
> > is equally poor as it essentially allows any userns the ability to
> > munge it's audit container ID (obviously not good).  It appears we
> > need a different method for controlling access to the audit container
> > ID.
>
> We're not quite ready yet for multiple audit daemons and possibly not
> yet for audit namespaces, but this is starting to look a lot like the
> latter.

A few quick comments on audit namespaces: the audit container ID is
not envisioned as a new namespace (even in nested form) and neither do
I consider running multiple audit daemons to be a new namespace.

From my perspective we create namespaces to allow us to redefine a
global resource for some subset of the system, e.g. providing a unique
/tmp for some number of processes on the system.  While it may be
tempting to think of the audit container ID as something we could
"namespace", especially when multiple audit daemons are concerned, in
some ways this would be counter productive; the audit container ID is
intended to be a global ID that can be used to associate audit event
records with a "container" where the "container" is defined by an
orchestrator outside the audit subsystem.  The global nature of the
audit container ID allows us to maintain a sane(ish) view of the
system in the audit log, if we were to "namespace" the audit container
ID such that the value was no longer guaranteed to be unique
throughout the system, we would need to additionally track the audit
namespace along with the audit container ID which starts to border on
insanity IMHO.

> If we can't trust ns_capable() then why are we passing on
> CAP_AUDIT_CONTROL?  It is being passed down and not stripped purposely
> by the orchestrator/engine.  If ns_capable() isn't inherited how is it
> gained otherwise?  Can it be inserted by cotainer image?  I think the
> answer is "no".  Either we trust ns_capable() or we have audit
> namespaces (recommend based on user namespace) (or both).

My thinking is that since ns_capable() checks the credentials with
respect to the current user namespace we can't rely on it to control
access since it would be possible for a privileged process running
inside an unprivileged container to manipulate the audit container ID
(containerized process has CAP_AUDIT_CONTROL, e.g. running as root in
the container, while the container itself does not).

> At this point I would say we are at an impasse unless we trust
> ns_capable() or we implement audit namespaces.

I'm not sure how we can trust ns_capable(), but if you can think of a
way I would love to hear it.  I'm also not sure how namespacing audit
is helpful (see my above comments), but if you think it is please
explain.

> I don't think another mechanism to trust nested orchestrators/engines
> will buy us anything.
>
> Am I missing something?

Based on your questions/comments above it looks like your
understanding of ns_capable() does not match mine; if I'm thinking
about ns_capable() incorrectly, please educate me.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [bpf PATCH v2 2/6] bpf: tls fix transition through disconnect with close
From: John Fastabend @ 2019-07-15 20:58 UTC (permalink / raw)
  To: Jakub Kicinski, John Fastabend; +Cc: ast, daniel, netdev, edumazet, bpf
In-Reply-To: <20190711201633.552292e6@cakuba.netronome.com>

Jakub Kicinski wrote:
> On Thu, 11 Jul 2019 14:25:54 -0700, John Fastabend wrote:
> > Jakub Kicinski wrote:
> > > On Thu, 11 Jul 2019 09:47:16 -0700, John Fastabend wrote:  
> > > > Jakub Kicinski wrote:  
> > > > > On Wed, 10 Jul 2019 12:34:17 -0700, Jakub Kicinski wrote:    
> > > > > > > > > +		if (sk->sk_prot->unhash)
> > > > > > > > > +			sk->sk_prot->unhash(sk);
> > > > > > > > > +	}
> > > > > > > > > +
> > > > > > > > > +	ctx = tls_get_ctx(sk);
> > > > > > > > > +	if (ctx->tx_conf == TLS_SW || ctx->rx_conf == TLS_SW)
> > > > > > > > > +		tls_sk_proto_cleanup(sk, ctx, timeo);    
> > > > > 
> > > > > Do we still need to hook into unhash? With patch 6 in place perhaps we
> > > > > can just do disconnect 🥺    
> > > > 
> > > > ?? "can just do a disconnect", not sure I folow. We still need unhash
> > > > in cases where we have a TLS socket transition from ESTABLISHED
> > > > to LISTEN state without calling close(). This is independent of if
> > > > sockmap is running or not.
> > > > 
> > > > Originally, I thought this would be extremely rare but I did see it
> > > > in real applications on the sockmap side so presumably it is possible
> > > > here as well.  
> > > 
> > > Ugh, sorry, I meant shutdown. Instead of replacing the unhash callback
> > > replace the shutdown callback. We probably shouldn't release the socket
> > > lock either there, but we can sleep, so I'll be able to run the device
> > > connection remove callback (which sleep).
> > 
> > ah OK seems doable to me. Do you want to write that on top of this
> > series? Or would you like to push it onto your branch and I can pull
> > it in push the rest of the patches on top and send it out? I think
> > if you can get to it in the next few days then it makes sense to wait.
> 
> Mm.. perhaps its easiest if we forget about HW for now and get SW 
> to work? Once you get the SW to 100% I can probably figure out what 
> to do for HW, but I feel like we got too many moving parts ATM.

Hi Jack,

I went ahead and pushed a v3 with your patches at the front. This resolves
a set of issues for me so I think it makes sense to push now and look
to resolve any further issues later. I'll look into the close with pending
data potential issue to see if it is/is-not a real issue.

Thanks,
John

^ permalink raw reply

* Re: [PATCH iproute2-rc 1/8] rdma: Update uapi headers to add statistic counter support
From: Stephen Hemminger @ 2019-07-15 20:52 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Leon Romanovsky, netdev, David Ahern, Mark Zhang,
	RDMA mailing list
In-Reply-To: <20190710072455.9125-2-leon@kernel.org>

On Wed, 10 Jul 2019 10:24:48 +0300
Leon Romanovsky <leon@kernel.org> wrote:

> From: Mark Zhang <markz@mellanox.com>
> 
> Update rdma_netlink.h to kernel commit 6e7be47a5345 ("RDMA/nldev:
> Allow get default counter statistics through RDMA netlink").
> 
> Signed-off-by: Mark Zhang <markz@mellanox.com>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>

I am waiting on this until it gets to Linus's tree.

^ permalink raw reply

* Re: [PATCH iproute2 master 0/3] devlink dumpit fixes
From: Stephen Hemminger @ 2019-07-15 20:51 UTC (permalink / raw)
  To: Tariq Toukan; +Cc: netdev, moshe, ayal
In-Reply-To: <1562756601-19171-1-git-send-email-tariqt@mellanox.com>

On Wed, 10 Jul 2019 14:03:18 +0300
Tariq Toukan <tariqt@mellanox.com> wrote:

> Hi,
> 
> This series from Aya contains several fixes for devlink health
> dump show command with binary data.
> 
> In patch 1 we replace the usage of doit with a dumpit, which
> is non-blocking and allows transferring larger amount of data.
> 
> Patches 2 and 3 fix the output for binary data prints, for both
> json and non-json.
> 
> Series generated against master commit:
> 2eb23f3e7aaf devlink: Show devlink port number
> 
> Regards,
> Tariq
> 
> Aya Levin (3):
>   devlink: Change devlink health dump show command to dumpit
>   devlink: Fix binary values print
>   devlink: Remove enclosing array brackets binary print with json format
> 
>  devlink/devlink.c | 41 +++++++++++++++++++++--------------------
>  1 file changed, 21 insertions(+), 20 deletions(-)
> 

Applied

^ permalink raw reply

* [bpf PATCH v3 8/8] bpf: sockmap/tls, close can race with map free
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

When a map free is called and in parallel a socket is closed we
have two paths that can potentially reset the socket prot ops, the
bpf close() path and the map free path. This creates a problem
with which prot ops should be used from the socket closed side.

If the map_free side completes first then we want to call the
original lowest level ops. However, if the tls path runs first
we want to call the sockmap ops. Additionally there was no locking
around prot updates in TLS code paths so the prot ops could
be changed multiple times once from TLS path and again from sockmap
side potentially leaving ops pointed at either TLS or sockmap
when psock and/or tls context have already been destroyed.

To fix this race first only update ops inside callback lock
so that TLS, sockmap and lowest level all agree on prot state.
Second and a ULP callback update() so that lower layers can
inform the upper layer when they are being removed allowing the
upper layer to reset prot ops.

This gets us close to allowing sockmap and tls to be stacked
in arbitrary order but will save that patch for *next trees.

Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com
Fixes: 02c558b2d5d6 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/skmsg.h |    8 +++++++-
 include/net/tcp.h     |    3 +++
 net/core/skmsg.c      |    4 ++--
 net/ipv4/tcp_ulp.c    |   13 +++++++++++++
 net/tls/tls_main.c    |   48 ++++++++++++++++++++++++++++++++++++------------
 5 files changed, 61 insertions(+), 15 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 50ced8aba9db..e4b3fb4bb77c 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -354,7 +354,13 @@ static inline void sk_psock_restore_proto(struct sock *sk,
 	sk->sk_write_space = psock->saved_write_space;
 
 	if (psock->sk_proto) {
-		sk->sk_prot = psock->sk_proto;
+		struct inet_connection_sock *icsk = inet_csk(sk);
+		bool has_ulp = !!icsk->icsk_ulp_data;
+
+		if (has_ulp)
+			tcp_update_ulp(sk, psock->sk_proto);
+		else
+			sk->sk_prot = psock->sk_proto;
 		psock->sk_proto = NULL;
 	}
 }
diff --git a/include/net/tcp.h b/include/net/tcp.h
index cca3c59b98bf..f4702c8b9b8c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2102,6 +2102,8 @@ struct tcp_ulp_ops {
 
 	/* initialize ulp */
 	int (*init)(struct sock *sk);
+	/* update ulp */
+	void (*update)(struct sock *sk, struct proto *p);
 	/* cleanup ulp */
 	void (*release)(struct sock *sk);
 
@@ -2113,6 +2115,7 @@ void tcp_unregister_ulp(struct tcp_ulp_ops *type);
 int tcp_set_ulp(struct sock *sk, const char *name);
 void tcp_get_available_ulp(char *buf, size_t len);
 void tcp_cleanup_ulp(struct sock *sk);
+void tcp_update_ulp(struct sock *sk, struct proto *p);
 
 #define MODULE_ALIAS_TCP_ULP(name)				\
 	__MODULE_INFO(alias, alias_userspace, name);		\
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 93bffaad2135..6832eeb4b785 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -585,12 +585,12 @@ EXPORT_SYMBOL_GPL(sk_psock_destroy);
 
 void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 {
-	rcu_assign_sk_user_data(sk, NULL);
 	sk_psock_cork_free(psock);
 	sk_psock_zap_ingress(psock);
-	sk_psock_restore_proto(sk, psock);
 
 	write_lock_bh(&sk->sk_callback_lock);
+	sk_psock_restore_proto(sk, psock);
+	rcu_assign_sk_user_data(sk, NULL);
 	if (psock->progs.skb_parser)
 		sk_psock_stop_strp(sk, psock);
 	write_unlock_bh(&sk->sk_callback_lock);
diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
index 3d8a1d835471..4849edb62d52 100644
--- a/net/ipv4/tcp_ulp.c
+++ b/net/ipv4/tcp_ulp.c
@@ -96,6 +96,19 @@ void tcp_get_available_ulp(char *buf, size_t maxlen)
 	rcu_read_unlock();
 }
 
+void tcp_update_ulp(struct sock *sk, struct proto *proto)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	if (!icsk->icsk_ulp_ops) {
+		sk->sk_prot = proto;
+		return;
+	}
+
+	if (icsk->icsk_ulp_ops->update)
+		icsk->icsk_ulp_ops->update(sk, proto);
+}
+
 void tcp_cleanup_ulp(struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index f4cb0522fa95..e67e687f79a2 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -323,15 +323,16 @@ static void tls_sk_proto_unhash(struct sock *sk)
 	long timeo = sock_sndtimeo(sk, 0);
 	struct tls_context *ctx;
 
-	if (unlikely(!icsk->icsk_ulp_data)) {
-		if (sk->sk_prot->unhash)
-			sk->sk_prot->unhash(sk);
-	}
-
 	ctx = tls_get_ctx(sk);
 	if (ctx->tx_conf == TLS_SW || ctx->rx_conf == TLS_SW)
 		tls_sk_proto_cleanup(sk, ctx, timeo);
+
+	write_lock_bh(&sk->sk_callback_lock);
 	icsk->icsk_ulp_data = NULL;
+	if (sk->sk_prot->unhash == tls_sk_proto_unhash)
+		sk->sk_prot = ctx->sk_proto;
+	write_unlock_bh(&sk->sk_callback_lock);
+
 	tls_ctx_free_wq(ctx);
 
 	if (ctx->unhash)
@@ -340,15 +341,17 @@ static void tls_sk_proto_unhash(struct sock *sk)
 
 static void tls_sk_proto_close(struct sock *sk, long timeout)
 {
-	void (*sk_proto_close)(struct sock *sk, long timeout);
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tls_context *ctx = tls_get_ctx(sk);
 	long timeo = sock_sndtimeo(sk, 0);
 
+	if (unlikely(!ctx))
+		return;
+
 	if (ctx->tx_conf == TLS_SW)
 		tls_sw_cancel_work_tx(ctx);
 
 	lock_sock(sk);
-	sk_proto_close = ctx->sk_proto_close;
 
 	if (ctx->tx_conf == TLS_HW_RECORD && ctx->rx_conf == TLS_HW_RECORD)
 		goto skip_tx_cleanup;
@@ -356,17 +359,20 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE)
 		goto skip_tx_cleanup;
 
-	sk->sk_prot = ctx->sk_proto;
 	tls_sk_proto_cleanup(sk, ctx, timeo);
 
 skip_tx_cleanup:
+	write_lock_bh(&sk->sk_callback_lock);
+	icsk->icsk_ulp_data = NULL;
+	if (sk->sk_prot->close == tls_sk_proto_close)
+		sk->sk_prot = ctx->sk_proto;
+	write_unlock_bh(&sk->sk_callback_lock);
 	release_sock(sk);
 	if (ctx->rx_conf == TLS_SW || ctx->rx_conf == TLS_HW)
 		tls_sw_strparser_done(ctx);
 	if (ctx->rx_conf == TLS_SW)
 		tls_sw_free_ctx_rx(ctx);
-	sk_proto_close(sk, timeout);
-
+	ctx->sk_proto_close(sk, timeout);
 	if (ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW &&
 	    ctx->tx_conf != TLS_HW_RECORD && ctx->rx_conf != TLS_HW_RECORD)
 		tls_ctx_free(ctx);
@@ -833,7 +839,7 @@ static int tls_init(struct sock *sk)
 	int rc = 0;
 
 	if (tls_hw_prot(sk))
-		goto out;
+		return 0;
 
 	/* The TLS ulp is currently supported only for TCP sockets
 	 * in ESTABLISHED state.
@@ -844,22 +850,39 @@ static int tls_init(struct sock *sk)
 	if (sk->sk_state != TCP_ESTABLISHED)
 		return -ENOTSUPP;
 
+	tls_build_proto(sk);
+
 	/* allocate tls context */
+	write_lock_bh(&sk->sk_callback_lock);
 	ctx = create_ctx(sk);
 	if (!ctx) {
 		rc = -ENOMEM;
 		goto out;
 	}
 
-	tls_build_proto(sk);
 	ctx->tx_conf = TLS_BASE;
 	ctx->rx_conf = TLS_BASE;
 	ctx->sk_proto = sk->sk_prot;
 	update_sk_prot(sk, ctx);
 out:
+	write_unlock_bh(&sk->sk_callback_lock);
 	return rc;
 }
 
+static void tls_update(struct sock *sk, struct proto *p)
+{
+	struct tls_context *ctx;
+
+	ctx = tls_get_ctx(sk);
+	if (likely(ctx)) {
+		ctx->sk_proto_close = p->close;
+		ctx->unhash = p->unhash;
+		ctx->sk_proto = p;
+	} else {
+		sk->sk_prot = p;
+	}
+}
+
 void tls_register_device(struct tls_device *device)
 {
 	spin_lock_bh(&device_spinlock);
@@ -880,6 +903,7 @@ static struct tcp_ulp_ops tcp_tls_ulp_ops __read_mostly = {
 	.name			= "tls",
 	.owner			= THIS_MODULE,
 	.init			= tls_init,
+	.update			= tls_update,
 };
 
 static int __init tls_register(void)


^ permalink raw reply related

* Re: [PATCH iproute2 v2] utils: don't match empty strings as prefixes
From: Stephen Hemminger @ 2019-07-15 20:49 UTC (permalink / raw)
  To: Matteo Croce; +Cc: netdev, David Ahern
In-Reply-To: <20190715180430.19902-1-mcroce@redhat.com>

On Mon, 15 Jul 2019 20:04:30 +0200
Matteo Croce <mcroce@redhat.com> wrote:

> iproute has an utility function which checks if a string is a prefix for
> another one, to allow use of abbreviated commands, e.g. 'addr' or 'a'
> instead of 'address'.
> 
> This routine unfortunately considers an empty string as prefix
> of any pattern, leading to undefined behaviour when an empty
> argument is passed to ip:
> 
>     # ip ''
>     1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
>         link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>         inet 127.0.0.1/8 scope host lo
>            valid_lft forever preferred_lft forever
>         inet6 ::1/128 scope host
>            valid_lft forever preferred_lft forever
> 
>     # tc ''
>     qdisc noqueue 0: dev lo root refcnt 2
> 
>     # ip address add 192.0.2.0/24 '' 198.51.100.1 dev dummy0
>     # ip addr show dev dummy0
>     6: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
>         link/ether 02:9d:5e:e9:3f:c0 brd ff:ff:ff:ff:ff:ff
>         inet 192.0.2.0/24 brd 198.51.100.1 scope global dummy0
>            valid_lft forever preferred_lft forever
> 
> Rewrite matches() so it takes care of an empty input, and doesn't
> scan the input strings three times: the actual implementation
> does 2 strlen and a memcpy to accomplish the same task.
> 
> Signed-off-by: Matteo Croce <mcroce@redhat.com>

Thanks for following up. Applied

^ permalink raw reply

* [bpf PATCH v3 7/8] bpf: sockmap, only create entry if ulp is not already enabled
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

Sockmap does not currently support adding sockets after TLS has been
enabled. There never was a real use case for this so it was never
added. But, we lost the test for ULP at some point so add it here
and fail the socket insert if TLS is enabled. Future work could
make sockmap support this use case but fixup the bug here.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/sock_map.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 56bcabe7c2f2..1330a7442e5b 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -334,6 +334,7 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx,
 				  struct sock *sk, u64 flags)
 {
 	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct sk_psock_link *link;
 	struct sk_psock *psock;
 	struct sock *osk;
@@ -344,6 +345,8 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx,
 		return -EINVAL;
 	if (unlikely(idx >= map->max_entries))
 		return -E2BIG;
+	if (unlikely(icsk->icsk_ulp_data))
+		return -EINVAL;
 
 	link = sk_psock_init_link();
 	if (!link)


^ permalink raw reply related

* [bpf PATCH v3 6/8] bpf: sockmap, synchronize_rcu before free'ing map
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

We need to have a synchronize_rcu before free'ing the sockmap because
any outstanding psock references will have a pointer to the map and
when they use this could trigger a use after free.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/sock_map.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 28702f2e9a4a..56bcabe7c2f2 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -247,6 +247,8 @@ static void sock_map_free(struct bpf_map *map)
 	raw_spin_unlock_bh(&stab->lock);
 	rcu_read_unlock();
 
+	synchronize_rcu();
+
 	bpf_map_area_free(stab->sks);
 	kfree(stab);
 }


^ permalink raw reply related

* [bpf PATCH v3 5/8] bpf: sockmap, sock_map_delete needs to use xchg
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

__sock_map_delete() may be called from a tcp event such as unhash or
close from the following trace,

  tcp_bpf_close()
    tcp_bpf_remove()
      sk_psock_unlink()
        sock_map_delete_from_link()
          __sock_map_delete()

In this case the sock lock is held but this only protects against
duplicate removals on the TCP side. If the map is free'd then we have
this trace,

  sock_map_free
    xchg()                  <- replaces map entry
    sock_map_unref()
      sk_psock_put()
        sock_map_del_link()

The __sock_map_delete() call however uses a read, test, null over the
map entry which can result in both paths trying to free the map
entry.

To fix use xchg in TCP paths as well so we avoid having two references
to the same map entry.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/sock_map.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 52d4faeee18b..28702f2e9a4a 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -276,16 +276,20 @@ static int __sock_map_delete(struct bpf_stab *stab, struct sock *sk_test,
 			     struct sock **psk)
 {
 	struct sock *sk;
+	int err = 0;
 
 	raw_spin_lock_bh(&stab->lock);
 	sk = *psk;
 	if (!sk_test || sk_test == sk)
-		*psk = NULL;
+		sk = xchg(psk, NULL);
+
+	if (likely(sk))
+		sock_map_unref(sk, psk);
+	else
+		err = -EINVAL;
+
 	raw_spin_unlock_bh(&stab->lock);
-	if (unlikely(!sk))
-		return -EINVAL;
-	sock_map_unref(sk, psk);
-	return 0;
+	return err;
 }
 
 static void sock_map_delete_from_link(struct bpf_map *map, struct sock *sk,


^ permalink raw reply related

* [bpf PATCH v3 4/8] bpf: tls fix transition through disconnect with close
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

It is possible (via shutdown()) for TCP socks to go through TCP_CLOSE
state via tcp_dosconnect() without actually calling tcp_close which
would then call the tls close callback. Because of this a user could
disconnect a socket then put it in a LISTEN state which would break
our assumptions about sockets always being ESTABLISHED state.

More directly because close() can call unhash() and unhash is
implemented by sockmap if a sockmap socket has TLS enabled we can
incorrectly destroy the psock from unhash() and then call its close
handler again. But because the psock (sockmap socket representation)
is already destroyed we call close handler in sk->prot. However,
in some cases (TLS BASE/BASE case) this will still point at the
sockmap close handler resulting in a circular call and crash reported
by syzbot.

To fix both above issues implement the unhash() routine for TLS.

Fixes: 3c4d7559159bf ("tls: kernel TLS support")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/net/tls.h  |    5 ++++-
 net/tls/tls_main.c |   57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 72ddd16de056..79ef7049375d 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -251,6 +251,8 @@ struct tls_context {
 	u8 tx_conf:3;
 	u8 rx_conf:3;
 
+	struct proto *sk_proto;
+
 	int (*push_pending_record)(struct sock *sk, int flags);
 	void (*sk_write_space)(struct sock *sk);
 
@@ -288,6 +290,8 @@ struct tls_context {
 
 	struct list_head list;
 	refcount_t refcount;
+
+	struct work_struct gc;
 };
 
 enum tls_offload_ctx_dir {
@@ -359,7 +363,6 @@ void tls_sw_strparser_done(struct tls_context *tls_ctx);
 int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags);
-void tls_sw_close(struct sock *sk, long timeout);
 void tls_sw_cancel_work_tx(struct tls_context *tls_ctx);
 void tls_sw_free_resources_tx(struct sock *sk);
 void tls_sw_free_resources_rx(struct sock *sk);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 9f4a9da182ae..f4cb0522fa95 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -251,6 +251,35 @@ static void tls_write_space(struct sock *sk)
 	ctx->sk_write_space(sk);
 }
 
+static void tls_ctx_free_deferred(struct work_struct *gc)
+{
+	struct tls_context *ctx = container_of(gc, struct tls_context, gc);
+
+	if (ctx->rx_conf == TLS_SW || ctx->rx_conf == TLS_HW)
+		tls_sw_strparser_done(ctx);
+
+	if (ctx->rx_conf == TLS_SW)
+		tls_sw_free_ctx_rx(ctx);
+
+	/* Ensure any remaining work items are completed. The sk will
+	 * already have lost its tls_ctx reference by the time we get
+	 * here so no xmit operation will actually be performed.
+	 */
+	tls_sw_cancel_work_tx(ctx);
+	kfree(ctx);
+}
+
+static void tls_ctx_free_wq(struct tls_context *ctx)
+{
+	if (!ctx)
+		return;
+
+	memzero_explicit(&ctx->crypto_send, sizeof(ctx->crypto_send));
+	memzero_explicit(&ctx->crypto_recv, sizeof(ctx->crypto_recv));
+	INIT_WORK(&ctx->gc, tls_ctx_free_deferred);
+	schedule_work(&ctx->gc);
+}
+
 void tls_ctx_free(struct tls_context *ctx)
 {
 	if (!ctx)
@@ -288,6 +317,27 @@ static void tls_sk_proto_cleanup(struct sock *sk,
 #endif
 }
 
+static void tls_sk_proto_unhash(struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	long timeo = sock_sndtimeo(sk, 0);
+	struct tls_context *ctx;
+
+	if (unlikely(!icsk->icsk_ulp_data)) {
+		if (sk->sk_prot->unhash)
+			sk->sk_prot->unhash(sk);
+	}
+
+	ctx = tls_get_ctx(sk);
+	if (ctx->tx_conf == TLS_SW || ctx->rx_conf == TLS_SW)
+		tls_sk_proto_cleanup(sk, ctx, timeo);
+	icsk->icsk_ulp_data = NULL;
+	tls_ctx_free_wq(ctx);
+
+	if (ctx->unhash)
+		ctx->unhash(sk);
+}
+
 static void tls_sk_proto_close(struct sock *sk, long timeout)
 {
 	void (*sk_proto_close)(struct sock *sk, long timeout);
@@ -306,6 +356,7 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE)
 		goto skip_tx_cleanup;
 
+	sk->sk_prot = ctx->sk_proto;
 	tls_sk_proto_cleanup(sk, ctx, timeo);
 
 skip_tx_cleanup:
@@ -611,6 +662,7 @@ static struct tls_context *create_ctx(struct sock *sk)
 	ctx->setsockopt = sk->sk_prot->setsockopt;
 	ctx->getsockopt = sk->sk_prot->getsockopt;
 	ctx->sk_proto_close = sk->sk_prot->close;
+	ctx->unhash = sk->sk_prot->unhash;
 	return ctx;
 }
 
@@ -734,20 +786,24 @@ static void build_protos(struct proto prot[TLS_NUM_CONFIG][TLS_NUM_CONFIG],
 	prot[TLS_BASE][TLS_BASE].setsockopt	= tls_setsockopt;
 	prot[TLS_BASE][TLS_BASE].getsockopt	= tls_getsockopt;
 	prot[TLS_BASE][TLS_BASE].close		= tls_sk_proto_close;
+	prot[TLS_BASE][TLS_BASE].unhash		= tls_sk_proto_unhash;
 
 	prot[TLS_SW][TLS_BASE] = prot[TLS_BASE][TLS_BASE];
 	prot[TLS_SW][TLS_BASE].sendmsg		= tls_sw_sendmsg;
 	prot[TLS_SW][TLS_BASE].sendpage		= tls_sw_sendpage;
+	prot[TLS_SW][TLS_BASE].unhash		= tls_sk_proto_unhash;
 
 	prot[TLS_BASE][TLS_SW] = prot[TLS_BASE][TLS_BASE];
 	prot[TLS_BASE][TLS_SW].recvmsg		  = tls_sw_recvmsg;
 	prot[TLS_BASE][TLS_SW].stream_memory_read = tls_sw_stream_read;
 	prot[TLS_BASE][TLS_SW].close		  = tls_sk_proto_close;
+	prot[TLS_BASE][TLS_SW].unhash		  = tls_sk_proto_unhash;
 
 	prot[TLS_SW][TLS_SW] = prot[TLS_SW][TLS_BASE];
 	prot[TLS_SW][TLS_SW].recvmsg		= tls_sw_recvmsg;
 	prot[TLS_SW][TLS_SW].stream_memory_read	= tls_sw_stream_read;
 	prot[TLS_SW][TLS_SW].close		= tls_sk_proto_close;
+	prot[TLS_SW][TLS_SW].unhash		= tls_sk_proto_unhash;
 
 #ifdef CONFIG_TLS_DEVICE
 	prot[TLS_HW][TLS_BASE] = prot[TLS_BASE][TLS_BASE];
@@ -798,6 +854,7 @@ static int tls_init(struct sock *sk)
 	tls_build_proto(sk);
 	ctx->tx_conf = TLS_BASE;
 	ctx->rx_conf = TLS_BASE;
+	ctx->sk_proto = sk->sk_prot;
 	update_sk_prot(sk, ctx);
 out:
 	return rc;


^ permalink raw reply related

* [bpf PATCH v3 3/8] tls: remove sock unlock/lock around strp_done()
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

The tls close() callback currently drops the sock lock to call
strp_done(). Split up the RX cleanup into stopping the strparser
and releasing most resources, syncing strparser and finally
freeing the context.

To avoid the need for a strp_done() call on the cleanup path
of device offload make sure we don't arm the strparser until
we are sure init will be successful.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
 include/net/tls.h    |    4 ++-
 net/tls/tls_device.c |    1 -
 net/tls/tls_main.c   |   65 +++++++++++++++++++++++++-------------------------
 net/tls/tls_sw.c     |   33 ++++++++++++++++++-------
 4 files changed, 58 insertions(+), 45 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index d4276cb6de53..72ddd16de056 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -107,9 +107,7 @@ struct tls_device {
 enum {
 	TLS_BASE,
 	TLS_SW,
-#ifdef CONFIG_TLS_DEVICE
 	TLS_HW,
-#endif
 	TLS_HW_RECORD,
 	TLS_NUM_CONFIG,
 };
@@ -357,6 +355,7 @@ int tls_sk_attach(struct sock *sk, int optname, char __user *optval,
 
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx);
 void tls_sw_strparser_arm(struct sock *sk, struct tls_context *ctx);
+void tls_sw_strparser_done(struct tls_context *tls_ctx);
 int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags);
@@ -365,6 +364,7 @@ void tls_sw_cancel_work_tx(struct tls_context *tls_ctx);
 void tls_sw_free_resources_tx(struct sock *sk);
 void tls_sw_free_resources_rx(struct sock *sk);
 void tls_sw_release_resources_rx(struct sock *sk);
+void tls_sw_free_ctx_rx(struct tls_context *tls_ctx);
 int tls_sw_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		   int nonblock, int flags, int *addr_len);
 bool tls_sw_stream_read(const struct sock *sk);
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 4d67d72f007c..7c0b2b778703 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -1045,7 +1045,6 @@ int tls_set_device_offload_rx(struct sock *sk, struct tls_context *ctx)
 	rc = tls_set_sw_offload(sk, ctx, 0);
 	if (rc)
 		goto release_ctx;
-	tls_sw_strparser_arm(sk, ctx);
 
 	rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_RX,
 					     &ctx->crypto_recv.info,
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index ddda422498aa..9f4a9da182ae 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -261,27 +261,9 @@ void tls_ctx_free(struct tls_context *ctx)
 	kfree(ctx);
 }
 
-static void tls_sk_proto_close(struct sock *sk, long timeout)
+static void tls_sk_proto_cleanup(struct sock *sk,
+				 struct tls_context *ctx, long timeo)
 {
-	struct tls_context *ctx = tls_get_ctx(sk);
-	long timeo = sock_sndtimeo(sk, 0);
-	void (*sk_proto_close)(struct sock *sk, long timeout);
-	bool free_ctx = false;
-
-	if (ctx->tx_conf == TLS_SW)
-		tls_sw_cancel_work_tx(ctx);
-
-	lock_sock(sk);
-	sk_proto_close = ctx->sk_proto_close;
-
-	if (ctx->tx_conf == TLS_HW_RECORD && ctx->rx_conf == TLS_HW_RECORD)
-		goto skip_tx_cleanup;
-
-	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE) {
-		free_ctx = true;
-		goto skip_tx_cleanup;
-	}
-
 	if (unlikely(sk->sk_write_pending) &&
 	    !wait_on_pending_writer(sk, &timeo))
 		tls_handle_open_record(sk, 0);
@@ -298,27 +280,44 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	}
 
 	if (ctx->rx_conf == TLS_SW)
-		tls_sw_free_resources_rx(sk);
+		tls_sw_release_resources_rx(sk);
 
 #ifdef CONFIG_TLS_DEVICE
 	if (ctx->rx_conf == TLS_HW)
 		tls_device_offload_cleanup_rx(sk);
-
-	if (ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW) {
-#else
-	{
 #endif
-		tls_ctx_free(ctx);
-		ctx = NULL;
-	}
+}
+
+static void tls_sk_proto_close(struct sock *sk, long timeout)
+{
+	void (*sk_proto_close)(struct sock *sk, long timeout);
+	struct tls_context *ctx = tls_get_ctx(sk);
+	long timeo = sock_sndtimeo(sk, 0);
+
+	if (ctx->tx_conf == TLS_SW)
+		tls_sw_cancel_work_tx(ctx);
+
+	lock_sock(sk);
+	sk_proto_close = ctx->sk_proto_close;
+
+	if (ctx->tx_conf == TLS_HW_RECORD && ctx->rx_conf == TLS_HW_RECORD)
+		goto skip_tx_cleanup;
+
+	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE)
+		goto skip_tx_cleanup;
+
+	tls_sk_proto_cleanup(sk, ctx, timeo);
 
 skip_tx_cleanup:
 	release_sock(sk);
+	if (ctx->rx_conf == TLS_SW || ctx->rx_conf == TLS_HW)
+		tls_sw_strparser_done(ctx);
+	if (ctx->rx_conf == TLS_SW)
+		tls_sw_free_ctx_rx(ctx);
 	sk_proto_close(sk, timeout);
-	/* free ctx for TLS_HW_RECORD, used by tcp_set_state
-	 * for sk->sk_prot->unhash [tls_hw_unhash]
-	 */
-	if (free_ctx)
+
+	if (ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW &&
+	    ctx->tx_conf != TLS_HW_RECORD && ctx->rx_conf != TLS_HW_RECORD)
 		tls_ctx_free(ctx);
 }
 
@@ -544,9 +543,9 @@ static int do_tls_setsockopt_conf(struct sock *sk, char __user *optval,
 			rc = tls_set_sw_offload(sk, ctx, 0);
 			if (rc)
 				goto err_crypto_info;
-			tls_sw_strparser_arm(sk, ctx);
 			conf = TLS_SW;
 		}
+		tls_sw_strparser_arm(sk, ctx);
 	}
 
 	if (tx)
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 38c0e53c727d..ee8fef312475 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2114,25 +2114,40 @@ void tls_sw_release_resources_rx(struct sock *sk)
 		skb_queue_purge(&ctx->rx_list);
 		crypto_free_aead(ctx->aead_recv);
 		strp_stop(&ctx->strp);
-		write_lock_bh(&sk->sk_callback_lock);
-		sk->sk_data_ready = ctx->saved_data_ready;
-		write_unlock_bh(&sk->sk_callback_lock);
-		release_sock(sk);
-		strp_done(&ctx->strp);
-		lock_sock(sk);
+		/* If tls_sw_strparser_arm() was not called (cleanup paths)
+		 * we still want to strp_stop(), but sk->sk_data_ready was
+		 * never swapped.
+		 */
+		if (ctx->saved_data_ready) {
+			write_lock_bh(&sk->sk_callback_lock);
+			sk->sk_data_ready = ctx->saved_data_ready;
+			write_unlock_bh(&sk->sk_callback_lock);
+		}
 	}
 }
 
-void tls_sw_free_resources_rx(struct sock *sk)
+void tls_sw_strparser_done(struct tls_context *tls_ctx)
 {
-	struct tls_context *tls_ctx = tls_get_ctx(sk);
 	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 
-	tls_sw_release_resources_rx(sk);
+	strp_done(&ctx->strp);
+}
+
+void tls_sw_free_ctx_rx(struct tls_context *tls_ctx)
+{
+	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 
 	kfree(ctx);
 }
 
+void tls_sw_free_resources_rx(struct sock *sk)
+{
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+
+	tls_sw_release_resources_rx(sk);
+	tls_sw_free_ctx_rx(tls_ctx);
+}
+
 /* The work handler to transmitt the encrypted records in tx_list */
 static void tx_work_handler(struct work_struct *work)
 {


^ permalink raw reply related

* [bpf PATCH v3 2/8] tls: remove close callback sock unlock/lock around TX work flush
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

The tls close() callback currently drops the sock lock, makes a
cancel_delayed_work_sync() call, and then relocks the sock.

By restructuring the code we can avoid droping lock and then
reclaiming it. To simplify this we do the following,

 tls_sk_proto_close
 set_bit(CLOSING)
 set_bit(SCHEDULE)
 cancel_delay_work_sync() <- cancel workqueue
 lock_sock(sk)
 ...
 release_sock(sk)
 strp_done()

Setting the CLOSING bit prevents the SCHEDULE bit from being
cleared by any workqueue items e.g. if one happens to be
scheduled and run between when we set SCHEDULE bit and cancel
work. Then because SCHEDULE bit is set now no new work will
be scheduled.

Tested with net selftests and bpf selftests.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
 include/net/tls.h  |    2 ++
 net/tls/tls_main.c |    3 +++
 net/tls/tls_sw.c   |   24 +++++++++++++++++-------
 3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 43f551cd508b..d4276cb6de53 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -162,6 +162,7 @@ struct tls_sw_context_tx {
 	int async_capable;
 
 #define BIT_TX_SCHEDULED	0
+#define BIT_TX_CLOSING		1
 	unsigned long tx_bitmask;
 };
 
@@ -360,6 +361,7 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags);
 void tls_sw_close(struct sock *sk, long timeout);
+void tls_sw_cancel_work_tx(struct tls_context *tls_ctx);
 void tls_sw_free_resources_tx(struct sock *sk);
 void tls_sw_free_resources_rx(struct sock *sk);
 void tls_sw_release_resources_rx(struct sock *sk);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 85a9d7d57b32..ddda422498aa 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -268,6 +268,9 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	void (*sk_proto_close)(struct sock *sk, long timeout);
 	bool free_ctx = false;
 
+	if (ctx->tx_conf == TLS_SW)
+		tls_sw_cancel_work_tx(ctx);
+
 	lock_sock(sk);
 	sk_proto_close = ctx->sk_proto_close;
 
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index f58a8ffc2a9c..38c0e53c727d 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2054,6 +2054,15 @@ static void tls_data_ready(struct sock *sk)
 	}
 }
 
+void tls_sw_cancel_work_tx(struct tls_context *tls_ctx)
+{
+	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
+
+	set_bit(BIT_TX_CLOSING, &ctx->tx_bitmask);
+	set_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask);
+	cancel_delayed_work_sync(&ctx->tx_work.work);
+}
+
 void tls_sw_free_resources_tx(struct sock *sk)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
@@ -2065,11 +2074,6 @@ void tls_sw_free_resources_tx(struct sock *sk)
 	if (atomic_read(&ctx->encrypt_pending))
 		crypto_wait_req(-EINPROGRESS, &ctx->async_wait);
 
-	release_sock(sk);
-	cancel_delayed_work_sync(&ctx->tx_work.work);
-	lock_sock(sk);
-
-	/* Tx whatever records we can transmit and abandon the rest */
 	tls_tx_records(sk, -1);
 
 	/* Free up un-sent records in tx_list. First, free
@@ -2137,11 +2141,17 @@ static void tx_work_handler(struct work_struct *work)
 					       struct tx_work, work);
 	struct sock *sk = tx_work->sk;
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
+	struct tls_sw_context_tx *ctx;
 
-	if (!test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask))
+	if (unlikely(!tls_ctx))
 		return;
 
+	ctx = tls_sw_ctx_tx(tls_ctx);
+	if (test_bit(BIT_TX_CLOSING, &ctx->tx_bitmask))
+		return;
+
+	if (!test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask))
+		return;
 	lock_sock(sk);
 	tls_tx_records(sk, -1);
 	release_sock(sk);


^ permalink raw reply related

* [bpf PATCH v3 1/8] net/tls: don't arm strparser immediately in tls_set_sw_offload()
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

From: Jakub Kicinski <jakub.kicinski@netronome.com>

In tls_set_device_offload_rx() we prepare the software context
for RX fallback and proceed to add the connection to the device.
Unfortunately, software context prep includes arming strparser
so in case of a later error we have to release the socket lock
to call strp_done().

In preparation for not releasing the socket lock half way through
callbacks move arming strparser into a separate function.
Following patches will make use of that.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
 include/net/tls.h    |    1 +
 net/tls/tls_device.c |    1 +
 net/tls/tls_main.c   |    8 +++++---
 net/tls/tls_sw.c     |   19 ++++++++++++-------
 4 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 584609174fe0..43f551cd508b 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -355,6 +355,7 @@ int tls_sk_attach(struct sock *sk, int optname, char __user *optval,
 		  unsigned int optlen);
 
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx);
+void tls_sw_strparser_arm(struct sock *sk, struct tls_context *ctx);
 int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags);
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 7c0b2b778703..4d67d72f007c 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -1045,6 +1045,7 @@ int tls_set_device_offload_rx(struct sock *sk, struct tls_context *ctx)
 	rc = tls_set_sw_offload(sk, ctx, 0);
 	if (rc)
 		goto release_ctx;
+	tls_sw_strparser_arm(sk, ctx);
 
 	rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_RX,
 					     &ctx->crypto_recv.info,
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 4674e57e66b0..85a9d7d57b32 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -526,6 +526,8 @@ static int do_tls_setsockopt_conf(struct sock *sk, char __user *optval,
 		{
 #endif
 			rc = tls_set_sw_offload(sk, ctx, 1);
+			if (rc)
+				goto err_crypto_info;
 			conf = TLS_SW;
 		}
 	} else {
@@ -537,13 +539,13 @@ static int do_tls_setsockopt_conf(struct sock *sk, char __user *optval,
 		{
 #endif
 			rc = tls_set_sw_offload(sk, ctx, 0);
+			if (rc)
+				goto err_crypto_info;
+			tls_sw_strparser_arm(sk, ctx);
 			conf = TLS_SW;
 		}
 	}
 
-	if (rc)
-		goto err_crypto_info;
-
 	if (tx)
 		ctx->tx_conf = conf;
 	else
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 53b4ad94e74a..f58a8ffc2a9c 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2160,6 +2160,18 @@ void tls_sw_write_space(struct sock *sk, struct tls_context *ctx)
 	}
 }
 
+void tls_sw_strparser_arm(struct sock *sk, struct tls_context *tls_ctx)
+{
+	struct tls_sw_context_rx *rx_ctx = tls_sw_ctx_rx(tls_ctx);
+
+	write_lock_bh(&sk->sk_callback_lock);
+	rx_ctx->saved_data_ready = sk->sk_data_ready;
+	sk->sk_data_ready = tls_data_ready;
+	write_unlock_bh(&sk->sk_callback_lock);
+
+	strp_check_rcv(&rx_ctx->strp);
+}
+
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
@@ -2357,13 +2369,6 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 		cb.parse_msg = tls_read_size;
 
 		strp_init(&sw_ctx_rx->strp, sk, &cb);
-
-		write_lock_bh(&sk->sk_callback_lock);
-		sw_ctx_rx->saved_data_ready = sk->sk_data_ready;
-		sk->sk_data_ready = tls_data_ready;
-		write_unlock_bh(&sk->sk_callback_lock);
-
-		strp_check_rcv(&sw_ctx_rx->strp);
 	}
 
 	goto out;


^ permalink raw reply related

* [bpf PATCH v3 0/8] sockmap/tls fixes
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf

Resolve a series of splats discovered by syzbot and an unhash
TLS issue noted by Eric Dumazet.

The main issues revolved around interaction between TLS and
sockmap tear down. TLS and sockmap could both reset sk->prot
ops creating a condition where a close or unhash op could be
called forever. A rare race condition resulting from a missing
rcu sync operation was causing a use after free. Then on the
TLS side dropping the sock lock and re-acquiring it during the
close op could hang. Finally, sockmap must be deployed before
tls for current stack assumptions to be met. This is enforced
now. A feature series can enable it.

To fix this first refactor TLS code so the lock is held for the
entire teardown operation. Then add an unhash callback to ensure
TLS can not transition from ESTABLISHED to LISTEN state. This
transition is a similar bug to the one found and fixed previously
in sockmap. Then apply three fixes to sockmap to fix up races
on tear down around map free and close. Finally, if sockmap
is destroyed before TLS we add a new ULP op update to inform
the TLS stack it should not call sockmap ops. This last one
appears to be the most commonly found issue from syzbot.

---

Jakub Kicinski (1):
      net/tls: don't arm strparser immediately in tls_set_sw_offload()

John Fastabend (7):
      tls: remove close callback sock unlock/lock around TX work flush
      tls: remove sock unlock/lock around strp_done()
      bpf: tls fix transition through disconnect with close
      bpf: sockmap, sock_map_delete needs to use xchg
      bpf: sockmap, synchronize_rcu before free'ing map
      bpf: sockmap, only create entry if ulp is not already enabled
      bpf: sockmap/tls, close can race with map free


 include/linux/skmsg.h |    8 ++-
 include/net/tcp.h     |    3 +
 include/net/tls.h     |   12 +++-
 net/core/skmsg.c      |    4 +
 net/core/sock_map.c   |   19 ++++--
 net/ipv4/tcp_ulp.c    |   13 ++++
 net/tls/tls_main.c    |  155 ++++++++++++++++++++++++++++++++++++++-----------
 net/tls/tls_sw.c      |   76 +++++++++++++++++-------
 8 files changed, 221 insertions(+), 69 deletions(-)

--
Signature

^ permalink raw reply

* Re: [PATCH iproute2] tc: util: constrain percentage in 0-100 interval
From: Stephen Hemminger @ 2019-07-15 20:48 UTC (permalink / raw)
  To: Andrea Claudi; +Cc: netdev, dsahern
In-Reply-To: <c0a9b4ce15d5389ac59fbf572f5f1b3030ec4c90.1563011008.git.aclaudi@redhat.com>

On Sat, 13 Jul 2019 11:44:07 +0200
Andrea Claudi <aclaudi@redhat.com> wrote:

> parse_percent() currently allows to specify negative percentages
> or value above 100%. However this does not seems to make sense,
> as the function is used for probabilities or bandiwidth rates.
> 
> Moreover, using negative values leads to erroneous results
> (using Bernoulli loss model as example):
> 
> $ ip link add test type dummy
> $ ip link set test up
> $ tc qdisc add dev test root netem loss gemodel -10% limit 10
> $ tc qdisc show dev test
> qdisc netem 800c: root refcnt 2 limit 10 loss gemodel p 90% r 10% 1-h 100% 1-k 0%
> 
> Using values above 100% we have instead:
> 
> $ ip link add test type dummy
> $ ip link set test up
> $ tc qdisc add dev test root netem loss gemodel 140% limit 10
> $ tc qdisc show dev test
> qdisc netem 800f: root refcnt 2 limit 10 loss gemodel p 40% r 60% 1-h 100% 1-k 0%
> 
> This commit changes parse_percent() with a check to ensure
> percentage values stay between 1.0 and 0.0.
> parse_percent_rate() function, which already employs a similar
> check, is adjusted accordingly.
> 
> With this check in place, we have:
> 
> $ ip link add test type dummy
> $ ip link set test up
> $ tc qdisc add dev test root netem loss gemodel -10% limit 10
> Illegal "loss gemodel p"
> 
> Fixes: 927e3cfb52b58 ("tc: B.W limits can now be specified in %.")
> Signed-off-by: Andrea Claudi <aclaudi@redhat.com>

Looks good. Applied

^ permalink raw reply

* Re: [net-next 1/2] ipvs: batch __ip_vs_cleanup
From: Julian Anastasov @ 2019-07-15 20:39 UTC (permalink / raw)
  To: Haishuang Yan
  Cc: David S. Miller, Pablo Neira Ayuso, Simon Horman, netdev,
	lvs-devel, linux-kernel, netfilter-devel
In-Reply-To: <1563031186-2101-2-git-send-email-yanhaishuang@cmss.chinamobile.com>


	Hello,

On Sat, 13 Jul 2019, Haishuang Yan wrote:

> It's better to batch __ip_vs_cleanup to speedup ipvs
> connections dismantle.
> 
> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
> ---
>  include/net/ip_vs.h             |  2 +-
>  net/netfilter/ipvs/ip_vs_core.c | 29 +++++++++++++++++------------
>  net/netfilter/ipvs/ip_vs_ctl.c  | 13 ++++++++++---
>  3 files changed, 28 insertions(+), 16 deletions(-)
> 
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 3759167..93e7a25 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -1324,7 +1324,7 @@ static inline void ip_vs_control_del(struct ip_vs_conn *cp)
>  void ip_vs_control_net_cleanup(struct netns_ipvs *ipvs);
>  void ip_vs_estimator_net_cleanup(struct netns_ipvs *ipvs);
>  void ip_vs_sync_net_cleanup(struct netns_ipvs *ipvs);
> -void ip_vs_service_net_cleanup(struct netns_ipvs *ipvs);
> +void ip_vs_service_nets_cleanup(struct list_head *net_list);
>  
>  /* IPVS application functions
>   * (from ip_vs_app.c)
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index 46f06f9..b4d79b7 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -2402,18 +2402,23 @@ static int __net_init __ip_vs_init(struct net *net)
>  	return -ENOMEM;
>  }
>  
> -static void __net_exit __ip_vs_cleanup(struct net *net)
> +static void __net_exit __ip_vs_cleanup_batch(struct list_head *net_list)
>  {
> -	struct netns_ipvs *ipvs = net_ipvs(net);
> -
> -	ip_vs_service_net_cleanup(ipvs);	/* ip_vs_flush() with locks */
> -	ip_vs_conn_net_cleanup(ipvs);
> -	ip_vs_app_net_cleanup(ipvs);
> -	ip_vs_protocol_net_cleanup(ipvs);
> -	ip_vs_control_net_cleanup(ipvs);
> -	ip_vs_estimator_net_cleanup(ipvs);
> -	IP_VS_DBG(2, "ipvs netns %d released\n", ipvs->gen);
> -	net->ipvs = NULL;
> +	struct netns_ipvs *ipvs;
> +	struct net *net;
> +	LIST_HEAD(list);
> +
> +	ip_vs_service_nets_cleanup(net_list);	/* ip_vs_flush() with locks */
> +	list_for_each_entry(net, net_list, exit_list) {

	How much faster is to replace list_for_each_entry in
ops_exit_list() with this one. IPVS can waste time in calls
such as kthread_stop() and del_timer_sync() but I'm not sure
we can solve it easily. What gain do you see in benchmarks?

> +		ipvs = net_ipvs(net);
> +		ip_vs_conn_net_cleanup(ipvs);
> +		ip_vs_app_net_cleanup(ipvs);
> +		ip_vs_protocol_net_cleanup(ipvs);
> +		ip_vs_control_net_cleanup(ipvs);
> +		ip_vs_estimator_net_cleanup(ipvs);
> +		IP_VS_DBG(2, "ipvs netns %d released\n", ipvs->gen);
> +		net->ipvs = NULL;
> +	}
>  }

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-07-15 20:38 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Tycho Andersen, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, netfilter-devel, sgrubb, omosnace,
	dhowells, simo, Eric Paris, Serge Hallyn, ebiederm, nhorman
In-Reply-To: <20190708175105.7zb6mikjw2wmnwln@madcap2.tricolour.ca>

On Mon, Jul 8, 2019 at 1:51 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-05-29 11:29, Paul Moore wrote:

...

> > The idea is that only container orchestrators should be able to
> > set/modify the audit container ID, and since setting the audit
> > container ID can have a significant effect on the records captured
> > (and their routing to multiple daemons when we get there) modifying
> > the audit container ID is akin to modifying the audit configuration
> > which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> > is that you would only change the audit container ID from one
> > set/inherited value to another if you were nesting containers, in
> > which case the nested container orchestrator would need to be granted
> > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > compromise).  We did consider allowing for a chain of nested audit
> > container IDs, but the implications of doing so are significant
> > (implementation mess, runtime cost, etc.) so we are leaving that out
> > of this effort.
>
> We had previously discussed the idea of restricting
> orchestrators/engines from only being able to set the audit container
> identifier on their own descendants, but it was discarded.  I've added a
> check to ensure this is now enforced.

When we weren't allowing nested orchestrators it wasn't necessary, but
with the move to support nesting I believe this will be a requirement.
We might also need/want to restrict audit container ID changes if a
descendant is acting as a container orchestrator and managing one or
more audit container IDs; although I'm less certain of the need for
this.

> I've also added a check to ensure that a process can't set its own audit
> container identifier ...

What does this protect against, or what problem does this solve?
Considering how easy it is to fork/exec, it seems like this could be
trivially bypassed.

> ... and that if the identifier is already set, then the
> orchestrator/engine must be in a descendant user namespace from the
> orchestrator that set the previously inherited audit container
> identifier.

You lost me here ... although I don't like the idea of relying on X
namespace inheritance for a hard coded policy on setting the audit
container ID; we've worked hard to keep this independent of any
definition of a "container" and it would sadden me greatly if we had
to go back on that.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH iproute2 net-next v2 1/6] Kernel header update for hardware offloading changes.
From: David Ahern @ 2019-07-15 20:16 UTC (permalink / raw)
  To: Stephen Hemminger, Patel, Vedang
  Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Gomes, Vinicius,
	netdev@vger.kernel.org, Dorileo, Leandro, Jakub Kicinski,
	Murali Karicheri
In-Reply-To: <20190715125059.70470f9e@hermes.lan>

On 7/15/19 1:50 PM, Stephen Hemminger wrote:
> On Mon, 15 Jul 2019 19:40:19 +0000
> "Patel, Vedang" <vedang.patel@intel.com> wrote:
> 
>> Hi Stephen, 
>>
>> The kernel patches corresponding to this series have been merged. I just wanted to check whether these iproute2 related patches are on your TODO list.
>>
>> Let me know if you need any information from me on these patches.
>>
>> Thanks,
>> Vedang Patel
> 
> 
> David Ahern handles iproute2 next
> 
> https://patchwork.ozlabs.org/patch/1111466/
> 

given the long time delay between when the iproute2 patches were posted
and when the kernel side was accepted you will need to re-send the
iproute2 patches.

^ permalink raw reply

* [net 2/3] net/mlx5e: Rely on filter_dev instead of dissector keys for tunnels
From: Saeed Mahameed @ 2019-07-15 20:09 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Vlad Buslov, Roi Dayan, Saeed Mahameed
In-Reply-To: <20190715200940.31799-1-saeedm@mellanox.com>

From: Vlad Buslov <vladbu@mellanox.com>

Currently, tunnel attributes are parsed and inner header matching is used
only when flow dissector specifies match on some of the supported
encapsulation fields. When user tries to offload tc filter that doesn't
match any encapsulation fields on tunnel device, mlx5 tc layer incorrectly
sets to match packet header keys on encap header (outer header) and
firmware rejects the rule with syndrome 0x7e1579 when creating new flow
group.

Change __parse_cls_flower() to determine whether tunnel is used based on
fitler_dev tunnel info, instead of determining it indirectly by checking
flow dissector enc keys.

Fixes: bbd00f7e2349 ("net/mlx5e: Add TC tunnel release action for SRIOV offloads")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 018709a4343f..b95e0ae4d7fd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1522,11 +1522,7 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 		return -EOPNOTSUPP;
 	}
 
-	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS) ||
-	    flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS) ||
-	    flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_KEYID) ||
-	    flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_PORTS) ||
-	    flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_OPTS)) {
+	if (mlx5e_get_tc_tun(filter_dev)) {
 		if (parse_tunnel_attr(priv, spec, f, filter_dev, tunnel_match_level))
 			return -EOPNOTSUPP;
 
-- 
2.21.0


^ permalink raw reply related

* [net 3/3] net/mlx5e: Allow dissector meta key in tc flower
From: Saeed Mahameed @ 2019-07-15 20:09 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Vlad Buslov, Roi Dayan, Saeed Mahameed
In-Reply-To: <20190715200940.31799-1-saeedm@mellanox.com>

From: Vlad Buslov <vladbu@mellanox.com>

Recently, fl_flow_key->indev_ifindex int field was refactored into
flow_dissector_key_meta field. With this, flower classifier also sets
FLOW_DISSECTOR_KEY_META flow dissector key. However, mlx5 flower dissector
validation code rejects filters that use flow dissector keys that are not
supported. Add FLOW_DISSECTOR_KEY_META to the list of allowed dissector
keys in __parse_cls_flower() to prevent following error when offloading
flower classifier to mlx5:

Error: mlx5_core: Unsupported key.

Fixes: 8212ed777f40 ("net: sched: cls_flower: use flow_dissector for ingress ifindex")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index b95e0ae4d7fd..cc096f6011d9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1499,7 +1499,8 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 	*match_level = MLX5_MATCH_NONE;
 
 	if (dissector->used_keys &
-	    ~(BIT(FLOW_DISSECTOR_KEY_CONTROL) |
+	    ~(BIT(FLOW_DISSECTOR_KEY_META) |
+	      BIT(FLOW_DISSECTOR_KEY_CONTROL) |
 	      BIT(FLOW_DISSECTOR_KEY_BASIC) |
 	      BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) |
 	      BIT(FLOW_DISSECTOR_KEY_VLAN) |
-- 
2.21.0


^ permalink raw reply related

* [net 1/3] net/mlx5e: Verify encapsulation is supported
From: Saeed Mahameed @ 2019-07-15 20:09 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Eli Cohen, Roi Dayan, Saeed Mahameed
In-Reply-To: <20190715200940.31799-1-saeedm@mellanox.com>

From: Eli Cohen <eli@mellanox.com>

When mlx5e_attach_encap() calls mlx5e_get_tc_tun() to get the tunnel
info data struct, check that returned value is not NULL, as would be in
the case of unsupported encapsulation.

Fixes: d386939a327d2 ("net/mlx5e: Rearrange tc tunnel code in a modular way")
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 2d6436257f9d..018709a4343f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2647,6 +2647,10 @@ static int mlx5e_attach_encap(struct mlx5e_priv *priv,
 	family = ip_tunnel_info_af(tun_info);
 	key.ip_tun_key = &tun_info->key;
 	key.tc_tunnel = mlx5e_get_tc_tun(mirred_dev);
+	if (!key.tc_tunnel) {
+		NL_SET_ERR_MSG_MOD(extack, "Unsupported tunnel");
+		return -EOPNOTSUPP;
+	}
 
 	hash_key = hash_encap_info(&key);
 
-- 
2.21.0


^ permalink raw reply related

* [pull request][net 0/3] Mellanox, mlx5 fixes 2019-07-15
From: Saeed Mahameed @ 2019-07-15 20:09 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev@vger.kernel.org, Saeed Mahameed

Hi Dave,

This pull request provides mlx5 TC flower and tunnel fixes for kernel 5.2
from Eli and Vlad.

Please pull and let me know if there is any problem.

Thanks,
Saeed.

---
The following changes since commit f384e62a82ba5d85408405fdd6aeff89354deaa9:

  ISDN: hfcsusb: checking idx of ep configuration (2019-07-15 11:10:31 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-fixes-2019-07-15

for you to fetch changes up to 3d144578c91a2db417923ba905ce7a84ce0c274b:

  net/mlx5e: Allow dissector meta key in tc flower (2019-07-15 13:04:04 -0700)

----------------------------------------------------------------
mlx5-fixes-2019-07-15

----------------------------------------------------------------
Eli Cohen (1):
      net/mlx5e: Verify encapsulation is supported

Vlad Buslov (2):
      net/mlx5e: Rely on filter_dev instead of dissector keys for tunnels
      net/mlx5e: Allow dissector meta key in tc flower

 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox