Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v2] net: vhost: improve performance when enable busyloop
From: Michael S. Tsirkin @ 2018-06-27 15:58 UTC (permalink / raw)
  To: Jason Wang; +Cc: xiangxia.m.yue, virtualization, netdev, Tonghao Zhang
In-Reply-To: <369bea44-6ebd-337a-b20b-a28a604fa2e9@redhat.com>

On Wed, Jun 27, 2018 at 10:24:43PM +0800, Jason Wang wrote:
> 
> 
> On 2018年06月26日 13:17, xiangxia.m.yue@gmail.com wrote:
> > From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> > 
> > This patch improves the guest receive performance from
> > host. On the handle_tx side, we poll the sock receive
> > queue at the same time. handle_rx do that in the same way.
> > 
> > For avoiding deadlock, change the code to lock the vq one
> > by one and use the VHOST_NET_VQ_XX as a subclass for
> > mutex_lock_nested. With the patch, qemu can set differently
> > the busyloop_timeout for rx or tx queue.
> > 
> > We set the poll-us=100us and use the iperf3 to test
> > its throughput. The iperf3 command is shown as below.
> > 
> > on the guest:
> > iperf3  -s -D
> > 
> > on the host:
> > iperf3  -c 192.168.1.100 -i 1 -P 10 -t 10 -M 1400
> > 
> > * With the patch:     23.1 Gbits/sec
> > * Without the patch:  12.7 Gbits/sec
> > 
> > Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>
> 
> Thanks a lot for the patch. Looks good generally, but please split this big
> patch into separate ones like:
> 
> patch 1: lock vqs one by one
> patch 2: replace magic number of lock annotation
> patch 3: factor out generic busy polling logic to vhost_net_busy_poll()
> patch 4: add rx busy polling in tx path.
> 
> And please cc Michael in v3.
> 
> Thanks

Pls include host CPU utilization numbers. You can get them e.g. using
vmstat. I suspect we also want the polling controllable e.g. through
an ioctl.

-- 
MST

^ permalink raw reply

* Re: brcmsmac: make function wlc_phy_workarounds_nphy_rev1 static
From: Kalle Valo @ 2018-06-27 15:58 UTC (permalink / raw)
  To: Colin Ian King
  Cc: Arend van Spriel, Franky Lin, Hante Meuleman, Chi-Hsien Lin,
	Wright Feng, David S . Miller, linux-wireless,
	brcm80211-dev-list.pdl, brcm80211-dev-list, netdev,
	kernel-janitors, linux-kernel
In-Reply-To: <20180623221531.6396-2-colin.king@canonical.com>

Colin Ian King <colin.king@canonical.com> wrote:

> From: Colin Ian King <colin.king@canonical.com>
> 
> The function wlc_phy_workarounds_nphy_rev1 is local to the source and
> does not need to be in global scope, so make it static.
> 
> Cleans up sparse warning:
> symbol 'wlc_phy_workarounds_nphy_rev1' was not declared. Should it
> be static?
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Patch applied to wireless-drivers-next.git, thanks.

ab8d904654e2 brcmsmac: make function wlc_phy_workarounds_nphy_rev1 static

-- 
https://patchwork.kernel.org/patch/10483927/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply

* Re: brcmsmac: fix wrap around in conversion from constant to s16
From: Kalle Valo @ 2018-06-27 15:57 UTC (permalink / raw)
  To: Stefan Agner
  Cc: Stefan Agner, Tobias Regnery, Arend van Spriel, Franky Lin,
	Hante Meuleman, Chi-Hsien Lin, Wright Feng, David S. Miller,
	linux-wireless, brcm80211-dev-list.pdl, brcm80211-dev-list,
	netdev, linux-kernel
In-Reply-To: <20180617103407.27819-1-stefan@agner.ch>

Stefan Agner <stefan@agner.ch> wrote:

> The last value in the log_table wraps around to a negative value
> since s16 has a value range of -32768 to 32767. This is not what
> the table intends to represent. Use the closest positive value
> 32767.
> 
> This fixes a warning seen with clang:
> drivers/net/wireless/broadcom/brcm80211/brcmsmac/phy/phy_qmath.c:216:2: warning:
>       implicit conversion from 'int' to 's16' (aka 'short') changes
> value from 32768
>       to -32768 [-Wconstant-conversion]
>         32768
>         ^~~~~
> 1 warning generated.
> 
> Fixes: 4c0bfeaae9f9 ("brcmsmac: fix array out-of-bounds access in qm_log10")
> Cc: Tobias Regnery <tobias.regnery@gmail.com>
> Signed-off-by: Stefan Agner <stefan@agner.ch>

Patch applied to wireless-drivers-next.git, thanks.

c9a61469fc97 brcmsmac: fix wrap around in conversion from constant to s16

-- 
https://patchwork.kernel.org/patch/10468755/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply

* Re: brcmsmac: Remove unnecessary parentheses
From: Kalle Valo @ 2018-06-27 15:55 UTC (permalink / raw)
  To: Varsha Rao
  Cc: Nicholas Mc Guire, Lukas Bulwahn, Arend van Spriel, Franky Lin,
	Hante Meuleman, Chi-Hsien Lin, Wright Feng, David S. Miller,
	linux-wireless, brcm80211-dev-list.pdl, brcm80211-dev-list,
	netdev, linux-kernel, Varsha Rao
In-Reply-To: <20180601021413.4031-1-rvarsha016@gmail.com>

Varsha Rao <rvarsha016@gmail.com> wrote:

> This patch fixes the clang warning of extraneous parentheses, with the
> following coccinelle script.
> 
> @@
> identifier i;
> expression e;
> statement s;
> @@
> if (
> -(i == e)
> +i == e
>  )
> s
> 
> Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
> Signed-off-by: Varsha Rao <rvarsha016@gmail.com>

Patch applied to wireless-drivers-next.git, thanks.

eb5d2f3afc0f brcmsmac: Remove unnecessary parentheses

-- 
https://patchwork.kernel.org/patch/10442401/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply

* Re: [PATCH bpf 4/4] xsk: fix potential race in SKB TX completion code
From: Eric Dumazet @ 2018-06-27 15:55 UTC (permalink / raw)
  To: Magnus Karlsson, bjorn.topel, ast, daniel, netdev; +Cc: qi.z.zhang, pavel
In-Reply-To: <1530108136-4984-5-git-send-email-magnus.karlsson@intel.com>



On 06/27/2018 07:02 AM, Magnus Karlsson wrote:
> There was a potential race in the TX completion code for
> the SKB case when the TX napi thread and the error path
> of the sendmsg code could both call the SKB destructor
> at the same time. Fixed by introducing a spin_lock in the
> destructor.


Wow, what is the impact on performance ?

Please describe a bit more what is the problem.

^ permalink raw reply

* Re: [PATCH net-next] tcp: replace LINUX_MIB_TCPOFODROP with LINUX_MIB_TCPRMEMFULLDROP for drops due to receive buffer full
From: Eric Dumazet @ 2018-06-27 15:52 UTC (permalink / raw)
  To: Yafang Shao, Eric Dumazet; +Cc: Eric Dumazet, David Miller, netdev, LKML
In-Reply-To: <CALOAHbCRJWThE6oLH0m9ZV8xRMMCaZFM68eSGcm8awRbCLqOsw@mail.gmail.com>



On 06/27/2018 08:38 AM, Yafang Shao wrote:
> On Wed, Jun 27, 2018 at 11:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>> On 06/27/2018 08:14 AM, Yafang Shao wrote:
>>
>>> Got it!
>>>
>>> What about introduce a new counter, i.e. TCPRcvQFullDrop ?
>>
>> tcp_try_rmem_schedule() can fail for many different reasons,
>> not related to how occupied the socket receive queue is.
> 
> Yes. So TCPRcvQDrop would be more specific ?

Yes, this looks better.

^ permalink raw reply

* Re: [PATCH 1/1] esp6: fix memleak on error path in esp6_input
From: Steffen Klassert @ 2018-06-27 15:51 UTC (permalink / raw)
  To: Zhen Lei
  Cc: Herbert Xu, David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	netdev, linux-kernel, Hanjun Guo, Libin, YueHaibing
In-Reply-To: <1530071368-15156-1-git-send-email-thunder.leizhen@huawei.com>

On Wed, Jun 27, 2018 at 11:49:28AM +0800, Zhen Lei wrote:
> This ought to be an omission in e6194923237 ("esp: Fix memleaks on error
> paths."). The memleak on error path in esp6_input is similar to esp_input
> of esp4.
> 
> Fixes: e6194923237 ("esp: Fix memleaks on error paths.")
> Fixes: 3f29770723f ("ipsec: check return value of skb_to_sgvec always")
> 
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>

Applied, thanks a lot for the fix!

^ permalink raw reply

* Re: [PATCH ipsec-next 1/1] xfrm: don't check offload_handle for nonzero
From: Steffen Klassert @ 2018-06-27 15:49 UTC (permalink / raw)
  To: Shannon Nelson; +Cc: netdev
In-Reply-To: <1530047950-3967-1-git-send-email-shannon.nelson@oracle.com>

On Tue, Jun 26, 2018 at 02:19:10PM -0700, Shannon Nelson wrote:
> The offload_handle should be an opaque data cookie for the driver
> to use, much like the data cookie for a timer or alarm callback.
> Thus, the XFRM stack should not be checking for non-zero, because
> the driver might use that to store an array reference, which could
> be zero, or some other zero but meaningful value.
> 
> We can remove the checks for non-zero because there are plenty
> other attributes also being checked to see if there is an offload
> in place for the SA in question.
> 
> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>

I don't have access to my hw offload testlab currently,
so I've queued this one until I'm back from my vacation.

^ permalink raw reply

* [PATCH net] tcp: add one more quick ack after after ECN events
From: Eric Dumazet @ 2018-06-27 15:47 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet, Lawrence Brakmo

Larry Brakmo proposal ( https://patchwork.ozlabs.org/patch/935233/
tcp: force cwnd at least 2 in tcp_cwnd_reduction) made us rethink
about our recent patch removing ~16 quick acks after ECN events.

tcp_enter_quickack_mode(sk, 1) makes sure one immediate ack is sent,
but in the case the sender cwnd was lowered to 1, we do not want
to have a delayed ack for the next packet we will receive.

Fixes: 522040ea5fdd ("tcp: do not aggressively quick ack after ECN events")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
---
 net/ipv4/tcp_input.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 355d3dffd021ccad0f30891994289d916f7d276c..045d930d01a92c2b0b7535062b32c27f6c150460 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -265,7 +265,7 @@ static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
 		 * it is probably a retransmit.
 		 */
 		if (tp->ecn_flags & TCP_ECN_SEEN)
-			tcp_enter_quickack_mode(sk, 1);
+			tcp_enter_quickack_mode(sk, 2);
 		break;
 	case INET_ECN_CE:
 		if (tcp_ca_needs_ecn(sk))
@@ -273,7 +273,7 @@ static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)

 		if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
 			/* Better not delay acks, sender can have a very low cwnd */
-			tcp_enter_quickack_mode(sk, 1);
+			tcp_enter_quickack_mode(sk, 2);
 			tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
 		}
 		tp->ecn_flags |= TCP_ECN_SEEN;
-- 
2.18.0.rc2.346.g013aa6912e-goog

^ permalink raw reply related

* Re: [PATCH net-next] tcp: replace LINUX_MIB_TCPOFODROP with LINUX_MIB_TCPRMEMFULLDROP for drops due to receive buffer full
From: Yafang Shao @ 2018-06-27 15:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Eric Dumazet, David Miller, netdev, LKML
In-Reply-To: <72f02552-9375-bf81-6a03-42656168b583@gmail.com>

On Wed, Jun 27, 2018 at 11:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 06/27/2018 08:14 AM, Yafang Shao wrote:
>
>> Got it!
>>
>> What about introduce a new counter, i.e. TCPRcvQFullDrop ?
>
> tcp_try_rmem_schedule() can fail for many different reasons,
> not related to how occupied the socket receive queue is.

Yes. So TCPRcvQDrop would be more specific ?

Thanks
Yafang

^ permalink raw reply

* Re: [PATCH net-next] tcp: replace LINUX_MIB_TCPOFODROP with LINUX_MIB_TCPRMEMFULLDROP for drops due to receive buffer full
From: Eric Dumazet @ 2018-06-27 15:27 UTC (permalink / raw)
  To: Yafang Shao; +Cc: Eric Dumazet, David Miller, netdev, LKML
In-Reply-To: <CALOAHbCdRjq2n37GpeBdorcbxXMDX2vNDLftypViJd5hRTA28A@mail.gmail.com>



On 06/27/2018 08:14 AM, Yafang Shao wrote:
 
> Got it!
> 
> What about introduce a new counter, i.e. TCPRcvQFullDrop ?

tcp_try_rmem_schedule() can fail for many different reasons,
not related to how occupied the socket receive queue is.

^ permalink raw reply

* Re: [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction
From: Neal Cardwell @ 2018-06-27 15:24 UTC (permalink / raw)
  To: Lawrence Brakmo, Yuchung Cheng, Matt Mathis
  Cc: Netdev, Kernel Team, bmatheny, ast, Eric Dumazet
In-Reply-To: <20180627023403.3395818-1-brakmo@fb.com>

On Tue, Jun 26, 2018 at 10:34 PM Lawrence Brakmo <brakmo@fb.com> wrote:
> The only issue is if it is safe to always use 2 or if it is better to
> use min(2, snd_ssthresh) (which could still trigger the problem).

Always using 2 SGTM. I don't think we need min(2, snd_ssthresh), as
that should be the same as just 2, since:

(a) RFCs mandate ssthresh should not be below 2, e.g.
https://tools.ietf.org/html/rfc5681 page 7:

 ssthresh = max (FlightSize / 2, 2*SMSS)            (4)

(b) The main loss-based CCs used in Linux (CUBIC, Reno, DCTCP) respect
that constraint, and always have an ssthresh of at least 2.

And if some CC misbehaves and uses a lower ssthresh, then taking
min(2, snd_ssthresh) will trigger problems, as you note.

> +       tp->snd_cwnd = max((int)tcp_packets_in_flight(tp) + sndcnt, 2);

AFAICT this does seem like it will make the sender behavior more
aggressive in cases with high loss and/or a very low per-flow
fair-share.

Old:

o send N packets
o receive SACKs for last 3 packets
o fast retransmit packet 1
o using ACKs, slow-start upward

New:

o send N packets
o receive SACKs for last 3 packets
o fast retransmit packets 1 and 2
o using ACKs, slow-start upward

In the extreme case, if the available fair share is less than 2
packets, whereas inflight would have oscillated between 1 packet and 2
packets with the existing code, it now seems like with this commit the
inflight will now hover at 2. It seems like this would have
significantly higher losses than we had with the existing code.

This may or may not be OK in practice, but IMHO it is worth mentioning
and discussing.

neal

^ permalink raw reply

* [PATCH 6/6] netfilter: nf_conncount: fix garbage collection confirm race
From: Pablo Neira Ayuso @ 2018-06-27 15:22 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180627152223.3633-1-pablo@netfilter.org>

From: Florian Westphal <fw@strlen.de>

Yi-Hung Wei and Justin Pettit found a race in the garbage collection scheme
used by nf_conncount.

When doing list walk, we lookup the tuple in the conntrack table.
If the lookup fails we remove this tuple from our list because
the conntrack entry is gone.

This is the common cause, but turns out its not the only one.
The list entry could have been created just before by another cpu, i.e. the
conntrack entry might not yet have been inserted into the global hash.

The avoid this, we introduce a timestamp and the owning cpu.
If the entry appears to be stale, evict only if:
 1. The current cpu is the one that added the entry, or,
 2. The timestamp is older than two jiffies

The second constraint allows GC to be taken over by other
cpu too (e.g. because a cpu was offlined or napi got moved to another
cpu).

We can't pretend the 'doubtful' entry wasn't in our list.
Instead, when we don't find an entry indicate via IS_ERR
that entry was removed ('did not exist' or withheld
('might-be-unconfirmed').

This most likely also fixes a xt_connlimit imbalance earlier reported by
Dmitry Andrianov.

Cc: Dmitry Andrianov <dmitry.andrianov@alertme.com>
Reported-by: Justin Pettit <jpettit@vmware.com>
Reported-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_conncount.c | 52 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 47 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index d8383609fe28..510039862aa9 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -47,6 +47,8 @@ struct nf_conncount_tuple {
 	struct hlist_node		node;
 	struct nf_conntrack_tuple	tuple;
 	struct nf_conntrack_zone	zone;
+	int				cpu;
+	u32				jiffies32;
 };
 
 struct nf_conncount_rb {
@@ -91,11 +93,42 @@ bool nf_conncount_add(struct hlist_head *head,
 		return false;
 	conn->tuple = *tuple;
 	conn->zone = *zone;
+	conn->cpu = raw_smp_processor_id();
+	conn->jiffies32 = (u32)jiffies;
 	hlist_add_head(&conn->node, head);
 	return true;
 }
 EXPORT_SYMBOL_GPL(nf_conncount_add);
 
+static const struct nf_conntrack_tuple_hash *
+find_or_evict(struct net *net, struct nf_conncount_tuple *conn)
+{
+	const struct nf_conntrack_tuple_hash *found;
+	unsigned long a, b;
+	int cpu = raw_smp_processor_id();
+	__s32 age;
+
+	found = nf_conntrack_find_get(net, &conn->zone, &conn->tuple);
+	if (found)
+		return found;
+	b = conn->jiffies32;
+	a = (u32)jiffies;
+
+	/* conn might have been added just before by another cpu and
+	 * might still be unconfirmed.  In this case, nf_conntrack_find()
+	 * returns no result.  Thus only evict if this cpu added the
+	 * stale entry or if the entry is older than two jiffies.
+	 */
+	age = a - b;
+	if (conn->cpu == cpu || age >= 2) {
+		hlist_del(&conn->node);
+		kmem_cache_free(conncount_conn_cachep, conn);
+		return ERR_PTR(-ENOENT);
+	}
+
+	return ERR_PTR(-EAGAIN);
+}
+
 unsigned int nf_conncount_lookup(struct net *net, struct hlist_head *head,
 				 const struct nf_conntrack_tuple *tuple,
 				 const struct nf_conntrack_zone *zone,
@@ -103,18 +136,27 @@ unsigned int nf_conncount_lookup(struct net *net, struct hlist_head *head,
 {
 	const struct nf_conntrack_tuple_hash *found;
 	struct nf_conncount_tuple *conn;
-	struct hlist_node *n;
 	struct nf_conn *found_ct;
+	struct hlist_node *n;
 	unsigned int length = 0;
 
 	*addit = tuple ? true : false;
 
 	/* check the saved connections */
 	hlist_for_each_entry_safe(conn, n, head, node) {
-		found = nf_conntrack_find_get(net, &conn->zone, &conn->tuple);
-		if (found == NULL) {
-			hlist_del(&conn->node);
-			kmem_cache_free(conncount_conn_cachep, conn);
+		found = find_or_evict(net, conn);
+		if (IS_ERR(found)) {
+			/* Not found, but might be about to be confirmed */
+			if (PTR_ERR(found) == -EAGAIN) {
+				length++;
+				if (!tuple)
+					continue;
+
+				if (nf_ct_tuple_equal(&conn->tuple, tuple) &&
+				    nf_ct_zone_id(&conn->zone, conn->zone.dir) ==
+				    nf_ct_zone_id(zone, zone->dir))
+					*addit = false;
+			}
 			continue;
 		}
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH 5/6] netfilter: nf_log: don't hold nf_log_mutex during user access
From: Pablo Neira Ayuso @ 2018-06-27 15:22 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180627152223.3633-1-pablo@netfilter.org>

From: Jann Horn <jannh@google.com>

The old code would indefinitely block other users of nf_log_mutex if
a userspace access in proc_dostring() blocked e.g. due to a userfaultfd
region. Fix it by moving proc_dostring() out of the locked region.

This is a followup to commit 266d07cb1c9a ("netfilter: nf_log: fix
sleeping function called from invalid context"), which changed this code
from using rcu_read_lock() to taking nf_log_mutex.

Fixes: 266d07cb1c9a ("netfilter: nf_log: fix sleeping function calle[...]")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_log.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index 2c47f9ec3511..a61d6df6e5f6 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -446,14 +446,17 @@ static int nf_log_proc_dostring(struct ctl_table *table, int write,
 		rcu_assign_pointer(net->nf.nf_loggers[tindex], logger);
 		mutex_unlock(&nf_log_mutex);
 	} else {
+		struct ctl_table tmp = *table;
+
+		tmp.data = buf;
 		mutex_lock(&nf_log_mutex);
 		logger = nft_log_dereference(net->nf.nf_loggers[tindex]);
 		if (!logger)
-			table->data = "NONE";
+			strlcpy(buf, "NONE", sizeof(buf));
 		else
-			table->data = logger->name;
-		r = proc_dostring(table, write, buffer, lenp, ppos);
+			strlcpy(buf, logger->name, sizeof(buf));
 		mutex_unlock(&nf_log_mutex);
+		r = proc_dostring(&tmp, write, buffer, lenp, ppos);
 	}
 
 	return r;
-- 
2.11.0

^ permalink raw reply related

* [PATCH 4/6] netfilter: nf_log: fix uninit read in nf_log_proc_dostring
From: Pablo Neira Ayuso @ 2018-06-27 15:22 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180627152223.3633-1-pablo@netfilter.org>

From: Jann Horn <jannh@google.com>

When proc_dostring() is called with a non-zero offset in strict mode, it
doesn't just write to the ->data buffer, it also reads. Make sure it
doesn't read uninitialized data.

Fixes: c6ac37d8d884 ("netfilter: nf_log: fix error on write NONE to [...]")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_log.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index 426457047578..2c47f9ec3511 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -424,6 +424,10 @@ static int nf_log_proc_dostring(struct ctl_table *table, int write,
 	if (write) {
 		struct ctl_table tmp = *table;
 
+		/* proc_dostring() can append to existing strings, so we need to
+		 * initialize it as an empty string.
+		 */
+		buf[0] = '\0';
 		tmp.data = buf;
 		r = proc_dostring(&tmp, write, buffer, lenp, ppos);
 		if (r)
-- 
2.11.0

^ permalink raw reply related

* [PATCH 3/6] netfilter: nf_ct_helper: Fix possible panic after nf_conntrack_helper_unregister
From: Pablo Neira Ayuso @ 2018-06-27 15:22 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180627152223.3633-1-pablo@netfilter.org>

From: Gao Feng <gfree.wind@vip.163.com>

The helper module would be unloaded after nf_conntrack_helper_unregister,
so it may cause a possible panic caused by race.

nf_ct_iterate_destroy(unhelp, me) reset the helper of conntrack as NULL,
but maybe someone has gotten the helper pointer during this period. Then
it would panic, when it accesses the helper and the module was unloaded.

Take an example as following:
CPU0                                                   CPU1
ctnetlink_dump_helpinfo
helper = rcu_dereference(help->helper);
                                                       unhelp
                                                       set helper as NULL
                                                       unload helper module
helper->to_nlattr(skb, ct);

As above, the cpu0 tries to access the helper and its module is unloaded,
then the panic happens.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_conntrack_helper.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c
index 551a1eddf0fa..a75b11c39312 100644
--- a/net/netfilter/nf_conntrack_helper.c
+++ b/net/netfilter/nf_conntrack_helper.c
@@ -465,6 +465,11 @@ void nf_conntrack_helper_unregister(struct nf_conntrack_helper *me)
 
 	nf_ct_expect_iterate_destroy(expect_iter_me, NULL);
 	nf_ct_iterate_destroy(unhelp, me);
+
+	/* Maybe someone has gotten the helper already when unhelp above.
+	 * So need to wait it.
+	 */
+	synchronize_rcu();
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_helper_unregister);
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH 2/6] netfilter: ipv6: nf_defrag: reduce struct net memory waste
From: Pablo Neira Ayuso @ 2018-06-27 15:22 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180627152223.3633-1-pablo@netfilter.org>

From: Eric Dumazet <edumazet@google.com>

It is a waste of memory to use a full "struct netns_sysctl_ipv6"
while only one pointer is really used, considering netns_sysctl_ipv6
keeps growing.

Also, since "struct netns_frags" has cache line alignment,
it is better to move the frags_hdr pointer outside, otherwise
we spend a full cache line for this pointer.

This saves 192 bytes of memory per netns.

Fixes: c038a767cd69 ("ipv6: add a new namespace for nf_conntrack_reasm")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/net_namespace.h             | 1 +
 include/net/netns/ipv6.h                | 1 -
 net/ipv6/netfilter/nf_conntrack_reasm.c | 6 +++---
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 47e35cce3b64..a71264d75d7f 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -128,6 +128,7 @@ struct net {
 #endif
 #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
 	struct netns_nf_frag	nf_frag;
+	struct ctl_table_header *nf_frag_frags_hdr;
 #endif
 	struct sock		*nfnl;
 	struct sock		*nfnl_stash;
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index c978a31b0f84..762ac9931b62 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -109,7 +109,6 @@ struct netns_ipv6 {
 
 #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
 struct netns_nf_frag {
-	struct netns_sysctl_ipv6 sysctl;
 	struct netns_frags	frags;
 };
 #endif
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 5e0332014c17..a452d99c9f52 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -107,7 +107,7 @@ static int nf_ct_frag6_sysctl_register(struct net *net)
 	if (hdr == NULL)
 		goto err_reg;
 
-	net->nf_frag.sysctl.frags_hdr = hdr;
+	net->nf_frag_frags_hdr = hdr;
 	return 0;
 
 err_reg:
@@ -121,8 +121,8 @@ static void __net_exit nf_ct_frags6_sysctl_unregister(struct net *net)
 {
 	struct ctl_table *table;
 
-	table = net->nf_frag.sysctl.frags_hdr->ctl_table_arg;
-	unregister_net_sysctl_table(net->nf_frag.sysctl.frags_hdr);
+	table = net->nf_frag_frags_hdr->ctl_table_arg;
+	unregister_net_sysctl_table(net->nf_frag_frags_hdr);
 	if (!net_eq(net, &init_net))
 		kfree(table);
 }
-- 
2.11.0

^ permalink raw reply related

* [PATCH 1/6] netfilter: nf_queue: augment nfqa_cfg_policy
From: Pablo Neira Ayuso @ 2018-06-27 15:22 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180627152223.3633-1-pablo@netfilter.org>

From: Eric Dumazet <edumazet@google.com>

Three attributes are currently not verified, thus can trigger KMSAN
warnings such as :

BUG: KMSAN: uninit-value in __arch_swab32 arch/x86/include/uapi/asm/swab.h:10 [inline]
BUG: KMSAN: uninit-value in __fswab32 include/uapi/linux/swab.h:59 [inline]
BUG: KMSAN: uninit-value in nfqnl_recv_config+0x939/0x17d0 net/netfilter/nfnetlink_queue.c:1268
CPU: 1 PID: 4521 Comm: syz-executor120 Not tainted 4.17.0+ #5
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:113
 kmsan_report+0x188/0x2a0 mm/kmsan/kmsan.c:1117
 __msan_warning_32+0x70/0xc0 mm/kmsan/kmsan_instr.c:620
 __arch_swab32 arch/x86/include/uapi/asm/swab.h:10 [inline]
 __fswab32 include/uapi/linux/swab.h:59 [inline]
 nfqnl_recv_config+0x939/0x17d0 net/netfilter/nfnetlink_queue.c:1268
 nfnetlink_rcv_msg+0xb2e/0xc80 net/netfilter/nfnetlink.c:212
 netlink_rcv_skb+0x37e/0x600 net/netlink/af_netlink.c:2448
 nfnetlink_rcv+0x2fe/0x680 net/netfilter/nfnetlink.c:513
 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
 netlink_unicast+0x1680/0x1750 net/netlink/af_netlink.c:1336
 netlink_sendmsg+0x104f/0x1350 net/netlink/af_netlink.c:1901
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg net/socket.c:639 [inline]
 ___sys_sendmsg+0xec8/0x1320 net/socket.c:2117
 __sys_sendmsg net/socket.c:2155 [inline]
 __do_sys_sendmsg net/socket.c:2164 [inline]
 __se_sys_sendmsg net/socket.c:2162 [inline]
 __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
 do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x43fd59
RSP: 002b:00007ffde0e30d28 EFLAGS: 00000213 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fd59
RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
R10: 00000000004002c8 R11: 0000000000000213 R12: 0000000000401680
R13: 0000000000401710 R14: 0000000000000000 R15: 0000000000000000

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:279 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:189
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:315
 kmsan_slab_alloc+0x10/0x20 mm/kmsan/kmsan.c:322
 slab_post_alloc_hook mm/slab.h:446 [inline]
 slab_alloc_node mm/slub.c:2753 [inline]
 __kmalloc_node_track_caller+0xb35/0x11b0 mm/slub.c:4395
 __kmalloc_reserve net/core/skbuff.c:138 [inline]
 __alloc_skb+0x2cb/0x9e0 net/core/skbuff.c:206
 alloc_skb include/linux/skbuff.h:988 [inline]
 netlink_alloc_large_skb net/netlink/af_netlink.c:1182 [inline]
 netlink_sendmsg+0x76e/0x1350 net/netlink/af_netlink.c:1876
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg net/socket.c:639 [inline]
 ___sys_sendmsg+0xec8/0x1320 net/socket.c:2117
 __sys_sendmsg net/socket.c:2155 [inline]
 __do_sys_sendmsg net/socket.c:2164 [inline]
 __se_sys_sendmsg net/socket.c:2162 [inline]
 __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
 do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: fdb694a01f1f ("netfilter: Add fail-open support")
Fixes: 829e17a1a602 ("[NETFILTER]: nfnetlink_queue: allow changing queue length through netlink")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nfnetlink_queue.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index 4ccd2988f9db..ea4ba551abb2 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -1243,6 +1243,9 @@ static int nfqnl_recv_unsupp(struct net *net, struct sock *ctnl,
 static const struct nla_policy nfqa_cfg_policy[NFQA_CFG_MAX+1] = {
 	[NFQA_CFG_CMD]		= { .len = sizeof(struct nfqnl_msg_config_cmd) },
 	[NFQA_CFG_PARAMS]	= { .len = sizeof(struct nfqnl_msg_config_params) },
+	[NFQA_CFG_QUEUE_MAXLEN]	= { .type = NLA_U32 },
+	[NFQA_CFG_MASK]		= { .type = NLA_U32 },
+	[NFQA_CFG_FLAGS]	= { .type = NLA_U32 },
 };
 
 static const struct nf_queue_handler nfqh = {
-- 
2.11.0

^ permalink raw reply related

* [PATCH 0/6] Netfilter fixes for net
From: Pablo Neira Ayuso @ 2018-06-27 15:22 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

Hi David,

The following patchset contains Netfilter fixes for your net tree:

1) Missing netlink attribute validation in nf_queue, uncovered by KASAN,
   from Eric Dumazet.

2) Use pointer to sysctl table, save us 192 bytes of memory per netns.
   Also from Eric.

3) Possible use-after-free when removing conntrack helper modules due
   to missing synchronize RCU call. From Taehee Yoo.

4) Fix corner case in systcl writes to nf_log that lead to appending
   data to uninitialized buffer, from Jann Horn.

5) Jann Horn says we may indefinitely block other users of nf_log_mutex
   if a userspace access in proc_dostring() blocked e.g. due to a
   userfaultfd.

6) Fix garbage collection race for unconfirmed conntrack entries,
   from Florian Westphal.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Thanks.

----------------------------------------------------------------

The following changes since commit 7e85dc8cb35abf16455f1511f0670b57c1a84608:

  net_sched: blackhole: tell upper qdisc about dropped packets (2018-06-17 08:42:33 +0900)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git HEAD

for you to fetch changes up to b36e4523d4d56e2595e28f16f6ccf1cd6a9fc452:

  netfilter: nf_conncount: fix garbage collection confirm race (2018-06-26 18:28:57 +0200)

----------------------------------------------------------------
Eric Dumazet (2):
      netfilter: nf_queue: augment nfqa_cfg_policy
      netfilter: ipv6: nf_defrag: reduce struct net memory waste

Florian Westphal (1):
      netfilter: nf_conncount: fix garbage collection confirm race

Gao Feng (1):
      netfilter: nf_ct_helper: Fix possible panic after nf_conntrack_helper_unregister

Jann Horn (2):
      netfilter: nf_log: fix uninit read in nf_log_proc_dostring
      netfilter: nf_log: don't hold nf_log_mutex during user access

 include/net/net_namespace.h             |  1 +
 include/net/netns/ipv6.h                |  1 -
 net/ipv6/netfilter/nf_conntrack_reasm.c |  6 ++--
 net/netfilter/nf_conncount.c            | 52 +++++++++++++++++++++++++++++----
 net/netfilter/nf_conntrack_helper.c     |  5 ++++
 net/netfilter/nf_log.c                  | 13 +++++++--
 net/netfilter/nfnetlink_queue.c         |  3 ++
 7 files changed, 69 insertions(+), 12 deletions(-)

^ permalink raw reply

* [PATCH] test_bpf: flag tests that cannot be jited on s390
From: Kleber Sacilotto de Souza @ 2018-06-27 15:19 UTC (permalink / raw)
  To: linux-s390, netdev; +Cc: Alexei Starovoitov, Daniel Borkmann

Flag with FLAG_EXPECTED_FAIL the BPF_MAXINSNS tests that cannot be jited
on s390 because they exceed BPF_SIZE_MAX and fail when
CONFIG_BPF_JIT_ALWAYS_ON is set. Also set .expected_errcode to -ENOTSUPP
so the tests pass in that case.

Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
---
 lib/test_bpf.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 60aedc879361..08d3d59dca17 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -5282,21 +5282,31 @@ static struct bpf_test tests[] = {
 	{	/* Mainly checking JIT here. */
 		"BPF_MAXINSNS: Ctx heavy transformations",
 		{ },
+#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_S390)
+		CLASSIC | FLAG_EXPECTED_FAIL,
+#else
 		CLASSIC,
+#endif
 		{ },
 		{
 			{  1, !!(SKB_VLAN_TCI & VLAN_TAG_PRESENT) },
 			{ 10, !!(SKB_VLAN_TCI & VLAN_TAG_PRESENT) }
 		},
 		.fill_helper = bpf_fill_maxinsns6,
+		.expected_errcode = -ENOTSUPP,
 	},
 	{	/* Mainly checking JIT here. */
 		"BPF_MAXINSNS: Call heavy transformations",
 		{ },
+#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_S390)
+		CLASSIC | FLAG_NO_DATA | FLAG_EXPECTED_FAIL,
+#else
 		CLASSIC | FLAG_NO_DATA,
+#endif
 		{ },
 		{ { 1, 0 }, { 10, 0 } },
 		.fill_helper = bpf_fill_maxinsns7,
+		.expected_errcode = -ENOTSUPP,
 	},
 	{	/* Mainly checking JIT here. */
 		"BPF_MAXINSNS: Jump heavy test",
@@ -5347,18 +5357,28 @@ static struct bpf_test tests[] = {
 	{
 		"BPF_MAXINSNS: exec all MSH",
 		{ },
+#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_S390)
+		CLASSIC | FLAG_EXPECTED_FAIL,
+#else
 		CLASSIC,
+#endif
 		{ 0xfa, 0xfb, 0xfc, 0xfd, },
 		{ { 4, 0xababab83 } },
 		.fill_helper = bpf_fill_maxinsns13,
+		.expected_errcode = -ENOTSUPP,
 	},
 	{
 		"BPF_MAXINSNS: ld_abs+get_processor_id",
 		{ },
+#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_S390)
+		CLASSIC | FLAG_EXPECTED_FAIL,
+#else
 		CLASSIC,
+#endif
 		{ },
 		{ { 1, 0xbee } },
 		.fill_helper = bpf_fill_ld_abs_get_processor_id,
+		.expected_errcode = -ENOTSUPP,
 	},
 	/*
 	 * LD_IND / LD_ABS on fragmented SKBs
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH net-next] tcp: replace LINUX_MIB_TCPOFODROP with LINUX_MIB_TCPRMEMFULLDROP for drops due to receive buffer full
From: Yafang Shao @ 2018-06-27 15:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Eric Dumazet, David Miller, netdev, LKML
In-Reply-To: <c89b70f9-e70d-35a5-b9a7-7e3ed7d34edf@gmail.com>

On Wed, Jun 27, 2018 at 10:48 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 06/27/2018 04:50 AM, Yafang Shao wrote:
>> When sk_rmem_alloc is larger than the receive buffer and we can't
>> schedule more memory for it, the skb will be dropped.
>>
>> In above situation, if this skb is put into the ofo queue,
>> LINUX_MIB_TCPOFODROP is incremented to track it,
>> while if this skb is put into the receive queue, there's no record.
>>
>> So LINUX_MIB_TCPOFODROP is replaced with LINUX_MIB_TCPRMEMFULLDROP to track
>> this behavior.
>
>
> Hi Yafang
>
> I do not want to remove TCPOFODrop and mix multiple causes in one single counter.
>
> Please take a look at commit a6df1ae9383697c to have the reasoning.
>

Got it!

What about introduce a new counter, i.e. TCPRcvQFullDrop ?

Thanks
Yafang

^ permalink raw reply

* Re: [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction
From: Eric Dumazet @ 2018-06-27 15:06 UTC (permalink / raw)
  To: Lawrence Brakmo, netdev; +Cc: Kernel Team, Blake Matheny, Alexei Starovoitov
In-Reply-To: <d07bfdf1-2060-bdaf-8df0-7e33a8f017b7@gmail.com>



On 06/27/2018 08:04 AM, Eric Dumazet wrote:
> 
> 
> On 06/26/2018 07:34 PM, Lawrence Brakmo wrote:
>> When using dctcp and doing RPCs, if the last packet of a request is
>> ECN marked as having seen congestion (CE), the sender can decrease its
>> cwnd to 1. As a result, it will only send one packet when a new request
>> is sent. In some instances this results in high tail latencies.
>>
> 
>>  	}
>>  	/* Force a fast retransmit upon entering fast recovery */
>>  	sndcnt = max(sndcnt, (tp->prr_out ? 0 : 1));
>> -	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
>> +	tp->snd_cwnd = max((int)tcp_packets_in_flight(tp) + sndcnt, 2);
> 
> Canonical way is to use min_t(), please respin (no need to explain this trivia in changelog)


Well, max_t() here, obviously :)

^ permalink raw reply

* Re: [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction
From: Eric Dumazet @ 2018-06-27 15:04 UTC (permalink / raw)
  To: Lawrence Brakmo, netdev; +Cc: Kernel Team, Blake Matheny, Alexei Starovoitov
In-Reply-To: <20180627023403.3395818-1-brakmo@fb.com>



On 06/26/2018 07:34 PM, Lawrence Brakmo wrote:
> When using dctcp and doing RPCs, if the last packet of a request is
> ECN marked as having seen congestion (CE), the sender can decrease its
> cwnd to 1. As a result, it will only send one packet when a new request
> is sent. In some instances this results in high tail latencies.
> 

>  	}
>  	/* Force a fast retransmit upon entering fast recovery */
>  	sndcnt = max(sndcnt, (tp->prr_out ? 0 : 1));
> -	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
> +	tp->snd_cwnd = max((int)tcp_packets_in_flight(tp) + sndcnt, 2);

Canonical way is to use min_t(), please respin (no need to explain this trivia in changelog)

Thanks.

^ permalink raw reply

* Re: [PATCH nf-next v2] openvswitch: use nf_ct_get_tuplepr, invert_tuplepr
From: Pablo Neira Ayuso @ 2018-06-27 15:03 UTC (permalink / raw)
  To: Pravin Shelar
  Cc: ovs dev, netfilter-devel-u79uwXL29TY76Z2rM5mHXA, Florian Westphal,
	Linux Kernel Network Developers
In-Reply-To: <CAOrHB_A_S_+f_weEW-PhNkhKTLXM3NKvAXh82XRTgntMSyNAQA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, Jun 25, 2018 at 08:46:45PM -0700, Pravin Shelar wrote:
> On Mon, Jun 25, 2018 at 8:55 AM, Florian Westphal <fw-HFFVJYpyMKqzQB+pC5nmwQ@public.gmane.org> wrote:
> > These versions deal with the l3proto/l4proto details internally.
> > It removes only caller of nf_ct_get_tuple, so make it static.
> >
> > After this, l3proto->get_l4proto() can be removed in a followup patch.
> >
> > Signed-off-by: Florian Westphal <fw-HFFVJYpyMKqzQB+pC5nmwQ@public.gmane.org>
> Acked-by: Pravin B Shelar <pshelar-LZ6Gd1LRuIk@public.gmane.org>

Applied, thanks.

^ permalink raw reply

* [PATCH net-next v2] net: stmmac: Add support for CBS QDISC
From: Jose Abreu @ 2018-06-27 14:57 UTC (permalink / raw)
  To: netdev
  Cc: Jose Abreu, David S. Miller, Joao Pinto, Vitor Soares,
	Giuseppe Cavallaro, Alexandre Torgue

This adds support for CBS reconfiguration using the TC application.

A new callback was added to TC ops struct and another one to DMA ops to
reconfigure the channel mode.

Tested in GMAC5.10.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Vitor Soares <soares@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
---
Changes from v1:
	- Fixup kbuild warning
---
 drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c  | 15 ++++++
 drivers/net/ethernet/stmicro/stmmac/hwif.h        |  8 +++
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c |  2 +
 drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c   | 62 +++++++++++++++++++++++
 4 files changed, 87 insertions(+)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
index d37f17ca62fe..6e32f8a3710b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
@@ -407,6 +407,19 @@ static void dwmac4_enable_tso(void __iomem *ioaddr, bool en, u32 chan)
 	}
 }
 
+static void dwmac4_qmode(void __iomem *ioaddr, u32 channel, u8 qmode)
+{
+	u32 mtl_tx_op = readl(ioaddr + MTL_CHAN_TX_OP_MODE(channel));
+
+	mtl_tx_op &= ~MTL_OP_MODE_TXQEN_MASK;
+	if (qmode != MTL_QUEUE_AVB)
+		mtl_tx_op |= MTL_OP_MODE_TXQEN;
+	else
+		mtl_tx_op |= MTL_OP_MODE_TXQEN_AV;
+
+	writel(mtl_tx_op, ioaddr +  MTL_CHAN_TX_OP_MODE(channel));
+}
+
 const struct stmmac_dma_ops dwmac4_dma_ops = {
 	.reset = dwmac4_dma_reset,
 	.init = dwmac4_dma_init,
@@ -431,6 +444,7 @@ const struct stmmac_dma_ops dwmac4_dma_ops = {
 	.set_rx_tail_ptr = dwmac4_set_rx_tail_ptr,
 	.set_tx_tail_ptr = dwmac4_set_tx_tail_ptr,
 	.enable_tso = dwmac4_enable_tso,
+	.qmode = dwmac4_qmode,
 };
 
 const struct stmmac_dma_ops dwmac410_dma_ops = {
@@ -457,4 +471,5 @@ const struct stmmac_dma_ops dwmac410_dma_ops = {
 	.set_rx_tail_ptr = dwmac4_set_rx_tail_ptr,
 	.set_tx_tail_ptr = dwmac4_set_tx_tail_ptr,
 	.enable_tso = dwmac4_enable_tso,
+	.qmode = dwmac4_qmode,
 };
diff --git a/drivers/net/ethernet/stmicro/stmmac/hwif.h b/drivers/net/ethernet/stmicro/stmmac/hwif.h
index e44e7b26ce82..e2a965790648 100644
--- a/drivers/net/ethernet/stmicro/stmmac/hwif.h
+++ b/drivers/net/ethernet/stmicro/stmmac/hwif.h
@@ -183,6 +183,7 @@ struct stmmac_dma_ops {
 	void (*set_rx_tail_ptr)(void __iomem *ioaddr, u32 tail_ptr, u32 chan);
 	void (*set_tx_tail_ptr)(void __iomem *ioaddr, u32 tail_ptr, u32 chan);
 	void (*enable_tso)(void __iomem *ioaddr, bool en, u32 chan);
+	void (*qmode)(void __iomem *ioaddr, u32 channel, u8 qmode);
 };
 
 #define stmmac_reset(__priv, __args...) \
@@ -235,6 +236,8 @@ struct stmmac_dma_ops {
 	stmmac_do_void_callback(__priv, dma, set_tx_tail_ptr, __args)
 #define stmmac_enable_tso(__priv, __args...) \
 	stmmac_do_void_callback(__priv, dma, enable_tso, __args)
+#define stmmac_dma_qmode(__priv, __args...) \
+	stmmac_do_void_callback(__priv, dma, qmode, __args)
 
 struct mac_device_info;
 struct net_device;
@@ -441,17 +444,22 @@ struct stmmac_mode_ops {
 
 struct stmmac_priv;
 struct tc_cls_u32_offload;
+struct tc_cbs_qopt_offload;
 
 struct stmmac_tc_ops {
 	int (*init)(struct stmmac_priv *priv);
 	int (*setup_cls_u32)(struct stmmac_priv *priv,
 			     struct tc_cls_u32_offload *cls);
+	int (*setup_cbs)(struct stmmac_priv *priv,
+			 struct tc_cbs_qopt_offload *qopt);
 };
 
 #define stmmac_tc_init(__priv, __args...) \
 	stmmac_do_callback(__priv, tc, init, __args)
 #define stmmac_tc_setup_cls_u32(__priv, __args...) \
 	stmmac_do_callback(__priv, tc, setup_cls_u32, __args)
+#define stmmac_tc_setup_cbs(__priv, __args...) \
+	stmmac_do_callback(__priv, tc, setup_cbs, __args)
 
 struct stmmac_regs_off {
 	u32 ptp_off;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 2354e30caa78..93a3bea8576e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -3793,6 +3793,8 @@ static int stmmac_setup_tc(struct net_device *ndev, enum tc_setup_type type,
 	switch (type) {
 	case TC_SETUP_BLOCK:
 		return stmmac_setup_tc_block(priv, type_data);
+	case TC_SETUP_QDISC_CBS:
+		return stmmac_tc_setup_cbs(priv, priv, type_data);
 	default:
 		return -EOPNOTSUPP;
 	}
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c
index 2258cd8cc844..0b0fca0200b2 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c
@@ -289,7 +289,69 @@ static int tc_init(struct stmmac_priv *priv)
 	return 0;
 }
 
+static int tc_setup_cbs(struct stmmac_priv *priv,
+			struct tc_cbs_qopt_offload *qopt)
+{
+	u32 tx_queues_count = priv->plat->tx_queues_to_use;
+	u32 queue = qopt->queue;
+	u32 ptr, speed_div;
+	u32 mode_to_use;
+	u64 value;
+	int ret;
+
+	/* Queue 0 is not AVB capable */
+	if (queue <= 0 || queue >= tx_queues_count)
+		return -EINVAL;
+	if (priv->speed != SPEED_100 && priv->speed != SPEED_1000)
+		return -EOPNOTSUPP;
+
+	mode_to_use = priv->plat->tx_queues_cfg[queue].mode_to_use;
+	if (mode_to_use == MTL_QUEUE_DCB && qopt->enable) {
+		ret = stmmac_dma_qmode(priv, priv->ioaddr, queue, MTL_QUEUE_AVB);
+		if (ret)
+			return ret;
+
+		priv->plat->tx_queues_cfg[queue].mode_to_use = MTL_QUEUE_AVB;
+	} else if (!qopt->enable) {
+		return stmmac_dma_qmode(priv, priv->ioaddr, queue, MTL_QUEUE_DCB);
+	}
+
+	/* Port Transmit Rate and Speed Divider */
+	ptr = (priv->speed == SPEED_100) ? 4 : 8;
+	speed_div = (priv->speed == SPEED_100) ? 100000 : 1000000;
+
+	/* Final adjustments for HW */
+	value = qopt->idleslope * 1024 * ptr;
+	do_div(value, speed_div);
+	priv->plat->tx_queues_cfg[queue].idle_slope = value & GENMASK(31, 0);
+
+	value = -qopt->sendslope * 1024UL * ptr;
+	do_div(value, speed_div);
+	priv->plat->tx_queues_cfg[queue].send_slope = value & GENMASK(31, 0);
+
+	value = qopt->hicredit * 1024 * 8;
+	priv->plat->tx_queues_cfg[queue].high_credit = value & GENMASK(31, 0);
+
+	value = qopt->locredit * 1024 * 8;
+	priv->plat->tx_queues_cfg[queue].low_credit = value & GENMASK(31, 0);
+
+	ret = stmmac_config_cbs(priv, priv->hw,
+				priv->plat->tx_queues_cfg[queue].send_slope,
+				priv->plat->tx_queues_cfg[queue].idle_slope,
+				priv->plat->tx_queues_cfg[queue].high_credit,
+				priv->plat->tx_queues_cfg[queue].low_credit,
+				queue);
+	if (ret)
+		return ret;
+
+	dev_info(priv->device, "CBS queue %d: send %d, idle %d, hi %d, lo %d\n",
+			queue, qopt->sendslope, qopt->idleslope,
+			qopt->hicredit, qopt->locredit);
+	return 0;
+}
+
 const struct stmmac_tc_ops dwmac510_tc_ops = {
 	.init = tc_init,
 	.setup_cls_u32 = tc_setup_cls_u32,
+	.setup_cbs = tc_setup_cbs,
 };
-- 
2.7.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox