Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next] bpf, xdp: drop rcu_read_lock from bpf_prog_run_xdp and move to caller
From: Daniel Borkmann @ 2016-11-30 21:16 UTC (permalink / raw)
  To: davem
  Cc: alexei.starovoitov, saeedm, kubakici, Yuval.Mintz, netdev,
	Daniel Borkmann

After 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
the rcu_read_lock() in bpf_prog_run_xdp() is superfluous, since callers
need to hold rcu_read_lock() already to make sure BPF program doesn't
get released in the background.

Thus, drop it from bpf_prog_run_xdp(), as it can otherwise be misleading.
Still keeping the bpf_prog_run_xdp() is useful as it allows for grepping
in XDP supported drivers and to keep the typecheck on the context intact.
For mlx4, this means we don't have a double rcu_read_lock() anymore. nfp can
just make use of bpf_prog_run_xdp(), too. For qede, just move rcu_read_lock()
out of the helper. When the driver gets atomic replace support, this will
move to call-sites eventually.

mlx5 needs actual fixing as it has the same issue as described already in
326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
that is, we're under RCU bh at this time, BPF programs are released via
call_rcu(), and call_rcu() != call_rcu_bh(), so we need to properly mark
read side as programs can get xchg()'ed in mlx5e_xdp_set() without queue
reset.

Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 ( Also here net-next is just fine, imho. )

 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c     |  8 ++++++--
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c |  2 +-
 drivers/net/ethernet/qlogic/qede/qede_main.c        |  7 +++++++
 include/linux/filter.h                              | 18 +++++++++---------
 4 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index b036710..42cd687 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -737,10 +737,10 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 struct sk_buff *skb_from_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 			     u16 wqe_counter, u32 cqe_bcnt)
 {
-	struct bpf_prog *xdp_prog = READ_ONCE(rq->xdp_prog);
 	struct mlx5e_dma_info *di;
 	struct sk_buff *skb;
 	void *va, *data;
+	bool consumed;
 
 	di             = &rq->dma_info[wqe_counter];
 	va             = page_address(di->page);
@@ -759,7 +759,11 @@ struct sk_buff *skb_from_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 		return NULL;
 	}
 
-	if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt))
+	rcu_read_lock();
+	consumed = mlx5e_xdp_handle(rq, READ_ONCE(rq->xdp_prog), di, data,
+				    cqe_bcnt);
+	rcu_read_unlock();
+	if (consumed)
 		return NULL; /* page/packet was consumed by XDP */
 
 	skb = build_skb(va, RQ_PAGE_SIZE(rq));
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 876ab3a..00d9a03 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1518,7 +1518,7 @@ static int nfp_net_run_xdp(struct bpf_prog *prog, void *data, unsigned int len)
 	xdp.data = data;
 	xdp.data_end = data + len;
 
-	return BPF_PROG_RUN(prog, &xdp);
+	return bpf_prog_run_xdp(prog, &xdp);
 }
 
 /**
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 172ff6d..faeaa9f 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -1497,7 +1497,14 @@ static bool qede_rx_xdp(struct qede_dev *edev,
 
 	xdp.data = page_address(bd->data) + cqe->placement_offset;
 	xdp.data_end = xdp.data + len;
+
+	/* Queues always have a full reset currently, so for the time
+	 * being until there's atomic program replace just mark read
+	 * side for map helpers.
+	 */
+	rcu_read_lock();
 	act = bpf_prog_run_xdp(prog, &xdp);
+	rcu_read_unlock();
 
 	if (act == XDP_PASS)
 		return true;
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7f246a2..45bd83e 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -498,16 +498,16 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
 	return BPF_PROG_RUN(prog, skb);
 }
 
-static inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
-				   struct xdp_buff *xdp)
+static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
+					    struct xdp_buff *xdp)
 {
-	u32 ret;
-
-	rcu_read_lock();
-	ret = BPF_PROG_RUN(prog, xdp);
-	rcu_read_unlock();
-
-	return ret;
+	/* Caller needs to hold rcu_read_lock() (!), otherwise program
+	 * can be released while still running, or map elements could be
+	 * freed early while still having concurrent users. XDP fastpath
+	 * already takes rcu_read_lock() when fetching the program, so
+	 * it's not necessary here anymore.
+	 */
+	return BPF_PROG_RUN(prog, xdp);
 }
 
 static inline unsigned int bpf_prog_size(unsigned int proglen)
-- 
1.9.3

^ permalink raw reply related

* Re: [net PATCH 0/2] Don't use lco_csum to compute IPv4 checksum
From: Jeff Kirsher @ 2016-11-30 21:15 UTC (permalink / raw)
  To: David Miller, alexander.h.duyck; +Cc: netdev, intel-wired-lan, sfr
In-Reply-To: <20161130.094746.724735454244491985.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1124 bytes --]

On Wed, 2016-11-30 at 09:47 -0500, David Miller wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> Date: Mon, 28 Nov 2016 10:42:18 -0500
> 
> > When I implemented the GSO partial support in the Intel drivers I was
> using
> > lco_csum to compute the checksum that we needed to plug into the IPv4
> > checksum field in order to cancel out the data that was not a part of
> the
> > IPv4 header.  However this didn't take into account that the transport
> > offset might be pointing to the inner transport header.
> > 
> > Instead of using lco_csum I have just coded around it so that we can
> use
> > the outer IP header plus the IP header length to determine where we
> need to
> > start our checksum and then just call csum_partial ourselves.
> > 
> > This should fix the SIT issue reported on igb interfaces as well as
> simliar
> > issues that would pop up on other Intel NICs.
> 
> Jeff, are you going to send me a pull request with this stuff or would
> you be OK with my applying these directly to 'net'?

Go ahead and apply those to your net tree, I do not want to hold this up.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH] netns: avoid disabling irq for netns id
From: David Miller @ 2016-11-30 21:12 UTC (permalink / raw)
  To: pmoore; +Cc: netdev, linux-audit, xiyou.wangcong
In-Reply-To: <CAGH-Kgv0UpmDdaW=z8pa1VvmrcJeaA57uMneqNEgex6Xa8NSQw@mail.gmail.com>

From: Paul Moore <pmoore@redhat.com>
Date: Wed, 30 Nov 2016 15:35:46 -0500

> On Wed, Nov 30, 2016 at 2:58 PM, David Miller <davem@davemloft.net> wrote:
>> From: Paul Moore <pmoore@redhat.com>
>> Date: Tue, 29 Nov 2016 17:11:29 -0500
>>
>>> From: Paul Moore <paul@paul-moore.com>
>>>
>>> Bring back commit bc51dddf98c9 ("netns: avoid disabling irq for netns
>>> id") now that we've fixed some audit multicast issues that caused
>>> problems with original attempt.  Additional information, and history,
>>> can be found in the links below:
>>>
>>>  * https://github.com/linux-audit/audit-kernel/issues/22
>>>  * https://github.com/linux-audit/audit-kernel/issues/23
>>>
>>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>>> Signed-off-by: Paul Moore <paul@paul-moore.com>
>>
>> This doesn't apply cleanly to the net-next tree, could you please
>> respin?
> 
> As I mentioned in a reply to the patch posting, because this relies on
> a number of patches in the audit tree I've gone ahead and merged this
> patch into the audit#next branch.  Unless you have any objections,
> I'll send this to Linus with the rest of the v4.10 audit patches.

That's fine with me.

^ permalink raw reply

* [PATCH net-next] net/mlx5e: skip loopback selftest with !CONFIG_INET
From: Arnd Bergmann @ 2016-11-30 21:05 UTC (permalink / raw)
  To: Saeed Mahameed, Matan Barak, Leon Romanovsky
  Cc: Arnd Bergmann, David S. Miller, Kamal Heib, netdev, linux-rdma,
	linux-kernel

When CONFIG_INET is disabled, the new selftest results in a link
error:

drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.o: In function `mlx5e_test_loopback':
en_selftest.c:(.text.mlx5e_test_loopback+0x2ec): undefined reference to `ip_send_check'
en_selftest.c:(.text.mlx5e_test_loopback+0x34c): undefined reference to `udp4_hwcsum'

This hides the specific test in that configuration.

Fixes: 0952da791c97 ("net/mlx5e: Add support for loopback selftest")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c b/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c
index c32af7daf3ff..65442c36a6e1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c
@@ -39,7 +39,9 @@ enum {
 	MLX5E_ST_LINK_STATE,
 	MLX5E_ST_LINK_SPEED,
 	MLX5E_ST_HEALTH_INFO,
+#ifdef CONFIG_INET
 	MLX5E_ST_LOOPBACK,
+#endif
 	MLX5E_ST_NUM,
 };
 
@@ -47,7 +49,9 @@ const char mlx5e_self_tests[MLX5E_ST_NUM][ETH_GSTRING_LEN] = {
 	"Link Test",
 	"Speed Test",
 	"Health Test",
+#ifdef CONFIG_INET
 	"Loopback Test",
+#endif
 };
 
 int mlx5e_self_test_num(struct mlx5e_priv *priv)
@@ -93,6 +97,7 @@ static int mlx5e_test_link_speed(struct mlx5e_priv *priv)
 	return 1;
 }
 
+#ifdef CONFIG_INET
 /* loopback test */
 #define MLX5E_TEST_PKT_SIZE (MLX5_MPWRQ_SMALL_PACKET_THRESHOLD - NET_IP_ALIGN)
 static const char mlx5e_test_text[ETH_GSTRING_LEN] = "MLX5E SELF TEST";
@@ -304,12 +309,15 @@ static int mlx5e_test_loopback(struct mlx5e_priv *priv)
 	kfree(lbtp);
 	return err;
 }
+#endif
 
 static int (*mlx5e_st_func[MLX5E_ST_NUM])(struct mlx5e_priv *) = {
 	mlx5e_test_link_state,
 	mlx5e_test_link_speed,
 	mlx5e_test_health_info,
-	mlx5e_test_loopback
+#ifdef CONFIG_INET
+	mlx5e_test_loopback,
+#endif
 };
 
 void mlx5e_self_test(struct net_device *ndev, struct ethtool_test *etest,
-- 
2.9.0

^ permalink raw reply related

* Re: Regression: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Eric Dumazet @ 2016-11-30 21:00 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: Jesper Dangaard Brouer, David Miller, netdev, Tariq Toukan
In-Reply-To: <CALzJLG_aT1O1ergGRu8Z0u4nszKYao5RbPfb=1USptwFY1d7PQ@mail.gmail.com>

On Wed, 2016-11-30 at 22:42 +0200, Saeed Mahameed wrote:
> On Wed, Nov 30, 2016 at 7:35 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Wed, 2016-11-30 at 18:46 +0200, Saeed Mahameed wrote:
> >
> >> we had/still have the proper stats they are the ones that
> >> mlx4_en_fold_software_stats is trying to cache into  (they always
> >> exist),
> >> but the ones that you are trying to read from (the mlx4 rings) are gone !
> >>
> >> This bug is totally new and as i warned, this is another symptom of
> >> the real root cause (can't sleep while reading stats).
> >>
> >> Eric what do you suggest ? Keep pre-allocated MAX_RINGS stats  and
> >> always iterate over all of them to query stats ?
> >> what if you have one ring/none/1K ? how would you know how many to query ?
> >
> > I am suggesting I will fix the bug I introduced.
> >
> > Do not panic.
> >
> >
> 
> Not at all, I trust you are the only one who is capable of providing
> the best solution.
> I am just trying to read your mind :-).
> 
> As i said i like the solution and i want to adapt it to mlx5, so I am
> a little bit enthusiastic :)

What about the following fix guys ?

As a bonus we update the stats right before they are sent to monitors
via rtnetlink ;)


diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 12ea3405f442717478bf0e8882edaf0de77986cb..091b904262bc7932d3edf99cf850affb23b9ce6e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1809,8 +1809,12 @@ void mlx4_en_stop_port(struct net_device *dev, int detach)
 
 	netif_tx_disable(dev);
 
+	spin_lock_bh(&priv->stats_lock);
+	mlx4_en_fold_software_stats(dev);
 	/* Set port as not active */
 	priv->port_up = false;
+	spin_unlock_bh(&priv->stats_lock);
+
 	priv->counter_index = MLX4_SINK_COUNTER_INDEX(mdev->dev);
 
 	/* Promsicuous mode */
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_port.c b/drivers/net/ethernet/mellanox/mlx4/en_port.c
index c6c4f1238923e09eced547454b86c68720292859..9166d90e732858610b1407fe85cbf6cbe27f5e0b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_port.c
@@ -154,7 +154,7 @@ void mlx4_en_fold_software_stats(struct net_device *dev)
 	unsigned long packets, bytes;
 	int i;
 
-	if (mlx4_is_master(mdev->dev))
+	if (!priv->port_up || mlx4_is_master(mdev->dev))
 		return;
 
 	packets = 0;

^ permalink raw reply related

* Re: [PATCH net-next] sock: reset sk_err for ICMP packets read from error queue
From: Maciej Żenczykowski @ 2016-11-30 20:49 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: David Miller, Linux NetDev, Eric Dumazet, Willem de Bruijn,
	Hannes Frederic Sowa, Soheil Hassas Yeganeh
In-Reply-To: <1480532468-1610-1-git-send-email-soheil.kdev@gmail.com>

On Wed, Nov 30, 2016 at 8:01 PM, Soheil Hassas Yeganeh
<soheil.kdev@gmail.com> wrote:
> From: Soheil Hassas Yeganeh <soheil@google.com>
>
> Only when ICMP packets are enqueued onto the error queue,
> sk_err is also set. Before f5f99309fa74 (sock: do not set sk_err
> in sock_dequeue_err_skb), a subsequent error queue read
> would set sk_err to the next error on the queue, or 0 if empty.
> As no error types other than ICMP set this field, sk_err should
> not be modified upon dequeuing them.
>
> Only for ICMP errors, reset the (racy) sk_err. Some applications,
> like traceroute, rely on it and go into a futile busy POLLERR
> loop otherwise.
>
> In principle, sk_err has to be set while an ICMP error is queued.
> Testing is_icmp_err_skb(skb_next) approximates this without
> requiring a full queue walk. Applications that receive both ICMP
> and other errors cannot rely on this legacy behavior, as other
> errors do not set sk_err in the first place.
>
> Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb)
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>

Acked-by: Maciej Żenczykowski <maze@google.com>

^ permalink raw reply

* Re: [PATCH 4/6] net: ethernet: ti: cpts: add ptp pps support
From: Grygorii Strashko @ 2016-11-30 20:43 UTC (permalink / raw)
  To: Richard Cochran, Murali Karicheri, Wingman Kwok
  Cc: David S. Miller, netdev, Mugunthan V N, Sekhar Nori, linux-kernel,
	linux-omap, Rob Herring, devicetree
In-Reply-To: <20161130184511.GB8209@netboy>



On 11/30/2016 12:45 PM, Richard Cochran wrote:
> On Mon, Nov 28, 2016 at 05:04:26PM -0600, Grygorii Strashko wrote:
>> +static cycle_t cpts_cc_ns2cyc(struct cpts *cpts, u64 nsecs)
>> +{
>> +	cycle_t cyc = (nsecs << cpts->cc.shift) + nsecs;
>> +
>> +	do_div(cyc, cpts->cc.mult);
>> +
>> +	return cyc;
>> +}
> 
> So you set the comparison value once per second, based on cc.mult.
> But when the clock is being actively synchronized, user space calls to
> clock_adjtimex() will change cc.mult.  This can happen several times
> per second, depending on the PTP Sync rate.
> 

Right.

> In order to produce the PPS edge correctly, you would have to adjust
> the comparison value whenever cc.mult changes, 

yes. And that is done in cpts_ptp_adjfreq()
	if (cpts->ts_comp_enabled)
		cpts->ts_comp_one_sec_cycs = cpts_cc_ns2cyc(cpts, NSEC_PER_SEC);
	^^^ re-calculate reload value for 
 
	cpts_ts_comp_settime(cpts, ns);
	^^^ adjust the ts_comp

> but of course this is unworkable.
> 

Sry, but this is questionable - code for pps comes from TI internal
branches (SDK releases) where it survived for a pretty long time.
I'm, of course, agree that without HW support for freq adjustment
this PPS feature is not super precise and has some limitation,
but that is what we agree to live with. 

Murali, do you have any comments regarding usability of SW
freq freq adjustment approach? 

> So I'll have to say NAK for this patch.
> 

:) 


-- 
regards,
-grygorii

^ permalink raw reply

* Re: DSA vs. SWTICHDEV ?
From: Jiri Pirko @ 2016-11-30 20:43 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Joakim Tjernlund, Florian Fainelli, netdev@vger.kernel.org
In-Reply-To: <20161130180927.GK21645@lunn.ch>

Wed, Nov 30, 2016 at 07:09:27PM CET, andrew@lunn.ch wrote:
>> Something like that. I need to run routing protocols on the switch I/Fs and egress
>> pkgs on selected switch I/Fs bypassing ARP, just like DSA does with its vendor
>> tags.
>
>Does the switch have an equivalent tagging protocol? If you are
>building a tree of switches you need something like this for frames
>going from the host via intermediate switches and out a specific port
>on a remote switch.
>
>> We might have a tree as well so now I really wonder: Given we write a
>> proper switchdev driver, can it support switchtrees without touching
>> switchdev infra structure?
>
>Jiri Pirko <jiri@resnulli.us> is probably the best person to ask about
>this. DSA hides the knowledge that there is multiple switches. To
>switchdev, a tree of switches looks like one switch. This is not
>because of switchdev, it is just the existing DSA code worked when
>switchdev came along.

Looks like the hw is DSA-ish. If I'm not mistaken about that, should be
handled as a part of DSA.


>
> If not I guess we will attach a physical
>> eth I/F to the switch and use both DSA and switchdev to support both trees
>> and HW offload. 
>
>This only works if the switch has the necessary tagging protocol to
>pass through multiple switches.
>
>> We have on an existing board with a BCM ROBO switch with lots of ports(>24),
>> managed over SPI. Looking at BCM DSA tag code it looks like it only supports
>> some 8 ports or so. I still have to find out if this is a limitation in BCM tagging
>> protocol or if just not impl. in DSA yet.
>
>Hi Florian, care to comment?
>
>As far as i understand, the tag used for SF2 and B53 does not support
>a tree of switches. But the big ROBO switches might have a different
>tagging protocol.
>
>  Andrew

^ permalink raw reply

* Re: Regression: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Saeed Mahameed @ 2016-11-30 20:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jesper Dangaard Brouer, David Miller, netdev, Tariq Toukan
In-Reply-To: <1480527321.18162.196.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Nov 30, 2016 at 7:35 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 18:46 +0200, Saeed Mahameed wrote:
>
>> we had/still have the proper stats they are the ones that
>> mlx4_en_fold_software_stats is trying to cache into  (they always
>> exist),
>> but the ones that you are trying to read from (the mlx4 rings) are gone !
>>
>> This bug is totally new and as i warned, this is another symptom of
>> the real root cause (can't sleep while reading stats).
>>
>> Eric what do you suggest ? Keep pre-allocated MAX_RINGS stats  and
>> always iterate over all of them to query stats ?
>> what if you have one ring/none/1K ? how would you know how many to query ?
>
> I am suggesting I will fix the bug I introduced.
>
> Do not panic.
>
>

Not at all, I trust you are the only one who is capable of providing
the best solution.
I am just trying to read your mind :-).

As i said i like the solution and i want to adapt it to mlx5, so I am
a little bit enthusiastic :)

Thanks.

^ permalink raw reply

* Re: [PATCH 1/4] bindings: net: stmmac: correct note about TSO
From: Rob Herring @ 2016-11-30 20:41 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Mark Rutland, David S. Miller, Giuseppe CAVALLARO,
	Alexandre TORGUE, Phil Reid, Niklas Cassel, Eric Engestrom,
	netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1479911066-19752-1-git-send-email-niklass-VrBV9hrLPhE@public.gmane.org>

On Wed, Nov 23, 2016 at 03:24:25PM +0100, Niklas Cassel wrote:
> From: Niklas Cassel <niklas.cassel-VrBV9hrLPhE@public.gmane.org>
> 
> snps,tso was previously placed under AXI BUS Mode parameters,
> suggesting that the property should be in the stmmac-axi-config node.
> 
> TSO (TCP Segmentation Offloading) has nothing to do with AXI BUS Mode
> parameters, and the parser actually expects it to be in the root node,
> not in the stmmac-axi-config.
> 
> Also added a note about snps,tso only being available on GMAC4 and newer.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel-VrBV9hrLPhE@public.gmane.org>
> ---
>  Documentation/devicetree/bindings/net/stmmac.txt | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

Acked-by: Rob Herring <robh-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] can: rcar_canfd: Correct order of interrupt specifiers
From: Rob Herring @ 2016-11-30 20:38 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Wolfgang Grandegger, Marc Kleine-Budde, Ramesh Shanmugasundaram,
	Chris Paterson, linux-can, netdev, devicetree, linux-renesas-soc
In-Reply-To: <1479908686-14028-1-git-send-email-geert+renesas@glider.be>

On Wed, Nov 23, 2016 at 02:44:46PM +0100, Geert Uytterhoeven wrote:
> According to both DTS (example and actual files), and Linux driver code,
> the first interrupt specifier should be the Channel interrupt, while the
> second interrupt specifier should be the Global interrupt.
> 
> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
> ---
>  Documentation/devicetree/bindings/net/can/rcar_canfd.txt | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Acked-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* Re: [PATCH] netns: avoid disabling irq for netns id
From: Paul Moore @ 2016-11-30 20:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-audit, xiyou.wangcong
In-Reply-To: <20161130.145822.727604546507312208.davem@davemloft.net>

On Wed, Nov 30, 2016 at 2:58 PM, David Miller <davem@davemloft.net> wrote:
> From: Paul Moore <pmoore@redhat.com>
> Date: Tue, 29 Nov 2016 17:11:29 -0500
>
>> From: Paul Moore <paul@paul-moore.com>
>>
>> Bring back commit bc51dddf98c9 ("netns: avoid disabling irq for netns
>> id") now that we've fixed some audit multicast issues that caused
>> problems with original attempt.  Additional information, and history,
>> can be found in the links below:
>>
>>  * https://github.com/linux-audit/audit-kernel/issues/22
>>  * https://github.com/linux-audit/audit-kernel/issues/23
>>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>> Signed-off-by: Paul Moore <paul@paul-moore.com>
>
> This doesn't apply cleanly to the net-next tree, could you please
> respin?

As I mentioned in a reply to the patch posting, because this relies on
a number of patches in the audit tree I've gone ahead and merged this
patch into the audit#next branch.  Unless you have any objections,
I'll send this to Linus with the rest of the v4.10 audit patches.

-- 
paul moore
security @ redhat

^ permalink raw reply

* Re: [PATCH 3/6] net: ethernet: ti: cpts: add support of cpts HW_TS_PUSH
From: Grygorii Strashko @ 2016-11-30 20:15 UTC (permalink / raw)
  To: Jan Lübbe, Murali Karicheri
  Cc: David S. Miller, netdev, Mugunthan V N, Richard Cochran,
	Sekhar Nori, linux-kernel, linux-omap, Rob Herring, devicetree,
	Wingman Kwok
In-Reply-To: <1480504100.9183.72.camel@pengutronix.de>



On 11/30/2016 05:08 AM, Jan Lübbe wrote:
> On Mo, 2016-11-28 at 17:04 -0600, Grygorii Strashko wrote:
>> This patch adds support of the CPTS HW_TS_PUSH events which are generated
>> by external low frequency time stamp channels on TI's OMAP CPSW and
>> Keystone 2 platforms. It supports up to 8 external time stamp channels for
>> HW_TS_PUSH input pins (the number of supported channel is different for
>> different SoCs and CPTS versions, check corresponding Data maual before
>> enabling it). Therefore, new DT property "cpts-ext-ts-inputs" is introduced
>> for specifying number of available external timestamp channels.
> 
> If this only depends on SoC and CTPS, it should be possible to derive
> the correct value from the compatible value and possibly a CPTS version
> register? If the existing compatible strings are not specific enough,
> possible a new one should be added.
> 

In general, I can try to add and use new compat strings
"ti,netcp-k2hk"
"ti,netcp-k2l"
"ti,netcp-k2e"
"ti,netcp-k2g" 
for determining CPTS capabilities.

CPTS version is not the choice due to very poor documentation
which do not allow identify relations between CPTS ver and supported
features :(

Murali, what do you think?


-- 
regards,
-grygorii

^ permalink raw reply

* Re: [PATCH v2] ethernet :mellanox :mlx5: Replace pci_pool_alloc by pci_pool_zalloc
From: David Miller @ 2016-11-30 19:57 UTC (permalink / raw)
  To: jrdr.linux
  Cc: sergei.shtylyov, saeedm, matanb, leonro, netdev, linux-rdma,
	sahu.rameshwar73
In-Reply-To: <20161129212018.GA5419@jordon-HP-15-Notebook-PC>

From: Souptick Joarder <jrdr.linux@gmail.com>
Date: Wed, 30 Nov 2016 02:50:18 +0530

> In alloc_cmd_box(), pci_pool_alloc() followed by memset will be
> replaced by pci_pool_zalloc()
> 
> Signed-off-by: Souptick joarder <jrdr.linux@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 1/1] driver: ipvlan: Remove useless member mtu_adj of struct ipvl_dev
From: David Miller @ 2016-11-30 20:02 UTC (permalink / raw)
  To: fgao; +Cc: maheshb, edumazet, netdev, gfree.wind
In-Reply-To: <1480466924-2436-1-git-send-email-fgao@ikuai8.com>

From: fgao@ikuai8.com
Date: Wed, 30 Nov 2016 08:48:44 +0800

> From: Gao Feng <fgao@ikuai8.com>
> 
> The mtu_adj is initialized to zero when alloc mem, there is no any
> assignment to mtu_adj. It is only used in ipvlan_adjust_mtu as one
> right value.
> So it is useless member of struct ipvl_dev, then remove it.
> 
> Signed-off-by: Gao Feng <fgao@ikuai8.com>

Applied, thank you.

^ permalink raw reply

* [PATCH] sh_eth: add missing checks for status bits
From: Chris Brandt @ 2016-11-30 20:01 UTC (permalink / raw)
  To: David Miller
  Cc: Simon Horman, Geert Uytterhoeven, netdev, linux-renesas-soc,
	Chris Brandt

When streaming a lot of data and the RZ can't keep up, some status bits
will get set that are not being checked or cleared which cause the
following messages and the Ethernet driver to stop working. This
patch fixes that issue.

irq 21: nobody cared (try booting with the "irqpoll" option)
handlers:
[<c036b71c>] sh_eth_interrupt
Disabling IRQ #21

Fixes: db893473d313a4ad ("sh_eth: Add support for r7s72100")
Signed-off-by: Chris Brandt <chris.brandt@renesas.com>
---
 drivers/net/ethernet/renesas/sh_eth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
index 05b0dc5..079f10e 100644
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c
@@ -523,7 +523,7 @@ static struct sh_eth_cpu_data r7s72100_data = {
 	.tx_check	= EESR_TC1 | EESR_FTC,
 	.eesr_err_check	= EESR_TWB1 | EESR_TWB | EESR_TABT | EESR_RABT |
 			  EESR_RFE | EESR_RDE | EESR_RFRMER | EESR_TFE |
-			  EESR_TDE | EESR_ECI,
+			  EESR_TDE | EESR_ECI | EESR_TUC | EESR_ROC,
 	.fdr_value	= 0x0000070f,
 
 	.no_psr		= 1,
-- 
2.10.1

^ permalink raw reply related

* Re: [PATCH] net: ethernet: ti: cpsw: fix ASSERT_RTNL() warning during resume
From: David Miller @ 2016-11-30 19:59 UTC (permalink / raw)
  To: grygorii.strashko
  Cc: netdev, mugunthanvnm, nsekhar, linux-kernel, linux-omap,
	ivan.khoronzhuk, d-gerlach
In-Reply-To: <20161129222703.10908-1-grygorii.strashko@ti.com>

From: Grygorii Strashko <grygorii.strashko@ti.com>
Date: Tue, 29 Nov 2016 16:27:03 -0600

> netif_set_real_num_tx/rx_queues() are required to be called with rtnl_lock
> taken, otherwise ASSERT_RTNL() warning will be triggered - which happens
> now during System resume from suspend:
> cpsw_resume()
> |- cpsw_ndo_open()
>   |- netif_set_real_num_tx/rx_queues()
>      |- ASSERT_RTNL();
> 
> Hence, fix it by surrounding cpsw_ndo_open() by rtnl_lock/unlock() calls.
> 
> Cc: Dave Gerlach <d-gerlach@ti.com>
> Cc: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
> Fixes: commit e05107e6b747 ("net: ethernet: ti: cpsw: add multi queue support")
> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>

Applied.

^ permalink raw reply

* Re: [PATCH] netns: avoid disabling irq for netns id
From: David Miller @ 2016-11-30 19:58 UTC (permalink / raw)
  To: pmoore; +Cc: netdev, linux-audit, xiyou.wangcong
In-Reply-To: <148045748887.22539.3188295553967836703.stgit@sifl>

From: Paul Moore <pmoore@redhat.com>
Date: Tue, 29 Nov 2016 17:11:29 -0500

> From: Paul Moore <paul@paul-moore.com>
> 
> Bring back commit bc51dddf98c9 ("netns: avoid disabling irq for netns
> id") now that we've fixed some audit multicast issues that caused
> problems with original attempt.  Additional information, and history,
> can be found in the links below:
> 
>  * https://github.com/linux-audit/audit-kernel/issues/22
>  * https://github.com/linux-audit/audit-kernel/issues/23
> 
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> Signed-off-by: Paul Moore <paul@paul-moore.com>

This doesn't apply cleanly to the net-next tree, could you please
respin?

Thanks.

^ permalink raw reply

* Re: [PATCH v3] ethernet :mellanox :mlx4: Replace pci_pool_alloc by pci_pool_zalloc
From: David Miller @ 2016-11-30 19:57 UTC (permalink / raw)
  To: jrdr.linux-Re5JQEeQqe8AvxtiuMwx3w
  Cc: sergei.shtylyov-M4DtvfQ/ZS1MRgGoP+s0PdBPR1lH4CV8,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sahu.rameshwar73-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <20161129194611.GA4088@jordon-HP-15-Notebook-PC>

From: Souptick Joarder <jrdr.linux-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Wed, 30 Nov 2016 01:16:12 +0530

> In mlx4_alloc_cmd_mailbox(), pci_pool_alloc() followed by memset will be
> replaced by pci_pool_zalloc()
> 
> Signed-off-by: Souptick joarder <jrdr.linux-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Applied.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net: ipv4: Don't crash if passing a null sk to ip_rt_update_pmtu.
From: David Miller @ 2016-11-30 19:54 UTC (permalink / raw)
  To: lorenzo; +Cc: netdev, erezsh
In-Reply-To: <1480442207-43618-1-git-send-email-lorenzo@google.com>

From: Lorenzo Colitti <lorenzo@google.com>
Date: Wed, 30 Nov 2016 02:56:47 +0900

> Commit e2d118a1cb5e ("net: inet: Support UID-based routing in IP
> protocols.") made __build_flow_key call sock_net(sk) to determine
> the network namespace of the passed-in socket. This crashes if sk
> is NULL.
> 
> Fix this by getting the network namespace from the skb instead.
> 
> Reported-by: Erez Shitrit <erezsh@dev.mellanox.co.il>
> Signed-off-by: Lorenzo Colitti <lorenzo@google.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] bpf: add test for the verifier equal logic bug
From: David Miller @ 2016-11-30 19:52 UTC (permalink / raw)
  To: jbacik; +Cc: netdev, ast, jannh, daniel, kernel-team
In-Reply-To: <1480440919-3252-1-git-send-email-jbacik@fb.com>

From: Josef Bacik <jbacik@fb.com>
Date: Tue, 29 Nov 2016 12:35:19 -0500

> This is a test to verify that
> 
> bpf: fix states equal logic for varlen access
> 
> actually fixed the problem.  The problem was if the register we added to our map
> register was UNKNOWN in both the false and true branches and the only thing that
> changed was the range then we'd incorrectly assume that the true branch was
> valid, which it really wasnt.  This tests this case and properly fails without
> my fix in place and passes with it in place.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Applied.

^ permalink raw reply

* Re: [PATCH net][v2] bpf: fix states equal logic for varlen access
From: David Miller @ 2016-11-30 19:51 UTC (permalink / raw)
  To: jbacik; +Cc: netdev, ast, jannh, daniel, kernel-team
In-Reply-To: <1480440429-2531-1-git-send-email-jbacik@fb.com>

From: Josef Bacik <jbacik@fb.com>
Date: Tue, 29 Nov 2016 12:27:09 -0500

> If we have a branch that looks something like this
> 
> int foo = map->value;
> if (condition) {
>   foo += blah;
> } else {
>   foo = bar;
> }
> map->array[foo] = baz;
> 
> We will incorrectly assume that the !condition branch is equal to the condition
> branch as the register for foo will be UNKNOWN_VALUE in both cases.  We need to
> adjust this logic to only do this if we didn't do a varlen access after we
> processed the !condition branch, otherwise we have different ranges and need to
> check the other branch as well.
> 
> Fixes: 484611357c19 ("bpf: allow access into map value arrays")
> Reported-by: Jann Horn <jannh@google.com>
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
> v1->v2:
> - renamed and moved varlen_map_access variable.
> - dropped the extra () in the second if statement.
> - added the Fixes and Reported-by tag.

Applied, thanks.

^ permalink raw reply

* Re: [RFC PATCH net-next v2] ipv6: implement consistent hashing for equal-cost multipath routing
From: David Miller @ 2016-11-30 19:49 UTC (permalink / raw)
  To: david.lebrun; +Cc: netdev
In-Reply-To: <1480439718-18019-1-git-send-email-david.lebrun@uclouvain.be>

From: David Lebrun <david.lebrun@uclouvain.be>
Date: Tue, 29 Nov 2016 18:15:18 +0100

> When multiple nexthops are available for a given route, the routing engine
> chooses a nexthop by computing the flow hash through get_hash_from_flowi6
> and by taking that value modulo the number of nexthops. The resulting value
> indexes the nexthop to select. This method causes issues when a new nexthop
> is added or one is removed (e.g. link failure). In that case, the number
> of nexthops changes and potentially all the flows get re-routed to another
> nexthop.
> 
> This patch implements a consistent hash method to select the nexthop in
> case of ECMP. The idea is to generate K slices (or intervals) for each
> route with multiple nexthops. The nexthops are randomly assigned to those
> slices, in a uniform manner. The number K is configurable through a sysctl
> net.ipv6.route.ecmp_slices and is always an exponent of 2. To select the
> nexthop, the algorithm takes the flow hash and computes an index which is
> the flow hash modulo K. As K = 2^x, the modulo can be computed using a
> simple binary AND operation (idx = hash & (K - 1)). The resulting index
> references the selected nexthop. The lookup time complexity is thus O(1).
> 
> When a nexthop is added, it steals K/N slices from the other nexthops,
> where N is the new number of nexthops. The slices are stolen randomly and
> uniformly from the other nexthops. When a nexthop is removed, the orphan
> slices are randomly reassigned to the other nexthops.
> 
> The number of slices for a route also fixes the maximum number of nexthops
> possible for that route.
> 
> Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>

Interesting approach, but like Hannes I worry about the memory consumption
bounds.

Limiting to 1<<16 is interesting, but if you can limit to 1<<8 (256
nexthops) maybe the state requirement can be compressed even further?

We can always increase this if necessary in the future if someone
reports a reasonable use case that really needs it.  Let's start
simple and small first.

^ permalink raw reply

* Re: [PATCH v3 net-next 1/2] net: ethernet: slicoss: add slicoss gigabit ethernet driver
From: Lino Sanfilippo @ 2016-11-30 19:48 UTC (permalink / raw)
  To: Florian Fainelli, davem, charrer, liodot, gregkh, andrew
  Cc: devel, netdev, linux-kernel
In-Reply-To: <34c8e4b3-ccbc-082e-1dd0-3893b56e475a@gmail.com>

On 29.11.2016 18:14, Florian Fainelli wrote:
> On 11/28/2016 01:41 PM, Lino Sanfilippo wrote:
>> The problem is that the HW does not provide a tx completion index. Instead we have to 
>> iterate the status descriptors until we get an invalid idx which indicates that there 
>> are no further tx descriptors done for now. I am afraid that if we do not limit the 
>> number of descriptors processed in the tx completion handler, a continuous transmission 
>> of frames could keep the loop in xmit_complete() run endlessly. I dont know if this 
>> can actually happen but I wanted to make sure that this is avoided.
> 
> OK, it might be a good idea to put that comment somewhere around the tx
> completion handler to understand why it is bounded with a specific value.
> 

Agreed, I will add such a comment.

>> 
>>> [snip]
>>>
>>>> +	while (slic_get_free_rx_descs(rxq) > SLIC_MAX_REQ_RX_DESCS) {
>>>> +		skb = alloc_skb(maplen + ALIGN_MASK, gfp);
>>>> +		if (!skb)
>>>> +			break;
>>>> +
>>>> +		paddr = dma_map_single(&sdev->pdev->dev, skb->data, maplen,
>>>> +				       DMA_FROM_DEVICE);
>>>> +		if (dma_mapping_error(&sdev->pdev->dev, paddr)) {
>>>> +			netdev_err(dev, "mapping rx packet failed\n");
>>>> +			/* drop skb */
>>>> +			dev_kfree_skb_any(skb);
>>>> +			break;
>>>> +		}
>>>> +		/* ensure head buffer descriptors are 256 byte aligned */
>>>> +		offset = 0;
>>>> +		misalign = paddr & ALIGN_MASK;
>>>> +		if (misalign) {
>>>> +			offset = SLIC_RX_BUFF_ALIGN - misalign;
>>>> +			skb_reserve(skb, offset);
>>>> +		}
>>>> +		/* the HW expects dma chunks for descriptor + frame data */
>>>> +		desc = (struct slic_rx_desc *)skb->data;
>>>> +		memset(desc, 0, sizeof(*desc));
>>>
>>> Do you really need to zero-out the prepending RX descriptor? Are not you
>>> missing a write barrier here?
>> 
>> Indeed, it should be sufficient to make sure that the bit SLIC_IRHDDR_SVALID is not set.
>> I will adjust it. 
>> Concerning the write barrier: You mean a wmb() before slic_write() to ensure that the zeroing
>>  of the status desc is done before the descriptor is passed to the HW, right?
> 
> Correct, that's what I meant here.
> 

Ok, will fix this. Good catch BTW!

>> 
>>> [snip]
>>>
>>>> +
>>>> +		dma_sync_single_for_cpu(&sdev->pdev->dev,
>>>> +					dma_unmap_addr(buff, map_addr),
>>>> +					buff->addr_offset + sizeof(*desc),
>>>> +					DMA_FROM_DEVICE);
>>>> +
>>>> +		status = le32_to_cpu(desc->status);
>>>> +		if (!(status & SLIC_IRHDDR_SVALID))
>>>> +			break;
>>>> +
>>>> +		buff->skb = NULL;
>>>> +
>>>> +		dma_unmap_single(&sdev->pdev->dev,
>>>> +				 dma_unmap_addr(buff, map_addr),
>>>> +				 dma_unmap_len(buff, map_len),
>>>> +				 DMA_FROM_DEVICE);
>>>
>>> This is potentially inefficient, you already did a cache invalidation
>>> for the RX descriptor here, you could be more efficient with just
>>> invalidating the packet length, minus the descriptor length.
>>>
>> 
>> I am not sure I understand: We have to unmap the complete dma area, no matter if we synced
>> part of it before, dont we? AFAIK a dma sync is different from unmapping dma, or do I miss
>> something?
> 
> Sorry, I was not very clear, what I meant is that you can allocate and
> do the initial dma_map_single() of your RX skbs during ndo_open(), and
> then, in your RX path, you can only do dma_sync_single_for_cpu() twice
> (once for the RX descriptor status, second time for the actual packet
> contents), and when you return the SKB to the HW, do a
> dma_sync_single_for_device(). The advantage of doing that, is that if
> your cache operations are slow, you only do them on exactly packet
> length, and not the actual RX buffer size (e.g: 2KB).

Um. In the rx path the SKB will be consumed (by napi_gro_receive()).
AFAIK we _have_ to unmap it before this call. Doing only a dma_sync_single_for_cpu()
for the packet contents does IMHO only make sense if the corresponding SKB is
reused somehow. But this is not the case. The rx buffers are refilled with newly
allocated SKBs each time, and thus we need to create a new dma mapping for each of them.

Or do I still misunderstand when to call the dma sync functions?


BTW: I just realized that if the descriptor has not been used by the HW yet, see:

+		dma_sync_single_for_cpu(&sdev->pdev->dev,
+					dma_unmap_addr(buff, map_addr),
+					buff->addr_offset + sizeof(*desc),
+					DMA_FROM_DEVICE);
+
+		status = le32_to_cpu(desc->status);
+		if (!(status & SLIC_IRHDDR_SVALID))
+			break;
+		  
		^^^^^^^^^^^^^^^^^^^^^^^^^^^  dma_sync_single_for_device missing


there has to be a dma_sync_single_for_device to undo the sync for cpu (since the
HW will write to this descr when the next rx packet arrives), right? But this is racy:
What if the HW writes to that descr after we synced it for cpu but before we synced
it for the HW again? Any ideas?

Regards,
Lino

^ permalink raw reply

* [PATCH] netfilter: avoid warn and OOM on vmalloc call
From: Marcelo Ricardo Leitner @ 2016-11-30 19:42 UTC (permalink / raw)
  To: andreyknvl; +Cc: fw, nhorman, netdev, netfilter-devel, linux-kernel
In-Reply-To: <20161130192145.GB13169@localhost.localdomain>

Hi Andrey,

Please let me know how this works for you. It seems good here, though
your poc may still trigger OOM through other means.

Thanks,
Marcelo

---8<---

Andrey Konovalov reported that this vmalloc call is based on an
userspace request and that it's spewing traces, which may flood the logs
and cause DoS if abused.

Florian Westphal also mentioned that this call should not trigger OOM,
as kmalloc one is already not triggering it.

This patch brings the vmalloc call in sync to kmalloc and disables the
warn trace on allocation failure and also disable OOM invocation.

Note, however, that under such stress situation, other places may
trigger OOM invocation.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 net/netfilter/x_tables.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index fc4977456c30e098197b4f987b758072c9cf60d9..dece525bf83a0098dad607fce665cd0bde228362 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -958,7 +958,9 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
 	if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
 		info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
 	if (!info) {
-		info = vmalloc(sz);
+		info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN |
+				     __GFP_NORETRY | __GFP_HIGHMEM,
+				 PAGE_KERNEL);
 		if (!info)
 			return NULL;
 	}
-- 
2.9.3

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox