Netdev List

Netdev List
 help / color / mirror / Atom feed

* [RFC net-next 0/4] net: Improving network scheduling latencies
From: Yannick Vignon @ 2021-12-10 19:35 UTC (permalink / raw)
  To: Giuseppe Cavallaro, Alexandre Torgue, netdev, Ong Boon Leong,
	David S. Miller, Jakub Kicinski, Jose Abreu, Eric Dumazet,
	Wei Wang, Alexander Lobakin, Vladimir Oltean, Xiaoliang Yang,
	mingkai.hu, Joakim Zhang, sebastien.laveze

I am working on an application to showcase TSN use cases. That
application wakes up periodically, reads packet(s) from the network,
sends packet(s), then goes back to sleep. Endpoints are synchronized
through gPTP, and a 802.1Qbv schedule is in place to ensure packets are
sent at a fixed time. Right now, we achieve an overal period of 2ms,
which results in 500µs between the time the application is supposed to
wake up to the time the last packet is sent. We use an NXP kernel 5.10.x
with PREEMPT_RT patches.
I've been focusing lately on reducing the period, to see how close a
Linux-based system could get to a micro-controller with a "real-time"
OS. I've been able to achieve 500µs overall (125µs for the app itself)
by using AF_XDP sockets, but this also led to identifying several
sources of "scheduling" latencies, which I've tried to resolve with the
patches attached. The main culprit so far has been
local_bh_disable/local_bh_enable sections running in lower prio tasks,
requiring costly context switches along with priority inheritance. I've
removed the offending sections without significant problems so far, but
I'm not entirely clear though on the reason local_disable/enable were
used in those places: is it some simple oversight, an excess of caution,
or am I missing something more fundamental in the way those locks are
used?

Thanks,
Yannick

Yannick Vignon (4)
  net: stmmac: remove unnecessary locking around PTP clock reads
  net: stmmac: do not use __netif_tx_lock_bh when in NAPI threaded mode
  net: stmmac: move to threaded IRQ
  net: napi threaded: remove unnecessary locking

 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 44
+++++++++++++++++++++++++-------------------
 drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c  |  2 --
 net/core/dev.c                                    |  2 --
 3 files changed, 25 insertions(+), 23 deletions(-)

^ permalink raw reply

* [RFC net-next 1/4] net: napi threaded: remove unnecessary locking
From: Yannick Vignon @ 2021-12-10 19:35 UTC (permalink / raw)
  To: Giuseppe Cavallaro, Alexandre Torgue, netdev, Ong Boon Leong,
	David S. Miller, Jakub Kicinski, Jose Abreu, Eric Dumazet,
	Wei Wang, Alexander Lobakin, Vladimir Oltean, Xiaoliang Yang,
	mingkai.hu, Joakim Zhang, sebastien.laveze
  Cc: Yannick Vignon
In-Reply-To: <20211210193556.1349090-1-yannick.vignon@oss.nxp.com>

From: Yannick Vignon <yannick.vignon@nxp.com>

NAPI polling is normally protected by local_bh_disable()/local_bh_enable()
calls, to avoid that code from being executed concurrently due to the
softirq design. When NAPI instances are assigned their own dedicated kernel
thread however, that concurrent code execution can no longer happen.

Removing the lock helps lower latencies when handling real-time traffic
(whose processing could still be delayed because of on-going processing of
best-effort traffic), and should also have a positive effect on overall
performance.

Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
---
 net/core/dev.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 15ac064b5562..e35d90e70c75 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7131,13 +7131,11 @@ static int napi_threaded_poll(void *data)
 		for (;;) {
 			bool repoll = false;

-			local_bh_disable();

 			have = netpoll_poll_lock(napi);
 			__napi_poll(napi, &repoll);
 			netpoll_poll_unlock(have);

-			local_bh_enable();

 			if (!repoll)
 				break;
-- 
2.25.1

^ permalink raw reply related

* [RFC net-next 2/4] net: stmmac: move to threaded IRQ
From: Yannick Vignon @ 2021-12-10 19:35 UTC (permalink / raw)
  To: Giuseppe Cavallaro, Alexandre Torgue, netdev, Ong Boon Leong,
	David S. Miller, Jakub Kicinski, Jose Abreu, Eric Dumazet,
	Wei Wang, Alexander Lobakin, Vladimir Oltean, Xiaoliang Yang,
	mingkai.hu, Joakim Zhang, sebastien.laveze
  Cc: Yannick Vignon
In-Reply-To: <20211210193556.1349090-1-yannick.vignon@oss.nxp.com>

From: Yannick Vignon <yannick.vignon@nxp.com>

WIP (seems to generate warnings/error on startup)

When an IRQ is forced threaded, execution of the handler remains protected
by local_bh_disable()/local_bh_enable() calls to keep the semantics of the
IRQ context and avoid deadlocks. However, this also creates a contention
point where a higher prio interrupt handler gets blocked by a lower prio
task already holding the lock. Even though priority inheritance kicks in in
such a case, the lower prio task can still execute for an indefinite time.

Move the stmmac interrupts to be explicitely threaded, so that high
priority traffic can be processed without delay even if another piece of
code was already running with BH disabled.

Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
---
 .../net/ethernet/stmicro/stmmac/stmmac_main.c | 28 +++++++++----------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 748195697e5a..8bf24902be3c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -3460,8 +3460,8 @@ static int stmmac_request_irq_multi_msi(struct net_device *dev)
 	/* For common interrupt */
 	int_name = priv->int_name_mac;
 	sprintf(int_name, "%s:%s", dev->name, "mac");
-	ret = request_irq(dev->irq, stmmac_mac_interrupt,
-			  0, int_name, dev);
+	ret = request_threaded_irq(dev->irq, NULL, stmmac_interrupt,
+			  IRQF_ONESHOT, int_name, dev);
 	if (unlikely(ret < 0)) {
 		netdev_err(priv->dev,
 			   "%s: alloc mac MSI %d (error: %d)\n",
@@ -3476,9 +3476,9 @@ static int stmmac_request_irq_multi_msi(struct net_device *dev)
 	if (priv->wol_irq > 0 && priv->wol_irq != dev->irq) {
 		int_name = priv->int_name_wol;
 		sprintf(int_name, "%s:%s", dev->name, "wol");
-		ret = request_irq(priv->wol_irq,
-				  stmmac_mac_interrupt,
-				  0, int_name, dev);
+		ret = request_threaded_irq(priv->wol_irq,
+				  NULL, stmmac_mac_interrupt,
+				  IRQF_ONESHOT, int_name, dev);
 		if (unlikely(ret < 0)) {
 			netdev_err(priv->dev,
 				   "%s: alloc wol MSI %d (error: %d)\n",
@@ -3494,9 +3494,9 @@ static int stmmac_request_irq_multi_msi(struct net_device *dev)
 	if (priv->lpi_irq > 0 && priv->lpi_irq != dev->irq) {
 		int_name = priv->int_name_lpi;
 		sprintf(int_name, "%s:%s", dev->name, "lpi");
-		ret = request_irq(priv->lpi_irq,
-				  stmmac_mac_interrupt,
-				  0, int_name, dev);
+		ret = request_threaded_irq(priv->lpi_irq,
+				  NULL, stmmac_mac_interrupt,
+				  IRQF_ONESHOT, int_name, dev);
 		if (unlikely(ret < 0)) {
 			netdev_err(priv->dev,
 				   "%s: alloc lpi MSI %d (error: %d)\n",
@@ -3605,8 +3605,8 @@ static int stmmac_request_irq_single(struct net_device *dev)
 	enum request_irq_err irq_err;
 	int ret;
 
-	ret = request_irq(dev->irq, stmmac_interrupt,
-			  IRQF_SHARED, dev->name, dev);
+	ret = request_threaded_irq(dev->irq, NULL, stmmac_interrupt,
+			  IRQF_SHARED | IRQF_ONESHOT, dev->name, dev);
 	if (unlikely(ret < 0)) {
 		netdev_err(priv->dev,
 			   "%s: ERROR: allocating the IRQ %d (error: %d)\n",
@@ -3619,8 +3619,8 @@ static int stmmac_request_irq_single(struct net_device *dev)
 	 * is used for WoL
 	 */
 	if (priv->wol_irq > 0 && priv->wol_irq != dev->irq) {
-		ret = request_irq(priv->wol_irq, stmmac_interrupt,
-				  IRQF_SHARED, dev->name, dev);
+		ret = request_threaded_irq(priv->wol_irq, NULL, stmmac_interrupt,
+				  IRQF_SHARED | IRQF_ONESHOT, dev->name, dev);
 		if (unlikely(ret < 0)) {
 			netdev_err(priv->dev,
 				   "%s: ERROR: allocating the WoL IRQ %d (%d)\n",
@@ -3632,8 +3632,8 @@ static int stmmac_request_irq_single(struct net_device *dev)
 
 	/* Request the IRQ lines */
 	if (priv->lpi_irq > 0 && priv->lpi_irq != dev->irq) {
-		ret = request_irq(priv->lpi_irq, stmmac_interrupt,
-				  IRQF_SHARED, dev->name, dev);
+		ret = request_threaded_irq(priv->lpi_irq, NULL, stmmac_interrupt,
+				  IRQF_SHARED | IRQF_ONESHOT, dev->name, dev);
 		if (unlikely(ret < 0)) {
 			netdev_err(priv->dev,
 				   "%s: ERROR: allocating the LPI IRQ %d (%d)\n",
-- 
2.25.1


^ permalink raw reply related

* [RFC net-next 3/4] net: stmmac: do not use __netif_tx_lock_bh when in NAPI threaded mode
From: Yannick Vignon @ 2021-12-10 19:35 UTC (permalink / raw)
  To: Giuseppe Cavallaro, Alexandre Torgue, netdev, Ong Boon Leong,
	David S. Miller, Jakub Kicinski, Jose Abreu, Eric Dumazet,
	Wei Wang, Alexander Lobakin, Vladimir Oltean, Xiaoliang Yang,
	mingkai.hu, Joakim Zhang, sebastien.laveze
  Cc: Yannick Vignon
In-Reply-To: <20211210193556.1349090-1-yannick.vignon@oss.nxp.com>

From: Yannick Vignon <yannick.vignon@nxp.com>

In threaded mode, a NAPI instance can not execute concurrently in a
separate context but only in its assigned kernel thread.

Replace the calls to __netif_tx_lock_bh/__netif_tx_unlock_bh by their
non-bh version to avoid disabling BH in that case. This prevents high
priority traffic from being blocked by another piece of code already
running with BH disabled.

Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
---
 .../net/ethernet/stmicro/stmmac/stmmac_main.c    | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 8bf24902be3c..2190b40fa92e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2481,13 +2481,16 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
  * @queue: TX queue index
  * Description: it reclaims the transmit resources after transmission completes.
  */
-static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue)
+static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue, bool is_threaded)
 {
 	struct stmmac_tx_queue *tx_q = &priv->tx_queue[queue];
 	unsigned int bytes_compl = 0, pkts_compl = 0;
 	unsigned int entry, xmits = 0, count = 0;
 
-	__netif_tx_lock_bh(netdev_get_tx_queue(priv->dev, queue));
+	if (is_threaded)
+		__netif_tx_lock(netdev_get_tx_queue(priv->dev, queue), smp_processor_id());
+	else
+		__netif_tx_lock_bh(netdev_get_tx_queue(priv->dev, queue));
 
 	priv->xstats.tx_clean++;
 
@@ -2646,7 +2649,10 @@ static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue)
 			      STMMAC_COAL_TIMER(priv->tx_coal_timer[queue]),
 			      HRTIMER_MODE_REL);
 
-	__netif_tx_unlock_bh(netdev_get_tx_queue(priv->dev, queue));
+	if (is_threaded)
+		__netif_tx_unlock(netdev_get_tx_queue(priv->dev, queue));
+	else
+		__netif_tx_unlock_bh(netdev_get_tx_queue(priv->dev, queue));
 
 	/* Combine decisions from TX clean and XSK TX */
 	return max(count, xmits);
@@ -5377,7 +5383,7 @@ static int stmmac_napi_poll_tx(struct napi_struct *napi, int budget)
 
 	priv->xstats.napi_poll++;
 
-	work_done = stmmac_tx_clean(priv, budget, chan);
+	work_done = stmmac_tx_clean(priv, budget, chan, !!napi->thread);
 	work_done = min(work_done, budget);
 
 	if (work_done < budget && napi_complete_done(napi, work_done)) {
@@ -5401,7 +5407,7 @@ static int stmmac_napi_poll_rxtx(struct napi_struct *napi, int budget)
 
 	priv->xstats.napi_poll++;
 
-	tx_done = stmmac_tx_clean(priv, budget, chan);
+	tx_done = stmmac_tx_clean(priv, budget, chan, !!napi->thread);
 	tx_done = min(tx_done, budget);
 
 	rx_done = stmmac_rx_zc(priv, budget, chan);
-- 
2.25.1


^ permalink raw reply related

* [RFC net-next 4/4] net: stmmac: remove unnecessary locking around PTP clock reads
From: Yannick Vignon @ 2021-12-10 19:35 UTC (permalink / raw)
  To: Giuseppe Cavallaro, Alexandre Torgue, netdev, Ong Boon Leong,
	David S. Miller, Jakub Kicinski, Jose Abreu, Eric Dumazet,
	Wei Wang, Alexander Lobakin, Vladimir Oltean, Xiaoliang Yang,
	mingkai.hu, Joakim Zhang, sebastien.laveze
  Cc: Yannick Vignon
In-Reply-To: <20211210193556.1349090-1-yannick.vignon@oss.nxp.com>

From: Yannick Vignon <yannick.vignon@nxp.com>

Reading the PTP clock is a simple operation requiring only 2 register reads,
while the time can be set atomically through another set of registers. Under
a PREEMPT_RT kernel, protecting the reads by a spin_lock is
counter-productive:
 * if the task is preempted in-between the 2 reads, the return time value
could become inconsistent,
 * if the 2nd task preempting the 1st has a higher prio but needs to
read time as well, it will require 2 costly context switches, which
will pretty much always be more costly than disabling preemption with a real
spin_lock.

Remove the unneeded locking around the gettime call.

Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
index 580cc035536b..8f0dcac23b1e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
@@ -137,9 +137,7 @@ static int stmmac_get_time(struct ptp_clock_info *ptp, struct timespec64 *ts)
 	unsigned long flags;
 	u64 ns = 0;

-	spin_lock_irqsave(&priv->ptp_lock, flags);
 	stmmac_get_systime(priv, priv->ptpaddr, &ns);
-	spin_unlock_irqrestore(&priv->ptp_lock, flags);

 	*ts = ns_to_timespec64(ns);

-- 
2.25.1

^ permalink raw reply related

* Re: [RFC PATCH v2 net-next 0/4] DSA master state tracking
From: Ansuel Smith @ 2021-12-10 20:02 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev@vger.kernel.org, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli
In-Reply-To: <20211210195441.6drqtckl2m6rbmk6@skbuf>

On Fri, Dec 10, 2021 at 07:54:42PM +0000, Vladimir Oltean wrote:
> On Fri, Dec 10, 2021 at 08:45:43PM +0100, Ansuel Smith wrote:
> > > Anyway the reason why I didn't say anything about this is because I
> > > don't yet understand how it is supposed to work. Specifically:
> > > 
> > > rtnl_lock
> > > 
> > > dev_open()
> > > -> __dev_open()
> > >    -> dev->flags |= IFF_UP;
> > >    -> dev_activate()
> > >       -> transition_one_qdisc()
> > > -> call_netdevice_notifiers(NETDEV_UP, dev);
> > > 
> > > rtnl_unlock
> > > 
> > > so the qdisc should have already transitioned by the time NETDEV_UP is
> > > emitted.
> > > 
> > > and since we already require a NETDEV_UP to have occurred, or dev->flags
> > > to contain IFF_UP, I simply don't understand the following
> > > (a) why would the qdisc be noop when we catch NETDEV_UP
> > > (b) who calls netdev_state_change() (or __dev_notify_flags ?!) after the
> > >     qdisc changes on a TX queue? If no one, then I'm not sure how we can
> > >     reliably check for the state of the qdisc if we aren't notified
> > >     about changes to it.
> > 
> > The ipv6 check is just a hint. The real clue was the second
> > NETDEV_CHANGE called by linkwatch_do_dev in link_watch.c
> > That is the one that calls the CHANGE event before the ready stuff.
> > 
> > I had problem tracking this as the change logic is "emit CHANGE when flags
> > change" but netdev_state_change is also called for other reason and one
> > example is dev_activate/dev_deactivate from linkwatch_do_dev.
> > It seems a bit confusing that a generic state change is called even when
> > flags are not changed and because of this is a bit problematic track why
> > the CHANGE event was called.
> > 
> > Wonder if linkwatch_do_dev should be changed and introduce a flag? But
> > that seems problematic if for whatever reason a driver use the CHANGE
> > event to track exactly dev_activate/deactivate.
> 
> Yes, I had my own "aha" moment just minutes before you sent this email
> about linkwatch_do_dev. So indeed that's the source of both the
> dev_activate(), as well as the netdev_state_change() notifier.
> 
> As to my previous question (why would the qdisc be noop when we catch
> NETDEV_UP): the answer is of course in the code as well:
> 
> dev_activate() has:
> 	if (!netif_carrier_ok(dev))
> 		/* Delay activation until next carrier-on event */
> 		return;
> 
> which is then actually picked up from linkwatch_do_dev().
> 
> Let's not change linkwatch_do_dev(), I just wanted to understand why it
> works. Please confirm that it also works for you to make master_admin_up
> depend on qdisc_tx_is_noop() instead of the current ingress_queue check,
> then add a comment stating the mechanism through which we are tracking
> the dev_activate() calls, and then this should be good to go.
> I'd like you to pick up the patches and post them together with your
> driver changes. I can't post the patches on my own since I don't have
> any use for them. I'll leave a few more "review" comments on them in a
> minute.

Ok will do the test, but I'm positive about that.
So the idea is to send a v3 rfc with the depends of the tagger-owned
private data. Add to my series your series with this extra check.
(when I will post v3 feel free to tell me if I messed code credits)

Is the additional bool and function correct or should we merge them and
assume a link up only when we both have the flag and the qdisc?

-- 
	Ansuel

^ permalink raw reply

* Re: [RFC PATCH v2 net-next 1/4] net: dsa: provide switch operations for tracking the master state
From: Vladimir Oltean @ 2021-12-10 20:10 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: David S. Miller, Jakub Kicinski, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, Ansuel Smith
In-Reply-To: <20211209173927.4179375-2-vladimir.oltean@nxp.com>

On Thu, Dec 09, 2021 at 07:39:24PM +0200, Vladimir Oltean wrote:
> Certain drivers may need to send management traffic to the switch for
> things like register access, FDB dump, etc, to accelerate what their
> slow bus (SPI, I2C, MDIO) can already do.
> 
> Ethernet is faster (especially in bulk transactions) but is also more
> unreliable, since the user may decide to bring the DSA master down (or
> not bring it up), therefore severing the link between the host and the
> attached switch.
> 
> Drivers needing Ethernet-based register access already should have
> fallback logic to the slow bus if the Ethernet method fails, but that
> fallback may be based on a timeout, and the I/O to the switch may slow
> down to a halt if the master is down, because every Ethernet packet will
> have to time out. The driver also doesn't have the option to turn off
> Ethernet-based I/O momentarily, because it wouldn't know when to turn it
> back on.
> 
> Which is where this change comes in. By tracking NETDEV_CHANGE,
> NETDEV_UP and NETDEV_GOING_DOWN events on the DSA master, we should know
> the exact interval of time during which this interface is reliably
> available for traffic. Provide this information to switches so they can
> use it as they wish.
> 
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
>  include/net/dsa.h  | 11 +++++++++++
>  net/dsa/dsa2.c     | 46 ++++++++++++++++++++++++++++++++++++++++++++++
>  net/dsa/dsa_priv.h | 13 +++++++++++++
>  net/dsa/slave.c    | 27 +++++++++++++++++++++++++++
>  net/dsa/switch.c   | 15 +++++++++++++++
>  5 files changed, 112 insertions(+)
> 
> diff --git a/include/net/dsa.h b/include/net/dsa.h
> index bdf308a5c55e..8690b9c6d674 100644
> --- a/include/net/dsa.h
> +++ b/include/net/dsa.h
> @@ -296,6 +296,10 @@ struct dsa_port {
>  	struct list_head	fdbs;
>  	struct list_head	mdbs;
>  
> +	/* Master state bits, valid only on CPU ports */
> +	u8 master_admin_up:1,
> +	   master_oper_up:1;
> +
>  	bool setup;
>  };
>  
> @@ -1011,6 +1015,13 @@ struct dsa_switch_ops {
>  	int	(*tag_8021q_vlan_add)(struct dsa_switch *ds, int port, u16 vid,
>  				      u16 flags);
>  	int	(*tag_8021q_vlan_del)(struct dsa_switch *ds, int port, u16 vid);
> +
> +	/*
> +	 * DSA master tracking operations
> +	 */
> +	void	(*master_state_change)(struct dsa_switch *ds,
> +				       const struct net_device *master,
> +				       bool operational);
>  };
>  
>  #define DSA_DEVLINK_PARAM_DRIVER(_id, _name, _type, _cmodes)		\
> diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
> index 8814fa0e44c8..a6cb3470face 100644
> --- a/net/dsa/dsa2.c
> +++ b/net/dsa/dsa2.c
> @@ -1187,6 +1187,52 @@ int dsa_tree_change_tag_proto(struct dsa_switch_tree *dst,
>  	return err;
>  }
>  
> +static void dsa_tree_master_state_change(struct dsa_switch_tree *dst,
> +					 struct net_device *master)
> +{
> +	struct dsa_notifier_master_state_info info;
> +	struct dsa_port *cpu_dp = master->dsa_ptr;
> +
> +	info.master = master;
> +	info.operational = cpu_dp->master_admin_up && cpu_dp->master_oper_up;
> +
> +	dsa_tree_notify(dst, DSA_NOTIFIER_MASTER_STATE_CHANGE, &info);
> +}
> +
> +void dsa_tree_master_admin_state_change(struct dsa_switch_tree *dst,
> +					struct net_device *master,
> +					bool up)
> +{
> +	struct dsa_port *cpu_dp = master->dsa_ptr;
> +	bool notify = false;
> +
> +	if ((cpu_dp->master_admin_up && cpu_dp->master_oper_up) !=
> +	    (up && cpu_dp->master_oper_up))
> +		notify = true;
> +
> +	cpu_dp->master_admin_up = up;
> +
> +	if (notify)
> +		dsa_tree_master_state_change(dst, master);
> +}
> +
> +void dsa_tree_master_oper_state_change(struct dsa_switch_tree *dst,
> +				       struct net_device *master,
> +				       bool up)
> +{
> +	struct dsa_port *cpu_dp = master->dsa_ptr;
> +	bool notify = false;
> +
> +	if ((cpu_dp->master_admin_up && cpu_dp->master_oper_up) !=
> +	    (cpu_dp->master_admin_up && up))
> +		notify = true;
> +
> +	cpu_dp->master_oper_up = up;
> +
> +	if (notify)
> +		dsa_tree_master_state_change(dst, master);
> +}
> +
>  static struct dsa_port *dsa_port_touch(struct dsa_switch *ds, int index)
>  {
>  	struct dsa_switch_tree *dst = ds->dst;
> diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
> index 38ce5129a33d..c47864446adc 100644
> --- a/net/dsa/dsa_priv.h
> +++ b/net/dsa/dsa_priv.h
> @@ -43,6 +43,7 @@ enum {
>  	DSA_NOTIFIER_MRP_DEL_RING_ROLE,
>  	DSA_NOTIFIER_TAG_8021Q_VLAN_ADD,
>  	DSA_NOTIFIER_TAG_8021Q_VLAN_DEL,
> +	DSA_NOTIFIER_MASTER_STATE_CHANGE,
>  };
>  
>  /* DSA_NOTIFIER_AGEING_TIME */
> @@ -126,6 +127,12 @@ struct dsa_notifier_tag_8021q_vlan_info {
>  	u16 vid;
>  };
>  
> +/* DSA_NOTIFIER_MASTER_STATE_CHANGE */
> +struct dsa_notifier_master_state_info {
> +	const struct net_device *master;
> +	bool operational;
> +};
> +
>  struct dsa_switchdev_event_work {
>  	struct dsa_switch *ds;
>  	int port;
> @@ -506,6 +513,12 @@ int dsa_tree_change_tag_proto(struct dsa_switch_tree *dst,
>  			      struct net_device *master,
>  			      const struct dsa_device_ops *tag_ops,
>  			      const struct dsa_device_ops *old_tag_ops);
> +void dsa_tree_master_admin_state_change(struct dsa_switch_tree *dst,
> +					struct net_device *master,
> +					bool up);
> +void dsa_tree_master_oper_state_change(struct dsa_switch_tree *dst,
> +				       struct net_device *master,
> +				       bool up);
>  unsigned int dsa_bridge_num_get(const struct net_device *bridge_dev, int max);
>  void dsa_bridge_num_put(const struct net_device *bridge_dev,
>  			unsigned int bridge_num);
> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> index 2b153b366118..9f3b25c08c13 100644
> --- a/net/dsa/slave.c
> +++ b/net/dsa/slave.c
> @@ -2349,6 +2349,31 @@ static int dsa_slave_netdevice_event(struct notifier_block *nb,
>  		err = dsa_port_lag_change(dp, info->lower_state_info);
>  		return notifier_from_errno(err);
>  	}
> +	case NETDEV_CHANGE: {
> +		if (netdev_uses_dsa(dev)) {
> +			struct dsa_port *cpu_dp = dev->dsa_ptr;
> +			struct dsa_switch_tree *dst = cpu_dp->ds->dst;
> +
> +			dsa_tree_master_oper_state_change(dst, dev,
> +							  netif_oper_up(dev));

must also add a call here to change the admin state, due to the fact
that linkwatch_do_dev may call netdev_state_change() after dev_activate().
So it seems that "case NETDEV_CHANGE" and "case NETDEV_UP" may share the
same implementation, like this:

	case NETDEV_CHANGE:
	case UP:
		if (netdev_uses_dsa(dev)) {
			struct dsa_port *cpu_dp = dev->dsa_ptr;
			struct dsa_switch_tree *dst = cpu_dp->ds->dst;

			dsa_tree_master_admin_state_change(dst, dev,
							   qdisc_tx_is_noop(dev));
			dsa_tree_master_oper_state_change(dst, dev,
							  netif_oper_up(dev));

			return NOTIFY_OK;
		}

		return NOTIFY_DONE;
	}

Would be good to also add some comments.

> +
> +			return NOTIFY_OK;
> +		}
> +
> +		return NOTIFY_DONE;
> +	}
> +	case NETDEV_UP: {
> +		if (netdev_uses_dsa(dev)) {
> +			struct dsa_port *cpu_dp = dev->dsa_ptr;
> +			struct dsa_switch_tree *dst = cpu_dp->ds->dst;
> +
> +			dsa_tree_master_admin_state_change(dst, dev, true);

s/true/qdisc_tx_is_noop(dev)/

> +
> +			return NOTIFY_OK;
> +		}
> +
> +		return NOTIFY_DONE;
> +	}
>  	case NETDEV_GOING_DOWN: {
>  		struct dsa_port *dp, *cpu_dp;
>  		struct dsa_switch_tree *dst;
> @@ -2360,6 +2385,8 @@ static int dsa_slave_netdevice_event(struct notifier_block *nb,
>  		cpu_dp = dev->dsa_ptr;
>  		dst = cpu_dp->ds->dst;
>  
> +		dsa_tree_master_admin_state_change(dst, dev, false);
> +
>  		list_for_each_entry(dp, &dst->ports, list) {
>  			if (!dsa_port_is_user(dp))
>  				continue;
> diff --git a/net/dsa/switch.c b/net/dsa/switch.c
> index 9c92edd96961..78816a6805c8 100644
> --- a/net/dsa/switch.c
> +++ b/net/dsa/switch.c
> @@ -699,6 +699,18 @@ dsa_switch_mrp_del_ring_role(struct dsa_switch *ds,
>  	return 0;
>  }
>  
> +static int
> +dsa_switch_master_state_change(struct dsa_switch *ds,
> +			       struct dsa_notifier_master_state_info *info)
> +{
> +	if (!ds->ops->master_state_change)
> +		return 0;
> +
> +	ds->ops->master_state_change(ds, info->master, info->operational);
> +
> +	return 0;
> +}
> +
>  static int dsa_switch_event(struct notifier_block *nb,
>  			    unsigned long event, void *info)
>  {
> @@ -784,6 +796,9 @@ static int dsa_switch_event(struct notifier_block *nb,
>  	case DSA_NOTIFIER_TAG_8021Q_VLAN_DEL:
>  		err = dsa_switch_tag_8021q_vlan_del(ds, info);
>  		break;
> +	case DSA_NOTIFIER_MASTER_STATE_CHANGE:
> +		err = dsa_switch_master_state_change(ds, info);
> +		break;
>  	default:
>  		err = -EOPNOTSUPP;
>  		break;
> -- 
> 2.25.1
>

^ permalink raw reply

* Re: [net v5 2/3] net: sched: add check tc_skip_classify in sch egress
From: Daniel Borkmann @ 2021-12-10 20:11 UTC (permalink / raw)
  To: Tonghao Zhang, John Fastabend
  Cc: Linux Kernel Network Developers, David S. Miller, Jakub Kicinski,
	Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Eric Dumazet, Antoine Tenart,
	Alexander Lobakin, Wei Wang, Arnd Bergmann
In-Reply-To: <CAMDZJNUyOELOcf0dtxktCTRKv1sUrp5Z17mW+4so7tt6DFnJsw@mail.gmail.com>

On 12/10/21 8:54 PM, Tonghao Zhang wrote:
> On Sat, Dec 11, 2021 at 1:46 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote:
>> On Sat, Dec 11, 2021 at 1:37 AM Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote:
>>> On Sat, Dec 11, 2021 at 12:43 AM John Fastabend
>>> <john.fastabend@gmail.com> wrote:
>>>> xiangxia.m.yue@ wrote:
>>>>> From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
>>>>>
>>>>> Try to resolve the issues as below:
>>>>> * We look up and then check tc_skip_classify flag in net
>>>>>    sched layer, even though skb don't want to be classified.
>>>>>    That case may consume a lot of cpu cycles. This patch
>>>>>    is useful when there are a lot of filters with different
>>>>>    prio. There is ~5 prio in in production, ~1% improvement.
>>>>>
>>>>>    Rules as below:
>>>>>    $ for id in $(seq 1 5); do
>>>>>    $       tc filter add ... egress prio $id ... action mirred egress redirect dev ifb0
>>>>>    $ done
>>>>>
>>>>> * bpf_redirect may be invoked in egress path. If we don't
>>>>>    check the flags and then return immediately, the packets
>>>>>    will loopback.
>>>>
>>>> This would be the naive case right? Meaning the BPF program is
>>>> doing a redirect without any logic or is buggy?
>>>>
>>>> Can you map out how this happens for me, I'm not fully sure I
>>>> understand the exact concern. Is it possible for BPF programs
>>>> that used to see packets no longer see the packet as expected?
>>>>
>>>> Is this the path you are talking about?
>>> Hi John
>>> Tx ethx -> __dev_queue_xmit -> sch_handle_egress
>>> ->  execute BPF program on ethx with bpf_redirect(ifb0) ->
>>> -> ifb_xmit -> ifb_ri_tasklet -> dev_queue_xmit -> __dev_queue_xmit
>>> the packets loopbacks, that means bpf_redirect doesn't work with ifb
>>> netdev, right ?
>>> so in sch_handle_egress, I add the check skb_skip_tc_classify().

But why would you do that? Usage like this is just broken by design..
If you need to loop anything back to RX, just use bpf_redirect() with
BPF_F_INGRESS? What is the concrete/actual rationale for ifb here?

^ permalink raw reply

* Re: [RFC PATCH v2 net-next 3/4] net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown}
From: Vladimir Oltean @ 2021-12-10 20:17 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: David S. Miller, Jakub Kicinski, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, Ansuel Smith
In-Reply-To: <20211209173927.4179375-4-vladimir.oltean@nxp.com>

On Thu, Dec 09, 2021 at 07:39:26PM +0200, Vladimir Oltean wrote:
> DSA needs to simulate master tracking events when a binding is first
> with a DSA master established and torn down, in order to give drivers
> the simplifying guarantee that ->master_up and ->master_going_down calls
> are made in exactly this order. To avoid races, we need to block the
> reception of NETDEV_UP/NETDEV_GOING_DOWN events in the netdev notifier
> chain while we are changing the master's dev->dsa_ptr (this changes what
> netdev_uses_dsa(dev) reports).

This paragraph needs to be updated. For one, "->master_up and
->master_going down calls are made in exactly this order" needs to
become something like "->master_change calls are made only when the
master's readiness state to pass traffic changes". Then, "block the
reception of NETDEV_UP/NETDEV_GOING_DOWN" must also mention
"NETDEV_CHANGE".

> 
> The dsa_master_setup() and dsa_master_teardown() functions optionally
> require the rtnl_mutex to be held, if the tagger needs the master to be
> promiscuous, these functions call dev_set_promiscuity(). Move the
> rtnl_lock() from that function and make it top-level.
> 
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
>  net/dsa/dsa2.c   | 8 ++++++++
>  net/dsa/master.c | 4 ++--
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
> index a6cb3470face..6d4422c9e334 100644
> --- a/net/dsa/dsa2.c
> +++ b/net/dsa/dsa2.c
> @@ -1015,6 +1015,8 @@ static int dsa_tree_setup_master(struct dsa_switch_tree *dst)
>  	struct dsa_port *dp;
>  	int err;
>  
> +	rtnl_lock();
> +
>  	list_for_each_entry(dp, &dst->ports, list) {
>  		if (dsa_port_is_cpu(dp)) {
>  			err = dsa_master_setup(dp->master, dp);
> @@ -1023,6 +1025,8 @@ static int dsa_tree_setup_master(struct dsa_switch_tree *dst)
>  		}
>  	}
>  
> +	rtnl_unlock();
> +
>  	return 0;
>  }
>  
> @@ -1030,9 +1034,13 @@ static void dsa_tree_teardown_master(struct dsa_switch_tree *dst)
>  {
>  	struct dsa_port *dp;
>  
> +	rtnl_lock();
> +
>  	list_for_each_entry(dp, &dst->ports, list)
>  		if (dsa_port_is_cpu(dp))
>  			dsa_master_teardown(dp->master);
> +
> +	rtnl_unlock();
>  }
>  
>  static int dsa_tree_setup_lags(struct dsa_switch_tree *dst)
> diff --git a/net/dsa/master.c b/net/dsa/master.c
> index f4efb244f91d..2199104ca7df 100644
> --- a/net/dsa/master.c
> +++ b/net/dsa/master.c
> @@ -267,9 +267,9 @@ static void dsa_master_set_promiscuity(struct net_device *dev, int inc)
>  	if (!ops->promisc_on_master)
>  		return;
>  
> -	rtnl_lock();
> +	ASSERT_RTNL();
> +
>  	dev_set_promiscuity(dev, inc);
> -	rtnl_unlock();
>  }
>  
>  static ssize_t tagging_show(struct device *d, struct device_attribute *attr,
> -- 
> 2.25.1
>

^ permalink raw reply

* Re: [RFC PATCH v2 net-next 4/4] net: dsa: replay master state events in dsa_tree_{setup,teardown}_master
From: Vladimir Oltean @ 2021-12-10 20:22 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: David S. Miller, Jakub Kicinski, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, Ansuel Smith
In-Reply-To: <20211209173927.4179375-5-vladimir.oltean@nxp.com>

On Thu, Dec 09, 2021 at 07:39:27PM +0200, Vladimir Oltean wrote:
> In order for switch driver to be able to make simple and reliable use of
> the master tracking operations, they must also be notified of the
> initial state of the DSA master, not just of the changes. This is
> because they might enable certain features only during the time when
> they know that the DSA master is up and running.
> 
> Therefore, this change explicitly checks the state of the DSA master
> under the same rtnl_mutex as we were holding during the
> dsa_master_setup() and dsa_master_teardown() call. The idea being that
> if the DSA master became operational in between the moment in which it
> became a DSA master (dsa_master_setup set dev->dsa_ptr) and the moment
> when we checked for master->flags & IFF_UP, there is a chance that we

s/master->flags & IFF_UP/the master being up/ (the condition will be
more complex, no need to spell it out

> would emit a ->master_up() event twice. We need to avoid that by

s/master_up() event twice/master_state_change() call with no actual
state change.

> serializing the concurrent netdevice event with us. If the netdevice
> event started before, we force it to finish before we begin, because we
> take rtnl_lock before making netdev_uses_dsa() return true. So we also
> handle that early event and do nothing on it. Similarly, if the
> dev_open() attempt is concurrent with us, it will attempt to take the
> rtnl_mutex, but we're holding it. We'll see that the master flag IFF_UP
> isn't set, then when we release the rtnl_mutex we'll process the
> NETDEV_UP notifier.
> 
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
>  net/dsa/dsa2.c | 26 ++++++++++++++++++++++----
>  1 file changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
> index 6d4422c9e334..c86c9688e8cc 100644
> --- a/net/dsa/dsa2.c
> +++ b/net/dsa/dsa2.c
> @@ -1019,9 +1019,17 @@ static int dsa_tree_setup_master(struct dsa_switch_tree *dst)
>  
>  	list_for_each_entry(dp, &dst->ports, list) {
>  		if (dsa_port_is_cpu(dp)) {
> -			err = dsa_master_setup(dp->master, dp);
> +			struct net_device *master = dp->master;
> +
> +			err = dsa_master_setup(master, dp);
>  			if (err)
>  				return err;
> +
> +			/* Replay master state event */
> +			dsa_tree_master_admin_state_change(dst, master,
> +							   master->flags & IFF_UP);

It would be good to add a "bool admin_up = (master->flags & IFF_UP) && !qdisc_tx_is_noop(master)",
to avoid the line getting too long.

> +			dsa_tree_master_oper_state_change(dst, master,
> +							  netif_oper_up(master));
>  		}
>  	}
>  
> @@ -1036,9 +1044,19 @@ static void dsa_tree_teardown_master(struct dsa_switch_tree *dst)
>  
>  	rtnl_lock();
>  
> -	list_for_each_entry(dp, &dst->ports, list)
> -		if (dsa_port_is_cpu(dp))
> -			dsa_master_teardown(dp->master);
> +	list_for_each_entry(dp, &dst->ports, list) {
> +		if (dsa_port_is_cpu(dp)) {
> +			struct net_device *master = dp->master;
> +
> +			/* Synthesizing an "admin down" state is sufficient for
> +			 * the switches to get a notification if the master is
> +			 * currently up and running.
> +			 */
> +			dsa_tree_master_admin_state_change(dst, master, false);
> +
> +			dsa_master_teardown(master);
> +		}
> +	}
>  
>  	rtnl_unlock();
>  }
> -- 
> 2.25.1
>

^ permalink raw reply

* [PATCH net-next] u64_stats: Disable preemption on 32bit UP+SMP PREEMPT_RT during updates.
From: Sebastian Andrzej Siewior @ 2021-12-10 20:29 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: Peter Zijlstra, Thomas Gleixner, Jakub Kicinski, David S. Miller

On PREEMPT_RT the seqcount_t for synchronisation is required on 32bit
architectures even on UP because the softirq (and the threaded IRQ handler) can
be preempted.

With the seqcount_t for synchronisation, a reader with higher priority can
preempt the writer and then spin endlessly in read_seqcount_begin() while the
writer can't make progress.

To avoid such a lock up on PREEMPT_RT the writer must disable preemption during
the update. There is no need to disable interrupts because no writer is using
this API in hard-IRQ context on PREEMPT_RT.

Disable preemption on 32bit-RT within the u64_stats write section.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/u64_stats_sync.h |   42 +++++++++++++++++++++++++++--------------
 1 file changed, 28 insertions(+), 14 deletions(-)

--- a/include/linux/u64_stats_sync.h
+++ b/include/linux/u64_stats_sync.h
@@ -66,7 +66,7 @@
 #include <linux/seqlock.h>
 
 struct u64_stats_sync {
-#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
+#if BITS_PER_LONG == 32 && (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT))
 	seqcount_t	seq;
 #endif
 };
@@ -125,7 +125,7 @@ static inline void u64_stats_inc(u64_sta
 }
 #endif
 
-#if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
+#if BITS_PER_LONG == 32 && (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT))
 #define u64_stats_init(syncp)	seqcount_init(&(syncp)->seq)
 #else
 static inline void u64_stats_init(struct u64_stats_sync *syncp)
@@ -135,15 +135,19 @@ static inline void u64_stats_init(struct
 
 static inline void u64_stats_update_begin(struct u64_stats_sync *syncp)
 {
-#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
+#if BITS_PER_LONG == 32 && (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT))
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
 	write_seqcount_begin(&syncp->seq);
 #endif
 }
 
 static inline void u64_stats_update_end(struct u64_stats_sync *syncp)
 {
-#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
+#if BITS_PER_LONG == 32 && (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT))
 	write_seqcount_end(&syncp->seq);
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 #endif
 }
 
@@ -152,8 +156,11 @@ u64_stats_update_begin_irqsave(struct u6
 {
 	unsigned long flags = 0;
 
-#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
-	local_irq_save(flags);
+#if BITS_PER_LONG == 32 && (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT))
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+	else
+		local_irq_save(flags);
 	write_seqcount_begin(&syncp->seq);
 #endif
 	return flags;
@@ -163,15 +170,18 @@ static inline void
 u64_stats_update_end_irqrestore(struct u64_stats_sync *syncp,
 				unsigned long flags)
 {
-#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
+#if BITS_PER_LONG == 32 && (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT))
 	write_seqcount_end(&syncp->seq);
-	local_irq_restore(flags);
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
+	else
+		local_irq_restore(flags);
 #endif
 }
 
 static inline unsigned int __u64_stats_fetch_begin(const struct u64_stats_sync *syncp)
 {
-#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
+#if BITS_PER_LONG == 32 && (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT))
 	return read_seqcount_begin(&syncp->seq);
 #else
 	return 0;
@@ -180,7 +190,7 @@ static inline unsigned int __u64_stats_f
 
 static inline unsigned int u64_stats_fetch_begin(const struct u64_stats_sync *syncp)
 {
-#if BITS_PER_LONG==32 && !defined(CONFIG_SMP)
+#if BITS_PER_LONG == 32 && (!defined(CONFIG_SMP) && !defined(CONFIG_PREEMPT_RT))
 	preempt_disable();
 #endif
 	return __u64_stats_fetch_begin(syncp);
@@ -189,7 +199,7 @@ static inline unsigned int u64_stats_fet
 static inline bool __u64_stats_fetch_retry(const struct u64_stats_sync *syncp,
 					 unsigned int start)
 {
-#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
+#if BITS_PER_LONG == 32 && (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT))
 	return read_seqcount_retry(&syncp->seq, start);
 #else
 	return false;
@@ -199,7 +209,7 @@ static inline bool __u64_stats_fetch_ret
 static inline bool u64_stats_fetch_retry(const struct u64_stats_sync *syncp,
 					 unsigned int start)
 {
-#if BITS_PER_LONG==32 && !defined(CONFIG_SMP)
+#if BITS_PER_LONG == 32 && (!defined(CONFIG_SMP) && !defined(CONFIG_PREEMPT_RT))
 	preempt_enable();
 #endif
 	return __u64_stats_fetch_retry(syncp, start);
@@ -213,7 +223,9 @@ static inline bool u64_stats_fetch_retry
  */
 static inline unsigned int u64_stats_fetch_begin_irq(const struct u64_stats_sync *syncp)
 {
-#if BITS_PER_LONG==32 && !defined(CONFIG_SMP)
+#if BITS_PER_LONG == 32 && defined(CONFIG_PREEMPT_RT)
+	preempt_disable();
+#elif BITS_PER_LONG == 32 && !defined(CONFIG_SMP)
 	local_irq_disable();
 #endif
 	return __u64_stats_fetch_begin(syncp);
@@ -222,7 +234,9 @@ static inline unsigned int u64_stats_fet
 static inline bool u64_stats_fetch_retry_irq(const struct u64_stats_sync *syncp,
 					     unsigned int start)
 {
-#if BITS_PER_LONG==32 && !defined(CONFIG_SMP)
+#if BITS_PER_LONG == 32 && defined(CONFIG_PREEMPT_RT)
+	preempt_enable();
+#elif BITS_PER_LONG == 32 && !defined(CONFIG_SMP)
 	local_irq_enable();
 #endif
 	return __u64_stats_fetch_retry(syncp, start);

^ permalink raw reply

* [PATCH net-next v2] net: Enable max_dgram_qlen unix sysctl to be configurable by non-init user namespaces
From: Joanne Koong @ 2021-12-10 20:40 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, kuba, davem, Kernel-team, kafai, Joanne Koong

This patch enables the "/proc/sys/net/unix/max_dgram_qlen" sysctl to be
exposed to non-init user namespaces. max_dgram_qlen is used as the default
"sk_max_ack_backlog" value for when a unix socket is created.

Currently, when a networking namespace is initialized, its unix sysctls
are exposed only if the user namespace that "owns" it is the init user
namespace. If there is an non-init user namespace that "owns" a networking
namespace (for example, in the case after we call clone() with both
CLONE_NEWUSER and CLONE_NEWNET set), the sysctls are hidden from view
and not configurable.

Exposing the unix sysctl is safe because any changes made to it will be
limited in scope to the networking namespace the non-init user namespace
"owns" and has privileges over (changes won't affect any other net
namespace). There is also no possibility of a non-privileged user namespace
messing up the net namespace sysctls it shares with its parent user namespace.
When a new user namespace is created without unsharing the network namespace
(eg calling clone()  with CLONE_NEWUSER), the new user namespace shares its
parent's network namespace. Write access is protected by the mode set
in the sysctl's ctl_table (and enforced by procfs). Here in the case of
"max_dgram_qlen", 0644 is set; only the user owner has write access.

v1 -> v2:
* Add more detail to commit message, specify the
"/proc/sys/net/unix/max_dgram_qlen" sysctl in commit message.

Signed-off-by: Joanne Koong <joannekoong@fb.com>
---
 net/unix/sysctl_net_unix.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/net/unix/sysctl_net_unix.c b/net/unix/sysctl_net_unix.c
index c09bea89151b..01d44e2598e2 100644
--- a/net/unix/sysctl_net_unix.c
+++ b/net/unix/sysctl_net_unix.c
@@ -30,10 +30,6 @@ int __net_init unix_sysctl_register(struct net *net)
 	if (table == NULL)
 		goto err_alloc;

-	/* Don't export sysctls to unprivileged users */
-	if (net->user_ns != &init_user_ns)
-		table[0].procname = NULL;
-
 	table[0].data = &net->unx.sysctl_max_dgram_qlen;
 	net->unx.ctl = register_net_sysctl(net, "net/unix", table);
 	if (net->unx.ctl == NULL)
-- 
2.30.2

^ permalink raw reply related

* Re: [PATCH intel-net 2/5] ice: xsk: allocate separate memory for XDP SW ring
From: Nguyen, Anthony L @ 2021-12-10 21:05 UTC (permalink / raw)
  To: Fijalkowski, Maciej, intel-wired-lan@lists.osuosl.org
  Cc: Mathew, Elza, Karlsson, Magnus, netdev@vger.kernel.org,
	kuba@kernel.org, bpf@vger.kernel.org, davem@davemloft.net
In-Reply-To: <20211210145941.5865-3-maciej.fijalkowski@intel.com>

On Fri, 2021-12-10 at 15:59 +0100, Maciej Fijalkowski wrote:
> @@ -425,9 +430,8 @@ static void ice_bump_ntc(struct ice_rx_ring
> *rx_ring)
>   * Returns the skb on success, NULL on failure.
>   */
>  static struct sk_buff *
> -ice_construct_skb_zc(struct ice_rx_ring *rx_ring, struct xdp_buff
> **xdp_arr)
> +ice_construct_skb_zc(struct ice_rx_ring *rx_ring, struct xdp_buff
> *xdp)

There's a kdoc issue here.

drivers/net/ethernet/intel/ice/ice_xsk.c:436: warning: Function
parameter or member 'xdp' not described in 'ice_construct_skb_zc'
drivers/net/ethernet/intel/ice/ice_xsk.c:436: warning: Excess function
parameter 'xdp_arr' description in 'ice_construct_skb_zc'

Thanks,
Tony


^ permalink raw reply

* Re: [PATCH net v2] tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH.
From: Martin KaFai Lau @ 2021-12-10 21:22 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Jakub Kicinski, Kuniyuki Iwashima, eric.dumazet, davem, dsahern,
	efault, netdev, tglx, yoshfuji
In-Reply-To: <YbN+edCO3DGMDPmj@linutronix.de>

On Fri, Dec 10, 2021 at 05:21:13PM +0100, Sebastian Andrzej Siewior wrote:
> On 2021-12-09 12:06:32 [-0800], Martin KaFai Lau wrote:
> > > local_bh_disable() + spin_lock(&ilb->lock):
> > >   inet_listen()
> > >     inet_csk_listen_start()
> > >       sk->sk_prot->hash() := inet_hash()
> > > 	local_bh_disable()
> > > 	__inet_hash()
> > > 	  spin_lock(&ilb->lock);
> > > 	    acquire(&ilb->lock);
> > > 
> > > Reverse order: spin_lock(&ilb->lock) + local_bh_disable():
> > >   tcp_seq_next()
> > >     listening_get_next()
> > >       spin_lock(&ilb->lock);
> > The net tree has already been using ilb2 instead of ilb.
> > It does not change the problem though but updating
> > the commit log will be useful to avoid future confusion.
> 
> You think so? But having ilb2 and ilb might suggest that these two are
> different locks while they are the same. I could repost it early next
> week if you this actually confuses more…
Yes, they are different locks.  ilb2->lock is also taken
in the inet_listen() path.  ilb->lock is not even taken
in the listening_get_next() side.

^ permalink raw reply

* Re: [PATCH net-next v4 1/2] devicetree: bindings: net: Add bindings doc for Sunplus SP7021.
From: Rob Herring @ 2021-12-10 21:33 UTC (permalink / raw)
  To: Wells Lu
  Cc: davem, kuba, netdev, devicetree, linux-kernel, p.zabel, wells.lu,
	vincent.shih
In-Reply-To: <1638864419-17501-2-git-send-email-wellslutw@gmail.com>

On Tue, Dec 07, 2021 at 04:06:58PM +0800, Wells Lu wrote:
> Add bindings documentation for Sunplus SP7021.
> 
> Signed-off-by: Wells Lu <wellslutw@gmail.com>
> ---
> Changes in v4
>   - Addressed all comments from Mr. Andrew Lunn.
>     - Moved properties 'nvmem-cells' and 'nvmem-cell-names' to port of ethernet-ports.
>     - Changed value of property 'nvmem-cell-names' to "mac-address".
> 
>  .../bindings/net/sunplus,sp7021-emac.yaml          | 172 +++++++++++++++++++++
>  MAINTAINERS                                        |   7 +
>  2 files changed, 179 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/sunplus,sp7021-emac.yaml
> 
> diff --git a/Documentation/devicetree/bindings/net/sunplus,sp7021-emac.yaml b/Documentation/devicetree/bindings/net/sunplus,sp7021-emac.yaml
> new file mode 100644
> index 0000000..efc987f
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/sunplus,sp7021-emac.yaml
> @@ -0,0 +1,172 @@
> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +# Copyright (C) Sunplus Co., Ltd. 2021
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/net/sunplus,sp7021-emac.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Sunplus SP7021 Dual Ethernet MAC Device Tree Bindings
> +
> +maintainers:
> +  - Wells Lu <wellslutw@gmail.com>
> +
> +description: |
> +  Sunplus SP7021 dual 10M/100M Ethernet MAC controller.
> +  Device node of the controller has following properties.
> +
> +properties:
> +  compatible:
> +    const: sunplus,sp7021-emac
> +
> +  reg:
> +    items:
> +      - description: the EMAC registers
> +      - description: the MOON5 registers
> +
> +  reg-names:
> +    items:
> +      - const: emac
> +      - const: moon5
> +
> +  interrupts:
> +    description: |
> +      Contains number and type of interrupt. Number should be 66.

Drop. That's every 'interrupts' and the exact number is outside the 
scope of the binding.

> +      Type should be high-level trigger.
> +    maxItems: 1
> +
> +  clocks:
> +    maxItems: 1
> +
> +  resets:
> +    maxItems: 1
> +
> +  ethernet-ports:
> +    type: object
> +    description: Ethernet ports to PHY
> +
> +    properties:
> +      "#address-cells":
> +        const: 1
> +
> +      "#size-cells":
> +        const: 0
> +
> +    patternProperties:
> +      "^port@[0-1]$":
> +        type: object
> +        description: Port to PHY
> +
> +        properties:
> +          reg:
> +            minimum: 0
> +            maximum: 1
> +
> +          phy-handle:
> +            maxItems: 1
> +
> +          phy-mode:
> +            maxItems: 1
> +
> +          nvmem-cells:
> +            items:
> +              - description: nvmem cell address of MAC address
> +
> +          nvmem-cell-names:
> +            description: names corresponding to the nvmem cells
> +            items:
> +              - const: mac-address
> +
> +        required:
> +          - reg
> +          - phy-handle
> +          - phy-mode
> +          - nvmem-cells
> +          - nvmem-cell-names
> +
> +  mdio:

Just need:

       $ref: mdio.yaml#
       unevaluatedProperties: false

and drop the rest.

> +    type: object
> +    description: external MDIO Bus
> +
> +    properties:
> +      "#address-cells":
> +        const: 1
> +
> +      "#size-cells":
> +        const: 0
> +
> +    patternProperties:
> +      "^ethernet-phy@[0-9a-f]+$":
> +        type: object
> +        description: external PHY node
> +
> +        properties:
> +          reg:
> +            minimum: 0
> +            maximum: 30
> +
> +        required:
> +          - reg
> +
> +additionalProperties: false
> +
> +required:
> +  - compatible
> +  - reg
> +  - reg-names
> +  - interrupts
> +  - clocks
> +  - resets
> +  - pinctrl-0
> +  - pinctrl-names
> +  - ethernet-ports
> +  - mdio
> +
> +examples:
> +  - |
> +    #include <dt-bindings/interrupt-controller/irq.h>
> +    emac: emac@9c108000 {

ethernet@9c108000 {

> +        compatible = "sunplus,sp7021-emac";
> +        reg = <0x9c108000 0x400>, <0x9c000280 0x80>;
> +        reg-names = "emac", "moon5";
> +        interrupt-parent = <&intc>;
> +        interrupts = <66 IRQ_TYPE_LEVEL_HIGH>;
> +        clocks = <&clkc 0xa7>;
> +        resets = <&rstc 0x97>;
> +        pinctrl-0 = <&emac_demo_board_v3_pins>;
> +        pinctrl-names = "default";
> +
> +        ethernet-ports {
> +            #address-cells = <1>;
> +            #size-cells = <0>;
> +
> +            port@0 {
> +                reg = <0>;
> +                phy-handle = <&eth_phy0>;
> +                phy-mode = "rmii";
> +                nvmem-cells = <&mac_addr0>;
> +                nvmem-cell-names = "mac-address";
> +            };
> +
> +            port@1 {
> +                reg = <1>;
> +                phy-handle = <&eth_phy1>;
> +                phy-mode = "rmii";
> +                nvmem-cells = <&mac_addr1>;
> +                nvmem-cell-names = "mac-address";
> +            };
> +        };
> +
> +        mdio {
> +            #address-cells = <1>;
> +            #size-cells = <0>;
> +
> +            eth_phy0: ethernet-phy@0 {
> +                reg = <0>;
> +            };
> +
> +            eth_phy1: ethernet-phy@1 {
> +                reg = <1>;
> +            };
> +        };
> +    };
> +...
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 0dc08cd..5b1ef9d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -18245,6 +18245,13 @@ L:	netdev@vger.kernel.org
>  S:	Maintained
>  F:	drivers/net/ethernet/dlink/sundance.c
>  
> +SUNPLUS ETHERNET DRIVER
> +M:	Wells Lu <wellslutw@gmail.com>
> +L:	netdev@vger.kernel.org
> +S:	Maintained
> +W:	https://sunplus-tibbo.atlassian.net/wiki/spaces/doc/overview
> +F:	Documentation/devicetree/bindings/net/sunplus,sp7021-emac.yaml
> +
>  SUPERH
>  M:	Yoshinori Sato <ysato@users.sourceforge.jp>
>  M:	Rich Felker <dalias@libc.org>
> -- 
> 2.7.4
> 
> 

^ permalink raw reply

* Re: [PATCH v4 7/7] dt-bindings: net: phy: Add 10-baseT1L 2.4 Vpp
From: Rob Herring @ 2021-12-10 22:09 UTC (permalink / raw)
  To: alexandru.tachici
  Cc: o.rempel, kuba, netdev, hkallweit1, davem, robh+dt, linux-kernel,
	linux, devicetree, andrew
In-Reply-To: <20211210110509.20970-8-alexandru.tachici@analog.com>

On Fri, 10 Dec 2021 13:05:09 +0200, alexandru.tachici@analog.com wrote:
> From: Alexandru Tachici <alexandru.tachici@analog.com>
> 
> Add a tristate property to advertise desired transmit level.
> 
> If the device supports the 2.4 Vpp operating mode for 10BASE-T1L,
> as defined in 802.3gc, and the 2.4 Vpp transmit voltage operation
> is desired, property should be set to 1. This property is used
> to select whether Auto-Negotiation advertises a request to
> operate the 10BASE-T1L PHY in increased transmit level mode.
> 
> If property is set to 1, the PHY shall advertise a request
> to operate the 10BASE-T1L PHY in increased transmit level mode.
> If property is set to zero, the PHY shall not advertise
> a request to operate the 10BASE-T1L PHY in increased transmit level mode.
> 
> Signed-off-by: Alexandru Tachici <alexandru.tachici@analog.com>
> ---
>  Documentation/devicetree/bindings/net/ethernet-phy.yaml | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 

Reviewed-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* [PATCH net-next v2 0/2] net: ipa: fix SDX55 interconnects
From: Alex Elder @ 2021-12-10 22:31 UTC (permalink / raw)
  To: davem, kuba
  Cc: david, manivannan.sadhasivam, jponduru, avuyyuru, bjorn.andersson,
	agross, robh+dt, cpratapa, subashab, mka, evgreen, elder, netdev,
	devicetree, linux-arm-msm, linux-kernel

The SDX55 SoC has IPA v4.5.  It currently represents the path
between IPA and main memory using two consecutive interconnects.
This was an optimization--not required for correct operation--and
complicates things unnecessarily.  It also does not conform to the
IPA binding (as pointed out by David Heidelberg).

This series fixes this by combining the two interconnects into one.

Version 2 simply adds a few missed e-mail addressees; there is no
change to the patch content.

					-Alex
Alex Elder (2):
  ARM: dts: qcom: sdx55: fix IPA interconnect definitions
  net: ipa: fix IPA v4.5 interconnect data

 arch/arm/boot/dts/qcom-sdx55.dtsi | 6 ++----
 drivers/net/ipa/ipa_data-v4.5.c   | 7 +------
 2 files changed, 3 insertions(+), 10 deletions(-)

-- 
2.32.0

^ permalink raw reply

* [PATCH net-next v2 1/2] ARM: dts: qcom: sdx55: fix IPA interconnect definitions
From: Alex Elder @ 2021-12-10 22:31 UTC (permalink / raw)
  To: davem, kuba, robh+dt
  Cc: david, manivannan.sadhasivam, jponduru, avuyyuru, bjorn.andersson,
	agross, cpratapa, subashab, mka, evgreen, elder, netdev,
	devicetree, linux-arm-msm, linux-kernel
In-Reply-To: <20211210223123.98586-1-elder@linaro.org>

The first two interconnects defined for IPA on the SDX55 SoC are
really two parts of what should be represented as a single path
between IPA and system memory.

Fix this by combining the "memory-a" and "memory-b" interconnects
into a single "memory" interconnect.

Reported-by: David Heidelberg <david@ixit.cz>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Alex Elder <elder@linaro.org>
---
 arch/arm/boot/dts/qcom-sdx55.dtsi | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/arm/boot/dts/qcom-sdx55.dtsi b/arch/arm/boot/dts/qcom-sdx55.dtsi
index 44526ad9d210b..eee2f63b9bbab 100644
--- a/arch/arm/boot/dts/qcom-sdx55.dtsi
+++ b/arch/arm/boot/dts/qcom-sdx55.dtsi
@@ -333,12 +333,10 @@ ipa: ipa@1e40000 {
 			clocks = <&rpmhcc RPMH_IPA_CLK>;
 			clock-names = "core";
 
-			interconnects = <&system_noc MASTER_IPA &system_noc SLAVE_SNOC_MEM_NOC_GC>,
-					<&mem_noc MASTER_SNOC_GC_MEM_NOC &mc_virt SLAVE_EBI_CH0>,
+			interconnects = <&system_noc MASTER_IPA &mc_virt SLAVE_EBI_CH0>,
 					<&system_noc MASTER_IPA &system_noc SLAVE_OCIMEM>,
 					<&mem_noc MASTER_AMPSS_M0 &system_noc SLAVE_IPA_CFG>;
-			interconnect-names = "memory-a",
-					     "memory-b",
+			interconnect-names = "memory",
 					     "imem",
 					     "config";
 
-- 
2.32.0


^ permalink raw reply related

* [PATCH net-next v2 2/2] net: ipa: fix IPA v4.5 interconnect data
From: Alex Elder @ 2021-12-10 22:31 UTC (permalink / raw)
  To: davem, kuba
  Cc: david, manivannan.sadhasivam, jponduru, avuyyuru, bjorn.andersson,
	agross, robh+dt, cpratapa, subashab, mka, evgreen, elder, netdev,
	devicetree, linux-arm-msm, linux-kernel
In-Reply-To: <20211210223123.98586-1-elder@linaro.org>

Update the definition of the IPA interconnects for IPA v4.5 so
the path between IPA and system memory is represented by a single
"memory" interconnect.

Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Alex Elder <elder@linaro.org>
---
 drivers/net/ipa/ipa_data-v4.5.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/net/ipa/ipa_data-v4.5.c b/drivers/net/ipa/ipa_data-v4.5.c
index e62ab9c3ac672..2da2c4194f2e6 100644
--- a/drivers/net/ipa/ipa_data-v4.5.c
+++ b/drivers/net/ipa/ipa_data-v4.5.c
@@ -420,15 +420,10 @@ static const struct ipa_mem_data ipa_mem_data = {
 /* Interconnect rates are in 1000 byte/second units */
 static const struct ipa_interconnect_data ipa_interconnect_data[] = {
 	{
-		.name			= "memory-a",
+		.name			= "memory",
 		.peak_bandwidth		= 600000,	/* 600 MBps */
 		.average_bandwidth	= 150000,	/* 150 MBps */
 	},
-	{
-		.name			= "memory-b",
-		.peak_bandwidth		= 1804000,	/* 1.804 GBps */
-		.average_bandwidth	= 150000,	/* 150 MBps */
-	},
 	/* Average rate is unused for the next two interconnects */
 	{
 		.name			= "imem",
-- 
2.32.0


^ permalink raw reply related

* Re: [syzbot] KASAN: use-after-free Read in remove_wait_queue (3)
From: syzbot @ 2021-12-10 22:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, netdev, syzkaller-bugs, viro
In-Reply-To: <000000000000e8f8f505d0e479a5@google.com>

syzbot has found a reproducer for the following issue on:

HEAD commit:    e5d75fc20b92 sh_eth: Use dev_err_probe() helper
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1540cdceb00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=24fd48984584829b
dashboard link: https://syzkaller.appspot.com/bug?extid=cdb5dd11c97cc532efad
compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=15de00bab00000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15ad646db00000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: use-after-free in __lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4897
Read of size 8 at addr ffff888015be3740 by task syz-executor161/3598

CPU: 1 PID: 3598 Comm: syz-executor161 Not tainted 5.16.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
 __kasan_report mm/kasan/report.c:433 [inline]
 kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
 __lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4897
 lock_acquire kernel/locking/lockdep.c:5637 [inline]
 lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5602
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
 _raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:162
 remove_wait_queue+0x1d/0x180 kernel/sched/wait.c:55
 ep_remove_wait_queue+0x88/0x1a0 fs/eventpoll.c:545
 ep_unregister_pollwait fs/eventpoll.c:561 [inline]
 ep_remove+0x106/0x9c0 fs/eventpoll.c:690
 eventpoll_release_file+0xe1/0x130 fs/eventpoll.c:923
 eventpoll_release include/linux/eventpoll.h:53 [inline]
 __fput+0x87b/0x9f0 fs/file_table.c:271
 task_work_run+0xdd/0x1a0 kernel/task_work.c:164
 tracehook_notify_resume include/linux/tracehook.h:189 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:175 [inline]
 exit_to_user_mode_prepare+0x27e/0x290 kernel/entry/common.c:207
 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
 syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f3167c0def3
Code: c7 c2 c0 ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb ba 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 83 ec 18 89 7c 24 0c e8
RSP: 002b:00007ffddef2e488 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00007f3167c0def3
RDX: 000000000000002f RSI: 0000000020001340 RDI: 0000000000000004
RBP: 0000000000000000 R08: 0000000000000014 R09: 00007ffddef2e4b0
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffddef2e4ac
R13: 00007ffddef2e4c0 R14: 00007ffddef2e500 R15: 0000000000000000
 </TASK>

Allocated by task 3598:
 kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
 kasan_set_track mm/kasan/common.c:46 [inline]
 set_alloc_info mm/kasan/common.c:434 [inline]
 ____kasan_kmalloc mm/kasan/common.c:513 [inline]
 ____kasan_kmalloc mm/kasan/common.c:472 [inline]
 __kasan_kmalloc+0xa9/0xd0 mm/kasan/common.c:522
 kmalloc include/linux/slab.h:590 [inline]
 psi_trigger_create.part.0+0x15e/0x7f0 kernel/sched/psi.c:1141
 cgroup_pressure_write+0x15d/0x6b0 kernel/cgroup/cgroup.c:3645
 cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3852
 kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
 call_write_iter include/linux/fs.h:2162 [inline]
 new_sync_write+0x429/0x660 fs/read_write.c:503
 vfs_write+0x7cd/0xae0 fs/read_write.c:590
 ksys_write+0x12d/0x250 fs/read_write.c:643
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Freed by task 3598:
 kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
 kasan_set_track+0x21/0x30 mm/kasan/common.c:46
 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
 ____kasan_slab_free mm/kasan/common.c:366 [inline]
 ____kasan_slab_free mm/kasan/common.c:328 [inline]
 __kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
 kasan_slab_free include/linux/kasan.h:235 [inline]
 slab_free_hook mm/slub.c:1723 [inline]
 slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
 slab_free mm/slub.c:3513 [inline]
 kfree+0xf6/0x560 mm/slub.c:4561
 cgroup_pressure_write+0x18d/0x6b0 kernel/cgroup/cgroup.c:3651
 cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3852
 kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
 call_write_iter include/linux/fs.h:2162 [inline]
 new_sync_write+0x429/0x660 fs/read_write.c:503
 vfs_write+0x7cd/0xae0 fs/read_write.c:590
 ksys_write+0x12d/0x250 fs/read_write.c:643
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

The buggy address belongs to the object at ffff888015be3700
 which belongs to the cache kmalloc-192 of size 192
The buggy address is located 64 bytes inside of
 192-byte region [ffff888015be3700, ffff888015be37c0)
The buggy address belongs to the page:
page:ffffea000056f8c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x15be3
flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000200 0000000000000000 dead000000000001 ffff888010c41a00
raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY), pid 1, ts 1983850449, free_ts 0
 prep_new_page mm/page_alloc.c:2418 [inline]
 get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
 __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
 alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2036
 alloc_pages+0x29f/0x300 mm/mempolicy.c:2186
 alloc_slab_page mm/slub.c:1793 [inline]
 allocate_slab mm/slub.c:1930 [inline]
 new_slab+0x32d/0x4a0 mm/slub.c:1993
 ___slab_alloc+0x918/0xfe0 mm/slub.c:3022
 __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
 slab_alloc_node mm/slub.c:3200 [inline]
 slab_alloc mm/slub.c:3242 [inline]
 kmem_cache_alloc_trace+0x289/0x2c0 mm/slub.c:3259
 kmalloc include/linux/slab.h:590 [inline]
 kzalloc include/linux/slab.h:724 [inline]
 call_usermodehelper_setup+0x97/0x340 kernel/umh.c:365
 kobject_uevent_env+0xf73/0x1650 lib/kobject_uevent.c:614
 version_sysfs_builtin kernel/params.c:878 [inline]
 param_sysfs_init+0x146/0x43b kernel/params.c:969
 do_one_initcall+0x103/0x650 init/main.c:1297
 do_initcall_level init/main.c:1370 [inline]
 do_initcalls init/main.c:1386 [inline]
 do_basic_setup init/main.c:1405 [inline]
 kernel_init_freeable+0x6b1/0x73a init/main.c:1610
 kernel_init+0x1a/0x1d0 init/main.c:1499
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
page_owner free stack trace missing

Memory state around the buggy address:
 ffff888015be3600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff888015be3680: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
>ffff888015be3700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                           ^
 ffff888015be3780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
 ffff888015be3800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================


^ permalink raw reply

* Re: [PATCH bpf-next v3 7/9] net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPF
From: Pablo Neira Ayuso @ 2021-12-10 23:03 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	Maxim Mikityanskiy, Florian Westphal, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev, netfilter-devel
In-Reply-To: <20211210153129.srb6p2ebzhl5yyzh@apollo.legion>

On Fri, Dec 10, 2021 at 09:01:29PM +0530, Kumar Kartikeya Dwivedi wrote:
> On Fri, Dec 10, 2021 at 08:39:14PM IST, Pablo Neira Ayuso wrote:
> > On Fri, Dec 10, 2021 at 06:32:28PM +0530, Kumar Kartikeya Dwivedi wrote:
> > [...]
> > >  net/netfilter/nf_conntrack_core.c | 252 ++++++++++++++++++++++++++++++
> > >  7 files changed, 497 insertions(+), 1 deletion(-)
> > >
> > [...]
> > > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > > index 770a63103c7a..85042cb6f82e 100644
> > > --- a/net/netfilter/nf_conntrack_core.c
> > > +++ b/net/netfilter/nf_conntrack_core.c
> >
> > Please, keep this new code away from net/netfilter/nf_conntrack_core.c
> 
> Ok. Can it be a new file under net/netfilter, or should it live elsewhere?

IPVS and OVS use conntrack for already quite a bit of time and they
keep their code in their respective folders.

Thanks.

^ permalink raw reply

* pull-request v2: bpf-next 2021-12-10
From: Andrii Nakryiko @ 2021-12-10 23:47 UTC (permalink / raw)
  To: davem; +Cc: kuba, daniel, ast, netdev, bpf, andrii, kernel-team

Hi David, hi Jakub,

The following pull-request contains BPF updates for your *net-next* tree.

There are three merge conflicts between bpf and bpf-next:

1. Documentation/bpf/index.rst. Please Just drop the libbpf and BTF sections,
   so that the resulting content is like this:

  [...]

  This kernel side documentation is still work in progress.
  The Cilium project also maintains a `BPF and XDP Reference Guide`_
  that goes into great technical depth about the BPF Architecture.

  .. toctree::
     :maxdepth: 1
  
     instruction-set
     verifier
  [...]

2. kernel/bpf/btf.c. There was a big chunk of code added at the end, but git
   is confused about #endif. Please keep the original #endif (corresponding to
   #ifdef CONFIG_DEBUG_INTO_BTF_MODULES) and all the newly added code goes to
   the end of the file:

  --- a/kernel/bpf/btf.c
  +++ b/kernel/bpf/btf.c
  @@@ -6418,384 -6390,4 +6409,386 @@@ bool bpf_check_mod_kfunc_call(struct kf
    DEFINE_KFUNC_BTF_ID_LIST(bpf_tcp_ca_kfunc_list);
    DEFINE_KFUNC_BTF_ID_LIST(prog_test_kfunc_list);

  + #endif
  ++
   +int bpf_core_types_are_compat(const struct btf *local_btf, __u32 local_id,
   +                            const struct btf *targ_btf, __u32 targ_id)
   +{
   +      return -EOPNOTSUPP;
   +}
  [...]

3. tools/lib/bpf/libbpf.c, attr->log_level should be replaced with
   extra_log_level, but otherwise 4-parameter invocation of btf_gen__init()
   wins:

  --- a/tools/lib/bpf/libbpf.c
  +++ b/tools/lib/bpf/libbpf.c
  @@@ -7477,7 -7258,7 +7477,7 @@@ static int bpf_object_load(struct bpf_o
          }

          if (obj->gen_loader)
  -               bpf_gen__init(obj->gen_loader, extra_log_level);
   -              bpf_gen__init(obj->gen_loader, attr->log_level, obj->nr_programs, obj->nr_maps);
  ++              bpf_gen__init(obj->gen_loader, extra_log_level, obj->nr_programs, obj->nr_maps);

          err = bpf_object__probe_loading(obj);
          err = err ? : bpf_object__load_vmlinux_btf(obj, false);

We've added 115 non-merge commits during the last 26 day(s) which contain
a total of 182 files changed, 5747 insertions(+), 2564 deletions(-).

The main changes are:

1) Various samples fixes, from Alexander Lobakin.

2) BPF CO-RE support in kernel and light skeleton, from Alexei Starovoitov.

3) A batch of new unified APIs for libbpf, logging improvements, version
   querying, etc. Also a batch of old deprecations for old APIs and various
   bug fixes, in preparation for libbpf 1.0, from Andrii Nakryiko.

4) BPF documentation reorganization and improvements, from Christoph Hellwig
   and Dave Tucker.

5) Support for declarative initialization of BPF_MAP_TYPE_PROG_ARRAY in
   libbpf, from Hengqi Chen.

6) Verifier log fixes, from Hou Tao.

7) Runtime-bounded loops support with bpf_loop() helper, from Joanne Koong.

8) Extend branch record capturing to all platforms that support it,
   from Kajol Jain.

9) Light skeleton codegen improvements, from Kumar Kartikeya Dwivedi.

10) bpftool doc-generating script improvements, from Quentin Monnet.

11) Two libbpf v0.6 bug fixes, from Shuyi Cheng and Vincent Minet.

12) Deprecation warning fix for perf/bpf_counter, from Song Liu.

13) MAX_TAIL_CALL_CNT unification and MIPS build fix for libbpf,
    from Tiezhu Yang.

14) BTF_KING_TYPE_TAG follow-up fixes, from Yonghong Song.

15) Selftests fixes and improvements, from Ilya Leoshkevich, Jean-Philippe
    Brucker, Jiri Olsa, Maxim Mikityanskiy, Tirthendu Sarkar, Yucong Sun,
    and others.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot!

Also thanks to reporters, reviewers and testers of commits in this pull-request:

Andrii Nakryiko, Björn Töpel, Evgeny Vereshchagin, Gustavo A. R. Silva, 
Ilya Leoshkevich, Jiri Olsa, Johan Almbladh, John Fastabend, KP Singh, 
Kumar Kartikeya Dwivedi, Maciej Fijalkowski, Martin KaFai Lau, Quentin 
Monnet, Song Liu, Toke Høiland-Jørgensen, Yonghong Song, Zeal Robot

----------------------------------------------------------------

The following changes since commit a5bdc36354cbf1a1a91396f4da548ff484686305:

  Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next (2021-11-15 08:49:23 -0800)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to 229fae38d0fc0d6ff58d57cbeb1432da55e58d4f:

  libbpf: Add "bool skipped" to struct bpf_map (2021-12-10 15:35:30 -0800)

----------------------------------------------------------------
Alan Maguire (1):
      libbpf: Silence uninitialized warning/error in btf_dump_dump_type_data

Alexander Lobakin (3):
      samples: bpf: Fix conflicting types in fds_example
      samples: bpf: Fix xdp_sample_user.o linking with Clang
      samples: bpf: Fix 'unknown warning group' build warning on Clang

Alexei Starovoitov (22):
      Merge branch 'Add bpf_loop helper'
      libbpf: Replace btf__type_by_id() with btf_type_by_id().
      bpf: Rename btf_member accessors.
      bpf: Prepare relo_core.c for kernel duty.
      bpf: Define enum bpf_core_relo_kind as uapi.
      bpf: Pass a set of bpf_core_relo-s to prog_load command.
      bpf: Adjust BTF log size limit.
      bpf: Add bpf_core_add_cands() and wire it into bpf_core_apply_relo_insn().
      libbpf: Use CO-RE in the kernel in light skeleton.
      libbpf: Support init of inner maps in light skeleton.
      libbpf: Clean gen_loader's attach kind.
      selftests/bpf: Add lskel version of kfunc test.
      selftests/bpf: Improve inner_map test coverage.
      selftests/bpf: Convert map_ptr_kern test to use light skeleton.
      selftests/bpf: Additional test for CO-RE in the kernel.
      selftests/bpf: Revert CO-RE removal in test_ksyms_weak.
      selftests/bpf: Add CO-RE relocations to verifier scale test.
      Merge branch 'Deprecate bpf_prog_load_xattr() API'
      libbpf: Reduce bpf_core_apply_relo_insn() stack usage.
      bpftool: Add debug mode for gen_loader.
      bpf: Silence purge_cand_cache build warning.
      Merge branch 'Enhance and rework logging controls in libbpf'

Andrii Nakryiko (48):
      selftests/bpf: Add uprobe triggering overhead benchmarks
      libbpf: Add runtime APIs to query libbpf version
      libbpf: Accommodate DWARF/compiler bug with duplicated structs
      libbpf: Load global data maps lazily on legacy kernels
      selftests/bpf: Mix legacy (maps) and modern (vars) BPF in one test
      libbpf: Unify low-level map creation APIs w/ new bpf_map_create()
      libbpf: Use bpf_map_create() consistently internally
      libbpf: Prevent deprecation warnings in xsk.c
      selftests/bpf: Migrate selftests to bpf_map_create()
      tools/resolve_btf_ids: Close ELF file on error
      libbpf: Fix potential misaligned memory access in btf_ext__new()
      libbpf: Don't call libc APIs with NULL pointers
      libbpf: Fix glob_syms memory leak in bpf_linker
      libbpf: Fix using invalidated memory in bpf_linker
      selftests/bpf: Fix UBSan complaint about signed __int128 overflow
      selftests/bpf: Fix possible NULL passed to memcpy() with zero size
      selftests/bpf: Prevent misaligned memory access in get_stack_raw_tp test
      selftests/bpf: Fix misaligned memory access in queue_stack_map test
      selftests/bpf: Prevent out-of-bounds stack access in test_bpffs
      selftests/bpf: Fix misaligned memory accesses in xdp_bonding test
      selftests/bpf: Fix misaligned accesses in xdp and xdp_bpf2bpf tests
      Merge branch 'Support static initialization of BPF_MAP_TYPE_PROG_ARRAY'
      Merge branch 'Apply suggestions for typeless/weak ksym series'
      libbpf: Cleanup struct bpf_core_cand.
      Merge branch 'bpf: CO-RE support in the kernel'
      libbpf: Use __u32 fields in bpf_map_create_opts
      libbpf: Add API to get/set log_level at per-program level
      bpftool: Migrate off of deprecated bpf_create_map_xattr() API
      selftests/bpf: Remove recently reintroduced legacy btf__dedup() use
      selftests/bpf: Mute xdpxceiver.c's deprecation warnings
      selftests/bpf: Remove all the uses of deprecated bpf_prog_load_xattr()
      samples/bpf: Clean up samples/bpf build failes
      samples/bpf: Get rid of deprecated libbpf API uses
      libbpf: Deprecate bpf_prog_load_xattr() API
      perf: Mute libbpf API deprecations temporarily
      Merge branch 'samples: bpf: fix build issues with Clang/LLVM'
      libbpf: Fix bpf_prog_load() log_buf logic for log_level 0
      libbpf: Add OPTS-based bpf_btf_load() API
      libbpf: Allow passing preallocated log_buf when loading BTF into kernel
      libbpf: Allow passing user log setting through bpf_object_open_opts
      libbpf: Improve logging around BPF program loading
      libbpf: Preserve kernel error code and remove kprobe prog type guessing
      libbpf: Add per-program log buffer setter and getter
      libbpf: Deprecate bpf_object__load_xattr()
      selftests/bpf: Replace all uses of bpf_load_btf() with bpf_btf_load()
      selftests/bpf: Add test for libbpf's custom log_buf behavior
      selftests/bpf: Remove the only use of deprecated bpf_object__load_xattr()
      bpftool: Switch bpf_object__load_xattr() to bpf_object__load()

Christoph Hellwig (5):
      x86, bpf: Cleanup the top of file header in bpf_jit_comp.c
      bpf: Remove a redundant comment on bpf_prog_free
      bpf, docs: Prune all references to "internal BPF"
      bpf, docs: Move handling of maps to Documentation/bpf/maps.rst
      bpf, docs: Split general purpose eBPF documentation out of filter.rst

Colin Ian King (1):
      bpf: Remove redundant assignment to pointer t

Dave Tucker (3):
      bpf, docs: Change underline in btf to match style guide
      bpf, docs: Rename bpf_lsm.rst to prog_lsm.rst
      bpf, docs: Fix ordering of bpf documentation

Drew Fustini (1):
      selftests/bpf: Fix trivial typo

Florent Revest (1):
      libbpf: Change bpf_program__set_extra_flags to bpf_program__set_flags

Grant Seltzer (1):
      libbpf: Add doc comments in libbpf.h

Hengqi Chen (2):
      libbpf: Support static initialization of BPF_MAP_TYPE_PROG_ARRAY
      selftests/bpf: Test BPF_MAP_TYPE_PROG_ARRAY static initialization

Hou Tao (2):
      bpf: Clean-up bpf_verifier_vlog() for BPF_LOG_KERNEL log level
      bpf: Disallow BPF_LOG_KERNEL log level for bpf(BPF_BTF_LOAD)

Ilya Leoshkevich (1):
      selfetests/bpf: Adapt vmtest.sh to s390 libbpf CI changes

Jean-Philippe Brucker (1):
      selftests/bpf: Build testing_helpers.o out of tree

Jiri Olsa (1):
      selftests/bpf: Add btf_dedup case with duplicated structs within CU

Joanne Koong (4):
      bpf: Add bpf_loop helper
      selftests/bpf: Add bpf_loop test
      selftests/bpf: Measure bpf_loop verifier performance
      selftest/bpf/benchs: Add bpf_loop benchmark

Kajol Jain (1):
      bpf: Remove config check to enable bpf support for branch records

Kumar Kartikeya Dwivedi (3):
      bpf: Change bpf_kallsyms_lookup_name size type to ARG_CONST_SIZE_OR_ZERO
      libbpf: Avoid double stores for success/failure case of ksym relocations
      libbpf: Avoid reload of imm for weak, unresolved, repeating ksym

Maxim Mikityanskiy (1):
      bpf: Fix the test_task_vma selftest to support output shorter than 1 kB

Mehrdad Arshad Rad (1):
      libbpf: Remove duplicate assignments

Minghao Chi (1):
      samples/bpf: Remove unneeded variable

Paul E. McKenney (1):
      selftests/bpf: Update test names for xchg and cmpxchg

Quentin Monnet (3):
      bpftool: Add SPDX tags to RST documentation files
      bpftool: Update doc (use susbtitutions) and test_bpftool_synctypes.py
      selftests/bpf: Configure dir paths via env in test_bpftool_synctypes.py

Shuyi Cheng (1):
      libbpf: Add "bool skipped" to struct bpf_map

Song Liu (1):
      perf/bpf_counter: Use bpf_map_create instead of bpf_create_map

Stanislav Fomichev (1):
      bpftool: Add current libbpf_strict mode to version output

Tiezhu Yang (2):
      bpf: Change value of MAX_TAIL_CALL_CNT from 32 to 33
      bpf, mips: Fix build errors about __NR_bpf undeclared

Tirthendu Sarkar (1):
      selftests/bpf: Fix xdpxceiver failures for no hugepages

Vincent Minet (1):
      libbpf: Fix typo in btf__dedup@LIBBPF_0.0.2 definition

Yonghong Song (3):
      libbpf: Fix a couple of missed btf_type_tag handling in btf.c
      selftests/bpf: Add a dedup selftest with equivalent structure types
      selftests/bpf: Fix a compilation warning

Yucong Sun (3):
      selftests/bpf: Move summary line after the error logs
      selftests/bpf: Variable naming fix
      selftests/bpf: Mark variable as static

huangxuesen (1):
      libbpf: Fix trivial typo

 Documentation/bpf/btf.rst                          |   44 +-
 Documentation/bpf/faq.rst                          |   11 +
 Documentation/bpf/helpers.rst                      |    7 +
 Documentation/bpf/index.rst                        |  102 +-
 Documentation/bpf/instruction-set.rst              |  467 +++++++++
 Documentation/bpf/libbpf/index.rst                 |    4 +-
 Documentation/bpf/maps.rst                         |   52 +
 Documentation/bpf/other.rst                        |    9 +
 Documentation/bpf/{bpf_lsm.rst => prog_lsm.rst}    |    0
 Documentation/bpf/programs.rst                     |    9 +
 Documentation/bpf/syscall_api.rst                  |   11 +
 Documentation/bpf/test_debug.rst                   |    9 +
 Documentation/bpf/verifier.rst                     |  529 ++++++++++
 Documentation/networking/filter.rst                | 1036 +-------------------
 MAINTAINERS                                        |    2 +-
 arch/arm/net/bpf_jit_32.c                          |    7 +-
 arch/arm64/net/bpf_jit_comp.c                      |    7 +-
 arch/mips/net/bpf_jit_comp32.c                     |    3 +-
 arch/mips/net/bpf_jit_comp64.c                     |    2 +-
 arch/powerpc/net/bpf_jit_comp32.c                  |    4 +-
 arch/powerpc/net/bpf_jit_comp64.c                  |    4 +-
 arch/riscv/net/bpf_jit_comp32.c                    |    6 +-
 arch/riscv/net/bpf_jit_comp64.c                    |    7 +-
 arch/s390/net/bpf_jit_comp.c                       |    6 +-
 arch/sparc/net/bpf_jit_comp_64.c                   |    4 +-
 arch/x86/net/bpf_jit_comp.c                        |   14 +-
 arch/x86/net/bpf_jit_comp32.c                      |    4 +-
 include/linux/bpf.h                                |   11 +-
 include/linux/bpf_verifier.h                       |    7 +
 include/linux/btf.h                                |   89 +-
 include/uapi/linux/bpf.h                           |  105 +-
 kernel/bpf/Makefile                                |    4 +
 kernel/bpf/bpf_iter.c                              |   35 +
 kernel/bpf/bpf_struct_ops.c                        |    6 +-
 kernel/bpf/btf.c                                   |  410 +++++++-
 kernel/bpf/core.c                                  |    6 +-
 kernel/bpf/helpers.c                               |    2 +
 kernel/bpf/syscall.c                               |    4 +-
 kernel/bpf/verifier.c                              |  180 +++-
 kernel/trace/bpf_trace.c                           |    6 +-
 lib/test_bpf.c                                     |    4 +-
 net/core/filter.c                                  |   11 +-
 net/ipv4/bpf_tcp_ca.c                              |    6 +-
 samples/bpf/Makefile                               |   18 +-
 samples/bpf/Makefile.target                        |   11 -
 samples/bpf/cookie_uid_helper_example.c            |   14 +-
 samples/bpf/fds_example.c                          |   29 +-
 samples/bpf/hbm_kern.h                             |    2 -
 samples/bpf/lwt_len_hist_kern.c                    |    7 -
 samples/bpf/map_perf_test_user.c                   |   15 +-
 samples/bpf/sock_example.c                         |   12 +-
 samples/bpf/sockex1_user.c                         |   15 +-
 samples/bpf/sockex2_user.c                         |   14 +-
 samples/bpf/test_cgrp2_array_pin.c                 |    4 +-
 samples/bpf/test_cgrp2_attach.c                    |   13 +-
 samples/bpf/test_cgrp2_sock.c                      |    8 +-
 samples/bpf/test_lru_dist.c                        |   11 +-
 samples/bpf/trace_output_user.c                    |    4 +-
 samples/bpf/xdp_redirect_cpu.bpf.c                 |    4 +-
 samples/bpf/xdp_sample_pkts_user.c                 |   22 +-
 samples/bpf/xdp_sample_user.h                      |    2 +
 samples/bpf/xdpsock_ctrl_proc.c                    |    3 +
 samples/bpf/xdpsock_user.c                         |    3 +
 samples/bpf/xsk_fwd.c                              |    3 +
 tools/bpf/bpftool/Documentation/Makefile           |    2 +-
 tools/bpf/bpftool/Documentation/bpftool-btf.rst    |    7 +-
 tools/bpf/bpftool/Documentation/bpftool-cgroup.rst |    7 +-
 .../bpf/bpftool/Documentation/bpftool-feature.rst  |    6 +-
 tools/bpf/bpftool/Documentation/bpftool-gen.rst    |    7 +-
 tools/bpf/bpftool/Documentation/bpftool-iter.rst   |    6 +-
 tools/bpf/bpftool/Documentation/bpftool-link.rst   |    7 +-
 tools/bpf/bpftool/Documentation/bpftool-map.rst    |    7 +-
 tools/bpf/bpftool/Documentation/bpftool-net.rst    |    6 +-
 tools/bpf/bpftool/Documentation/bpftool-perf.rst   |    6 +-
 tools/bpf/bpftool/Documentation/bpftool-prog.rst   |    6 +-
 .../bpftool/Documentation/bpftool-struct_ops.rst   |    6 +-
 tools/bpf/bpftool/Documentation/bpftool.rst        |    7 +-
 tools/bpf/bpftool/Documentation/common_options.rst |    2 +
 tools/bpf/bpftool/Documentation/substitutions.rst  |    3 +
 tools/bpf/bpftool/gen.c                            |   11 +-
 tools/bpf/bpftool/main.c                           |   12 +-
 tools/bpf/bpftool/map.c                            |   23 +-
 tools/bpf/bpftool/prog.c                           |   44 +-
 tools/bpf/bpftool/struct_ops.c                     |   15 +-
 tools/bpf/resolve_btfids/main.c                    |    5 +-
 tools/build/feature/test-bpf.c                     |    6 +
 tools/include/uapi/linux/bpf.h                     |  105 +-
 tools/lib/bpf/bpf.c                                |  234 +++--
 tools/lib/bpf/bpf.h                                |   55 +-
 tools/lib/bpf/bpf_gen_internal.h                   |    9 +-
 tools/lib/bpf/btf.c                                |  139 ++-
 tools/lib/bpf/btf.h                                |    2 +-
 tools/lib/bpf/btf_dump.c                           |    2 +-
 tools/lib/bpf/gen_loader.c                         |  160 ++-
 tools/lib/bpf/libbpf.c                             |  649 ++++++++----
 tools/lib/bpf/libbpf.h                             |  115 ++-
 tools/lib/bpf/libbpf.map                           |   15 +-
 tools/lib/bpf/libbpf_common.h                      |    5 +
 tools/lib/bpf/libbpf_internal.h                    |   24 +-
 tools/lib/bpf/libbpf_probes.c                      |   32 +-
 tools/lib/bpf/libbpf_version.h                     |    2 +-
 tools/lib/bpf/linker.c                             |    6 +-
 tools/lib/bpf/relo_core.c                          |  231 +++--
 tools/lib/bpf/relo_core.h                          |  103 +-
 tools/lib/bpf/skel_internal.h                      |   13 +-
 tools/lib/bpf/xsk.c                                |   18 +-
 tools/perf/tests/bpf.c                             |    4 +
 tools/perf/util/bpf-loader.c                       |    3 +
 tools/perf/util/bpf_counter.c                      |   18 +-
 tools/testing/selftests/bpf/Makefile               |   49 +-
 tools/testing/selftests/bpf/bench.c                |   47 +
 tools/testing/selftests/bpf/bench.h                |    2 +
 .../testing/selftests/bpf/benchs/bench_bpf_loop.c  |  105 ++
 tools/testing/selftests/bpf/benchs/bench_trigger.c |  146 +++
 .../selftests/bpf/benchs/run_bench_bpf_loop.sh     |   15 +
 tools/testing/selftests/bpf/benchs/run_common.sh   |   15 +
 .../selftests/bpf/map_tests/array_map_batch_ops.c  |   13 +-
 .../selftests/bpf/map_tests/htab_map_batch_ops.c   |   13 +-
 .../bpf/map_tests/lpm_trie_map_batch_ops.c         |   15 +-
 .../selftests/bpf/map_tests/sk_storage_map.c       |   52 +-
 tools/testing/selftests/bpf/prog_tests/atomics.c   |    4 +-
 .../selftests/bpf/prog_tests/bloom_filter_map.c    |   36 +-
 tools/testing/selftests/bpf/prog_tests/bpf_iter.c  |   13 +-
 tools/testing/selftests/bpf/prog_tests/bpf_loop.c  |  145 +++
 .../testing/selftests/bpf/prog_tests/bpf_tcp_ca.c  |    6 +-
 .../selftests/bpf/prog_tests/bpf_verif_scale.c     |   42 +-
 tools/testing/selftests/bpf/prog_tests/btf.c       |  127 ++-
 .../selftests/bpf/prog_tests/btf_dedup_split.c     |  113 +++
 tools/testing/selftests/bpf/prog_tests/btf_dump.c  |    4 +-
 .../selftests/bpf/prog_tests/cgroup_attach_multi.c |   12 +-
 .../selftests/bpf/prog_tests/connect_force_port.c  |   17 +-
 tools/testing/selftests/bpf/prog_tests/core_kern.c |   14 +
 .../testing/selftests/bpf/prog_tests/core_reloc.c  |    3 +-
 .../selftests/bpf/prog_tests/get_stack_raw_tp.c    |   14 +-
 tools/testing/selftests/bpf/prog_tests/kfree_skb.c |   58 +-
 .../testing/selftests/bpf/prog_tests/kfunc_call.c  |   24 +
 .../selftests/bpf/prog_tests/legacy_printk.c       |   65 ++
 tools/testing/selftests/bpf/prog_tests/log_buf.c   |  276 ++++++
 tools/testing/selftests/bpf/prog_tests/map_ptr.c   |   16 +-
 tools/testing/selftests/bpf/prog_tests/pinning.c   |    4 +-
 .../selftests/bpf/prog_tests/prog_array_init.c     |   32 +
 .../selftests/bpf/prog_tests/queue_stack_map.c     |   12 +-
 .../selftests/bpf/prog_tests/ringbuf_multi.c       |    4 +-
 .../selftests/bpf/prog_tests/select_reuseport.c    |   21 +-
 .../selftests/bpf/prog_tests/sockmap_basic.c       |    4 +-
 .../selftests/bpf/prog_tests/sockmap_ktls.c        |    2 +-
 .../selftests/bpf/prog_tests/sockmap_listen.c      |    4 +-
 .../selftests/bpf/prog_tests/sockopt_inherit.c     |   12 +-
 .../selftests/bpf/prog_tests/sockopt_multi.c       |   12 +-
 tools/testing/selftests/bpf/prog_tests/tcp_rtt.c   |   21 +-
 .../testing/selftests/bpf/prog_tests/test_bpffs.c  |    6 +-
 .../selftests/bpf/prog_tests/test_global_funcs.c   |   28 +-
 tools/testing/selftests/bpf/prog_tests/xdp.c       |   11 +-
 .../testing/selftests/bpf/prog_tests/xdp_bonding.c |   36 +-
 .../testing/selftests/bpf/prog_tests/xdp_bpf2bpf.c |    6 +-
 tools/testing/selftests/bpf/progs/bpf_loop.c       |  112 +++
 tools/testing/selftests/bpf/progs/bpf_loop_bench.c |   26 +
 tools/testing/selftests/bpf/progs/core_kern.c      |  104 ++
 tools/testing/selftests/bpf/progs/map_ptr_kern.c   |   16 +-
 tools/testing/selftests/bpf/progs/pyperf.h         |   71 +-
 .../selftests/bpf/progs/pyperf600_bpf_loop.c       |    6 +
 tools/testing/selftests/bpf/progs/strobemeta.h     |   75 +-
 .../selftests/bpf/progs/strobemeta_bpf_loop.c      |    9 +
 .../testing/selftests/bpf/progs/test_ksyms_weak.c  |    2 +-
 .../selftests/bpf/progs/test_legacy_printk.c       |   73 ++
 tools/testing/selftests/bpf/progs/test_log_buf.c   |   24 +
 .../selftests/bpf/progs/test_prog_array_init.c     |   39 +
 .../selftests/bpf/progs/test_verif_scale2.c        |    4 +-
 tools/testing/selftests/bpf/progs/trigger_bench.c  |    7 +
 .../selftests/bpf/test_bpftool_synctypes.py        |   94 +-
 tools/testing/selftests/bpf/test_cgroup_storage.c  |    8 +-
 tools/testing/selftests/bpf/test_lpm_map.c         |   27 +-
 tools/testing/selftests/bpf/test_lru_map.c         |   16 +-
 tools/testing/selftests/bpf/test_maps.c            |  110 ++-
 tools/testing/selftests/bpf/test_progs.c           |   28 +-
 tools/testing/selftests/bpf/test_sock_addr.c       |   33 +-
 tools/testing/selftests/bpf/test_tag.c             |    5 +-
 tools/testing/selftests/bpf/test_verifier.c        |   54 +-
 tools/testing/selftests/bpf/testing_helpers.c      |   14 +-
 tools/testing/selftests/bpf/vmtest.sh              |   46 +-
 tools/testing/selftests/bpf/xdp_redirect_multi.c   |   15 +-
 tools/testing/selftests/bpf/xdpxceiver.c           |   12 +-
 182 files changed, 5747 insertions(+), 2564 deletions(-)
 create mode 100644 Documentation/bpf/faq.rst
 create mode 100644 Documentation/bpf/helpers.rst
 create mode 100644 Documentation/bpf/instruction-set.rst
 create mode 100644 Documentation/bpf/maps.rst
 create mode 100644 Documentation/bpf/other.rst
 rename Documentation/bpf/{bpf_lsm.rst => prog_lsm.rst} (100%)
 create mode 100644 Documentation/bpf/programs.rst
 create mode 100644 Documentation/bpf/syscall_api.rst
 create mode 100644 Documentation/bpf/test_debug.rst
 create mode 100644 Documentation/bpf/verifier.rst
 create mode 100644 tools/bpf/bpftool/Documentation/substitutions.rst
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_loop.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_bpf_loop.sh
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_loop.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/core_kern.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/legacy_printk.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/log_buf.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/prog_array_init.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_loop.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_loop_bench.c
 create mode 100644 tools/testing/selftests/bpf/progs/core_kern.c
 create mode 100644 tools/testing/selftests/bpf/progs/pyperf600_bpf_loop.c
 create mode 100644 tools/testing/selftests/bpf/progs/strobemeta_bpf_loop.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_legacy_printk.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_log_buf.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_prog_array_init.c

^ permalink raw reply

* Re: [PATCH v2] samples/bpf: xdpsock: fix swap.cocci warning
From: Andrii Nakryiko @ 2021-12-10 23:53 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Yihao Han, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Networking, bpf, open list, kernel
In-Reply-To: <877dccwn6x.fsf@toke.dk>

On Fri, Dec 10, 2021 at 6:26 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Yihao Han <hanyihao@vivo.com> writes:
>
> > Fix following swap.cocci warning:
> > ./samples/bpf/xdpsock_user.c:528:22-23:
> > WARNING opportunity for swap()
> >
> > Signed-off-by: Yihao Han <hanyihao@vivo.com>
>
> Erm, did this get applied without anyone actually trying to compile
> samples? I'm getting build errors as:

Good news: I actually do build samples/bpf nowadays after fixing a
bunch of compilation issues recently.

Bad news: seems like I didn't pay too much attention after building
samples/bpf for this particular patch, sorry about that. I've dropped
this patch, samples/bpf builds for me. We should be good now.


>
>   CC  /home/build/linux/samples/bpf/xsk_fwd.o
> /home/build/linux/samples/bpf/xsk_fwd.c: In function ‘swap_mac_addresses’:
> /home/build/linux/samples/bpf/xsk_fwd.c:658:9: warning: implicit declaration of function ‘swap’; did you mean ‘swab’? [-Wimplicit-function-declaration]
>   658 |         swap(*src_addr, *dst_addr);
>       |         ^~~~
>       |         swab
>
> /usr/bin/ld: /home/build/linux/samples/bpf/xsk_fwd.o: in function `thread_func':
> xsk_fwd.c:(.text+0x440): undefined reference to `swap'
> collect2: error: ld returned 1 exit status
>
>
> Could we maybe get samples/bpf added to the BPF CI builds? :)

Maybe we could, if someone dedicated their effort towards making this happen.

>
> -Toke
>

^ permalink raw reply

* RE: [PATCH net-next 0/2] net: stmmac: add EthType Rx Frame steering
From: Nambiar, Amritha @ 2021-12-10 23:57 UTC (permalink / raw)
  To: Jakub Kicinski, Vladimir Oltean
  Cc: Ong, Boon Leong, David S . Miller, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, Maxime Coquelin,
	alexandre.torgue@foss.st.com, Kanzenbach, Kurt,
	netdev@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com,
	linux-arm-kernel@lists.infradead.org, Samudrala, Sridhar
In-Reply-To: <20211210113821.522b7c00@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>

> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Friday, December 10, 2021 11:38 AM
> To: Vladimir Oltean <olteanv@gmail.com>
> Cc: Ong, Boon Leong <boon.leong.ong@intel.com>; David S . Miller
> <davem@davemloft.net>; Giuseppe Cavallaro <peppe.cavallaro@st.com>;
> Alexandre Torgue <alexandre.torgue@st.com>; Jose Abreu
> <joabreu@synopsys.com>; Maxime Coquelin
> <mcoquelin.stm32@gmail.com>; alexandre.torgue@foss.st.com;
> Kanzenbach, Kurt <kurt.kanzenbach@linutronix.de>;
> netdev@vger.kernel.org; linux-stm32@st-md-mailman.stormreply.com;
> linux-arm-kernel@lists.infradead.org; Nambiar, Amritha
> <amritha.nambiar@intel.com>
> Subject: Re: [PATCH net-next 0/2] net: stmmac: add EthType Rx Frame
> steering
> 
> On Fri, 10 Dec 2021 13:57:30 +0200 Vladimir Oltean wrote:
> > Is it the canonical approach to perform flow steering via tc-flower hw_tc,
> > as opposed to ethtool --config-nfc? My understanding from reading the
> > documentation is that tc-flower hw_tc only selects the hardware traffic
> > class for a packet, and that this has to do with prioritization
> > (although the concept in itself is a bit ill-defined as far as I
> > understand it, how does it relate to things like offloaded skbedit priority?).
> > But selecting a traffic class, in itself, doesn't (directly or
> > necessarily) select a ring per se, as ethtool does? Just like ethtool
> > doesn't select packet priority, just RX queue. When the RX queue
> > priority is configurable (see the "snps,priority" device tree property
> > in stmmac_mtl_setup) and more RX queues have the same priority, I'm not
> > sure what hw_tc is supposed to do in terms of RX queue selection?
> 
> You didn't mention the mqprio, but I think that's the piece that maps
> TCs to queue pairs. You can have multiple queues in a TC.
> 
> Obviously that's still pretty weird what the flow rules should select
> is an RSS context. mqprio is a qdisc, which means Tx, not Rx.
> 
> Adding Amritha who I believe added the concept of selecting Rx queues
> via hw_tc. Can you comment?

So tc-mpqrio is the piece that is needed to set up the queue-groups. The offload
mode "hw 2" in mqprio will offload the TCs, the queue configurations and
bandwidth rate limits. The prio-tc map in mqprio will map a user priority to the
TC/queue-group. The priority to traffic class mapping and the user specified
queue ranges are used to configure the traffic class when the 'hw' option is set to 2.
Drivers can then configure queue-pairs based on the offsets and queue ranges
in mqprio.

The hw_tc option in tc-flower for ingress filter is used to direct Rx traffic to the
queue-group (configured via mqprio). Queue selection within the queue group can
be achieved using RSS.

I agree mqprio qdisc should be used to set up Tx queues only, but the limitation was the
absence of a single interface that could configure both Tx and Rx queue-groups/queue-sets
(ethtool did not support directing flows to a queue-group, but only a specific individual
queue, TC does not support Rx queue-group configuration either). The hw_tc in mqprio is a
range of class ids reserved to identify hardware traffic classes normally reported
via dev->num_tc. For Rx queue-group configuration, the gap is that the ingress/clsact qdisc
does not expose a set of virtual qdiscs similar to HW traffic classes in mqprio.
This was discussed in Slide 20 from Netdev 0x14 
(https://legacy.netdevconf.info/0x14/pub/slides/28/Application%20Device%20Queues%20for%20system-level%20network%20IO%20performance%20improvements.pdf)

-Amritha

^ permalink raw reply

* RE: [PATCH v20 bpf-next 03/23] net: mvneta: update mb bit before passing the xdp buffer to eBPF layer
From: John Fastabend @ 2021-12-11  0:09 UTC (permalink / raw)
  To: Lorenzo Bianconi, bpf, netdev
  Cc: lorenzo.bianconi, davem, kuba, ast, daniel, shayagr,
	john.fastabend, dsahern, brouer, echaudro, jasowang,
	alexander.duyck, saeed, maciej.fijalkowski, magnus.karlsson,
	tirthendu.sarkar, toke
In-Reply-To: <8a7a7972107efa6e2ba8e603bd5f054c79de3431.1639162845.git.lorenzo@kernel.org>

Lorenzo Bianconi wrote:
> Update multi-buffer bit (mb) in xdp_buff to notify XDP/eBPF layer and
> XDP remote drivers if this is a "non-linear" XDP buffer. Access
> skb_shared_info only if xdp_buff mb is set in order to avoid possible
> cache-misses.
> 
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox