Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next] net: sfp: add debugfs support
From: Russell King - ARM Linux admin @ 2020-11-24  9:49 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Andrew Lunn, Heiner Kallweit, David S. Miller, netdev,
	Jakub Kicinski
In-Reply-To: <20201124084151.GA722671@shredder.lan>

On Tue, Nov 24, 2020 at 10:41:51AM +0200, Ido Schimmel wrote:
> On Tue, Nov 24, 2020 at 01:14:31AM +0100, Andrew Lunn wrote:
> > On Mon, Nov 23, 2020 at 10:06:16PM +0000, Russell King wrote:
> > > Add debugfs support to SFP so that the internal state of the SFP state
> > > machines and hardware signal state can be viewed from userspace, rather
> > > than having to compile a debug kernel to view state state transitions
> > > in the kernel log.  The 'state' output looks like:
> > > 
> > > Module state: empty
> > > Module probe attempts: 0 0
> > > Device state: up
> > > Main state: down
> > > Fault recovery remaining retries: 5
> > > PHY probe remaining retries: 12
> > > moddef0: 0
> > > rx_los: 1
> > > tx_fault: 1
> > > tx_disable: 1
> > > 
> > > Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
> > 
> > Hi Russell
> > 
> > This looks useful. I always seem to end up recompiling the kernel,
> > which as you said, this should avoid.
> 
> FWIW, another option is to use drgn [1]. Especially when the state is
> queried from the kernel and not hardware. We are using that in mlxsw
> [2][3].

Presumably that requires /proc/kcore support, which 32-bit ARM doesn't
have.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [PATCH net-next] net: sfp: VSOL V2801F / CarlitoxxPro CPGOS03-0490 v2.0 workaround
From: Russell King - ARM Linux admin @ 2020-11-24  9:43 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Heiner Kallweit, David S. Miller, netdev, Jakub Kicinski
In-Reply-To: <20201124002021.GB2031446@lunn.ch>

On Tue, Nov 24, 2020 at 01:20:21AM +0100, Andrew Lunn wrote:
> > @@ -335,10 +336,19 @@ static int sfp_i2c_read(struct sfp *sfp, bool a2, u8 dev_addr, void *buf,
> >  			size_t len)
> >  {
> >  	struct i2c_msg msgs[2];
> > -	u8 bus_addr = a2 ? 0x51 : 0x50;
> > +	size_t block_size;
> >  	size_t this_len;
> > +	u8 bus_addr;
> >  	int ret;
> >  
> > +	if (a2) {
> > +		block_size = 16;
> > +		bus_addr = 0x51;
> 
> Hi Russell, Thomas
> 
> Does this man the diagnostic page can be read 16 bytes at a time, even
> when the other page has to be 1 bytes at a time? That seems rather
> odd. Or is the diagnostic page not implemented in these SFPs?

SFF8472 requires that multibyte values are read using sequential
reads. So we can't use single byte reads to read a multibyte value -
it's just not atomic.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* [PATCH net-next v6 5/5] net/x25: remove x25_kill_by_device()
From: Martin Schiller @ 2020-11-24  9:39 UTC (permalink / raw)
  To: andrew.hendry, davem, kuba, xie.he.0141
  Cc: linux-x25, netdev, linux-kernel, Martin Schiller
In-Reply-To: <20201124093938.22012-1-ms@dev.tdt.de>

Remove obsolete function x25_kill_by_device(). It's not used any more.

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
---
 net/x25/af_x25.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index 313a6222ded9..1432a05805ab 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -199,22 +199,6 @@ static void x25_remove_socket(struct sock *sk)
 	write_unlock_bh(&x25_list_lock);
 }
 
-/*
- *	Kill all bound sockets on a dropped device.
- */
-static void x25_kill_by_device(struct net_device *dev)
-{
-	struct sock *s;
-
-	write_lock_bh(&x25_list_lock);
-
-	sk_for_each(s, &x25_list)
-		if (x25_sk(s)->neighbour && x25_sk(s)->neighbour->dev == dev)
-			x25_disconnect(s, ENETUNREACH, 0, 0);
-
-	write_unlock_bh(&x25_list_lock);
-}
-
 /*
  *	Handle device status changes.
  */
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v6 4/5] net/x25: fix restart request/confirm handling
From: Martin Schiller @ 2020-11-24  9:39 UTC (permalink / raw)
  To: andrew.hendry, davem, kuba, xie.he.0141
  Cc: linux-x25, netdev, linux-kernel, Martin Schiller
In-Reply-To: <20201124093938.22012-1-ms@dev.tdt.de>

We have to take the actual link state into account to handle
restart requests/confirms well.

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
---
 net/x25/x25_link.c | 41 +++++++++++++++++++++++++++++++++--------
 1 file changed, 33 insertions(+), 8 deletions(-)

diff --git a/net/x25/x25_link.c b/net/x25/x25_link.c
index 11e868aa625d..f92073f3cb11 100644
--- a/net/x25/x25_link.c
+++ b/net/x25/x25_link.c
@@ -74,16 +74,43 @@ void x25_link_control(struct sk_buff *skb, struct x25_neigh *nb,
 
 	switch (frametype) {
 	case X25_RESTART_REQUEST:
-		confirm = !x25_t20timer_pending(nb);
-		x25_stop_t20timer(nb);
-		nb->state = X25_LINK_STATE_3;
-		if (confirm)
+		switch (nb->state) {
+		case X25_LINK_STATE_2:
+			confirm = !x25_t20timer_pending(nb);
+			x25_stop_t20timer(nb);
+			nb->state = X25_LINK_STATE_3;
+			if (confirm)
+				x25_transmit_restart_confirmation(nb);
+			break;
+		case X25_LINK_STATE_3:
+			/* clear existing virtual calls */
+			x25_kill_by_neigh(nb);
+
 			x25_transmit_restart_confirmation(nb);
+			break;
+		}
 		break;
 
 	case X25_RESTART_CONFIRMATION:
-		x25_stop_t20timer(nb);
-		nb->state = X25_LINK_STATE_3;
+		switch (nb->state) {
+		case X25_LINK_STATE_2:
+			if (x25_t20timer_pending(nb)) {
+				x25_stop_t20timer(nb);
+				nb->state = X25_LINK_STATE_3;
+			} else {
+				x25_transmit_restart_request(nb);
+				x25_start_t20timer(nb);
+			}
+			break;
+		case X25_LINK_STATE_3:
+			/* clear existing virtual calls */
+			x25_kill_by_neigh(nb);
+
+			x25_transmit_restart_request(nb);
+			nb->state = X25_LINK_STATE_2;
+			x25_start_t20timer(nb);
+			break;
+		}
 		break;
 
 	case X25_DIAGNOSTIC:
@@ -214,8 +241,6 @@ void x25_link_established(struct x25_neigh *nb)
 {
 	switch (nb->state) {
 	case X25_LINK_STATE_0:
-		nb->state = X25_LINK_STATE_2;
-		break;
 	case X25_LINK_STATE_1:
 		x25_transmit_restart_request(nb);
 		nb->state = X25_LINK_STATE_2;
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v6 2/5] net/lapb: support netdev events
From: Martin Schiller @ 2020-11-24  9:39 UTC (permalink / raw)
  To: andrew.hendry, davem, kuba, xie.he.0141
  Cc: linux-x25, netdev, linux-kernel, Martin Schiller
In-Reply-To: <20201124093938.22012-1-ms@dev.tdt.de>

This patch allows layer2 (LAPB) to react to netdev events itself and
avoids the detour via layer3 (X.25).

1. Establish layer2 on NETDEV_UP events, if the carrier is already up.

2. Call lapb_disconnect_request() on NETDEV_GOING_DOWN events to signal
   the peer that the connection will go down.
   (Only when the carrier is up.)

3. When a NETDEV_DOWN event occur, clear all queues, enter state
   LAPB_STATE_0 and stop all timers.

4. The NETDEV_CHANGE event makes it possible to handle carrier loss and
   detection.

   In case of Carrier Loss, clear all queues, enter state LAPB_STATE_0
   and stop all timers.

   In case of Carrier Detection, we start timer t1 on a DCE interface,
   and on a DTE interface we change to state LAPB_STATE_1 and start
   sending SABM(E).

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
---
 net/lapb/lapb_iface.c | 94 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)

diff --git a/net/lapb/lapb_iface.c b/net/lapb/lapb_iface.c
index 3c03f6512c5f..f226d354aaf0 100644
--- a/net/lapb/lapb_iface.c
+++ b/net/lapb/lapb_iface.c
@@ -418,14 +418,108 @@ int lapb_data_transmit(struct lapb_cb *lapb, struct sk_buff *skb)
 	return used;
 }
 
+/* Handle device status changes. */
+static int lapb_device_event(struct notifier_block *this, unsigned long event,
+			     void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+	struct lapb_cb *lapb;
+
+	if (!net_eq(dev_net(dev), &init_net))
+		return NOTIFY_DONE;
+
+	if (dev->type == ARPHRD_X25) {
+		switch (event) {
+		case NETDEV_UP:
+			lapb_dbg(0, "(%p) Interface up: %s\n", dev,
+				 dev->name);
+
+			if (netif_carrier_ok(dev)) {
+				lapb = lapb_devtostruct(dev);
+				if (!lapb)
+					break;
+
+				lapb_dbg(0, "(%p): Carrier is already up: %s\n",
+					 dev, dev->name);
+				if (lapb->mode & LAPB_DCE) {
+					lapb_start_t1timer(lapb);
+				} else {
+					if (lapb->state == LAPB_STATE_0) {
+						lapb->state = LAPB_STATE_1;
+						lapb_establish_data_link(lapb);
+					}
+				}
+			}
+			break;
+		case NETDEV_GOING_DOWN:
+			if (netif_carrier_ok(dev))
+				lapb_disconnect_request(dev);
+			break;
+		case NETDEV_DOWN:
+			lapb = lapb_devtostruct(dev);
+			if (!lapb)
+				break;
+
+			lapb_dbg(0, "(%p) Interface down: %s\n", dev,
+				 dev->name);
+			lapb_dbg(0, "(%p) S%d -> S0\n", dev,
+				 lapb->state);
+			lapb_clear_queues(lapb);
+			lapb->state = LAPB_STATE_0;
+			lapb->n2count   = 0;
+			lapb_stop_t1timer(lapb);
+			lapb_stop_t2timer(lapb);
+			break;
+		case NETDEV_CHANGE:
+			lapb = lapb_devtostruct(dev);
+			if (!lapb)
+				break;
+
+			if (netif_carrier_ok(dev)) {
+				lapb_dbg(0, "(%p): Carrier detected: %s\n",
+					 dev, dev->name);
+				if (lapb->mode & LAPB_DCE) {
+					lapb_start_t1timer(lapb);
+				} else {
+					if (lapb->state == LAPB_STATE_0) {
+						lapb->state = LAPB_STATE_1;
+						lapb_establish_data_link(lapb);
+					}
+				}
+			} else {
+				lapb_dbg(0, "(%p) Carrier lost: %s\n", dev,
+					 dev->name);
+				lapb_dbg(0, "(%p) S%d -> S0\n", dev,
+					 lapb->state);
+				lapb_clear_queues(lapb);
+				lapb->state = LAPB_STATE_0;
+				lapb->n2count   = 0;
+				lapb_stop_t1timer(lapb);
+				lapb_stop_t2timer(lapb);
+			}
+			break;
+		}
+	}
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block lapb_dev_notifier = {
+	.notifier_call = lapb_device_event,
+};
+
 static int __init lapb_init(void)
 {
+	register_netdevice_notifier(&lapb_dev_notifier);
+
 	return 0;
 }
 
 static void __exit lapb_exit(void)
 {
 	WARN_ON(!list_empty(&lapb_list));
+
+	unregister_netdevice_notifier(&lapb_dev_notifier);
 }
 
 MODULE_AUTHOR("Jonathan Naylor <g4klx@g4klx.demon.co.uk>");
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v6 3/5] net/lapb: fix t1 timer handling for LAPB_STATE_0
From: Martin Schiller @ 2020-11-24  9:39 UTC (permalink / raw)
  To: andrew.hendry, davem, kuba, xie.he.0141
  Cc: linux-x25, netdev, linux-kernel, Martin Schiller
In-Reply-To: <20201124093938.22012-1-ms@dev.tdt.de>

1. DTE interface changes immediately to LAPB_STATE_1 and start sending
   SABM(E).

2. DCE interface sends N2-times DM and changes to LAPB_STATE_1
   afterwards if there is no response in the meantime.

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
---
 net/lapb/lapb_timer.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/lapb/lapb_timer.c b/net/lapb/lapb_timer.c
index 8f5b17001a07..baa247fe4ed0 100644
--- a/net/lapb/lapb_timer.c
+++ b/net/lapb/lapb_timer.c
@@ -85,11 +85,18 @@ static void lapb_t1timer_expiry(struct timer_list *t)
 	switch (lapb->state) {
 
 		/*
-		 *	If we are a DCE, keep going DM .. DM .. DM
+		 *	If we are a DCE, send DM up to N2 times, then switch to
+		 *	STATE_1 and send SABM(E).
 		 */
 		case LAPB_STATE_0:
-			if (lapb->mode & LAPB_DCE)
+			if (lapb->mode & LAPB_DCE &&
+			    lapb->n2count != lapb->n2) {
+				lapb->n2count++;
 				lapb_send_control(lapb, LAPB_DM, LAPB_POLLOFF, LAPB_RESPONSE);
+			} else {
+				lapb->state = LAPB_STATE_1;
+				lapb_establish_data_link(lapb);
+			}
 			break;
 
 		/*
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH iproute2-next v2] tc flower: use right ethertype in icmp/arp parsing
From: Roi Dayan @ 2020-11-24  9:39 UTC (permalink / raw)
  To: David Ahern, Zahari Doychev, netdev; +Cc: simon.horman, jhs, jianbol
In-Reply-To: <c51abdae-6596-54ec-2b96-9b010c27cdb1@gmail.com>



On 2020-11-14 5:12 AM, David Ahern wrote:
> On 11/10/20 12:53 AM, Zahari Doychev wrote:
>> Currently the icmp and arp parsing functions are called with incorrect
>> ethtype in case of vlan or cvlan filter options. In this case either
>> cvlan_ethtype or vlan_ethtype has to be used. The ethtype is now updated
>> each time a vlan ethtype is matched during parsing.
>>
>> Signed-off-by: Zahari Doychev <zahari.doychev@linux.com>
>> ---
>>   tc/f_flower.c | 52 +++++++++++++++++++++++----------------------------
>>   1 file changed, 23 insertions(+), 29 deletions(-)
>>
> 
> Thanks, looks much better.
> 
> applied to iproute2-next.
> 

Hi,

I didn't debug yet but with this commit I am failing to add a tc
rule I always could before. also the error msg doesn't make sense.

Example:

# tc filter add dev enp8s0f0 protocol 802.1Q parent ffff: prio 1 flower\
  skip_hw dst_mac e4:11:22:11:4a:51 src_mac e4:11:22:11:4a:50\
  vlan_ethtype 0x800 vlan_id 100 vlan_prio 0 action vlan pop action\
  mirred egress redirect dev enp8s0f0_0


Can't set "vlan_id" if ethertype isn't 802.1Q or 802.1AD


I used protocol 802.1Q and vlan_ethtype 0x800.
am i missing something? the rule should look different now?

Thanks,
Roi

^ permalink raw reply

* [PATCH net-next v6 1/5] net/x25: handle additional netdev events
From: Martin Schiller @ 2020-11-24  9:39 UTC (permalink / raw)
  To: andrew.hendry, davem, kuba, xie.he.0141
  Cc: linux-x25, netdev, linux-kernel, Martin Schiller
In-Reply-To: <20201124093938.22012-1-ms@dev.tdt.de>

1. Add / remove x25_link_device by NETDEV_REGISTER/UNREGISTER and also
   by NETDEV_POST_TYPE_CHANGE/NETDEV_PRE_TYPE_CHANGE.

   This change is needed so that the x25_neigh struct for an interface
   is already created when it shows up and is kept independently if the
   interface goes UP or DOWN.

   This is used in an upcomming commit, where x25 params of an neighbour
   will get configurable through ioctls.

2. NETDEV_CHANGE event makes it possible to handle carrier loss and
   detection. If carrier is lost, clean up everything related to this
   neighbour by calling x25_link_terminated().

3. Also call x25_link_terminated() for NETDEV_DOWN events and remove the
   call to x25_clear_forward_by_dev() in x25_route_device_down(), as
   this is already called by x25_kill_by_neigh() which gets called by
   x25_link_terminated().

4. Do nothing for NETDEV_UP and NETDEV_GOING_DOWN events, as these will
   be handled in layer 2 (LAPB) and layer3 (X.25) will be informed by
   layer2 when layer2 link is established and layer3 link should be
   initiated.

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
---
 net/x25/af_x25.c    | 22 ++++++++++++++++------
 net/x25/x25_link.c  |  6 +++---
 net/x25/x25_route.c |  3 ---
 3 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index 046d3fee66a9..313a6222ded9 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -233,21 +233,31 @@ static int x25_device_event(struct notifier_block *this, unsigned long event,
 #endif
 	 ) {
 		switch (event) {
-		case NETDEV_UP:
+		case NETDEV_REGISTER:
+		case NETDEV_POST_TYPE_CHANGE:
 			x25_link_device_up(dev);
 			break;
-		case NETDEV_GOING_DOWN:
+		case NETDEV_DOWN:
 			nb = x25_get_neigh(dev);
 			if (nb) {
-				x25_terminate_link(nb);
+				x25_link_terminated(nb);
 				x25_neigh_put(nb);
 			}
-			break;
-		case NETDEV_DOWN:
-			x25_kill_by_device(dev);
 			x25_route_device_down(dev);
+			break;
+		case NETDEV_PRE_TYPE_CHANGE:
+		case NETDEV_UNREGISTER:
 			x25_link_device_down(dev);
 			break;
+		case NETDEV_CHANGE:
+			if (!netif_carrier_ok(dev)) {
+				nb = x25_get_neigh(dev);
+				if (nb) {
+					x25_link_terminated(nb);
+					x25_neigh_put(nb);
+				}
+			}
+			break;
 		}
 	}
 
diff --git a/net/x25/x25_link.c b/net/x25/x25_link.c
index fdae054b7dc1..11e868aa625d 100644
--- a/net/x25/x25_link.c
+++ b/net/x25/x25_link.c
@@ -232,6 +232,9 @@ void x25_link_established(struct x25_neigh *nb)
 void x25_link_terminated(struct x25_neigh *nb)
 {
 	nb->state = X25_LINK_STATE_0;
+	skb_queue_purge(&nb->queue);
+	x25_stop_t20timer(nb);
+
 	/* Out of order: clear existing virtual calls (X.25 03/93 4.6.3) */
 	x25_kill_by_neigh(nb);
 }
@@ -277,9 +280,6 @@ void x25_link_device_up(struct net_device *dev)
  */
 static void __x25_remove_neigh(struct x25_neigh *nb)
 {
-	skb_queue_purge(&nb->queue);
-	x25_stop_t20timer(nb);
-
 	if (nb->node.next) {
 		list_del(&nb->node);
 		x25_neigh_put(nb);
diff --git a/net/x25/x25_route.c b/net/x25/x25_route.c
index 00e46c9a5280..ec2a39e9b3e6 100644
--- a/net/x25/x25_route.c
+++ b/net/x25/x25_route.c
@@ -115,9 +115,6 @@ void x25_route_device_down(struct net_device *dev)
 			__x25_remove_route(rt);
 	}
 	write_unlock_bh(&x25_route_list_lock);
-
-	/* Remove any related forwarding */
-	x25_clear_forward_by_dev(dev);
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v6 0/5] net/x25: netdev event handling
From: Martin Schiller @ 2020-11-24  9:39 UTC (permalink / raw)
  To: andrew.hendry, davem, kuba, xie.he.0141
  Cc: linux-x25, netdev, linux-kernel, Martin Schiller

---

Changes to v5:
o fix numbering in commit message of patch 2/5.

Changes to v4:
o also establish layer2 (LAPB) on NETDEV_UP events, if the carrier is
  already UP.

Changes to v3:
o another complete rework of the patch-set to split event handling
  for layer2 (LAPB) and layer3 (X.25)

Changes to v2:
o restructure complete patch-set
o keep netdev event handling in layer3 (X.25)
o add patch to fix lapb_connect_request() for DCE
o add patch to handle carrier loss correctly in lapb
o drop patch for x25_neighbour param handling
  this may need fixes/cleanup and will be resubmitted later.

Changes to v1:
o fix 'subject_prefix' and 'checkpatch' warnings

---

Martin Schiller (5):
  net/x25: handle additional netdev events
  net/lapb: support netdev events
  net/lapb: fix t1 timer handling for LAPB_STATE_0
  net/x25: fix restart request/confirm handling
  net/x25: remove x25_kill_by_device()

 net/lapb/lapb_iface.c | 94 +++++++++++++++++++++++++++++++++++++++++++
 net/lapb/lapb_timer.c | 11 ++++-
 net/x25/af_x25.c      | 38 ++++++++---------
 net/x25/x25_link.c    | 47 +++++++++++++++++-----
 net/x25/x25_route.c   |  3 --
 5 files changed, 155 insertions(+), 38 deletions(-)

-- 
2.20.1


^ permalink raw reply

* [PATCH net-next v5 5/5] net/x25: remove x25_kill_by_device()
From: Martin Schiller @ 2020-11-24  9:35 UTC (permalink / raw)
  To: andrew.hendry, davem, kuba, xie.he.0141
  Cc: linux-x25, netdev, linux-kernel, Martin Schiller
In-Reply-To: <20201124093538.21177-1-ms@dev.tdt.de>

Remove obsolete function x25_kill_by_device(). It's not used any more.

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
---
 net/x25/af_x25.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index 313a6222ded9..1432a05805ab 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -199,22 +199,6 @@ static void x25_remove_socket(struct sock *sk)
 	write_unlock_bh(&x25_list_lock);
 }
 
-/*
- *	Kill all bound sockets on a dropped device.
- */
-static void x25_kill_by_device(struct net_device *dev)
-{
-	struct sock *s;
-
-	write_lock_bh(&x25_list_lock);
-
-	sk_for_each(s, &x25_list)
-		if (x25_sk(s)->neighbour && x25_sk(s)->neighbour->dev == dev)
-			x25_disconnect(s, ENETUNREACH, 0, 0);
-
-	write_unlock_bh(&x25_list_lock);
-}
-
 /*
  *	Handle device status changes.
  */
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v5 4/5] net/x25: fix restart request/confirm handling
From: Martin Schiller @ 2020-11-24  9:35 UTC (permalink / raw)
  To: andrew.hendry, davem, kuba, xie.he.0141
  Cc: linux-x25, netdev, linux-kernel, Martin Schiller
In-Reply-To: <20201124093538.21177-1-ms@dev.tdt.de>

We have to take the actual link state into account to handle
restart requests/confirms well.

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
---
 net/x25/x25_link.c | 41 +++++++++++++++++++++++++++++++++--------
 1 file changed, 33 insertions(+), 8 deletions(-)

diff --git a/net/x25/x25_link.c b/net/x25/x25_link.c
index 11e868aa625d..f92073f3cb11 100644
--- a/net/x25/x25_link.c
+++ b/net/x25/x25_link.c
@@ -74,16 +74,43 @@ void x25_link_control(struct sk_buff *skb, struct x25_neigh *nb,
 
 	switch (frametype) {
 	case X25_RESTART_REQUEST:
-		confirm = !x25_t20timer_pending(nb);
-		x25_stop_t20timer(nb);
-		nb->state = X25_LINK_STATE_3;
-		if (confirm)
+		switch (nb->state) {
+		case X25_LINK_STATE_2:
+			confirm = !x25_t20timer_pending(nb);
+			x25_stop_t20timer(nb);
+			nb->state = X25_LINK_STATE_3;
+			if (confirm)
+				x25_transmit_restart_confirmation(nb);
+			break;
+		case X25_LINK_STATE_3:
+			/* clear existing virtual calls */
+			x25_kill_by_neigh(nb);
+
 			x25_transmit_restart_confirmation(nb);
+			break;
+		}
 		break;
 
 	case X25_RESTART_CONFIRMATION:
-		x25_stop_t20timer(nb);
-		nb->state = X25_LINK_STATE_3;
+		switch (nb->state) {
+		case X25_LINK_STATE_2:
+			if (x25_t20timer_pending(nb)) {
+				x25_stop_t20timer(nb);
+				nb->state = X25_LINK_STATE_3;
+			} else {
+				x25_transmit_restart_request(nb);
+				x25_start_t20timer(nb);
+			}
+			break;
+		case X25_LINK_STATE_3:
+			/* clear existing virtual calls */
+			x25_kill_by_neigh(nb);
+
+			x25_transmit_restart_request(nb);
+			nb->state = X25_LINK_STATE_2;
+			x25_start_t20timer(nb);
+			break;
+		}
 		break;
 
 	case X25_DIAGNOSTIC:
@@ -214,8 +241,6 @@ void x25_link_established(struct x25_neigh *nb)
 {
 	switch (nb->state) {
 	case X25_LINK_STATE_0:
-		nb->state = X25_LINK_STATE_2;
-		break;
 	case X25_LINK_STATE_1:
 		x25_transmit_restart_request(nb);
 		nb->state = X25_LINK_STATE_2;
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v5 3/5] net/lapb: fix t1 timer handling for LAPB_STATE_0
From: Martin Schiller @ 2020-11-24  9:35 UTC (permalink / raw)
  To: andrew.hendry, davem, kuba, xie.he.0141
  Cc: linux-x25, netdev, linux-kernel, Martin Schiller
In-Reply-To: <20201124093538.21177-1-ms@dev.tdt.de>

1. DTE interface changes immediately to LAPB_STATE_1 and start sending
   SABM(E).

2. DCE interface sends N2-times DM and changes to LAPB_STATE_1
   afterwards if there is no response in the meantime.

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
---
 net/lapb/lapb_timer.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/lapb/lapb_timer.c b/net/lapb/lapb_timer.c
index 8f5b17001a07..baa247fe4ed0 100644
--- a/net/lapb/lapb_timer.c
+++ b/net/lapb/lapb_timer.c
@@ -85,11 +85,18 @@ static void lapb_t1timer_expiry(struct timer_list *t)
 	switch (lapb->state) {
 
 		/*
-		 *	If we are a DCE, keep going DM .. DM .. DM
+		 *	If we are a DCE, send DM up to N2 times, then switch to
+		 *	STATE_1 and send SABM(E).
 		 */
 		case LAPB_STATE_0:
-			if (lapb->mode & LAPB_DCE)
+			if (lapb->mode & LAPB_DCE &&
+			    lapb->n2count != lapb->n2) {
+				lapb->n2count++;
 				lapb_send_control(lapb, LAPB_DM, LAPB_POLLOFF, LAPB_RESPONSE);
+			} else {
+				lapb->state = LAPB_STATE_1;
+				lapb_establish_data_link(lapb);
+			}
 			break;
 
 		/*
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v5 2/5] net/lapb: support netdev events
From: Martin Schiller @ 2020-11-24  9:35 UTC (permalink / raw)
  To: andrew.hendry, davem, kuba, xie.he.0141
  Cc: linux-x25, netdev, linux-kernel, Martin Schiller
In-Reply-To: <20201124093538.21177-1-ms@dev.tdt.de>

This patch allows layer2 (LAPB) to react to netdev events itself and
avoids the detour via layer3 (X.25).

1. Establish layer2 on NETDEV_UP events, if the carrier is already up.

2. Call lapb_disconnect_request() on NETDEV_GOING_DOWN events to signal
   the peer that the connection will go down.
   (Only when the carrier is up.)

2. When a NETDEV_DOWN event occur, clear all queues, enter state
   LAPB_STATE_0 and stop all timers.

3. The NETDEV_CHANGE event makes it possible to handle carrier loss and
   detection.

   In case of Carrier Loss, clear all queues, enter state LAPB_STATE_0
   and stop all timers.

   In case of Carrier Detection, we start timer t1 on a DCE interface,
   and on a DTE interface we change to state LAPB_STATE_1 and start
   sending SABM(E).

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
---
 net/lapb/lapb_iface.c | 94 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)

diff --git a/net/lapb/lapb_iface.c b/net/lapb/lapb_iface.c
index 3c03f6512c5f..f226d354aaf0 100644
--- a/net/lapb/lapb_iface.c
+++ b/net/lapb/lapb_iface.c
@@ -418,14 +418,108 @@ int lapb_data_transmit(struct lapb_cb *lapb, struct sk_buff *skb)
 	return used;
 }
 
+/* Handle device status changes. */
+static int lapb_device_event(struct notifier_block *this, unsigned long event,
+			     void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+	struct lapb_cb *lapb;
+
+	if (!net_eq(dev_net(dev), &init_net))
+		return NOTIFY_DONE;
+
+	if (dev->type == ARPHRD_X25) {
+		switch (event) {
+		case NETDEV_UP:
+			lapb_dbg(0, "(%p) Interface up: %s\n", dev,
+				 dev->name);
+
+			if (netif_carrier_ok(dev)) {
+				lapb = lapb_devtostruct(dev);
+				if (!lapb)
+					break;
+
+				lapb_dbg(0, "(%p): Carrier is already up: %s\n",
+					 dev, dev->name);
+				if (lapb->mode & LAPB_DCE) {
+					lapb_start_t1timer(lapb);
+				} else {
+					if (lapb->state == LAPB_STATE_0) {
+						lapb->state = LAPB_STATE_1;
+						lapb_establish_data_link(lapb);
+					}
+				}
+			}
+			break;
+		case NETDEV_GOING_DOWN:
+			if (netif_carrier_ok(dev))
+				lapb_disconnect_request(dev);
+			break;
+		case NETDEV_DOWN:
+			lapb = lapb_devtostruct(dev);
+			if (!lapb)
+				break;
+
+			lapb_dbg(0, "(%p) Interface down: %s\n", dev,
+				 dev->name);
+			lapb_dbg(0, "(%p) S%d -> S0\n", dev,
+				 lapb->state);
+			lapb_clear_queues(lapb);
+			lapb->state = LAPB_STATE_0;
+			lapb->n2count   = 0;
+			lapb_stop_t1timer(lapb);
+			lapb_stop_t2timer(lapb);
+			break;
+		case NETDEV_CHANGE:
+			lapb = lapb_devtostruct(dev);
+			if (!lapb)
+				break;
+
+			if (netif_carrier_ok(dev)) {
+				lapb_dbg(0, "(%p): Carrier detected: %s\n",
+					 dev, dev->name);
+				if (lapb->mode & LAPB_DCE) {
+					lapb_start_t1timer(lapb);
+				} else {
+					if (lapb->state == LAPB_STATE_0) {
+						lapb->state = LAPB_STATE_1;
+						lapb_establish_data_link(lapb);
+					}
+				}
+			} else {
+				lapb_dbg(0, "(%p) Carrier lost: %s\n", dev,
+					 dev->name);
+				lapb_dbg(0, "(%p) S%d -> S0\n", dev,
+					 lapb->state);
+				lapb_clear_queues(lapb);
+				lapb->state = LAPB_STATE_0;
+				lapb->n2count   = 0;
+				lapb_stop_t1timer(lapb);
+				lapb_stop_t2timer(lapb);
+			}
+			break;
+		}
+	}
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block lapb_dev_notifier = {
+	.notifier_call = lapb_device_event,
+};
+
 static int __init lapb_init(void)
 {
+	register_netdevice_notifier(&lapb_dev_notifier);
+
 	return 0;
 }
 
 static void __exit lapb_exit(void)
 {
 	WARN_ON(!list_empty(&lapb_list));
+
+	unregister_netdevice_notifier(&lapb_dev_notifier);
 }
 
 MODULE_AUTHOR("Jonathan Naylor <g4klx@g4klx.demon.co.uk>");
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v5 1/5] net/x25: handle additional netdev events
From: Martin Schiller @ 2020-11-24  9:35 UTC (permalink / raw)
  To: andrew.hendry, davem, kuba, xie.he.0141
  Cc: linux-x25, netdev, linux-kernel, Martin Schiller
In-Reply-To: <20201124093538.21177-1-ms@dev.tdt.de>

1. Add / remove x25_link_device by NETDEV_REGISTER/UNREGISTER and also
   by NETDEV_POST_TYPE_CHANGE/NETDEV_PRE_TYPE_CHANGE.

   This change is needed so that the x25_neigh struct for an interface
   is already created when it shows up and is kept independently if the
   interface goes UP or DOWN.

   This is used in an upcomming commit, where x25 params of an neighbour
   will get configurable through ioctls.

2. NETDEV_CHANGE event makes it possible to handle carrier loss and
   detection. If carrier is lost, clean up everything related to this
   neighbour by calling x25_link_terminated().

3. Also call x25_link_terminated() for NETDEV_DOWN events and remove the
   call to x25_clear_forward_by_dev() in x25_route_device_down(), as
   this is already called by x25_kill_by_neigh() which gets called by
   x25_link_terminated().

4. Do nothing for NETDEV_UP and NETDEV_GOING_DOWN events, as these will
   be handled in layer 2 (LAPB) and layer3 (X.25) will be informed by
   layer2 when layer2 link is established and layer3 link should be
   initiated.

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
---
 net/x25/af_x25.c    | 22 ++++++++++++++++------
 net/x25/x25_link.c  |  6 +++---
 net/x25/x25_route.c |  3 ---
 3 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index 046d3fee66a9..313a6222ded9 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -233,21 +233,31 @@ static int x25_device_event(struct notifier_block *this, unsigned long event,
 #endif
 	 ) {
 		switch (event) {
-		case NETDEV_UP:
+		case NETDEV_REGISTER:
+		case NETDEV_POST_TYPE_CHANGE:
 			x25_link_device_up(dev);
 			break;
-		case NETDEV_GOING_DOWN:
+		case NETDEV_DOWN:
 			nb = x25_get_neigh(dev);
 			if (nb) {
-				x25_terminate_link(nb);
+				x25_link_terminated(nb);
 				x25_neigh_put(nb);
 			}
-			break;
-		case NETDEV_DOWN:
-			x25_kill_by_device(dev);
 			x25_route_device_down(dev);
+			break;
+		case NETDEV_PRE_TYPE_CHANGE:
+		case NETDEV_UNREGISTER:
 			x25_link_device_down(dev);
 			break;
+		case NETDEV_CHANGE:
+			if (!netif_carrier_ok(dev)) {
+				nb = x25_get_neigh(dev);
+				if (nb) {
+					x25_link_terminated(nb);
+					x25_neigh_put(nb);
+				}
+			}
+			break;
 		}
 	}
 
diff --git a/net/x25/x25_link.c b/net/x25/x25_link.c
index fdae054b7dc1..11e868aa625d 100644
--- a/net/x25/x25_link.c
+++ b/net/x25/x25_link.c
@@ -232,6 +232,9 @@ void x25_link_established(struct x25_neigh *nb)
 void x25_link_terminated(struct x25_neigh *nb)
 {
 	nb->state = X25_LINK_STATE_0;
+	skb_queue_purge(&nb->queue);
+	x25_stop_t20timer(nb);
+
 	/* Out of order: clear existing virtual calls (X.25 03/93 4.6.3) */
 	x25_kill_by_neigh(nb);
 }
@@ -277,9 +280,6 @@ void x25_link_device_up(struct net_device *dev)
  */
 static void __x25_remove_neigh(struct x25_neigh *nb)
 {
-	skb_queue_purge(&nb->queue);
-	x25_stop_t20timer(nb);
-
 	if (nb->node.next) {
 		list_del(&nb->node);
 		x25_neigh_put(nb);
diff --git a/net/x25/x25_route.c b/net/x25/x25_route.c
index 00e46c9a5280..ec2a39e9b3e6 100644
--- a/net/x25/x25_route.c
+++ b/net/x25/x25_route.c
@@ -115,9 +115,6 @@ void x25_route_device_down(struct net_device *dev)
 			__x25_remove_route(rt);
 	}
 	write_unlock_bh(&x25_route_list_lock);
-
-	/* Remove any related forwarding */
-	x25_clear_forward_by_dev(dev);
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v5 0/5] net/x25: netdev event handling
From: Martin Schiller @ 2020-11-24  9:35 UTC (permalink / raw)
  To: andrew.hendry, davem, kuba, xie.he.0141
  Cc: linux-x25, netdev, linux-kernel, Martin Schiller

---

Changes to v4:
o also establish layer2 (LAPB) on NETDEV_UP events, if the carrier is
  already UP.

Changes to v3:
o another complete rework of the patch-set to split event handling
  for layer2 (LAPB) and layer3 (X.25)

Changes to v2:
o restructure complete patch-set
o keep netdev event handling in layer3 (X.25)
o add patch to fix lapb_connect_request() for DCE
o add patch to handle carrier loss correctly in lapb
o drop patch for x25_neighbour param handling
  this may need fixes/cleanup and will be resubmitted later.

Changes to v1:
o fix 'subject_prefix' and 'checkpatch' warnings

---

Martin Schiller (5):
  net/x25: handle additional netdev events
  net/lapb: support netdev events
  net/lapb: fix t1 timer handling for LAPB_STATE_0
  net/x25: fix restart request/confirm handling
  net/x25: remove x25_kill_by_device()

 net/lapb/lapb_iface.c | 94 +++++++++++++++++++++++++++++++++++++++++++
 net/lapb/lapb_timer.c | 11 ++++-
 net/x25/af_x25.c      | 38 ++++++++---------
 net/x25/x25_link.c    | 47 +++++++++++++++++-----
 net/x25/x25_route.c   |  3 --
 5 files changed, 155 insertions(+), 38 deletions(-)

-- 
2.20.1


^ permalink raw reply

* Re: [PATCH 2/3] xsk: change the tx writeable condition
From: Magnus Karlsson @ 2020-11-24  9:28 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	David S. Miller, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Network Development, bpf, linux-kernel
In-Reply-To: <b7b0432d49ab05064efb85f1858b6e6f9e1274bd.1605686678.git.xuanzhuo@linux.alibaba.com>

On Wed, Nov 18, 2020 at 9:25 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> Modify the tx writeable condition from the queue is not full to the
> number of remaining tx queues is less than the half of the total number
> of queues. Because the tx queue not full is a very short time, this will
> cause a large number of EPOLLOUT events, and cause a large number of
> process wake up.
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>  net/xdp/xsk.c       | 20 +++++++++++++++++---
>  net/xdp/xsk_queue.h |  6 ++++++
>  2 files changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 7f0353e..bc3d4ece 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -211,6 +211,17 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len,
>         return 0;
>  }
>
> +static bool xsk_writeable(struct xdp_sock *xs)

Not clear what this function does from the name. How about
xsk_tx_half_free() or maybe xsk_tx_writeable()?

> +{
> +       if (!xs->tx)
> +               return false;

Skip this test as it will slow down the code. It is only needed in one
place below.

> +       if (xskq_cons_left(xs->tx) > xs->tx->nentries / 2)
> +               return false;
> +
> +       return true;
> +}
> +
>  static bool xsk_is_bound(struct xdp_sock *xs)
>  {
>         if (READ_ONCE(xs->state) == XSK_BOUND) {
> @@ -296,7 +307,8 @@ void xsk_tx_release(struct xsk_buff_pool *pool)
>         rcu_read_lock();
>         list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
>                 __xskq_cons_release(xs->tx);
> -               xs->sk.sk_write_space(&xs->sk);
> +               if (xsk_writeable(xs))
> +                       xs->sk.sk_write_space(&xs->sk);
>         }
>         rcu_read_unlock();
>  }
> @@ -442,7 +454,8 @@ static int xsk_generic_xmit(struct sock *sk)
>
>  out:
>         if (sent_frame)
> -               sk->sk_write_space(sk);
> +               if (xsk_writeable(xs))
> +                       sk->sk_write_space(sk);
>
>         mutex_unlock(&xs->mutex);
>         return err;
> @@ -499,7 +512,8 @@ static __poll_t xsk_poll(struct file *file, struct socket *sock,
>
>         if (xs->rx && !xskq_prod_is_empty(xs->rx))
>                 mask |= EPOLLIN | EPOLLRDNORM;
> -       if (xs->tx && !xskq_cons_is_full(xs->tx))
> +

No reason to introduce a newline here.

> +       if (xsk_writeable(xs))

Add an explicit "xs->tx &&" in the if statement here as we removed the
test in xsk_writeable.

>                 mask |= EPOLLOUT | EPOLLWRNORM;
>
>         return mask;
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index cdb9cf3..82a5228 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -264,6 +264,12 @@ static inline bool xskq_cons_is_full(struct xsk_queue *q)
>                 q->nentries;
>  }
>
> +static inline __u64 xskq_cons_left(struct xsk_queue *q)

Let us call this xskq_cons_entries_present() or
xskq_cons_filled_entries(). The word "left" has the connotation that I
still have stuff left to do. While this is kind of true for this case,
it might not be for other cases that can use your function. The
function provides how many (filled) entries that are present in the
ring. Can you come up with a better name as I am not super fond of my
suggestions? It would have been nice to call it xskq_cons_nb_entries()
but there is already such a function that is lazy in nature and that
allows access to the entries.

> +{
> +       /* No barriers needed since data is not accessed */
> +       return READ_ONCE(q->ring->producer) - READ_ONCE(q->ring->consumer);
> +}
> +
>  /* Functions for producers */
>
>  static inline bool xskq_prod_is_full(struct xsk_queue *q)
> --
> 1.8.3.1
>

^ permalink raw reply

* Re: [PATCH net-next] net: sched: alias action flags with TCA_ACT_ prefix
From: Vlad Buslov @ 2020-11-24  9:28 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem, jhs, xiyou.wangcong, jiri
In-Reply-To: <20201123132244.55768678@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>

On Mon 23 Nov 2020 at 23:22, Jakub Kicinski <kuba@kernel.org> wrote:
> On Sat, 21 Nov 2020 18:09:02 +0200 Vlad Buslov wrote:
>> Currently both filter and action flags use same "TCA_" prefix which makes
>> them hard to distinguish to code and confusing for users. Create aliases
>> for existing action flags constants with "TCA_ACT_" prefix.
>> 
>> Signed-off-by: Vlad Buslov <vlad@buslov.dev>
>
> Are we expecting to add both aliases for all new flags?

I don't think it makes sense to have both aliases for any new flags.

>
> TCA_FLAG_TERSE_DUMP exists only in net-next, we could rename it, right?

You are right. I'll send a fix.


^ permalink raw reply

* Re: netconsole deadlock with virtnet
From: Leon Romanovsky @ 2020-11-24  9:26 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jakub Kicinski, Steven Rostedt, Sergey Senozhatsky,
	Michael S. Tsirkin, Petr Mladek, John Ogness, virtualization,
	Amit Shah, Itay Aveksis, Ran Rozenstein, netdev
In-Reply-To: <6f046c51-cdcc-77f9-4859-2508d08126f8@redhat.com>

On Tue, Nov 24, 2020 at 04:57:23PM +0800, Jason Wang wrote:
>
> On 2020/11/24 下午4:01, Leon Romanovsky wrote:
> > On Tue, Nov 24, 2020 at 11:22:03AM +0800, Jason Wang wrote:
> > > On 2020/11/24 上午3:21, Jakub Kicinski wrote:
> > > > On Mon, 23 Nov 2020 14:09:34 -0500 Steven Rostedt wrote:
> > > > > On Mon, 23 Nov 2020 10:52:52 -0800
> > > > > Jakub Kicinski <kuba@kernel.org> wrote:
> > > > >
> > > > > > On Mon, 23 Nov 2020 09:31:28 -0500 Steven Rostedt wrote:
> > > > > > > On Mon, 23 Nov 2020 13:08:55 +0200
> > > > > > > Leon Romanovsky <leon@kernel.org> wrote:
> > > > > > >
> > > > > > > >    [   10.028024] Chain exists of:
> > > > > > > >    [   10.028025]   console_owner --> target_list_lock --> _xmit_ETHER#2
> > > > > > > Note, the problem is that we have a location that grabs the xmit_lock while
> > > > > > > holding target_list_lock (and possibly console_owner).
> > > > > > Well, it try_locks the xmit_lock. Does lockdep understand try-locks?
> > > > > >
> > > > > > (not that I condone the shenanigans that are going on here)
> > > > > Does it?
> > > > >
> > > > > 	virtnet_poll_tx() {
> > > > > 		__netif_tx_lock() {
> > > > > 			spin_lock(&txq->_xmit_lock);
> > > > Umpf. Right. I was looking at virtnet_poll_cleantx()
> > > >
> > > > > That looks like we can have:
> > > > >
> > > > >
> > > > > 	CPU0		CPU1
> > > > > 	----		----
> > > > >      lock(xmit_lock)
> > > > >
> > > > > 		    lock(console)
> > > > > 		    lock(target_list_lock)
> > > > > 		    __netif_tx_lock()
> > > > > 		        lock(xmit_lock);
> > > > >
> > > > > 			[BLOCKED]
> > > > >
> > > > >      <interrupt>
> > > > >      lock(console)
> > > > >
> > > > >      [BLOCKED]
> > > > >
> > > > >
> > > > >
> > > > >    DEADLOCK.
> > > > >
> > > > >
> > > > > So where is the trylock here?
> > > > >
> > > > > Perhaps you need the trylock in virtnet_poll_tx()?
> > > > That could work. Best if we used normal lock if !!budget, and trylock
> > > > when budget is 0. But maybe that's too hairy.
> > >
> > > If we use trylock, we probably lose(or delay) tx notification that may have
> > > side effects to the stack.
> > >
> > >
> > > > I'm assuming all this trickiness comes from virtqueue_get_buf() needing
> > > > locking vs the TX path? It's pretty unusual for the completion path to
> > > > need locking vs xmit path.
> > >
> > > Two reasons for doing this:
> > >
> > > 1) For some historical reason, we try to free transmitted tx packets in xmit
> > > (see free_old_xmit_skbs() in start_xmit()), we can probably remove this if
> > > we remove the non tx interrupt mode.
> > > 2) virtio core requires virtqueue_get_buf() to be synchronized with
> > > virtqueue_add(), we probably can solve this but it requires some non trivial
> > > refactoring in the virtio core
> > So how will we solve our lockdep issues?
> >
> > Thanks
>
>
> It's not clear to me that whether it's a virtio-net specific issue. E.g the
> above deadlock looks like a generic issue so workaround it via virtio-net
> may not help for other drivers.

It is hard to say, no one else complained except me who is using virtio :).

Thanks

>
> Thanks
>
>
> >
> > > Btw, have a quick search, there are several other drivers that uses tx lock
> > > in the tx NAPI.
> > >
> > > Thanks
> > >
>

^ permalink raw reply

* Re: [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
From: Kuniyuki Iwashima @ 2020-11-24  9:24 UTC (permalink / raw)
  To: kafai
  Cc: ast, benh, bpf, daniel, davem, edumazet, kuba, kuni1840, kuniyu,
	linux-kernel, netdev
In-Reply-To: <20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com>

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Sun, 22 Nov 2020 16:40:20 -0800
> On Sat, Nov 21, 2020 at 07:13:22PM +0900, Kuniyuki Iwashima wrote:
> > From:   Martin KaFai Lau <kafai@fb.com>
> > Date:   Thu, 19 Nov 2020 17:53:46 -0800
> > > On Fri, Nov 20, 2020 at 07:09:22AM +0900, Kuniyuki Iwashima wrote:
> > > > From: Martin KaFai Lau <kafai@fb.com>
> > > > Date: Wed, 18 Nov 2020 15:50:17 -0800
> > > > > On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> > > > > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > > > > which is used only by inet_unhash(). If it is not NULL,
> > > > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > > > sockets from the closing listener to the selected one.
> > > > > > 
> > > > > > Listening sockets hold incoming connections as a linked list of struct
> > > > > > request_sock in the accept queue, and each request has reference to a full
> > > > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
> > > > > > requests from the closing listener's queue and relink them to the head of
> > > > > > the new listener's queue. We do not process each request, so the migration
> > > > > > completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
> > > > > > sockets, we will take special care in the next commit.
> > > > > > 
> > > > > > By default, we select the last element of socks[] as the new listener.
> > > > > > This behaviour is based on how the kernel moves sockets in socks[].
> > > > > > 
> > > > > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > > > > first two by turns. The sockets move in socks[] like below. (See also [1])
> > > > > > 
> > > > > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > > > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > > > >   socks[2] : C   |      socks[2] : C --'
> > > > > >   socks[3] : D --'
> > > > > > 
> > > > > > Then, if C and D have newer settings than A and B, and each socket has a
> > > > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > > > requests evenly to new listeners.
> > > > > I don't think it should emphasize/claim there is a specific way that
> > > > > the kernel-pick here can redistribute the requests evenly.  It depends on
> > > > > how the application close/listen.  The userspace can not expect the
> > > > > ordering of socks[] will behave in a certain way.
> > > > 
> > > > I've expected replacing listeners by generations as a general use case.
> > > > But exactly. Users should not expect the undocumented kernel internal.
> > > > 
> > > > 
> > > > > The primary redistribution policy has to depend on BPF which is the
> > > > > policy defined by the user based on its application logic (e.g. how
> > > > > its binary restart work).  The application (and bpf) knows which one
> > > > > is a dying process and can avoid distributing to it.
> > > > > 
> > > > > The kernel-pick could be an optional fallback but not a must.  If the bpf
> > > > > prog is attached, I would even go further to call bpf to redistribute
> > > > > regardless of the sysctl, so I think the sysctl is not necessary.
> > > > 
> > > > I also think it is just an optional fallback, but to pick out a different
> > > > listener everytime, choosing the moved socket was reasonable. So the even
> > > > redistribution for a specific use case is a side effect of such socket
> > > > selection.
> > > > 
> > > > But, users should decide to use either way:
> > > >   (1) let the kernel select a new listener randomly
> > > >   (2) select a particular listener by eBPF
> > > > 
> > > > I will update the commit message like:
> > > > The kernel selects a new listener randomly, but as the side effect, it can
> > > > redistribute packets evenly for a specific case where an application
> > > > replaces listeners by generations.
> > > Since there is no feedback on sysctl, so may be something missed
> > > in the lines.
> > 
> > I'm sorry, I have missed this point while thinking about each reply...
> > 
> > 
> > > I don't think this migration logic should depend on a sysctl.
> > > At least not when a bpf prog is attached that is capable of doing
> > > migration, it is too fragile to ask user to remember to turn on
> > > the sysctl before attaching the bpf prog.
> > > 
> > > Your use case is to primarily based on bpf prog to pick or only based
> > > on kernel to do a random pick?
> Again, what is your primarily use case?

We have so many services and components that I cannot grasp all of their
implementations, but I have started this series because a service component
based on the random pick by the kernel suffered from the issue.


> > I think we have to care about both cases.
> > 
> > I think we can always enable the migration feature if eBPF prog is not
> > attached. On the other hand, if BPF_PROG_TYPE_SK_REUSEPORT prog is attached
> > to select a listener by some rules, along updating the kernel,
> > redistributing requests without user intention can break the application.
> > So, there is something needed to confirm user intension at least if eBPF
> > prog is attached.
> Right, something being able to tell if the bpf prog can do migration
> can confirm the user intention here.  However, this will not be a
> sysctl.
> 
> A new bpf_attach_type "BPF_SK_REUSEPORT_SELECT_OR_MIGRATE" can be added.
> "prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE"
> can be used to decide if migration can be done by the bpf prog.
> Although the prog->expected_attach_type has not been checked for
> BPF_PROG_TYPE_SK_REUSEPORT, there was an earlier discussion
> that the risk of breaking is very small and is acceptable.
> 
> Instead of depending on !reuse_md->data to decide if it
> is doing migration or not, a clearer signal should be given
> to the bpf prog.  A "u8 migration" can be added to "struct sk_reuseport_kern"
> (and to "struct sk_reuseport_md" accordingly).  It can tell
> the bpf prog that it is doing migration.  It should also tell if it is
> migrating a list of established sk(s) or an individual req_sk.
> Accessing "reuse_md->migration" should only be allowed for
> BPF_SK_REUSEPORT_SELECT_OR_MIGRATE during is_valid_access().
> 
> During migration, if skb is not available, an empty skb can be used.
> Migration is a slow path and does not happen very often, so it will
> be fine even it has to create a temp skb (or may be a static const skb
> can be used, not sure but this is implementation details).

I greatly appreciate your detailed idea and explanation!
I will try to implement this.


> > But honestly, I believe such eBPF users can follow this change and
> > implement migration eBPF prog if we introduce such a breaking change.
> > 
> > 
> > > Also, IIUC, this sysctl setting sticks at "*reuse", there is no way to
> > > change it until all the listening sockets are closed which is exactly
> > > the service disruption problem this series is trying to solve here.
> > 
> > Oh, exactly...
> > If we apply this series by live patching, we cannot enable the feature
> > without service disruption.
> > 
> > To enable the migration feature dynamically, how about this logic?
> > In this logic, we do not save the sysctl value and check it at each time.
> > 
> >   1. no eBPF prog attached -> ON
> >   2. eBPF prog attached and sysctl is 0 -> OFF
> No.  When bpf prog is attached and it clearly signals (expected_attach_type
> here) it can do migration, it should not depend on anything else.  It is very
> confusing to use.  When a prog is successfully loaded, verified
> and attached, it is expected to run.
> 
> This sysctl essentially only disables the bpf prog with
> type == BPF_PROG_TYPE_SK_REUSEPORT running at a particular point.
> This is going down a path that having another sysctl in the future
> to disable another bpf prog type.  If there would be a need to disable
> bpf prog on a type-by-type bases, it would need a more
> generic solution on the bpf side and do it in a consistent way
> for all prog types.  It needs a separate and longer discussion.
> 
> All behaviors of the BPF_SK_REUSEPORT_SELECT_OR_MIGRATE bpf prog
> should not depend on this sysctl at all .
> 
> /* Pseudo code to show the idea only.
>  * Actual implementation should try to fit
>  * better into the current code and should look
>  * quite different from here.
>  */
> 
> if ((prog && prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE)) {
> 	/* call bpf to migrate */
> 	action = BPF_PROG_RUN(prog, &reuse_kern);
> 
> 	if (action == SK_PASS) {
> 		if (!reuse_kern.selected_sk)
> 			/* fallback to kernel random pick */
> 		else
> 			/* migrate to reuse_kern.selected_sk */
> 	} else {
> 		/* action == SK_DROP. don't do migration at all and
> 		 * don't fallback to kernel random pick.
> 		 */ 
> 	}
> }
> 
> Going back to the sysctl, with BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> do you still have a need on adding sysctl_tcp_migrate_req?

No, now I do not think the option should be sysctl.
It will be BPF_SK_REUSEPORT_SELECT_OR_MIGRATE in the next series.
Thank you!


> Regardless, if there is still a need,
> the document for sysctl_tcp_migrate_req should be something like:
> "the kernel will do a random pick when there is no bpf prog
>  attached to the reuseport group...."
> 
> [ ps, my reply will be slow in this week. ]

^ permalink raw reply

* RE: [PATCH v2 1/1] ath10k: add option for chip-id based BDF selection
From: Rakesh Pillai @ 2020-11-24  9:18 UTC (permalink / raw)
  To: 'Doug Anderson', 'Abhishek Kumar'
  Cc: 'Kalle Valo', 'LKML', 'ath10k',
	'Brian Norris', 'linux-wireless',
	'David S. Miller', 'Jakub Kicinski',
	'netdev'
In-Reply-To: <CAD=FV=XKCLgL6Bt+3KfqKByyP5fpwXOh6TNHXAoXkaQJRzjKjQ@mail.gmail.com>



> -----Original Message-----
> From: Doug Anderson <dianders@chromium.org>
> Sent: Tuesday, November 24, 2020 6:27 AM
> To: Abhishek Kumar <kuabhs@chromium.org>
> Cc: Kalle Valo <kvalo@codeaurora.org>; Rakesh Pillai
> <pillair@codeaurora.org>; LKML <linux-kernel@vger.kernel.org>; ath10k
> <ath10k@lists.infradead.org>; Brian Norris <briannorris@chromium.org>;
> linux-wireless <linux-wireless@vger.kernel.org>; David S. Miller
> <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; netdev
> <netdev@vger.kernel.org>
> Subject: Re: [PATCH v2 1/1] ath10k: add option for chip-id based BDF
> selection
> 
> Hi,
> 
> On Thu, Nov 12, 2020 at 12:09 PM Abhishek Kumar <kuabhs@chromium.org>
> wrote:
> >
> > In some devices difference in chip-id should be enough to pick
> > the right BDF. Add another support for chip-id based BDF selection.
> > With this new option, ath10k supports 2 fallback options.
> >
> > The board name with chip-id as option looks as follows
> > board name 'bus=snoc,qmi-board-id=ff,qmi-chip-id=320'
> >
> > Tested-on: WCN3990 hw1.0 SNOC WLAN.HL.3.2.2-00696-QCAHLSWMTPL-1
> > Tested-on: QCA6174 HW3.2 WLAN.RM.4.4.1-00157-QCARMSWPZ-1
> > Signed-off-by: Abhishek Kumar <kuabhs@chromium.org>
> > ---
> >
> > (no changes since v1)
> 
> I think you need to work on the method you're using to generate your
> patches.  There are most definitely changes since v1.  You described
> them in your cover letter (which you don't really need for a singleton
> patch) instead of here.
> 
> 
> > @@ -1438,12 +1439,17 @@ static int
> ath10k_core_create_board_name(struct ath10k *ar, char *name,
> >         }
> >
> >         if (ar->id.qmi_ids_valid) {
> > -               if (with_variant && ar->id.bdf_ext[0] != '\0')
> > +               if (with_additional_params && ar->id.bdf_ext[0] != '\0')
> >                         scnprintf(name, name_len,
> >                                   "bus=%s,qmi-board-id=%x,qmi-chip-id=%x%s",
> >                                   ath10k_bus_str(ar->hif.bus),
> >                                   ar->id.qmi_board_id, ar->id.qmi_chip_id,
> >                                   variant);
> > +               else if (with_additional_params)
> > +                       scnprintf(name, name_len,
> > +                                 "bus=%s,qmi-board-id=%x,qmi-chip-id=%x",
> > +                                 ath10k_bus_str(ar->hif.bus),
> > +                                 ar->id.qmi_board_id, ar->id.qmi_chip_id);
> 
> I believe this is exactly opposite of what Rakesh was requesting.
> Specifically, he was trying to eliminate the extra scnprintf() but I
> think he still agreed that it was a good idea to generate 3 different
> strings.  I believe the proper diff to apply to v1 is:
> 
> https://crrev.com/c/255643
> 
> -Doug

Hi Abhishek/Doug,

I missed on reviewing this change. Also I agree with Doug that this is not the change I was looking for.

The argument "with_variant" can be renamed to "with_extra_params". There is no need for any new argument to this function.
Case 1: with_extra_params=0,  ar->id.bdf_ext[0] = 0             ->   The default name will be used (bus=snoc,qmi_board_id=0xab)
Case 2: with_extra_params=1,  ar->id.bdf_ext[0] = 0             ->   bus=snoc,qmi_board_id=0xab,qmi_chip_id=0xcd
Case 3: with_extra_params=1,  ar->id.bdf_ext[0] = "xyz"      ->   bus=snoc,qmi_board_id=0xab,qmi_chip_id=0xcd,variant=xyz

ar->id.bdf_ext[0] depends on the DT entry for variant field.

Thanks,
Rakesh Pillai.


^ permalink raw reply

* Re: [PATCH v15 6/9] arm64/kvm: Add hypercall service for kvm ptp.
From: Marc Zyngier @ 2020-11-24  9:07 UTC (permalink / raw)
  To: Jianyong Wu
  Cc: netdev, yangbo.lu, john.stultz, tglx, pbonzini,
	sean.j.christopherson, richardcochran, Mark Rutland, will,
	Suzuki Poulose, Andre Przywara, Steven Price, linux-kernel,
	linux-arm-kernel, kvmarm, kvm, Steve Capper, Justin He, nd
In-Reply-To: <HE1PR0802MB255534CF7A04FB5A6CE99A67F4FB0@HE1PR0802MB2555.eurprd08.prod.outlook.com>

On 2020-11-24 05:20, Jianyong Wu wrote:
> Hi Marc,

[...]

>> > +/* ptp_kvm counter type ID */
>> > +#define ARM_PTP_VIRT_COUNTER			0
>> > +#define ARM_PTP_PHY_COUNTER			1
>> > +#define ARM_PTP_NONE_COUNTER			2
>> 
>> The architecture definitely doesn't have this last counter.
> 
> Yeah, this is just represent no counter data needed from guest.
> Some annotation should be added here.

I'd rather you remove it entirely, or explain why you really cannot
do without a fake counter.

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply

* Re: [PATCH v15 7/9] ptp: arm/arm64: Enable ptp_kvm for arm/arm64
From: Marc Zyngier @ 2020-11-24  9:04 UTC (permalink / raw)
  To: Jianyong Wu
  Cc: netdev, yangbo.lu, john.stultz, tglx, pbonzini,
	sean.j.christopherson, richardcochran, Mark Rutland, will,
	Suzuki Poulose, Andre Przywara, Steven Price, linux-kernel,
	linux-arm-kernel, kvmarm, kvm, Steve Capper, Justin He, nd
In-Reply-To: <HE1PR0802MB2555C5D09EA2BF0BA369BE37F4FB0@HE1PR0802MB2555.eurprd08.prod.outlook.com>

Jianyong,

On 2020-11-24 05:37, Jianyong Wu wrote:
> Hi Marc,

[...]

>> > +
>> 	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_PTP_FU
>> NC_ID,
>> > +			     ARM_PTP_NONE_COUNTER, &hvc_res);
>> 
>> I really don't see the need to use a non-architectural counter ID.
>> Using the virtual counter ID should just be fine, and shouldn't lead 
>> to any
>> issue.
>> 
> 
>> Am I missing something?
> 
> In this function, no counter data is needed. If virtual counter ID is
> used here, user may be confused with why we get virtual counter
> data and do nothing with it. So I explicitly add a new "NONE" counter
> ID to make it clear.
> 
> WDYT?

ITIABI (I Think It's A Bad Idea). There are two counters, and the API
allows to retrieve the data for any of these two. If the "user" doesn't
want to do anything with the data, that's their problem.

Here, you can just sue the virtual counter, and that will give you the
exact same semantic, without inventing non-architectural state.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply

* [PATCH bpf-next v3 7/7] samples: bpf: remove bpf_load loader completely
From: Daniel T. Lee @ 2020-11-24  9:03 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko, brakmo,
	Jesper Dangaard Brouer, Andrii Nakryiko, Lorenzo Bianconi,
	David Ahern, Yonghong Song, Toke Høiland-Jørgensen,
	Ira Weiny, Thomas Graf, Jakub Kicinski, Martin KaFai Lau,
	John Fastabend
  Cc: bpf, netdev, Xdp
In-Reply-To: <20201124090310.24374-1-danieltimlee@gmail.com>

Numerous refactoring that rewrites BPF programs written with bpf_load
to use the libbpf loader was finally completed, resulting in BPF
programs using bpf_load within the kernel being completely no longer
present.

This commit removes bpf_load, an outdated bpf loader that is difficult
to keep up with the latest kernel BPF and causes confusion.

Also, this commit removes the unused trace_helper and bpf_load from
samples/bpf target objects from Makefile.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
---
Changes in v2:
 - merge commit with changing Makefile
 
 samples/bpf/Makefile            |  10 +-
 samples/bpf/bpf_load.c          | 667 --------------------------------
 samples/bpf/bpf_load.h          |  57 ---
 samples/bpf/xdp2skb_meta_kern.c |   2 +-
 4 files changed, 5 insertions(+), 731 deletions(-)
 delete mode 100644 samples/bpf/bpf_load.c
 delete mode 100644 samples/bpf/bpf_load.h

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 25380e04897e..05db041f8b18 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -73,7 +73,7 @@ tracex5-objs := tracex5_user.o $(TRACE_HELPERS)
 tracex6-objs := tracex6_user.o
 tracex7-objs := tracex7_user.o
 test_probe_write_user-objs := test_probe_write_user_user.o
-trace_output-objs := trace_output_user.o $(TRACE_HELPERS)
+trace_output-objs := trace_output_user.o
 lathist-objs := lathist_user.o
 offwaketime-objs := offwaketime_user.o $(TRACE_HELPERS)
 spintest-objs := spintest_user.o $(TRACE_HELPERS)
@@ -91,8 +91,8 @@ test_current_task_under_cgroup-objs := $(CGROUP_HELPERS) \
 				       test_current_task_under_cgroup_user.o
 trace_event-objs := trace_event_user.o $(TRACE_HELPERS)
 sampleip-objs := sampleip_user.o $(TRACE_HELPERS)
-tc_l2_redirect-objs := bpf_load.o tc_l2_redirect_user.o
-lwt_len_hist-objs := bpf_load.o lwt_len_hist_user.o
+tc_l2_redirect-objs := tc_l2_redirect_user.o
+lwt_len_hist-objs := lwt_len_hist_user.o
 xdp_tx_iptunnel-objs := xdp_tx_iptunnel_user.o
 test_map_in_map-objs := test_map_in_map_user.o
 per_socket_stats_example-objs := cookie_uid_helper_example.o
@@ -108,7 +108,7 @@ xdpsock-objs := xdpsock_user.o
 xsk_fwd-objs := xsk_fwd.o
 xdp_fwd-objs := xdp_fwd_user.o
 task_fd_query-objs := task_fd_query_user.o $(TRACE_HELPERS)
-xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
+xdp_sample_pkts-objs := xdp_sample_pkts_user.o
 ibumad-objs := ibumad_user.o
 hbm-objs := hbm.o $(CGROUP_HELPERS)
 
@@ -197,8 +197,6 @@ TPROGS_CFLAGS += --sysroot=$(SYSROOT)
 TPROGS_LDFLAGS := -L$(SYSROOT)/usr/lib
 endif
 
-TPROGCFLAGS_bpf_load.o += -Wno-unused-variable
-
 TPROGS_LDLIBS			+= $(LIBBPF) -lelf -lz
 TPROGLDLIBS_tracex4		+= -lrt
 TPROGLDLIBS_trace_output	+= -lrt
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
deleted file mode 100644
index c5ad528f046e..000000000000
--- a/samples/bpf/bpf_load.c
+++ /dev/null
@@ -1,667 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-#include <stdio.h>
-#include <sys/types.h>
-#include <sys/stat.h>
-#include <fcntl.h>
-#include <libelf.h>
-#include <gelf.h>
-#include <errno.h>
-#include <unistd.h>
-#include <string.h>
-#include <stdbool.h>
-#include <stdlib.h>
-#include <linux/bpf.h>
-#include <linux/filter.h>
-#include <linux/perf_event.h>
-#include <linux/netlink.h>
-#include <linux/rtnetlink.h>
-#include <linux/types.h>
-#include <sys/socket.h>
-#include <sys/syscall.h>
-#include <sys/ioctl.h>
-#include <sys/mman.h>
-#include <poll.h>
-#include <ctype.h>
-#include <assert.h>
-#include <bpf/bpf.h>
-#include "bpf_load.h"
-#include "perf-sys.h"
-
-#define DEBUGFS "/sys/kernel/debug/tracing/"
-
-static char license[128];
-static int kern_version;
-static bool processed_sec[128];
-char bpf_log_buf[BPF_LOG_BUF_SIZE];
-int map_fd[MAX_MAPS];
-int prog_fd[MAX_PROGS];
-int event_fd[MAX_PROGS];
-int prog_cnt;
-int prog_array_fd = -1;
-
-struct bpf_map_data map_data[MAX_MAPS];
-int map_data_count;
-
-static int populate_prog_array(const char *event, int prog_fd)
-{
-	int ind = atoi(event), err;
-
-	err = bpf_map_update_elem(prog_array_fd, &ind, &prog_fd, BPF_ANY);
-	if (err < 0) {
-		printf("failed to store prog_fd in prog_array\n");
-		return -1;
-	}
-	return 0;
-}
-
-static int write_kprobe_events(const char *val)
-{
-	int fd, ret, flags;
-
-	if (val == NULL)
-		return -1;
-	else if (val[0] == '\0')
-		flags = O_WRONLY | O_TRUNC;
-	else
-		flags = O_WRONLY | O_APPEND;
-
-	fd = open(DEBUGFS "kprobe_events", flags);
-
-	ret = write(fd, val, strlen(val));
-	close(fd);
-
-	return ret;
-}
-
-static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
-{
-	bool is_socket = strncmp(event, "socket", 6) == 0;
-	bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
-	bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
-	bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
-	bool is_raw_tracepoint = strncmp(event, "raw_tracepoint/", 15) == 0;
-	bool is_xdp = strncmp(event, "xdp", 3) == 0;
-	bool is_perf_event = strncmp(event, "perf_event", 10) == 0;
-	bool is_cgroup_skb = strncmp(event, "cgroup/skb", 10) == 0;
-	bool is_cgroup_sk = strncmp(event, "cgroup/sock", 11) == 0;
-	bool is_sockops = strncmp(event, "sockops", 7) == 0;
-	bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
-	bool is_sk_msg = strncmp(event, "sk_msg", 6) == 0;
-	size_t insns_cnt = size / sizeof(struct bpf_insn);
-	enum bpf_prog_type prog_type;
-	char buf[256];
-	int fd, efd, err, id;
-	struct perf_event_attr attr = {};
-
-	attr.type = PERF_TYPE_TRACEPOINT;
-	attr.sample_type = PERF_SAMPLE_RAW;
-	attr.sample_period = 1;
-	attr.wakeup_events = 1;
-
-	if (is_socket) {
-		prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
-	} else if (is_kprobe || is_kretprobe) {
-		prog_type = BPF_PROG_TYPE_KPROBE;
-	} else if (is_tracepoint) {
-		prog_type = BPF_PROG_TYPE_TRACEPOINT;
-	} else if (is_raw_tracepoint) {
-		prog_type = BPF_PROG_TYPE_RAW_TRACEPOINT;
-	} else if (is_xdp) {
-		prog_type = BPF_PROG_TYPE_XDP;
-	} else if (is_perf_event) {
-		prog_type = BPF_PROG_TYPE_PERF_EVENT;
-	} else if (is_cgroup_skb) {
-		prog_type = BPF_PROG_TYPE_CGROUP_SKB;
-	} else if (is_cgroup_sk) {
-		prog_type = BPF_PROG_TYPE_CGROUP_SOCK;
-	} else if (is_sockops) {
-		prog_type = BPF_PROG_TYPE_SOCK_OPS;
-	} else if (is_sk_skb) {
-		prog_type = BPF_PROG_TYPE_SK_SKB;
-	} else if (is_sk_msg) {
-		prog_type = BPF_PROG_TYPE_SK_MSG;
-	} else {
-		printf("Unknown event '%s'\n", event);
-		return -1;
-	}
-
-	if (prog_cnt == MAX_PROGS)
-		return -1;
-
-	fd = bpf_load_program(prog_type, prog, insns_cnt, license, kern_version,
-			      bpf_log_buf, BPF_LOG_BUF_SIZE);
-	if (fd < 0) {
-		printf("bpf_load_program() err=%d\n%s", errno, bpf_log_buf);
-		return -1;
-	}
-
-	prog_fd[prog_cnt++] = fd;
-
-	if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
-		return 0;
-
-	if (is_socket || is_sockops || is_sk_skb || is_sk_msg) {
-		if (is_socket)
-			event += 6;
-		else
-			event += 7;
-		if (*event != '/')
-			return 0;
-		event++;
-		if (!isdigit(*event)) {
-			printf("invalid prog number\n");
-			return -1;
-		}
-		return populate_prog_array(event, fd);
-	}
-
-	if (is_raw_tracepoint) {
-		efd = bpf_raw_tracepoint_open(event + 15, fd);
-		if (efd < 0) {
-			printf("tracepoint %s %s\n", event + 15, strerror(errno));
-			return -1;
-		}
-		event_fd[prog_cnt - 1] = efd;
-		return 0;
-	}
-
-	if (is_kprobe || is_kretprobe) {
-		bool need_normal_check = true;
-		const char *event_prefix = "";
-
-		if (is_kprobe)
-			event += 7;
-		else
-			event += 10;
-
-		if (*event == 0) {
-			printf("event name cannot be empty\n");
-			return -1;
-		}
-
-		if (isdigit(*event))
-			return populate_prog_array(event, fd);
-
-#ifdef __x86_64__
-		if (strncmp(event, "sys_", 4) == 0) {
-			snprintf(buf, sizeof(buf), "%c:__x64_%s __x64_%s",
-				is_kprobe ? 'p' : 'r', event, event);
-			err = write_kprobe_events(buf);
-			if (err >= 0) {
-				need_normal_check = false;
-				event_prefix = "__x64_";
-			}
-		}
-#endif
-		if (need_normal_check) {
-			snprintf(buf, sizeof(buf), "%c:%s %s",
-				is_kprobe ? 'p' : 'r', event, event);
-			err = write_kprobe_events(buf);
-			if (err < 0) {
-				printf("failed to create kprobe '%s' error '%s'\n",
-				       event, strerror(errno));
-				return -1;
-			}
-		}
-
-		strcpy(buf, DEBUGFS);
-		strcat(buf, "events/kprobes/");
-		strcat(buf, event_prefix);
-		strcat(buf, event);
-		strcat(buf, "/id");
-	} else if (is_tracepoint) {
-		event += 11;
-
-		if (*event == 0) {
-			printf("event name cannot be empty\n");
-			return -1;
-		}
-		strcpy(buf, DEBUGFS);
-		strcat(buf, "events/");
-		strcat(buf, event);
-		strcat(buf, "/id");
-	}
-
-	efd = open(buf, O_RDONLY, 0);
-	if (efd < 0) {
-		printf("failed to open event %s\n", event);
-		return -1;
-	}
-
-	err = read(efd, buf, sizeof(buf));
-	if (err < 0 || err >= sizeof(buf)) {
-		printf("read from '%s' failed '%s'\n", event, strerror(errno));
-		return -1;
-	}
-
-	close(efd);
-
-	buf[err] = 0;
-	id = atoi(buf);
-	attr.config = id;
-
-	efd = sys_perf_event_open(&attr, -1/*pid*/, 0/*cpu*/, -1/*group_fd*/, 0);
-	if (efd < 0) {
-		printf("event %d fd %d err %s\n", id, efd, strerror(errno));
-		return -1;
-	}
-	event_fd[prog_cnt - 1] = efd;
-	err = ioctl(efd, PERF_EVENT_IOC_ENABLE, 0);
-	if (err < 0) {
-		printf("ioctl PERF_EVENT_IOC_ENABLE failed err %s\n",
-		       strerror(errno));
-		return -1;
-	}
-	err = ioctl(efd, PERF_EVENT_IOC_SET_BPF, fd);
-	if (err < 0) {
-		printf("ioctl PERF_EVENT_IOC_SET_BPF failed err %s\n",
-		       strerror(errno));
-		return -1;
-	}
-
-	return 0;
-}
-
-static int load_maps(struct bpf_map_data *maps, int nr_maps,
-		     fixup_map_cb fixup_map)
-{
-	int i, numa_node;
-
-	for (i = 0; i < nr_maps; i++) {
-		if (fixup_map) {
-			fixup_map(&maps[i], i);
-			/* Allow userspace to assign map FD prior to creation */
-			if (maps[i].fd != -1) {
-				map_fd[i] = maps[i].fd;
-				continue;
-			}
-		}
-
-		numa_node = maps[i].def.map_flags & BPF_F_NUMA_NODE ?
-			maps[i].def.numa_node : -1;
-
-		if (maps[i].def.type == BPF_MAP_TYPE_ARRAY_OF_MAPS ||
-		    maps[i].def.type == BPF_MAP_TYPE_HASH_OF_MAPS) {
-			int inner_map_fd = map_fd[maps[i].def.inner_map_idx];
-
-			map_fd[i] = bpf_create_map_in_map_node(maps[i].def.type,
-							maps[i].name,
-							maps[i].def.key_size,
-							inner_map_fd,
-							maps[i].def.max_entries,
-							maps[i].def.map_flags,
-							numa_node);
-		} else {
-			map_fd[i] = bpf_create_map_node(maps[i].def.type,
-							maps[i].name,
-							maps[i].def.key_size,
-							maps[i].def.value_size,
-							maps[i].def.max_entries,
-							maps[i].def.map_flags,
-							numa_node);
-		}
-		if (map_fd[i] < 0) {
-			printf("failed to create map %d (%s): %d %s\n",
-			       i, maps[i].name, errno, strerror(errno));
-			return 1;
-		}
-		maps[i].fd = map_fd[i];
-
-		if (maps[i].def.type == BPF_MAP_TYPE_PROG_ARRAY)
-			prog_array_fd = map_fd[i];
-	}
-	return 0;
-}
-
-static int get_sec(Elf *elf, int i, GElf_Ehdr *ehdr, char **shname,
-		   GElf_Shdr *shdr, Elf_Data **data)
-{
-	Elf_Scn *scn;
-
-	scn = elf_getscn(elf, i);
-	if (!scn)
-		return 1;
-
-	if (gelf_getshdr(scn, shdr) != shdr)
-		return 2;
-
-	*shname = elf_strptr(elf, ehdr->e_shstrndx, shdr->sh_name);
-	if (!*shname || !shdr->sh_size)
-		return 3;
-
-	*data = elf_getdata(scn, 0);
-	if (!*data || elf_getdata(scn, *data) != NULL)
-		return 4;
-
-	return 0;
-}
-
-static int parse_relo_and_apply(Elf_Data *data, Elf_Data *symbols,
-				GElf_Shdr *shdr, struct bpf_insn *insn,
-				struct bpf_map_data *maps, int nr_maps)
-{
-	int i, nrels;
-
-	nrels = shdr->sh_size / shdr->sh_entsize;
-
-	for (i = 0; i < nrels; i++) {
-		GElf_Sym sym;
-		GElf_Rel rel;
-		unsigned int insn_idx;
-		bool match = false;
-		int j, map_idx;
-
-		gelf_getrel(data, i, &rel);
-
-		insn_idx = rel.r_offset / sizeof(struct bpf_insn);
-
-		gelf_getsym(symbols, GELF_R_SYM(rel.r_info), &sym);
-
-		if (insn[insn_idx].code != (BPF_LD | BPF_IMM | BPF_DW)) {
-			printf("invalid relo for insn[%d].code 0x%x\n",
-			       insn_idx, insn[insn_idx].code);
-			return 1;
-		}
-		insn[insn_idx].src_reg = BPF_PSEUDO_MAP_FD;
-
-		/* Match FD relocation against recorded map_data[] offset */
-		for (map_idx = 0; map_idx < nr_maps; map_idx++) {
-			if (maps[map_idx].elf_offset == sym.st_value) {
-				match = true;
-				break;
-			}
-		}
-		if (match) {
-			insn[insn_idx].imm = maps[map_idx].fd;
-		} else {
-			printf("invalid relo for insn[%d] no map_data match\n",
-			       insn_idx);
-			return 1;
-		}
-	}
-
-	return 0;
-}
-
-static int cmp_symbols(const void *l, const void *r)
-{
-	const GElf_Sym *lsym = (const GElf_Sym *)l;
-	const GElf_Sym *rsym = (const GElf_Sym *)r;
-
-	if (lsym->st_value < rsym->st_value)
-		return -1;
-	else if (lsym->st_value > rsym->st_value)
-		return 1;
-	else
-		return 0;
-}
-
-static int load_elf_maps_section(struct bpf_map_data *maps, int maps_shndx,
-				 Elf *elf, Elf_Data *symbols, int strtabidx)
-{
-	int map_sz_elf, map_sz_copy;
-	bool validate_zero = false;
-	Elf_Data *data_maps;
-	int i, nr_maps;
-	GElf_Sym *sym;
-	Elf_Scn *scn;
-	int copy_sz;
-
-	if (maps_shndx < 0)
-		return -EINVAL;
-	if (!symbols)
-		return -EINVAL;
-
-	/* Get data for maps section via elf index */
-	scn = elf_getscn(elf, maps_shndx);
-	if (scn)
-		data_maps = elf_getdata(scn, NULL);
-	if (!scn || !data_maps) {
-		printf("Failed to get Elf_Data from maps section %d\n",
-		       maps_shndx);
-		return -EINVAL;
-	}
-
-	/* For each map get corrosponding symbol table entry */
-	sym = calloc(MAX_MAPS+1, sizeof(GElf_Sym));
-	for (i = 0, nr_maps = 0; i < symbols->d_size / sizeof(GElf_Sym); i++) {
-		assert(nr_maps < MAX_MAPS+1);
-		if (!gelf_getsym(symbols, i, &sym[nr_maps]))
-			continue;
-		if (sym[nr_maps].st_shndx != maps_shndx)
-			continue;
-		/* Only increment iif maps section */
-		nr_maps++;
-	}
-
-	/* Align to map_fd[] order, via sort on offset in sym.st_value */
-	qsort(sym, nr_maps, sizeof(GElf_Sym), cmp_symbols);
-
-	/* Keeping compatible with ELF maps section changes
-	 * ------------------------------------------------
-	 * The program size of struct bpf_load_map_def is known by loader
-	 * code, but struct stored in ELF file can be different.
-	 *
-	 * Unfortunately sym[i].st_size is zero.  To calculate the
-	 * struct size stored in the ELF file, assume all struct have
-	 * the same size, and simply divide with number of map
-	 * symbols.
-	 */
-	map_sz_elf = data_maps->d_size / nr_maps;
-	map_sz_copy = sizeof(struct bpf_load_map_def);
-	if (map_sz_elf < map_sz_copy) {
-		/*
-		 * Backward compat, loading older ELF file with
-		 * smaller struct, keeping remaining bytes zero.
-		 */
-		map_sz_copy = map_sz_elf;
-	} else if (map_sz_elf > map_sz_copy) {
-		/*
-		 * Forward compat, loading newer ELF file with larger
-		 * struct with unknown features. Assume zero means
-		 * feature not used.  Thus, validate rest of struct
-		 * data is zero.
-		 */
-		validate_zero = true;
-	}
-
-	/* Memcpy relevant part of ELF maps data to loader maps */
-	for (i = 0; i < nr_maps; i++) {
-		struct bpf_load_map_def *def;
-		unsigned char *addr, *end;
-		const char *map_name;
-		size_t offset;
-
-		map_name = elf_strptr(elf, strtabidx, sym[i].st_name);
-		maps[i].name = strdup(map_name);
-		if (!maps[i].name) {
-			printf("strdup(%s): %s(%d)\n", map_name,
-			       strerror(errno), errno);
-			free(sym);
-			return -errno;
-		}
-
-		/* Symbol value is offset into ELF maps section data area */
-		offset = sym[i].st_value;
-		def = (struct bpf_load_map_def *)(data_maps->d_buf + offset);
-		maps[i].elf_offset = offset;
-		memset(&maps[i].def, 0, sizeof(struct bpf_load_map_def));
-		memcpy(&maps[i].def, def, map_sz_copy);
-
-		/* Verify no newer features were requested */
-		if (validate_zero) {
-			addr = (unsigned char *) def + map_sz_copy;
-			end  = (unsigned char *) def + map_sz_elf;
-			for (; addr < end; addr++) {
-				if (*addr != 0) {
-					free(sym);
-					return -EFBIG;
-				}
-			}
-		}
-	}
-
-	free(sym);
-	return nr_maps;
-}
-
-static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
-{
-	int fd, i, ret, maps_shndx = -1, strtabidx = -1;
-	Elf *elf;
-	GElf_Ehdr ehdr;
-	GElf_Shdr shdr, shdr_prog;
-	Elf_Data *data, *data_prog, *data_maps = NULL, *symbols = NULL;
-	char *shname, *shname_prog;
-	int nr_maps = 0;
-
-	/* reset global variables */
-	kern_version = 0;
-	memset(license, 0, sizeof(license));
-	memset(processed_sec, 0, sizeof(processed_sec));
-
-	if (elf_version(EV_CURRENT) == EV_NONE)
-		return 1;
-
-	fd = open(path, O_RDONLY, 0);
-	if (fd < 0)
-		return 1;
-
-	elf = elf_begin(fd, ELF_C_READ, NULL);
-
-	if (!elf)
-		return 1;
-
-	if (gelf_getehdr(elf, &ehdr) != &ehdr)
-		return 1;
-
-	/* clear all kprobes */
-	i = write_kprobe_events("");
-
-	/* scan over all elf sections to get license and map info */
-	for (i = 1; i < ehdr.e_shnum; i++) {
-
-		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
-			continue;
-
-		if (0) /* helpful for llvm debugging */
-			printf("section %d:%s data %p size %zd link %d flags %d\n",
-			       i, shname, data->d_buf, data->d_size,
-			       shdr.sh_link, (int) shdr.sh_flags);
-
-		if (strcmp(shname, "license") == 0) {
-			processed_sec[i] = true;
-			memcpy(license, data->d_buf, data->d_size);
-		} else if (strcmp(shname, "version") == 0) {
-			processed_sec[i] = true;
-			if (data->d_size != sizeof(int)) {
-				printf("invalid size of version section %zd\n",
-				       data->d_size);
-				return 1;
-			}
-			memcpy(&kern_version, data->d_buf, sizeof(int));
-		} else if (strcmp(shname, "maps") == 0) {
-			int j;
-
-			maps_shndx = i;
-			data_maps = data;
-			for (j = 0; j < MAX_MAPS; j++)
-				map_data[j].fd = -1;
-		} else if (shdr.sh_type == SHT_SYMTAB) {
-			strtabidx = shdr.sh_link;
-			symbols = data;
-		}
-	}
-
-	ret = 1;
-
-	if (!symbols) {
-		printf("missing SHT_SYMTAB section\n");
-		goto done;
-	}
-
-	if (data_maps) {
-		nr_maps = load_elf_maps_section(map_data, maps_shndx,
-						elf, symbols, strtabidx);
-		if (nr_maps < 0) {
-			printf("Error: Failed loading ELF maps (errno:%d):%s\n",
-			       nr_maps, strerror(-nr_maps));
-			goto done;
-		}
-		if (load_maps(map_data, nr_maps, fixup_map))
-			goto done;
-		map_data_count = nr_maps;
-
-		processed_sec[maps_shndx] = true;
-	}
-
-	/* process all relo sections, and rewrite bpf insns for maps */
-	for (i = 1; i < ehdr.e_shnum; i++) {
-		if (processed_sec[i])
-			continue;
-
-		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
-			continue;
-
-		if (shdr.sh_type == SHT_REL) {
-			struct bpf_insn *insns;
-
-			/* locate prog sec that need map fixup (relocations) */
-			if (get_sec(elf, shdr.sh_info, &ehdr, &shname_prog,
-				    &shdr_prog, &data_prog))
-				continue;
-
-			if (shdr_prog.sh_type != SHT_PROGBITS ||
-			    !(shdr_prog.sh_flags & SHF_EXECINSTR))
-				continue;
-
-			insns = (struct bpf_insn *) data_prog->d_buf;
-			processed_sec[i] = true; /* relo section */
-
-			if (parse_relo_and_apply(data, symbols, &shdr, insns,
-						 map_data, nr_maps))
-				continue;
-		}
-	}
-
-	/* load programs */
-	for (i = 1; i < ehdr.e_shnum; i++) {
-
-		if (processed_sec[i])
-			continue;
-
-		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
-			continue;
-
-		if (memcmp(shname, "kprobe/", 7) == 0 ||
-		    memcmp(shname, "kretprobe/", 10) == 0 ||
-		    memcmp(shname, "tracepoint/", 11) == 0 ||
-		    memcmp(shname, "raw_tracepoint/", 15) == 0 ||
-		    memcmp(shname, "xdp", 3) == 0 ||
-		    memcmp(shname, "perf_event", 10) == 0 ||
-		    memcmp(shname, "socket", 6) == 0 ||
-		    memcmp(shname, "cgroup/", 7) == 0 ||
-		    memcmp(shname, "sockops", 7) == 0 ||
-		    memcmp(shname, "sk_skb", 6) == 0 ||
-		    memcmp(shname, "sk_msg", 6) == 0) {
-			ret = load_and_attach(shname, data->d_buf,
-					      data->d_size);
-			if (ret != 0)
-				goto done;
-		}
-	}
-
-done:
-	close(fd);
-	return ret;
-}
-
-int load_bpf_file(char *path)
-{
-	return do_load_bpf_file(path, NULL);
-}
-
-int load_bpf_file_fixup_map(const char *path, fixup_map_cb fixup_map)
-{
-	return do_load_bpf_file(path, fixup_map);
-}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
deleted file mode 100644
index 4fcd258c616f..000000000000
--- a/samples/bpf/bpf_load.h
+++ /dev/null
@@ -1,57 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __BPF_LOAD_H
-#define __BPF_LOAD_H
-
-#include <bpf/bpf.h>
-
-#define MAX_MAPS 32
-#define MAX_PROGS 32
-
-struct bpf_load_map_def {
-	unsigned int type;
-	unsigned int key_size;
-	unsigned int value_size;
-	unsigned int max_entries;
-	unsigned int map_flags;
-	unsigned int inner_map_idx;
-	unsigned int numa_node;
-};
-
-struct bpf_map_data {
-	int fd;
-	char *name;
-	size_t elf_offset;
-	struct bpf_load_map_def def;
-};
-
-typedef void (*fixup_map_cb)(struct bpf_map_data *map, int idx);
-
-extern int prog_fd[MAX_PROGS];
-extern int event_fd[MAX_PROGS];
-extern char bpf_log_buf[BPF_LOG_BUF_SIZE];
-extern int prog_cnt;
-
-/* There is a one-to-one mapping between map_fd[] and map_data[].
- * The map_data[] just contains more rich info on the given map.
- */
-extern int map_fd[MAX_MAPS];
-extern struct bpf_map_data map_data[MAX_MAPS];
-extern int map_data_count;
-
-/* parses elf file compiled by llvm .c->.o
- * . parses 'maps' section and creates maps via BPF syscall
- * . parses 'license' section and passes it to syscall
- * . parses elf relocations for BPF maps and adjusts BPF_LD_IMM64 insns by
- *   storing map_fd into insn->imm and marking such insns as BPF_PSEUDO_MAP_FD
- * . loads eBPF programs via BPF syscall
- *
- * One ELF file can contain multiple BPF programs which will be loaded
- * and their FDs stored stored in prog_fd array
- *
- * returns zero on success
- */
-int load_bpf_file(char *path);
-int load_bpf_file_fixup_map(const char *path, fixup_map_cb fixup_map);
-
-int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags);
-#endif
diff --git a/samples/bpf/xdp2skb_meta_kern.c b/samples/bpf/xdp2skb_meta_kern.c
index 9b783316e860..d5631014a176 100644
--- a/samples/bpf/xdp2skb_meta_kern.c
+++ b/samples/bpf/xdp2skb_meta_kern.c
@@ -6,7 +6,7 @@
  * This uses the XDP data_meta infrastructure, and is a cooperation
  * between two bpf-programs (1) XDP and (2) clsact at TC-ingress hook.
  *
- * Notice: This example does not use the BPF C-loader (bpf_load.c),
+ * Notice: This example does not use the BPF C-loader,
  * but instead rely on the iproute2 TC tool for loading BPF-objects.
  */
 #include <uapi/linux/bpf.h>
-- 
2.25.1


^ permalink raw reply related

* [PATCH bpf-next v3 1/7] samples: bpf: refactor hbm program with libbpf
From: Daniel T. Lee @ 2020-11-24  9:03 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko, brakmo,
	Jesper Dangaard Brouer, Andrii Nakryiko, Lorenzo Bianconi,
	David Ahern, Yonghong Song, Toke Høiland-Jørgensen,
	Ira Weiny, Thomas Graf, Jakub Kicinski, Martin KaFai Lau,
	John Fastabend
  Cc: bpf, netdev, Xdp
In-Reply-To: <20201124090310.24374-1-danieltimlee@gmail.com>

This commit refactors the existing cgroup programs with libbpf
bpf loader. Since bpf_program__attach doesn't support cgroup program
attachment, this explicitly attaches cgroup bpf program with
bpf_program__attach_cgroup(bpf_prog, cg1).

Also, to change attach_type of bpf program, this uses libbpf's
bpf_program__set_expected_attach_type helper to switch EGRESS to
INGRESS. To keep bpf program attached to the cgroup hierarchy even
after the exit, this commit uses the BPF_LINK_PINNING to pin the link
attachment even after it is closed.

Besides, this program was broken due to the typo of BPF MAP definition.
But this commit solves the problem by fixing this from 'queue_stats' map
struct hvm_queue_stats -> hbm_queue_stats.

Fixes: 36b5d471135c ("selftests/bpf: samples/bpf: Split off legacy stuff from bpf_helpers.h")
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
---
Changes in v2:
 - restore read_trace_pipe2
 - remove unnecessary return code and cgroup fd compare
 - add static at global variable and remove unused variable
 - change cgroup path with unified controller (/unified/)
 - add link pinning to prevent cleaning up on process exit

Changes in v3:
 - cleanup bpf_link, bpf_object and cgroup fd both on success and error
 - remove link NULL cleanup since __destroy() can handle
 - fix cgroup test on cgroup fd cleanup
 
 samples/bpf/.gitignore     |   3 +
 samples/bpf/Makefile       |   2 +-
 samples/bpf/do_hbm_test.sh |  32 +++++------
 samples/bpf/hbm.c          | 111 ++++++++++++++++++++-----------------
 samples/bpf/hbm_kern.h     |   2 +-
 5 files changed, 78 insertions(+), 72 deletions(-)

diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
index b2f29bc8dc43..0b9548ea8477 100644
--- a/samples/bpf/.gitignore
+++ b/samples/bpf/.gitignore
@@ -52,3 +52,6 @@ xdp_tx_iptunnel
 xdpsock
 xsk_fwd
 testfile.img
+hbm_out.log
+iperf.*
+*.out
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index aeebf5d12f32..7c61118525f7 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -110,7 +110,7 @@ xdp_fwd-objs := xdp_fwd_user.o
 task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
 xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
 ibumad-objs := bpf_load.o ibumad_user.o $(TRACE_HELPERS)
-hbm-objs := bpf_load.o hbm.o $(CGROUP_HELPERS)
+hbm-objs := hbm.o $(CGROUP_HELPERS)
 
 # Tell kbuild to always build the programs
 always-y := $(tprogs-y)
diff --git a/samples/bpf/do_hbm_test.sh b/samples/bpf/do_hbm_test.sh
index ffe4c0607341..21790ea5c460 100755
--- a/samples/bpf/do_hbm_test.sh
+++ b/samples/bpf/do_hbm_test.sh
@@ -91,6 +91,16 @@ qdisc=""
 flags=""
 do_stats=0
 
+BPFFS=/sys/fs/bpf
+function config_bpffs () {
+	if mount | grep $BPFFS > /dev/null; then
+		echo "bpffs already mounted"
+	else
+		echo "bpffs not mounted. Mounting..."
+		mount -t bpf none $BPFFS
+	fi
+}
+
 function start_hbm () {
   rm -f hbm.out
   echo "./hbm $dir -n $id -r $rate -t $dur $flags $dbg $prog" > hbm.out
@@ -192,6 +202,7 @@ processArgs () {
 }
 
 processArgs
+config_bpffs
 
 if [ $debug_flag -eq 1 ] ; then
   rm -f hbm_out.log
@@ -201,7 +212,7 @@ hbm_pid=$(start_hbm)
 usleep 100000
 
 host=`hostname`
-cg_base_dir=/sys/fs/cgroup
+cg_base_dir=/sys/fs/cgroup/unified
 cg_dir="$cg_base_dir/cgroup-test-work-dir/hbm$id"
 
 echo $$ >> $cg_dir/cgroup.procs
@@ -411,23 +422,8 @@ fi
 
 sleep 1
 
-# Detach any BPF programs that may have lingered
-ttx=`bpftool cgroup tree | grep hbm`
-v=2
-for x in $ttx ; do
-    if [ "${x:0:36}" == "/sys/fs/cgroup/cgroup-test-work-dir/" ] ; then
-	cg=$x ; v=0
-    else
-	if [ $v -eq 0 ] ; then
-	    id=$x ; v=1
-	else
-	    if [ $v -eq 1 ] ; then
-		type=$x ; bpftool cgroup detach $cg $type id $id
-		v=0
-	    fi
-	fi
-    fi
-done
+# Detach any pinned BPF programs that may have lingered
+rm -rf $BPFFS/hbm*
 
 if [ $use_netperf -ne 0 ] ; then
   if [ "$server" == "" ] ; then
diff --git a/samples/bpf/hbm.c b/samples/bpf/hbm.c
index 400e741a56eb..b0c18efe7928 100644
--- a/samples/bpf/hbm.c
+++ b/samples/bpf/hbm.c
@@ -46,7 +46,6 @@
 #include <bpf/bpf.h>
 #include <getopt.h>
 
-#include "bpf_load.h"
 #include "bpf_rlimit.h"
 #include "cgroup_helpers.h"
 #include "hbm.h"
@@ -70,9 +69,9 @@ static void do_error(char *msg, bool errno_flag);
 
 #define DEBUGFS "/sys/kernel/debug/tracing/"
 
-struct bpf_object *obj;
-int bpfprog_fd;
-int cgroup_storage_fd;
+static struct bpf_program *bpf_prog;
+static struct bpf_object *obj;
+static int queue_stats_fd;
 
 static void read_trace_pipe2(void)
 {
@@ -121,56 +120,50 @@ static void do_error(char *msg, bool errno_flag)
 
 static int prog_load(char *prog)
 {
-	struct bpf_prog_load_attr prog_load_attr = {
-		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
-		.file = prog,
-		.expected_attach_type = BPF_CGROUP_INET_EGRESS,
-	};
-	int map_fd;
-	struct bpf_map *map;
-
-	int ret = 0;
-
-	if (access(prog, O_RDONLY) < 0) {
-		printf("Error accessing file %s: %s\n", prog, strerror(errno));
+	obj = bpf_object__open_file(prog, NULL);
+	if (libbpf_get_error(obj)) {
+		printf("ERROR: opening BPF object file failed\n");
 		return 1;
 	}
-	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &bpfprog_fd))
-		ret = 1;
-	if (!ret) {
-		map = bpf_object__find_map_by_name(obj, "queue_stats");
-		map_fd = bpf_map__fd(map);
-		if (map_fd < 0) {
-			printf("Map not found: %s\n", strerror(map_fd));
-			ret = 1;
-		}
+
+	/* load BPF program */
+	if (bpf_object__load(obj)) {
+		printf("ERROR: loading BPF object file failed\n");
+		goto err;
 	}
 
-	if (ret) {
-		printf("ERROR: bpf_prog_load_xattr failed for: %s\n", prog);
-		printf("  Output from verifier:\n%s\n------\n", bpf_log_buf);
-		ret = -1;
-	} else {
-		ret = map_fd;
+	bpf_prog = bpf_object__find_program_by_title(obj, "cgroup_skb/egress");
+	if (!bpf_prog) {
+		printf("ERROR: finding a prog in obj file failed\n");
+		goto err;
+	}
+
+	queue_stats_fd = bpf_object__find_map_fd_by_name(obj, "queue_stats");
+	if (queue_stats_fd < 0) {
+		printf("ERROR: finding a map in obj file failed\n");
+		goto err;
 	}
 
-	return ret;
+	return 0;
+
+err:
+	bpf_object__close(obj);
+	return 1;
 }
 
 static int run_bpf_prog(char *prog, int cg_id)
 {
-	int map_fd;
-	int rc = 0;
+	struct hbm_queue_stats qstats = {0};
+	char cg_dir[100], cg_pin_path[100];
+	struct bpf_link *link = NULL;
 	int key = 0;
 	int cg1 = 0;
-	int type = BPF_CGROUP_INET_EGRESS;
-	char cg_dir[100];
-	struct hbm_queue_stats qstats = {0};
+	int rc = 0;
 
 	sprintf(cg_dir, "/hbm%d", cg_id);
-	map_fd = prog_load(prog);
-	if (map_fd  == -1)
-		return 1;
+	rc = prog_load(prog);
+	if (rc != 0)
+		return rc;
 
 	if (setup_cgroup_environment()) {
 		printf("ERROR: setting cgroup environment\n");
@@ -190,16 +183,24 @@ static int run_bpf_prog(char *prog, int cg_id)
 	qstats.stats = stats_flag ? 1 : 0;
 	qstats.loopback = loopback_flag ? 1 : 0;
 	qstats.no_cn = no_cn_flag ? 1 : 0;
-	if (bpf_map_update_elem(map_fd, &key, &qstats, BPF_ANY)) {
+	if (bpf_map_update_elem(queue_stats_fd, &key, &qstats, BPF_ANY)) {
 		printf("ERROR: Could not update map element\n");
 		goto err;
 	}
 
 	if (!outFlag)
-		type = BPF_CGROUP_INET_INGRESS;
-	if (bpf_prog_attach(bpfprog_fd, cg1, type, 0)) {
-		printf("ERROR: bpf_prog_attach fails!\n");
-		log_err("Attaching prog");
+		bpf_program__set_expected_attach_type(bpf_prog, BPF_CGROUP_INET_INGRESS);
+
+	link = bpf_program__attach_cgroup(bpf_prog, cg1);
+	if (libbpf_get_error(link)) {
+		fprintf(stderr, "ERROR: bpf_program__attach_cgroup failed\n");
+		goto err;
+	}
+
+	sprintf(cg_pin_path, "/sys/fs/bpf/hbm%d", cg_id);
+	rc = bpf_link__pin(link, cg_pin_path);
+	if (rc < 0) {
+		printf("ERROR: bpf_link__pin failed: %d\n", rc);
 		goto err;
 	}
 
@@ -213,7 +214,7 @@ static int run_bpf_prog(char *prog, int cg_id)
 #define DELTA_RATE_CHECK 10000		/* in us */
 #define RATE_THRESHOLD 9500000000	/* 9.5 Gbps */
 
-		bpf_map_lookup_elem(map_fd, &key, &qstats);
+		bpf_map_lookup_elem(queue_stats_fd, &key, &qstats);
 		if (gettimeofday(&t0, NULL) < 0)
 			do_error("gettimeofday failed", true);
 		t_last = t0;
@@ -242,7 +243,7 @@ static int run_bpf_prog(char *prog, int cg_id)
 			fclose(fin);
 			printf("  new_eth_tx_bytes:%llu\n",
 			       new_eth_tx_bytes);
-			bpf_map_lookup_elem(map_fd, &key, &qstats);
+			bpf_map_lookup_elem(queue_stats_fd, &key, &qstats);
 			new_cg_tx_bytes = qstats.bytes_total;
 			delta_bytes = new_eth_tx_bytes - last_eth_tx_bytes;
 			last_eth_tx_bytes = new_eth_tx_bytes;
@@ -289,14 +290,14 @@ static int run_bpf_prog(char *prog, int cg_id)
 					rate = minRate;
 				qstats.rate = rate;
 			}
-			if (bpf_map_update_elem(map_fd, &key, &qstats, BPF_ANY))
+			if (bpf_map_update_elem(queue_stats_fd, &key, &qstats, BPF_ANY))
 				do_error("update map element fails", false);
 		}
 	} else {
 		sleep(dur);
 	}
 	// Get stats!
-	if (stats_flag && bpf_map_lookup_elem(map_fd, &key, &qstats)) {
+	if (stats_flag && bpf_map_lookup_elem(queue_stats_fd, &key, &qstats)) {
 		char fname[100];
 		FILE *fout;
 
@@ -394,14 +395,20 @@ static int run_bpf_prog(char *prog, int cg_id)
 
 	if (debugFlag)
 		read_trace_pipe2();
-	return rc;
+	goto cleanup;
+
 err:
 	rc = 1;
 
-	if (cg1)
+cleanup:
+	bpf_link__destroy(link);
+	bpf_object__close(obj);
+
+	if (cg1 != -1)
 		close(cg1);
-	cleanup_cgroup_environment();
 
+	if (rc != 0)
+		cleanup_cgroup_environment();
 	return rc;
 }
 
diff --git a/samples/bpf/hbm_kern.h b/samples/bpf/hbm_kern.h
index e00f26f6afba..722b3fadb467 100644
--- a/samples/bpf/hbm_kern.h
+++ b/samples/bpf/hbm_kern.h
@@ -69,7 +69,7 @@ struct {
 	__uint(type, BPF_MAP_TYPE_ARRAY);
 	__uint(max_entries, 1);
 	__type(key, u32);
-	__type(value, struct hvm_queue_stats);
+	__type(value, struct hbm_queue_stats);
 } queue_stats SEC(".maps");
 
 struct hbm_pkt_info {
-- 
2.25.1


^ permalink raw reply related

* [PATCH bpf-next v3 6/7] samples: bpf: fix lwt_len_hist reusing previous BPF map
From: Daniel T. Lee @ 2020-11-24  9:03 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko, brakmo,
	Jesper Dangaard Brouer, Andrii Nakryiko, Lorenzo Bianconi,
	David Ahern, Yonghong Song, Toke Høiland-Jørgensen,
	Ira Weiny, Thomas Graf, Jakub Kicinski, Martin KaFai Lau,
	John Fastabend
  Cc: bpf, netdev, Xdp
In-Reply-To: <20201124090310.24374-1-danieltimlee@gmail.com>

Currently, lwt_len_hist's map lwt_len_hist_map is uses pinning, and the
map isn't cleared on test end. This leds to reuse of that map for
each test, which prevents the results of the test from being accurate.

This commit fixes the problem by removing of pinned map from bpffs.
Also, this commit add the executable permission to shell script
files.

Fixes: f74599f7c5309 ("bpf: Add tests and samples for LWT-BPF")
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
---
 samples/bpf/lwt_len_hist.sh | 2 ++
 samples/bpf/test_lwt_bpf.sh | 0
 2 files changed, 2 insertions(+)
 mode change 100644 => 100755 samples/bpf/lwt_len_hist.sh
 mode change 100644 => 100755 samples/bpf/test_lwt_bpf.sh

diff --git a/samples/bpf/lwt_len_hist.sh b/samples/bpf/lwt_len_hist.sh
old mode 100644
new mode 100755
index 090b96eaf7f7..0eda9754f50b
--- a/samples/bpf/lwt_len_hist.sh
+++ b/samples/bpf/lwt_len_hist.sh
@@ -8,6 +8,8 @@ VETH1=tst_lwt1b
 TRACE_ROOT=/sys/kernel/debug/tracing
 
 function cleanup {
+	# To reset saved histogram, remove pinned map
+	rm /sys/fs/bpf/tc/globals/lwt_len_hist_map
 	ip route del 192.168.253.2/32 dev $VETH0 2> /dev/null
 	ip link del $VETH0 2> /dev/null
 	ip link del $VETH1 2> /dev/null
diff --git a/samples/bpf/test_lwt_bpf.sh b/samples/bpf/test_lwt_bpf.sh
old mode 100644
new mode 100755
-- 
2.25.1


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox