Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next v1] net: dsa: mv88e6xxx: Trap PTP traffic
From: Kurt Kanzenbach @ 2021-12-09 17:33 UTC (permalink / raw)
  To: Andrew Lunn, Vivien Didelot
  Cc: Florian Fainelli, Vladimir Oltean, David S. Miller,
	Jakub Kicinski, netdev, Richard Cochran, Kurt Kanzenbach

A time aware switch should trap PTP traffic to the management CPU. User space
daemons such as ptp4l will process these messages to implement Boundary (or
Transparent) Clocks.

At the moment the mv88e6xxx driver for mv88e6341 doesn't trap these messages
which leads to confusion when multiple end devices are connected to the
switch. Therefore, setup PTP traps. Leverage the already implemented policy
mechanism based on destination addresses. Configure the traps only if
timestamping is enabled so that non time aware use case is still functioning.

Introduce an additional PTP operation in case other devices need special
handling in regards to trapping as well.

Tested on Marvell Topaz (mv88e6341) switch with multiple end devices connected
like this:

|# DSA setup
|$ ip link set eth0 up
|$ ip link set lan0 up
|$ ip link set lan1 up
|$ ip link set lan2 up
|$ ip link add name br0 type bridge
|$ ip link set dev lan0 master br0
|$ ip link set dev lan1 master br0
|$ ip link set dev lan2 master br0
|$ ip link set lan0 up
|$ ip link set lan1 up
|$ ip link set lan2 up
|$ ip link set br0 up
|$ dhclient br0
|# Configure bridge routing
|$ ebtables --table broute --append BROUTING --protocol 0x88F7 --jump DROP
|# Start linuxptp
|$ ptp4l -H -2 -i lan0 -i lan1 -i lan2 --tx_timestamp_timeout=40 -m

Verified added policies with mv88e6xxx_dump.

Signed-off-by: Kurt Kanzenbach <kurt@kmk-computers.de>
---
 drivers/net/dsa/mv88e6xxx/chip.c     | 12 +++---
 drivers/net/dsa/mv88e6xxx/chip.h     |  5 +++
 drivers/net/dsa/mv88e6xxx/hwtstamp.c |  7 ++++
 drivers/net/dsa/mv88e6xxx/ptp.c      | 59 ++++++++++++++++++++++++++++
 drivers/net/dsa/mv88e6xxx/ptp.h      |  2 +
 5 files changed, 80 insertions(+), 5 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 7fadbf987b23..ab50ebd85f1f 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -1816,8 +1816,8 @@ static int mv88e6xxx_port_db_load_purge(struct mv88e6xxx_chip *chip, int port,
 	return mv88e6xxx_g1_atu_loadpurge(chip, fid, &entry);
 }
 
-static int mv88e6xxx_policy_apply(struct mv88e6xxx_chip *chip, int port,
-				  const struct mv88e6xxx_policy *policy)
+int mv88e6xxx_policy_apply(struct mv88e6xxx_chip *chip, int port,
+			   const struct mv88e6xxx_policy *policy)
 {
 	enum mv88e6xxx_policy_mapping mapping = policy->mapping;
 	enum mv88e6xxx_policy_action action = policy->action;
@@ -1835,10 +1835,12 @@ static int mv88e6xxx_policy_apply(struct mv88e6xxx_chip *chip, int port,
 	case MV88E6XXX_POLICY_MAPPING_SA:
 		if (action == MV88E6XXX_POLICY_ACTION_NORMAL)
 			state = 0; /* Dissociate the port and address */
-		else if (action == MV88E6XXX_POLICY_ACTION_DISCARD &&
+		else if ((action == MV88E6XXX_POLICY_ACTION_DISCARD ||
+			  action == MV88E6XXX_POLICY_ACTION_TRAP) &&
 			 is_multicast_ether_addr(addr))
 			state = MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC_POLICY;
-		else if (action == MV88E6XXX_POLICY_ACTION_DISCARD &&
+		else if ((action == MV88E6XXX_POLICY_ACTION_DISCARD ||
+			  action == MV88E6XXX_POLICY_ACTION_TRAP) &&
 			 is_unicast_ether_addr(addr))
 			state = MV88E6XXX_G1_ATU_DATA_STATE_UC_STATIC_POLICY;
 		else
@@ -4589,7 +4591,7 @@ static const struct mv88e6xxx_ops mv88e6341_ops = {
 	.serdes_irq_status = mv88e6390_serdes_irq_status,
 	.gpio_ops = &mv88e6352_gpio_ops,
 	.avb_ops = &mv88e6390_avb_ops,
-	.ptp_ops = &mv88e6352_ptp_ops,
+	.ptp_ops = &mv88e6341_ptp_ops,
 	.serdes_get_sset_count = mv88e6390_serdes_get_sset_count,
 	.serdes_get_strings = mv88e6390_serdes_get_strings,
 	.serdes_get_stats = mv88e6390_serdes_get_stats,
diff --git a/drivers/net/dsa/mv88e6xxx/chip.h b/drivers/net/dsa/mv88e6xxx/chip.h
index 8271b8aa7b71..795ae5a56834 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.h
+++ b/drivers/net/dsa/mv88e6xxx/chip.h
@@ -673,6 +673,8 @@ struct mv88e6xxx_ptp_ops {
 	int (*port_disable)(struct mv88e6xxx_chip *chip, int port);
 	int (*global_enable)(struct mv88e6xxx_chip *chip);
 	int (*global_disable)(struct mv88e6xxx_chip *chip);
+	int (*setup_ptp_traps)(struct mv88e6xxx_chip *chip, int port,
+			       bool enable);
 	int n_ext_ts;
 	int arr0_sts_reg;
 	int arr1_sts_reg;
@@ -760,4 +762,7 @@ static inline void mv88e6xxx_reg_unlock(struct mv88e6xxx_chip *chip)
 
 int mv88e6xxx_fid_map(struct mv88e6xxx_chip *chip, unsigned long *bitmap);
 
+int mv88e6xxx_policy_apply(struct mv88e6xxx_chip *chip, int port,
+			   const struct mv88e6xxx_policy *policy);
+
 #endif /* _MV88E6XXX_CHIP_H */
diff --git a/drivers/net/dsa/mv88e6xxx/hwtstamp.c b/drivers/net/dsa/mv88e6xxx/hwtstamp.c
index 8f74ffc7a279..617aeb6cbaac 100644
--- a/drivers/net/dsa/mv88e6xxx/hwtstamp.c
+++ b/drivers/net/dsa/mv88e6xxx/hwtstamp.c
@@ -94,6 +94,7 @@ static int mv88e6xxx_set_hwtstamp_config(struct mv88e6xxx_chip *chip, int port,
 	const struct mv88e6xxx_ptp_ops *ptp_ops = chip->info->ops->ptp_ops;
 	struct mv88e6xxx_port_hwtstamp *ps = &chip->port_hwtstamp[port];
 	bool tstamp_enable = false;
+	int ret;
 
 	/* Prevent the TX/RX paths from trying to interact with the
 	 * timestamp hardware while we reconfigure it.
@@ -161,6 +162,12 @@ static int mv88e6xxx_set_hwtstamp_config(struct mv88e6xxx_chip *chip, int port,
 		if (chip->enable_count == 0 && ptp_ops->global_disable)
 			ptp_ops->global_disable(chip);
 	}
+
+	if (ptp_ops->setup_ptp_traps) {
+		ret = ptp_ops->setup_ptp_traps(chip, port, tstamp_enable);
+		if (tstamp_enable && ret)
+			dev_warn(chip->dev, "Failed to setup PTP traps. PTP might not work as desired!\n");
+	}
 	mv88e6xxx_reg_unlock(chip);
 
 	/* Once hardware has been configured, enable timestamp checks
diff --git a/drivers/net/dsa/mv88e6xxx/ptp.c b/drivers/net/dsa/mv88e6xxx/ptp.c
index d838c174dc0d..8d6ff03d37c8 100644
--- a/drivers/net/dsa/mv88e6xxx/ptp.c
+++ b/drivers/net/dsa/mv88e6xxx/ptp.c
@@ -345,6 +345,37 @@ static int mv88e6352_ptp_verify(struct ptp_clock_info *ptp, unsigned int pin,
 	return 0;
 }
 
+static int mv88e6341_setup_ptp_traps(struct mv88e6xxx_chip *chip, int port,
+				     bool enable)
+{
+	static const u8 ptp_destinations[][ETH_ALEN] = {
+		{ 0x01, 0x1b, 0x19, 0x00, 0x00, 0x00 }, /* L2 PTP */
+		{ 0x01, 0x80, 0xc2, 0x00, 0x00, 0x0e }, /* L2 P2P */
+		{ 0x01, 0x00, 0x5e, 0x00, 0x01, 0x81 }, /* IPv4 PTP */
+		{ 0x01, 0x00, 0x5e, 0x00, 0x00, 0x6b }, /* IPv4 P2P */
+		{ 0x33, 0x33, 0x00, 0x00, 0x01, 0x81 }, /* IPv6 PTP */
+		{ 0x33, 0x33, 0x00, 0x00, 0x00, 0x6b }, /* IPv6 P2P */
+	};
+	int ret, i;
+
+	for (i = 0; i < ARRAY_SIZE(ptp_destinations); ++i) {
+		struct mv88e6xxx_policy policy = { };
+
+		policy.mapping	= MV88E6XXX_POLICY_MAPPING_DA;
+		policy.action	= enable ? MV88E6XXX_POLICY_ACTION_TRAP :
+			MV88E6XXX_POLICY_ACTION_NORMAL;
+		policy.port	= port;
+		policy.vid	= 0;
+		ether_addr_copy(policy.addr, ptp_destinations[i]);
+
+		ret = mv88e6xxx_policy_apply(chip, port, &policy);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 const struct mv88e6xxx_ptp_ops mv88e6165_ptp_ops = {
 	.clock_read = mv88e6165_ptp_clock_read,
 	.global_enable = mv88e6165_global_enable,
@@ -419,6 +450,34 @@ const struct mv88e6xxx_ptp_ops mv88e6352_ptp_ops = {
 	.cc_mult_dem = MV88E6XXX_CC_MULT_DEM,
 };
 
+const struct mv88e6xxx_ptp_ops mv88e6341_ptp_ops = {
+	.clock_read = mv88e6352_ptp_clock_read,
+	.ptp_enable = mv88e6352_ptp_enable,
+	.ptp_verify = mv88e6352_ptp_verify,
+	.event_work = mv88e6352_tai_event_work,
+	.port_enable = mv88e6352_hwtstamp_port_enable,
+	.port_disable = mv88e6352_hwtstamp_port_disable,
+	.setup_ptp_traps = mv88e6341_setup_ptp_traps,
+	.n_ext_ts = 1,
+	.arr0_sts_reg = MV88E6XXX_PORT_PTP_ARR0_STS,
+	.arr1_sts_reg = MV88E6XXX_PORT_PTP_ARR1_STS,
+	.dep_sts_reg = MV88E6XXX_PORT_PTP_DEP_STS,
+	.rx_filters = (1 << HWTSTAMP_FILTER_NONE) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_L4_EVENT) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_L4_SYNC) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_L2_EVENT) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_L2_SYNC) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_EVENT) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_SYNC) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_DELAY_REQ),
+	.cc_shift = MV88E6XXX_CC_SHIFT,
+	.cc_mult = MV88E6XXX_CC_MULT,
+	.cc_mult_num = MV88E6XXX_CC_MULT_NUM,
+	.cc_mult_dem = MV88E6XXX_CC_MULT_DEM,
+};
+
 static u64 mv88e6xxx_ptp_clock_read(const struct cyclecounter *cc)
 {
 	struct mv88e6xxx_chip *chip = cc_to_chip(cc);
diff --git a/drivers/net/dsa/mv88e6xxx/ptp.h b/drivers/net/dsa/mv88e6xxx/ptp.h
index 269d5d16a466..badcb72d10a6 100644
--- a/drivers/net/dsa/mv88e6xxx/ptp.h
+++ b/drivers/net/dsa/mv88e6xxx/ptp.h
@@ -151,6 +151,7 @@ void mv88e6xxx_ptp_free(struct mv88e6xxx_chip *chip);
 extern const struct mv88e6xxx_ptp_ops mv88e6165_ptp_ops;
 extern const struct mv88e6xxx_ptp_ops mv88e6250_ptp_ops;
 extern const struct mv88e6xxx_ptp_ops mv88e6352_ptp_ops;
+extern const struct mv88e6xxx_ptp_ops mv88e6341_ptp_ops;
 
 #else /* !CONFIG_NET_DSA_MV88E6XXX_PTP */
 
@@ -171,6 +172,7 @@ static inline void mv88e6xxx_ptp_free(struct mv88e6xxx_chip *chip)
 static const struct mv88e6xxx_ptp_ops mv88e6165_ptp_ops = {};
 static const struct mv88e6xxx_ptp_ops mv88e6250_ptp_ops = {};
 static const struct mv88e6xxx_ptp_ops mv88e6352_ptp_ops = {};
+static const struct mv88e6xxx_ptp_ops mv88e6341_ptp_ops = {};
 
 #endif /* CONFIG_NET_DSA_MV88E6XXX_PTP */
 
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH v4 net-next 1/9] i40e: don't reserve excessive XDP_PACKET_HEADROOM on XSK Rx to skb
From: Alexander Lobakin @ 2021-12-09 17:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Lobakin, intel-wired-lan, brouer, Jesse Brandeburg,
	Tony Nguyen, David S. Miller, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Björn Töpel, Maciej Fijalkowski, Michal Swiatkowski,
	Jithu Joseph, Martin KaFai Lau, Song Liu, KP Singh, Yonghong Song,
	Andrii Nakryiko, netdev, linux-kernel, bpf
In-Reply-To: <da317f39-8679-96f7-ec6f-309216b02f33@redhat.com>

From: Jesper Dangaard Brouer <jbrouer@redhat.com>
Date: Thu, 9 Dec 2021 09:19:46 +0100

> On 08/12/2021 15.06, Alexander Lobakin wrote:
> > {__,}napi_alloc_skb() allocates and reserves additional NET_SKB_PAD
> > + NET_IP_ALIGN for any skb.
> > OTOH, i40e_construct_skb_zc() currently allocates and reserves
> > additional `xdp->data - xdp->data_hard_start`, which is
> > XDP_PACKET_HEADROOM for XSK frames.
> > There's no need for that at all as the frame is post-XDP and will
> > go only to the networking stack core.
> 
> I disagree with this assumption, that headroom is not needed by netstack.
> Why "no need for that at all" for netstack?

napi_alloc_skb() in our particular case will reserve 64 bytes, it is
sufficient for {TCP,UDP,SCTP,...}/IPv{4,6} etc.

> 
> Having headroom is important for netstack in general.  When packet will 
> grow we avoid realloc of SKB.  Use-case could also be cpumap or veth 
> redirect, or XDP-generic, that expect this headroom.

Well, those are not common cases at all.
Allocating 256 bytes more for some hypothetical usecases (and having
320 in total) is more expensive than expanding headroom in-place.
I don't know any other drivers or ifaces which reserve
XDP_PACKET_HEADROOM just for the case of using both driver-side
and generic XDP at the same time. To be more precise, I can't
remember any driver which would check whether generic XDP is enabled
for its netdev(s).

As a second option, I was trying to get exactly XDP_PACKET_HEADROOM
of headroom, but this involves either __alloc_skb() which is slower
than napi_alloc_skb(), or

	skb = napi_alloc_skb(napi, xdp->data_end -
				   xdp->data_hard_start -
				   NET_SKB_PAD);
	skb_reserve(skb, xdp->data_meta - xdp->data_hard_start -
			 NET_SKB_PAD);

Doesn't look good for me.

We could probably introduce a version of napi_alloc_skb() which
wouldn't reserve any headroom for you to have more control over it,
but that's more global material than these local fixes I'd say.

> 
> 
> > Pass the size of the actual data only to __napi_alloc_skb() and
> > don't reserve anything. This will give enough headroom for stack
> > processing.
> > 
> > Fixes: 0a714186d3c0 ("i40e: add AF_XDP zero-copy Rx support")
> > Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> > Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
> > ---
> >   drivers/net/ethernet/intel/i40e/i40e_xsk.c | 4 +---
> >   1 file changed, 1 insertion(+), 3 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > index f08d19b8c554..9564906b7da8 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > @@ -245,13 +245,11 @@ static struct sk_buff *i40e_construct_skb_zc(struct i40e_ring *rx_ring,
> >   	struct sk_buff *skb;
> >   
> >   	/* allocate a skb to store the frags */
> > -	skb = __napi_alloc_skb(&rx_ring->q_vector->napi,
> > -			       xdp->data_end - xdp->data_hard_start,
> > +	skb = __napi_alloc_skb(&rx_ring->q_vector->napi, datasize,
> >   			       GFP_ATOMIC | __GFP_NOWARN);
> >   	if (unlikely(!skb))
> >   		goto out;
> >   
> > -	skb_reserve(skb, xdp->data - xdp->data_hard_start);
> >   	memcpy(__skb_put(skb, datasize), xdp->data, datasize);
> >   	if (metasize)
> >   		skb_metadata_set(skb, metasize);

Thanks,
Al

^ permalink raw reply

* Re: [RFC PATCH net-next 0/7] DSA master state tracking
From: Vladimir Oltean @ 2021-12-09 17:33 UTC (permalink / raw)
  To: Ansuel Smith
  Cc: netdev@vger.kernel.org, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli
In-Reply-To: <61b21641.1c69fb81.d27e9.02a0@mx.google.com>

On Thu, Dec 09, 2021 at 03:44:14PM +0100, Ansuel Smith wrote:
> On Thu, Dec 09, 2021 at 02:28:30PM +0000, Vladimir Oltean wrote:
> > On Thu, Dec 09, 2021 at 04:05:59AM +0100, Ansuel Smith wrote:
> > > On Thu, Dec 09, 2021 at 12:32:23AM +0200, Vladimir Oltean wrote:
> > > > This patch set is provided solely for review purposes (therefore not to
> > > > be applied anywhere) and for Ansuel to test whether they resolve the
> > > > slowdown reported here:
> > > > https://patchwork.kernel.org/project/netdevbpf/cover/20211207145942.7444-1-ansuelsmth@gmail.com/
> > > > 
> > > > It does conflict with net-next due to other patches that are in my tree,
> > > > and which were also posted here and would need to be picked ("Rework DSA
> > > > bridge TX forwarding offload API"):
> > > > https://patchwork.kernel.org/project/netdevbpf/cover/20211206165758.1553882-1-vladimir.oltean@nxp.com/
> > > > 
> > > > Additionally, for Ansuel's work there is also a logical dependency with
> > > > this series ("Replace DSA dp->priv with tagger-owned storage"):
> > > > https://patchwork.kernel.org/project/netdevbpf/cover/20211208200504.3136642-1-vladimir.oltean@nxp.com/
> > > > 
> > > > To get both dependency series, the following commands should be sufficient:
> > > > git b4 20211206165758.1553882-1-vladimir.oltean@nxp.com
> > > > git b4 20211208200504.3136642-1-vladimir.oltean@nxp.com
> > > > 
> > > > where "git b4" is an alias in ~/.gitconfig:
> > > > [b4]
> > > > 	midmask = https://lore.kernel.org/r/%25s
> > > > [alias]
> > > > 	b4 = "!f() { b4 am -t -o - $@ | git am -3; }; f"
> > > > 
> > > > The patches posted here are mainly to offer a consistent
> > > > "master_up"/"master_going_down" chain of events to switches, without
> > > > duplicates, and always starting with "master_up" and ending with
> > > > "master_going_down". This way, drivers should know when they can perform
> > > > Ethernet-based register access.
> > > > 
> > > > Vladimir Oltean (7):
> > > >   net: dsa: only bring down user ports assigned to a given DSA master
> > > >   net: dsa: refactor the NETDEV_GOING_DOWN master tracking into separate
> > > >     function
> > > >   net: dsa: use dsa_tree_for_each_user_port in
> > > >     dsa_tree_master_going_down()
> > > >   net: dsa: provide switch operations for tracking the master state
> > > >   net: dsa: stop updating master MTU from master.c
> > > >   net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown}
> > > >   net: dsa: replay master state events in
> > > >     dsa_tree_{setup,teardown}_master
> > > > 
> > > >  include/net/dsa.h  |  8 +++++++
> > > >  net/dsa/dsa2.c     | 52 ++++++++++++++++++++++++++++++++++++++++++++--
> > > >  net/dsa/dsa_priv.h | 11 ++++++++++
> > > >  net/dsa/master.c   | 29 +++-----------------------
> > > >  net/dsa/slave.c    | 32 +++++++++++++++-------------
> > > >  net/dsa/switch.c   | 29 ++++++++++++++++++++++++++
> > > >  6 files changed, 118 insertions(+), 43 deletions(-)
> > > > 
> > > > -- 
> > > > 2.25.1
> > > > 
> > > 
> > > I applied this patch and it does work correctly. Sadly the problem is
> > > not solved and still the packet are not tracked correctly. What I notice
> > > is that everything starts to work as soon as the master is set to
> > > promiiscuous mode. Wonder if we should track that event instead of
> > > simple up?
> > > 
> > > Here is a bootlog [0]. I added some log when the function timeouts and when
> > > master up is actually called.
> > > Current implementation for this is just a bool that is set to true on
> > > master up and false on master going down. (final version should use
> > > locking to check if an Ethernet transation is in progress)
> > > 
> > > [0] https://pastebin.com/7w2kgG7a
> > 
> > This is strange. What MAC DA do the ack packets have? Could you give us
> > a pcap with the request and reply packets (not necessarily now)?
> 
> If you want I can give you a pcap from a router bootup to the setup with
> no ethernet cable attached. I notice the switch sends some packet at the
> bootup for some reason but they are not Ethernet mdio packet or other
> type. It seems they are not even tagged (doesn't have qca tag) as the
> header mode is disabled by default)
> Let me know if you need just a pcap for the Ethernet mdio transaction or
> from a bootup. I assume it would be better from a bootup? (they are not
> tons of packet and the mdio Ethernet ones are easy to notice.)

Anything that contains some request and response packets should do, as
long as they're relatively easy to spot. But as stated, this can wait
for a while, I don't think that promiscuity is the issue, after your
second reply.

> > Can you try to set ".promisc_on_master = true" in qca_netdev_ops?
> 
> I already tried and here [0] is a log. I notice with promisc_on_master
> the "eth0 entered promiscuous mode" is missing. Is that correct?
> Unless I was tired and misread the code, the info should be printed
> anyway. Also looking at the comments for promisc_on_master I don't think
> that should be applied to this tagger.
> 
> [0] https://pastebin.com/MN2ttVpr

It isn't missing, it's right there on line 11.
I think the problem is that we also need to track the operstate of the
master (netif_oper_up via NETDEV_CHANGE) before declaring it as good to go.
You can see that this is exactly the line after which the timeouts disappear:

[    7.146901] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

I didn't really want to go there, because now I'm not sure how to
synthesize the information for the switch drivers to consume it.
Anyway I've prepared a v2 patchset and I'll send it out very soon.

^ permalink raw reply

* Re: [PATCHv2 bpf-next] samples/bpf:remove unneeded variable
From: patchwork-bot+netdevbpf @ 2021-12-09 17:30 UTC (permalink / raw)
  To: CGEL
  Cc: andrii.nakryiko, andrii, ast, bpf, chi.minghao, daniel, davem,
	hawk, john.fastabend, kafai, kpsingh, kuba, linux-kernel, netdev,
	songliubraving, yhs, zealci
In-Reply-To: <20211209080051.421844-1-chi.minghao@zte.com.cn>

Hello:

This patch was applied to bpf/bpf-next.git (master)
by Andrii Nakryiko <andrii@kernel.org>:

On Thu,  9 Dec 2021 08:00:51 +0000 you wrote:
> From: Minghao Chi <chi.minghao@zte.com.cn>
> 
> return value form directly instead of
> taking this in another redundant variable.
> 
> Reported-by: Zeal Robot <zealci@zte.com.cm>
> Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
> 
> [...]

Here is the summary with links:
  - [PATCHv2,bpf-next] samples/bpf:remove unneeded variable
    https://git.kernel.org/bpf/bpf-next/c/ac55b3f00c32

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCHv2 bpf-next] samples/bpf:remove unneeded variable
From: Andrii Nakryiko @ 2021-12-09 17:27 UTC (permalink / raw)
  To: cgel.zte
  Cc: Andrii Nakryiko, Alexei Starovoitov, bpf, chiminghao,
	Daniel Borkmann, David S. Miller, Jesper Dangaard Brouer,
	john fastabend, Martin Lau, KP Singh, Jakub Kicinski, open list,
	Networking, Song Liu, Yonghong Song, Zeal Robot
In-Reply-To: <20211209080051.421844-1-chi.minghao@zte.com.cn>

On Thu, Dec 9, 2021 at 12:01 AM <cgel.zte@gmail.com> wrote:
>
> From: Minghao Chi <chi.minghao@zte.com.cn>
>
> return value form directly instead of
> taking this in another redundant variable.
>
> Reported-by: Zeal Robot <zealci@zte.com.cm>
> Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
> ---

Applied to bpf-next, thanks.

>  samples/bpf/xdp_redirect_cpu.bpf.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/samples/bpf/xdp_redirect_cpu.bpf.c b/samples/bpf/xdp_redirect_cpu.bpf.c
> index f10fe3cf25f6..25e3a405375f 100644
> --- a/samples/bpf/xdp_redirect_cpu.bpf.c
> +++ b/samples/bpf/xdp_redirect_cpu.bpf.c
> @@ -100,7 +100,6 @@ u16 get_dest_port_ipv4_udp(struct xdp_md *ctx, u64 nh_off)
>         void *data     = (void *)(long)ctx->data;
>         struct iphdr *iph = data + nh_off;
>         struct udphdr *udph;
> -       u16 dport;
>
>         if (iph + 1 > data_end)
>                 return 0;
> @@ -111,8 +110,7 @@ u16 get_dest_port_ipv4_udp(struct xdp_md *ctx, u64 nh_off)
>         if (udph + 1 > data_end)
>                 return 0;
>
> -       dport = bpf_ntohs(udph->dest);
> -       return dport;
> +       return bpf_ntohs(udph->dest);
>  }
>
>  static __always_inline
> --
> 2.25.1
>

^ permalink raw reply

* Re: [PATCH bpf-next] libbpf: Skip the pinning of global data map for old kernels.
From: Andrii Nakryiko @ 2021-12-09 17:26 UTC (permalink / raw)
  To: Shuyi Cheng
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, open list:BPF (Safe dynamic programs and tools),
	open list:BPF (Safe dynamic programs and tools)
In-Reply-To: <9eb3216b-a785-9024-0f1d-e5a14dfb025b@linux.alibaba.com>

On Thu, Dec 9, 2021 at 12:44 AM Shuyi Cheng
<chengshuyi@linux.alibaba.com> wrote:
>
>
> Fix error: "failed to pin map: Bad file descriptor, path:
> /sys/fs/bpf/_rodata_str1_1."
>
> In the old kernel, the global data map will not be created, see [0]. So
> we should skip the pinning of the global data map to avoid
> bpf_object__pin_maps returning error.
>
> [0]: https://lore.kernel.org/bpf/20211123200105.387855-1-andrii@kernel.org
>
> Signed-off-by: Shuyi Cheng <chengshuyi@linux.alibaba.com>
> ---
>   tools/lib/bpf/libbpf.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 6db0b5e8540e..d96cf49cebab 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -7884,6 +7884,10 @@ int bpf_object__pin_maps(struct bpf_object *obj,
> const char *path)
>                 char *pin_path = NULL;
>                 char buf[PATH_MAX];
>
> +               if (bpf_map__is_internal(map) &&
> +                   !kernel_supports(obj, FEAT_GLOBAL_DATA))


doing the same check in 3 different places sucks. Let's add "bool
skipped" to struct bpf_map, which will be set in one place (at the map
creation time) and then check during relocation and during pinning?

> +                       continue;
> +
>                 if (path) {
>                         int len;
>
> --
> 2.19.1.6.gb485710b

^ permalink raw reply

* [GIT PULL] Networking for 5.16-rc5
From: Jakub Kicinski @ 2021-12-09 17:20 UTC (permalink / raw)
  To: torvalds; +Cc: kuba, davem, netdev, bpf, linux-kernel, netfilter-devel,
	linux-can

Hi Linus!

The following changes since commit a51e3ac43ddbad891c2b1a4f3aa52371d6939570:

  Merge tag 'net-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net (2021-12-02 11:22:06 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git tags/net-5.16-rc5

for you to fetch changes up to 04ec4e6250e5f58b525b08f3dca45c7d7427620e:

  net: dsa: mv88e6xxx: allow use of PHYs on CPU and DSA ports (2021-12-09 08:48:40 -0800)

----------------------------------------------------------------
Networking fixes for 5.16-rc5, including fixes from bpf, can and netfilter.

Current release - regressions:

 - bpf, sockmap: re-evaluate proto ops when psock is removed from sockmap

Current release - new code bugs:

 - bpf: fix bpf_check_mod_kfunc_call for built-in modules

 - ice: fixes for TC classifier offloads

 - vrf: don't run conntrack on vrf with !dflt qdisc

Previous releases - regressions:

 - bpf: fix the off-by-two error in range markings

 - seg6: fix the iif in the IPv6 socket control block

 - devlink: fix netns refcount leak in devlink_nl_cmd_reload()

 - dsa: mv88e6xxx: fix "don't use PHY_DETECT on internal PHY's"

 - dsa: mv88e6xxx: allow use of PHYs on CPU and DSA ports

Previous releases - always broken:

 - ethtool: do not perform operations on net devices being unregistered

 - udp: use datalen to cap max gso segments

 - ice: fix races in stats collection

 - fec: only clear interrupt of handling queue in fec_enet_rx_queue()

 - m_can: pci: fix incorrect reference clock rate

 - m_can: disable and ignore ELO interrupt

 - mvpp2: fix XDP rx queues registering

Misc:

 - treewide: add missing includes masked by cgroup -> bpf.h dependency

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

----------------------------------------------------------------
Ameer Hamza (2):
      gve: fix for null pointer dereference.
      net: dsa: mv88e6xxx: error handling for serdes_power functions

Andrea Mayer (1):
      seg6: fix the iif in the IPv6 socket control block

Andrii Nakryiko (1):
      Merge branch 'Fixes for kfunc-mod regressions and warnings'

Antoine Tenart (1):
      ethtool: do not perform operations on net devices being unregistered

Björn Töpel (1):
      bpf, x86: Fix "no previous prototype" warning

Brian Silverman (1):
      can: m_can: Disable and ignore ELO interrupt

Dan Carpenter (3):
      net: altera: set a couple error code in probe()
      can: sja1000: fix use after free in ems_pcmcia_add_card()
      net/qla3xxx: fix an error code in ql_adapter_up()

Dave Ertman (1):
      ice: Fix problems with DSCP QoS implementation

Eric Dumazet (7):
      inet: use #ifdef CONFIG_SOCK_RX_QUEUE_MAPPING consistently
      tcp: fix another uninit-value (sk_rx_queue_mapping)
      bonding: make tx_rebalance_counter an atomic
      devlink: fix netns refcount leak in devlink_nl_cmd_reload()
      netfilter: conntrack: annotate data-races around ct->timeout
      net, neigh: clear whole pneigh_entry at alloc time
      net/sched: fq_pie: prevent dismantle issue

Florian Westphal (2):
      netfilter: nfnetlink_queue: silence bogus compiler warning
      selftests: netfilter: switch zone stress to socat

Jakub Kicinski (10):
      treewide: Add missing includes masked by cgroup -> bpf dependency
      Merge tag 'linux-can-fixes-for-5.16-20211207' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
      Merge branch 'net-tls-cover-all-ciphers-with-tests'
      Merge branch 'net-phy-fix-doc-build-warning'
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
      Merge tag 'linux-can-fixes-for-5.16-20211209' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
      Merge branch 'net-wwan-iosm-bug-fixes'

Jesse Brandeburg (2):
      ice: ignore dropped packets during init
      ice: safer stats processing

Jianglei Nie (1):
      nfp: Fix memory leak in nfp_cpp_area_cache_add()

Jianguo Wu (1):
      udp: using datalen to cap max gso segments

Jiasheng Jiang (1):
      net: bcm4908: Handle dma_set_coherent_mask error codes

Jimmy Assarsson (2):
      can: kvaser_pciefd: kvaser_pciefd_rx_error_frame(): increase correct stats->{rx,tx}_errors counter
      can: kvaser_usb: get CAN clock frequency from device

Joakim Zhang (1):
      net: fec: only clear interrupt of handling queue in fec_enet_rx_queue()

Johan Almbladh (1):
      mips, bpf: Fix reference to non-existing Kconfig symbol

John Fastabend (2):
      bpf, sockmap: Attach map progs to psock early for feature probes
      bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap

José Expósito (2):
      net: mana: Fix memory leak in mana_hwc_create_wq
      net: dsa: felix: Fix memory leak in felix_setup_mmio_filtering

Julian Wiedmann (1):
      MAINTAINERS: s390/net: remove myself as maintainer

Karen Sornek (1):
      i40e: Fix failed opcode appearing if handling messages from VF

Krzysztof Kozlowski (1):
      nfc: fix potential NULL pointer deref in nfc_genl_dump_ses_done

Kumar Kartikeya Dwivedi (3):
      bpf: Make CONFIG_DEBUG_INFO_BTF depend upon CONFIG_BPF_SYSCALL
      bpf: Fix bpf_check_mod_kfunc_call for built-in modules
      tools/resolve_btfids: Skip unresolved symbol warning for empty BTF sets

Lee Jones (1):
      net: cdc_ncm: Allow for dwNtbOutMaxSize to be unset or zero

Li Zhijian (4):
      selftests/tc-testing: add exit code
      selftests/tc-testing: add missing config
      selftests/tc-testing: Fix cannot create /sys/bus/netdevsim/new_device: Directory nonexistent
      selftests: net/fcnal-test.sh: add exit code

Louis Amas (1):
      net: mvpp2: fix XDP rx queues registering

M Chetan Kumar (3):
      net: wwan: iosm: fixes unnecessary doorbell send
      net: wwan: iosm: fixes net interface nonfunctional after fw flash
      net: wwan: iosm: fixes unable to send AT command during mbim tx

Manish Chopra (1):
      qede: validate non LSO skb length

Mateusz Palczewski (1):
      i40e: Fix pre-set max number of queues for VF

Matthias Schiffer (5):
      can: m_can: pci: fix iomap_read_fifo() and iomap_write_fifo()
      can: m_can: pci: fix incorrect reference clock rate
      Revert "can: m_can: remove support for custom bit timing"
      can: m_can: make custom bittiming fields const
      can: m_can: pci: use custom bit timings for Elkhart Lake

Maxim Mikityanskiy (2):
      bpf: Fix the off-by-two error in range markings
      bpf: Add selftests to cover packet access corner cases

Michal Maloszewski (1):
      iavf: Fix reporting when setting descriptor count

Michal Swiatkowski (2):
      ice: fix choosing UDP header type
      ice: fix adding different tunnels

Mitch Williams (1):
      iavf: restore MSI state on reset

Nicolas Dichtel (1):
      vrf: don't run conntrack on vrf with !dflt qdisc

Norbert Zulinski (1):
      i40e: Fix NULL pointer dereference in i40e_dbg_dump_desc

Pablo Neira Ayuso (1):
      netfilter: nft_exthdr: break evaluation if setting TCP option fails

Paul Greenwalt (1):
      ice: rearm other interrupt cause register after enabling VFs

Peilin Ye (1):
      selftests/fib_tests: Rework fib_rp_filter_test()

Petr Machata (1):
      MAINTAINERS: net: mlxsw: Remove Jiri as a maintainer, add myself

Ronak Doshi (1):
      vmxnet3: fix minimum vectors alloc issue

Russell King (Oracle) (2):
      net: dsa: mv88e6xxx: fix "don't use PHY_DETECT on internal PHY's"
      net: dsa: mv88e6xxx: allow use of PHYs on CPU and DSA ports

Sebastian Andrzej Siewior (2):
      Documentation/locking/locktypes: Update migrate_disable() bits.
      bpf: Make sure bpf_disable_instrumentation() is safe vs preemption.

Stefano Brivio (2):
      nft_set_pipapo: Fix bucket load in AVX2 lookup routine for six 8-bit groups
      selftests: netfilter: Add correctness test for mac,net set type

Tadeusz Struk (1):
      nfc: fix segfault in nfc_genl_dump_devices_done

Vadim Fedorenko (2):
      selftests: tls: add missing AES-CCM cipher tests
      selftests: tls: add missing AES256-GCM cipher

Vincent Mailhol (2):
      can: pch_can: pch_can_rx_normal: fix use after free
      can: m_can: m_can_read_fifo: fix memory leak in error branch

Yahui Cao (1):
      ice: fix FDIR init missing when reset VF

Yanteng Si (2):
      net: phy: Remove unnecessary indentation in the comments of phy_device
      net: phy: Add the missing blank line in the phylink_suspend comment

 Documentation/locking/locktypes.rst                |   9 +-
 MAINTAINERS                                        |   4 +-
 arch/mips/net/bpf_jit_comp.h                       |   2 +-
 block/fops.c                                       |   1 +
 drivers/gpu/drm/drm_gem_shmem_helper.c             |   1 +
 drivers/gpu/drm/i915/gt/intel_gtt.c                |   1 +
 drivers/gpu/drm/i915/i915_request.c                |   1 +
 drivers/gpu/drm/lima/lima_device.c                 |   1 +
 drivers/gpu/drm/msm/msm_gem_shrinker.c             |   1 +
 drivers/gpu/drm/ttm/ttm_tt.c                       |   1 +
 drivers/net/bonding/bond_alb.c                     |  14 +-
 drivers/net/can/kvaser_pciefd.c                    |   8 +-
 drivers/net/can/m_can/m_can.c                      |  42 +-
 drivers/net/can/m_can/m_can.h                      |   3 +
 drivers/net/can/m_can/m_can_pci.c                  |  62 +-
 drivers/net/can/pch_can.c                          |   2 +-
 drivers/net/can/sja1000/ems_pcmcia.c               |   7 +-
 drivers/net/can/usb/kvaser_usb/kvaser_usb_leaf.c   | 101 +++-
 drivers/net/dsa/mv88e6xxx/chip.c                   |  85 +--
 drivers/net/dsa/mv88e6xxx/serdes.c                 |   8 +-
 drivers/net/dsa/ocelot/felix.c                     |   5 +-
 drivers/net/ethernet/altera/altera_tse_main.c      |   9 +-
 drivers/net/ethernet/broadcom/bcm4908_enet.c       |   4 +-
 drivers/net/ethernet/freescale/fec.h               |   3 +
 drivers/net/ethernet/freescale/fec_main.c          |   2 +-
 drivers/net/ethernet/google/gve/gve_utils.c        |   3 +
 drivers/net/ethernet/huawei/hinic/hinic_sriov.c    |   1 +
 drivers/net/ethernet/intel/i40e/i40e_debugfs.c     |   8 +
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |  75 ++-
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.h |   2 +
 drivers/net/ethernet/intel/iavf/iavf_ethtool.c     |  43 +-
 drivers/net/ethernet/intel/iavf/iavf_main.c        |   1 +
 drivers/net/ethernet/intel/ice/ice_dcb_nl.c        |  18 +-
 drivers/net/ethernet/intel/ice/ice_ethtool_fdir.c  |   4 +-
 drivers/net/ethernet/intel/ice/ice_fdir.c          |   2 +-
 drivers/net/ethernet/intel/ice/ice_flex_pipe.c     |   7 +-
 drivers/net/ethernet/intel/ice/ice_flex_pipe.h     |   3 +-
 drivers/net/ethernet/intel/ice/ice_main.c          |  32 +-
 drivers/net/ethernet/intel/ice/ice_switch.c        |  19 +-
 drivers/net/ethernet/intel/ice/ice_tc_lib.c        |  30 +-
 drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c   |   6 +
 drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c    |   4 +-
 .../net/ethernet/marvell/octeontx2/nic/otx2_ptp.c  |   2 +
 drivers/net/ethernet/microsoft/mana/hw_channel.c   |  10 +-
 .../ethernet/netronome/nfp/nfpcore/nfp_cppcore.c   |   4 +-
 drivers/net/ethernet/qlogic/qede/qede_fp.c         |   7 +
 drivers/net/ethernet/qlogic/qla3xxx.c              |  19 +-
 drivers/net/phy/phylink.c                          |   1 +
 drivers/net/usb/cdc_ncm.c                          |   2 +
 drivers/net/vmxnet3/vmxnet3_drv.c                  |  13 +-
 drivers/net/vrf.c                                  |   8 +-
 drivers/net/wwan/iosm/iosm_ipc_imem.c              |  26 +-
 drivers/net/wwan/iosm/iosm_ipc_imem.h              |   4 +-
 drivers/net/wwan/iosm/iosm_ipc_imem_ops.c          |   7 +-
 drivers/pci/controller/dwc/pci-exynos.c            |   1 +
 drivers/pci/controller/dwc/pcie-qcom-ep.c          |   1 +
 drivers/usb/cdns3/host.c                           |   1 +
 include/linux/bpf.h                                |  17 +-
 include/linux/btf.h                                |  14 +-
 include/linux/cacheinfo.h                          |   1 -
 include/linux/device/driver.h                      |   1 +
 include/linux/filter.h                             |   5 +-
 include/linux/phy.h                                |  11 +-
 include/net/bond_alb.h                             |   2 +-
 include/net/busy_poll.h                            |  13 +
 include/net/netfilter/nf_conntrack.h               |   6 +-
 kernel/bpf/btf.c                                   |  11 +-
 kernel/bpf/verifier.c                              |   2 +-
 lib/Kconfig.debug                                  |   1 +
 mm/damon/vaddr.c                                   |   1 +
 mm/memory_hotplug.c                                |   1 +
 mm/swap_slots.c                                    |   1 +
 net/core/devlink.c                                 |  16 +-
 net/core/neighbour.c                               |   3 +-
 net/core/skmsg.c                                   |   5 +
 net/core/sock_map.c                                |  15 +-
 net/ethtool/netlink.c                              |   3 +-
 net/ipv4/inet_connection_sock.c                    |   2 +-
 net/ipv4/tcp_minisocks.c                           |   4 +-
 net/ipv4/udp.c                                     |   2 +-
 net/ipv6/seg6_iptunnel.c                           |   8 +
 net/netfilter/nf_conntrack_core.c                  |   6 +-
 net/netfilter/nf_conntrack_netlink.c               |   2 +-
 net/netfilter/nf_flow_table_core.c                 |   4 +-
 net/netfilter/nfnetlink_queue.c                    |   2 +-
 net/netfilter/nft_exthdr.c                         |  11 +-
 net/netfilter/nft_set_pipapo_avx2.c                |   2 +-
 net/nfc/netlink.c                                  |  12 +-
 net/sched/sch_fq_pie.c                             |   1 +
 tools/bpf/resolve_btfids/main.c                    |   8 +-
 .../bpf/verifier/xdp_direct_packet_access.c        | 632 +++++++++++++++++++--
 tools/testing/selftests/net/fcnal-test.sh          |   8 +
 tools/testing/selftests/net/fib_tests.sh           |  59 +-
 tools/testing/selftests/net/tls.c                  |  36 ++
 tools/testing/selftests/netfilter/conntrack_vrf.sh |  30 +-
 .../selftests/netfilter/nft_concat_range.sh        |  24 +-
 .../testing/selftests/netfilter/nft_zones_many.sh  |  19 +-
 tools/testing/selftests/tc-testing/config          |   2 +
 tools/testing/selftests/tc-testing/tdc.py          |   8 +-
 tools/testing/selftests/tc-testing/tdc.sh          |   1 +
 100 files changed, 1370 insertions(+), 383 deletions(-)

^ permalink raw reply

* Re: [PATCH v1 bpf 1/1] libbpf: don't force user-supplied ifname string to be of fixed size
From: Andrii Nakryiko @ 2021-12-09 17:17 UTC (permalink / raw)
  To: Emmanuel Deloget
  Cc: Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Networking, bpf, open list
In-Reply-To: <20211209120327.551952-1-emmanuel.deloget@eho.link>

On Thu, Dec 9, 2021 at 4:03 AM Emmanuel Deloget
<emmanuel.deloget@eho.link> wrote:
>
> When calling either xsk_socket__create_shared() or xsk_socket__create()
> the user supplies a const char *ifname which is implicitely supposed to
> be a pointer to the start of a char[IFNAMSIZ] array. The internal
> function xsk_create_ctx() then blindly copy IFNAMSIZ bytes from this
> string into the xsk context.
>
> This is counter-intuitive and error-prone.
>
> For example,
>
>         int r = xsk_socket__create(..., "eth0", ...)
>
> may result in an invalid object because of the blind copy. The "eth0"
> string might be followed by random data from the ro data section,
> resulting in ctx->ifname being filled with the correct interface name
> then a bunch and invalid bytes.
>
> The same kind of issue arises when the ifname string is located on the
> stack:
>
>         char ifname[] = "eth0";
>         int r = xsk_socket__create(..., ifname, ...);
>
> Or comes from the command line
>
>         const char *ifname = argv[n];
>         int r = xsk_socket__create(..., ifname, ...);
>
> In both case we'll fill ctx->ifname with random data from the stack.
>
> In practice, we saw that this issue caused various small errors which,
> in then end, prevented us to setup a valid xsk context that would have
> allowed us to capture packets on our interfaces. We fixed this issue in
> our code by forcing our char ifname[] to be of size IFNAMSIZ but that felt
> weird and unnecessary.

I might be missing something, but the eth0 example above would include
terminating zero at the right place, so ifname will still have
"eth0\0" which is a valid string. Yes there will be some garbage after
that, but it shouldn't matter. It could cause ASAN to complain about
reading beyond allocated memory, of course, but I'm curious what
problems you actually ran into in practice.

>
> Fixes: 2f6324a3937f8 (libbpf: Support shared umems between queues and devices)
> Signed-off-by: Emmanuel Deloget <emmanuel.deloget@eho.link>
> ---
>  tools/lib/bpf/xsk.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
> index 81f8fbc85e70..8dda80bcefcc 100644
> --- a/tools/lib/bpf/xsk.c
> +++ b/tools/lib/bpf/xsk.c
> @@ -944,6 +944,7 @@ static struct xsk_ctx *xsk_create_ctx(struct xsk_socket *xsk,
>  {
>         struct xsk_ctx *ctx;
>         int err;
> +       size_t ifnamlen;
>
>         ctx = calloc(1, sizeof(*ctx));
>         if (!ctx)
> @@ -965,8 +966,10 @@ static struct xsk_ctx *xsk_create_ctx(struct xsk_socket *xsk,
>         ctx->refcount = 1;
>         ctx->umem = umem;
>         ctx->queue_id = queue_id;
> -       memcpy(ctx->ifname, ifname, IFNAMSIZ - 1);
> -       ctx->ifname[IFNAMSIZ - 1] = '\0';
> +
> +       ifnamlen = strnlen(ifname, IFNAMSIZ);
> +       memcpy(ctx->ifname, ifname, ifnamlen);

maybe use strncpy instead of strnlen + memcpy? keep the guaranteed
zero termination (and keep '\0', why did you change it?)

Also, note that xsk.c is deprecated in libbpf and has been moved into
libxdp, so please contribute a similar fix there.

> +       ctx->ifname[IFNAMSIZ - 1] = 0;
>
>         ctx->fill = fill;
>         ctx->comp = comp;
> --
> 2.32.0
>

^ permalink raw reply

* Re: [PATCH v2, 1/2] yaml: Add dm9051 SPI network yaml file
From: Rob Herring @ 2021-12-09 17:10 UTC (permalink / raw)
  To: JosephCHANG
  Cc: joseph_chang, David S . Miller, linux-kernel, Jakub Kicinski,
	netdev, Rob Herring, devicetree
In-Reply-To: <20211209100702.5609-2-josright123@gmail.com>

On Thu, 09 Dec 2021 18:07:01 +0800, JosephCHANG wrote:
> For support davicom dm9051 device tree config
> 
> Signed-off-by: JosephCHANG <josright123@gmail.com>
> ---
>  .../bindings/net/davicom,dm9051.yaml          | 62 +++++++++++++++++++
>  1 file changed, 62 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/davicom,dm9051.yaml
> 

My bot found errors running 'make DT_CHECKER_FLAGS=-m dt_binding_check'
on your patch (DT_CHECKER_FLAGS is new in v5.13):

yamllint warnings/errors:

dtschema/dtc warnings/errors:
Error: Documentation/devicetree/bindings/net/davicom,dm9051.example.dts:29.34-35 syntax error
FATAL ERROR: Unable to parse input tree
make[1]: *** [scripts/Makefile.lib:373: Documentation/devicetree/bindings/net/davicom,dm9051.example.dt.yaml] Error 1
make[1]: *** Waiting for unfinished jobs....
make: *** [Makefile:1413: dt_binding_check] Error 2

doc reference errors (make refcheckdocs):

See https://patchwork.ozlabs.org/patch/1565710

This check can fail if there are any dependencies. The base for a patch
series is generally the most recent rc1.

If you already ran 'make dt_binding_check' and didn't see the above
error(s), then make sure 'yamllint' is installed and dt-schema is up to
date:

pip3 install dtschema --upgrade

Please check and re-submit.

^ permalink raw reply

* [PATCH bpf-next v2 9/9] selftests/bpf: Add test for unstable CT lookup API
From: Kumar Kartikeya Dwivedi @ 2021-12-09 17:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	Maxim Mikityanskiy, Florian Westphal, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev, netfilter-devel
In-Reply-To: <20211209170929.3485242-1-memxor@gmail.com>

This tests that we return errors as documented, and also that the kfunc
calls work from both XDP and TC hooks.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 tools/testing/selftests/bpf/config            |   4 +
 .../testing/selftests/bpf/prog_tests/bpf_nf.c |  48 ++++++++
 .../testing/selftests/bpf/progs/test_bpf_nf.c | 113 ++++++++++++++++++
 3 files changed, 165 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_nf.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_bpf_nf.c

diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index 5192305159ec..4a2a47fcd6ef 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -46,3 +46,7 @@ CONFIG_IMA_READ_POLICY=y
 CONFIG_BLK_DEV_LOOP=y
 CONFIG_FUNCTION_TRACER=y
 CONFIG_DYNAMIC_FTRACE=y
+CONFIG_NETFILTER=y
+CONFIG_NF_DEFRAG_IPV4=y
+CONFIG_NF_DEFRAG_IPV6=y
+CONFIG_NF_CONNTRACK=y
diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_nf.c b/tools/testing/selftests/bpf/prog_tests/bpf_nf.c
new file mode 100644
index 000000000000..56e8d745b8c8
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_nf.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "test_bpf_nf.skel.h"
+
+enum {
+	TEST_XDP,
+	TEST_TC_BPF,
+};
+
+void test_bpf_nf_ct(int mode)
+{
+	struct test_bpf_nf *skel;
+	int prog_fd, err, retval;
+
+	skel = test_bpf_nf__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "test_bpf_nf__open_and_load"))
+		return;
+
+	if (mode == TEST_XDP)
+		prog_fd = bpf_program__fd(skel->progs.nf_xdp_ct_test);
+	else
+		prog_fd = bpf_program__fd(skel->progs.nf_skb_ct_test);
+
+	err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4), NULL, NULL,
+				(__u32 *)&retval, NULL);
+	if (!ASSERT_OK(err, "bpf_prog_test_run"))
+		goto end;
+
+	ASSERT_EQ(skel->bss->test_einval_bpf_tuple, -EINVAL, "Test EINVAL for NULL bpf_tuple");
+	ASSERT_EQ(skel->bss->test_einval_reserved, -EINVAL, "Test EINVAL for reserved not set to 0");
+	ASSERT_EQ(skel->bss->test_einval_netns_id, -EINVAL, "Test EINVAL for netns_id < -1");
+	ASSERT_EQ(skel->bss->test_einval_len_opts, -EINVAL, "Test EINVAL for len__opts != NF_BPF_CT_OPTS_SZ");
+	ASSERT_EQ(skel->bss->test_eproto_l4proto, -EPROTO, "Test EPROTO for l4proto != TCP or UDP");
+	ASSERT_EQ(skel->bss->test_enonet_netns_id, -ENONET, "Test ENONET for bad but valid netns_id");
+	ASSERT_EQ(skel->bss->test_enoent_lookup, -ENOENT, "Test ENOENT for failed lookup");
+	ASSERT_EQ(skel->bss->test_eafnosupport, -EAFNOSUPPORT,"Test EAFNOSUPPORT for invalid len__tuple");
+end:
+	test_bpf_nf__destroy(skel);
+}
+
+void test_bpf_nf(void)
+{
+	if (test__start_subtest("xdp-ct"))
+		test_bpf_nf_ct(TEST_XDP);
+	if (test__start_subtest("tc-bpf-ct"))
+		test_bpf_nf_ct(TEST_TC_BPF);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_bpf_nf.c b/tools/testing/selftests/bpf/progs/test_bpf_nf.c
new file mode 100644
index 000000000000..7cfff245b24f
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_bpf_nf.c
@@ -0,0 +1,113 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+
+#define EAFNOSUPPORT 97
+#define EPROTO 71
+#define ENONET 64
+#define EINVAL 22
+#define ENOENT 2
+
+int test_einval_bpf_tuple = 0;
+int test_einval_reserved = 0;
+int test_einval_netns_id = 0;
+int test_einval_len_opts = 0;
+int test_eproto_l4proto = 0;
+int test_enonet_netns_id = 0;
+int test_enoent_lookup = 0;
+int test_eafnosupport = 0;
+
+struct nf_conn *bpf_xdp_ct_lookup(struct xdp_md *, struct bpf_sock_tuple *, u32,
+				  struct bpf_ct_opts *, u32) __weak __ksym;
+struct nf_conn *bpf_skb_ct_lookup(struct __sk_buff *, struct bpf_sock_tuple *, u32,
+				  struct bpf_ct_opts *, u32) __weak __ksym;
+void bpf_ct_release(struct nf_conn *) __weak __ksym;
+
+#define nf_ct_test(func, ctx)                                                  \
+	({                                                                     \
+		struct bpf_ct_opts opts_def = { .l4proto = IPPROTO_TCP,        \
+						.netns_id = -1 };              \
+		struct bpf_sock_tuple bpf_tuple;                               \
+		struct nf_conn *ct;                                            \
+                                                                               \
+		__builtin_memset(&bpf_tuple, 0, sizeof(bpf_tuple.ipv4));       \
+		ct = func(ctx, NULL, 0, &opts_def, sizeof(opts_def));          \
+		if (ct)                                                        \
+			bpf_ct_release(ct);                                    \
+		else                                                           \
+			test_einval_bpf_tuple = opts_def.error;                \
+                                                                               \
+		opts_def.reserved[0] = 1;                                      \
+		ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,  \
+			  sizeof(opts_def));                                   \
+		opts_def.reserved[0] = 0;                                      \
+		opts_def.l4proto = IPPROTO_TCP;                                \
+		if (ct)                                                        \
+			bpf_ct_release(ct);                                    \
+		else                                                           \
+			test_einval_reserved = opts_def.error;                 \
+                                                                               \
+		opts_def.netns_id = -2;                                        \
+		ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,  \
+			  sizeof(opts_def));                                   \
+		opts_def.netns_id = -1;                                        \
+		if (ct)                                                        \
+			bpf_ct_release(ct);                                    \
+		else                                                           \
+			test_einval_netns_id = opts_def.error;                 \
+                                                                               \
+		ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,  \
+			  sizeof(opts_def) - 1);                               \
+		if (ct)                                                        \
+			bpf_ct_release(ct);                                    \
+		else                                                           \
+			test_einval_len_opts = opts_def.error;                 \
+                                                                               \
+		opts_def.l4proto = IPPROTO_ICMP;                               \
+		ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,  \
+			  sizeof(opts_def));                                   \
+		opts_def.l4proto = IPPROTO_TCP;                                \
+		if (ct)                                                        \
+			bpf_ct_release(ct);                                    \
+		else                                                           \
+			test_eproto_l4proto = opts_def.error;                  \
+                                                                               \
+		opts_def.netns_id = 0xf00f;                                    \
+		ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,  \
+			  sizeof(opts_def));                                   \
+		opts_def.netns_id = -1;                                        \
+		if (ct)                                                        \
+			bpf_ct_release(ct);                                    \
+		else                                                           \
+			test_enonet_netns_id = opts_def.error;                 \
+                                                                               \
+		ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,  \
+			  sizeof(opts_def));                                   \
+		if (ct)                                                        \
+			bpf_ct_release(ct);                                    \
+		else                                                           \
+			test_enoent_lookup = opts_def.error;                   \
+                                                                               \
+		ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4) - 1,         \
+			  &opts_def, sizeof(opts_def));                        \
+		if (ct)                                                        \
+			bpf_ct_release(ct);                                    \
+		else                                                           \
+			test_eafnosupport = opts_def.error;                    \
+	})
+
+SEC("xdp")
+int nf_xdp_ct_test(struct xdp_md *ctx)
+{
+	nf_ct_test(bpf_xdp_ct_lookup, ctx);
+	return 0;
+}
+
+SEC("tc")
+int nf_skb_ct_test(struct __sk_buff *ctx)
+{
+	nf_ct_test(bpf_skb_ct_lookup, ctx);
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v2 8/9] selftests/bpf: Extend kfunc selftests
From: Kumar Kartikeya Dwivedi @ 2021-12-09 17:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	Maxim Mikityanskiy, Florian Westphal, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev, netfilter-devel
In-Reply-To: <20211209170929.3485242-1-memxor@gmail.com>

Use the prog_test kfuncs to test the referenced PTR_TO_BTF_ID kfunc
support, and PTR_TO_CTX, PTR_TO_MEM argument passing support. Also
testing the various failure cases.

The failure selftests will test the following cases for kfunc:
kfunc_call_test_fail1 - Argument struct type has non-scalar member
kfunc_call_test_fail2 - Nesting depth of type > 8
kfunc_call_test_fail3 - Struct type has trailing zero-sized FAM
kfunc_call_test_fail4 - Trying to pass reg->type != PTR_TO_CTX when
			argument struct type is a ctx type
kfunc_call_test_fail5 - void * not part of mem, len pair
kfunc_call_test_fail6 - u64 * not part of mem, len pair
kfunc_call_test_fail7 - mark_btf_ld_reg copies ref_obj_id
kfunc_call_test_fail8 - Same type btf_struct_walk reference copy handled
			correctly during release (i.e. only parent
			object can be released)

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 .../selftests/bpf/prog_tests/kfunc_call.c     | 28 ++++++++++
 .../selftests/bpf/progs/kfunc_call_test.c     | 52 ++++++++++++++++++-
 .../bpf/progs/kfunc_call_test_fail1.c         | 16 ++++++
 .../bpf/progs/kfunc_call_test_fail2.c         | 16 ++++++
 .../bpf/progs/kfunc_call_test_fail3.c         | 16 ++++++
 .../bpf/progs/kfunc_call_test_fail4.c         | 16 ++++++
 .../bpf/progs/kfunc_call_test_fail5.c         | 16 ++++++
 .../bpf/progs/kfunc_call_test_fail6.c         | 16 ++++++
 .../bpf/progs/kfunc_call_test_fail7.c         | 24 +++++++++
 .../bpf/progs/kfunc_call_test_fail8.c         | 22 ++++++++
 10 files changed, 220 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_fail1.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_fail2.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_fail3.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_fail4.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_fail5.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_fail6.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_fail7.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_fail8.c

diff --git a/tools/testing/selftests/bpf/prog_tests/kfunc_call.c b/tools/testing/selftests/bpf/prog_tests/kfunc_call.c
index 7d7445ccc141..b6630b9427d0 100644
--- a/tools/testing/selftests/bpf/prog_tests/kfunc_call.c
+++ b/tools/testing/selftests/bpf/prog_tests/kfunc_call.c
@@ -5,11 +5,33 @@
 #include "kfunc_call_test.lskel.h"
 #include "kfunc_call_test_subprog.skel.h"
 #include "kfunc_call_test_subprog.lskel.h"
+#include "kfunc_call_test_fail1.skel.h"
+#include "kfunc_call_test_fail2.skel.h"
+#include "kfunc_call_test_fail3.skel.h"
+#include "kfunc_call_test_fail4.skel.h"
+#include "kfunc_call_test_fail5.skel.h"
+#include "kfunc_call_test_fail6.skel.h"
+#include "kfunc_call_test_fail7.skel.h"
+#include "kfunc_call_test_fail8.skel.h"
 
 static void test_main(void)
 {
 	struct kfunc_call_test_lskel *skel;
 	int prog_fd, retval, err;
+	void *fskel;
+
+#define FAIL(nr)                                                               \
+	({                                                                     \
+		fskel = kfunc_call_test_fail##nr##__open_and_load();           \
+		if (!ASSERT_EQ(fskel, NULL,                                    \
+			       "kfunc_call_test_fail" #nr                      \
+			       "__open_and_load")) {                           \
+			kfunc_call_test_fail##nr##__destroy(fskel);            \
+			return;                                                \
+		}                                                              \
+	})
+
+	FAIL(1); FAIL(2); FAIL(3); FAIL(4); FAIL(5); FAIL(6); FAIL(7); FAIL(8);
 
 	skel = kfunc_call_test_lskel__open_and_load();
 	if (!ASSERT_OK_PTR(skel, "skel"))
@@ -27,6 +49,12 @@ static void test_main(void)
 	ASSERT_OK(err, "bpf_prog_test_run(test2)");
 	ASSERT_EQ(retval, 3, "test2-retval");
 
+	prog_fd = skel->progs.kfunc_call_test_ref_btf_id.prog_fd;
+	err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
+				NULL, NULL, (__u32 *)&retval, NULL);
+	ASSERT_OK(err, "bpf_prog_test_run(test_ref_btf_id)");
+	ASSERT_EQ(retval, 0, "test_ref_btf_id-retval");
+
 	kfunc_call_test_lskel__destroy(skel);
 }
 
diff --git a/tools/testing/selftests/bpf/progs/kfunc_call_test.c b/tools/testing/selftests/bpf/progs/kfunc_call_test.c
index 8a8cf59017aa..5aecbb9fdc68 100644
--- a/tools/testing/selftests/bpf/progs/kfunc_call_test.c
+++ b/tools/testing/selftests/bpf/progs/kfunc_call_test.c
@@ -1,13 +1,20 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (c) 2021 Facebook */
-#include <linux/bpf.h>
+#include <vmlinux.h>
 #include <bpf/bpf_helpers.h>
-#include "bpf_tcp_helpers.h"
 
 extern int bpf_kfunc_call_test2(struct sock *sk, __u32 a, __u32 b) __ksym;
 extern __u64 bpf_kfunc_call_test1(struct sock *sk, __u32 a, __u64 b,
 				  __u32 c, __u64 d) __ksym;
 
+extern struct prog_test_ref_kfunc *bpf_kfunc_call_test_acquire(unsigned long *sp) __ksym;
+extern void bpf_kfunc_call_test_release(struct prog_test_ref_kfunc *p) __ksym;
+extern void bpf_kfunc_call_test_pass_ctx(struct __sk_buff *skb) __ksym;
+extern void bpf_kfunc_call_test_pass1(struct prog_test_pass1 *p) __ksym;
+extern void bpf_kfunc_call_test_pass2(struct prog_test_pass2 *p) __ksym;
+extern void bpf_kfunc_call_test_mem_len_pass1(void *mem, int len) __ksym;
+extern void bpf_kfunc_call_test_mem_len_fail2(__u64 *mem, int len) __ksym;
+
 SEC("tc")
 int kfunc_call_test2(struct __sk_buff *skb)
 {
@@ -44,4 +51,45 @@ int kfunc_call_test1(struct __sk_buff *skb)
 	return ret;
 }
 
+SEC("tc")
+int kfunc_call_test_ref_btf_id(struct __sk_buff *skb)
+{
+	struct prog_test_ref_kfunc *pt;
+	unsigned long s = 0;
+	int ret = 0;
+
+	pt = bpf_kfunc_call_test_acquire(&s);
+	if (pt) {
+		if (pt->a != 42 || pt->b != 108)
+			ret = -1;
+		bpf_kfunc_call_test_release(pt);
+	}
+	return ret;
+}
+
+SEC("tc")
+int kfunc_call_test_pass(struct __sk_buff *skb)
+{
+	struct prog_test_pass1 p1 = {};
+	struct prog_test_pass2 p2 = {};
+	short a = 0;
+	__u64 b = 0;
+	long c = 0;
+	char d = 0;
+	int e = 0;
+
+	bpf_kfunc_call_test_pass_ctx(skb);
+	bpf_kfunc_call_test_pass1(&p1);
+	bpf_kfunc_call_test_pass2(&p2);
+
+	bpf_kfunc_call_test_mem_len_pass1(&a, sizeof(a));
+	bpf_kfunc_call_test_mem_len_pass1(&b, sizeof(b));
+	bpf_kfunc_call_test_mem_len_pass1(&c, sizeof(c));
+	bpf_kfunc_call_test_mem_len_pass1(&d, sizeof(d));
+	bpf_kfunc_call_test_mem_len_pass1(&e, sizeof(e));
+	bpf_kfunc_call_test_mem_len_fail2(&b, -1);
+
+	return 0;
+}
+
 char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/kfunc_call_test_fail1.c b/tools/testing/selftests/bpf/progs/kfunc_call_test_fail1.c
new file mode 100644
index 000000000000..4088000dcfc0
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kfunc_call_test_fail1.c
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+
+extern void bpf_kfunc_call_test_fail1(struct prog_test_fail1 *p) __ksym;
+
+SEC("tc")
+int kfunc_call_test_fail1(struct __sk_buff *skb)
+{
+	struct prog_test_fail1 s = {};
+
+	bpf_kfunc_call_test_fail1(&s);
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/kfunc_call_test_fail2.c b/tools/testing/selftests/bpf/progs/kfunc_call_test_fail2.c
new file mode 100644
index 000000000000..0c9779693576
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kfunc_call_test_fail2.c
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+
+extern void bpf_kfunc_call_test_fail2(struct prog_test_fail2 *p) __ksym;
+
+SEC("tc")
+int kfunc_call_test_fail2(struct __sk_buff *skb)
+{
+	struct prog_test_fail2 s = {};
+
+	bpf_kfunc_call_test_fail2(&s);
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/kfunc_call_test_fail3.c b/tools/testing/selftests/bpf/progs/kfunc_call_test_fail3.c
new file mode 100644
index 000000000000..4e5a7493cdf7
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kfunc_call_test_fail3.c
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+
+extern void bpf_kfunc_call_test_fail3(struct prog_test_fail3 *p) __ksym;
+
+SEC("tc")
+int kfunc_call_test_fail3(struct __sk_buff *skb)
+{
+	struct prog_test_fail3 s = {};
+
+	bpf_kfunc_call_test_fail3(&s);
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/kfunc_call_test_fail4.c b/tools/testing/selftests/bpf/progs/kfunc_call_test_fail4.c
new file mode 100644
index 000000000000..01c3523c7c50
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kfunc_call_test_fail4.c
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+
+extern void bpf_kfunc_call_test_pass_ctx(struct __sk_buff *skb) __ksym;
+
+SEC("tc")
+int kfunc_call_test_fail4(struct __sk_buff *skb)
+{
+	struct __sk_buff local_skb = {};
+
+	bpf_kfunc_call_test_pass_ctx(&local_skb);
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/kfunc_call_test_fail5.c b/tools/testing/selftests/bpf/progs/kfunc_call_test_fail5.c
new file mode 100644
index 000000000000..e32f13709357
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kfunc_call_test_fail5.c
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+
+extern void bpf_kfunc_call_test_mem_len_fail1(void *mem, int len) __ksym;
+
+SEC("tc")
+int kfunc_call_test_fail5(struct __sk_buff *skb)
+{
+	int a = 0;
+
+	bpf_kfunc_call_test_mem_len_fail1(&a, sizeof(a));
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/kfunc_call_test_fail6.c b/tools/testing/selftests/bpf/progs/kfunc_call_test_fail6.c
new file mode 100644
index 000000000000..998626aaca35
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kfunc_call_test_fail6.c
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+
+extern void bpf_kfunc_call_test_mem_len_fail2(__u64 *mem, int len) __ksym;
+
+SEC("tc")
+int kfunc_call_test_fail6(struct __sk_buff *skb)
+{
+	int a = 0;
+
+	bpf_kfunc_call_test_mem_len_fail2((void *)&a, sizeof(a));
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/kfunc_call_test_fail7.c b/tools/testing/selftests/bpf/progs/kfunc_call_test_fail7.c
new file mode 100644
index 000000000000..05d4914b0533
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kfunc_call_test_fail7.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+
+extern struct prog_test_ref_kfunc *bpf_kfunc_call_test_acquire(unsigned long *sp) __ksym;
+extern void bpf_kfunc_call_test_release(struct prog_test_ref_kfunc *p) __ksym;
+
+SEC("tc")
+int kfunc_call_test_fail7(struct __sk_buff *skb)
+{
+	struct prog_test_ref_kfunc *p, *p2;
+	unsigned long sp = 0;
+
+	p = bpf_kfunc_call_test_acquire(&sp);
+	if (p) {
+		p2 = p->next->next;
+		bpf_kfunc_call_test_release(p);
+		if (p2->a == 42)
+			return 1;
+	}
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/kfunc_call_test_fail8.c b/tools/testing/selftests/bpf/progs/kfunc_call_test_fail8.c
new file mode 100644
index 000000000000..eac8637ce841
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kfunc_call_test_fail8.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+
+extern struct prog_test_ref_kfunc *bpf_kfunc_call_test_acquire(unsigned long *sp) __ksym;
+extern void bpf_kfunc_call_test_release(struct prog_test_ref_kfunc *p) __ksym;
+
+SEC("tc")
+int kfunc_call_test_fail8(struct __sk_buff *skb)
+{
+	struct prog_test_ref_kfunc *p, *p2;
+	unsigned long sp = 0;
+
+	p = bpf_kfunc_call_test_acquire(&sp);
+	if (p) {
+		p2 = p->next->next;
+		bpf_kfunc_call_test_release(p2);
+	}
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v2 7/9] net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPF
From: Kumar Kartikeya Dwivedi @ 2021-12-09 17:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	Maxim Mikityanskiy, Florian Westphal, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev, netfilter-devel
In-Reply-To: <20211209170929.3485242-1-memxor@gmail.com>

This change adds conntrack lookup helpers using the unstable kfunc call
interface for the XDP and TC-BPF hooks. The primary usecase is
implementing a synproxy in XDP, see Maxim's patchset at [0].

Also add acquire/release functions (randomly returning NULL), and also
exercise the PTR_TO_BTF_ID_OR_NULL path so that BPF program caller has
to check for NULL before dereferencing the pointer, for the TC hook.
Introduce kfunc that take various argument types (for PTR_TO_MEM) that
will pass and fail the verifier checks. These will be used in selftests.

Export get_net_ns_by_id as nf_conntrack needs to call it.

Note that we search for acquire, release, and null returning kfuncs in
the intersection of those sets and main set.

This implies that the kfunc_btf_id_list acq_set, rel_set, null_set may
contain BTF ID not in main set, this is explicitly allowed and
recommended (to save on definining more and more sets), since
check_kfunc_call verifier operation would filter out the invalid BTF ID
fairly early, so later checks for acquire, release, and ret_type_null
kfunc will only consider allowed BTF IDs for that program that are
allowed in main set. This is why the nf_conntrack_acq_ids set has BTF
IDs for both xdp and tc hook kfuncs.

  [0]: https://lore.kernel.org/bpf/20211019144655.3483197-1-maximmi@nvidia.com

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h               |  21 +++
 include/linux/btf.h               |  30 +++-
 kernel/bpf/btf.c                  |  19 +++
 net/bpf/test_run.c                | 147 +++++++++++++++++
 net/core/filter.c                 |  27 ++++
 net/core/net_namespace.c          |   1 +
 net/netfilter/nf_conntrack_core.c | 252 ++++++++++++++++++++++++++++++
 7 files changed, 496 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6a14072c72a0..7b370ed06b1d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1675,6 +1675,9 @@ int bpf_prog_test_run_sk_lookup(struct bpf_prog *prog,
 				const union bpf_attr *kattr,
 				union bpf_attr __user *uattr);
 bool bpf_prog_test_check_kfunc_call(u32 kfunc_id, struct module *owner);
+bool bpf_prog_test_is_acquire_kfunc(u32 kfunc_id, struct module *owner);
+bool bpf_prog_test_is_release_kfunc(u32 kfunc_id, struct module *owner);
+bool bpf_prog_test_is_kfunc_ret_type_null(u32 kfunc_id, struct module *owner);
 bool btf_ctx_access(int off, int size, enum bpf_access_type type,
 		    const struct bpf_prog *prog,
 		    struct bpf_insn_access_aux *info);
@@ -1933,6 +1936,24 @@ static inline bool bpf_prog_test_check_kfunc_call(u32 kfunc_id,
 	return false;
 }
 
+static inline bool bpf_prog_test_is_acquire_kfunc(u32 kfunc_id,
+						  struct module *owner)
+{
+	return false;
+}
+
+static inline bool bpf_prog_test_is_release_kfunc(u32 kfunc_id,
+						  struct module *owner)
+{
+	return false;
+}
+
+static inline bool bpf_prog_test_is_kfunc_ret_type_null(u32 kfunc_id,
+							struct module *owner)
+{
+	return false;
+}
+
 static inline void bpf_map_put(struct bpf_map *map)
 {
 }
diff --git a/include/linux/btf.h b/include/linux/btf.h
index ec9e69c14480..8362a39fe522 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -321,7 +321,10 @@ static inline const char *btf_name_by_offset(const struct btf *btf,
 #endif
 
 enum kfunc_btf_id_set_types {
-	BTF_SET_CHECK,
+	BTF_SET_CHECK,    /* Allowed kfunc set */
+	BTF_SET_ACQUIRE,  /* Acquire kfunc set */
+	BTF_SET_RELEASE,  /* Release kfunc set */
+	BTF_SET_RET_NULL, /* kfunc with 'return type PTR_TO_BTF_ID_OR_NULL' set */
 	__BTF_SET_MAX,
 };
 
@@ -331,6 +334,9 @@ struct kfunc_btf_id_set {
 		struct btf_id_set *sets[__BTF_SET_MAX];
 		struct {
 			struct btf_id_set *set;
+			struct btf_id_set *acq_set;
+			struct btf_id_set *rel_set;
+			struct btf_id_set *null_set;
 		};
 	};
 	struct module *owner;
@@ -345,6 +351,12 @@ void unregister_kfunc_btf_id_set(struct kfunc_btf_id_list *l,
 				 struct kfunc_btf_id_set *s);
 bool bpf_check_mod_kfunc_call(struct kfunc_btf_id_list *klist, u32 kfunc_id,
 			      struct module *owner);
+bool bpf_is_mod_acquire_kfunc(struct kfunc_btf_id_list *klist, u32 kfunc_id,
+			      struct module *owner);
+bool bpf_is_mod_release_kfunc(struct kfunc_btf_id_list *klist, u32 kfunc_id,
+			      struct module *owner);
+bool bpf_is_mod_kfunc_ret_type_null(struct kfunc_btf_id_list *klist,
+				    u32 kfunc_id, struct module *owner);
 #else
 static inline void register_kfunc_btf_id_set(struct kfunc_btf_id_list *l,
 					     struct kfunc_btf_id_set *s)
@@ -359,9 +371,25 @@ bool bpf_check_mod_kfunc_call(struct kfunc_btf_id_list *klist, u32 kfunc_id,
 {
 	return false;
 }
+bool bpf_is_mod_acquire_kfunc(struct kfunc_btf_id_list *klist, u32 kfunc_id,
+			      struct module *owner)
+{
+	return false;
+}
+bool bpf_is_mod_release_kfunc(struct kfunc_btf_id_list *klist, u32 kfunc_id,
+			      struct module *owner)
+{
+	return false;
+}
+bool bpf_is_mod_kfunc_ret_type_null(struct kfunc_btf_id_list *klist,
+				    u32 kfunc_id, struct module *owner)
+{
+	return false;
+}
 #endif
 
 extern struct kfunc_btf_id_list bpf_tcp_ca_kfunc_list;
 extern struct kfunc_btf_id_list prog_test_kfunc_list;
+extern struct kfunc_btf_id_list xdp_kfunc_list;
 
 #endif
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index a790ba6d93d8..4b955f47b800 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6539,6 +6539,24 @@ bool bpf_check_mod_kfunc_call(struct kfunc_btf_id_list *klist, u32 kfunc_id,
 	return kfunc_btf_id_set_contains(klist, kfunc_id, owner, BTF_SET_CHECK);
 }
 
+bool bpf_is_mod_acquire_kfunc(struct kfunc_btf_id_list *klist, u32 kfunc_id,
+			      struct module *owner)
+{
+	return kfunc_btf_id_set_contains(klist, kfunc_id, owner, BTF_SET_ACQUIRE);
+}
+
+bool bpf_is_mod_release_kfunc(struct kfunc_btf_id_list *klist, u32 kfunc_id,
+			      struct module *owner)
+{
+	return kfunc_btf_id_set_contains(klist, kfunc_id, owner, BTF_SET_RELEASE);
+}
+
+bool bpf_is_mod_kfunc_ret_type_null(struct kfunc_btf_id_list *klist,
+				    u32 kfunc_id, struct module *owner)
+{
+	return kfunc_btf_id_set_contains(klist, kfunc_id, owner, BTF_SET_RET_NULL);
+}
+
 #endif
 
 #define DEFINE_KFUNC_BTF_ID_LIST(name)                                         \
@@ -6548,6 +6566,7 @@ bool bpf_check_mod_kfunc_call(struct kfunc_btf_id_list *klist, u32 kfunc_id,
 
 DEFINE_KFUNC_BTF_ID_LIST(bpf_tcp_ca_kfunc_list);
 DEFINE_KFUNC_BTF_ID_LIST(prog_test_kfunc_list);
+DEFINE_KFUNC_BTF_ID_LIST(xdp_kfunc_list);
 
 int bpf_core_types_are_compat(const struct btf *local_btf, __u32 local_id,
 			      const struct btf *targ_btf, __u32 targ_id)
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 46dd95755967..779377f4af25 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -232,6 +232,117 @@ struct sock * noinline bpf_kfunc_call_test3(struct sock *sk)
 	return sk;
 }
 
+struct prog_test_ref_kfunc {
+	int a;
+	int b;
+	struct prog_test_ref_kfunc *next;
+};
+
+static struct prog_test_ref_kfunc prog_test_struct = {
+	.a = 42,
+	.b = 108,
+	.next = &prog_test_struct,
+};
+
+noinline struct prog_test_ref_kfunc *
+bpf_kfunc_call_test_acquire(unsigned long *scalar_ptr)
+{
+	/* randomly return NULL */
+	if (get_jiffies_64() % 2)
+		return NULL;
+	return &prog_test_struct;
+}
+
+noinline void bpf_kfunc_call_test_release(struct prog_test_ref_kfunc *p)
+{
+}
+
+struct prog_test_pass1 {
+	int x0;
+	struct {
+		int x1;
+		struct {
+			int x2;
+			struct {
+				int x3;
+				struct {
+					int x4;
+					struct {
+						int x5;
+						struct {
+							int x6;
+							struct {
+								int x7;
+							};
+						};
+					};
+				};
+			};
+		};
+	};
+};
+
+struct prog_test_pass2 {
+	int len;
+	short arr1[4];
+	struct {
+		char arr2[4];
+		unsigned long arr3[8];
+	} x;
+};
+
+struct prog_test_fail1 {
+	void *p;
+	int x;
+};
+
+struct prog_test_fail2 {
+	int x8;
+	struct prog_test_pass1 x;
+};
+
+struct prog_test_fail3 {
+	int len;
+	char arr1[2];
+	char arr2[0];
+};
+
+noinline void bpf_kfunc_call_test_pass_ctx(struct __sk_buff *skb)
+{
+}
+
+noinline void bpf_kfunc_call_test_pass1(struct prog_test_pass1 *p)
+{
+}
+
+noinline void bpf_kfunc_call_test_pass2(struct prog_test_pass2 *p)
+{
+}
+
+noinline void bpf_kfunc_call_test_fail1(struct prog_test_fail1 *p)
+{
+}
+
+noinline void bpf_kfunc_call_test_fail2(struct prog_test_fail2 *p)
+{
+}
+
+noinline void bpf_kfunc_call_test_fail3(struct prog_test_fail3 *p)
+{
+}
+
+noinline void bpf_kfunc_call_test_mem_len_pass1(void *mem, int len__mem)
+{
+}
+
+noinline void bpf_kfunc_call_test_mem_len_fail1(void *mem, int len)
+{
+}
+
+noinline void bpf_kfunc_call_test_mem_len_fail2(u64 *mem, int len)
+{
+}
+
 __diag_pop();
 
 ALLOW_ERROR_INJECTION(bpf_modify_return_test, ERRNO);
@@ -240,8 +351,23 @@ BTF_SET_START(test_sk_kfunc_ids)
 BTF_ID(func, bpf_kfunc_call_test1)
 BTF_ID(func, bpf_kfunc_call_test2)
 BTF_ID(func, bpf_kfunc_call_test3)
+BTF_ID(func, bpf_kfunc_call_test_acquire)
+BTF_ID(func, bpf_kfunc_call_test_release)
+BTF_ID(func, bpf_kfunc_call_test_pass_ctx)
+BTF_ID(func, bpf_kfunc_call_test_pass1)
+BTF_ID(func, bpf_kfunc_call_test_pass2)
+BTF_ID(func, bpf_kfunc_call_test_fail1)
+BTF_ID(func, bpf_kfunc_call_test_fail2)
+BTF_ID(func, bpf_kfunc_call_test_fail3)
+BTF_ID(func, bpf_kfunc_call_test_mem_len_pass1)
+BTF_ID(func, bpf_kfunc_call_test_mem_len_fail1)
+BTF_ID(func, bpf_kfunc_call_test_mem_len_fail2)
 BTF_SET_END(test_sk_kfunc_ids)
 
+BTF_ID_LIST(test_sk_acq_rel)
+BTF_ID(func, bpf_kfunc_call_test_acquire)
+BTF_ID(func, bpf_kfunc_call_test_release)
+
 bool bpf_prog_test_check_kfunc_call(u32 kfunc_id, struct module *owner)
 {
 	if (btf_id_set_contains(&test_sk_kfunc_ids, kfunc_id))
@@ -249,6 +375,27 @@ bool bpf_prog_test_check_kfunc_call(u32 kfunc_id, struct module *owner)
 	return bpf_check_mod_kfunc_call(&prog_test_kfunc_list, kfunc_id, owner);
 }
 
+bool bpf_prog_test_is_acquire_kfunc(u32 kfunc_id, struct module *owner)
+{
+	if (kfunc_id == test_sk_acq_rel[0])
+		return true;
+	return bpf_is_mod_acquire_kfunc(&prog_test_kfunc_list, kfunc_id, owner);
+}
+
+bool bpf_prog_test_is_release_kfunc(u32 kfunc_id, struct module *owner)
+{
+	if (kfunc_id == test_sk_acq_rel[1])
+		return true;
+	return bpf_is_mod_release_kfunc(&prog_test_kfunc_list, kfunc_id, owner);
+}
+
+bool bpf_prog_test_is_kfunc_ret_type_null(u32 kfunc_id, struct module *owner)
+{
+	if (kfunc_id == test_sk_acq_rel[0])
+		return true;
+	return bpf_is_mod_kfunc_ret_type_null(&prog_test_kfunc_list, kfunc_id, owner);
+}
+
 static void *bpf_test_init(const union bpf_attr *kattr, u32 size,
 			   u32 headroom, u32 tailroom)
 {
diff --git a/net/core/filter.c b/net/core/filter.c
index fe27c91e3758..ad84ec6e0daa 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10000,17 +10000,44 @@ const struct bpf_verifier_ops tc_cls_act_verifier_ops = {
 	.gen_prologue		= tc_cls_act_prologue,
 	.gen_ld_abs		= bpf_gen_ld_abs,
 	.check_kfunc_call	= bpf_prog_test_check_kfunc_call,
+	.is_acquire_kfunc	= bpf_prog_test_is_acquire_kfunc,
+	.is_release_kfunc	= bpf_prog_test_is_release_kfunc,
+	.is_kfunc_ret_type_null = bpf_prog_test_is_kfunc_ret_type_null,
 };
 
 const struct bpf_prog_ops tc_cls_act_prog_ops = {
 	.test_run		= bpf_prog_test_run_skb,
 };
 
+static bool xdp_check_kfunc_call(u32 kfunc_id, struct module *owner)
+{
+	return bpf_check_mod_kfunc_call(&xdp_kfunc_list, kfunc_id, owner);
+}
+
+static bool xdp_is_acquire_kfunc(u32 kfunc_id, struct module *owner)
+{
+	return bpf_is_mod_acquire_kfunc(&xdp_kfunc_list, kfunc_id, owner);
+}
+
+static bool xdp_is_release_kfunc(u32 kfunc_id, struct module *owner)
+{
+	return bpf_is_mod_release_kfunc(&xdp_kfunc_list, kfunc_id, owner);
+}
+
+static bool xdp_is_kfunc_ret_type_null(u32 kfunc_id, struct module *owner)
+{
+	return bpf_is_mod_kfunc_ret_type_null(&xdp_kfunc_list, kfunc_id, owner);
+}
+
 const struct bpf_verifier_ops xdp_verifier_ops = {
 	.get_func_proto		= xdp_func_proto,
 	.is_valid_access	= xdp_is_valid_access,
 	.convert_ctx_access	= xdp_convert_ctx_access,
 	.gen_prologue		= bpf_noop_prologue,
+	.check_kfunc_call	= xdp_check_kfunc_call,
+	.is_acquire_kfunc	= xdp_is_acquire_kfunc,
+	.is_release_kfunc	= xdp_is_release_kfunc,
+	.is_kfunc_ret_type_null = xdp_is_kfunc_ret_type_null,
 };
 
 const struct bpf_prog_ops xdp_prog_ops = {
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 202fa5eacd0f..7b4bfe793002 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -299,6 +299,7 @@ struct net *get_net_ns_by_id(const struct net *net, int id)
 
 	return peer;
 }
+EXPORT_SYMBOL_GPL(get_net_ns_by_id);
 
 /*
  * setup_net runs the initializers for the network namespace object.
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 770a63103c7a..85042cb6f82e 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -11,6 +11,9 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/btf_ids.h>
 #include <linux/types.h>
 #include <linux/netfilter.h>
 #include <linux/module.h>
@@ -2451,6 +2454,249 @@ static int kill_all(struct nf_conn *i, void *data)
 	return net_eq(nf_ct_net(i), data);
 }
 
+/* Unstable Kernel Helpers for XDP and TC-BPF hook
+ *
+ * These are called from the XDP and SCHED_CLS BPF programs. Note that it is
+ * allowed to break compatibility for these functions since the interface they
+ * are exposed through to BPF programs is explicitly unstable.
+ */
+static struct nf_conn *__bpf_nf_ct_lookup(struct net *net,
+					  struct bpf_sock_tuple *bpf_tuple,
+					  u32 tuple_len, u8 protonum,
+					  s32 netns_id)
+{
+	struct nf_conntrack_tuple_hash *hash;
+	struct nf_conntrack_tuple tuple;
+
+	if (unlikely(protonum != IPPROTO_TCP && protonum != IPPROTO_UDP))
+		return ERR_PTR(-EPROTO);
+	if (unlikely(netns_id < BPF_F_CURRENT_NETNS))
+		return ERR_PTR(-EINVAL);
+
+	memset(&tuple, 0, sizeof(tuple));
+	switch (tuple_len) {
+	case sizeof(bpf_tuple->ipv4):
+		tuple.src.l3num = AF_INET;
+		tuple.src.u3.ip = bpf_tuple->ipv4.saddr;
+		tuple.src.u.tcp.port = bpf_tuple->ipv4.sport;
+		tuple.dst.u3.ip = bpf_tuple->ipv4.daddr;
+		tuple.dst.u.tcp.port = bpf_tuple->ipv4.dport;
+		break;
+	case sizeof(bpf_tuple->ipv6):
+		tuple.src.l3num = AF_INET6;
+		memcpy(tuple.src.u3.ip6, bpf_tuple->ipv6.saddr, sizeof(bpf_tuple->ipv6.saddr));
+		tuple.src.u.tcp.port = bpf_tuple->ipv6.sport;
+		memcpy(tuple.dst.u3.ip6, bpf_tuple->ipv6.daddr, sizeof(bpf_tuple->ipv6.daddr));
+		tuple.dst.u.tcp.port = bpf_tuple->ipv6.dport;
+		break;
+	default:
+		return ERR_PTR(-EAFNOSUPPORT);
+	}
+
+	tuple.dst.protonum = protonum;
+
+	if (netns_id >= 0) {
+		net = get_net_ns_by_id(net, netns_id);
+		if (unlikely(!net))
+			return ERR_PTR(-ENONET);
+	}
+
+	hash = nf_conntrack_find_get(net, &nf_ct_zone_dflt, &tuple);
+	if (netns_id >= 0)
+		put_net(net);
+	if (!hash)
+		return ERR_PTR(-ENOENT);
+	return nf_ct_tuplehash_to_ctrack(hash);
+}
+
+/* bpf_ct_opts - Options for CT lookup helpers
+ *
+ * Members:
+ * @error      - Out parameter, set for any errors encountered
+ *		 Values:
+ *		   -EINVAL - Passed NULL for bpf_tuple pointer
+ *		   -EINVAL - opts->reserved is not 0
+ *		   -EINVAL - netns_id is less than -1
+ *		   -EINVAL - len__opts isn't NF_BPF_CT_OPTS_SZ (12)
+ *		   -EPROTO - l4proto isn't one of IPPROTO_TCP or IPPROTO_UDP
+ *		   -ENONET - No network namespace found for netns_id
+ *		   -ENOENT - Conntrack lookup could not find entry for tuple
+ *		   -EAFNOSUPPORT - len__tuple isn't one of sizeof(tuple->ipv4)
+ *				   or sizeof(tuple->ipv6)
+ * @l4proto    - Layer 4 protocol
+ *		 Values:
+ *		   IPPROTO_TCP, IPPROTO_UDP
+ * @reserved   - Reserved member, will be reused for more options in future
+ *		 Values:
+ *		   0
+ * @netns_id   - Specify the network namespace for lookup
+ *		 Values:
+ *		   BPF_F_CURRENT_NETNS (-1)
+ *		     Use namespace associated with ctx (xdp_md, __sk_buff)
+ *		   [0, S32_MAX]
+ *		     Network Namespace ID
+ */
+struct bpf_ct_opts {
+	s32 netns_id;
+	s32 error;
+	u8 l4proto;
+	u8 reserved[3];
+};
+
+enum {
+	NF_BPF_CT_OPTS_SZ = 12,
+};
+
+/* bpf_xdp_ct_lookup - Lookup CT entry for the given tuple, and acquire a
+ *		       reference to it
+ *
+ * Parameters:
+ * @xdp_ctx	- Pointer to ctx (xdp_md) in XDP program
+ *		    Cannot be NULL
+ * @bpf_tuple	- Pointer to memory representing the tuple to look up
+ *		    Cannot be NULL
+ * @len__tuple	- Length of the tuple structure
+ *		    Must be one of sizeof(bpf_tuple->ipv4) or
+ *		    sizeof(bpf_tuple->ipv6)
+ * @opts	- Additional options for lookup (documented above)
+ *		    Cannot be NULL
+ * @len__opts	- Length of the bpf_ct_opts structure
+ *		    Must be NF_BPF_CT_OPTS_SZ (12)
+ */
+struct nf_conn *bpf_xdp_ct_lookup(struct xdp_md *xdp_ctx,
+				  struct bpf_sock_tuple *bpf_tuple,
+				  u32 len__tuple, struct bpf_ct_opts *opts,
+				  u32 len__opts)
+{
+	struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
+	struct net *caller_net;
+	struct nf_conn *nfct;
+
+	BUILD_BUG_ON(sizeof(struct bpf_ct_opts) != NF_BPF_CT_OPTS_SZ);
+
+	if (!opts)
+		return NULL;
+	if (!bpf_tuple || opts->reserved[0] || opts->reserved[1] ||
+	    opts->reserved[2] || len__opts != NF_BPF_CT_OPTS_SZ) {
+		opts->error = -EINVAL;
+		return NULL;
+	}
+	caller_net = dev_net(ctx->rxq->dev);
+	nfct = __bpf_nf_ct_lookup(caller_net, bpf_tuple, len__tuple, opts->l4proto,
+				  opts->netns_id);
+	if (IS_ERR(nfct)) {
+		opts->error = PTR_ERR(nfct);
+		return NULL;
+	}
+	return nfct;
+}
+
+/* bpf_skb_ct_lookup - Lookup CT entry for the given tuple, and acquire a
+ *		       reference to it
+ *
+ * Parameters:
+ * @skb_ctx	- Pointer to ctx (__sk_buff) in TC program
+ *		    Cannot be NULL
+ * @bpf_tuple	- Pointer to memory representing the tuple to look up
+ *		    Cannot be NULL
+ * @len__tuple	- Length of the tuple structure
+ *		    Must be one of sizeof(bpf_tuple->ipv4) or
+ *		    sizeof(bpf_tuple->ipv6)
+ * @opts	- Additional options for lookup (documented above)
+ *		    Cannot be NULL
+ * @len__opts	- Length of the bpf_ct_opts structure
+ *		    Must be NF_BPF_CT_OPTS_SZ (12)
+ */
+struct nf_conn *bpf_skb_ct_lookup(struct __sk_buff *skb_ctx,
+				  struct bpf_sock_tuple *bpf_tuple,
+				  u32 len__tuple, struct bpf_ct_opts *opts,
+				  u32 len__opts)
+{
+	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
+	struct net *caller_net;
+	struct nf_conn *nfct;
+
+	BUILD_BUG_ON(sizeof(struct bpf_ct_opts) != NF_BPF_CT_OPTS_SZ);
+
+	if (!opts)
+		return NULL;
+	if (!bpf_tuple || opts->reserved[0] || opts->reserved[1] ||
+	    opts->reserved[2] || len__opts != NF_BPF_CT_OPTS_SZ) {
+		opts->error = -EINVAL;
+		return NULL;
+	}
+	caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+	nfct = __bpf_nf_ct_lookup(caller_net, bpf_tuple, len__tuple, opts->l4proto,
+				  opts->netns_id);
+	if (IS_ERR(nfct)) {
+		opts->error = PTR_ERR(nfct);
+		return NULL;
+	}
+	return nfct;
+}
+
+/* bpf_ct_release - Release acquired nf_conn object
+ *
+ * This must be invoked for referenced PTR_TO_BTF_ID, and the verifier rejects
+ * the program if any references remain in the program in all of the explored
+ * states.
+ *
+ * Parameters:
+ * @nf_conn	 - Pointer to referenced nf_conn object, obtained using
+ *		   bpf_xdp_ct_lookup or bpf_skb_ct_lookup.
+ */
+void bpf_ct_release(struct nf_conn *nfct)
+{
+	if (!nfct)
+		return;
+	nf_ct_put(nfct);
+}
+
+/* kfuncs that may return NULL PTR_TO_BTF_ID */
+BTF_SET_START(nf_conntrack_null_ids)
+BTF_ID(func, bpf_xdp_ct_lookup)
+BTF_ID(func, bpf_skb_ct_lookup)
+BTF_SET_END(nf_conntrack_null_ids)
+
+/* XDP hook allowed kfuncs */
+BTF_SET_START(nf_conntrack_xdp_ids)
+BTF_ID(func, bpf_xdp_ct_lookup)
+BTF_ID(func, bpf_ct_release)
+BTF_SET_END(nf_conntrack_xdp_ids)
+
+/* TC-BPF hook allowed kfuncs */
+BTF_SET_START(nf_conntrack_skb_ids)
+BTF_ID(func, bpf_skb_ct_lookup)
+BTF_ID(func, bpf_ct_release)
+BTF_SET_END(nf_conntrack_skb_ids)
+
+/* XDP and TC-BPF hook acquire kfuncs */
+BTF_SET_START(nf_conntrack_acq_ids)
+BTF_ID(func, bpf_xdp_ct_lookup)
+BTF_ID(func, bpf_skb_ct_lookup)
+BTF_SET_END(nf_conntrack_acq_ids)
+
+/* XDP and TC-BPF hook release kfuncs */
+BTF_SET_START(nf_conntrack_rel_ids)
+BTF_ID(func, bpf_ct_release)
+BTF_SET_END(nf_conntrack_rel_ids)
+
+static struct kfunc_btf_id_set nf_ct_xdp_kfunc_set = {
+	.owner                 = THIS_MODULE,
+	.set                   = &nf_conntrack_xdp_ids,
+	.acq_set	       = &nf_conntrack_acq_ids,
+	.rel_set	       = &nf_conntrack_rel_ids,
+	.null_set	       = &nf_conntrack_null_ids,
+};
+
+static struct kfunc_btf_id_set nf_ct_skb_kfunc_set = {
+	.owner                 = THIS_MODULE,
+	.set                   = &nf_conntrack_skb_ids,
+	.acq_set	       = &nf_conntrack_acq_ids,
+	.rel_set	       = &nf_conntrack_rel_ids,
+	.null_set	       = &nf_conntrack_null_ids,
+};
+
 void nf_conntrack_cleanup_start(void)
 {
 	conntrack_gc_work.exiting = true;
@@ -2459,6 +2705,9 @@ void nf_conntrack_cleanup_start(void)
 
 void nf_conntrack_cleanup_end(void)
 {
+	unregister_kfunc_btf_id_set(&xdp_kfunc_list, &nf_ct_xdp_kfunc_set);
+	unregister_kfunc_btf_id_set(&prog_test_kfunc_list, &nf_ct_skb_kfunc_set);
+
 	RCU_INIT_POINTER(nf_ct_hook, NULL);
 	cancel_delayed_work_sync(&conntrack_gc_work.dwork);
 	kvfree(nf_conntrack_hash);
@@ -2745,6 +2994,9 @@ int nf_conntrack_init_start(void)
 	conntrack_gc_work_init(&conntrack_gc_work);
 	queue_delayed_work(system_power_efficient_wq, &conntrack_gc_work.dwork, HZ);
 
+	register_kfunc_btf_id_set(&prog_test_kfunc_list, &nf_ct_skb_kfunc_set);
+	register_kfunc_btf_id_set(&xdp_kfunc_list, &nf_ct_xdp_kfunc_set);
+
 	return 0;
 
 err_proto:
-- 
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v2 5/9] bpf: Add reference tracking support to kfunc
From: Kumar Kartikeya Dwivedi @ 2021-12-09 17:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	Maxim Mikityanskiy, Florian Westphal, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev, netfilter-devel
In-Reply-To: <20211209170929.3485242-1-memxor@gmail.com>

This patch adds verifier support for PTR_TO_BTF_ID return type of kfunc
to be a reference, by reusing acquire_reference_state/release_reference
support for existing in-kernel bpf helpers.

Verifier ops struct is extended with three callbacks:

- is_acquire_kfunc
  Return true if kfunc_btf_id, module pair is an acquire kfunc.  This
  will acquire_reference_state for the returned PTR_TO_BTF_ID (this is
  the only allow return value). Note that acquire kfunc must always
  return a PTR_TO_BTF_ID{_OR_NULL}, otherwise the program is rejected.

- is_release_kfunc
  Return true if kfunc_btf_id, module pair is a release kfunc.  This
  will release the reference to the passed in PTR_TO_BTF_ID which has a
  reference state (from earlier acquire kfunc).
  The btf_check_func_arg_match returns the regno (of argument register,
  hence > 0) if the kfunc is a release kfunc, and a proper referenced
  PTR_TO_BTF_ID is being passed to it.
  This is similar to how helper call check uses bpf_call_arg_meta to
  store the ref_obj_id that is later used to release the reference.
  Similar to in-kernel helper, we only allow passing one referenced
  PTR_TO_BTF_ID as an argument. It can also be passed in to normal
  kfunc, but in case of release kfunc there must always be one
  PTR_TO_BTF_ID argument that is referenced.

- is_kfunc_ret_type_null
  For kfunc returning PTR_TO_BTF_ID, tells if it can be NULL, hence
  force caller to mark the pointer not null (using check) before
  accessing it. Note that taking into account the case fixed by commit
  93c230e3f5bd6, we assign a non-zero id for mark_ptr_or_null_reg logic.
  Later, if more return types are supported by kfunc, which have a
  _OR_NULL variant, it might be better to move this id generation under
  a common reg_type_may_be_null check, similar to the case in the
  commit.

Later patches will implement these callbacks.

Referenced PTR_TO_BTF_ID is currently only limited to kfunc, but can be
extended in the future to other BPF helpers as well.  For now, we can
rely on the btf_struct_ids_match check to ensure we get the pointer to
the expected struct type. In the future, care needs to be taken to avoid
ambiguity for reference PTR_TO_BTF_ID passed to release function, in
case multiple candidates can release same BTF ID.

e.g. there might be two release kfuncs (or kfunc and helper):

foo(struct abc *p);
bar(struct abc *p);

... such that both release a PTR_TO_BTF_ID with btf_id of struct abc. In
this case we would need to track the acquire function corresponding to
the release function to avoid type confusion, and store this information
in the register state so that an incorrect program can be rejected. This
is not a problem right now, hence it is left as an exercise for the
future patch introducing such a case in the kernel.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h   |  6 ++++-
 kernel/bpf/btf.c      | 42 ++++++++++++++++++++++++++++-----
 kernel/bpf/verifier.c | 54 +++++++++++++++++++++++++++++++++++++------
 3 files changed, 88 insertions(+), 14 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8bbf08fbab66..6a14072c72a0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -521,6 +521,9 @@ struct bpf_verifier_ops {
 				 enum bpf_access_type atype,
 				 u32 *next_btf_id);
 	bool (*check_kfunc_call)(u32 kfunc_btf_id, struct module *owner);
+	bool (*is_acquire_kfunc)(u32 kfunc_btf_id, struct module *owner);
+	bool (*is_release_kfunc)(u32 kfunc_btf_id, struct module *owner);
+	bool (*is_kfunc_ret_type_null)(u32 kfunc_btf_id, struct module *owner);
 };
 
 struct bpf_prog_offload_ops {
@@ -1717,7 +1720,8 @@ int btf_check_subprog_arg_match(struct bpf_verifier_env *env, int subprog,
 				struct bpf_reg_state *regs);
 int btf_check_kfunc_arg_match(struct bpf_verifier_env *env,
 			      const struct btf *btf, u32 func_id,
-			      struct bpf_reg_state *regs);
+			      struct bpf_reg_state *regs,
+			      struct module *btf_mod);
 int btf_prepare_func_args(struct bpf_verifier_env *env, int subprog,
 			  struct bpf_reg_state *reg);
 int btf_check_type_match(struct bpf_verifier_log *log, const struct bpf_prog *prog,
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index df9a3f77fc4a..a790ba6d93d8 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -5640,14 +5640,17 @@ static bool is_kfunc_arg_mem_size(const struct btf *btf,
 static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 				    const struct btf *btf, u32 func_id,
 				    struct bpf_reg_state *regs,
-				    bool ptr_to_mem_ok)
+				    bool ptr_to_mem_ok,
+				    struct module *btf_mod)
 {
 	struct bpf_verifier_log *log = &env->log;
+	u32 i, nargs, ref_id, ref_obj_id = 0;
 	bool is_kfunc = btf_is_kernel(btf);
 	const char *func_name, *ref_tname;
 	const struct btf_type *t, *ref_t;
 	const struct btf_param *args;
-	u32 i, nargs, ref_id;
+	int ref_regno = 0;
+	bool rel = false;
 
 	t = btf_type_by_id(btf, func_id);
 	if (!t || !btf_type_is_func(t)) {
@@ -5725,6 +5728,16 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 			if (reg->type == PTR_TO_BTF_ID) {
 				reg_btf = reg->btf;
 				reg_ref_id = reg->btf_id;
+				/* Ensure only one argument is referenced PTR_TO_BTF_ID */
+				if (reg->ref_obj_id) {
+					if (ref_obj_id) {
+						bpf_log(log, "verifier internal error: more than one arg with ref_obj_id R%d %u %u\n",
+							regno, reg->ref_obj_id, ref_obj_id);
+						return -EFAULT;
+					}
+					ref_regno = regno;
+					ref_obj_id = reg->ref_obj_id;
+				}
 			} else {
 				reg_btf = btf_vmlinux;
 				reg_ref_id = *reg2btf_ids[reg->type];
@@ -5794,7 +5807,23 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 		}
 	}
 
-	return 0;
+	/* Either both are set, or neither */
+	WARN_ON_ONCE((ref_obj_id && !ref_regno) || (!ref_obj_id && ref_regno));
+	if (is_kfunc) {
+		rel = env->ops->is_release_kfunc &&
+			env->ops->is_release_kfunc(func_id, btf_mod);
+		/* We already made sure ref_obj_id is set only for one argument */
+		if (rel && !ref_obj_id) {
+			bpf_log(log, "release kernel function %s expects refcounted PTR_TO_BTF_ID\n",
+				func_name);
+			return -EINVAL;
+		}
+		/* Allow (!rel && ref_obj_id), so that passing such referenced PTR_TO_BTF_ID to
+		 * other kfuncs works
+		 */
+	}
+	/* returns argument register number > 0 in case of reference release kfunc */
+	return rel ? ref_regno : 0;
 }
 
 /* Compare BTF of a function with given bpf_reg_state.
@@ -5824,7 +5853,7 @@ int btf_check_subprog_arg_match(struct bpf_verifier_env *env, int subprog,
 		return -EINVAL;
 
 	is_global = prog->aux->func_info_aux[subprog].linkage == BTF_FUNC_GLOBAL;
-	err = btf_check_func_arg_match(env, btf, btf_id, regs, is_global);
+	err = btf_check_func_arg_match(env, btf, btf_id, regs, is_global, NULL);
 
 	/* Compiler optimizations can remove arguments from static functions
 	 * or mismatched type can be passed into a global function.
@@ -5837,9 +5866,10 @@ int btf_check_subprog_arg_match(struct bpf_verifier_env *env, int subprog,
 
 int btf_check_kfunc_arg_match(struct bpf_verifier_env *env,
 			      const struct btf *btf, u32 func_id,
-			      struct bpf_reg_state *regs)
+			      struct bpf_reg_state *regs,
+			      struct module *btf_mod)
 {
-	return btf_check_func_arg_match(env, btf, func_id, regs, true);
+	return btf_check_func_arg_match(env, btf, func_id, regs, true, btf_mod);
 }
 
 /* Convert BTF of a function into bpf_reg_state if possible
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 074a78a0efa4..b8685fb7ff15 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -466,7 +466,9 @@ static bool reg_type_may_be_refcounted_or_null(enum bpf_reg_type type)
 		type == PTR_TO_TCP_SOCK ||
 		type == PTR_TO_TCP_SOCK_OR_NULL ||
 		type == PTR_TO_MEM ||
-		type == PTR_TO_MEM_OR_NULL;
+		type == PTR_TO_MEM_OR_NULL ||
+		type == PTR_TO_BTF_ID ||
+		type == PTR_TO_BTF_ID_OR_NULL;
 }
 
 static bool arg_type_may_be_refcounted(enum bpf_arg_type type)
@@ -6753,16 +6755,18 @@ static void mark_btf_func_reg_size(struct bpf_verifier_env *env, u32 regno,
 	}
 }
 
-static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn)
+static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
+			    int *insn_idx_p)
 {
 	const struct btf_type *t, *func, *func_proto, *ptr_type;
 	struct bpf_reg_state *regs = cur_regs(env);
 	const char *func_name, *ptr_type_name;
 	u32 i, nargs, func_id, ptr_type_id;
+	int err, insn_idx = *insn_idx_p;
 	struct module *btf_mod = NULL;
 	const struct btf_param *args;
 	struct btf *desc_btf;
-	int err;
+	bool acq;
 
 	/* skip for now, but return error when we find this in fixup_kfunc_call */
 	if (!insn->imm)
@@ -6784,16 +6788,37 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn)
 		return -EACCES;
 	}
 
+	/* Is this an acquire kfunc? */
+	acq = env->ops->is_acquire_kfunc &&
+		env->ops->is_acquire_kfunc(func_id, btf_mod);
+
 	/* Check the arguments */
-	err = btf_check_kfunc_arg_match(env, desc_btf, func_id, regs);
-	if (err)
+	err = btf_check_kfunc_arg_match(env, desc_btf, func_id, regs, btf_mod);
+	if (err < 0)
 		return err;
+	/* In case of release function, we get register number of refcounted
+	 * PTR_TO_BTF_ID back from btf_check_kfunc_arg_match, do the release now
+	 */
+	if (err) {
+		err = release_reference(env, regs[err].ref_obj_id);
+		if (err) {
+			verbose(env, "kfunc %s#%d reference has not been acquired before\n",
+				func_name, func_id);
+			return err;
+		}
+	}
 
 	for (i = 0; i < CALLER_SAVED_REGS; i++)
 		mark_reg_not_init(env, regs, caller_saved[i]);
 
 	/* Check return type */
 	t = btf_type_skip_modifiers(desc_btf, func_proto->type, NULL);
+
+	if (acq && !btf_type_is_ptr(t)) {
+		verbose(env, "acquire kernel function does not return PTR_TO_BTF_ID\n");
+		return -EINVAL;
+	}
+
 	if (btf_type_is_scalar(t)) {
 		mark_reg_unknown(env, regs, BPF_REG_0);
 		mark_btf_func_reg_size(env, BPF_REG_0, t->size);
@@ -6808,11 +6833,26 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn)
 				ptr_type_name);
 			return -EINVAL;
 		}
+		if (env->ops->is_kfunc_ret_type_null &&
+		    env->ops->is_kfunc_ret_type_null(func_id, btf_mod)) {
+			regs[BPF_REG_0].type = PTR_TO_BTF_ID_OR_NULL;
+			/* For mark_ptr_or_null_reg, see 93c230e3f5bd6 */
+			regs[BPF_REG_0].id = ++env->id_gen;
+		} else {
+			regs[BPF_REG_0].type = PTR_TO_BTF_ID;
+		}
 		mark_reg_known_zero(env, regs, BPF_REG_0);
 		regs[BPF_REG_0].btf = desc_btf;
-		regs[BPF_REG_0].type = PTR_TO_BTF_ID;
 		regs[BPF_REG_0].btf_id = ptr_type_id;
 		mark_btf_func_reg_size(env, BPF_REG_0, sizeof(void *));
+		if (acq) {
+			int id = acquire_reference_state(env, insn_idx);
+
+			if (id < 0)
+				return id;
+			regs[BPF_REG_0].id = id;
+			regs[BPF_REG_0].ref_obj_id = id;
+		}
 	} /* else { add_kfunc_call() ensures it is btf_type_is_void(t) } */
 
 	nargs = btf_type_vlen(func_proto);
@@ -11457,7 +11497,7 @@ static int do_check(struct bpf_verifier_env *env)
 				if (insn->src_reg == BPF_PSEUDO_CALL)
 					err = check_func_call(env, insn, &env->insn_idx);
 				else if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL)
-					err = check_kfunc_call(env, insn);
+					err = check_kfunc_call(env, insn, &env->insn_idx);
 				else
 					err = check_helper_call(env, insn, &env->insn_idx);
 				if (err)
-- 
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v2 6/9] bpf: Track provenance for pointers formed from referenced PTR_TO_BTF_ID
From: Kumar Kartikeya Dwivedi @ 2021-12-09 17:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	Maxim Mikityanskiy, Florian Westphal, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev, netfilter-devel
In-Reply-To: <20211209170929.3485242-1-memxor@gmail.com>

In the previous commit, we implemented support in the verifier for
working with referenced PTR_TO_BTF_ID registers. These are invalidated
when their corresponding release function is called.

However, PTR_TO_BTF_ID is a bit special, in that distinct PTR_TO_BTF_ID
can be formed by walking pointers in the struct represented by the BTF
ID. mark_btf_ld_reg will copy the relevant register state to the
destination register.

However, we cannot simply copy ref_obj_id (such that
release_reg_references will match and invalidate all pointers formed by
pointer walking), as we obtain the same BTF ID in the destination
register, leading to confusion during release. An example is show below:

For a type like so:
struct foo { struct foo *next; };

r1 = acquire(...); // BTF ID of struct foo
if (r1) {
	r2 = r1->next; // BTF ID of struct foo, and we copied ref_obj_id in
		       // mark_btf_ld_reg.
	release(r2);
}

With this logic, the above snippet succeeds. Hence we need to
distinguish the canonical reference and pointers formed from it.

We introduce a 'parent_ref_obj_id' member in bpf_reg_state, for a
referenced register, only one of ref_obj_id or parent_ref_obj_id may be
set, i.e. either a register holds a canonical reference, or it is
related to a canonical reference for invalidation purposes (contains an
edge pointing to it by way of having the same ref_obj_id in
parent_ref_obj_id, in the graph of objects).

When releasing reference, we ensure that both are not set at once, and
then release if either of them match the requested ref_obj_id to be
released. This ensures that the example given above will not succeed.
A test to this end has been added in later patches.

Typically, kernel objects have a nested object lifetime (where the
parent object 'owns' the objects it holds references to). However, this
is not always true. For now, we don't need support to hold on to
references to objects obtained from a refcounted PTR_TO_BTF_ID after its
release, but this can be relaxed on a case by case basis (i.e. based on
the BTF ID and program type/attach type) in the future.

The safest assumption for the verifier to make in absence of any other
hints, is that all such pointers formed from refcounted PTR_TO_BTF_ID
shall be invalidated.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf_verifier.h | 10 +++++++
 kernel/bpf/verifier.c        | 54 ++++++++++++++++++++++++++++--------
 2 files changed, 52 insertions(+), 12 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index b80fe5bf2a02..a6ef11db6823 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -128,6 +128,16 @@ struct bpf_reg_state {
 	 * allowed and has the same effect as bpf_sk_release(sk).
 	 */
 	u32 ref_obj_id;
+	/* This is set for pointers which are derived from referenced
+	 * pointer (e.g. PTR_TO_BTF_ID pointer walking), so that the
+	 * pointers obtained by walking referenced PTR_TO_BTF_ID
+	 * are appropriately invalidated when the lifetime of their
+	 * parent object ends.
+	 *
+	 * Only one of ref_obj_id and parent_ref_obj_id can be set,
+	 * never both at once.
+	 */
+	u32 parent_ref_obj_id;
 	/* For scalar types (SCALAR_VALUE), this represents our knowledge of
 	 * the actual value.
 	 * For pointer types, this represents the variable part of the offset
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b8685fb7ff15..16e30d7f631b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -654,7 +654,8 @@ static void print_verifier_state(struct bpf_verifier_env *env,
 				verbose(env, "%s", kernel_type_name(reg->btf, reg->btf_id));
 			verbose(env, "(id=%d", reg->id);
 			if (reg_type_may_be_refcounted_or_null(t))
-				verbose(env, ",ref_obj_id=%d", reg->ref_obj_id);
+				verbose(env, ",%sref_obj_id=%d", reg->ref_obj_id ? "" : "parent_",
+					reg->ref_obj_id ?: reg->parent_ref_obj_id);
 			if (t != SCALAR_VALUE)
 				verbose(env, ",off=%d", reg->off);
 			if (type_is_pkt_pointer(t))
@@ -1500,7 +1501,8 @@ static void mark_reg_not_init(struct bpf_verifier_env *env,
 static void mark_btf_ld_reg(struct bpf_verifier_env *env,
 			    struct bpf_reg_state *regs, u32 regno,
 			    enum bpf_reg_type reg_type,
-			    struct btf *btf, u32 btf_id)
+			    struct btf *btf, u32 btf_id,
+			    u32 parent_ref_obj_id)
 {
 	if (reg_type == SCALAR_VALUE) {
 		mark_reg_unknown(env, regs, regno);
@@ -1509,6 +1511,7 @@ static void mark_btf_ld_reg(struct bpf_verifier_env *env,
 	mark_reg_known_zero(env, regs, regno);
 	regs[regno].type = PTR_TO_BTF_ID;
 	regs[regno].btf = btf;
+	regs[regno].parent_ref_obj_id = parent_ref_obj_id;
 	regs[regno].btf_id = btf_id;
 }
 
@@ -4136,8 +4139,14 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 	if (ret < 0)
 		return ret;
 
-	if (atype == BPF_READ && value_regno >= 0)
-		mark_btf_ld_reg(env, regs, value_regno, ret, reg->btf, btf_id);
+	if (atype == BPF_READ && value_regno >= 0) {
+		if (WARN_ON_ONCE(reg->ref_obj_id && reg->parent_ref_obj_id)) {
+			verbose(env, "verifier internal error: both ref and parent ref set\n");
+			return -EACCES;
+		}
+		mark_btf_ld_reg(env, regs, value_regno, ret, reg->btf, btf_id,
+				reg->ref_obj_id ?: reg->parent_ref_obj_id);
+	}
 
 	return 0;
 }
@@ -4191,8 +4200,14 @@ static int check_ptr_to_map_access(struct bpf_verifier_env *env,
 	if (ret < 0)
 		return ret;
 
-	if (value_regno >= 0)
-		mark_btf_ld_reg(env, regs, value_regno, ret, btf_vmlinux, btf_id);
+	if (value_regno >= 0) {
+		if (WARN_ON_ONCE(reg->ref_obj_id && reg->parent_ref_obj_id)) {
+			verbose(env, "verifier internal error: both ref and parent ref set\n");
+			return -EACCES;
+		}
+		mark_btf_ld_reg(env, regs, value_regno, ret, btf_vmlinux, btf_id,
+				reg->ref_obj_id ?: reg->parent_ref_obj_id);
+	}
 
 	return 0;
 }
@@ -5865,23 +5880,35 @@ static void mark_pkt_end(struct bpf_verifier_state *vstate, int regn, bool range
 		reg->range = AT_PKT_END;
 }
 
-static void release_reg_references(struct bpf_verifier_env *env,
+static int release_reg_references(struct bpf_verifier_env *env,
 				   struct bpf_func_state *state,
 				   int ref_obj_id)
 {
 	struct bpf_reg_state *regs = state->regs, *reg;
 	int i;
 
-	for (i = 0; i < MAX_BPF_REG; i++)
-		if (regs[i].ref_obj_id == ref_obj_id)
+	for (i = 0; i < MAX_BPF_REG; i++) {
+		if (WARN_ON_ONCE(regs[i].ref_obj_id && regs[i].parent_ref_obj_id)) {
+			verbose(env, "verifier internal error: both ref and parent ref set\n");
+			return -EACCES;
+		}
+		if (regs[i].ref_obj_id == ref_obj_id ||
+		    regs[i].parent_ref_obj_id == ref_obj_id)
 			mark_reg_unknown(env, regs, i);
+	}
 
 	bpf_for_each_spilled_reg(i, state, reg) {
 		if (!reg)
 			continue;
-		if (reg->ref_obj_id == ref_obj_id)
+		if (WARN_ON_ONCE(reg->ref_obj_id && reg->parent_ref_obj_id)) {
+			verbose(env, "verifier internal error: both ref and parent ref set\n");
+			return -EACCES;
+		}
+		if (reg->ref_obj_id == ref_obj_id ||
+		    reg->parent_ref_obj_id == ref_obj_id)
 			__mark_reg_unknown(env, reg);
 	}
+	return 0;
 }
 
 /* The pointer with the specified id has released its reference to kernel
@@ -5898,8 +5925,11 @@ static int release_reference(struct bpf_verifier_env *env,
 	if (err)
 		return err;
 
-	for (i = 0; i <= vstate->curframe; i++)
-		release_reg_references(env, vstate->frame[i], ref_obj_id);
+	for (i = 0; i <= vstate->curframe; i++) {
+		err = release_reg_references(env, vstate->frame[i], ref_obj_id);
+		if (err)
+			return err;
+	}
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v2 4/9] bpf: Introduce mem, size argument pair support for kfunc
From: Kumar Kartikeya Dwivedi @ 2021-12-09 17:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	Maxim Mikityanskiy, Florian Westphal, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev, netfilter-devel
In-Reply-To: <20211209170929.3485242-1-memxor@gmail.com>

BPF helpers can associate two adjacent arguments together to pass memory
of certain size, using ARG_PTR_TO_MEM and ARG_CONST_SIZE arguments.
Since we don't use bpf_func_proto for kfunc, we need to leverage BTF to
implement similar support.

The ARG_CONST_SIZE processing for helpers is refactored into a common
check_mem_size_reg helper that is shared with kfunc as well. kfunc
ptr_to_mem support follows logic similar to global functions, where
verification is done as if pointer is not null, even when it may be
null.

This leads to a simple to follow rule for writing kfunc: always check
the argument pointer for NULL, except when it is PTR_TO_CTX.

Currently, we require the size argument to be prefixed with "len__" in
the parameter name. This information is then recorded in kernel BTF and
verified during function argument checking. In the future we can use BTF
tagging instead, and modify the kernel function definitions. This will
be a purely kernel-side change.

This allows us to have some form of backwards compatibility for
structures that are passed in to the kernel function with their size,
and allow variable length structures to be passed in if they are
accompanied by a size parameter.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf_verifier.h |   2 +
 kernel/bpf/btf.c             |  41 +++++++++++-
 kernel/bpf/verifier.c        | 124 ++++++++++++++++++++++-------------
 3 files changed, 119 insertions(+), 48 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 182b16a91084..b80fe5bf2a02 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -507,6 +507,8 @@ bpf_prog_offload_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt);
 
 int check_ctx_reg(struct bpf_verifier_env *env,
 		  const struct bpf_reg_state *reg, int regno);
+int check_kfunc_mem_size_reg(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
+			     u32 regno);
 int check_mem_reg(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
 		   u32 regno, u32 mem_size);
 
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 63b22ff73550..df9a3f77fc4a 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -5618,6 +5618,25 @@ static bool __btf_type_is_scalar_struct(struct bpf_verifier_log *log,
 	return true;
 }
 
+static bool is_kfunc_arg_mem_size(const struct btf *btf,
+				  const struct btf_param *arg,
+				  const struct bpf_reg_state *reg)
+{
+	const struct btf_type *t;
+	const char *param_name;
+
+	t = btf_type_skip_modifiers(btf, arg->type, NULL);
+	if (!btf_type_is_scalar(t) || reg->type != SCALAR_VALUE)
+		return false;
+
+	/* In the future, this can be ported to use BTF tagging */
+	param_name = btf_name_by_offset(btf, arg->name_off);
+	if (strncmp(param_name, "len__", sizeof("len__") - 1))
+		return false;
+
+	return true;
+}
+
 static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 				    const struct btf *btf, u32 func_id,
 				    struct bpf_reg_state *regs,
@@ -5729,16 +5748,32 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 			u32 type_size;
 
 			if (is_kfunc) {
+				bool arg_mem_size = i + 1 < nargs && is_kfunc_arg_mem_size(btf, &args[i + 1], &regs[regno + 1]);
+
 				/* Permit pointer to mem, but only when argument
 				 * type is pointer to scalar, or struct composed
 				 * (recursively) of scalars.
+				 * When arg_mem_size is true, the pointer can be
+				 * void *.
 				 */
-				if (!btf_type_is_scalar(ref_t) && !__btf_type_is_scalar_struct(log, btf, ref_t, 0)) {
+				if (!btf_type_is_scalar(ref_t) && !__btf_type_is_scalar_struct(log, btf, ref_t, 0) &&
+				    (arg_mem_size ? !btf_type_is_void(ref_t) : 1)) {
 					bpf_log(log,
-						"arg#%d pointer type %s %s must point to scalar or struct with scalar\n",
-						i, btf_type_str(ref_t), ref_tname);
+						"arg#%d pointer type %s %s must point to %sscalar, or struct with scalar\n",
+						i, btf_type_str(ref_t), ref_tname, arg_mem_size ? "void, " : "");
 					return -EINVAL;
 				}
+
+				/* Check for mem, len pair */
+				if (arg_mem_size) {
+					if (check_kfunc_mem_size_reg(env, &regs[regno + 1], regno + 1)) {
+						bpf_log(log, "arg#%d arg#%d memory, len pair leads to invalid memory access\n",
+							i, i + 1);
+						return -EINVAL;
+					}
+					i++;
+					continue;
+				}
 			}
 
 			resolve_ret = btf_resolve_size(btf, ref_t, &type_size);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1126b75fe650..074a78a0efa4 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4780,6 +4780,62 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno,
 	}
 }
 
+static int check_mem_size_reg(struct bpf_verifier_env *env,
+			      struct bpf_reg_state *reg, u32 regno,
+			      bool zero_size_allowed,
+			      struct bpf_call_arg_meta *meta)
+{
+	int err;
+
+	/* This is used to refine r0 return value bounds for helpers
+	 * that enforce this value as an upper bound on return values.
+	 * See do_refine_retval_range() for helpers that can refine
+	 * the return value. C type of helper is u32 so we pull register
+	 * bound from umax_value however, if negative verifier errors
+	 * out. Only upper bounds can be learned because retval is an
+	 * int type and negative retvals are allowed.
+	 */
+	if (meta)
+		meta->msize_max_value = reg->umax_value;
+
+	/* The register is SCALAR_VALUE; the access check
+	 * happens using its boundaries.
+	 */
+	if (!tnum_is_const(reg->var_off))
+		/* For unprivileged variable accesses, disable raw
+		 * mode so that the program is required to
+		 * initialize all the memory that the helper could
+		 * just partially fill up.
+		 */
+		meta = NULL;
+
+	if (reg->smin_value < 0) {
+		verbose(env, "R%d min value is negative, either use unsigned or 'var &= const'\n",
+			regno);
+		return -EACCES;
+	}
+
+	if (reg->umin_value == 0) {
+		err = check_helper_mem_access(env, regno - 1, 0,
+					      zero_size_allowed,
+					      meta);
+		if (err)
+			return err;
+	}
+
+	if (reg->umax_value >= BPF_MAX_VAR_SIZ) {
+		verbose(env, "R%d unbounded memory access, use 'var &= const' or 'if (var < const)'\n",
+			regno);
+		return -EACCES;
+	}
+	err = check_helper_mem_access(env, regno - 1,
+				      reg->umax_value,
+				      zero_size_allowed, meta);
+	if (!err)
+		err = mark_chain_precision(env, regno);
+	return err;
+}
+
 int check_mem_reg(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
 		   u32 regno, u32 mem_size)
 {
@@ -4803,6 +4859,28 @@ int check_mem_reg(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
 	return check_helper_mem_access(env, regno, mem_size, true, NULL);
 }
 
+int check_kfunc_mem_size_reg(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
+			     u32 regno)
+{
+	struct bpf_reg_state *mem_reg = &cur_regs(env)[regno - 1];
+	bool may_be_null = reg_type_may_be_null(mem_reg->type);
+	struct bpf_reg_state saved_reg;
+	int err;
+
+	WARN_ON_ONCE(regno < BPF_REG_2 || regno > BPF_REG_5);
+
+	if (may_be_null) {
+		saved_reg = *mem_reg;
+		mark_ptr_not_null_reg(mem_reg);
+	}
+
+	err = check_mem_size_reg(env, reg, regno, true, NULL);
+
+	if (may_be_null)
+		*mem_reg = saved_reg;
+	return err;
+}
+
 /* Implementation details:
  * bpf_map_lookup returns PTR_TO_MAP_VALUE_OR_NULL
  * Two bpf_map_lookups (even with the same key) will have different reg->id.
@@ -5316,51 +5394,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 	} else if (arg_type_is_mem_size(arg_type)) {
 		bool zero_size_allowed = (arg_type == ARG_CONST_SIZE_OR_ZERO);
 
-		/* This is used to refine r0 return value bounds for helpers
-		 * that enforce this value as an upper bound on return values.
-		 * See do_refine_retval_range() for helpers that can refine
-		 * the return value. C type of helper is u32 so we pull register
-		 * bound from umax_value however, if negative verifier errors
-		 * out. Only upper bounds can be learned because retval is an
-		 * int type and negative retvals are allowed.
-		 */
-		meta->msize_max_value = reg->umax_value;
-
-		/* The register is SCALAR_VALUE; the access check
-		 * happens using its boundaries.
-		 */
-		if (!tnum_is_const(reg->var_off))
-			/* For unprivileged variable accesses, disable raw
-			 * mode so that the program is required to
-			 * initialize all the memory that the helper could
-			 * just partially fill up.
-			 */
-			meta = NULL;
-
-		if (reg->smin_value < 0) {
-			verbose(env, "R%d min value is negative, either use unsigned or 'var &= const'\n",
-				regno);
-			return -EACCES;
-		}
-
-		if (reg->umin_value == 0) {
-			err = check_helper_mem_access(env, regno - 1, 0,
-						      zero_size_allowed,
-						      meta);
-			if (err)
-				return err;
-		}
-
-		if (reg->umax_value >= BPF_MAX_VAR_SIZ) {
-			verbose(env, "R%d unbounded memory access, use 'var &= const' or 'if (var < const)'\n",
-				regno);
-			return -EACCES;
-		}
-		err = check_helper_mem_access(env, regno - 1,
-					      reg->umax_value,
-					      zero_size_allowed, meta);
-		if (!err)
-			err = mark_chain_precision(env, regno);
+		err = check_mem_size_reg(env, reg, regno, zero_size_allowed, meta);
 	} else if (arg_type_is_alloc_size(arg_type)) {
 		if (!tnum_is_const(reg->var_off)) {
 			verbose(env, "R%d is not a known constant'\n",
-- 
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v2 3/9] bpf: Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument support
From: Kumar Kartikeya Dwivedi @ 2021-12-09 17:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	Maxim Mikityanskiy, Florian Westphal, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev, netfilter-devel
In-Reply-To: <20211209170929.3485242-1-memxor@gmail.com>

Allow passing PTR_TO_CTX, if the kfunc expects a matching struct type,
and punt to PTR_TO_MEM block if reg->type does not fall in one of
PTR_TO_BTF_ID or PTR_TO_SOCK* types. This will be used by future commits
to get access to XDP and TC PTR_TO_CTX, and pass various data (flags,
tuple, netns_id, etc.) encoded in opts struct as a pointer to the kfunc.

For PTR_TO_MEM support, arguments are currently limited to pointer to
scalar, or pointer to struct composed of scalars. This is done so that
unsafe scenarios (like passing PTR_TO_MEM where PTR_TO_BTF_ID of
in-kernel valid structure is expected, which may have pointers) are
avoided. kfunc argument checking is based on the passed in register type
and limited argument type matching, hence this limitation is imposed. In
the future, support for PTR_TO_MEM for kfunc can be extended to serve
other usecases. struct may have maximum 8 nested structs, all
recursively composed of scalars or struct with scalars.

Future commits will add negative tests that check whether these
restrictions imposed for kfunc arguments are duly rejected by BPF
verifier or not.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/btf.c | 96 +++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 75 insertions(+), 21 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 450f9e37ceca..63b22ff73550 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -5575,12 +5575,56 @@ static u32 *reg2btf_ids[__BPF_REG_TYPE_MAX] = {
 #endif
 };
 
+/* Returns true if struct is composed of scalars, 8 levels of nesting allowed */
+static bool __btf_type_is_scalar_struct(struct bpf_verifier_log *log,
+					const struct btf *btf,
+					const struct btf_type *t, int rec)
+{
+	const struct btf_type *member_type;
+	const struct btf_member *member;
+	u16 i;
+
+	if (rec == 8) {
+		bpf_log(log, "max struct nesting depth 8 exceeded\n");
+		return false;
+	}
+
+	if (!btf_type_is_struct(t))
+		return false;
+
+	for_each_member(i, t, member) {
+		const struct btf_array *array;
+
+		member_type = btf_type_skip_modifiers(btf, member->type, NULL);
+
+		if (btf_type_is_struct(member_type)) {
+			if (!__btf_type_is_scalar_struct(log, btf, member_type, rec + 1))
+				return false;
+			continue;
+		}
+		if (btf_type_is_array(member_type)) {
+			array = btf_type_array(member_type);
+			/* FAM */
+			if (!array->nelems)
+				return false;
+			member_type = btf_type_skip_modifiers(btf, array->type, NULL);
+			if (!btf_type_is_scalar(member_type))
+				return false;
+			continue;
+		}
+		if (!btf_type_is_scalar(member_type))
+			return false;
+	}
+	return true;
+}
+
 static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 				    const struct btf *btf, u32 func_id,
 				    struct bpf_reg_state *regs,
 				    bool ptr_to_mem_ok)
 {
 	struct bpf_verifier_log *log = &env->log;
+	bool is_kfunc = btf_is_kernel(btf);
 	const char *func_name, *ref_tname;
 	const struct btf_type *t, *ref_t;
 	const struct btf_param *args;
@@ -5633,7 +5677,20 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 
 		ref_t = btf_type_skip_modifiers(btf, t->type, &ref_id);
 		ref_tname = btf_name_by_offset(btf, ref_t->name_off);
-		if (btf_is_kernel(btf)) {
+		if (btf_get_prog_ctx_type(log, btf, t,
+					  env->prog->type, i)) {
+			/* If function expects ctx type in BTF check that caller
+			 * is passing PTR_TO_CTX.
+			 */
+			if (reg->type != PTR_TO_CTX) {
+				bpf_log(log,
+					"arg#%d expected pointer to ctx, but got %s\n",
+					i, btf_type_str(t));
+				return -EINVAL;
+			}
+			if (check_ctx_reg(env, reg, regno))
+				return -EINVAL;
+		} else if (is_kfunc && (reg->type == PTR_TO_BTF_ID || reg2btf_ids[reg->type])) {
 			const struct btf_type *reg_ref_t;
 			const struct btf *reg_btf;
 			const char *reg_ref_tname;
@@ -5649,14 +5706,9 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 			if (reg->type == PTR_TO_BTF_ID) {
 				reg_btf = reg->btf;
 				reg_ref_id = reg->btf_id;
-			} else if (reg2btf_ids[reg->type]) {
+			} else {
 				reg_btf = btf_vmlinux;
 				reg_ref_id = *reg2btf_ids[reg->type];
-			} else {
-				bpf_log(log, "kernel function %s args#%d expected pointer to %s %s but R%d is not a pointer to btf_id\n",
-					func_name, i,
-					btf_type_str(ref_t), ref_tname, regno);
-				return -EINVAL;
 			}
 
 			reg_ref_t = btf_type_skip_modifiers(reg_btf, reg_ref_id,
@@ -5672,23 +5724,23 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 					reg_ref_tname);
 				return -EINVAL;
 			}
-		} else if (btf_get_prog_ctx_type(log, btf, t,
-						 env->prog->type, i)) {
-			/* If function expects ctx type in BTF check that caller
-			 * is passing PTR_TO_CTX.
-			 */
-			if (reg->type != PTR_TO_CTX) {
-				bpf_log(log,
-					"arg#%d expected pointer to ctx, but got %s\n",
-					i, btf_type_str(t));
-				return -EINVAL;
-			}
-			if (check_ctx_reg(env, reg, regno))
-				return -EINVAL;
 		} else if (ptr_to_mem_ok) {
 			const struct btf_type *resolve_ret;
 			u32 type_size;
 
+			if (is_kfunc) {
+				/* Permit pointer to mem, but only when argument
+				 * type is pointer to scalar, or struct composed
+				 * (recursively) of scalars.
+				 */
+				if (!btf_type_is_scalar(ref_t) && !__btf_type_is_scalar_struct(log, btf, ref_t, 0)) {
+					bpf_log(log,
+						"arg#%d pointer type %s %s must point to scalar or struct with scalar\n",
+						i, btf_type_str(ref_t), ref_tname);
+					return -EINVAL;
+				}
+			}
+
 			resolve_ret = btf_resolve_size(btf, ref_t, &type_size);
 			if (IS_ERR(resolve_ret)) {
 				bpf_log(log,
@@ -5701,6 +5753,8 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 			if (check_mem_reg(env, reg, regno, type_size))
 				return -EINVAL;
 		} else {
+			bpf_log(log, "reg type unsupported for arg#%d %sfunction %s#%d\n", i,
+				is_kfunc ? "kernel " : "", func_name, func_id);
 			return -EINVAL;
 		}
 	}
@@ -5750,7 +5804,7 @@ int btf_check_kfunc_arg_match(struct bpf_verifier_env *env,
 			      const struct btf *btf, u32 func_id,
 			      struct bpf_reg_state *regs)
 {
-	return btf_check_func_arg_match(env, btf, func_id, regs, false);
+	return btf_check_func_arg_match(env, btf, func_id, regs, true);
 }
 
 /* Convert BTF of a function into bpf_reg_state if possible
-- 
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v2 2/9] bpf: Remove DEFINE_KFUNC_BTF_ID_SET
From: Kumar Kartikeya Dwivedi @ 2021-12-09 17:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	Maxim Mikityanskiy, Florian Westphal, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev, netfilter-devel
In-Reply-To: <20211209170929.3485242-1-memxor@gmail.com>

The only reason to keep it was to initialize list head, but future
commits will introduce more members that need to be set, which is more
convenient to do using designated initializer.

Hence, remove the macro, convert users, and initialize list head inside
register_kfunc_btf_id_set.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/btf.h                                   | 4 ----
 kernel/bpf/btf.c                                      | 1 +
 net/ipv4/tcp_bbr.c                                    | 5 ++++-
 net/ipv4/tcp_cubic.c                                  | 5 ++++-
 net/ipv4/tcp_dctcp.c                                  | 5 ++++-
 tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c | 5 ++++-
 6 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index d3014493f6fb..ec9e69c14480 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -361,10 +361,6 @@ bool bpf_check_mod_kfunc_call(struct kfunc_btf_id_list *klist, u32 kfunc_id,
 }
 #endif
 
-#define DEFINE_KFUNC_BTF_ID_SET(set, name)                                     \
-	struct kfunc_btf_id_set name = { LIST_HEAD_INIT(name.list), (set),     \
-					 THIS_MODULE }
-
 extern struct kfunc_btf_id_list bpf_tcp_ca_kfunc_list;
 extern struct kfunc_btf_id_list prog_test_kfunc_list;
 
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index c9413d13ca91..450f9e37ceca 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6375,6 +6375,7 @@ struct kfunc_btf_id_list {
 void register_kfunc_btf_id_set(struct kfunc_btf_id_list *l,
 			       struct kfunc_btf_id_set *s)
 {
+	INIT_LIST_HEAD(&s->list);
 	mutex_lock(&l->mutex);
 	list_add(&s->list, &l->list);
 	mutex_unlock(&l->mutex);
diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index ec5550089b4d..280dada5d1ae 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -1169,7 +1169,10 @@ BTF_ID(func, bbr_set_state)
 #endif
 BTF_SET_END(tcp_bbr_kfunc_ids)
 
-static DEFINE_KFUNC_BTF_ID_SET(&tcp_bbr_kfunc_ids, tcp_bbr_kfunc_btf_set);
+static struct kfunc_btf_id_set tcp_bbr_kfunc_btf_set = {
+	.owner = THIS_MODULE,
+	.set   = &tcp_bbr_kfunc_ids,
+};
 
 static int __init bbr_register(void)
 {
diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 5e9d9c51164c..70384a8040c5 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -497,7 +497,10 @@ BTF_ID(func, cubictcp_acked)
 #endif
 BTF_SET_END(tcp_cubic_kfunc_ids)
 
-static DEFINE_KFUNC_BTF_ID_SET(&tcp_cubic_kfunc_ids, tcp_cubic_kfunc_btf_set);
+static struct kfunc_btf_id_set tcp_cubic_kfunc_btf_set = {
+	.owner = THIS_MODULE,
+	.set   = &tcp_cubic_kfunc_ids,
+};
 
 static int __init cubictcp_register(void)
 {
diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 0d7ab3cc7b61..ac2a47eb89d8 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -251,7 +251,10 @@ BTF_ID(func, dctcp_state)
 #endif
 BTF_SET_END(tcp_dctcp_kfunc_ids)
 
-static DEFINE_KFUNC_BTF_ID_SET(&tcp_dctcp_kfunc_ids, tcp_dctcp_kfunc_btf_set);
+static struct kfunc_btf_id_set tcp_dctcp_kfunc_btf_set = {
+	.owner = THIS_MODULE,
+	.set   = &tcp_dctcp_kfunc_ids,
+};
 
 static int __init dctcp_register(void)
 {
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 5d52ea2768df..a437086e1860 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -93,7 +93,10 @@ BTF_SET_START(bpf_testmod_kfunc_ids)
 BTF_ID(func, bpf_testmod_test_mod_kfunc)
 BTF_SET_END(bpf_testmod_kfunc_ids)
 
-static DEFINE_KFUNC_BTF_ID_SET(&bpf_testmod_kfunc_ids, bpf_testmod_kfunc_btf_set);
+static struct kfunc_btf_id_set bpf_testmod_kfunc_btf_set = {
+	.owner = THIS_MODULE,
+	.set   = &bpf_testmod_kfunc_ids,
+};
 
 static int bpf_testmod_init(void)
 {
-- 
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v2 1/9] bpf: Refactor bpf_check_mod_kfunc_call
From: Kumar Kartikeya Dwivedi @ 2021-12-09 17:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	Maxim Mikityanskiy, Florian Westphal, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev, netfilter-devel
In-Reply-To: <20211209170929.3485242-1-memxor@gmail.com>

Future commits adding more callbacks will implement the same pattern of
matching module owner of kfunc_btf_id_set, and then operating on more
sets inside the struct.

While the btf_id_set for check_kfunc_call wouldn't have been NULL so
far, future commits introduce sets that are optional, hence the common
code also checks whether the pointer is valid.

Note that we must continue search on owner match and btf_id_set_contains
returning false, since more entries may have same owner (which can be
NULL for built-in modules). To clarify this case, a comment is added, so
that future commits don't regress the search.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/btf.h | 16 +++++++++++++---
 kernel/bpf/btf.c    | 27 +++++++++++++++++++--------
 2 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index acef6ef28768..d3014493f6fb 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -320,9 +320,19 @@ static inline const char *btf_name_by_offset(const struct btf *btf,
 }
 #endif
 
+enum kfunc_btf_id_set_types {
+	BTF_SET_CHECK,
+	__BTF_SET_MAX,
+};
+
 struct kfunc_btf_id_set {
 	struct list_head list;
-	struct btf_id_set *set;
+	union {
+		struct btf_id_set *sets[__BTF_SET_MAX];
+		struct {
+			struct btf_id_set *set;
+		};
+	};
 	struct module *owner;
 };
 
@@ -344,8 +354,8 @@ static inline void unregister_kfunc_btf_id_set(struct kfunc_btf_id_list *l,
 					       struct kfunc_btf_id_set *s)
 {
 }
-static inline bool bpf_check_mod_kfunc_call(struct kfunc_btf_id_list *klist,
-					    u32 kfunc_id, struct module *owner)
+bool bpf_check_mod_kfunc_call(struct kfunc_btf_id_list *klist, u32 kfunc_id,
+			      struct module *owner)
 {
 	return false;
 }
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 27b7de538697..c9413d13ca91 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6390,22 +6390,33 @@ void unregister_kfunc_btf_id_set(struct kfunc_btf_id_list *l,
 }
 EXPORT_SYMBOL_GPL(unregister_kfunc_btf_id_set);
 
-bool bpf_check_mod_kfunc_call(struct kfunc_btf_id_list *klist, u32 kfunc_id,
-			      struct module *owner)
+/* Caller must hold reference to module 'owner' */
+static bool kfunc_btf_id_set_contains(struct kfunc_btf_id_list *klist,
+				      u32 kfunc_id, struct module *owner,
+				      enum kfunc_btf_id_set_types type)
 {
-	struct kfunc_btf_id_set *s;
+	struct kfunc_btf_id_set *s = NULL;
+	bool ret = false;
 
-	if (!owner)
+	if (type >= __BTF_SET_MAX)
 		return false;
 	mutex_lock(&klist->mutex);
 	list_for_each_entry(s, &klist->list, list) {
-		if (s->owner == owner && btf_id_set_contains(s->set, kfunc_id)) {
-			mutex_unlock(&klist->mutex);
-			return true;
+		if (s->owner == owner && s->sets[type] &&
+		    btf_id_set_contains(s->sets[type], kfunc_id)) {
+			ret = true;
+			break;
 		}
+		/* continue search, since multiple sets may have same owner */
 	}
 	mutex_unlock(&klist->mutex);
-	return false;
+	return ret;
+}
+
+bool bpf_check_mod_kfunc_call(struct kfunc_btf_id_list *klist, u32 kfunc_id,
+			      struct module *owner)
+{
+	return kfunc_btf_id_set_contains(klist, kfunc_id, owner, BTF_SET_CHECK);
 }
 
 #endif
-- 
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v2 0/9] Introduce unstable CT lookup helpers
From: Kumar Kartikeya Dwivedi @ 2021-12-09 17:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	Maxim Mikityanskiy, Florian Westphal, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen, netdev, netfilter-devel

This series adds unstable conntrack lookup helpers using BPF kfunc support.  The
patch adding the lookup helper is based off of Maxim's recent patch to aid in
rebasing their series on top of this, all adjusted to work with module kfuncs [0].

  [0]: https://lore.kernel.org/bpf/20211019144655.3483197-8-maximmi@nvidia.com

To enable returning a reference to struct nf_conn, the verifier is extended to
support reference tracking for PTR_TO_BTF_ID, and kfunc is extended with support
for working as acquire/release functions, similar to existing BPF helpers. kfunc
returning pointer (limited to PTR_TO_BTF_ID in the kernel) can also return a
PTR_TO_BTF_ID_OR_NULL now, typically needed when acquiring a resource can fail.
kfunc can also receive PTR_TO_CTX and PTR_TO_MEM (with some limitations) as
arguments now. There is also support for passing a mem, len pair as argument
to kfunc now. In such cases, passing pointer to unsized type (void) is also
permitted.

Please see individual commits for details.

Note 1: Patch 1 in this series makes the same change as b12f03104324 ("bpf: Fix
bpf_check_mod_kfunc_call for built-in modules") in bpf tree, so there will be a
conflict if patch 1 is applied against that commit. I incorporated the same diff
change so that testing this set is possible (tests in patch 8 rely on it), but
before applying this, I'll rebase and resend, after bpf tree is merged into
bpf-next.

Note 2: BPF CI needs to add the following to config to test the set. I did
update the selftests config in patch 8, but not sure if that is enough.

	CONFIG_NETFILTER=y
	CONFIG_NF_DEFRAG_IPV4=y
	CONFIG_NF_DEFRAG_IPV6=y
	CONFIG_NF_CONNTRACK=y

Changelog:
----------

RFC v1 -> v2:
v1: https://lore.kernel.org/bpf/20211030144609.263572-1-memxor@gmail.com

 * Limit PTR_TO_MEM support to pointer to scalar, or struct with scalars (Alexei)
 * Use btf_id_set for checking acquire, release, ret type null (Alexei)
 * Introduce opts struct for CT helpers, move int err parameter to it
 * Add l4proto as parameter to CT helper's opts, remove separate tcp/udp helpers
 * Add support for mem, len argument pair to kfunc
 * Allow void * as pointer type for mem, len argument pair
 * Extend selftests to cover new additions to kfuncs
 * Copy ref_obj_id to PTR_TO_BTF_ID dst_reg on btf_struct_access, test it
 * Fix other misc nits, bugs, and expand commit messages

Kumar Kartikeya Dwivedi (9):
  bpf: Refactor bpf_check_mod_kfunc_call
  bpf: Remove DEFINE_KFUNC_BTF_ID_SET
  bpf: Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument support
  bpf: Introduce mem, size argument pair support for kfunc
  bpf: Add reference tracking support to kfunc
  bpf: Track provenance for pointers formed from referenced
    PTR_TO_BTF_ID
  net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPF
  selftests/bpf: Extend kfunc selftests
  selftests/bpf: Add test for unstable CT lookup API

 include/linux/bpf.h                           |  27 +-
 include/linux/bpf_verifier.h                  |  12 +
 include/linux/btf.h                           |  48 +++-
 kernel/bpf/btf.c                              | 218 ++++++++++++---
 kernel/bpf/verifier.c                         | 232 +++++++++++-----
 net/bpf/test_run.c                            | 147 ++++++++++
 net/core/filter.c                             |  27 ++
 net/core/net_namespace.c                      |   1 +
 net/ipv4/tcp_bbr.c                            |   5 +-
 net/ipv4/tcp_cubic.c                          |   5 +-
 net/ipv4/tcp_dctcp.c                          |   5 +-
 net/netfilter/nf_conntrack_core.c             | 252 ++++++++++++++++++
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |   5 +-
 tools/testing/selftests/bpf/config            |   4 +
 .../testing/selftests/bpf/prog_tests/bpf_nf.c |  48 ++++
 .../selftests/bpf/prog_tests/kfunc_call.c     |  28 ++
 .../selftests/bpf/progs/kfunc_call_test.c     |  52 +++-
 .../bpf/progs/kfunc_call_test_fail1.c         |  16 ++
 .../bpf/progs/kfunc_call_test_fail2.c         |  16 ++
 .../bpf/progs/kfunc_call_test_fail3.c         |  16 ++
 .../bpf/progs/kfunc_call_test_fail4.c         |  16 ++
 .../bpf/progs/kfunc_call_test_fail5.c         |  16 ++
 .../bpf/progs/kfunc_call_test_fail6.c         |  16 ++
 .../bpf/progs/kfunc_call_test_fail7.c         |  24 ++
 .../bpf/progs/kfunc_call_test_fail8.c         |  22 ++
 .../testing/selftests/bpf/progs/test_bpf_nf.c | 113 ++++++++
 26 files changed, 1259 insertions(+), 112 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_nf.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_fail1.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_fail2.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_fail3.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_fail4.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_fail5.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_fail6.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_fail7.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_fail8.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_bpf_nf.c

-- 
2.34.1


^ permalink raw reply

* Re: [PATCH v2 net-next] net: ocelot: fix missed include in the vsc7514_regs.h file
From: Florian Fainelli @ 2021-12-09 17:02 UTC (permalink / raw)
  To: Colin Foster, linux-kernel, netdev
  Cc: Vladimir Oltean, Claudiu Manoil, Alexandre Belloni,
	UNGLinuxDriver, Andrew Lunn, Vivien Didelot, David S. Miller,
	Jakub Kicinski, Russell King
In-Reply-To: <20211209074010.1813010-1-colin.foster@in-advantage.com>

On 12/8/21 11:40 PM, Colin Foster wrote:
> commit 32ecd22ba60b ("net: mscc: ocelot: split register definitions to a
> separate file") left out an include for <soc/mscc/ocelot_vcap.h>. It was
> missed because the only consumer was ocelot_vsc7514.h, which already
> included ocelot_vcap.
> 
> Fixes: 32ecd22ba60b ("net: mscc: ocelot: split register definitions to a separate file")
> Signed-off-by: Colin Foster <colin.foster@in-advantage.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* Re: [PATCH net] net: dsa: mv88e6xxx: allow use of PHYs on CPU and DSA ports
From: patchwork-bot+netdevbpf @ 2021-12-09 17:00 UTC (permalink / raw)
  To: Russell King
  Cc: andrew, hkallweit1, davem, netdev, vivien.didelot, f.fainelli,
	olteanv, kuba
In-Reply-To: <E1mvFhP-00F8Zb-Ul@rmk-PC.armlinux.org.uk>

Hello:

This patch was applied to netdev/net.git (master)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 09 Dec 2021 09:26:47 +0000 you wrote:
> Martyn Welch reports that his CPU port is unable to link where it has
> been necessary to use one of the switch ports with an internal PHY for
> the CPU port. The reason behind this is the port control register is
> left forcing the link down, preventing traffic flow.
> 
> This occurs because during initialisation, phylink expects the link to
> be down, and DSA forces the link down by synthesising a call to the
> DSA drivers phylink_mac_link_down() method, but we don't touch the
> forced-link state when we later reconfigure the port.
> 
> [...]

Here is the summary with links:
  - [net] net: dsa: mv88e6xxx: allow use of PHYs on CPU and DSA ports
    https://git.kernel.org/netdev/net/c/04ec4e6250e5

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH] bridge: extend BR_ISOLATE to full split-horizon
From: Nikolay Aleksandrov @ 2021-12-09 16:57 UTC (permalink / raw)
  To: Jakub Kicinski, David Lamparter; +Cc: netdev, Alexandra Winter
In-Reply-To: <481128e0-d4d0-3fde-6a9e-65e391bbf64d@nvidia.com>

On 09/12/2021 18:23, Nikolay Aleksandrov wrote:
> On 09/12/2021 18:08, Nikolay Aleksandrov wrote:
>> On 09/12/2021 17:42, Jakub Kicinski wrote:
>>> On Thu,  9 Dec 2021 13:14:32 +0100 David Lamparter wrote:
>>>> Split-horizon essentially just means being able to create multiple
>>>> groups of isolated ports that are isolated within the group, but not
>>>> with respect to each other.
>>>>
>>>> The intent is very different, while isolation is a policy feature,
>>>> split-horizon is intended to provide functional "multiple member ports
>>>> are treated as one for loop avoidance."  But it boils down to the same
>>>> thing in the end.
>>>>
>>>> Signed-off-by: David Lamparter <equinox@diac24.net>
>>>> Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
>>>> Cc: Alexandra Winter <wintera@linux.ibm.com>
>>>
>>> Does not apply to net-next, you'll need to repost even if the code is
>>> good. Please put [PATCH net-next] in the subject.
>>>
>>
>> Hi,
>> For some reason this patch didn't make it to my inbox.. Anyway I was
>> able to see it now online, a few comments (sorry can't do them inline due
>> to missing mbox patch):
>> - please drop the sysfs part, we're not extending sysfs anymore
>> - split the bridge change from the driver
>> - drop the /* BR_ISOLATED - previously BIT(16) */ comment
>> - [IFLA_BRPORT_HORIZON_GROUP] = NLA_POLICY_MIN(NLA_S32, 0), why not just { .type = NLA_U32 } ?
> 
>> - just forbid having both set (tb[IFLA_BRPORT_ISOLATED] && tb[IFLA_BRPORT_HORIZON_GROUP])
>>   user-space should use just one of the two, if isolated is set then it overwrites any older
>>   IFLA_BRPORT_HORIZON_GROUP settings, that should simplify things considerably
> 
> Actually they'll have to be exported together always... that's unfortunate. I get
> why you need the extra netlink logic, I think we'll have to keep it.
So one relatively simple way to drop most of the logic is to have BRPORT_HORIZON_GROUP's
attribute get set after ISOLATED so it always overwrites it, which is similar to the current
situation but if you set only ISOLATED later it will function as intended (i.e. drop the logic
for horizon_group when using the old attr). Still will have to disallow ISOLATED = 0/HORIZON_GROUP >0
and ISOLATED=1/HORIZON_GROUP=0 though as these don't make sense.

e.g.:
if (tb[IFLA_BRPORT_ISOLATED])
	p->horizon_group = !!nla_get_u8(tb[IFLA_BRPORT_ISOLATED]);
if (tb[IFLA_BRPORT_HORIZON_GROUP])
	p->horizon_group = nla_get_u32(tb[IFLA_BRPORT_HORIZON_GROUP]);



This will be also in line with how they're exported (ISOLATED = 1 and HORIZON_GROUP >0).





^ permalink raw reply

* RE: [PATCH bpf-next 6/8] bpf: Add XDP_REDIRECT support to XDP for bpf_prog_run()
From: Toke Høiland-Jørgensen @ 2021-12-09 16:51 UTC (permalink / raw)
  To: John Fastabend, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer
  Cc: netdev, bpf
In-Reply-To: <87tufhwygr.fsf@toke.dk>

Toke Høiland-Jørgensen <toke@redhat.com> writes:

> John Fastabend <john.fastabend@gmail.com> writes:
>
>> Toke Høiland-Jørgensen wrote:
>>> This adds support for doing real redirects when an XDP program returns
>>> XDP_REDIRECT in bpf_prog_run(). To achieve this, we create a page pool
>>> instance while setting up the test run, and feed pages from that into the
>>> XDP program. The setup cost of this is amortised over the number of
>>> repetitions specified by userspace.
>>> 
>>> To support performance testing use case, we further optimise the setup step
>>> so that all pages in the pool are pre-initialised with the packet data, and
>>> pre-computed context and xdp_frame objects stored at the start of each
>>> page. This makes it possible to entirely avoid touching the page content on
>>> each XDP program invocation, and enables sending up to 11.5 Mpps/core on my
>>> test box.
>>> 
>>> Because the data pages are recycled by the page pool, and the test runner
>>> doesn't re-initialise them for each run, subsequent invocations of the XDP
>>> program will see the packet data in the state it was after the last time it
>>> ran on that particular page. This means that an XDP program that modifies
>>> the packet before redirecting it has to be careful about which assumptions
>>> it makes about the packet content, but that is only an issue for the most
>>> naively written programs.
>>> 
>>> Previous uses of bpf_prog_run() for XDP returned the modified packet data
>>> and return code to userspace, which is a different semantic then this new
>>> redirect mode. For this reason, the caller has to set the new
>>> BPF_F_TEST_XDP_DO_REDIRECT flag when calling bpf_prog_run() to opt in to
>>> the different semantics. Enabling this flag is only allowed if not setting
>>> ctx_out and data_out in the test specification, since it means frames will
>>> be redirected somewhere else, so they can't be returned.
>>> 
>>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>> ---
>>
>> [...]
>>
>>> +static int bpf_test_run_xdp_redirect(struct bpf_test_timer *t,
>>> +				     struct bpf_prog *prog, struct xdp_buff *orig_ctx)
>>> +{
>>> +	void *data, *data_end, *data_meta;
>>> +	struct xdp_frame *frm;
>>> +	struct xdp_buff *ctx;
>>> +	struct page *page;
>>> +	int ret, err = 0;
>>> +
>>> +	page = page_pool_dev_alloc_pages(t->xdp.pp);
>>> +	if (!page)
>>> +		return -ENOMEM;
>>> +
>>> +	ctx = ctx_from_page(page);
>>> +	data = ctx->data;
>>> +	data_meta = ctx->data_meta;
>>> +	data_end = ctx->data_end;
>>> +
>>> +	ret = bpf_prog_run_xdp(prog, ctx);
>>> +	if (ret == XDP_REDIRECT) {
>>> +		frm = (struct xdp_frame *)(ctx + 1);
>>> +		/* if program changed pkt bounds we need to update the xdp_frame */
>>
>> Because this reuses the frame repeatedly is there any issue with also
>> updating the ctx each time? Perhaps if the prog keeps shrinking
>> the pkt it might wind up with 0 len pkt? Just wanted to ask.
>
> Sure, it could. But the data buffer comes from userspace anyway, and
> there's nothing preventing userspace from passing a 0-length packet
> anyway, so I just mentally put this in the "don't do that, then" bucket :)
>
> At least I don't *think* there's actually any problem with this that we
> don't have already? A regular XDP program can also shrink an incoming
> packet to zero, then redirect it, no?

Another thought is that we could of course do the opposite here: instead
of updating the xdp_frame when the program resizes the packet, just
reset the pointers so that the next invocation will get the original
size again? The data would still be changed, but maybe that behaviour is
less surprising? WDYT?

-Toke


^ permalink raw reply

* Re: [PATCH net-next v3 6/6] net: lan966x: Add switchdev support
From: Horatiu Vultur @ 2021-12-09 16:43 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev@vger.kernel.org, devicetree@vger.kernel.org,
	linux-kernel@vger.kernel.org, davem@davemloft.net,
	kuba@kernel.org, robh+dt@kernel.org, UNGLinuxDriver@microchip.com,
	linux@armlinux.org.uk, f.fainelli@gmail.com,
	vivien.didelot@gmail.com, andrew@lunn.ch
In-Reply-To: <20211209133616.2kii2xfz5rioii4o@skbuf>

The 12/09/2021 13:36, Vladimir Oltean wrote:
> 
> On Thu, Dec 09, 2021 at 10:46:15AM +0100, Horatiu Vultur wrote:
> > This adds support for switchdev in lan966x.
> > It offloads to the HW basic forwarding and vlan filtering. To be able to
> > offload this to the HW, it is required to disable promisc mode for ports
> > that are part of the bridge.
> >
> > Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
> > ---
> >  .../net/ethernet/microchip/lan966x/Makefile   |   3 +-
> >  .../ethernet/microchip/lan966x/lan966x_main.c |  41 +-
> >  .../ethernet/microchip/lan966x/lan966x_main.h |  18 +
> >  .../microchip/lan966x/lan966x_switchdev.c     | 548 ++++++++++++++++++
> >  .../ethernet/microchip/lan966x/lan966x_vlan.c |  12 +-
> >  5 files changed, 610 insertions(+), 12 deletions(-)
> >  create mode 100644 drivers/net/ethernet/microchip/lan966x/lan966x_switchdev.c
> >
> > diff --git a/drivers/net/ethernet/microchip/lan966x/Makefile b/drivers/net/ethernet/microchip/lan966x/Makefile
> > index f7e6068a91cb..d82e896c2e53 100644
> > --- a/drivers/net/ethernet/microchip/lan966x/Makefile
> > +++ b/drivers/net/ethernet/microchip/lan966x/Makefile
> > @@ -6,4 +6,5 @@
> >  obj-$(CONFIG_LAN966X_SWITCH) += lan966x-switch.o
> >
> >  lan966x-switch-objs  := lan966x_main.o lan966x_phylink.o lan966x_port.o \
> > -                     lan966x_mac.o lan966x_ethtool.o lan966x_vlan.o
> > +                     lan966x_mac.o lan966x_ethtool.o lan966x_switchdev.o \
> > +                     lan966x_vlan.o
> > diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_main.c b/drivers/net/ethernet/microchip/lan966x/lan966x_main.c
> > index 1b4c7e6b4f85..aee36c1cfa17 100644
> > --- a/drivers/net/ethernet/microchip/lan966x/lan966x_main.c
> > +++ b/drivers/net/ethernet/microchip/lan966x/lan966x_main.c
> > @@ -306,7 +306,7 @@ static int lan966x_port_xmit(struct sk_buff *skb, struct net_device *dev)
> >       return lan966x_port_ifh_xmit(skb, ifh, dev);
> >  }
> >
> > -static void lan966x_set_promisc(struct lan966x_port *port, bool enable)
> > +void lan966x_set_promisc(struct lan966x_port *port, bool enable)
> >  {
> >       struct lan966x *lan966x = port->lan966x;
> >
> 
> My documentation of CPU_SRC_COPY_ENA says:
> 
> If set, all frames received on this port are
> copied to the CPU extraction queue given by
> CPUQ_CFG.CPUQ_SRC_COPY.
> 
> I think it was established a while ago that this isn't what promiscuous
> mode is about? Instead it is about accepting packets on a port
> regardless of whether the MAC DA is in their RX filter or not.

Yes, I am aware that this change interprets the things differently and I
am totally OK to drop this promisc if it is needed.

> 
> Hence the oddity of your change. I understand what it intends to do:
> if this is a standalone port you support IFF_UNICAST_FLT, so you drop
> frames with unknown MAC DA. But if IFF_PROMISC is set, then why do you
> copy all frames to the CPU? Why don't you just put the CPU in the
> unknown flooding mask?

Because I don't want the CPU to be in the unknown flooding mask. I want
to send frames to the CPU only if it is required.

> There's a difference between "force a packet to
> get copied to the CPU" and "copy it to the CPU only if it may have
> business there". Then this change would be compatible with bridge mode.
> You want the bridge to receive unknown traffic too, it may need to
> forward it in software.

I don't want to bridge to receive unknown traffic if I know it would
just drop it.

> 
> > @@ -318,14 +318,18 @@ static void lan966x_set_promisc(struct lan966x_port *port, bool enable)
> >  static void lan966x_port_change_rx_flags(struct net_device *dev, int flags)
> >  {
> >       struct lan966x_port *port = netdev_priv(dev);
> > +     bool enable;
> >
> >       if (!(flags & IFF_PROMISC))
> >               return;
> >
> > -     if (dev->flags & IFF_PROMISC)
> > -             lan966x_set_promisc(port, true);
> > -     else
> > -             lan966x_set_promisc(port, false);
> > +     enable = dev->flags & IFF_PROMISC ? true : false;
> > +     port->promisc = enable;
> > +
> > +     if (port->bridge)
> > +             return;
> > +
> > +     lan966x_set_promisc(port, enable);
> >  }
> >
> >  static int lan966x_port_change_mtu(struct net_device *dev, int new_mtu)
> > @@ -340,7 +344,7 @@ static int lan966x_port_change_mtu(struct net_device *dev, int new_mtu)
> >       return 0;
> >  }
> >
> > -static int lan966x_mc_unsync(struct net_device *dev, const unsigned char *addr)
> > +int lan966x_mc_unsync(struct net_device *dev, const unsigned char *addr)
> >  {
> >       struct lan966x_port *port = netdev_priv(dev);
> >       struct lan966x *lan966x = port->lan966x;
> > @@ -348,7 +352,7 @@ static int lan966x_mc_unsync(struct net_device *dev, const unsigned char *addr)
> >       return lan966x_mac_forget(lan966x, addr, port->pvid, ENTRYTYPE_LOCKED);
> >  }
> >
> > -static int lan966x_mc_sync(struct net_device *dev, const unsigned char *addr)
> > +int lan966x_mc_sync(struct net_device *dev, const unsigned char *addr)
> >  {
> >       struct lan966x_port *port = netdev_priv(dev);
> >       struct lan966x *lan966x = port->lan966x;
> > @@ -401,6 +405,11 @@ static const struct net_device_ops lan966x_port_netdev_ops = {
> >       .ndo_vlan_rx_kill_vid           = lan966x_vlan_rx_kill_vid,
> >  };
> >
> > +bool lan966x_netdevice_check(const struct net_device *dev)
> > +{
> > +     return dev && (dev->netdev_ops == &lan966x_port_netdev_ops);
> 
> Can "dev" ever be NULL?

It doesn't look like that. I will remove it.

> 
> > +}
> > +
> >  static int lan966x_port_xtr_status(struct lan966x *lan966x, u8 grp)
> >  {
> >       return lan_rd(lan966x, QS_XTR_RD(grp));
> > @@ -537,6 +546,11 @@ static irqreturn_t lan966x_xtr_irq_handler(int irq, void *args)
> >
> >               skb->protocol = eth_type_trans(skb, dev);
> >
> > +#ifdef CONFIG_NET_SWITCHDEV
> > +             if (lan966x->ports[src_port]->bridge)
> > +                     skb->offload_fwd_mark = 1;
> > +#endif
> > +
> >               netif_rx_ni(skb);
> >               dev->stats.rx_bytes += len;
> >               dev->stats.rx_packets++;
> > @@ -619,13 +633,16 @@ static int lan966x_probe_port(struct lan966x *lan966x, u32 p,
> >
> >       dev->netdev_ops = &lan966x_port_netdev_ops;
> >       dev->ethtool_ops = &lan966x_ethtool_ops;
> > +     dev->hw_features |= NETIF_F_HW_VLAN_CTAG_FILTER |
> > +                         NETIF_F_RXFCS;
> > +     dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER |
> > +                      NETIF_F_HW_VLAN_CTAG_TX |
> > +                      NETIF_F_HW_VLAN_STAG_TX;
> > +     dev->priv_flags |= IFF_UNICAST_FLT;
> 
> Too many changes in one patch. IFF_UNICAST_FLT and the handling of
> promiscuous mode have nothing to do with switchdev.

Well, in this case (because it is a bug in the promisc) it has. Because
the bridge will increase the promiscuity of the interfaces, so then all
the ports will copy the frames to the CPU. Which breaks the entire
purpose of the bridge.

> Neither VLAN filtering via dev->features (should have been part of the previous
> patch), or RXFCS.

I agree, the VLAN filtering should be part of the previous patch and
RXFCS should be totally separate change.

> It seems that each one of these changes was previously
> missed and is now snuck in without the explanation it deserves.
> 
> >       dev->needed_headroom = IFH_LEN * sizeof(u32);
> >
> >       eth_hw_addr_gen(dev, lan966x->base_mac, p + 1);
> >
> > -     lan966x_mac_learn(lan966x, PGID_CPU, dev->dev_addr, port->pvid,
> > -                       ENTRYTYPE_LOCKED);
> > -
> 
> Why is this deleted? If the port uses IFF_UNICAST_FLT and isn't
> promiscuous, how does it get the unicast traffic it needs?
> I may be not realizing something because the changes aren't properly
> split.

The functionality is moved inside function 'lan966x_port_add_addr'.

> 
> >       port->phylink_config.dev = &port->dev->dev;
> >       port->phylink_config.type = PHYLINK_NETDEV;
> >       port->phylink_pcs.poll = true;
> > @@ -949,6 +966,8 @@ static int lan966x_probe(struct platform_device *pdev)
> >               lan966x_port_init(lan966x->ports[p]);
> >       }
> >
> > +     lan966x_register_notifier_blocks(lan966x);
> > +
> >       return 0;
> >
> >  cleanup_ports:
> > @@ -967,6 +986,8 @@ static int lan966x_remove(struct platform_device *pdev)
> >  {
> >       struct lan966x *lan966x = platform_get_drvdata(pdev);
> >
> > +     lan966x_unregister_notifier_blocks(lan966x);
> > +
> >       lan966x_cleanup_ports(lan966x);
> >
> >       cancel_delayed_work_sync(&lan966x->stats_work);
> > diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_main.h b/drivers/net/ethernet/microchip/lan966x/lan966x_main.h
> > index ec3eccf634b3..4a0988087167 100644
> > --- a/drivers/net/ethernet/microchip/lan966x/lan966x_main.h
> > +++ b/drivers/net/ethernet/microchip/lan966x/lan966x_main.h
> > @@ -80,6 +80,11 @@ struct lan966x {
> >       struct list_head mac_entries;
> >       spinlock_t mac_lock; /* lock for mac_entries list */
> >
> > +     /* Notifiers */
> > +     struct notifier_block netdevice_nb;
> > +     struct notifier_block switchdev_nb;
> > +     struct notifier_block switchdev_blocking_nb;
> > +
> >       u16 vlan_mask[VLAN_N_VID];
> >       DECLARE_BITMAP(cpu_vlan_mask, VLAN_N_VID);
> >
> > @@ -112,6 +117,10 @@ struct lan966x_port {
> >       struct net_device *dev;
> >       struct lan966x *lan966x;
> >
> > +     struct net_device *bridge;
> > +     u8 stp_state;
> > +     u8 promisc;
> > +
> >       u8 chip_port;
> >       u16 pvid;
> >       u16 vid;
> > @@ -129,6 +138,14 @@ extern const struct phylink_mac_ops lan966x_phylink_mac_ops;
> >  extern const struct phylink_pcs_ops lan966x_phylink_pcs_ops;
> >  extern const struct ethtool_ops lan966x_ethtool_ops;
> >
> > +int lan966x_mc_unsync(struct net_device *dev, const unsigned char *addr);
> > +int lan966x_mc_sync(struct net_device *dev, const unsigned char *addr);
> > +
> > +bool lan966x_netdevice_check(const struct net_device *dev);
> > +
> > +int lan966x_register_notifier_blocks(struct lan966x *lan966x);
> > +void lan966x_unregister_notifier_blocks(struct lan966x *lan966x);
> > +
> >  void lan966x_stats_get(struct net_device *dev,
> >                      struct rtnl_link_stats64 *stats);
> >  int lan966x_stats_init(struct lan966x *lan966x);
> > @@ -139,6 +156,7 @@ void lan966x_port_status_get(struct lan966x_port *port,
> >                            struct phylink_link_state *state);
> >  int lan966x_port_pcs_set(struct lan966x_port *port,
> >                        struct lan966x_port_config *config);
> > +void lan966x_set_promisc(struct lan966x_port *port, bool enable);
> >  void lan966x_port_init(struct lan966x_port *port);
> >
> >  int lan966x_mac_learn(struct lan966x *lan966x, int port,
> > diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_switchdev.c b/drivers/net/ethernet/microchip/lan966x/lan966x_switchdev.c
> > new file mode 100644
> > index 000000000000..ed6ec78d2d9a
> > --- /dev/null
> > +++ b/drivers/net/ethernet/microchip/lan966x/lan966x_switchdev.c
> > @@ -0,0 +1,548 @@
> > +// SPDX-License-Identifier: GPL-2.0+
> > +
> > +#include <linux/if_bridge.h>
> > +#include <net/switchdev.h>
> > +
> > +#include "lan966x_main.h"
> > +
> > +static struct workqueue_struct *lan966x_owq;
> > +
> > +struct lan966x_fdb_event_work {
> > +     struct work_struct work;
> > +     struct switchdev_notifier_fdb_info fdb_info;
> > +     struct net_device *dev;
> > +     struct lan966x *lan966x;
> > +     unsigned long event;
> > +};
> > +
> > +static void lan966x_port_attr_bridge_flags(struct lan966x_port *port,
> > +                                        struct switchdev_brport_flags flags)
> > +{
> > +     u32 val = lan_rd(port->lan966x, ANA_PGID(PGID_MC));
> > +
> > +     val = ANA_PGID_PGID_GET(val);
> > +
> > +     if (flags.mask & BR_MCAST_FLOOD) {
> > +             if (flags.val & BR_MCAST_FLOOD)
> > +                     val |= BIT(port->chip_port);
> > +             else
> > +                     val &= ~BIT(port->chip_port);
> > +     }
> > +
> > +     lan_rmw(ANA_PGID_PGID_SET(val),
> > +             ANA_PGID_PGID,
> > +             port->lan966x, ANA_PGID(PGID_MC));
> > +}
> > +
> > +static u32 lan966x_get_fwd_mask(struct lan966x_port *port)
> > +{
> > +     struct net_device *bridge = port->bridge;
> > +     struct lan966x *lan966x = port->lan966x;
> > +     u8 ingress_src = port->chip_port;
> > +     u32 mask = 0;
> > +     int p;
> > +
> > +     if (port->stp_state != BR_STATE_FORWARDING)
> > +             goto skip_forwarding;
> > +
> > +     for (p = 0; p < lan966x->num_phys_ports; p++) {
> > +             port = lan966x->ports[p];
> > +
> > +             if (!port)
> > +                     continue;
> > +
> > +             if (port->stp_state == BR_STATE_FORWARDING &&
> > +                 port->bridge == bridge)
> > +                     mask |= BIT(p);
> > +     }
> > +
> > +skip_forwarding:
> > +     mask &= ~BIT(ingress_src);
> > +
> > +     return mask;
> > +}
> > +
> > +static void lan966x_update_fwd_mask(struct lan966x *lan966x)
> > +{
> > +     int p;
> > +
> > +     for (p = 0; p < lan966x->num_phys_ports; p++) {
> > +             struct lan966x_port *port = lan966x->ports[p];
> > +             unsigned long mask = 0;
> > +
> > +             if (port->bridge)
> > +                     mask = lan966x_get_fwd_mask(port);
> > +
> > +             mask |= BIT(CPU_PORT);
> > +
> > +             lan_wr(ANA_PGID_PGID_SET(mask),
> > +                    lan966x, ANA_PGID(PGID_SRC + p));
> > +     }
> > +}
> > +
> > +static void lan966x_attr_stp_state_set(struct lan966x_port *port,
> > +                                    u8 state)
> > +{
> > +     struct lan966x *lan966x = port->lan966x;
> > +     bool learn_ena = 0;
> > +
> > +     port->stp_state = state;
> > +
> > +     if (state == BR_STATE_FORWARDING || state == BR_STATE_LEARNING)
> > +             learn_ena = 1;
> 
> Please use true/false for bool types.

Will do that.

> 
> > +
> > +     lan_rmw(ANA_PORT_CFG_LEARN_ENA_SET(learn_ena),
> > +             ANA_PORT_CFG_LEARN_ENA,
> > +             lan966x, ANA_PORT_CFG(port->chip_port));
> > +
> > +     lan966x_update_fwd_mask(lan966x);
> > +}
> > +
> > +static void lan966x_port_attr_ageing_set(struct lan966x_port *port,
> > +                                      unsigned long ageing_clock_t)
> > +{
> > +     unsigned long ageing_jiffies = clock_t_to_jiffies(ageing_clock_t);
> > +     u32 ageing_time = jiffies_to_msecs(ageing_jiffies) / 1000;
> > +
> > +     lan966x_mac_set_ageing(port->lan966x, ageing_time);
> > +}
> > +
> > +static int lan966x_port_attr_set(struct net_device *dev, const void *ctx,
> > +                              const struct switchdev_attr *attr,
> > +                              struct netlink_ext_ack *extack)
> > +{
> > +     struct lan966x_port *port = netdev_priv(dev);
> > +
> > +     switch (attr->id) {
> > +     case SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS:
> > +             lan966x_port_attr_bridge_flags(port, attr->u.brport_flags);
> > +             break;
> 
> no SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS? I think it doesn't even work
> if you don't handle that?
> 
> br_switchdev_set_port_flag():
> 
>         attr.id = SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS;
>         attr.u.brport_flags.val = flags;
>         attr.u.brport_flags.mask = mask;
> 
>         /* We run from atomic context here */
>         err = call_switchdev_notifiers(SWITCHDEV_PORT_ATTR_SET, p->dev,
>                                        &info.info, extack);
>         err = notifier_to_errno(err);
>         if (err == -EOPNOTSUPP)
>                 return 0;
> 
> Anyway, a big blob of "switchdev support" is hard to follow and review.
> If you add bridge port flags you could as well add more comprehensive
> support for them, but in a separate change please. Forwarding domain is
> one thing, FDB/MDB is another, VLAN is another.

Good catch. I will try to split it as you suggested.

> 
> > +     case SWITCHDEV_ATTR_ID_PORT_STP_STATE:
> > +             lan966x_attr_stp_state_set(port, attr->u.stp_state);
> > +             break;
> > +     case SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME:
> > +             lan966x_port_attr_ageing_set(port, attr->u.ageing_time);
> > +             break;
> > +     case SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING:
> > +             lan966x_vlan_port_set_vlan_aware(port, attr->u.vlan_filtering);
> > +             lan966x_vlan_port_apply(port);
> > +             lan966x_vlan_cpu_set_vlan_aware(port);
> > +             break;
> > +     default:
> > +             return -EOPNOTSUPP;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static int lan966x_port_bridge_join(struct lan966x_port *port,
> > +                                 struct net_device *bridge,
> > +                                 struct netlink_ext_ack *extack)
> > +{
> > +     struct net_device *dev = port->dev;
> > +     int err;
> > +
> > +     err = switchdev_bridge_port_offload(dev, dev, NULL, NULL, NULL,
> > +                                         false, extack);
> > +     if (err)
> > +             return err;
> > +
> > +     port->bridge = bridge;
> > +
> > +     /* Port enters in bridge mode therefor don't need to copy to CPU
> > +      * frames for multicast in case the bridge is not requesting them
> > +      */
> > +     __dev_mc_unsync(dev, lan966x_mc_unsync);
> 
> Why is there a need to unsync the addresses, though? A driver that
> supports proper MAC table isolation between standalone ports and
> VLAN-unaware bridge ports should use separate pvids for the two modes of
> operation (if it doesn't support other isolation mechanisms apart from VLAN,
> which it looks like lan966x does not). I see that your driver even does
> this in lan966x_vlan_port_get_pvid(). So the non-bridge multicast
> addresses could sit there even when the port is in a bridge.

Really good observation.
Initially I had the same PVID for standalone and vlan-unaware bridge
ports. So then I have added all these but now that they use different
PVID I can remove this.

> 
> > +
> > +     /* make sure that the promisc is disabled when entering under the bridge
> > +      * because we don't want all the frames to come to CPU
> > +      */
> > +     lan966x_set_promisc(port, false);
> 
> What's the story here? Why don't other switchdev drivers handle promisc
> in this way (copy all frames to the CPU)?

I can't say about other switchdev drivers, but in our case we really
don't want all the frames to the CPU. The function lan966x_set_promisc
was setting the port to copy all the frames to the CPU, which we don't
want when the port is part of the bridge.

> 
> > +
> > +     return 0;
> > +}
> > +
> > +static void lan966x_port_bridge_leave(struct lan966x_port *port,
> > +                                   struct net_device *bridge)
> > +{
> > +     struct lan966x *lan966x = port->lan966x;
> > +
> > +     switchdev_bridge_port_unoffload(port->dev, NULL, NULL, NULL);
> 
> The bridge offers a facility to sync and unsync the host addresses on
> joins and leaves. For this to work, the switchdev_bridge_port_unoffload
> call should be during the NETDEV_PRECHANGEUPPER notifier event.
> 
> > +     port->bridge = NULL;
> > +
> > +     /* Set the port back to host mode */
> > +     lan966x_vlan_port_set_vlan_aware(port, 0);
> 
> s/0/false/
> 
> > +     lan966x_vlan_port_set_vid(port, HOST_PVID, false, false);
> > +     lan966x_vlan_port_apply(port);
> > +
> > +     lan966x_mac_cpu_learn(lan966x, port->dev->dev_addr, HOST_PVID);
> > +
> > +     /* Port enters in host more therefore restore mc list */
> > +     __dev_mc_sync(port->dev, lan966x_mc_sync, lan966x_mc_unsync);
> > +
> > +     /* Restore back the promisc as it was before the interfaces was added to
> > +      * the bridge
> > +      */
> > +     lan966x_set_promisc(port, port->promisc);
> > +}
> > +
> > +static int lan966x_port_changeupper(struct net_device *dev,
> > +                                 struct netdev_notifier_changeupper_info *info)
> > +{
> > +     struct lan966x_port *port = netdev_priv(dev);
> > +     struct netlink_ext_ack *extack;
> > +     int err = 0;
> > +
> > +     extack = netdev_notifier_info_to_extack(&info->info);
> > +
> > +     if (netif_is_bridge_master(info->upper_dev)) {
> > +             if (info->linking)
> > +                     err = lan966x_port_bridge_join(port, info->upper_dev,
> > +                                                    extack);
> > +             else
> > +                     lan966x_port_bridge_leave(port, info->upper_dev);
> > +     }
> > +
> > +     return err;
> > +}
> > +
> > +static int lan966x_port_add_addr(struct net_device *dev, bool up)
> > +{
> > +     struct lan966x_port *port = netdev_priv(dev);
> > +     struct lan966x *lan966x = port->lan966x;
> > +     u16 vid;
> > +
> > +     vid = lan966x_vlan_port_get_pvid(port);
> > +
> > +     if (up)
> > +             lan966x_mac_cpu_learn(lan966x, dev->dev_addr, vid);
> > +     else
> > +             lan966x_mac_cpu_forget(lan966x, dev->dev_addr, vid);
> 
> For which uppers is this intended? The bridge? Because the bridge
> notifies you of all the entries it needs, if you also consider the
> replayed events and provide non-NULL pointers to
> switchdev_bridge_port_offload.

No, this up just means if the port is up or down. Doesn't have anything
to do with the bridge. This code replace the code from
lan966x_probe_port.

> 
> > +
> > +     return 0;
> > +}
> > +
> > +static int lan966x_netdevice_port_event(struct net_device *dev,
> > +                                     struct notifier_block *nb,
> > +                                     unsigned long event, void *ptr)
> > +{
> > +     int err = 0;
> > +
> > +     if (!lan966x_netdevice_check(dev))
> > +             return 0;
> > +
> > +     switch (event) {
> > +     case NETDEV_CHANGEUPPER:
> > +             err = lan966x_port_changeupper(dev, ptr);
> > +             break;
> > +     case NETDEV_PRE_UP:
> > +             err = lan966x_port_add_addr(dev, true);
> > +             break;
> > +     case NETDEV_DOWN:
> > +             err = lan966x_port_add_addr(dev, false);
> > +             break;
> > +     }
> > +
> > +     return err;
> > +}
> > +
> > +static int lan966x_netdevice_event(struct notifier_block *nb,
> > +                                unsigned long event, void *ptr)
> > +{
> > +     struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> > +     int ret;
> > +
> > +     ret = lan966x_netdevice_port_event(dev, nb, event, ptr);
> > +
> > +     return notifier_from_errno(ret);
> > +}
> > +
> > +static void lan966x_fdb_event_work(struct work_struct *work)
> > +{
> > +     struct lan966x_fdb_event_work *fdb_work =
> > +             container_of(work, struct lan966x_fdb_event_work, work);
> > +     struct switchdev_notifier_fdb_info *fdb_info;
> > +     struct net_device *dev = fdb_work->dev;
> > +     struct lan966x_port *port;
> > +     struct lan966x *lan966x;
> > +
> > +     rtnl_lock();
> 
> rtnl_lock() shouldn't be needed.
> 
> > +
> > +     fdb_info = &fdb_work->fdb_info;
> > +     lan966x = fdb_work->lan966x;
> > +
> > +     if (lan966x_netdevice_check(dev)) {
> > +             port = netdev_priv(dev);
> > +
> > +             switch (fdb_work->event) {
> > +             case SWITCHDEV_FDB_ADD_TO_DEVICE:
> > +                     if (!fdb_info->added_by_user)
> > +                             break;
> 
> If you get notified of a MAC address dynamically learned by the software
> bridge on a lan966x port, you will have allocated memory for the work
> item, and scheduled it, for nothing. Please try to exit unnecessary work
> early.

Yes, I will add an early check.

> 
> > +                     lan966x_mac_add_entry(lan966x, port, fdb_info->addr,
> > +                                           fdb_info->vid);
> > +                     break;
> > +             case SWITCHDEV_FDB_DEL_TO_DEVICE:
> > +                     if (!fdb_info->added_by_user)
> > +                             break;
> > +                     lan966x_mac_del_entry(lan966x, fdb_info->addr, fdb_info->vid);
> > +                     break;
> > +             }
> > +     } else {
> > +             if (!netif_is_bridge_master(dev))
> > +                     goto out;
> > +
> > +             /* If the CPU is not part of the vlan then there is no point
> > +              * to copy the frames to the CPU because they will be dropped
> > +              */
> > +             if (!lan966x_vlan_cpu_member_vlan_mask(lan966x, fdb_info->vid))
> > +                     goto out;
> 
> It isn't part of the VLAN now, but what about later? I don't see that
> you keep these FDB entries anywhere and restore them when a port joins
> that VLAN.

Actually I do that. It is inside lan966x_vlan_port_add_vlan. There I
check if the CPU is part of the VLAN then add that entry.

> 
> > +
> > +             /* In case the bridge is called */
> > +             switch (fdb_work->event) {
> > +             case SWITCHDEV_FDB_ADD_TO_DEVICE:
> > +                     /* If there is no front port in this vlan, there is no
> > +                      * point to copy the frame to CPU because it would be
> > +                      * just dropped at later point. So add it only if
> > +                      * there is a port
> > +                      */
> > +                     if (!lan966x_vlan_port_any_vlan_mask(lan966x, fdb_info->vid))
> > +                             break;
> > +
> > +                     lan966x_mac_cpu_learn(lan966x, fdb_info->addr, fdb_info->vid);
> 
> Does the lan966x_mac_cpu_learn() operation trigger interrupts, or only
> the dynamic learning process?

Only dynamic learning process.

> How do you handle migration of an FDB entry pointing towards the CPU,
> towards one pointing towards a port?

Shouldn't I get 2 calls that the entry is removed from CPU and then
added to a port?

> 
> > +                     break;
> > +             case SWITCHDEV_FDB_DEL_TO_DEVICE:
> > +                     /* It is OK to always forget the entry it */
> 
> Forget the entry it?

It was a typo. I will remove the 'it'.

> 
> > +                     lan966x_mac_cpu_forget(lan966x, fdb_info->addr, fdb_info->vid);
> > +                     break;
> > +             }
> > +     }
> > +
> > +out:
> > +     rtnl_unlock();
> > +     kfree(fdb_work->fdb_info.addr);
> > +     kfree(fdb_work);
> > +     dev_put(dev);
> > +}
> > +
> > +static int lan966x_switchdev_event(struct notifier_block *nb,
> > +                                unsigned long event, void *ptr)
> > +{
> > +     struct lan966x *lan966x = container_of(nb, struct lan966x, switchdev_nb);
> > +     struct net_device *dev = switchdev_notifier_info_to_dev(ptr);
> > +     struct switchdev_notifier_fdb_info *fdb_info;
> > +     struct switchdev_notifier_info *info = ptr;
> > +     struct lan966x_fdb_event_work *fdb_work;
> > +     int err;
> > +
> > +     switch (event) {
> > +     case SWITCHDEV_PORT_ATTR_SET:
> > +             err = switchdev_handle_port_attr_set(dev, ptr,
> > +                                                  lan966x_netdevice_check,
> > +                                                  lan966x_port_attr_set);
> > +             return notifier_from_errno(err);
> > +     case SWITCHDEV_FDB_ADD_TO_DEVICE:
> > +             fallthrough;
> 
> "fallthrough;" not needed here.
> 
> > +     case SWITCHDEV_FDB_DEL_TO_DEVICE:
> > +             fdb_work = kzalloc(sizeof(*fdb_work), GFP_ATOMIC);
> > +             if (!fdb_work)
> > +                     return NOTIFY_BAD;
> > +
> > +             fdb_info = container_of(info,
> > +                                     struct switchdev_notifier_fdb_info,
> > +                                     info);
> > +
> > +             fdb_work->dev = dev;
> > +             fdb_work->lan966x = lan966x;
> > +             fdb_work->event = event;
> > +             INIT_WORK(&fdb_work->work, lan966x_fdb_event_work);
> > +             memcpy(&fdb_work->fdb_info, ptr, sizeof(fdb_work->fdb_info));
> > +             fdb_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
> > +             if (!fdb_work->fdb_info.addr)
> > +                     goto err_addr_alloc;
> > +
> > +             ether_addr_copy((u8 *)fdb_work->fdb_info.addr, fdb_info->addr);
> > +             dev_hold(dev);
> > +
> > +             queue_work(lan966x_owq, &fdb_work->work);
> > +             break;
> > +     }
> > +
> > +     return NOTIFY_DONE;
> > +err_addr_alloc:
> > +     kfree(fdb_work);
> > +     return NOTIFY_BAD;
> > +}
> > +
> > +static int lan966x_handle_port_vlan_add(struct net_device *dev,
> > +                                     struct notifier_block *nb,
> > +                                     const struct switchdev_obj_port_vlan *v)
> > +{
> > +     struct lan966x_port *port;
> > +     struct lan966x *lan966x;
> > +
> > +     /* When adding a port to a vlan, we get a callback for the port but
> > +      * also for the bridge. When get the callback for the bridge just bail
> > +      * out. Then when the bridge is added to the vlan, then we get a
> > +      * callback here but in this case the flags has set:
> > +      * BRIDGE_VLAN_INFO_BRENTRY. In this case it means that the CPU
> > +      * port is added to the vlan, so the broadcast frames and unicast frames
> > +      * with dmac of the bridge should be foward to CPU.
> > +      */
> > +     if (netif_is_bridge_master(dev) &&
> > +         !(v->flags & BRIDGE_VLAN_INFO_BRENTRY))
> > +             return 0;
> > +
> > +     lan966x = container_of(nb, struct lan966x, switchdev_blocking_nb);
> > +
> > +     /* In case the port gets called */
> > +     if (!(netif_is_bridge_master(dev))) {
> > +             if (!lan966x_netdevice_check(dev))
> > +                     return -EOPNOTSUPP;
> > +
> > +             port = netdev_priv(dev);
> > +             return lan966x_vlan_port_add_vlan(port, v->vid,
> > +                                               v->flags & BRIDGE_VLAN_INFO_PVID,
> > +                                               v->flags & BRIDGE_VLAN_INFO_UNTAGGED);
> > +     }
> > +
> > +     /* In case the bridge gets called */
> > +     if (netif_is_bridge_master(dev))
> > +             return lan966x_vlan_cpu_add_vlan(lan966x, dev, v->vid);
> > +
> > +     return 0;
> > +}
> > +
> > +static int lan966x_handle_port_obj_add(struct net_device *dev,
> > +                                    struct notifier_block *nb,
> > +                                    struct switchdev_notifier_port_obj_info *info)
> > +{
> > +     const struct switchdev_obj *obj = info->obj;
> > +     int err;
> > +
> > +     switch (obj->id) {
> > +     case SWITCHDEV_OBJ_ID_PORT_VLAN:
> > +             err = lan966x_handle_port_vlan_add(dev, nb,
> > +                                                SWITCHDEV_OBJ_PORT_VLAN(obj));
> > +             break;
> > +     default:
> > +             err = -EOPNOTSUPP;
> > +             break;
> > +     }
> > +
> > +     info->handled = true;
> > +     return err;
> > +}
> > +
> > +static int lan966x_handle_port_vlan_del(struct net_device *dev,
> > +                                     struct notifier_block *nb,
> > +                                     const struct switchdev_obj_port_vlan *v)
> > +{
> > +     struct lan966x_port *port;
> > +     struct lan966x *lan966x;
> > +
> > +     lan966x = container_of(nb, struct lan966x, switchdev_blocking_nb);
> > +
> > +     /* In case the physical port gets called */
> > +     if (!netif_is_bridge_master(dev)) {
> > +             if (!lan966x_netdevice_check(dev))
> > +                     return -EOPNOTSUPP;
> > +
> > +             port = netdev_priv(dev);
> > +             return lan966x_vlan_port_del_vlan(port, v->vid);
> > +     }
> > +
> > +     /* In case the bridge gets called */
> > +     if (netif_is_bridge_master(dev))
> > +             return lan966x_vlan_cpu_del_vlan(lan966x, dev, v->vid);
> > +
> > +     return 0;
> > +}
> > +
> > +static int lan966x_handle_port_obj_del(struct net_device *dev,
> > +                                    struct notifier_block *nb,
> > +                                    struct switchdev_notifier_port_obj_info *info)
> > +{
> > +     const struct switchdev_obj *obj = info->obj;
> > +     int err;
> > +
> > +     switch (obj->id) {
> > +     case SWITCHDEV_OBJ_ID_PORT_VLAN:
> > +             err = lan966x_handle_port_vlan_del(dev, nb,
> > +                                                SWITCHDEV_OBJ_PORT_VLAN(obj));
> > +             break;
> > +     default:
> > +             err = -EOPNOTSUPP;
> > +             break;
> > +     }
> > +
> > +     info->handled = true;
> > +     return err;
> > +}
> > +
> > +static int lan966x_switchdev_blocking_event(struct notifier_block *nb,
> > +                                         unsigned long event,
> > +                                         void *ptr)
> > +{
> > +     struct net_device *dev = switchdev_notifier_info_to_dev(ptr);
> > +     int err;
> > +
> > +     switch (event) {
> > +     case SWITCHDEV_PORT_OBJ_ADD:
> > +             err = lan966x_handle_port_obj_add(dev, nb, ptr);
> > +             return notifier_from_errno(err);
> > +     case SWITCHDEV_PORT_OBJ_DEL:
> > +             err = lan966x_handle_port_obj_del(dev, nb, ptr);
> > +             return notifier_from_errno(err);
> > +     case SWITCHDEV_PORT_ATTR_SET:
> > +             err = switchdev_handle_port_attr_set(dev, ptr,
> > +                                                  lan966x_netdevice_check,
> > +                                                  lan966x_port_attr_set);
> > +             return notifier_from_errno(err);
> > +     }
> > +
> > +     return NOTIFY_DONE;
> > +}
> > +
> > +int lan966x_register_notifier_blocks(struct lan966x *lan966x)
> > +{
> > +     int err;
> > +
> > +     lan966x->netdevice_nb.notifier_call = lan966x_netdevice_event;
> > +     err = register_netdevice_notifier(&lan966x->netdevice_nb);
> > +     if (err)
> > +             return err;
> > +
> > +     lan966x->switchdev_nb.notifier_call = lan966x_switchdev_event;
> > +     err = register_switchdev_notifier(&lan966x->switchdev_nb);
> > +     if (err)
> > +             goto err_switchdev_nb;
> > +
> > +     lan966x->switchdev_blocking_nb.notifier_call = lan966x_switchdev_blocking_event;
> > +     err = register_switchdev_blocking_notifier(&lan966x->switchdev_blocking_nb);
> > +     if (err)
> > +             goto err_switchdev_blocking_nb;
> > +
> > +     lan966x_owq = alloc_ordered_workqueue("lan966x_order", 0);
> > +     if (!lan966x_owq) {
> > +             err = -ENOMEM;
> > +             goto err_switchdev_blocking_nb;
> > +     }
> 
> These should be singleton objects, otherwise things get problematic if
> you have more than one switch device instantiated in the system.

Yes, I will update this.

> 
> > +
> > +     return 0;
> > +
> > +err_switchdev_blocking_nb:
> > +     unregister_switchdev_notifier(&lan966x->switchdev_nb);
> > +err_switchdev_nb:
> > +     unregister_netdevice_notifier(&lan966x->netdevice_nb);
> > +
> > +     return err;
> > +}
> > +
> > +void lan966x_unregister_notifier_blocks(struct lan966x *lan966x)
> > +{
> > +     destroy_workqueue(lan966x_owq);
> > +
> > +     unregister_switchdev_blocking_notifier(&lan966x->switchdev_blocking_nb);
> > +     unregister_switchdev_notifier(&lan966x->switchdev_nb);
> > +     unregister_netdevice_notifier(&lan966x->netdevice_nb);
> > +}
> > diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_vlan.c b/drivers/net/ethernet/microchip/lan966x/lan966x_vlan.c
> > index e47552775d06..26644503b4e6 100644
> > --- a/drivers/net/ethernet/microchip/lan966x/lan966x_vlan.c
> > +++ b/drivers/net/ethernet/microchip/lan966x/lan966x_vlan.c
> > @@ -155,6 +155,9 @@ static bool lan966x_vlan_cpu_member_cpu_vlan_mask(struct lan966x *lan966x, u16 v
> >
> >  u16 lan966x_vlan_port_get_pvid(struct lan966x_port *port)
> >  {
> > +     if (!port->bridge)
> > +             return HOST_PVID;
> > +
> >       return port->vlan_aware ? port->pvid : UNAWARE_PVID;
> >  }
> >
> > @@ -210,6 +213,8 @@ void lan966x_vlan_cpu_set_vlan_aware(struct lan966x_port *port)
> >                * table for the front port and the CPU
> >                */
> >               lan966x_mac_cpu_learn(lan966x, port->dev->dev_addr, UNAWARE_PVID);
> > +             lan966x_mac_cpu_learn(lan966x, port->bridge->dev_addr,
> > +                                   UNAWARE_PVID);
> >
> >               lan966x_vlan_port_add_vlan_mask(port, UNAWARE_PVID);
> >               lan966x_vlan_port_apply(port);
> > @@ -218,6 +223,8 @@ void lan966x_vlan_cpu_set_vlan_aware(struct lan966x_port *port)
> >                * to vlan unaware
> >                */
> >               lan966x_mac_cpu_forget(lan966x, port->dev->dev_addr, UNAWARE_PVID);
> > +             lan966x_mac_cpu_forget(lan966x, port->bridge->dev_addr,
> > +                                    UNAWARE_PVID);
> >
> >               lan966x_vlan_port_del_vlan_mask(port, UNAWARE_PVID);
> >               lan966x_vlan_port_apply(port);
> > @@ -293,6 +300,7 @@ int lan966x_vlan_port_add_vlan(struct lan966x_port *port,
> >        */
> >       if (lan966x_vlan_cpu_member_cpu_vlan_mask(lan966x, vid)) {
> >               lan966x_mac_cpu_learn(lan966x, port->dev->dev_addr, vid);
> > +             lan966x_mac_cpu_learn(lan966x, port->bridge->dev_addr, vid);
> >               lan966x_vlan_cpu_add_vlan_mask(lan966x, vid);
> >       }
> >
> > @@ -322,8 +330,10 @@ int lan966x_vlan_port_del_vlan(struct lan966x_port *port,
> >        * that vlan but still keep it in the mask because it may be needed
> >        * again then another port gets added in tha vlan
> >        */
> > -     if (!lan966x_vlan_port_any_vlan_mask(lan966x, vid))
> > +     if (!lan966x_vlan_port_any_vlan_mask(lan966x, vid)) {
> > +             lan966x_mac_cpu_forget(lan966x, port->bridge->dev_addr, vid);
> >               lan966x_vlan_cpu_del_vlan_mask(lan966x, vid);
> > +     }
> >
> >       return 0;
> >  }
> > --
> > 2.33.0
> >

-- 
/Horatiu

^ permalink raw reply

* Re: [PATCH] bridge: extend BR_ISOLATE to full split-horizon
From: Alexandra Winter @ 2021-12-09 16:23 UTC (permalink / raw)
  To: Nikolay Aleksandrov, Jakub Kicinski, David Lamparter; +Cc: netdev
In-Reply-To: <4d18d015-4154-5a0c-e93d-16b8bdbdaddb@nvidia.com>



On 09.12.21 17:08, Nikolay Aleksandrov wrote:
> On 09/12/2021 17:42, Jakub Kicinski wrote:
>> On Thu,  9 Dec 2021 13:14:32 +0100 David Lamparter wrote:
>>> Split-horizon essentially just means being able to create multiple
>>> groups of isolated ports that are isolated within the group, but not
>>> with respect to each other.
>>>
>>> The intent is very different, while isolation is a policy feature,
>>> split-horizon is intended to provide functional "multiple member ports
>>> are treated as one for loop avoidance."  But it boils down to the same
>>> thing in the end.
>>>
>>> Signed-off-by: David Lamparter <equinox@diac24.net>
>>> Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
>>> Cc: Alexandra Winter <wintera@linux.ibm.com>
>>
>> Does not apply to net-next, you'll need to repost even if the code is
>> good. Please put [PATCH net-next] in the subject.
>>
> 
> Hi,
> For some reason this patch didn't make it to my inbox.. Anyway I was
> able to see it now online, a few comments (sorry can't do them inline due
> to missing mbox patch):
> - please drop the sysfs part, we're not extending sysfs anymore
> - split the bridge change from the driver
> - drop the /* BR_ISOLATED - previously BIT(16) */ comment
> - [IFLA_BRPORT_HORIZON_GROUP] = NLA_POLICY_MIN(NLA_S32, 0), why not just { .type = NLA_U32 } ?
> - just forbid having both set (tb[IFLA_BRPORT_ISOLATED] && tb[IFLA_BRPORT_HORIZON_GROUP])
>   user-space should use just one of the two, if isolated is set then it overwrites any older
>   IFLA_BRPORT_HORIZON_GROUP settings, that should simplify things considerably
Yes, please keep it compatible with userspace setting IFLA_BRPORT_ISOLATED only.
> 
> Why the limitation (UAPI limited to positive signed int. (recommended ifindex namespace)) ?
> You have the full unsigned space available, user-space can use it as it sees fit.
> You can just remove the comment about recommended ifindex.
> 
> Also please extend the port isolation self-test with a test for a different horizon group.
> 
> Thanks,
>  Nik
> 
> 
> 
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox