Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v4 net-next 2/6] net: dsa: Pass ndo_setup_tc slave callback to drivers
From: Vladimir Oltean @ 2019-09-15  1:59 UTC (permalink / raw)
  To: f.fainelli, vivien.didelot, andrew, davem, vinicius.gomes,
	vedang.patel, richardcochran
  Cc: weifeng.voon, jiri, m-karicheri2, jose.abreu, ilias.apalodimas,
	jhs, xiyou.wangcong, kurt.kanzenbach, joergen.andreasen, netdev,
	Vladimir Oltean
In-Reply-To: <20190915020003.27926-1-olteanv@gmail.com>

DSA currently handles shared block filters (for the classifier-action
qdisc) in the core due to what I believe are simply pragmatic reasons -
hiding the complexity from drivers and offerring a simple API for port
mirroring.

Extend the dsa_slave_setup_tc function by passing all other qdisc
offloads to the driver layer, where the driver may choose what it
implements and how. DSA is simply a pass-through in this case.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
---
Changes since v2:
- Added Florian Fainelli's Reviewed-by.

Changes since v1:
- Added Kurt Kanzenbach's Acked-by.

Changes since RFC:
- Removed the unused declaration of struct tc_taprio_qopt_offload.

 include/net/dsa.h |  2 ++
 net/dsa/slave.c   | 12 ++++++++----
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 96acb14ec1a8..541fb514e31d 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -515,6 +515,8 @@ struct dsa_switch_ops {
 				   bool ingress);
 	void	(*port_mirror_del)(struct dsa_switch *ds, int port,
 				   struct dsa_mall_mirror_tc_entry *mirror);
+	int	(*port_setup_tc)(struct dsa_switch *ds, int port,
+				 enum tc_setup_type type, void *type_data);
 
 	/*
 	 * Cross-chip operations
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 9a88035517a6..75d58229a4bd 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1035,12 +1035,16 @@ static int dsa_slave_setup_tc_block(struct net_device *dev,
 static int dsa_slave_setup_tc(struct net_device *dev, enum tc_setup_type type,
 			      void *type_data)
 {
-	switch (type) {
-	case TC_SETUP_BLOCK:
+	struct dsa_port *dp = dsa_slave_to_port(dev);
+	struct dsa_switch *ds = dp->ds;
+
+	if (type == TC_SETUP_BLOCK)
 		return dsa_slave_setup_tc_block(dev, type_data);
-	default:
+
+	if (!ds->ops->port_setup_tc)
 		return -EOPNOTSUPP;
-	}
+
+	return ds->ops->port_setup_tc(ds, dp->index, type, type_data);
 }
 
 static void dsa_slave_get_stats64(struct net_device *dev,
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 net-next 4/6] net: dsa: sja1105: Advertise the 8 TX queues
From: Vladimir Oltean @ 2019-09-15  2:00 UTC (permalink / raw)
  To: f.fainelli, vivien.didelot, andrew, davem, vinicius.gomes,
	vedang.patel, richardcochran
  Cc: weifeng.voon, jiri, m-karicheri2, jose.abreu, ilias.apalodimas,
	jhs, xiyou.wangcong, kurt.kanzenbach, joergen.andreasen, netdev,
	Vladimir Oltean
In-Reply-To: <20190915020003.27926-1-olteanv@gmail.com>

This is a preparation patch for the tc-taprio offload (and potentially
for other future offloads such as tc-mqprio).

Instead of looking directly at skb->priority during xmit, let's get the
netdev queue and the queue-to-traffic-class mapping, and put the
resulting traffic class into the dsa_8021q PCP field. The switch is
configured with a 1-to-1 PCP-to-ingress-queue-to-egress-queue mapping
(see vlan_pmap in sja1105_main.c), so the effect is that we can inject
into a front-panel's egress traffic class through VLAN tagging from
Linux, completely transparently.

Unfortunately the switch doesn't look at the VLAN PCP in the case of
management traffic to/from the CPU (link-local frames at
01-80-C2-xx-xx-xx or 01-1B-19-xx-xx-xx) so we can't alter the
transmission queue of this type of traffic on a frame-by-frame basis. It
is only selected through the "hostprio" setting which ATM is harcoded in
the driver to 7.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
---
Changes since v2:
- None.

Changes since v1:
- None, but the use of netdev_txq_to_tc is now finally correct after
  adjusting the gate_mask meaning in the taprio offload structure.

Changes since RFC:
- None.

 drivers/net/dsa/sja1105/sja1105_main.c | 7 ++++++-
 net/dsa/tag_sja1105.c                  | 3 ++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/sja1105/sja1105_main.c b/drivers/net/dsa/sja1105/sja1105_main.c
index d8cff0107ec4..108f62c27c28 100644
--- a/drivers/net/dsa/sja1105/sja1105_main.c
+++ b/drivers/net/dsa/sja1105/sja1105_main.c
@@ -384,7 +384,9 @@ static int sja1105_init_general_params(struct sja1105_private *priv)
 		/* Disallow dynamic changing of the mirror port */
 		.mirr_ptacu = 0,
 		.switchid = priv->ds->index,
-		/* Priority queue for link-local frames trapped to CPU */
+		/* Priority queue for link-local management frames
+		 * (both ingress to and egress from CPU - PTP, STP etc)
+		 */
 		.hostprio = 7,
 		.mac_fltres1 = SJA1105_LINKLOCAL_FILTER_A,
 		.mac_flt1    = SJA1105_LINKLOCAL_FILTER_A_MASK,
@@ -1711,6 +1713,9 @@ static int sja1105_setup(struct dsa_switch *ds)
 	 */
 	ds->vlan_filtering_is_global = true;
 
+	/* Advertise the 8 egress queues */
+	ds->num_tx_queues = SJA1105_NUM_TC;
+
 	/* The DSA/switchdev model brings up switch ports in standalone mode by
 	 * default, and that means vlan_filtering is 0 since they're not under
 	 * a bridge, so it's safe to set up switch tagging at this time.
diff --git a/net/dsa/tag_sja1105.c b/net/dsa/tag_sja1105.c
index 47ee88163a9d..9c9aff3e52cf 100644
--- a/net/dsa/tag_sja1105.c
+++ b/net/dsa/tag_sja1105.c
@@ -89,7 +89,8 @@ static struct sk_buff *sja1105_xmit(struct sk_buff *skb,
 	struct dsa_port *dp = dsa_slave_to_port(netdev);
 	struct dsa_switch *ds = dp->ds;
 	u16 tx_vid = dsa_8021q_tx_vid(ds, dp->index);
-	u8 pcp = skb->priority;
+	u16 queue_mapping = skb_get_queue_mapping(skb);
+	u8 pcp = netdev_txq_to_tc(netdev, queue_mapping);
 
 	/* Transmitting management traffic does not rely upon switch tagging,
 	 * but instead SPI-installed management routes. Part 2 of this
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 net-next 1/6] taprio: Add support for hardware offloading
From: Vladimir Oltean @ 2019-09-15  1:59 UTC (permalink / raw)
  To: f.fainelli, vivien.didelot, andrew, davem, vinicius.gomes,
	vedang.patel, richardcochran
  Cc: weifeng.voon, jiri, m-karicheri2, jose.abreu, ilias.apalodimas,
	jhs, xiyou.wangcong, kurt.kanzenbach, joergen.andreasen, netdev,
	Vladimir Oltean
In-Reply-To: <20190915020003.27926-1-olteanv@gmail.com>

From: Vinicius Costa Gomes <vinicius.gomes@intel.com>

This allows taprio to offload the schedule enforcement to capable
network cards, resulting in more precise windows and less CPU usage.

The gate mask acts on traffic classes (groups of queues of same
priority), as specified in IEEE 802.1Q-2018, and following the existing
taprio and mqprio semantics.
It is up to the driver to perform conversion between tc and individual
netdev queues if for some reason it needs to make that distinction.

Full offload is requested from the network interface by specifying
"flags 2" in the tc qdisc creation command, which in turn corresponds to
the TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD bit.

The important detail here is the clockid which is implicitly /dev/ptpN
for full offload, and hence not configurable.

A reference counting API is added to support the use case where Ethernet
drivers need to keep the taprio offload structure locally (i.e. they are
a multi-port switch driver, and configuring a port depends on the
settings of other ports as well). The refcount_t variable is kept in a
private structure (__tc_taprio_qopt_offload) and not exposed to drivers.

In the future, the private structure might also be expanded with a
backpointer to taprio_sched *q, to implement the notification system
described in the patch (of when admin became oper, or an error occurred,
etc, so the offload can be monitored with 'tc qdisc show').

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Voon Weifeng <weifeng.voon@intel.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
---
Changes since v2:
- None.

Changes since v1:
- Turned the next_sched hrtimer function into a simple
  taprio_offload_config_changed function called synchronously (for now)
  from taprio_enable_offload. But the idea is that the driver may have a
  lot more means to figure out when the admin schedule is no longer
  pending (perhaps even an interrupt), so leave an open window for
  implementing a notification system from the driver.
- Made it an error to specify 'clockid' with full offload.
- Created a wrapper __tc_taprio_qopt_offload structure which holds the
  refcount_t (for now) and maybe a backpointer to the qdisc_priv in the
  future.
- Renamed taprio_get and taprio_free to taprio_offload_get and
  taprio_offload_free. Renamed the "taprio" variable to "offload".
- Moved the reference counting helper implementations to sch_taprio.c.
- Removed the tc_mask_to_queue_mask manipulation done to the gate_mask
  before passing it on to drivers. Instead of netdev queue gates, they
  now see a mask of traffic class gates, which:
  - They need to care about anyway, if they have a multi-queue device
    and they need to configure the queue-to-tc hardware mapping.
  - Makes no difference to them if the hardware makes no distinction
    between queue and traffic class (there is only one egress queue per
    tc, having a fixed priority). The sja1105 hw is in this situation.

Changes since RFC:
- Made the combination of FULL_OFFLOAD and TXTIME_ASSIST invalid.
- Made ndo_setup_tc be called from sleepable context.
- Added a taprio_alloc helper to avoid passing stack memory to drivers.
- Made taprio_disable_offload take the extack as well.
- Conditioned the setup of the software (and txtime-assisted)
  implementation of taprio on there not being a full offload in place.
- Fixed a lockdep-related compilation bug.

 include/linux/netdevice.h      |   1 +
 include/net/pkt_sched.h        |  23 ++
 include/uapi/linux/pkt_sched.h |   3 +-
 net/sched/sch_taprio.c         | 409 +++++++++++++++++++++++++++++----
 4 files changed, 392 insertions(+), 44 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d7d5626002e9..9eda1c31d1f7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -847,6 +847,7 @@ enum tc_setup_type {
 	TC_SETUP_QDISC_ETF,
 	TC_SETUP_ROOT_QDISC,
 	TC_SETUP_QDISC_GRED,
+	TC_SETUP_QDISC_TAPRIO,
 };
 
 /* These structures hold the attributes of bpf state that are being passed
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index a16fbe9a2a67..d1632979622e 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -161,4 +161,27 @@ struct tc_etf_qopt_offload {
 	s32 queue;
 };
 
+struct tc_taprio_sched_entry {
+	u8 command; /* TC_TAPRIO_CMD_* */
+
+	/* The gate_mask in the offloading side refers to traffic classes */
+	u32 gate_mask;
+	u32 interval;
+};
+
+struct tc_taprio_qopt_offload {
+	u8 enable;
+	ktime_t base_time;
+	u64 cycle_time;
+	u64 cycle_time_extension;
+
+	size_t num_entries;
+	struct tc_taprio_sched_entry entries[0];
+};
+
+/* Reference counting */
+struct tc_taprio_qopt_offload *taprio_offload_get(struct tc_taprio_qopt_offload
+						  *offload);
+void taprio_offload_free(struct tc_taprio_qopt_offload *offload);
+
 #endif
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 18f185299f47..5011259b8f67 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1160,7 +1160,8 @@ enum {
  *       [TCA_TAPRIO_ATTR_SCHED_ENTRY_INTERVAL]
  */
 
-#define TCA_TAPRIO_ATTR_FLAG_TXTIME_ASSIST 0x1
+#define TCA_TAPRIO_ATTR_FLAG_TXTIME_ASSIST	BIT(0)
+#define TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD	BIT(1)
 
 enum {
 	TCA_TAPRIO_ATTR_UNSPEC,
diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 84b863e2bdbd..2f7b34205c82 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -29,8 +29,8 @@ static DEFINE_SPINLOCK(taprio_list_lock);
 
 #define TAPRIO_ALL_GATES_OPEN -1
 
-#define FLAGS_VALID(flags) (!((flags) & ~TCA_TAPRIO_ATTR_FLAG_TXTIME_ASSIST))
 #define TXTIME_ASSIST_IS_ENABLED(flags) ((flags) & TCA_TAPRIO_ATTR_FLAG_TXTIME_ASSIST)
+#define FULL_OFFLOAD_IS_ENABLED(flags) ((flags) & TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD)
 
 struct sched_entry {
 	struct list_head list;
@@ -75,9 +75,16 @@ struct taprio_sched {
 	struct sched_gate_list __rcu *admin_sched;
 	struct hrtimer advance_timer;
 	struct list_head taprio_list;
+	struct sk_buff *(*dequeue)(struct Qdisc *sch);
+	struct sk_buff *(*peek)(struct Qdisc *sch);
 	u32 txtime_delay;
 };
 
+struct __tc_taprio_qopt_offload {
+	refcount_t users;
+	struct tc_taprio_qopt_offload offload;
+};
+
 static ktime_t sched_base_time(const struct sched_gate_list *sched)
 {
 	if (!sched)
@@ -268,6 +275,19 @@ static bool is_valid_interval(struct sk_buff *skb, struct Qdisc *sch)
 	return entry;
 }
 
+static bool taprio_flags_valid(u32 flags)
+{
+	/* Make sure no other flag bits are set. */
+	if (flags & ~(TCA_TAPRIO_ATTR_FLAG_TXTIME_ASSIST |
+		      TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD))
+		return false;
+	/* txtime-assist and full offload are mutually exclusive */
+	if ((flags & TCA_TAPRIO_ATTR_FLAG_TXTIME_ASSIST) &&
+	    (flags & TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD))
+		return false;
+	return true;
+}
+
 /* This returns the tstamp value set by TCP in terms of the set clock. */
 static ktime_t get_tcp_tstamp(struct taprio_sched *q, struct sk_buff *skb)
 {
@@ -417,7 +437,7 @@ static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	return qdisc_enqueue(skb, child, to_free);
 }
 
-static struct sk_buff *taprio_peek(struct Qdisc *sch)
+static struct sk_buff *taprio_peek_soft(struct Qdisc *sch)
 {
 	struct taprio_sched *q = qdisc_priv(sch);
 	struct net_device *dev = qdisc_dev(sch);
@@ -461,6 +481,36 @@ static struct sk_buff *taprio_peek(struct Qdisc *sch)
 	return NULL;
 }
 
+static struct sk_buff *taprio_peek_offload(struct Qdisc *sch)
+{
+	struct taprio_sched *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	struct sk_buff *skb;
+	int i;
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		struct Qdisc *child = q->qdiscs[i];
+
+		if (unlikely(!child))
+			continue;
+
+		skb = child->ops->peek(child);
+		if (!skb)
+			continue;
+
+		return skb;
+	}
+
+	return NULL;
+}
+
+static struct sk_buff *taprio_peek(struct Qdisc *sch)
+{
+	struct taprio_sched *q = qdisc_priv(sch);
+
+	return q->peek(sch);
+}
+
 static void taprio_set_budget(struct taprio_sched *q, struct sched_entry *entry)
 {
 	atomic_set(&entry->budget,
@@ -468,7 +518,7 @@ static void taprio_set_budget(struct taprio_sched *q, struct sched_entry *entry)
 			     atomic64_read(&q->picos_per_byte)));
 }
 
-static struct sk_buff *taprio_dequeue(struct Qdisc *sch)
+static struct sk_buff *taprio_dequeue_soft(struct Qdisc *sch)
 {
 	struct taprio_sched *q = qdisc_priv(sch);
 	struct net_device *dev = qdisc_dev(sch);
@@ -550,6 +600,40 @@ static struct sk_buff *taprio_dequeue(struct Qdisc *sch)
 	return skb;
 }
 
+static struct sk_buff *taprio_dequeue_offload(struct Qdisc *sch)
+{
+	struct taprio_sched *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	struct sk_buff *skb;
+	int i;
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		struct Qdisc *child = q->qdiscs[i];
+
+		if (unlikely(!child))
+			continue;
+
+		skb = child->ops->dequeue(child);
+		if (unlikely(!skb))
+			continue;
+
+		qdisc_bstats_update(sch, skb);
+		qdisc_qstats_backlog_dec(sch, skb);
+		sch->q.qlen--;
+
+		return skb;
+	}
+
+	return NULL;
+}
+
+static struct sk_buff *taprio_dequeue(struct Qdisc *sch)
+{
+	struct taprio_sched *q = qdisc_priv(sch);
+
+	return q->dequeue(sch);
+}
+
 static bool should_restart_cycle(const struct sched_gate_list *oper,
 				 const struct sched_entry *entry)
 {
@@ -932,6 +1016,9 @@ static void taprio_start_sched(struct Qdisc *sch,
 	struct taprio_sched *q = qdisc_priv(sch);
 	ktime_t expires;
 
+	if (FULL_OFFLOAD_IS_ENABLED(q->flags))
+		return;
+
 	expires = hrtimer_get_expires(&q->advance_timer);
 	if (expires == 0)
 		expires = KTIME_MAX;
@@ -1011,6 +1098,254 @@ static void setup_txtime(struct taprio_sched *q,
 	}
 }
 
+static struct tc_taprio_qopt_offload *taprio_offload_alloc(int num_entries)
+{
+	size_t size = sizeof(struct tc_taprio_sched_entry) * num_entries +
+		      sizeof(struct __tc_taprio_qopt_offload);
+	struct __tc_taprio_qopt_offload *__offload;
+
+	__offload = kzalloc(size, GFP_KERNEL);
+	if (!__offload)
+		return NULL;
+
+	refcount_set(&__offload->users, 1);
+
+	return &__offload->offload;
+}
+
+struct tc_taprio_qopt_offload *taprio_offload_get(struct tc_taprio_qopt_offload
+						  *offload)
+{
+	struct __tc_taprio_qopt_offload *__offload;
+
+	__offload = container_of(offload, struct __tc_taprio_qopt_offload,
+				 offload);
+
+	refcount_inc(&__offload->users);
+
+	return offload;
+}
+EXPORT_SYMBOL_GPL(taprio_offload_get);
+
+void taprio_offload_free(struct tc_taprio_qopt_offload *offload)
+{
+	struct __tc_taprio_qopt_offload *__offload;
+
+	__offload = container_of(offload, struct __tc_taprio_qopt_offload,
+				 offload);
+
+	if (!refcount_dec_and_test(&__offload->users))
+		return;
+
+	kfree(__offload);
+}
+EXPORT_SYMBOL_GPL(taprio_offload_free);
+
+/* The function will only serve to keep the pointers to the "oper" and "admin"
+ * schedules valid in relation to their base times, so when calling dump() the
+ * users looks at the right schedules.
+ * When using full offload, the admin configuration is promoted to oper at the
+ * base_time in the PHC time domain.  But because the system time is not
+ * necessarily in sync with that, we can't just trigger a hrtimer to call
+ * switch_schedules at the right hardware time.
+ * At the moment we call this by hand right away from taprio, but in the future
+ * it will be useful to create a mechanism for drivers to notify taprio of the
+ * offload state (PENDING, ACTIVE, INACTIVE) so it can be visible in dump().
+ * This is left as TODO.
+ */
+void taprio_offload_config_changed(struct taprio_sched *q)
+{
+	struct sched_gate_list *oper, *admin;
+
+	spin_lock(&q->current_entry_lock);
+
+	oper = rcu_dereference_protected(q->oper_sched,
+					 lockdep_is_held(&q->current_entry_lock));
+	admin = rcu_dereference_protected(q->admin_sched,
+					  lockdep_is_held(&q->current_entry_lock));
+
+	switch_schedules(q, &admin, &oper);
+
+	spin_unlock(&q->current_entry_lock);
+}
+
+static void taprio_sched_to_offload(struct taprio_sched *q,
+				    struct sched_gate_list *sched,
+				    const struct tc_mqprio_qopt *mqprio,
+				    struct tc_taprio_qopt_offload *offload)
+{
+	struct sched_entry *entry;
+	int i = 0;
+
+	offload->base_time = sched->base_time;
+	offload->cycle_time = sched->cycle_time;
+	offload->cycle_time_extension = sched->cycle_time_extension;
+
+	list_for_each_entry(entry, &sched->entries, list) {
+		struct tc_taprio_sched_entry *e = &offload->entries[i];
+
+		e->command = entry->command;
+		e->interval = entry->interval;
+		e->gate_mask = entry->gate_mask;
+		i++;
+	}
+
+	offload->num_entries = i;
+}
+
+static int taprio_enable_offload(struct net_device *dev,
+				 struct tc_mqprio_qopt *mqprio,
+				 struct taprio_sched *q,
+				 struct sched_gate_list *sched,
+				 struct netlink_ext_ack *extack)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+	struct tc_taprio_qopt_offload *offload;
+	int err = 0;
+
+	if (!ops->ndo_setup_tc) {
+		NL_SET_ERR_MSG(extack,
+			       "Device does not support taprio offload");
+		return -EOPNOTSUPP;
+	}
+
+	offload = taprio_offload_alloc(sched->num_entries);
+	if (!offload) {
+		NL_SET_ERR_MSG(extack,
+			       "Not enough memory for enabling offload mode");
+		return -ENOMEM;
+	}
+	offload->enable = 1;
+	taprio_sched_to_offload(q, sched, mqprio, offload);
+
+	err = ops->ndo_setup_tc(dev, TC_SETUP_QDISC_TAPRIO, offload);
+	if (err < 0) {
+		NL_SET_ERR_MSG(extack,
+			       "Device failed to setup taprio offload");
+		goto done;
+	}
+
+	taprio_offload_config_changed(q);
+
+done:
+	taprio_offload_free(offload);
+
+	return err;
+}
+
+static int taprio_disable_offload(struct net_device *dev,
+				  struct taprio_sched *q,
+				  struct netlink_ext_ack *extack)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+	struct tc_taprio_qopt_offload *offload;
+	int err;
+
+	if (!FULL_OFFLOAD_IS_ENABLED(q->flags))
+		return 0;
+
+	if (!ops->ndo_setup_tc)
+		return -EOPNOTSUPP;
+
+	offload = taprio_offload_alloc(0);
+	if (!offload) {
+		NL_SET_ERR_MSG(extack,
+			       "Not enough memory to disable offload mode");
+		return -ENOMEM;
+	}
+	offload->enable = 0;
+
+	err = ops->ndo_setup_tc(dev, TC_SETUP_QDISC_TAPRIO, offload);
+	if (err < 0) {
+		NL_SET_ERR_MSG(extack,
+			       "Device failed to disable offload");
+		goto out;
+	}
+
+out:
+	taprio_offload_free(offload);
+
+	return err;
+}
+
+/* If full offload is enabled, the only possible clockid is the net device's
+ * PHC. For that reason, specifying a clockid through netlink is incorrect.
+ * For txtime-assist, it is implicitly assumed that the device's PHC is kept
+ * in sync with the specified clockid via a user space daemon such as phc2sys.
+ * For both software taprio and txtime-assist, the clockid is used for the
+ * hrtimer that advances the schedule and hence mandatory.
+ */
+static int taprio_parse_clockid(struct Qdisc *sch, struct nlattr **tb,
+				struct netlink_ext_ack *extack)
+{
+	struct taprio_sched *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	int err = -EINVAL;
+
+	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
+		const struct ethtool_ops *ops = dev->ethtool_ops;
+		struct ethtool_ts_info info = {
+			.cmd = ETHTOOL_GET_TS_INFO,
+			.phc_index = -1,
+		};
+
+		if (tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]) {
+			NL_SET_ERR_MSG(extack,
+				       "The 'clockid' cannot be specified for full offload");
+			goto out;
+		}
+
+		if (ops && ops->get_ts_info)
+			err = ops->get_ts_info(dev, &info);
+
+		if (err || info.phc_index < 0) {
+			NL_SET_ERR_MSG(extack,
+				       "Device does not have a PTP clock");
+			err = -ENOTSUPP;
+			goto out;
+		}
+	} else if (tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]) {
+		int clockid = nla_get_s32(tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]);
+
+		/* We only support static clockids and we don't allow
+		 * for it to be modified after the first init.
+		 */
+		if (clockid < 0 ||
+		    (q->clockid != -1 && q->clockid != clockid)) {
+			NL_SET_ERR_MSG(extack,
+				       "Changing the 'clockid' of a running schedule is not supported");
+			err = -ENOTSUPP;
+			goto out;
+		}
+
+		switch (clockid) {
+		case CLOCK_REALTIME:
+			q->tk_offset = TK_OFFS_REAL;
+			break;
+		case CLOCK_MONOTONIC:
+			q->tk_offset = TK_OFFS_MAX;
+			break;
+		case CLOCK_BOOTTIME:
+			q->tk_offset = TK_OFFS_BOOT;
+			break;
+		case CLOCK_TAI:
+			q->tk_offset = TK_OFFS_TAI;
+			break;
+		default:
+			NL_SET_ERR_MSG(extack, "Invalid 'clockid'");
+			err = -EINVAL;
+			goto out;
+		}
+
+		q->clockid = clockid;
+	} else {
+		NL_SET_ERR_MSG(extack, "Specifying a 'clockid' is mandatory");
+		goto out;
+	}
+out:
+	return err;
+}
+
 static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 			 struct netlink_ext_ack *extack)
 {
@@ -1020,9 +1355,9 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 	struct net_device *dev = qdisc_dev(sch);
 	struct tc_mqprio_qopt *mqprio = NULL;
 	u32 taprio_flags = 0;
-	int i, err, clockid;
 	unsigned long flags;
 	ktime_t start;
+	int i, err;
 
 	err = nla_parse_nested_deprecated(tb, TCA_TAPRIO_ATTR_MAX, opt,
 					  taprio_policy, extack);
@@ -1038,7 +1373,7 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 		if (q->flags != 0 && q->flags != taprio_flags) {
 			NL_SET_ERR_MSG_MOD(extack, "Changing 'flags' of a running schedule is not supported");
 			return -EOPNOTSUPP;
-		} else if (!FLAGS_VALID(taprio_flags)) {
+		} else if (!taprio_flags_valid(taprio_flags)) {
 			NL_SET_ERR_MSG_MOD(extack, "Specified 'flags' are not valid");
 			return -EINVAL;
 		}
@@ -1078,30 +1413,19 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 		goto free_sched;
 	}
 
-	if (tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]) {
-		clockid = nla_get_s32(tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]);
-
-		/* We only support static clockids and we don't allow
-		 * for it to be modified after the first init.
-		 */
-		if (clockid < 0 ||
-		    (q->clockid != -1 && q->clockid != clockid)) {
-			NL_SET_ERR_MSG(extack, "Changing the 'clockid' of a running schedule is not supported");
-			err = -ENOTSUPP;
-			goto free_sched;
-		}
-
-		q->clockid = clockid;
-	}
-
-	if (q->clockid == -1 && !tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]) {
-		NL_SET_ERR_MSG(extack, "Specifying a 'clockid' is mandatory");
-		err = -EINVAL;
+	err = taprio_parse_clockid(sch, tb, extack);
+	if (err < 0)
 		goto free_sched;
-	}
 
 	taprio_set_picos_per_byte(dev, q);
 
+	if (FULL_OFFLOAD_IS_ENABLED(taprio_flags))
+		err = taprio_enable_offload(dev, mqprio, q, new_admin, extack);
+	else
+		err = taprio_disable_offload(dev, q, extack);
+	if (err)
+		goto free_sched;
+
 	/* Protects against enqueue()/dequeue() */
 	spin_lock_bh(qdisc_lock(sch));
 
@@ -1116,6 +1440,7 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 	}
 
 	if (!TXTIME_ASSIST_IS_ENABLED(taprio_flags) &&
+	    !FULL_OFFLOAD_IS_ENABLED(taprio_flags) &&
 	    !hrtimer_active(&q->advance_timer)) {
 		hrtimer_init(&q->advance_timer, q->clockid, HRTIMER_MODE_ABS);
 		q->advance_timer.function = advance_sched;
@@ -1134,23 +1459,15 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 					       mqprio->prio_tc_map[i]);
 	}
 
-	switch (q->clockid) {
-	case CLOCK_REALTIME:
-		q->tk_offset = TK_OFFS_REAL;
-		break;
-	case CLOCK_MONOTONIC:
-		q->tk_offset = TK_OFFS_MAX;
-		break;
-	case CLOCK_BOOTTIME:
-		q->tk_offset = TK_OFFS_BOOT;
-		break;
-	case CLOCK_TAI:
-		q->tk_offset = TK_OFFS_TAI;
-		break;
-	default:
-		NL_SET_ERR_MSG(extack, "Invalid 'clockid'");
-		err = -EINVAL;
-		goto unlock;
+	if (FULL_OFFLOAD_IS_ENABLED(taprio_flags)) {
+		q->dequeue = taprio_dequeue_offload;
+		q->peek = taprio_peek_offload;
+	} else {
+		/* Be sure to always keep the function pointers
+		 * in a consistent state.
+		 */
+		q->dequeue = taprio_dequeue_soft;
+		q->peek = taprio_peek_soft;
 	}
 
 	err = taprio_get_start_time(sch, new_admin, &start);
@@ -1212,6 +1529,8 @@ static void taprio_destroy(struct Qdisc *sch)
 
 	hrtimer_cancel(&q->advance_timer);
 
+	taprio_disable_offload(dev, q, NULL);
+
 	if (q->qdiscs) {
 		for (i = 0; i < dev->num_tx_queues && q->qdiscs[i]; i++)
 			qdisc_put(q->qdiscs[i]);
@@ -1241,6 +1560,9 @@ static int taprio_init(struct Qdisc *sch, struct nlattr *opt,
 	hrtimer_init(&q->advance_timer, CLOCK_TAI, HRTIMER_MODE_ABS);
 	q->advance_timer.function = advance_sched;
 
+	q->dequeue = taprio_dequeue_soft;
+	q->peek = taprio_peek_soft;
+
 	q->root = sch;
 
 	/* We only support static clockids. Use an invalid value as default
@@ -1423,7 +1745,8 @@ static int taprio_dump(struct Qdisc *sch, struct sk_buff *skb)
 	if (nla_put(skb, TCA_TAPRIO_ATTR_PRIOMAP, sizeof(opt), &opt))
 		goto options_error;
 
-	if (nla_put_s32(skb, TCA_TAPRIO_ATTR_SCHED_CLOCKID, q->clockid))
+	if (!FULL_OFFLOAD_IS_ENABLED(q->flags) &&
+	    nla_put_s32(skb, TCA_TAPRIO_ATTR_SCHED_CLOCKID, q->clockid))
 		goto options_error;
 
 	if (q->flags && nla_put_u32(skb, TCA_TAPRIO_ATTR_FLAGS, q->flags))
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 net-next 3/6] net: dsa: sja1105: Add static config tables for scheduling
From: Vladimir Oltean @ 2019-09-15  2:00 UTC (permalink / raw)
  To: f.fainelli, vivien.didelot, andrew, davem, vinicius.gomes,
	vedang.patel, richardcochran
  Cc: weifeng.voon, jiri, m-karicheri2, jose.abreu, ilias.apalodimas,
	jhs, xiyou.wangcong, kurt.kanzenbach, joergen.andreasen, netdev,
	Vladimir Oltean
In-Reply-To: <20190915020003.27926-1-olteanv@gmail.com>

In order to support tc-taprio offload, the TTEthernet egress scheduling
core registers must be made visible through the static interface.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
---
Changes since v2:
- None.

Changes since v1:
- None.

Changes since RFC:
- None.

 .../net/dsa/sja1105/sja1105_dynamic_config.c  |   8 +
 .../net/dsa/sja1105/sja1105_static_config.c   | 167 ++++++++++++++++++
 .../net/dsa/sja1105/sja1105_static_config.h   |  48 ++++-
 3 files changed, 222 insertions(+), 1 deletion(-)

diff --git a/drivers/net/dsa/sja1105/sja1105_dynamic_config.c b/drivers/net/dsa/sja1105/sja1105_dynamic_config.c
index 9988c9d18567..91da430045ff 100644
--- a/drivers/net/dsa/sja1105/sja1105_dynamic_config.c
+++ b/drivers/net/dsa/sja1105/sja1105_dynamic_config.c
@@ -488,6 +488,8 @@ sja1105et_general_params_entry_packing(void *buf, void *entry_ptr,
 
 /* SJA1105E/T: First generation */
 struct sja1105_dynamic_table_ops sja1105et_dyn_ops[BLK_IDX_MAX_DYN] = {
+	[BLK_IDX_SCHEDULE] = {0},
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS] = {0},
 	[BLK_IDX_L2_LOOKUP] = {
 		.entry_packing = sja1105et_dyn_l2_lookup_entry_packing,
 		.cmd_packing = sja1105et_l2_lookup_cmd_packing,
@@ -529,6 +531,8 @@ struct sja1105_dynamic_table_ops sja1105et_dyn_ops[BLK_IDX_MAX_DYN] = {
 		.packed_size = SJA1105ET_SIZE_MAC_CONFIG_DYN_CMD,
 		.addr = 0x36,
 	},
+	[BLK_IDX_SCHEDULE_PARAMS] = {0},
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS] = {0},
 	[BLK_IDX_L2_LOOKUP_PARAMS] = {
 		.entry_packing = sja1105et_l2_lookup_params_entry_packing,
 		.cmd_packing = sja1105et_l2_lookup_params_cmd_packing,
@@ -552,6 +556,8 @@ struct sja1105_dynamic_table_ops sja1105et_dyn_ops[BLK_IDX_MAX_DYN] = {
 
 /* SJA1105P/Q/R/S: Second generation */
 struct sja1105_dynamic_table_ops sja1105pqrs_dyn_ops[BLK_IDX_MAX_DYN] = {
+	[BLK_IDX_SCHEDULE] = {0},
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS] = {0},
 	[BLK_IDX_L2_LOOKUP] = {
 		.entry_packing = sja1105pqrs_dyn_l2_lookup_entry_packing,
 		.cmd_packing = sja1105pqrs_l2_lookup_cmd_packing,
@@ -593,6 +599,8 @@ struct sja1105_dynamic_table_ops sja1105pqrs_dyn_ops[BLK_IDX_MAX_DYN] = {
 		.packed_size = SJA1105PQRS_SIZE_MAC_CONFIG_DYN_CMD,
 		.addr = 0x4B,
 	},
+	[BLK_IDX_SCHEDULE_PARAMS] = {0},
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS] = {0},
 	[BLK_IDX_L2_LOOKUP_PARAMS] = {
 		.entry_packing = sja1105et_l2_lookup_params_entry_packing,
 		.cmd_packing = sja1105et_l2_lookup_params_cmd_packing,
diff --git a/drivers/net/dsa/sja1105/sja1105_static_config.c b/drivers/net/dsa/sja1105/sja1105_static_config.c
index b31c737dc560..0d03e13e9909 100644
--- a/drivers/net/dsa/sja1105/sja1105_static_config.c
+++ b/drivers/net/dsa/sja1105/sja1105_static_config.c
@@ -371,6 +371,63 @@ size_t sja1105pqrs_mac_config_entry_packing(void *buf, void *entry_ptr,
 	return size;
 }
 
+static size_t
+sja1105_schedule_entry_points_params_entry_packing(void *buf, void *entry_ptr,
+						   enum packing_op op)
+{
+	struct sja1105_schedule_entry_points_params_entry *entry = entry_ptr;
+	const size_t size = SJA1105_SIZE_SCHEDULE_ENTRY_POINTS_PARAMS_ENTRY;
+
+	sja1105_packing(buf, &entry->clksrc,    31, 30, size, op);
+	sja1105_packing(buf, &entry->actsubsch, 29, 27, size, op);
+	return size;
+}
+
+static size_t
+sja1105_schedule_entry_points_entry_packing(void *buf, void *entry_ptr,
+					    enum packing_op op)
+{
+	struct sja1105_schedule_entry_points_entry *entry = entry_ptr;
+	const size_t size = SJA1105_SIZE_SCHEDULE_ENTRY_POINTS_ENTRY;
+
+	sja1105_packing(buf, &entry->subschindx, 31, 29, size, op);
+	sja1105_packing(buf, &entry->delta,      28, 11, size, op);
+	sja1105_packing(buf, &entry->address,    10, 1,  size, op);
+	return size;
+}
+
+static size_t sja1105_schedule_params_entry_packing(void *buf, void *entry_ptr,
+						    enum packing_op op)
+{
+	const size_t size = SJA1105_SIZE_SCHEDULE_PARAMS_ENTRY;
+	struct sja1105_schedule_params_entry *entry = entry_ptr;
+	int offset, i;
+
+	for (i = 0, offset = 16; i < 8; i++, offset += 10)
+		sja1105_packing(buf, &entry->subscheind[i],
+				offset + 9, offset + 0, size, op);
+	return size;
+}
+
+static size_t sja1105_schedule_entry_packing(void *buf, void *entry_ptr,
+					     enum packing_op op)
+{
+	const size_t size = SJA1105_SIZE_SCHEDULE_ENTRY;
+	struct sja1105_schedule_entry *entry = entry_ptr;
+
+	sja1105_packing(buf, &entry->winstindex,  63, 54, size, op);
+	sja1105_packing(buf, &entry->winend,      53, 53, size, op);
+	sja1105_packing(buf, &entry->winst,       52, 52, size, op);
+	sja1105_packing(buf, &entry->destports,   51, 47, size, op);
+	sja1105_packing(buf, &entry->setvalid,    46, 46, size, op);
+	sja1105_packing(buf, &entry->txen,        45, 45, size, op);
+	sja1105_packing(buf, &entry->resmedia_en, 44, 44, size, op);
+	sja1105_packing(buf, &entry->resmedia,    43, 36, size, op);
+	sja1105_packing(buf, &entry->vlindex,     35, 26, size, op);
+	sja1105_packing(buf, &entry->delta,       25, 8,  size, op);
+	return size;
+}
+
 size_t sja1105_vlan_lookup_entry_packing(void *buf, void *entry_ptr,
 					 enum packing_op op)
 {
@@ -447,11 +504,15 @@ static void sja1105_table_write_crc(u8 *table_start, u8 *crc_ptr)
  * before blindly indexing kernel memory with the blk_idx.
  */
 static u64 blk_id_map[BLK_IDX_MAX] = {
+	[BLK_IDX_SCHEDULE] = BLKID_SCHEDULE,
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS] = BLKID_SCHEDULE_ENTRY_POINTS,
 	[BLK_IDX_L2_LOOKUP] = BLKID_L2_LOOKUP,
 	[BLK_IDX_L2_POLICING] = BLKID_L2_POLICING,
 	[BLK_IDX_VLAN_LOOKUP] = BLKID_VLAN_LOOKUP,
 	[BLK_IDX_L2_FORWARDING] = BLKID_L2_FORWARDING,
 	[BLK_IDX_MAC_CONFIG] = BLKID_MAC_CONFIG,
+	[BLK_IDX_SCHEDULE_PARAMS] = BLKID_SCHEDULE_PARAMS,
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS] = BLKID_SCHEDULE_ENTRY_POINTS_PARAMS,
 	[BLK_IDX_L2_LOOKUP_PARAMS] = BLKID_L2_LOOKUP_PARAMS,
 	[BLK_IDX_L2_FORWARDING_PARAMS] = BLKID_L2_FORWARDING_PARAMS,
 	[BLK_IDX_AVB_PARAMS] = BLKID_AVB_PARAMS,
@@ -461,6 +522,13 @@ static u64 blk_id_map[BLK_IDX_MAX] = {
 
 const char *sja1105_static_config_error_msg[] = {
 	[SJA1105_CONFIG_OK] = "",
+	[SJA1105_TTETHERNET_NOT_SUPPORTED] =
+		"schedule-table present, but TTEthernet is "
+		"only supported on T and Q/S",
+	[SJA1105_INCORRECT_TTETHERNET_CONFIGURATION] =
+		"schedule-table present, but one of "
+		"schedule-entry-points-table, schedule-parameters-table or "
+		"schedule-entry-points-parameters table is empty",
 	[SJA1105_MISSING_L2_POLICING_TABLE] =
 		"l2-policing-table needs to have at least one entry",
 	[SJA1105_MISSING_L2_FORWARDING_TABLE] =
@@ -508,6 +576,21 @@ sja1105_static_config_check_valid(const struct sja1105_static_config *config)
 #define IS_FULL(blk_idx) \
 	(tables[blk_idx].entry_count == tables[blk_idx].ops->max_entry_count)
 
+	if (tables[BLK_IDX_SCHEDULE].entry_count) {
+		if (config->device_id != SJA1105T_DEVICE_ID &&
+		    config->device_id != SJA1105QS_DEVICE_ID)
+			return SJA1105_TTETHERNET_NOT_SUPPORTED;
+
+		if (tables[BLK_IDX_SCHEDULE_ENTRY_POINTS].entry_count == 0)
+			return SJA1105_INCORRECT_TTETHERNET_CONFIGURATION;
+
+		if (!IS_FULL(BLK_IDX_SCHEDULE_PARAMS))
+			return SJA1105_INCORRECT_TTETHERNET_CONFIGURATION;
+
+		if (!IS_FULL(BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS))
+			return SJA1105_INCORRECT_TTETHERNET_CONFIGURATION;
+	}
+
 	if (tables[BLK_IDX_L2_POLICING].entry_count == 0)
 		return SJA1105_MISSING_L2_POLICING_TABLE;
 
@@ -614,6 +697,8 @@ sja1105_static_config_get_length(const struct sja1105_static_config *config)
 
 /* SJA1105E: First generation, no TTEthernet */
 struct sja1105_table_ops sja1105e_table_ops[BLK_IDX_MAX] = {
+	[BLK_IDX_SCHEDULE] = {0},
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS] = {0},
 	[BLK_IDX_L2_LOOKUP] = {
 		.packing = sja1105et_l2_lookup_entry_packing,
 		.unpacked_entry_size = sizeof(struct sja1105_l2_lookup_entry),
@@ -644,6 +729,8 @@ struct sja1105_table_ops sja1105e_table_ops[BLK_IDX_MAX] = {
 		.packed_entry_size = SJA1105ET_SIZE_MAC_CONFIG_ENTRY,
 		.max_entry_count = SJA1105_MAX_MAC_CONFIG_COUNT,
 	},
+	[BLK_IDX_SCHEDULE_PARAMS] = {0},
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS] = {0},
 	[BLK_IDX_L2_LOOKUP_PARAMS] = {
 		.packing = sja1105et_l2_lookup_params_entry_packing,
 		.unpacked_entry_size = sizeof(struct sja1105_l2_lookup_params_entry),
@@ -678,6 +765,18 @@ struct sja1105_table_ops sja1105e_table_ops[BLK_IDX_MAX] = {
 
 /* SJA1105T: First generation, TTEthernet */
 struct sja1105_table_ops sja1105t_table_ops[BLK_IDX_MAX] = {
+	[BLK_IDX_SCHEDULE] = {
+		.packing = sja1105_schedule_entry_packing,
+		.unpacked_entry_size = sizeof(struct sja1105_schedule_entry),
+		.packed_entry_size = SJA1105_SIZE_SCHEDULE_ENTRY,
+		.max_entry_count = SJA1105_MAX_SCHEDULE_COUNT,
+	},
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS] = {
+		.packing = sja1105_schedule_entry_points_entry_packing,
+		.unpacked_entry_size = sizeof(struct sja1105_schedule_entry_points_entry),
+		.packed_entry_size = SJA1105_SIZE_SCHEDULE_ENTRY_POINTS_ENTRY,
+		.max_entry_count = SJA1105_MAX_SCHEDULE_ENTRY_POINTS_COUNT,
+	},
 	[BLK_IDX_L2_LOOKUP] = {
 		.packing = sja1105et_l2_lookup_entry_packing,
 		.unpacked_entry_size = sizeof(struct sja1105_l2_lookup_entry),
@@ -708,6 +807,18 @@ struct sja1105_table_ops sja1105t_table_ops[BLK_IDX_MAX] = {
 		.packed_entry_size = SJA1105ET_SIZE_MAC_CONFIG_ENTRY,
 		.max_entry_count = SJA1105_MAX_MAC_CONFIG_COUNT,
 	},
+	[BLK_IDX_SCHEDULE_PARAMS] = {
+		.packing = sja1105_schedule_params_entry_packing,
+		.unpacked_entry_size = sizeof(struct sja1105_schedule_params_entry),
+		.packed_entry_size = SJA1105_SIZE_SCHEDULE_PARAMS_ENTRY,
+		.max_entry_count = SJA1105_MAX_SCHEDULE_PARAMS_COUNT,
+	},
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS] = {
+		.packing = sja1105_schedule_entry_points_params_entry_packing,
+		.unpacked_entry_size = sizeof(struct sja1105_schedule_entry_points_params_entry),
+		.packed_entry_size = SJA1105_SIZE_SCHEDULE_ENTRY_POINTS_PARAMS_ENTRY,
+		.max_entry_count = SJA1105_MAX_SCHEDULE_ENTRY_POINTS_PARAMS_COUNT,
+	},
 	[BLK_IDX_L2_LOOKUP_PARAMS] = {
 		.packing = sja1105et_l2_lookup_params_entry_packing,
 		.unpacked_entry_size = sizeof(struct sja1105_l2_lookup_params_entry),
@@ -742,6 +853,8 @@ struct sja1105_table_ops sja1105t_table_ops[BLK_IDX_MAX] = {
 
 /* SJA1105P: Second generation, no TTEthernet, no SGMII */
 struct sja1105_table_ops sja1105p_table_ops[BLK_IDX_MAX] = {
+	[BLK_IDX_SCHEDULE] = {0},
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS] = {0},
 	[BLK_IDX_L2_LOOKUP] = {
 		.packing = sja1105pqrs_l2_lookup_entry_packing,
 		.unpacked_entry_size = sizeof(struct sja1105_l2_lookup_entry),
@@ -772,6 +885,8 @@ struct sja1105_table_ops sja1105p_table_ops[BLK_IDX_MAX] = {
 		.packed_entry_size = SJA1105PQRS_SIZE_MAC_CONFIG_ENTRY,
 		.max_entry_count = SJA1105_MAX_MAC_CONFIG_COUNT,
 	},
+	[BLK_IDX_SCHEDULE_PARAMS] = {0},
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS] = {0},
 	[BLK_IDX_L2_LOOKUP_PARAMS] = {
 		.packing = sja1105pqrs_l2_lookup_params_entry_packing,
 		.unpacked_entry_size = sizeof(struct sja1105_l2_lookup_params_entry),
@@ -806,6 +921,18 @@ struct sja1105_table_ops sja1105p_table_ops[BLK_IDX_MAX] = {
 
 /* SJA1105Q: Second generation, TTEthernet, no SGMII */
 struct sja1105_table_ops sja1105q_table_ops[BLK_IDX_MAX] = {
+	[BLK_IDX_SCHEDULE] = {
+		.packing = sja1105_schedule_entry_packing,
+		.unpacked_entry_size = sizeof(struct sja1105_schedule_entry),
+		.packed_entry_size = SJA1105_SIZE_SCHEDULE_ENTRY,
+		.max_entry_count = SJA1105_MAX_SCHEDULE_COUNT,
+	},
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS] = {
+		.packing = sja1105_schedule_entry_points_entry_packing,
+		.unpacked_entry_size = sizeof(struct sja1105_schedule_entry_points_entry),
+		.packed_entry_size = SJA1105_SIZE_SCHEDULE_ENTRY_POINTS_ENTRY,
+		.max_entry_count = SJA1105_MAX_SCHEDULE_ENTRY_POINTS_COUNT,
+	},
 	[BLK_IDX_L2_LOOKUP] = {
 		.packing = sja1105pqrs_l2_lookup_entry_packing,
 		.unpacked_entry_size = sizeof(struct sja1105_l2_lookup_entry),
@@ -836,6 +963,18 @@ struct sja1105_table_ops sja1105q_table_ops[BLK_IDX_MAX] = {
 		.packed_entry_size = SJA1105PQRS_SIZE_MAC_CONFIG_ENTRY,
 		.max_entry_count = SJA1105_MAX_MAC_CONFIG_COUNT,
 	},
+	[BLK_IDX_SCHEDULE_PARAMS] = {
+		.packing = sja1105_schedule_params_entry_packing,
+		.unpacked_entry_size = sizeof(struct sja1105_schedule_params_entry),
+		.packed_entry_size = SJA1105_SIZE_SCHEDULE_PARAMS_ENTRY,
+		.max_entry_count = SJA1105_MAX_SCHEDULE_PARAMS_COUNT,
+	},
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS] = {
+		.packing = sja1105_schedule_entry_points_params_entry_packing,
+		.unpacked_entry_size = sizeof(struct sja1105_schedule_entry_points_params_entry),
+		.packed_entry_size = SJA1105_SIZE_SCHEDULE_ENTRY_POINTS_PARAMS_ENTRY,
+		.max_entry_count = SJA1105_MAX_SCHEDULE_ENTRY_POINTS_PARAMS_COUNT,
+	},
 	[BLK_IDX_L2_LOOKUP_PARAMS] = {
 		.packing = sja1105pqrs_l2_lookup_params_entry_packing,
 		.unpacked_entry_size = sizeof(struct sja1105_l2_lookup_params_entry),
@@ -870,6 +1009,8 @@ struct sja1105_table_ops sja1105q_table_ops[BLK_IDX_MAX] = {
 
 /* SJA1105R: Second generation, no TTEthernet, SGMII */
 struct sja1105_table_ops sja1105r_table_ops[BLK_IDX_MAX] = {
+	[BLK_IDX_SCHEDULE] = {0},
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS] = {0},
 	[BLK_IDX_L2_LOOKUP] = {
 		.packing = sja1105pqrs_l2_lookup_entry_packing,
 		.unpacked_entry_size = sizeof(struct sja1105_l2_lookup_entry),
@@ -900,6 +1041,8 @@ struct sja1105_table_ops sja1105r_table_ops[BLK_IDX_MAX] = {
 		.packed_entry_size = SJA1105PQRS_SIZE_MAC_CONFIG_ENTRY,
 		.max_entry_count = SJA1105_MAX_MAC_CONFIG_COUNT,
 	},
+	[BLK_IDX_SCHEDULE_PARAMS] = {0},
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS] = {0},
 	[BLK_IDX_L2_LOOKUP_PARAMS] = {
 		.packing = sja1105pqrs_l2_lookup_params_entry_packing,
 		.unpacked_entry_size = sizeof(struct sja1105_l2_lookup_params_entry),
@@ -934,6 +1077,18 @@ struct sja1105_table_ops sja1105r_table_ops[BLK_IDX_MAX] = {
 
 /* SJA1105S: Second generation, TTEthernet, SGMII */
 struct sja1105_table_ops sja1105s_table_ops[BLK_IDX_MAX] = {
+	[BLK_IDX_SCHEDULE] = {
+		.packing = sja1105_schedule_entry_packing,
+		.unpacked_entry_size = sizeof(struct sja1105_schedule_entry),
+		.packed_entry_size = SJA1105_SIZE_SCHEDULE_ENTRY,
+		.max_entry_count = SJA1105_MAX_SCHEDULE_COUNT,
+	},
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS] = {
+		.packing = sja1105_schedule_entry_points_entry_packing,
+		.unpacked_entry_size = sizeof(struct sja1105_schedule_entry_points_entry),
+		.packed_entry_size = SJA1105_SIZE_SCHEDULE_ENTRY_POINTS_ENTRY,
+		.max_entry_count = SJA1105_MAX_SCHEDULE_ENTRY_POINTS_COUNT,
+	},
 	[BLK_IDX_L2_LOOKUP] = {
 		.packing = sja1105pqrs_l2_lookup_entry_packing,
 		.unpacked_entry_size = sizeof(struct sja1105_l2_lookup_entry),
@@ -964,6 +1119,18 @@ struct sja1105_table_ops sja1105s_table_ops[BLK_IDX_MAX] = {
 		.packed_entry_size = SJA1105PQRS_SIZE_MAC_CONFIG_ENTRY,
 		.max_entry_count = SJA1105_MAX_MAC_CONFIG_COUNT,
 	},
+	[BLK_IDX_SCHEDULE_PARAMS] = {
+		.packing = sja1105_schedule_params_entry_packing,
+		.unpacked_entry_size = sizeof(struct sja1105_schedule_params_entry),
+		.packed_entry_size = SJA1105_SIZE_SCHEDULE_PARAMS_ENTRY,
+		.max_entry_count = SJA1105_MAX_SCHEDULE_PARAMS_COUNT,
+	},
+	[BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS] = {
+		.packing = sja1105_schedule_entry_points_params_entry_packing,
+		.unpacked_entry_size = sizeof(struct sja1105_schedule_entry_points_params_entry),
+		.packed_entry_size = SJA1105_SIZE_SCHEDULE_ENTRY_POINTS_PARAMS_ENTRY,
+		.max_entry_count = SJA1105_MAX_SCHEDULE_ENTRY_POINTS_PARAMS_COUNT,
+	},
 	[BLK_IDX_L2_LOOKUP_PARAMS] = {
 		.packing = sja1105pqrs_l2_lookup_params_entry_packing,
 		.unpacked_entry_size = sizeof(struct sja1105_l2_lookup_params_entry),
diff --git a/drivers/net/dsa/sja1105/sja1105_static_config.h b/drivers/net/dsa/sja1105/sja1105_static_config.h
index 684465fc0882..7f87022a2d61 100644
--- a/drivers/net/dsa/sja1105/sja1105_static_config.h
+++ b/drivers/net/dsa/sja1105/sja1105_static_config.h
@@ -11,11 +11,15 @@
 
 #define SJA1105_SIZE_DEVICE_ID				4
 #define SJA1105_SIZE_TABLE_HEADER			12
+#define SJA1105_SIZE_SCHEDULE_ENTRY			8
+#define SJA1105_SIZE_SCHEDULE_ENTRY_POINTS_ENTRY	4
 #define SJA1105_SIZE_L2_POLICING_ENTRY			8
 #define SJA1105_SIZE_VLAN_LOOKUP_ENTRY			8
 #define SJA1105_SIZE_L2_FORWARDING_ENTRY		8
 #define SJA1105_SIZE_L2_FORWARDING_PARAMS_ENTRY		12
 #define SJA1105_SIZE_XMII_PARAMS_ENTRY			4
+#define SJA1105_SIZE_SCHEDULE_PARAMS_ENTRY		12
+#define SJA1105_SIZE_SCHEDULE_ENTRY_POINTS_PARAMS_ENTRY	4
 #define SJA1105ET_SIZE_L2_LOOKUP_ENTRY			12
 #define SJA1105ET_SIZE_MAC_CONFIG_ENTRY			28
 #define SJA1105ET_SIZE_L2_LOOKUP_PARAMS_ENTRY		4
@@ -29,11 +33,15 @@
 
 /* UM10944.pdf Page 11, Table 2. Configuration Blocks */
 enum {
+	BLKID_SCHEDULE					= 0x00,
+	BLKID_SCHEDULE_ENTRY_POINTS			= 0x01,
 	BLKID_L2_LOOKUP					= 0x05,
 	BLKID_L2_POLICING				= 0x06,
 	BLKID_VLAN_LOOKUP				= 0x07,
 	BLKID_L2_FORWARDING				= 0x08,
 	BLKID_MAC_CONFIG				= 0x09,
+	BLKID_SCHEDULE_PARAMS				= 0x0A,
+	BLKID_SCHEDULE_ENTRY_POINTS_PARAMS		= 0x0B,
 	BLKID_L2_LOOKUP_PARAMS				= 0x0D,
 	BLKID_L2_FORWARDING_PARAMS			= 0x0E,
 	BLKID_AVB_PARAMS				= 0x10,
@@ -42,11 +50,15 @@ enum {
 };
 
 enum sja1105_blk_idx {
-	BLK_IDX_L2_LOOKUP = 0,
+	BLK_IDX_SCHEDULE = 0,
+	BLK_IDX_SCHEDULE_ENTRY_POINTS,
+	BLK_IDX_L2_LOOKUP,
 	BLK_IDX_L2_POLICING,
 	BLK_IDX_VLAN_LOOKUP,
 	BLK_IDX_L2_FORWARDING,
 	BLK_IDX_MAC_CONFIG,
+	BLK_IDX_SCHEDULE_PARAMS,
+	BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS,
 	BLK_IDX_L2_LOOKUP_PARAMS,
 	BLK_IDX_L2_FORWARDING_PARAMS,
 	BLK_IDX_AVB_PARAMS,
@@ -59,11 +71,15 @@ enum sja1105_blk_idx {
 	BLK_IDX_INVAL = -1,
 };
 
+#define SJA1105_MAX_SCHEDULE_COUNT			1024
+#define SJA1105_MAX_SCHEDULE_ENTRY_POINTS_COUNT		2048
 #define SJA1105_MAX_L2_LOOKUP_COUNT			1024
 #define SJA1105_MAX_L2_POLICING_COUNT			45
 #define SJA1105_MAX_VLAN_LOOKUP_COUNT			4096
 #define SJA1105_MAX_L2_FORWARDING_COUNT			13
 #define SJA1105_MAX_MAC_CONFIG_COUNT			5
+#define SJA1105_MAX_SCHEDULE_PARAMS_COUNT		1
+#define SJA1105_MAX_SCHEDULE_ENTRY_POINTS_PARAMS_COUNT	1
 #define SJA1105_MAX_L2_LOOKUP_PARAMS_COUNT		1
 #define SJA1105_MAX_L2_FORWARDING_PARAMS_COUNT		1
 #define SJA1105_MAX_GENERAL_PARAMS_COUNT		1
@@ -83,6 +99,23 @@ enum sja1105_blk_idx {
 #define SJA1105R_PART_NO				0x9A86
 #define SJA1105S_PART_NO				0x9A87
 
+struct sja1105_schedule_entry {
+	u64 winstindex;
+	u64 winend;
+	u64 winst;
+	u64 destports;
+	u64 setvalid;
+	u64 txen;
+	u64 resmedia_en;
+	u64 resmedia;
+	u64 vlindex;
+	u64 delta;
+};
+
+struct sja1105_schedule_params_entry {
+	u64 subscheind[8];
+};
+
 struct sja1105_general_params_entry {
 	u64 vllupformat;
 	u64 mirr_ptacu;
@@ -112,6 +145,17 @@ struct sja1105_general_params_entry {
 	u64 replay_port;
 };
 
+struct sja1105_schedule_entry_points_entry {
+	u64 subschindx;
+	u64 delta;
+	u64 address;
+};
+
+struct sja1105_schedule_entry_points_params_entry {
+	u64 clksrc;
+	u64 actsubsch;
+};
+
 struct sja1105_vlan_lookup_entry {
 	u64 ving_mirr;
 	u64 vegr_mirr;
@@ -256,6 +300,8 @@ sja1105_static_config_get_length(const struct sja1105_static_config *config);
 
 typedef enum {
 	SJA1105_CONFIG_OK = 0,
+	SJA1105_TTETHERNET_NOT_SUPPORTED,
+	SJA1105_INCORRECT_TTETHERNET_CONFIGURATION,
 	SJA1105_MISSING_L2_POLICING_TABLE,
 	SJA1105_MISSING_L2_FORWARDING_TABLE,
 	SJA1105_MISSING_L2_FORWARDING_PARAMS_TABLE,
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 net-next 6/6] docs: net: dsa: sja1105: Add info about the Time-Aware Scheduler
From: Vladimir Oltean @ 2019-09-15  2:00 UTC (permalink / raw)
  To: f.fainelli, vivien.didelot, andrew, davem, vinicius.gomes,
	vedang.patel, richardcochran
  Cc: weifeng.voon, jiri, m-karicheri2, jose.abreu, ilias.apalodimas,
	jhs, xiyou.wangcong, kurt.kanzenbach, joergen.andreasen, netdev,
	Vladimir Oltean
In-Reply-To: <20190915020003.27926-1-olteanv@gmail.com>

While not an exhaustive usage tutorial, this describes the details
needed to build more complex scenarios.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
---
Changes since v2:
- Specified that the HOSTPRIO switch setting is not configurable at the
  moment, since I dropped the patch for configuring it.

Changes since v1:
- Patch is new.

 Documentation/networking/dsa/sja1105.rst | 90 ++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/Documentation/networking/dsa/sja1105.rst b/Documentation/networking/dsa/sja1105.rst
index cb2858dece93..eef20d0bcf7c 100644
--- a/Documentation/networking/dsa/sja1105.rst
+++ b/Documentation/networking/dsa/sja1105.rst
@@ -146,6 +146,96 @@ enslaves eth0 and eth1 (the DSA master of the switch ports). This is because in
 this mode, the switch ports beneath br0 are not capable of regular traffic, and
 are only used as a conduit for switchdev operations.
 
+Offloads
+========
+
+Time-aware scheduling
+---------------------
+
+The switch supports a variation of the enhancements for scheduled traffic
+specified in IEEE 802.1Q-2018 (formerly 802.1Qbv). This means it can be used to
+ensure deterministic latency for priority traffic that is sent in-band with its
+gate-open event in the network schedule.
+
+This capability can be managed through the tc-taprio offload ('flags 2'). The
+difference compared to the software implementation of taprio is that the latter
+would only be able to shape traffic originated from the CPU, but not
+autonomously forwarded flows.
+
+The device has 8 traffic classes, and maps incoming frames to one of them based
+on the VLAN PCP bits (if no VLAN is present, the port-based default is used).
+As described in the previous sections, depending on the value of
+``vlan_filtering``, the EtherType recognized by the switch as being VLAN can
+either be the typical 0x8100 or a custom value used internally by the driver
+for tagging. Therefore, the switch ignores the VLAN PCP if used in standalone
+or bridge mode with ``vlan_filtering=0``, as it will not recognize the 0x8100
+EtherType. In these modes, injecting into a particular TX queue can only be
+done by the DSA net devices, which populate the PCP field of the tagging header
+on egress. Using ``vlan_filtering=1``, the behavior is the other way around:
+offloaded flows can be steered to TX queues based on the VLAN PCP, but the DSA
+net devices are no longer able to do that. To inject frames into a hardware TX
+queue with VLAN awareness active, it is necessary to create a VLAN
+sub-interface on the DSA master port, and send normal (0x8100) VLAN-tagged
+towards the switch, with the VLAN PCP bits set appropriately.
+
+Management traffic (having DMAC 01-80-C2-xx-xx-xx or 01-19-1B-xx-xx-xx) is the
+notable exception: the switch always treats it with a fixed priority and
+disregards any VLAN PCP bits even if present. The traffic class for management
+traffic has a value of 7 (highest priority) at the moment, which is not
+configurable in the driver.
+
+Below is an example of configuring a 500 us cyclic schedule on egress port
+``swp5``. The traffic class gate for management traffic (7) is open for 100 us,
+and the gates for all other traffic classes are open for 400 us::
+
+  #!/bin/bash
+
+  set -e -u -o pipefail
+
+  NSEC_PER_SEC="1000000000"
+
+  gatemask() {
+          local tc_list="$1"
+          local mask=0
+
+          for tc in ${tc_list}; do
+                  mask=$((${mask} | (1 << ${tc})))
+          done
+
+          printf "%02x" ${mask}
+  }
+
+  if ! systemctl is-active --quiet ptp4l; then
+          echo "Please start the ptp4l service"
+          exit
+  fi
+
+  now=$(phc_ctl /dev/ptp1 get | gawk '/clock time is/ { print $5; }')
+  # Phase-align the base time to the start of the next second.
+  sec=$(echo "${now}" | gawk -F. '{ print $1; }')
+  base_time="$(((${sec} + 1) * ${NSEC_PER_SEC}))"
+
+  tc qdisc add dev swp5 parent root handle 100 taprio \
+          num_tc 8 \
+          map 0 1 2 3 5 6 7 \
+          queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
+          base-time ${base_time} \
+          sched-entry S $(gatemask 7) 100000 \
+          sched-entry S $(gatemask "0 1 2 3 4 5 6") 400000 \
+          flags 2
+
+It is possible to apply the tc-taprio offload on multiple egress ports. There
+are hardware restrictions related to the fact that no gate event may trigger
+simultaneously on two ports. The driver checks the consistency of the schedules
+against this restriction and errors out when appropriate. Schedule analysis is
+needed to avoid this, which is outside the scope of the document.
+
+At the moment, the time-aware scheduler can only be triggered based on a
+standalone clock and not based on PTP time. This means the base-time argument
+from tc-taprio is ignored and the schedule starts right away. It also means it
+is more difficult to phase-align the scheduler with the other devices in the
+network.
+
 Device Tree bindings and board design
 =====================================
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 net-next 5/6] net: dsa: sja1105: Configure the Time-Aware Scheduler via tc-taprio offload
From: Vladimir Oltean @ 2019-09-15  2:00 UTC (permalink / raw)
  To: f.fainelli, vivien.didelot, andrew, davem, vinicius.gomes,
	vedang.patel, richardcochran
  Cc: weifeng.voon, jiri, m-karicheri2, jose.abreu, ilias.apalodimas,
	jhs, xiyou.wangcong, kurt.kanzenbach, joergen.andreasen, netdev,
	Vladimir Oltean
In-Reply-To: <20190915020003.27926-1-olteanv@gmail.com>

This qdisc offload is the closest thing to what the SJA1105 supports in
hardware for time-based egress shaping. The switch core really is built
around SAE AS6802/TTEthernet (a TTTech standard) but can be made to
operate similarly to IEEE 802.1Qbv with some constraints:

- The gate control list is a global list for all ports. There are 8
  execution threads that iterate through this global list in parallel.
  I don't know why 8, there are only 4 front-panel ports.

- Care must be taken by the user to make sure that two execution threads
  never get to execute a GCL entry simultaneously. I created a O(n^4)
  checker for this hardware limitation, prior to accepting a taprio
  offload configuration as valid.

- The spec says that if a GCL entry's interval is shorter than the frame
  length, you shouldn't send it (and end up in head-of-line blocking).
  Well, this switch does anyway.

- The switch has no concept of ADMIN and OPER configurations. Because
  it's so simple, the TAS settings are loaded through the static config
  tables interface, so there isn't even place for any discussion about
  'graceful switchover between ADMIN and OPER'. You just reset the
  switch and upload a new OPER config.

- The switch accepts multiple time sources for the gate events. Right
  now I am using the standalone clock source as opposed to PTP. So the
  base time parameter doesn't really do much. Support for the PTP clock
  source will be added in a future series.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
---
Changes since v2:
- Made all functions that are entry points into sja1105_tas.c take a
  dsa_struct *ds argument instead of sja1105_private *priv. This also
  happens to avoid a build error reported by the Kbuild test robot.
- Moved the iteration over switch ports outside of
  sja1105_tas_check_conflicts, to address some checkpatch complaints.
- Renamed "new" -> "admin".

Changes since v1:
- Adapted to the naming convention changes in 01/07 (taprio_get ->
  taprio_offload_get, tas_config -> offload, etc).

Changes since RFC:
- Removed the sja1105_tas_config_work workqueue.
- Allocating memory with GFP_KERNEL.
- Made the ASCII art drawing fit in < 80 characters.
- Made most of the time-holding variables s64 instead of u64 (for fear
  of them not holding the result of signed arithmetics properly).

 drivers/net/dsa/sja1105/Kconfig        |   8 +
 drivers/net/dsa/sja1105/Makefile       |   4 +
 drivers/net/dsa/sja1105/sja1105.h      |   6 +
 drivers/net/dsa/sja1105/sja1105_main.c |  19 +-
 drivers/net/dsa/sja1105/sja1105_tas.c  | 423 +++++++++++++++++++++++++
 drivers/net/dsa/sja1105/sja1105_tas.h  |  41 +++
 6 files changed, 500 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/dsa/sja1105/sja1105_tas.c
 create mode 100644 drivers/net/dsa/sja1105/sja1105_tas.h

diff --git a/drivers/net/dsa/sja1105/Kconfig b/drivers/net/dsa/sja1105/Kconfig
index 770134a66e48..55424f39cb0d 100644
--- a/drivers/net/dsa/sja1105/Kconfig
+++ b/drivers/net/dsa/sja1105/Kconfig
@@ -23,3 +23,11 @@ config NET_DSA_SJA1105_PTP
 	help
 	  This enables support for timestamping and PTP clock manipulations in
 	  the SJA1105 DSA driver.
+
+config NET_DSA_SJA1105_TAS
+	bool "Support for the Time-Aware Scheduler on NXP SJA1105"
+	depends on NET_DSA_SJA1105
+	help
+	  This enables support for the TTEthernet-based egress scheduling
+	  engine in the SJA1105 DSA driver, which is controlled using a
+	  hardware offload of the tc-tqprio qdisc.
diff --git a/drivers/net/dsa/sja1105/Makefile b/drivers/net/dsa/sja1105/Makefile
index 4483113e6259..66161e874344 100644
--- a/drivers/net/dsa/sja1105/Makefile
+++ b/drivers/net/dsa/sja1105/Makefile
@@ -12,3 +12,7 @@ sja1105-objs := \
 ifdef CONFIG_NET_DSA_SJA1105_PTP
 sja1105-objs += sja1105_ptp.o
 endif
+
+ifdef CONFIG_NET_DSA_SJA1105_TAS
+sja1105-objs += sja1105_tas.o
+endif
diff --git a/drivers/net/dsa/sja1105/sja1105.h b/drivers/net/dsa/sja1105/sja1105.h
index 78094db32622..e53e494c22e0 100644
--- a/drivers/net/dsa/sja1105/sja1105.h
+++ b/drivers/net/dsa/sja1105/sja1105.h
@@ -20,6 +20,8 @@
  */
 #define SJA1105_AGEING_TIME_MS(ms)	((ms) / 10)
 
+#include "sja1105_tas.h"
+
 /* Keeps the different addresses between E/T and P/Q/R/S */
 struct sja1105_regs {
 	u64 device_id;
@@ -104,6 +106,7 @@ struct sja1105_private {
 	 */
 	struct mutex mgmt_lock;
 	struct sja1105_tagger_data tagger_data;
+	struct sja1105_tas_data tas_data;
 };
 
 #include "sja1105_dynamic_config.h"
@@ -120,6 +123,9 @@ typedef enum {
 	SPI_WRITE = 1,
 } sja1105_spi_rw_mode_t;
 
+/* From sja1105_main.c */
+int sja1105_static_config_reload(struct sja1105_private *priv);
+
 /* From sja1105_spi.c */
 int sja1105_spi_send_packed_buf(const struct sja1105_private *priv,
 				sja1105_spi_rw_mode_t rw, u64 reg_addr,
diff --git a/drivers/net/dsa/sja1105/sja1105_main.c b/drivers/net/dsa/sja1105/sja1105_main.c
index 108f62c27c28..b9def744bcb3 100644
--- a/drivers/net/dsa/sja1105/sja1105_main.c
+++ b/drivers/net/dsa/sja1105/sja1105_main.c
@@ -22,6 +22,7 @@
 #include <linux/if_ether.h>
 #include <linux/dsa/8021q.h>
 #include "sja1105.h"
+#include "sja1105_tas.h"
 
 static void sja1105_hw_reset(struct gpio_desc *gpio, unsigned int pulse_len,
 			     unsigned int startup_delay)
@@ -1382,7 +1383,7 @@ static void sja1105_bridge_leave(struct dsa_switch *ds, int port,
  * modify at runtime (currently only MAC) and restore them after uploading,
  * such that this operation is relatively seamless.
  */
-static int sja1105_static_config_reload(struct sja1105_private *priv)
+int sja1105_static_config_reload(struct sja1105_private *priv)
 {
 	struct sja1105_mac_config_entry *mac;
 	int speed_mbps[SJA1105_NUM_PORTS];
@@ -1727,6 +1728,7 @@ static void sja1105_teardown(struct dsa_switch *ds)
 {
 	struct sja1105_private *priv = ds->priv;
 
+	sja1105_tas_teardown(ds);
 	cancel_work_sync(&priv->tagger_data.rxtstamp_work);
 	skb_queue_purge(&priv->tagger_data.skb_rxtstamp_queue);
 	sja1105_ptp_clock_unregister(priv);
@@ -2056,6 +2058,18 @@ static bool sja1105_port_txtstamp(struct dsa_switch *ds, int port,
 	return true;
 }
 
+static int sja1105_port_setup_tc(struct dsa_switch *ds, int port,
+				 enum tc_setup_type type,
+				 void *type_data)
+{
+	switch (type) {
+	case TC_SETUP_QDISC_TAPRIO:
+		return sja1105_setup_tc_taprio(ds, port, type_data);
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
 static const struct dsa_switch_ops sja1105_switch_ops = {
 	.get_tag_protocol	= sja1105_get_tag_protocol,
 	.setup			= sja1105_setup,
@@ -2088,6 +2102,7 @@ static const struct dsa_switch_ops sja1105_switch_ops = {
 	.port_hwtstamp_set	= sja1105_hwtstamp_set,
 	.port_rxtstamp		= sja1105_port_rxtstamp,
 	.port_txtstamp		= sja1105_port_txtstamp,
+	.port_setup_tc		= sja1105_port_setup_tc,
 };
 
 static int sja1105_check_device_id(struct sja1105_private *priv)
@@ -2197,6 +2212,8 @@ static int sja1105_probe(struct spi_device *spi)
 	}
 	mutex_init(&priv->mgmt_lock);
 
+	sja1105_tas_setup(ds);
+
 	return dsa_register_switch(priv->ds);
 }
 
diff --git a/drivers/net/dsa/sja1105/sja1105_tas.c b/drivers/net/dsa/sja1105/sja1105_tas.c
new file mode 100644
index 000000000000..33eca6a82ec5
--- /dev/null
+++ b/drivers/net/dsa/sja1105/sja1105_tas.c
@@ -0,0 +1,423 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019, Vladimir Oltean <olteanv@gmail.com>
+ */
+#include "sja1105.h"
+
+#define SJA1105_TAS_CLKSRC_DISABLED	0
+#define SJA1105_TAS_CLKSRC_STANDALONE	1
+#define SJA1105_TAS_CLKSRC_AS6802	2
+#define SJA1105_TAS_CLKSRC_PTP		3
+#define SJA1105_TAS_MAX_DELTA		BIT(19)
+#define SJA1105_GATE_MASK		GENMASK_ULL(SJA1105_NUM_TC - 1, 0)
+
+/* This is not a preprocessor macro because the "ns" argument may or may not be
+ * s64 at caller side. This ensures it is properly type-cast before div_s64.
+ */
+static s64 ns_to_sja1105_delta(s64 ns)
+{
+	return div_s64(ns, 200);
+}
+
+/* Lo and behold: the egress scheduler from hell.
+ *
+ * At the hardware level, the Time-Aware Shaper holds a global linear arrray of
+ * all schedule entries for all ports. These are the Gate Control List (GCL)
+ * entries, let's call them "timeslots" for short. This linear array of
+ * timeslots is held in BLK_IDX_SCHEDULE.
+ *
+ * Then there are a maximum of 8 "execution threads" inside the switch, which
+ * iterate cyclically through the "schedule". Each "cycle" has an entry point
+ * and an exit point, both being timeslot indices in the schedule table. The
+ * hardware calls each cycle a "subschedule".
+ *
+ * Subschedule (cycle) i starts when
+ *   ptpclkval >= ptpschtm + BLK_IDX_SCHEDULE_ENTRY_POINTS[i].delta.
+ *
+ * The hardware scheduler iterates BLK_IDX_SCHEDULE with a k ranging from
+ *   k = BLK_IDX_SCHEDULE_ENTRY_POINTS[i].address to
+ *   k = BLK_IDX_SCHEDULE_PARAMS.subscheind[i]
+ *
+ * For each schedule entry (timeslot) k, the engine executes the gate control
+ * list entry for the duration of BLK_IDX_SCHEDULE[k].delta.
+ *
+ *         +---------+
+ *         |         | BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS
+ *         +---------+
+ *              |
+ *              +-----------------+
+ *                                | .actsubsch
+ *  BLK_IDX_SCHEDULE_ENTRY_POINTS v
+ *                 +-------+-------+
+ *                 |cycle 0|cycle 1|
+ *                 +-------+-------+
+ *                   |  |      |  |
+ *  +----------------+  |      |  +-------------------------------------+
+ *  |   .subschindx     |      |             .subschindx                |
+ *  |                   |      +---------------+                        |
+ *  |          .address |        .address      |                        |
+ *  |                   |                      |                        |
+ *  |                   |                      |                        |
+ *  |  BLK_IDX_SCHEDULE v                      v                        |
+ *  |              +-------+-------+-------+-------+-------+------+     |
+ *  |              |entry 0|entry 1|entry 2|entry 3|entry 4|entry5|     |
+ *  |              +-------+-------+-------+-------+-------+------+     |
+ *  |                                  ^                    ^  ^  ^     |
+ *  |                                  |                    |  |  |     |
+ *  |        +-------------------------+                    |  |  |     |
+ *  |        |              +-------------------------------+  |  |     |
+ *  |        |              |              +-------------------+  |     |
+ *  |        |              |              |                      |     |
+ *  | +---------------------------------------------------------------+ |
+ *  | |subscheind[0]<=subscheind[1]<=subscheind[2]<=...<=subscheind[7]| |
+ *  | +---------------------------------------------------------------+ |
+ *  |        ^              ^                BLK_IDX_SCHEDULE_PARAMS    |
+ *  |        |              |                                           |
+ *  +--------+              +-------------------------------------------+
+ *
+ *  In the above picture there are two subschedules (cycles):
+ *
+ *  - cycle 0: iterates the schedule table from 0 to 2 (and back)
+ *  - cycle 1: iterates the schedule table from 3 to 5 (and back)
+ *
+ *  All other possible execution threads must be marked as unused by making
+ *  their "subschedule end index" (subscheind) equal to the last valid
+ *  subschedule's end index (in this case 5).
+ */
+static int sja1105_init_scheduling(struct sja1105_private *priv)
+{
+	struct sja1105_schedule_entry_points_entry *schedule_entry_points;
+	struct sja1105_schedule_entry_points_params_entry
+					*schedule_entry_points_params;
+	struct sja1105_schedule_params_entry *schedule_params;
+	struct sja1105_tas_data *tas_data = &priv->tas_data;
+	struct sja1105_schedule_entry *schedule;
+	struct sja1105_table *table;
+	int schedule_start_idx;
+	s64 entry_point_delta;
+	int schedule_end_idx;
+	int num_entries = 0;
+	int num_cycles = 0;
+	int cycle = 0;
+	int i, k = 0;
+	int port;
+
+	/* Discard previous Schedule Table */
+	table = &priv->static_config.tables[BLK_IDX_SCHEDULE];
+	if (table->entry_count) {
+		kfree(table->entries);
+		table->entry_count = 0;
+	}
+
+	/* Discard previous Schedule Entry Points Parameters Table */
+	table = &priv->static_config.tables[BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS];
+	if (table->entry_count) {
+		kfree(table->entries);
+		table->entry_count = 0;
+	}
+
+	/* Discard previous Schedule Parameters Table */
+	table = &priv->static_config.tables[BLK_IDX_SCHEDULE_PARAMS];
+	if (table->entry_count) {
+		kfree(table->entries);
+		table->entry_count = 0;
+	}
+
+	/* Discard previous Schedule Entry Points Table */
+	table = &priv->static_config.tables[BLK_IDX_SCHEDULE_ENTRY_POINTS];
+	if (table->entry_count) {
+		kfree(table->entries);
+		table->entry_count = 0;
+	}
+
+	/* Figure out the dimensioning of the problem */
+	for (port = 0; port < SJA1105_NUM_PORTS; port++) {
+		if (tas_data->offload[port]) {
+			num_entries += tas_data->offload[port]->num_entries;
+			num_cycles++;
+		}
+	}
+
+	/* Nothing to do */
+	if (!num_cycles)
+		return 0;
+
+	/* Pre-allocate space in the static config tables */
+
+	/* Schedule Table */
+	table = &priv->static_config.tables[BLK_IDX_SCHEDULE];
+	table->entries = kcalloc(num_entries, table->ops->unpacked_entry_size,
+				 GFP_KERNEL);
+	if (!table->entries)
+		return -ENOMEM;
+	table->entry_count = num_entries;
+	schedule = table->entries;
+
+	/* Schedule Points Parameters Table */
+	table = &priv->static_config.tables[BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS];
+	table->entries = kcalloc(SJA1105_MAX_SCHEDULE_ENTRY_POINTS_PARAMS_COUNT,
+				 table->ops->unpacked_entry_size, GFP_KERNEL);
+	if (!table->entries)
+		/* Previously allocated memory will be freed automatically in
+		 * sja1105_static_config_free. This is true for all early
+		 * returns below.
+		 */
+		return -ENOMEM;
+	table->entry_count = SJA1105_MAX_SCHEDULE_ENTRY_POINTS_PARAMS_COUNT;
+	schedule_entry_points_params = table->entries;
+
+	/* Schedule Parameters Table */
+	table = &priv->static_config.tables[BLK_IDX_SCHEDULE_PARAMS];
+	table->entries = kcalloc(SJA1105_MAX_SCHEDULE_PARAMS_COUNT,
+				 table->ops->unpacked_entry_size, GFP_KERNEL);
+	if (!table->entries)
+		return -ENOMEM;
+	table->entry_count = SJA1105_MAX_SCHEDULE_PARAMS_COUNT;
+	schedule_params = table->entries;
+
+	/* Schedule Entry Points Table */
+	table = &priv->static_config.tables[BLK_IDX_SCHEDULE_ENTRY_POINTS];
+	table->entries = kcalloc(num_cycles, table->ops->unpacked_entry_size,
+				 GFP_KERNEL);
+	if (!table->entries)
+		return -ENOMEM;
+	table->entry_count = num_cycles;
+	schedule_entry_points = table->entries;
+
+	/* Finally start populating the static config tables */
+	schedule_entry_points_params->clksrc = SJA1105_TAS_CLKSRC_STANDALONE;
+	schedule_entry_points_params->actsubsch = num_cycles - 1;
+
+	for (port = 0; port < SJA1105_NUM_PORTS; port++) {
+		const struct tc_taprio_qopt_offload *offload;
+
+		offload = tas_data->offload[port];
+		if (!offload)
+			continue;
+
+		schedule_start_idx = k;
+		schedule_end_idx = k + offload->num_entries - 1;
+		/* TODO this is the base time for the port's subschedule,
+		 * relative to PTPSCHTM. But as we're using the standalone
+		 * clock source and not PTP clock as time reference, there's
+		 * little point in even trying to put more logic into this,
+		 * like preserving the phases between the subschedules of
+		 * different ports. We'll get all of that when switching to the
+		 * PTP clock source.
+		 */
+		entry_point_delta = 1;
+
+		schedule_entry_points[cycle].subschindx = cycle;
+		schedule_entry_points[cycle].delta = entry_point_delta;
+		schedule_entry_points[cycle].address = schedule_start_idx;
+
+		/* The subschedule end indices need to be
+		 * monotonically increasing.
+		 */
+		for (i = cycle; i < 8; i++)
+			schedule_params->subscheind[i] = schedule_end_idx;
+
+		for (i = 0; i < offload->num_entries; i++, k++) {
+			s64 delta_ns = offload->entries[i].interval;
+
+			schedule[k].delta = ns_to_sja1105_delta(delta_ns);
+			schedule[k].destports = BIT(port);
+			schedule[k].resmedia_en = true;
+			schedule[k].resmedia = SJA1105_GATE_MASK &
+					~offload->entries[i].gate_mask;
+		}
+		cycle++;
+	}
+
+	return 0;
+}
+
+/* Be there 2 port subschedules, each executing an arbitrary number of gate
+ * open/close events cyclically.
+ * None of those gate events must ever occur at the exact same time, otherwise
+ * the switch is known to act in exotically strange ways.
+ * However the hardware doesn't bother performing these integrity checks.
+ * So here we are with the task of validating whether the new @admin offload
+ * has any conflict with the already established TAS configuration in
+ * tas_data->offload.  We already know the other ports are in harmony with one
+ * another, otherwise we wouldn't have saved them.
+ * Each gate event executes periodically, with a period of @cycle_time and a
+ * phase given by its cycle's @base_time plus its offset within the cycle
+ * (which in turn is given by the length of the events prior to it).
+ * There are two aspects to possible collisions:
+ * - Collisions within one cycle's (actually the longest cycle's) time frame.
+ *   For that, we need to compare the cartesian product of each possible
+ *   occurrence of each event within one cycle time.
+ * - Collisions in the future. Events may not collide within one cycle time,
+ *   but if two port schedules don't have the same periodicity (aka the cycle
+ *   times aren't multiples of one another), they surely will some time in the
+ *   future (actually they will collide an infinite amount of times).
+ */
+static bool
+sja1105_tas_check_conflicts(struct sja1105_private *priv, int port,
+			    const struct tc_taprio_qopt_offload *admin)
+{
+	struct sja1105_tas_data *tas_data = &priv->tas_data;
+	const struct tc_taprio_qopt_offload *offload;
+	s64 max_cycle_time, min_cycle_time;
+	s64 delta1, delta2;
+	s64 rbt1, rbt2;
+	s64 stop_time;
+	s64 t1, t2;
+	int i, j;
+	s32 rem;
+
+	offload = tas_data->offload[port];
+	if (!offload)
+		return false;
+
+	/* Check if the two cycle times are multiples of one another.
+	 * If they aren't, then they will surely collide.
+	 */
+	max_cycle_time = max(offload->cycle_time, admin->cycle_time);
+	min_cycle_time = min(offload->cycle_time, admin->cycle_time);
+	div_s64_rem(max_cycle_time, min_cycle_time, &rem);
+	if (rem)
+		return true;
+
+	/* Calculate the "reduced" base time of each of the two cycles
+	 * (transposed back as close to 0 as possible) by dividing to
+	 * the cycle time.
+	 */
+	div_s64_rem(offload->base_time, offload->cycle_time, &rem);
+	rbt1 = rem;
+
+	div_s64_rem(admin->base_time, admin->cycle_time, &rem);
+	rbt2 = rem;
+
+	stop_time = max_cycle_time + max(rbt1, rbt2);
+
+	/* delta1 is the relative base time of each GCL entry within
+	 * the established ports' TAS config.
+	 */
+	for (i = 0, delta1 = 0;
+	     i < offload->num_entries;
+	     delta1 += offload->entries[i].interval, i++) {
+		/* delta2 is the relative base time of each GCL entry
+		 * within the newly added TAS config.
+		 */
+		for (j = 0, delta2 = 0;
+		     j < admin->num_entries;
+		     delta2 += admin->entries[j].interval, j++) {
+			/* t1 follows all possible occurrences of the
+			 * established ports' GCL entry i within the
+			 * first cycle time.
+			 */
+			for (t1 = rbt1 + delta1;
+			     t1 <= stop_time;
+			     t1 += offload->cycle_time) {
+				/* t2 follows all possible occurrences
+				 * of the newly added GCL entry j
+				 * within the first cycle time.
+				 */
+				for (t2 = rbt2 + delta2;
+				     t2 <= stop_time;
+				     t2 += admin->cycle_time) {
+					if (t1 == t2) {
+						dev_warn(priv->ds->dev,
+							 "GCL entry %d collides with entry %d of port %d\n",
+							 j, i, port);
+						return true;
+					}
+				}
+			}
+		}
+	}
+
+	return false;
+}
+
+int sja1105_setup_tc_taprio(struct dsa_switch *ds, int port,
+			    struct tc_taprio_qopt_offload *admin)
+{
+	struct sja1105_private *priv = ds->priv;
+	struct sja1105_tas_data *tas_data = &priv->tas_data;
+	int other_port, rc, i;
+
+	/* Can't change an already configured port (must delete qdisc first).
+	 * Can't delete the qdisc from an unconfigured port.
+	 */
+	if (!!tas_data->offload[port] == admin->enable)
+		return -EINVAL;
+
+	if (!admin->enable) {
+		taprio_offload_free(tas_data->offload[port]);
+		tas_data->offload[port] = NULL;
+
+		rc = sja1105_init_scheduling(priv);
+		if (rc < 0)
+			return rc;
+
+		return sja1105_static_config_reload(priv);
+	}
+
+	/* The cycle time extension is the amount of time the last cycle from
+	 * the old OPER needs to be extended in order to phase-align with the
+	 * base time of the ADMIN when that becomes the new OPER.
+	 * But of course our switch needs to be reset to switch-over between
+	 * the ADMIN and the OPER configs - so much for a seamless transition.
+	 * So don't add insult over injury and just say we don't support cycle
+	 * time extension.
+	 */
+	if (admin->cycle_time_extension)
+		return -ENOTSUPP;
+
+	if (!ns_to_sja1105_delta(admin->base_time)) {
+		dev_err(ds->dev, "A base time of zero is not hardware-allowed\n");
+		return -ERANGE;
+	}
+
+	for (i = 0; i < admin->num_entries; i++) {
+		s64 delta_ns = admin->entries[i].interval;
+		s64 delta_cycles = ns_to_sja1105_delta(delta_ns);
+		bool too_long, too_short;
+
+		too_long = (delta_cycles >= SJA1105_TAS_MAX_DELTA);
+		too_short = (delta_cycles == 0);
+		if (too_long || too_short) {
+			dev_err(priv->ds->dev,
+				"Interval %llu too %s for GCL entry %d\n",
+				delta_ns, too_long ? "long" : "short", i);
+			return -ERANGE;
+		}
+	}
+
+	for (other_port = 0; other_port < SJA1105_NUM_PORTS; other_port++) {
+		if (other_port == port)
+			continue;
+
+		if (sja1105_tas_check_conflicts(priv, other_port, admin))
+			return -ERANGE;
+	}
+
+	tas_data->offload[port] = taprio_offload_get(admin);
+
+	rc = sja1105_init_scheduling(priv);
+	if (rc < 0)
+		return rc;
+
+	return sja1105_static_config_reload(priv);
+}
+
+void sja1105_tas_setup(struct dsa_switch *ds)
+{
+}
+
+void sja1105_tas_teardown(struct dsa_switch *ds)
+{
+	struct sja1105_private *priv = ds->priv;
+	struct tc_taprio_qopt_offload *offload;
+	int port;
+
+	for (port = 0; port < SJA1105_NUM_PORTS; port++) {
+		offload = priv->tas_data.offload[port];
+		if (!offload)
+			continue;
+
+		taprio_offload_free(offload);
+	}
+}
diff --git a/drivers/net/dsa/sja1105/sja1105_tas.h b/drivers/net/dsa/sja1105/sja1105_tas.h
new file mode 100644
index 000000000000..0b803c30e640
--- /dev/null
+++ b/drivers/net/dsa/sja1105/sja1105_tas.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * Copyright (c) 2019, Vladimir Oltean <olteanv@gmail.com>
+ */
+#ifndef _SJA1105_TAS_H
+#define _SJA1105_TAS_H
+
+#include <net/pkt_sched.h>
+
+#if IS_ENABLED(CONFIG_NET_DSA_SJA1105_TAS)
+
+struct sja1105_tas_data {
+	struct tc_taprio_qopt_offload *offload[SJA1105_NUM_PORTS];
+};
+
+int sja1105_setup_tc_taprio(struct dsa_switch *ds, int port,
+			    struct tc_taprio_qopt_offload *admin);
+
+void sja1105_tas_setup(struct dsa_switch *ds);
+
+void sja1105_tas_teardown(struct dsa_switch *ds);
+
+#else
+
+/* C doesn't allow empty structures, bah! */
+struct sja1105_tas_data {
+	u8 dummy;
+};
+
+static inline int sja1105_setup_tc_taprio(struct dsa_switch *ds, int port,
+					  struct tc_taprio_qopt_offload *admin)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void sja1105_tas_setup(struct dsa_switch *ds) { }
+
+static inline void sja1105_tas_teardown(struct dsa_switch *ds) { }
+
+#endif /* IS_ENABLED(CONFIG_NET_DSA_SJA1105_TAS) */
+
+#endif /* _SJA1105_TAS_H */
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH v4.14-stable 1/2] tcp: Reset send_head when removing skb from write-queue
From: kbuild test robot @ 2019-09-15  2:22 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: kbuild-all, stable, netdev, gregkh, Sasha Levin, David Miller,
	Eric Dumazet, Jason Baron, Vladimir Rutsky, Soheil Hassas Yeganeh,
	Neal Cardwell
In-Reply-To: <20190913200819.32686-2-cpaasch@apple.com>

[-- Attachment #1: Type: text/plain, Size: 8602 bytes --]

Hi Christoph,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[cannot apply to v5.3-rc8 next-20190904]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Christoph-Paasch/tcp-Reset-send_head-when-removing-skb-from-write-queue/20190914-144256
config: x86_64-randconfig-s0-201937 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.2-10+deb8u1) 4.9.2
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   net//ipv4/tcp.c: In function 'tcp_remove_empty_skb':
>> net//ipv4/tcp.c:948:3: error: implicit declaration of function 'tcp_check_send_head' [-Werror=implicit-function-declaration]
      tcp_check_send_head(sk, skb);
      ^
   Cyclomatic Complexity 5 include/linux/compiler.h:__read_once_size
   Cyclomatic Complexity 5 include/linux/compiler.h:__write_once_size
   Cyclomatic Complexity 1 include/linux/kasan-checks.h:kasan_check_read
   Cyclomatic Complexity 1 include/linux/kasan-checks.h:kasan_check_write
   Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:arch_atomic_read
   Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:arch_atomic_set
   Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:arch_atomic_inc
   Cyclomatic Complexity 2 arch/x86/include/asm/atomic.h:arch_atomic_dec_and_test
   Cyclomatic Complexity 1 arch/x86/include/asm/atomic64_64.h:arch_atomic64_read
   Cyclomatic Complexity 1 include/asm-generic/atomic-instrumented.h:atomic_read
   Cyclomatic Complexity 1 include/asm-generic/atomic-instrumented.h:atomic_set
   Cyclomatic Complexity 1 include/asm-generic/atomic-instrumented.h:atomic_inc
   Cyclomatic Complexity 1 include/asm-generic/atomic-instrumented.h:atomic_dec_and_test
   Cyclomatic Complexity 1 include/asm-generic/atomic-instrumented.h:atomic64_read
   Cyclomatic Complexity 1 include/asm-generic/atomic-long.h:atomic_long_read
   Cyclomatic Complexity 2 arch/x86/include/asm/bitops.h:arch_set_bit
   Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:arch___set_bit
   Cyclomatic Complexity 2 arch/x86/include/asm/bitops.h:arch_clear_bit
   Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:arch___clear_bit
   Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:constant_test_bit
   Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:variable_test_bit
   Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:fls64
   Cyclomatic Complexity 1 include/asm-generic/bitops-instrumented.h:set_bit
   Cyclomatic Complexity 1 include/asm-generic/bitops-instrumented.h:__set_bit
   Cyclomatic Complexity 1 include/asm-generic/bitops-instrumented.h:clear_bit
   Cyclomatic Complexity 1 include/asm-generic/bitops-instrumented.h:__clear_bit
   Cyclomatic Complexity 2 include/asm-generic/bitops-instrumented.h:test_bit
   Cyclomatic Complexity 1 include/linux/bitops.h:ror32
   Cyclomatic Complexity 1 include/linux/log2.h:__ilog2_u64
   Cyclomatic Complexity 1 include/linux/kernel.h:kstrtoul
   Cyclomatic Complexity 1 include/linux/list.h:INIT_LIST_HEAD
   Cyclomatic Complexity 1 include/linux/percpu-defs.h:__this_cpu_preempt_check
   Cyclomatic Complexity 1 include/linux/math64.h:div_u64_rem
   Cyclomatic Complexity 1 include/linux/math64.h:div_u64
   Cyclomatic Complexity 1 arch/x86/include/asm/current.h:get_current
   Cyclomatic Complexity 2 arch/x86/include/asm/page_64.h:__phys_addr_nodebug
   Cyclomatic Complexity 69 include/asm-generic/getorder.h:get_order
   Cyclomatic Complexity 1 include/linux/jump_label.h:static_key_count
   Cyclomatic Complexity 2 include/linux/jump_label.h:static_key_slow_inc
   Cyclomatic Complexity 4 include/linux/jump_label.h:static_key_enable
   Cyclomatic Complexity 3 include/linux/string.h:memset
   Cyclomatic Complexity 4 include/linux/string.h:memcpy
   Cyclomatic Complexity 1 include/linux/err.h:IS_ERR
   Cyclomatic Complexity 1 include/linux/thread_info.h:test_ti_thread_flag
   Cyclomatic Complexity 1 include/linux/thread_info.h:check_object_size
   Cyclomatic Complexity 2 include/linux/thread_info.h:copy_overflow
   Cyclomatic Complexity 8 include/linux/thread_info.h:check_copy_size
   Cyclomatic Complexity 1 arch/x86/include/asm/preempt.h:preempt_count
   Cyclomatic Complexity 1 arch/x86/include/asm/preempt.h:should_resched
   Cyclomatic Complexity 1 include/linux/bottom_half.h:local_bh_disable
   Cyclomatic Complexity 1 include/linux/bottom_half.h:local_bh_enable
   Cyclomatic Complexity 1 include/linux/lockdep.h:lock_is_held
   Cyclomatic Complexity 1 include/linux/spinlock.h:spinlock_check
   Cyclomatic Complexity 1 include/linux/spinlock.h:spin_lock
   Cyclomatic Complexity 1 include/linux/spinlock.h:spin_unlock
   Cyclomatic Complexity 1 include/linux/spinlock.h:spin_unlock_bh
   Cyclomatic Complexity 1 include/linux/rcupdate.h:rcu_lock_acquire
   Cyclomatic Complexity 1 include/linux/rcupdate.h:rcu_lock_release
   Cyclomatic Complexity 4 include/linux/rcupdate.h:rcu_read_lock
   Cyclomatic Complexity 4 include/linux/rcupdate.h:rcu_read_unlock
   Cyclomatic Complexity 1 include/linux/time32.h:timespec64_to_timespec
   Cyclomatic Complexity 1 include/linux/ktime.h:ktime_to_ns
   Cyclomatic Complexity 1 include/linux/timekeeping.h:ktime_get_ns
   Cyclomatic Complexity 2 include/linux/page-flags.h:compound_head
   Cyclomatic Complexity 1 include/linux/gfp.h:gfpflags_allow_blocking
   Cyclomatic Complexity 3 include/linux/slab.h:kmalloc_type
   Cyclomatic Complexity 28 include/linux/slab.h:kmalloc_index
   Cyclomatic Complexity 1 include/linux/slab.h:__kmalloc_node
   Cyclomatic Complexity 1 include/linux/slab.h:kmem_cache_alloc_node_trace
   Cyclomatic Complexity 1 include/linux/slab.h:kmalloc_large
   Cyclomatic Complexity 4 include/linux/slab.h:kmalloc
   Cyclomatic Complexity 4 include/linux/slab.h:kmalloc_node
   Cyclomatic Complexity 1 include/linux/slab.h:kzalloc
   Cyclomatic Complexity 1 include/linux/refcount.h:refcount_read
   Cyclomatic Complexity 1 include/linux/sched.h:task_pid_nr
   Cyclomatic Complexity 1 include/linux/sched.h:task_thread_info
   Cyclomatic Complexity 1 include/linux/sched.h:test_tsk_thread_flag
   Cyclomatic Complexity 2 include/linux/uaccess.h:copy_from_user
   Cyclomatic Complexity 2 include/linux/uaccess.h:copy_to_user
   Cyclomatic Complexity 1 include/crypto/hash.h:__crypto_ahash_cast
   Cyclomatic Complexity 1 include/crypto/hash.h:crypto_ahash_tfm
   Cyclomatic Complexity 1 include/crypto/hash.h:crypto_ahash_reqtfm
   Cyclomatic Complexity 1 include/crypto/hash.h:crypto_ahash_reqsize
   Cyclomatic Complexity 1 include/crypto/hash.h:crypto_ahash_update
   Cyclomatic Complexity 1 include/crypto/hash.h:ahash_request_set_tfm
   Cyclomatic Complexity 2 include/crypto/hash.h:ahash_request_alloc
   Cyclomatic Complexity 1 include/crypto/hash.h:ahash_request_set_callback
   Cyclomatic Complexity 1 include/crypto/hash.h:ahash_request_set_crypt
   Cyclomatic Complexity 1 include/linux/percpu_counter.h:percpu_counter_init
   Cyclomatic Complexity 2 include/linux/percpu_counter.h:percpu_counter_add
   Cyclomatic Complexity 1 include/linux/percpu_counter.h:percpu_counter_read_positive
   Cyclomatic Complexity 1 include/linux/percpu_counter.h:percpu_counter_sum_positive
   Cyclomatic Complexity 1 include/linux/percpu_counter.h:percpu_counter_inc
   Cyclomatic Complexity 4 include/linux/poll.h:poll_wait
   Cyclomatic Complexity 3 include/linux/poll.h:poll_does_not_wait
   Cyclomatic Complexity 2 include/linux/uio.h:copy_to_iter
   Cyclomatic Complexity 2 include/linux/uio.h:copy_from_iter_full
   Cyclomatic Complexity 2 include/linux/uio.h:copy_from_iter_full_nocache

vim +/tcp_check_send_head +948 net//ipv4/tcp.c

   937	
   938	/* In some cases, both sendpage() and sendmsg() could have added
   939	 * an skb to the write queue, but failed adding payload on it.
   940	 * We need to remove it to consume less memory, but more
   941	 * importantly be able to generate EPOLLOUT for Edge Trigger epoll()
   942	 * users.
   943	 */
   944	static void tcp_remove_empty_skb(struct sock *sk, struct sk_buff *skb)
   945	{
   946		if (skb && !skb->len) {
   947			tcp_unlink_write_queue(skb, sk);
 > 948			tcp_check_send_head(sk, skb);
   949			sk_wmem_free_skb(sk, skb);
   950		}
   951	}
   952	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31840 bytes --]

^ permalink raw reply

* Re: [PATCH v2] net: mdio: switch to using gpiod_get_optional()
From: Dmitry Torokhov @ 2019-09-15  6:05 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Andrew Lunn, Florian Fainelli, Heiner Kallweit, David S. Miller,
	Linus Walleij, netdev, linux-kernel
In-Reply-To: <20190914170933.GV2680@smile.fi.intel.com>

On Sat, Sep 14, 2019 at 08:09:33PM +0300, Andy Shevchenko wrote:
> On Fri, Sep 13, 2019 at 03:55:47PM -0700, Dmitry Torokhov wrote:
> > The MDIO device reset line is optional and now that gpiod_get_optional()
> > returns proper value when GPIO support is compiled out, there is no
> > reason to use fwnode_get_named_gpiod() that I plan to hide away.
> > 
> > Let's switch to using more standard gpiod_get_optional() and
> > gpiod_set_consumer_name() to keep the nice "PHY reset" label.
> > 
> > Also there is no reason to only try to fetch the reset GPIO when we have
> > OF node, gpiolib can fetch GPIO data from firmwares as well.
> > 
> 
> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

Thanks Andy.

> 
> But see comment below.
> 

> > +	mdiodev->reset_gpio = gpiod_get_optional(&mdiodev->dev,
> > +						 "reset", GPIOD_OUT_LOW);
> > +	error = PTR_ERR_OR_ZERO(mdiodev->reset_gpio);
> > +	if (error)
> > +		return error;
> > +
> 
> > +	if (mdiodev->reset_gpio)
> 
> This is redundant check.

I see that gpiod_* API handle NULL desc and usually return immediately,
but frankly I am not that comfortable with it. I'm OK with functions
that free/destroy objects that recognize NULL resources, but it is
unusual for other types of APIs.

Thanks.

-- 
Dmitry

^ permalink raw reply

* Re: [PATCH net-next v8 2/3] net: phy: add support for clause 37 auto-negotiation
From: Tao Ren @ 2019-09-15  6:15 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Florian Fainelli, Heiner Kallweit, David S . Miller,
	Vladimir Oltean, Arun Parameswaran, Justin Chen,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	openbmc@lists.ozlabs.org
In-Reply-To: <20190914141752.GC27922@lunn.ch>

On 9/14/19 7:17 AM, Andrew Lunn wrote:
> On Mon, Sep 09, 2019 at 01:49:06PM -0700, Tao Ren wrote:
>> From: Heiner Kallweit <hkallweit1@gmail.com>
>>
>> This patch adds support for clause 37 1000Base-X auto-negotiation.
>>
>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>> Signed-off-by: Tao Ren <taoren@fb.com>
>> Tested-by: René van Dorst <opensource@vdorst.com>
> 
> Reviewed-by: Andrew Lunn <andrew@lunn.ch>
> 
>     Andrew

Thanks a lot, Andrew.


Cheers,

Tao

^ permalink raw reply

* [PATCH net-next 0/2] drop_monitor: Better sanitize notified packets
From: Ido Schimmel @ 2019-09-15  6:46 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, nhorman, jakub.kicinski, mlxsw, Ido Schimmel

From: Ido Schimmel <idosch@mellanox.com>

When working in 'packet' mode, drop monitor generates a notification
with a potentially truncated payload of the dropped packet. The payload
is copied from the MAC header, but I forgot to check that the MAC header
was set, so do it now.

Patch #1 sets the offsets to the various protocol layers in netdevsim,
so that it will continue to work after the MAC header check is added to
drop monitor in patch #2.

Ido Schimmel (2):
  netdevsim: Set offsets to various protocol layers
  drop_monitor: Better sanitize notified packets

 drivers/net/netdevsim/dev.c | 3 +++
 net/core/drop_monitor.c     | 6 ++++++
 2 files changed, 9 insertions(+)

-- 
2.21.0

^ permalink raw reply

* [PATCH net-next 1/2] netdevsim: Set offsets to various protocol layers
From: Ido Schimmel @ 2019-09-15  6:46 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, nhorman, jakub.kicinski, mlxsw, Ido Schimmel
In-Reply-To: <20190915064636.6884-1-idosch@idosch.org>

From: Ido Schimmel <idosch@mellanox.com>

The driver periodically generates "trapped" UDP packets that it then
passes on to devlink. Set the offsets to the various protocol layers.

This is a prerequisite to the next patch, where drop monitor is taught
to check that the offset to the MAC header was set.

Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 drivers/net/netdevsim/dev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
index 7fba7b271a57..56576d4f34a5 100644
--- a/drivers/net/netdevsim/dev.c
+++ b/drivers/net/netdevsim/dev.c
@@ -374,12 +374,14 @@ static struct sk_buff *nsim_dev_trap_skb_build(void)
 		return NULL;
 	tot_len = sizeof(struct iphdr) + sizeof(struct udphdr) + data_len;
 
+	skb_reset_mac_header(skb);
 	eth = skb_put(skb, sizeof(struct ethhdr));
 	eth_random_addr(eth->h_dest);
 	eth_random_addr(eth->h_source);
 	eth->h_proto = htons(ETH_P_IP);
 	skb->protocol = htons(ETH_P_IP);
 
+	skb_set_network_header(skb, skb->len);
 	iph = skb_put(skb, sizeof(struct iphdr));
 	iph->protocol = IPPROTO_UDP;
 	iph->saddr = in_aton("192.0.2.1");
@@ -392,6 +394,7 @@ static struct sk_buff *nsim_dev_trap_skb_build(void)
 	iph->check = 0;
 	iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
 
+	skb_set_transport_header(skb, skb->len);
 	udph = skb_put_zero(skb, sizeof(struct udphdr) + data_len);
 	get_random_bytes(&udph->source, sizeof(u16));
 	get_random_bytes(&udph->dest, sizeof(u16));
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 2/2] drop_monitor: Better sanitize notified packets
From: Ido Schimmel @ 2019-09-15  6:46 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, nhorman, jakub.kicinski, mlxsw, Ido Schimmel
In-Reply-To: <20190915064636.6884-1-idosch@idosch.org>

From: Ido Schimmel <idosch@mellanox.com>

When working in 'packet' mode, drop monitor generates a notification
with a potentially truncated payload of the dropped packet. The payload
is copied from the MAC header, but I forgot to check that the MAC header
was set, so do it now.

Fixes: ca30707dee2b ("drop_monitor: Add packet alert mode")
Fixes: 5e58109b1ea4 ("drop_monitor: Add support for packet alert mode for hardware drops")
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 net/core/drop_monitor.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index cc60cc22e2db..536e032d95c8 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -487,6 +487,9 @@ static void net_dm_packet_trace_kfree_skb_hit(void *ignore,
 	struct sk_buff *nskb;
 	unsigned long flags;
 
+	if (!skb_mac_header_was_set(skb))
+		return;
+
 	nskb = skb_clone(skb, GFP_ATOMIC);
 	if (!nskb)
 		return;
@@ -900,6 +903,9 @@ net_dm_hw_packet_probe(struct sk_buff *skb,
 	struct sk_buff *nskb;
 	unsigned long flags;
 
+	if (!skb_mac_header_was_set(skb))
+		return;
+
 	nskb = skb_clone(skb, GFP_ATOMIC);
 	if (!nskb)
 		return;
-- 
2.21.0


^ permalink raw reply related

* Re: [patch iproute2-next 2/2] devlink: extend reload command to add support for network namespace change
From: Ido Schimmel @ 2019-09-15  7:16 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, idosch, dsahern, jakub.kicinski, tariqt, saeedm,
	kuznet, yoshfuji, shuah, mlxsw
In-Reply-To: <20190914065757.27295-2-jiri@resnulli.us>

On Sat, Sep 14, 2019 at 08:57:57AM +0200, Jiri Pirko wrote:
> diff --git a/man/man8/devlink-dev.8 b/man/man8/devlink-dev.8
> index 1804463b2321..0e1a5523fa7b 100644
> --- a/man/man8/devlink-dev.8
> +++ b/man/man8/devlink-dev.8
> @@ -25,6 +25,13 @@ devlink-dev \- devlink device configuration
>  .ti -8
>  .B devlink dev help
>  
> +.ti -8
> +.BR "devlink dev set"
> +.IR DEV
> +.RI "[ "
> +.BI "netns { " PID " | " NAME " | " ID " }
> +.RI "]"
> +
>  .ti -8
>  .BR "devlink dev eswitch set"
>  .IR DEV
> @@ -92,6 +99,11 @@ Format is:
>  .in +2
>  BUS_NAME/BUS_ADDRESS
>  
> +.SS devlink dev set  - sets devlink device attributes
> +
> +.TP
> +.BI "netns { " PID " | " NAME " | " ID " }

This looks like leftover from previous version?

> +
>  .SS devlink dev eswitch show - display devlink device eswitch attributes
>  .SS devlink dev eswitch set  - sets devlink device eswitch attributes
>  
> -- 
> 2.21.0
> 

^ permalink raw reply

* Re: [patch net-next 02/15] net: fib_notifier: make FIB notifier per-netns
From: Ido Schimmel @ 2019-09-15  8:06 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, idosch, dsahern, jakub.kicinski, tariqt, saeedm,
	kuznet, yoshfuji, shuah, mlxsw
In-Reply-To: <20190914064608.26799-3-jiri@resnulli.us>

On Sat, Sep 14, 2019 at 08:45:55AM +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
> 
> Currently all users of FIB notifier only cares about events in init_net.

s/cares/care/

> Later in this patchset, users get interested in other namespaces too.
> However, for every registered block user is interested only about one
> namespace. Make the FIB notifier registration per-netns and avoid
> unnecessary calls of notifier block for other namespaces.

...

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag_mp.c b/drivers/net/ethernet/mellanox/mlx5/core/lag_mp.c
> index 5d20d615663e..fe0cc969cf94 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lag_mp.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag_mp.c
> @@ -248,9 +248,6 @@ static int mlx5_lag_fib_event(struct notifier_block *nb,
>  	struct net_device *fib_dev;
>  	struct fib_info *fi;
>  
> -	if (!net_eq(info->net, &init_net))
> -		return NOTIFY_DONE;

I don't see anymore uses of 'info->net'. Can it be removed from 'struct
fib_notifier_info' ?

^ permalink raw reply

* Re: [patch net-next 03/15] net: fib_notifier: propagate possible error during fib notifier registration
From: Ido Schimmel @ 2019-09-15  8:17 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, idosch, dsahern, jakub.kicinski, tariqt, saeedm,
	kuznet, yoshfuji, shuah, mlxsw
In-Reply-To: <20190914064608.26799-4-jiri@resnulli.us>

On Sat, Sep 14, 2019 at 08:45:56AM +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
> 
> Unlike events for registered notifier, during the registration, the
> errors that happened for the block being registered are not propagated
> up to the caller. For fib rules, this is already present, but not for

What do you mean by "already present" ? You added it below for rules as
well...

> fib entries. So make sure the error is propagated for those as well.
> 
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> ---
>  include/net/ip_fib.h    |  2 +-
>  net/core/fib_notifier.c |  2 --
>  net/core/fib_rules.c    | 11 ++++++++---
>  net/ipv4/fib_notifier.c |  4 +---
>  net/ipv4/fib_trie.c     | 31 ++++++++++++++++++++++---------
>  net/ipv4/ipmr_base.c    | 22 +++++++++++++++-------
>  net/ipv6/ip6_fib.c      | 36 ++++++++++++++++++++++++------------
>  7 files changed, 71 insertions(+), 37 deletions(-)
> 
> diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> index 4cec9ecaa95e..caae0fa610aa 100644
> --- a/include/net/ip_fib.h
> +++ b/include/net/ip_fib.h
> @@ -229,7 +229,7 @@ int __net_init fib4_notifier_init(struct net *net);
>  void __net_exit fib4_notifier_exit(struct net *net);
>  
>  void fib_info_notify_update(struct net *net, struct nl_info *info);
> -void fib_notify(struct net *net, struct notifier_block *nb);
> +int fib_notify(struct net *net, struct notifier_block *nb);
>  
>  struct fib_table {
>  	struct hlist_node	tb_hlist;
> diff --git a/net/core/fib_notifier.c b/net/core/fib_notifier.c
> index b965f3c0ec9a..fbd029425638 100644
> --- a/net/core/fib_notifier.c
> +++ b/net/core/fib_notifier.c
> @@ -65,8 +65,6 @@ static int fib_net_dump(struct net *net, struct notifier_block *nb)
>  
>  	rcu_read_lock();
>  	list_for_each_entry_rcu(ops, &fn_net->fib_notifier_ops, list) {
> -		int err;

Looks like this should have been removed in previous patch

> -
>  		if (!try_module_get(ops->owner))
>  			continue;
>  		err = ops->fib_dump(net, nb);
> diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
> index 28cbf07102bc..592d8aef90e3 100644
> --- a/net/core/fib_rules.c
> +++ b/net/core/fib_rules.c
> @@ -354,15 +354,20 @@ int fib_rules_dump(struct net *net, struct notifier_block *nb, int family)
>  {
>  	struct fib_rules_ops *ops;
>  	struct fib_rule *rule;
> +	int err = 0;
>  
>  	ops = lookup_rules_ops(net, family);
>  	if (!ops)
>  		return -EAFNOSUPPORT;
> -	list_for_each_entry_rcu(rule, &ops->rules_list, list)
> -		call_fib_rule_notifier(nb, FIB_EVENT_RULE_ADD, rule, family);
> +	list_for_each_entry_rcu(rule, &ops->rules_list, list) {
> +		err = call_fib_rule_notifier(nb, FIB_EVENT_RULE_ADD,
> +					     rule, family);

Here you add it for rules

> +		if (err)
> +			break;
> +	}
>  	rules_ops_put(ops);
>  
> -	return 0;
> +	return err;
>  }
>  EXPORT_SYMBOL_GPL(fib_rules_dump);

^ permalink raw reply

* Re: [PATCH] igb/igc: Don't warn on fatal read failures when the device is removed
From: Feng Tang @ 2019-09-15  8:27 UTC (permalink / raw)
  To: Lyude Paul
  Cc: intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
	Neftin, Sasha, Kirsher, Jeffrey T, David S. Miller,
	linux-kernel@vger.kernel.org
In-Reply-To: <20190822183318.27634-1-lyude@redhat.com>

On Fri, Aug 23, 2019 at 02:33:18AM +0800, Lyude Paul wrote:
> Fatal read errors are worth warning about, unless of course the device
> was just unplugged from the machine - something that's a rather normal
> occurence when the igb/igc adapter is located on a Thunderbolt dock. So,
> let's only WARN() if there's a fatal read error while the device is
> still present.
> 
> This fixes the following WARN splat that's been appearing whenever I
> unplug my Caldigit TS3 Thunderbolt dock from my laptop:
> 
>   igb 0000:09:00.0 enp9s0: PCIe link lost
>   ------------[ cut here ]------------
>   igb: Failed to read reg 0x18!
>   WARNING: CPU: 7 PID: 516 at
>   drivers/net/ethernet/intel/igb/igb_main.c:756 igb_rd32+0x57/0x6a [igb]
>   Modules linked in: igb dca thunderbolt fuse vfat fat elan_i2c mei_wdt
>   mei_hdcp i915 wmi_bmof intel_wmi_thunderbolt iTCO_wdt
>   iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp joydev
>   coretemp crct10dif_pclmul crc32_pclmul i2c_algo_bit ghash_clmulni_intel
>   intel_cstate drm_kms_helper intel_uncore syscopyarea sysfillrect
>   sysimgblt fb_sys_fops intel_rapl_perf intel_xhci_usb_role_switch mei_me
>   drm roles idma64 i2c_i801 ucsi_acpi typec_ucsi mei intel_lpss_pci
>   processor_thermal_device typec intel_pch_thermal intel_soc_dts_iosf
>   intel_lpss int3403_thermal thinkpad_acpi wmi int340x_thermal_zone
>   ledtrig_audio int3400_thermal acpi_thermal_rel acpi_pad video
>   pcc_cpufreq ip_tables serio_raw nvme nvme_core crc32c_intel uas
>   usb_storage e1000e i2c_dev
>   CPU: 7 PID: 516 Comm: kworker/u16:3 Not tainted 5.2.0-rc1Lyude-Test+ #14
>   Hardware name: LENOVO 20L8S2N800/20L8S2N800, BIOS N22ET35W (1.12 ) 04/09/2018
>   Workqueue: kacpi_hotplug acpi_hotplug_work_fn
>   RIP: 0010:igb_rd32+0x57/0x6a [igb]
>   Code: 87 b8 fc ff ff 48 c7 47 08 00 00 00 00 48 c7 c6 33 42 9b c0 4c 89
>   c7 e8 47 45 cd dc 89 ee 48 c7 c7 43 42 9b c0 e8 c1 94 71 dc <0f> 0b eb
>   08 8b 00 ff c0 75 b0 eb c8 44 89 e0 5d 41 5c c3 0f 1f 44
>   RSP: 0018:ffffba5801cf7c48 EFLAGS: 00010286
>   RAX: 0000000000000000 RBX: ffff9e7956608840 RCX: 0000000000000007
>   RDX: 0000000000000000 RSI: ffffba5801cf7b24 RDI: ffff9e795e3d6a00
>   RBP: 0000000000000018 R08: 000000009dec4a01 R09: ffffffff9e61018f
>   R10: 0000000000000000 R11: ffffba5801cf7ae5 R12: 00000000ffffffff
>   R13: ffff9e7956608840 R14: ffff9e795a6f10b0 R15: 0000000000000000
>   FS:  0000000000000000(0000) GS:ffff9e795e3c0000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 0000564317bc4088 CR3: 000000010e00a006 CR4: 00000000003606e0
>   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>   Call Trace:
>    igb_release_hw_control+0x1a/0x30 [igb]
>    igb_remove+0xc5/0x14b [igb]
>    pci_device_remove+0x3b/0x93
>    device_release_driver_internal+0xd7/0x17e
>    pci_stop_bus_device+0x36/0x75
>    pci_stop_bus_device+0x66/0x75
>    pci_stop_bus_device+0x66/0x75
>    pci_stop_and_remove_bus_device+0xf/0x19
>    trim_stale_devices+0xc5/0x13a
>    ? __pm_runtime_resume+0x6e/0x7b
>    trim_stale_devices+0x103/0x13a
>    ? __pm_runtime_resume+0x6e/0x7b
>    trim_stale_devices+0x103/0x13a
>    acpiphp_check_bridge+0xd8/0xf5
>    acpiphp_hotplug_notify+0xf7/0x14b
>    ? acpiphp_check_bridge+0xf5/0xf5
>    acpi_device_hotplug+0x357/0x3b5
>    acpi_hotplug_work_fn+0x1a/0x23
>    process_one_work+0x1a7/0x296
>    worker_thread+0x1a8/0x24c
>    ? process_scheduled_works+0x2c/0x2c
>    kthread+0xe9/0xee
>    ? kthread_destroy_worker+0x41/0x41
>    ret_from_fork+0x35/0x40
>   ---[ end trace 252bf10352c63d22 ]---
>

Thanks for the fix.

Acked-by: Feng Tang <feng.tang@intel.com>

>
> Signed-off-by: Lyude Paul <lyude@redhat.com>
> Fixes: 47e16692b26b ("igb/igc: warn when fatal read failure happens")
> Cc: Feng Tang <feng.tang@intel.com>
> Cc: Sasha Neftin <sasha.neftin@intel.com>
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Cc: intel-wired-lan@lists.osuosl.org
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 3 ++-
>  drivers/net/ethernet/intel/igc/igc_main.c | 3 ++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index e5b7e638df28..1a7f7cd28df9 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -753,7 +753,8 @@ u32 igb_rd32(struct e1000_hw *hw, u32 reg)
>  		struct net_device *netdev = igb->netdev;
>  		hw->hw_addr = NULL;
>  		netdev_err(netdev, "PCIe link lost\n");
> -		WARN(1, "igb: Failed to read reg 0x%x!\n", reg);
> +		WARN(pci_device_is_present(igb->pdev),
> +		     "igb: Failed to read reg 0x%x!\n", reg);
>  	}
>  
>  	return value;
> diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
> index 28072b9aa932..f873a4b35eaf 100644
> --- a/drivers/net/ethernet/intel/igc/igc_main.c
> +++ b/drivers/net/ethernet/intel/igc/igc_main.c
> @@ -3934,7 +3934,8 @@ u32 igc_rd32(struct igc_hw *hw, u32 reg)
>  		hw->hw_addr = NULL;
>  		netif_device_detach(netdev);
>  		netdev_err(netdev, "PCIe link lost, device now detached\n");
> -		WARN(1, "igc: Failed to read reg 0x%x!\n", reg);
> +		WARN(pci_device_is_present(igc->pdev),
> +		     "igc: Failed to read reg 0x%x!\n", reg);
>  	}
>  
>  	return value;
> -- 
> 2.21.0

^ permalink raw reply

* Re: [patch net-next 08/15] mlxsw: Register port netdevices into net of core
From: Ido Schimmel @ 2019-09-15  8:37 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, idosch, dsahern, jakub.kicinski, tariqt, saeedm,
	kuznet, yoshfuji, shuah, mlxsw
In-Reply-To: <20190914064608.26799-9-jiri@resnulli.us>

On Sat, Sep 14, 2019 at 08:46:01AM +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
> 
> When creating netdevices for ports, put then under network namespace

s/then/them/

> that the core/parent devlink belongs to.

^ permalink raw reply

* Re: [patch net-next 09/15] mlxsw: Propagate extack down to register_fib_notifier()
From: Ido Schimmel @ 2019-09-15  8:39 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, idosch, dsahern, jakub.kicinski, tariqt, saeedm,
	kuznet, yoshfuji, shuah, mlxsw
In-Reply-To: <20190914064608.26799-10-jiri@resnulli.us>

On Sat, Sep 14, 2019 at 08:46:02AM +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
> 
> During the devlink reaload the extack is present, so propagate it all

s/reaload/reload/

> the way down to register_fib_notifier() call in spectrum_router.c.

^ permalink raw reply

* Здравствуйте! Вас интересуют клиентские базы данных?
From: netdev @ 2019-09-14 23:39 UTC (permalink / raw)
  To: netdev

Здравствуйте! Вас интересуют клиентские базы данных?

^ permalink raw reply

* Re: [patch net-next 14/15] net: devlink: allow to change namespaces during reload
From: Ido Schimmel @ 2019-09-15  8:58 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, idosch, dsahern, jakub.kicinski, tariqt, saeedm,
	kuznet, yoshfuji, shuah, mlxsw
In-Reply-To: <20190914064608.26799-15-jiri@resnulli.us>

On Sat, Sep 14, 2019 at 08:46:07AM +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
> 
> All devlink instances are created in init_net and stay there for a
> lifetime. Allow user to be able to move devlink instances into
> namespaces during devlink reload operation. That ensures proper
> re-instantiation of driver objects, including netdevices.
> 
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/main.c |   4 +
>  include/uapi/linux/devlink.h              |   4 +
>  net/core/devlink.c                        | 155 ++++++++++++++++++++--
>  3 files changed, 155 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
> index ef3f3d06ff1e..989d0882aaa9 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
> @@ -3942,6 +3942,10 @@ static int mlx4_devlink_reload_down(struct devlink *devlink,
>  	struct mlx4_dev *dev = &priv->dev;
>  	struct mlx4_dev_persistent *persist = dev->persist;
>  
> +	if (!net_eq(devlink_net(devlink), &init_net)) {
> +		NL_SET_ERR_MSG_MOD(extack, "Namespace change is not supported");
> +		return -EOPNOTSUPP;
> +	}

Are you sure that this actually works? I see that you first invoke
reload_down(), then set the new namespace, then invoke reload_up().

So shouldn't this check be done in reload_up() callback instead?

>  	if (persist->num_vfs)
>  		mlx4_warn(persist->dev, "Reload performed on PF, will cause reset on operating Virtual Functions\n");
>  	mlx4_restart_one_down(persist->pdev);
> diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
> index 580b7a2e40e1..b558ea88b766 100644
> --- a/include/uapi/linux/devlink.h
> +++ b/include/uapi/linux/devlink.h
> @@ -421,6 +421,10 @@ enum devlink_attr {
>  
>  	DEVLINK_ATTR_RELOAD_FAILED,			/* u8 0 or 1 */
>  
> +	DEVLINK_ATTR_NETNS_FD,			/* u32 */
> +	DEVLINK_ATTR_NETNS_PID,			/* u32 */
> +	DEVLINK_ATTR_NETNS_ID,			/* u32 */
> +
>  	/* add new attributes above here, update the policy in devlink.c */
>  
>  	__DEVLINK_ATTR_MAX,
> diff --git a/net/core/devlink.c b/net/core/devlink.c
> index 362cbbcca225..2a5db95cce3c 100644
> --- a/net/core/devlink.c
> +++ b/net/core/devlink.c
> @@ -435,8 +435,16 @@ static void devlink_nl_post_doit(const struct genl_ops *ops,
>  {
>  	struct devlink *devlink;
>  
> -	devlink = devlink_get_from_info(info);
> -	if (~ops->internal_flags & DEVLINK_NL_FLAG_NO_LOCK)
> +	/* When devlink changes netns, it would not be found
> +	 * by devlink_get_from_info(). So try if it is stored first.
> +	 */
> +	if (ops->internal_flags & DEVLINK_NL_FLAG_NEED_DEVLINK) {
> +		devlink = info->user_ptr[0];
> +	} else {
> +		devlink = devlink_get_from_info(info);
> +		WARN_ON(IS_ERR(devlink));
> +	}
> +	if (!IS_ERR(devlink) && ~ops->internal_flags & DEVLINK_NL_FLAG_NO_LOCK)
>  		mutex_unlock(&devlink->lock);
>  	mutex_unlock(&devlink_mutex);
>  }
> @@ -2675,6 +2683,73 @@ devlink_resources_validate(struct devlink *devlink,
>  	return err;
>  }
>  
> +static struct net *devlink_netns_get(struct sk_buff *skb,
> +				     struct devlink *devlink,
> +				     struct genl_info *info)
> +{
> +	struct nlattr *netns_pid_attr = info->attrs[DEVLINK_ATTR_NETNS_PID];
> +	struct nlattr *netns_fd_attr = info->attrs[DEVLINK_ATTR_NETNS_FD];
> +	struct nlattr *netns_id_attr = info->attrs[DEVLINK_ATTR_NETNS_ID];
> +	struct net *net;
> +
> +	if (!!netns_pid_attr + !!netns_fd_attr + !!netns_id_attr > 1) {
> +		NL_SET_ERR_MSG(info->extack, "multiple netns identifying attributes specified");
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	if (netns_pid_attr) {
> +		net = get_net_ns_by_pid(nla_get_u32(netns_pid_attr));
> +	} else if (netns_fd_attr) {
> +		net = get_net_ns_by_fd(nla_get_u32(netns_fd_attr));
> +	} else if (netns_id_attr) {
> +		net = get_net_ns_by_id(sock_net(skb->sk),
> +				       nla_get_u32(netns_id_attr));
> +		if (!net)
> +			net = ERR_PTR(-EINVAL);
> +	} else {
> +		WARN_ON(1);
> +		net = ERR_PTR(-EINVAL);
> +	}
> +	if (IS_ERR(net)) {
> +		NL_SET_ERR_MSG(info->extack, "Unknown network namespace");
> +		return ERR_PTR(-EINVAL);
> +	}
> +	if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN)) {
> +		put_net(net);
> +		return ERR_PTR(-EPERM);
> +	}
> +	return net;
> +}
> +
> +static void devlink_param_notify(struct devlink *devlink,
> +				 unsigned int port_index,
> +				 struct devlink_param_item *param_item,
> +				 enum devlink_command cmd);
> +
> +static void devlink_reload_netns_change(struct devlink *devlink,
> +					struct net *dest_net)
> +{
> +	struct devlink_param_item *param_item;
> +
> +	/* Userspace needs to be notified about devlink objects
> +	 * removed from original and entering new network namespace.
> +	 * The rest of the devlink objects are re-created during
> +	 * reload process so the notifications are generated separatelly.
> +	 */
> +
> +	list_for_each_entry(param_item, &devlink->param_list, list)
> +		devlink_param_notify(devlink, 0, param_item,
> +				     DEVLINK_CMD_PARAM_DEL);
> +	devlink_notify(devlink, DEVLINK_CMD_DEL);
> +
> +	devlink_net_set(devlink, dest_net);
> +
> +	devlink_notify(devlink, DEVLINK_CMD_NEW);
> +	list_for_each_entry(param_item, &devlink->param_list, list)
> +		devlink_param_notify(devlink, 0, param_item,
> +				     DEVLINK_CMD_PARAM_NEW);
> +}
> +
>  static bool devlink_reload_supported(struct devlink *devlink)
>  {
>  	return devlink->ops->reload_down && devlink->ops->reload_up;
> @@ -2695,9 +2770,27 @@ bool devlink_is_reload_failed(const struct devlink *devlink)
>  }
>  EXPORT_SYMBOL_GPL(devlink_is_reload_failed);
>  
> +static int devlink_reload(struct devlink *devlink, struct net *dest_net,
> +			  struct netlink_ext_ack *extack)
> +{
> +	int err;
> +
> +	err = devlink->ops->reload_down(devlink, extack);
> +	if (err)
> +		return err;
> +
> +	if (dest_net && !net_eq(dest_net, devlink_net(devlink)))
> +		devlink_reload_netns_change(devlink, dest_net);
> +
> +	err = devlink->ops->reload_up(devlink, extack);
> +	devlink_reload_failed_set(devlink, !!err);
> +	return err;
> +}
> +
>  static int devlink_nl_cmd_reload(struct sk_buff *skb, struct genl_info *info)
>  {
>  	struct devlink *devlink = info->user_ptr[0];
> +	struct net *dest_net = NULL;
>  	int err;
>  
>  	if (!devlink_reload_supported(devlink))
> @@ -2708,11 +2801,20 @@ static int devlink_nl_cmd_reload(struct sk_buff *skb, struct genl_info *info)
>  		NL_SET_ERR_MSG_MOD(info->extack, "resources size validation failed");
>  		return err;
>  	}
> -	err = devlink->ops->reload_down(devlink, info->extack);
> -	if (err)
> -		return err;
> -	err = devlink->ops->reload_up(devlink, info->extack);
> -	devlink_reload_failed_set(devlink, !!err);
> +
> +	if (info->attrs[DEVLINK_ATTR_NETNS_PID] ||
> +	    info->attrs[DEVLINK_ATTR_NETNS_FD] ||
> +	    info->attrs[DEVLINK_ATTR_NETNS_ID]) {
> +		dest_net = devlink_netns_get(skb, devlink, info);

Hmm, you're never using 'devlink' there, so I guess you can drop it.

> +		if (IS_ERR(dest_net))
> +			return PTR_ERR(dest_net);
> +	}
> +
> +	err = devlink_reload(devlink, dest_net, info->extack);
> +
> +	if (dest_net)
> +		put_net(dest_net);
> +
>  	return err;
>  }
>  
> @@ -5794,6 +5896,9 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
>  	[DEVLINK_ATTR_TRAP_NAME] = { .type = NLA_NUL_STRING },
>  	[DEVLINK_ATTR_TRAP_ACTION] = { .type = NLA_U8 },
>  	[DEVLINK_ATTR_TRAP_GROUP_NAME] = { .type = NLA_NUL_STRING },
> +	[DEVLINK_ATTR_NETNS_PID] = { .type = NLA_U32 },
> +	[DEVLINK_ATTR_NETNS_FD] = { .type = NLA_U32 },
> +	[DEVLINK_ATTR_NETNS_ID] = { .type = NLA_U32 },
>  };
>  
>  static const struct genl_ops devlink_nl_ops[] = {
> @@ -8061,9 +8166,43 @@ int devlink_compat_switch_id_get(struct net_device *dev,
>  	return 0;
>  }
>  
> +static void __net_exit devlink_pernet_pre_exit(struct net *net)
> +{
> +	struct devlink *devlink;
> +	int err;
> +
> +	/* In case network namespace is getting destroyed, reload
> +	 * all devlink instances from this namespace into init_net.
> +	 */
> +	mutex_lock(&devlink_mutex);
> +	list_for_each_entry(devlink, &devlink_list, list) {
> +		if (net_eq(devlink_net(devlink), net)) {
> +			if (WARN_ON(!devlink_reload_supported(devlink)))
> +				continue;
> +			err = devlink_reload(devlink, &init_net, NULL);
> +			if (err)
> +				pr_warn("Failed to reload devlink instance into init_net\n");
> +		}
> +	}
> +	mutex_unlock(&devlink_mutex);
> +}
> +
> +static struct pernet_operations devlink_pernet_ops __net_initdata = {
> +	.pre_exit = devlink_pernet_pre_exit,
> +};
> +
>  static int __init devlink_init(void)
>  {
> -	return genl_register_family(&devlink_nl_family);
> +	int err;
> +
> +	err = genl_register_family(&devlink_nl_family);
> +	if (err)
> +		goto out;
> +	err = register_pernet_subsys(&devlink_pernet_ops);
> +
> +out:
> +	WARN_ON(err);
> +	return err;
>  }
>  
>  subsys_initcall(devlink_init);
> -- 
> 2.21.0
> 

^ permalink raw reply

* rt_uses_gateway was removed?
From: Julian Anastasov @ 2019-09-15  9:08 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, Ido Schimmel


	Hello,

	Looks like I'm a bit late with the storm of changes
in the routing.

	By default, after allocation rt_uses_gateway was set to 0.
Later it can be set to 1 if nh_gw is not the final route target,
i.e. it is indirect GW and not a target on LAN (the RT_SCOPE_LINK
check in rt_set_nexthop).

	What remains hidden for the rt_uses_gateway semantic
is this code in rt_set_nexthop:

	if (unlikely(!cached)) {
		/* Routes we intend to cache in nexthop exception or
		 * FIB nexthop have the DST_NOCACHE bit clear.
		 * However, if we are unsuccessful at storing this
		 * route into the cache we really need to set it.
		 */
		if (!rt->rt_gateway)
			rt->rt_gateway = daddr;
		rt_add_uncached_list(rt);
	}

and this code in rt_bind_exception:

	if (!rt->rt_gateway)
		rt->rt_gateway = daddr;

	I.e. even if rt_uses_gateway remains 0, rt_gateway
can contain address, i.e. the target is on LAN and not
behind nh_gw.

	Now I see commit 1550c171935d wrongly changes that to
"If rt_gw_family is set it implies rt_uses_gateway.".
As result, we set rt_gw_family while rt_uses_gateway was 0
for above cases. Think about it in this way: there should be
a reason why we used rt_uses_gateway flag instead a simple
"rt_gateway != 0" check, right?

	Replacing rt->rt_gateway checks with rt_gw_family
checks is right but rt_uses_gateway checks should be put
back because they indicates the route has more hops to
target.

	As the problem is related to some FNHW and non-cached
routes, redirects, etc the easiest way to see the problem is with
'ip route get LOCAL_IP oif eth0' where extra 'via GW' line is
shown.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [patch net-next 02/15] net: fib_notifier: make FIB notifier per-netns
From: Jiri Pirko @ 2019-09-15  9:37 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, idosch, dsahern, jakub.kicinski, tariqt, saeedm,
	kuznet, yoshfuji, shuah, mlxsw
In-Reply-To: <20190915080602.GA11194@splinter>

Sun, Sep 15, 2019 at 10:06:02AM CEST, idosch@idosch.org wrote:
>On Sat, Sep 14, 2019 at 08:45:55AM +0200, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>> 
>> Currently all users of FIB notifier only cares about events in init_net.
>
>s/cares/care/

ok


>
>> Later in this patchset, users get interested in other namespaces too.
>> However, for every registered block user is interested only about one
>> namespace. Make the FIB notifier registration per-netns and avoid
>> unnecessary calls of notifier block for other namespaces.
>
>...
>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag_mp.c b/drivers/net/ethernet/mellanox/mlx5/core/lag_mp.c
>> index 5d20d615663e..fe0cc969cf94 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/lag_mp.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag_mp.c
>> @@ -248,9 +248,6 @@ static int mlx5_lag_fib_event(struct notifier_block *nb,
>>  	struct net_device *fib_dev;
>>  	struct fib_info *fi;
>>  
>> -	if (!net_eq(info->net, &init_net))
>> -		return NOTIFY_DONE;
>
>I don't see anymore uses of 'info->net'. Can it be removed from 'struct
>fib_notifier_info' ?

correct. I missed that.


^ permalink raw reply

* Re: [patch net-next 03/15] net: fib_notifier: propagate possible error during fib notifier registration
From: Jiri Pirko @ 2019-09-15  9:41 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, idosch, dsahern, jakub.kicinski, tariqt, saeedm,
	kuznet, yoshfuji, shuah, mlxsw
In-Reply-To: <20190915081746.GB11194@splinter>

Sun, Sep 15, 2019 at 10:17:46AM CEST, idosch@idosch.org wrote:
>On Sat, Sep 14, 2019 at 08:45:56AM +0200, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>> 
>> Unlike events for registered notifier, during the registration, the
>> errors that happened for the block being registered are not propagated
>> up to the caller. For fib rules, this is already present, but not for
>
>What do you mean by "already present" ? You added it below for rules as
>well...

Right, will fix.


>
>> fib entries. So make sure the error is propagated for those as well.
>> 
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>> ---
>>  include/net/ip_fib.h    |  2 +-
>>  net/core/fib_notifier.c |  2 --
>>  net/core/fib_rules.c    | 11 ++++++++---
>>  net/ipv4/fib_notifier.c |  4 +---
>>  net/ipv4/fib_trie.c     | 31 ++++++++++++++++++++++---------
>>  net/ipv4/ipmr_base.c    | 22 +++++++++++++++-------
>>  net/ipv6/ip6_fib.c      | 36 ++++++++++++++++++++++++------------
>>  7 files changed, 71 insertions(+), 37 deletions(-)
>> 
>> diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
>> index 4cec9ecaa95e..caae0fa610aa 100644
>> --- a/include/net/ip_fib.h
>> +++ b/include/net/ip_fib.h
>> @@ -229,7 +229,7 @@ int __net_init fib4_notifier_init(struct net *net);
>>  void __net_exit fib4_notifier_exit(struct net *net);
>>  
>>  void fib_info_notify_update(struct net *net, struct nl_info *info);
>> -void fib_notify(struct net *net, struct notifier_block *nb);
>> +int fib_notify(struct net *net, struct notifier_block *nb);
>>  
>>  struct fib_table {
>>  	struct hlist_node	tb_hlist;
>> diff --git a/net/core/fib_notifier.c b/net/core/fib_notifier.c
>> index b965f3c0ec9a..fbd029425638 100644
>> --- a/net/core/fib_notifier.c
>> +++ b/net/core/fib_notifier.c
>> @@ -65,8 +65,6 @@ static int fib_net_dump(struct net *net, struct notifier_block *nb)
>>  
>>  	rcu_read_lock();
>>  	list_for_each_entry_rcu(ops, &fn_net->fib_notifier_ops, list) {
>> -		int err;
>
>Looks like this should have been removed in previous patch

Correct, will move.


>
>> -
>>  		if (!try_module_get(ops->owner))
>>  			continue;
>>  		err = ops->fib_dump(net, nb);
>> diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
>> index 28cbf07102bc..592d8aef90e3 100644
>> --- a/net/core/fib_rules.c
>> +++ b/net/core/fib_rules.c
>> @@ -354,15 +354,20 @@ int fib_rules_dump(struct net *net, struct notifier_block *nb, int family)
>>  {
>>  	struct fib_rules_ops *ops;
>>  	struct fib_rule *rule;
>> +	int err = 0;
>>  
>>  	ops = lookup_rules_ops(net, family);
>>  	if (!ops)
>>  		return -EAFNOSUPPORT;
>> -	list_for_each_entry_rcu(rule, &ops->rules_list, list)
>> -		call_fib_rule_notifier(nb, FIB_EVENT_RULE_ADD, rule, family);
>> +	list_for_each_entry_rcu(rule, &ops->rules_list, list) {
>> +		err = call_fib_rule_notifier(nb, FIB_EVENT_RULE_ADD,
>> +					     rule, family);
>
>Here you add it for rules
>
>> +		if (err)
>> +			break;
>> +	}
>>  	rules_ops_put(ops);
>>  
>> -	return 0;
>> +	return err;
>>  }
>>  EXPORT_SYMBOL_GPL(fib_rules_dump);

^ permalink raw reply

* Re: [patch net-next 14/15] net: devlink: allow to change namespaces during reload
From: Jiri Pirko @ 2019-09-15  9:43 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, idosch, dsahern, jakub.kicinski, tariqt, saeedm,
	kuznet, yoshfuji, shuah, mlxsw
In-Reply-To: <20190915085829.GE11194@splinter>

Sun, Sep 15, 2019 at 10:58:29AM CEST, idosch@idosch.org wrote:
>On Sat, Sep 14, 2019 at 08:46:07AM +0200, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>> 
>> All devlink instances are created in init_net and stay there for a
>> lifetime. Allow user to be able to move devlink instances into
>> namespaces during devlink reload operation. That ensures proper
>> re-instantiation of driver objects, including netdevices.
>> 
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>> ---
>>  drivers/net/ethernet/mellanox/mlx4/main.c |   4 +
>>  include/uapi/linux/devlink.h              |   4 +
>>  net/core/devlink.c                        | 155 ++++++++++++++++++++--
>>  3 files changed, 155 insertions(+), 8 deletions(-)
>> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
>> index ef3f3d06ff1e..989d0882aaa9 100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
>> @@ -3942,6 +3942,10 @@ static int mlx4_devlink_reload_down(struct devlink *devlink,
>>  	struct mlx4_dev *dev = &priv->dev;
>>  	struct mlx4_dev_persistent *persist = dev->persist;
>>  
>> +	if (!net_eq(devlink_net(devlink), &init_net)) {
>> +		NL_SET_ERR_MSG_MOD(extack, "Namespace change is not supported");
>> +		return -EOPNOTSUPP;
>> +	}
>
>Are you sure that this actually works? I see that you first invoke
>reload_down(), then set the new namespace, then invoke reload_up().
>
>So shouldn't this check be done in reload_up() callback instead?

Correct, need to fix this. But I need to do this in down phase so the
objects are not removed.


>
>>  	if (persist->num_vfs)
>>  		mlx4_warn(persist->dev, "Reload performed on PF, will cause reset on operating Virtual Functions\n");
>>  	mlx4_restart_one_down(persist->pdev);
>> diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
>> index 580b7a2e40e1..b558ea88b766 100644
>> --- a/include/uapi/linux/devlink.h
>> +++ b/include/uapi/linux/devlink.h
>> @@ -421,6 +421,10 @@ enum devlink_attr {
>>  
>>  	DEVLINK_ATTR_RELOAD_FAILED,			/* u8 0 or 1 */
>>  
>> +	DEVLINK_ATTR_NETNS_FD,			/* u32 */
>> +	DEVLINK_ATTR_NETNS_PID,			/* u32 */
>> +	DEVLINK_ATTR_NETNS_ID,			/* u32 */
>> +
>>  	/* add new attributes above here, update the policy in devlink.c */
>>  
>>  	__DEVLINK_ATTR_MAX,
>> diff --git a/net/core/devlink.c b/net/core/devlink.c
>> index 362cbbcca225..2a5db95cce3c 100644
>> --- a/net/core/devlink.c
>> +++ b/net/core/devlink.c
>> @@ -435,8 +435,16 @@ static void devlink_nl_post_doit(const struct genl_ops *ops,
>>  {
>>  	struct devlink *devlink;
>>  
>> -	devlink = devlink_get_from_info(info);
>> -	if (~ops->internal_flags & DEVLINK_NL_FLAG_NO_LOCK)
>> +	/* When devlink changes netns, it would not be found
>> +	 * by devlink_get_from_info(). So try if it is stored first.
>> +	 */
>> +	if (ops->internal_flags & DEVLINK_NL_FLAG_NEED_DEVLINK) {
>> +		devlink = info->user_ptr[0];
>> +	} else {
>> +		devlink = devlink_get_from_info(info);
>> +		WARN_ON(IS_ERR(devlink));
>> +	}
>> +	if (!IS_ERR(devlink) && ~ops->internal_flags & DEVLINK_NL_FLAG_NO_LOCK)
>>  		mutex_unlock(&devlink->lock);
>>  	mutex_unlock(&devlink_mutex);
>>  }
>> @@ -2675,6 +2683,73 @@ devlink_resources_validate(struct devlink *devlink,
>>  	return err;
>>  }
>>  
>> +static struct net *devlink_netns_get(struct sk_buff *skb,
>> +				     struct devlink *devlink,
>> +				     struct genl_info *info)
>> +{
>> +	struct nlattr *netns_pid_attr = info->attrs[DEVLINK_ATTR_NETNS_PID];
>> +	struct nlattr *netns_fd_attr = info->attrs[DEVLINK_ATTR_NETNS_FD];
>> +	struct nlattr *netns_id_attr = info->attrs[DEVLINK_ATTR_NETNS_ID];
>> +	struct net *net;
>> +
>> +	if (!!netns_pid_attr + !!netns_fd_attr + !!netns_id_attr > 1) {
>> +		NL_SET_ERR_MSG(info->extack, "multiple netns identifying attributes specified");
>> +		return ERR_PTR(-EINVAL);
>> +	}
>> +
>> +	if (netns_pid_attr) {
>> +		net = get_net_ns_by_pid(nla_get_u32(netns_pid_attr));
>> +	} else if (netns_fd_attr) {
>> +		net = get_net_ns_by_fd(nla_get_u32(netns_fd_attr));
>> +	} else if (netns_id_attr) {
>> +		net = get_net_ns_by_id(sock_net(skb->sk),
>> +				       nla_get_u32(netns_id_attr));
>> +		if (!net)
>> +			net = ERR_PTR(-EINVAL);
>> +	} else {
>> +		WARN_ON(1);
>> +		net = ERR_PTR(-EINVAL);
>> +	}
>> +	if (IS_ERR(net)) {
>> +		NL_SET_ERR_MSG(info->extack, "Unknown network namespace");
>> +		return ERR_PTR(-EINVAL);
>> +	}
>> +	if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN)) {
>> +		put_net(net);
>> +		return ERR_PTR(-EPERM);
>> +	}
>> +	return net;
>> +}
>> +
>> +static void devlink_param_notify(struct devlink *devlink,
>> +				 unsigned int port_index,
>> +				 struct devlink_param_item *param_item,
>> +				 enum devlink_command cmd);
>> +
>> +static void devlink_reload_netns_change(struct devlink *devlink,
>> +					struct net *dest_net)
>> +{
>> +	struct devlink_param_item *param_item;
>> +
>> +	/* Userspace needs to be notified about devlink objects
>> +	 * removed from original and entering new network namespace.
>> +	 * The rest of the devlink objects are re-created during
>> +	 * reload process so the notifications are generated separatelly.
>> +	 */
>> +
>> +	list_for_each_entry(param_item, &devlink->param_list, list)
>> +		devlink_param_notify(devlink, 0, param_item,
>> +				     DEVLINK_CMD_PARAM_DEL);
>> +	devlink_notify(devlink, DEVLINK_CMD_DEL);
>> +
>> +	devlink_net_set(devlink, dest_net);
>> +
>> +	devlink_notify(devlink, DEVLINK_CMD_NEW);
>> +	list_for_each_entry(param_item, &devlink->param_list, list)
>> +		devlink_param_notify(devlink, 0, param_item,
>> +				     DEVLINK_CMD_PARAM_NEW);
>> +}
>> +
>>  static bool devlink_reload_supported(struct devlink *devlink)
>>  {
>>  	return devlink->ops->reload_down && devlink->ops->reload_up;
>> @@ -2695,9 +2770,27 @@ bool devlink_is_reload_failed(const struct devlink *devlink)
>>  }
>>  EXPORT_SYMBOL_GPL(devlink_is_reload_failed);
>>  
>> +static int devlink_reload(struct devlink *devlink, struct net *dest_net,
>> +			  struct netlink_ext_ack *extack)
>> +{
>> +	int err;
>> +
>> +	err = devlink->ops->reload_down(devlink, extack);
>> +	if (err)
>> +		return err;
>> +
>> +	if (dest_net && !net_eq(dest_net, devlink_net(devlink)))
>> +		devlink_reload_netns_change(devlink, dest_net);
>> +
>> +	err = devlink->ops->reload_up(devlink, extack);
>> +	devlink_reload_failed_set(devlink, !!err);
>> +	return err;
>> +}
>> +
>>  static int devlink_nl_cmd_reload(struct sk_buff *skb, struct genl_info *info)
>>  {
>>  	struct devlink *devlink = info->user_ptr[0];
>> +	struct net *dest_net = NULL;
>>  	int err;
>>  
>>  	if (!devlink_reload_supported(devlink))
>> @@ -2708,11 +2801,20 @@ static int devlink_nl_cmd_reload(struct sk_buff *skb, struct genl_info *info)
>>  		NL_SET_ERR_MSG_MOD(info->extack, "resources size validation failed");
>>  		return err;
>>  	}
>> -	err = devlink->ops->reload_down(devlink, info->extack);
>> -	if (err)
>> -		return err;
>> -	err = devlink->ops->reload_up(devlink, info->extack);
>> -	devlink_reload_failed_set(devlink, !!err);
>> +
>> +	if (info->attrs[DEVLINK_ATTR_NETNS_PID] ||
>> +	    info->attrs[DEVLINK_ATTR_NETNS_FD] ||
>> +	    info->attrs[DEVLINK_ATTR_NETNS_ID]) {
>> +		dest_net = devlink_netns_get(skb, devlink, info);
>
>Hmm, you're never using 'devlink' there, so I guess you can drop it.

Right, will do.


>
>> +		if (IS_ERR(dest_net))
>> +			return PTR_ERR(dest_net);
>> +	}
>> +
>> +	err = devlink_reload(devlink, dest_net, info->extack);
>> +
>> +	if (dest_net)
>> +		put_net(dest_net);
>> +
>>  	return err;
>>  }
>>  
>> @@ -5794,6 +5896,9 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
>>  	[DEVLINK_ATTR_TRAP_NAME] = { .type = NLA_NUL_STRING },
>>  	[DEVLINK_ATTR_TRAP_ACTION] = { .type = NLA_U8 },
>>  	[DEVLINK_ATTR_TRAP_GROUP_NAME] = { .type = NLA_NUL_STRING },
>> +	[DEVLINK_ATTR_NETNS_PID] = { .type = NLA_U32 },
>> +	[DEVLINK_ATTR_NETNS_FD] = { .type = NLA_U32 },
>> +	[DEVLINK_ATTR_NETNS_ID] = { .type = NLA_U32 },
>>  };
>>  
>>  static const struct genl_ops devlink_nl_ops[] = {
>> @@ -8061,9 +8166,43 @@ int devlink_compat_switch_id_get(struct net_device *dev,
>>  	return 0;
>>  }
>>  
>> +static void __net_exit devlink_pernet_pre_exit(struct net *net)
>> +{
>> +	struct devlink *devlink;
>> +	int err;
>> +
>> +	/* In case network namespace is getting destroyed, reload
>> +	 * all devlink instances from this namespace into init_net.
>> +	 */
>> +	mutex_lock(&devlink_mutex);
>> +	list_for_each_entry(devlink, &devlink_list, list) {
>> +		if (net_eq(devlink_net(devlink), net)) {
>> +			if (WARN_ON(!devlink_reload_supported(devlink)))
>> +				continue;
>> +			err = devlink_reload(devlink, &init_net, NULL);
>> +			if (err)
>> +				pr_warn("Failed to reload devlink instance into init_net\n");
>> +		}
>> +	}
>> +	mutex_unlock(&devlink_mutex);
>> +}
>> +
>> +static struct pernet_operations devlink_pernet_ops __net_initdata = {
>> +	.pre_exit = devlink_pernet_pre_exit,
>> +};
>> +
>>  static int __init devlink_init(void)
>>  {
>> -	return genl_register_family(&devlink_nl_family);
>> +	int err;
>> +
>> +	err = genl_register_family(&devlink_nl_family);
>> +	if (err)
>> +		goto out;
>> +	err = register_pernet_subsys(&devlink_pernet_ops);
>> +
>> +out:
>> +	WARN_ON(err);
>> +	return err;
>>  }
>>  
>>  subsys_initcall(devlink_init);
>> -- 
>> 2.21.0
>> 

^ permalink raw reply

* Re: [patch iproute2-next 2/2] devlink: extend reload command to add support for network namespace change
From: Jiri Pirko @ 2019-09-15  9:44 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, idosch, dsahern, jakub.kicinski, tariqt, saeedm,
	kuznet, yoshfuji, shuah, mlxsw
In-Reply-To: <20190915071639.GA8776@splinter>

Sun, Sep 15, 2019 at 09:16:39AM CEST, idosch@idosch.org wrote:
>On Sat, Sep 14, 2019 at 08:57:57AM +0200, Jiri Pirko wrote:
>> diff --git a/man/man8/devlink-dev.8 b/man/man8/devlink-dev.8
>> index 1804463b2321..0e1a5523fa7b 100644
>> --- a/man/man8/devlink-dev.8
>> +++ b/man/man8/devlink-dev.8
>> @@ -25,6 +25,13 @@ devlink-dev \- devlink device configuration
>>  .ti -8
>>  .B devlink dev help
>>  
>> +.ti -8
>> +.BR "devlink dev set"
>> +.IR DEV
>> +.RI "[ "
>> +.BI "netns { " PID " | " NAME " | " ID " }
>> +.RI "]"
>> +
>>  .ti -8
>>  .BR "devlink dev eswitch set"
>>  .IR DEV
>> @@ -92,6 +99,11 @@ Format is:
>>  .in +2
>>  BUS_NAME/BUS_ADDRESS
>>  
>> +.SS devlink dev set  - sets devlink device attributes
>> +
>> +.TP
>> +.BI "netns { " PID " | " NAME " | " ID " }
>
>This looks like leftover from previous version?

Will fix. Thanks!

>
>> +
>>  .SS devlink dev eswitch show - display devlink device eswitch attributes
>>  .SS devlink dev eswitch set  - sets devlink device eswitch attributes
>>  
>> -- 
>> 2.21.0
>> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox