Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction
From: Lawrence Brakmo @ 2018-06-29 18:55 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Yuchung Cheng, Matt Mathis, Netdev, Kernel Team, Blake Matheny,
	Alexei Starovoitov, Eric Dumazet, Wei Wang, Steve Ibanez,
	Yousuk Seung
In-Reply-To: <CADVnQy=MsiEBCr+Mnp97mp0MxDqrA+_KiZEQehgcDfe9L-hghQ@mail.gmail.com>

On 6/28/18, 1:48 PM, "netdev-owner@vger.kernel.org on behalf of Neal Cardwell" <netdev-owner@vger.kernel.org on behalf of ncardwell@google.com> wrote:

    On Thu, Jun 28, 2018 at 4:20 PM Lawrence Brakmo <brakmo@fb.com> wrote:
    >
    > I just looked at 4.18 traces and the behavior is as follows:
    >
    >    Host A sends the last packets of the request
    >
    >    Host B receives them, and the last packet is marked with congestion (CE)
    >
    >    Host B sends ACKs for packets not marked with congestion
    >
    >    Host B sends data packet with reply and ACK for packet marked with congestion (TCP flag ECE)
    >
    >    Host A receives ACKs with no ECE flag
    >
    >    Host A receives data packet with ACK for the last packet of request and has TCP ECE bit set
    >
    >    Host A sends 1st data packet of the next request with TCP flag CWR
    >
    >    Host B receives the packet (as seen in tcpdump at B), no CE flag
    >
    >    Host B sends a dup ACK that also has the TCP ECE flag
    >
    >    Host A RTO timer fires!
    >
    >    Host A to send the next packet
    >
    >    Host A receives an ACK for everything it has sent (i.e. Host B did receive 1st packet of request)
    >
    >    Host A send more packets…
    
    Thanks, Larry! This is very interesting. I don't know the cause, but
    this reminds me of an issue  Steve Ibanez raised on the netdev list
    last December, where he was seeing cases with DCTCP where a CWR packet
    would be received and buffered by Host B but not ACKed by Host B. This
    was the thread "Re: Linux ECN Handling", starting around December 5. I
    have cc-ed Steve.
    
    I wonder if this may somehow be related to the DCTCP logic to rewind
    tp->rcv_nxt and call tcp_send_ack(), and then restore tp->rcv_nxt, if
    DCTCP notices that the incoming CE bits have been changed while the
    receiver thinks it is holding on to a delayed ACK (in
    dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0()). I wonder if the
    "synthetic" call to tcp_send_ack() somehow has side effects in the
    delayed ACK state machine that can cause the connection to forget that
    it still needs to fire a delayed ACK, even though it just sent an ACK
    just now.
    
    neal
    
You were right Neal, one of the bugs is related to this and is caused by a lack of state update to DCTCP. DCTCP is first informed that the ACK was delayed but it is not updated when the ACK is sent with a data packet.

I am working on a patch to fix this which hopefully should be out today. Thanks everyone for the great feedback! 

^ permalink raw reply

* Re: [PATCH v1 net-next 14/14] net/sched: Make etf report drops on error_queue
From: Willem de Bruijn @ 2018-06-29 18:49 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: Network Development, Thomas Gleixner, jan.altenberg,
	Vinicius Gomes, kurt.kanzenbach, Henrik Austad, Richard Cochran,
	ilias.apalodimas, ivan.khoronzhuk, Miroslav Lichvar,
	Willem de Bruijn, Jamal Hadi Salim, Cong Wang,
	Jiří Pírko, Alexander Duyck
In-Reply-To: <88846dcd-b4dc-ddde-6c4b-5a29ffde016b@intel.com>

> >> diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
> >> index 5514a8aa3bd5..166f4b72875b 100644
> >> --- a/net/sched/sch_etf.c
> >> +++ b/net/sched/sch_etf.c
> >> @@ -11,6 +11,7 @@
> >>  #include <linux/kernel.h>
> >>  #include <linux/string.h>
> >>  #include <linux/errno.h>
> >> +#include <linux/errqueue.h>
> >>  #include <linux/rbtree.h>
> >>  #include <linux/skbuff.h>
> >>  #include <linux/posix-timers.h>
> >> @@ -124,6 +125,35 @@ static void reset_watchdog(struct Qdisc *sch)
> >>         qdisc_watchdog_schedule_ns(&q->watchdog, ktime_to_ns(next));
> >>  }
> >>
> >> +static void report_sock_error(struct sk_buff *skb, u32 err, u8 code)
> >> +{
> >> +       struct sock_exterr_skb *serr;
> >> +       ktime_t txtime = skb->tstamp;
> >> +
> >> +       if (!skb->sk || !(skb->sk->sk_txtime_flags & SK_TXTIME_RECV_ERR_MASK))
> >> +               return;
> >> +
> >> +       skb = skb_clone_sk(skb);
> >> +       if (!skb)
> >> +               return;
> >> +
> >> +       sock_hold(skb->sk);
> >
> > Why take an extra reference? The skb holds a ref on the sk.
>
>
> Yes, the cloned skb holds a ref on the socket, but the documentation of
> skb_clone_sk() makes this explicit suggestion:
>
> (...)
>  * When passing buffers allocated with this function to sock_queue_err_skb
>  * it is necessary to wrap the call with sock_hold/sock_put in order to
>  * prevent the socket from being released prior to being enqueued on
>  * the sk_error_queue.
>  */
>
> which I believe is here just so we are protected against a possible race after
> skb_orphan() is called from sock_queue_err_skb(). Please let me know if I'm
> misreading anything.

Yes, indeed. Code only has to worry about that if there are no
concurrent references
on the socket.

I may be mistaken, but I believe that this complicated logic exists
only for cases where
the timestamp may be queued after the original skb has been released.
Specifically,
when a tx timestamp is returned from a hardware device after transmission of the
original skb. Then the cloned timestamp skb needs its own reference on
the sk while
it is waiting for the timestamp data (i.e., until the device
completion arrives) and then
we need a temporary extra ref to work around the skb_orphan in
sock_queue_err_skb.

Compare skb_complete_tx_timestamp with skb_tstamp_tx. The second is used in
the regular datapath to clone an skb and queue it on the error queue
immediately,
while holding the original skb. This does not call skb_clone_sk and
does not need the
extra sock_hold. This should be good enough for this code path, too.
As kb holds a
ref on skb->sk, the socket cannot go away in the middle of report_sock_error.

^ permalink raw reply

* Re: [PATCH] cfg80211: use IDA to allocate wiphy indeces
From: Brian Norris @ 2018-06-29 18:48 UTC (permalink / raw)
  To: Johannes Berg; +Cc: linux-kernel, linux-wireless, netdev
In-Reply-To: <1530258140.3481.4.camel@sipsolutions.net>

Hi Johannes,

On Fri, Jun 29, 2018 at 09:42:20AM +0200, Johannes Berg wrote:
> On Wed, 2018-06-20 at 18:29 -0700, Brian Norris wrote:
> > It's annoying to see the phy index increase arbitrarily, just because a
> > device got removed and re-probed (e.g., during a device reset, or due to
> > probe testing). We can use the in-kernel index allocator for this,
> > instead of just an increasing counter.
> 
> I can understand that it's somewhat annoying to people, but it was
> actually done on purpose to avoid userspace talking to the wrong device.

Hmm, interesting. I'm not dead-set on this patch, so if there are good
reasons to reject it, I won't fret.

> Imagine you have some userspace process running that has remembered the
> wiphy index to use it to talk to nl80211, and now underneath the device
> goes away and reappears. This process should understand that situation,
> and handle it accordingly, rather than being blind to the reset.

How is this different from the wlan (netdev) device naming? We allow
'wlan0' to leave and return under the same name. Isn't the right answer
that user space should be listening for udev and/or netlink events?

Brian

^ permalink raw reply

* Re: [PATCH bpf 3/3] bpf: undo prog rejection on read-only lock failure
From: Kees Cook @ 2018-06-29 18:42 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, Network Development, Laura Abbott
In-Reply-To: <20180628213459.28631-4-daniel@iogearbox.net>

On Thu, Jun 28, 2018 at 2:34 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> Kees suggested that if set_memory_*() can fail, we should annotate it with
> __must_check, and all callers need to deal with it gracefully given those
> set_memory_*() markings aren't "advisory", but they're expected to actually
> do what they say. This might be an option worth to move forward in future
> but would at the same time require that set_memory_*() calls from supporting
> archs are guaranteed to be "atomic" in that they provide rollback if part
> of the range fails, once that happened, the transition from RW -> RO could
> be made more robust that way, while subsequent RO -> RW transition /must/
> continue guaranteeing to always succeed the undo part.

Does this mean we can have BPF filters that aren't read-only then?
What's the situation where set_memory_ro() fails? (Can it be induced
by the user?)

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply

* [PATCH] atm: zatm: Fix potential Spectre v1
From: Gustavo A. R. Silva @ 2018-06-29 18:28 UTC (permalink / raw)
  To: Chas Williams, David S. Miller
  Cc: linux-atm-general, netdev, linux-kernel, Gustavo A. R. Silva

pool can be indirectly controlled by user-space, hence leading to
a potential exploitation of the Spectre variant 1 vulnerability.

This issue was detected with the help of Smatch:

drivers/atm/zatm.c:1491 zatm_ioctl() warn: potential spectre issue
'zatm_dev->pool_info' (local cap)

Fix this by sanitizing pool before using it to index
zatm_dev->pool_info

Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].

[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
 drivers/atm/zatm.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/atm/zatm.c b/drivers/atm/zatm.c
index a8d2eb0..2c288d1 100644
--- a/drivers/atm/zatm.c
+++ b/drivers/atm/zatm.c
@@ -1483,6 +1483,8 @@ static int zatm_ioctl(struct atm_dev *dev,unsigned int cmd,void __user *arg)
 					return -EFAULT;
 				if (pool < 0 || pool > ZATM_LAST_POOL)
 					return -EINVAL;
+				pool = array_index_nospec(pool,
+							  ZATM_LAST_POOL + 1);
 				if (copy_from_user(&info,
 				    &((struct zatm_pool_req __user *) arg)->info,
 				    sizeof(info))) return -EFAULT;
-- 
2.7.4

^ permalink raw reply related

* [PATCH v6 0/3] multi-threading device shutdown
From: Pavel Tatashin @ 2018-06-29 18:25 UTC (permalink / raw)
  To: pasha.tatashin, steven.sistare, daniel.m.jordan, linux-kernel,
	jeffrey.t.kirsher, intel-wired-lan, netdev, gregkh,
	alexander.duyck, tobin, andy.shevchenko

Changelog
v6 - v5
	- Added Review-by from Andy.
	- Synced with mainline
v4 - v5
	- Addressed comments from Andy Shevchenko and Greg
	  Kroah-Hartman
	- Split the patch into a series of 3 patches in order to
	  provide a better bisecting, and facilitate with reviewing.
v3 - v4
	- Added device_shutdown_serial kernel parameter to disable
	  multi-threading as suggested by Greg Kroah-Hartman

v2 - v3
	- Fixed warning from kbuild test.
	- Moved device_lock/device_unlock inside device_shutdown_tree().

v1 - v2
	- It turns out we cannot lock more than MAX_LOCK_DEPTH by a single
	  thread. (By default this value is 48), and is used to detect
	  deadlocks. So, I re-wrote the code to only lock one devices per
	  thread instead of pre-locking all devices by the main thread.
	- Addressed comments from Tobin C. Harding.
	- As suggested by Alexander Duyck removed ixgbe changes. It can be
	  done as a separate work scaling RTNL mutex.

Do a faster shutdown by calling dev->*->shutdown(dev) in parallel.
device_shutdown() calls these functions for every single device but
only using one thread.

Since, nothing else is running on the machine by the time
device_shutdown() is called, there is no reason not to utilize all the
available CPU resources.

Pavel Tatashin (3):
  drivers core: refactor device_shutdown
  drivers core: prepare device_shutdown for multi-threading
  drivers core: multi-threading device shutdown

 drivers/base/core.c | 290 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 243 insertions(+), 47 deletions(-)

-- 
2.18.0

^ permalink raw reply

* [PATCH v6 3/3] drivers core: multi-threading device shutdown
From: Pavel Tatashin @ 2018-06-29 18:25 UTC (permalink / raw)
  To: pasha.tatashin, steven.sistare, daniel.m.jordan, linux-kernel,
	jeffrey.t.kirsher, intel-wired-lan, netdev, gregkh,
	alexander.duyck, tobin, andy.shevchenko
In-Reply-To: <20180629182541.6735-1-pasha.tatashin@oracle.com>

When system is rebooted, halted or kexeced device_shutdown() is
called.

This function shuts down every single device by calling either:

	dev->bus->shutdown(dev)
	dev->driver->shutdown(dev)

Even on a machine with just a moderate amount of devices, device_shutdown()
may take multiple seconds to complete. This is because many devices require
a specific delays to perform this operation.

Here is a sample analysis of time it takes to call device_shutdown() on a
two socket Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz machine.

device_shutdown		2.95s
-----------------------------
 mlx4_shutdown		1.14s
 megasas_shutdown	0.24s
 ixgbe_shutdown		0.37s x 4 (four ixgbe devices on this machine).
 the rest		0.09s

In mlx4 we spent the most time, but that is because there is a 1 second
sleep, which is defined by hardware specifications:
mlx4_shutdown
 mlx4_unload_one
  mlx4_free_ownership
   msleep(1000)

With megasas we spend a quarter of a second, but sometimes longer (up-to
0.5s) in this path:

    megasas_shutdown
      megasas_flush_cache
        megasas_issue_blocked_cmd
          wait_event_timeout

Finally, with ixgbe_shutdown() it takes 0.37 for each device, but that time
is spread all over the place, with bigger offenders:

    ixgbe_shutdown
      __ixgbe_shutdown
        ixgbe_close_suspend
          ixgbe_down
            ixgbe_init_hw_generic
              ixgbe_reset_hw_X540
                msleep(100);                        0.104483472
                ixgbe_get_san_mac_addr_generic      0.048414851
                ixgbe_get_wwn_prefix_generic        0.048409893
              ixgbe_start_hw_X540
                ixgbe_start_hw_generic
                  ixgbe_clear_hw_cntrs_generic      0.048581502
                  ixgbe_setup_fc_generic            0.024225800

    All the ixgbe_*generic functions end-up calling:
    ixgbe_read_eerd_X540()
      ixgbe_acquire_swfw_sync_X540
        usleep_range(5000, 6000);
      ixgbe_release_swfw_sync_X540
        usleep_range(5000, 6000);

While these are short sleeps, they end-up calling them over 24 times!
24 * 0.0055s = 0.132s. Adding-up to 0.528s for four devices. Also we have
four msleep(100). Totaling to:  0.928s

While we should keep optimizing the individual device drivers, in some
cases this is simply a hardware property that forces a specific delay, and
we must wait.

So, the solution for this problem is to shutdown devices in parallel.
However, we must shutdown children before shutting down parents, so parent
device must wait for its children to finish.

With this patch, on the same machine devices_shutdown() takes 1.142s, and
without mlx4 one second delay only 0.38s

This feature can be optionally disabled via kernel parameter:
device_shutdown_serial. When booted with this parameter, device_shutdown()
will shutdown devices one by one.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 drivers/base/core.c | 70 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 68 insertions(+), 2 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 5d8623b38d4d..b0f9e70daf52 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -22,6 +22,7 @@
 #include <linux/genhd.h>
 #include <linux/mutex.h>
 #include <linux/pm_runtime.h>
+#include <linux/kthread.h>
 #include <linux/netdevice.h>
 #include <linux/sched/signal.h>
 #include <linux/sysfs.h>
@@ -2887,19 +2888,39 @@ static void device_shutdown_one(struct device *dev)
 /*
  * Passed as an argument to device_shutdown_child_task().
  * child_next_index	the next available child index.
+ * tasks_running	number of tasks still running. Each tasks decrements it
+ *			when job is finished and the last task signals that the
+ *			job is complete.
+ * complete		Used to signal job competition.
  * parent		Parent device.
  */
 struct device_shutdown_task_data {
 	atomic_t		child_next_index;
+	atomic_t		tasks_running;
+	struct completion	complete;
 	struct device		*parent;
 };
 
 static int device_shutdown_child_task(void *data);
+static bool device_shutdown_serial;
+
+/*
+ * These globals are used by tasks that are started for root devices.
+ * device_root_tasks_finished	Number of root devices finished shutting down.
+ * device_root_tasks_started	Total number of root devices tasks started.
+ * device_root_tasks_done	The completion signal to the main thread.
+ */
+static atomic_t			device_root_tasks_finished;
+static atomic_t			device_root_tasks_started;
+static struct completion	device_root_tasks_done;
 
 /*
  * Shutdown device tree with root started in dev. If dev has no children
  * simply shutdown only this device. If dev has children recursively shutdown
  * children first, and only then the parent.
+ * For performance reasons children are shutdown in parallel using kernel
+ * threads. because we lock dev its children cannot be removed while this
+ * functions is running.
  */
 static void device_shutdown_tree(struct device *dev)
 {
@@ -2913,11 +2934,20 @@ static void device_shutdown_tree(struct device *dev)
 		int i;
 
 		atomic_set(&tdata.child_next_index, 0);
+		atomic_set(&tdata.tasks_running, children_count);
+		init_completion(&tdata.complete);
 		tdata.parent = dev;
 
 		for (i = 0; i < children_count; i++) {
-			device_shutdown_child_task(&tdata);
+			if (device_shutdown_serial) {
+				device_shutdown_child_task(&tdata);
+			} else {
+				kthread_run(device_shutdown_child_task,
+					    &tdata, "device_shutdown.%s",
+					    dev_name(dev));
+			}
 		}
+		wait_for_completion(&tdata.complete);
 	}
 	device_shutdown_one(dev);
 	device_unlock(dev);
@@ -2937,6 +2967,10 @@ static int device_shutdown_child_task(void *data)
 	/* ref. counter is going to be decremented in device_shutdown_one() */
 	get_device(dev);
 	device_shutdown_tree(dev);
+
+	/* If we are the last to exit, signal the completion */
+	if (atomic_dec_return(&tdata->tasks_running) == 0)
+		complete(&tdata->complete);
 	return 0;
 }
 
@@ -2947,9 +2981,14 @@ static int device_shutdown_child_task(void *data)
 static int device_shutdown_root_task(void *data)
 {
 	struct device *dev = (struct device *)data;
+	int root_devices;
 
 	device_shutdown_tree(dev);
 
+	/* If we are the last to exit, signal the completion */
+	root_devices = atomic_inc_return(&device_root_tasks_finished);
+	if (root_devices == atomic_read(&device_root_tasks_started))
+		complete(&device_root_tasks_done);
 	return 0;
 }
 
@@ -2958,10 +2997,17 @@ static int device_shutdown_root_task(void *data)
  */
 void device_shutdown(void)
 {
+	int root_devices = 0;
 	struct device *dev;
 
+	atomic_set(&device_root_tasks_finished, 0);
+	atomic_set(&device_root_tasks_started, 0);
+	init_completion(&device_root_tasks_done);
+
 	/* Shutdown the root devices. The children are going to be
 	 * shutdown first in device_shutdown_tree().
+	 * We shutdown root devices in parallel by starting thread
+	 * for each root device.
 	 */
 	spin_lock(&devices_kset->list_lock);
 	while (!list_empty(&devices_kset->list)) {
@@ -2992,13 +3038,33 @@ void device_shutdown(void)
 			 */
 			spin_unlock(&devices_kset->list_lock);
 
-			device_shutdown_root_task(dev);
+			root_devices++;
+			if (device_shutdown_serial) {
+				device_shutdown_root_task(dev);
+			} else {
+				kthread_run(device_shutdown_root_task,
+					    dev, "device_root_shutdown.%s",
+					    dev_name(dev));
+			}
 			spin_lock(&devices_kset->list_lock);
 		}
 	}
 	spin_unlock(&devices_kset->list_lock);
+
+	/* Set number of root tasks started, and waits for completion */
+	atomic_set(&device_root_tasks_started, root_devices);
+	if (root_devices != atomic_read(&device_root_tasks_finished))
+		wait_for_completion(&device_root_tasks_done);
+}
+
+static int __init _device_shutdown_serial(char *arg)
+{
+	device_shutdown_serial = true;
+	return 0;
 }
 
+early_param("device_shutdown_serial", _device_shutdown_serial);
+
 /*
  * Device logging functions
  */
-- 
2.18.0

^ permalink raw reply related

* [PATCH v6 2/3] drivers core: prepare device_shutdown for multi-threading
From: Pavel Tatashin @ 2018-06-29 18:25 UTC (permalink / raw)
  To: pasha.tatashin, steven.sistare, daniel.m.jordan, linux-kernel,
	jeffrey.t.kirsher, intel-wired-lan, netdev, gregkh,
	alexander.duyck, tobin, andy.shevchenko
In-Reply-To: <20180629182541.6735-1-pasha.tatashin@oracle.com>

Do all the necessary refactoring to prepare device_shutdown() logic to
be multi-threaded.

Which includes:
1. Change the direction of traversing the list instead of going backward,
we now go forward.
2. Children are shutdown recursively for each root device from bottom-up.
3. Functions that can be multi-threaded have _task() in their name.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 drivers/base/core.c | 178 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 149 insertions(+), 29 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 07380aa0683b..5d8623b38d4d 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2134,6 +2134,59 @@ const char *device_get_devnode(struct device *dev,
 	return *tmp = s;
 }
 
+/*
+ * device_children_count - device children count
+ * @parent: parent struct device.
+ *
+ * Returns number of children for this device or 0 if none.
+ */
+static int device_children_count(struct device *parent)
+{
+	struct klist_iter i;
+	int children = 0;
+
+	if (!parent->p)
+		return 0;
+
+	klist_iter_init(&parent->p->klist_children, &i);
+	while (next_device(&i))
+		children++;
+	klist_iter_exit(&i);
+
+	return children;
+}
+
+/*
+ * device_get_child_by_index - Return child using the provided index.
+ * @parent: parent struct device.
+ * @index:  Index of the child, where 0 is the first child in the children list,
+ * and so on.
+ *
+ * Returns child or NULL if child with this index is not present.
+ */
+static struct device *
+device_get_child_by_index(struct device *parent, int index)
+{
+	struct klist_iter i;
+	struct device *dev = NULL, *d;
+	int child_index = 0;
+
+	if (!parent->p || index < 0)
+		return NULL;
+
+	klist_iter_init(&parent->p->klist_children, &i);
+	while ((d = next_device(&i))) {
+		if (child_index == index) {
+			dev = d;
+			break;
+		}
+		child_index++;
+	}
+	klist_iter_exit(&i);
+
+	return dev;
+}
+
 /**
  * device_for_each_child - device child iterator.
  * @parent: parent struct device.
@@ -2831,50 +2884,117 @@ static void device_shutdown_one(struct device *dev)
 	put_device(dev);
 }
 
+/*
+ * Passed as an argument to device_shutdown_child_task().
+ * child_next_index	the next available child index.
+ * parent		Parent device.
+ */
+struct device_shutdown_task_data {
+	atomic_t		child_next_index;
+	struct device		*parent;
+};
+
+static int device_shutdown_child_task(void *data);
+
+/*
+ * Shutdown device tree with root started in dev. If dev has no children
+ * simply shutdown only this device. If dev has children recursively shutdown
+ * children first, and only then the parent.
+ */
+static void device_shutdown_tree(struct device *dev)
+{
+	int children_count;
+
+	device_lock(dev);
+	children_count = device_children_count(dev);
+
+	if (children_count) {
+		struct device_shutdown_task_data tdata;
+		int i;
+
+		atomic_set(&tdata.child_next_index, 0);
+		tdata.parent = dev;
+
+		for (i = 0; i < children_count; i++) {
+			device_shutdown_child_task(&tdata);
+		}
+	}
+	device_shutdown_one(dev);
+	device_unlock(dev);
+}
+
+/*
+ * Only devices with parent are going through this function. The parent is
+ * locked and waits for all of its children to finish shutting down before
+ * calling shutdown function for itself.
+ */
+static int device_shutdown_child_task(void *data)
+{
+	struct device_shutdown_task_data *tdata = data;
+	int cidx = atomic_inc_return(&tdata->child_next_index) - 1;
+	struct device *dev = device_get_child_by_index(tdata->parent, cidx);
+
+	/* ref. counter is going to be decremented in device_shutdown_one() */
+	get_device(dev);
+	device_shutdown_tree(dev);
+	return 0;
+}
+
+/*
+ * On shutdown each root device (the one that does not have a parent) goes
+ * through this function.
+ */
+static int device_shutdown_root_task(void *data)
+{
+	struct device *dev = (struct device *)data;
+
+	device_shutdown_tree(dev);
+
+	return 0;
+}
+
 /**
  * device_shutdown - call ->shutdown() on each device to shutdown.
  */
 void device_shutdown(void)
 {
-	struct device *dev, *parent;
+	struct device *dev;
 
-	spin_lock(&devices_kset->list_lock);
-	/*
-	 * Walk the devices list backward, shutting down each in turn.
-	 * Beware that device unplug events may also start pulling
-	 * devices offline, even as the system is shutting down.
+	/* Shutdown the root devices. The children are going to be
+	 * shutdown first in device_shutdown_tree().
 	 */
+	spin_lock(&devices_kset->list_lock);
 	while (!list_empty(&devices_kset->list)) {
-		dev = list_entry(devices_kset->list.prev, struct device,
-				kobj.entry);
+		dev = list_entry(devices_kset->list.next, struct device,
+				 kobj.entry);
 
-		/*
-		 * hold reference count of device's parent to
-		 * prevent it from being freed because parent's
-		 * lock is to be held
-		 */
-		parent = get_device(dev->parent);
-		get_device(dev);
 		/*
 		 * Make sure the device is off the kset list, in the
 		 * event that dev->*->shutdown() doesn't remove it.
 		 */
 		list_del_init(&dev->kobj.entry);
-		spin_unlock(&devices_kset->list_lock);
 
-		/* hold lock to avoid race with probe/release */
-		if (parent)
-			device_lock(parent);
-		device_lock(dev);
-
-		device_shutdown_one(dev);
-		device_unlock(dev);
-		if (parent)
-			device_unlock(parent);
-
-		put_device(parent);
-
-		spin_lock(&devices_kset->list_lock);
+		/* Here we start tasks for root devices only */
+		if (!dev->parent) {
+			/* Prevents devices from being freed. The counter is
+			 * going to be decremented in device_shutdown_one() once
+			 * this root device is shutdown.
+			 */
+			get_device(dev);
+
+			/* We unlock list for performance reasons,
+			 * dev->*->shutdown(), may try to take this lock to
+			 * remove us from kset list. To avoid unlocking this
+			 * list we could replace spin lock in:
+			 * dev->kobj.kset->list_lock with a dummy one once
+			 * device is locked in device_shutdown_root_task() and
+			 * in device_shutdown_child_task().
+			 */
+			spin_unlock(&devices_kset->list_lock);
+
+			device_shutdown_root_task(dev);
+			spin_lock(&devices_kset->list_lock);
+		}
 	}
 	spin_unlock(&devices_kset->list_lock);
 }
-- 
2.18.0

^ permalink raw reply related

* [PATCH v6 1/3] drivers core: refactor device_shutdown
From: Pavel Tatashin @ 2018-06-29 18:25 UTC (permalink / raw)
  To: pasha.tatashin, steven.sistare, daniel.m.jordan, linux-kernel,
	jeffrey.t.kirsher, intel-wired-lan, netdev, gregkh,
	alexander.duyck, tobin, andy.shevchenko
In-Reply-To: <20180629182541.6735-1-pasha.tatashin@oracle.com>

device_shutdown() traverses through the list of devices, and calls
dev->{bug/driver}->shutdown() for each entry in the list.

Refactor the function by keeping device_shutdown() to do the logic of
traversing the list of devices, and device_shutdown_one() to perform the
actual shutdown operation on one device.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
---
 drivers/base/core.c | 50 +++++++++++++++++++++++++++------------------
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index df3e1a44707a..07380aa0683b 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2802,6 +2802,35 @@ int device_move(struct device *dev, struct device *new_parent,
 }
 EXPORT_SYMBOL_GPL(device_move);
 
+/*
+ * device_shutdown_one - call ->shutdown() for the device passed as
+ * argument.
+ */
+static void device_shutdown_one(struct device *dev)
+{
+	/* Don't allow any more runtime suspends */
+	pm_runtime_get_noresume(dev);
+	pm_runtime_barrier(dev);
+
+	if (dev->class && dev->class->shutdown_pre) {
+		if (initcall_debug)
+			dev_info(dev, "shutdown_pre\n");
+		dev->class->shutdown_pre(dev);
+	}
+	if (dev->bus && dev->bus->shutdown) {
+		if (initcall_debug)
+			dev_info(dev, "shutdown\n");
+		dev->bus->shutdown(dev);
+	} else if (dev->driver && dev->driver->shutdown) {
+		if (initcall_debug)
+			dev_info(dev, "shutdown\n");
+		dev->driver->shutdown(dev);
+	}
+
+	/* decrement the reference counter */
+	put_device(dev);
+}
+
 /**
  * device_shutdown - call ->shutdown() on each device to shutdown.
  */
@@ -2838,30 +2867,11 @@ void device_shutdown(void)
 			device_lock(parent);
 		device_lock(dev);
 
-		/* Don't allow any more runtime suspends */
-		pm_runtime_get_noresume(dev);
-		pm_runtime_barrier(dev);
-
-		if (dev->class && dev->class->shutdown_pre) {
-			if (initcall_debug)
-				dev_info(dev, "shutdown_pre\n");
-			dev->class->shutdown_pre(dev);
-		}
-		if (dev->bus && dev->bus->shutdown) {
-			if (initcall_debug)
-				dev_info(dev, "shutdown\n");
-			dev->bus->shutdown(dev);
-		} else if (dev->driver && dev->driver->shutdown) {
-			if (initcall_debug)
-				dev_info(dev, "shutdown\n");
-			dev->driver->shutdown(dev);
-		}
-
+		device_shutdown_one(dev);
 		device_unlock(dev);
 		if (parent)
 			device_unlock(parent);
 
-		put_device(dev);
 		put_device(parent);
 
 		spin_lock(&devices_kset->list_lock);
-- 
2.18.0

^ permalink raw reply related

* Re: [PATCH bpf 0/3] Three BPF fixes
From: Alexei Starovoitov @ 2018-06-29 17:54 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: ast, netdev
In-Reply-To: <20180628213459.28631-1-daniel@iogearbox.net>

On Thu, Jun 28, 2018 at 11:34:56PM +0200, Daniel Borkmann wrote:
> This set contains three fixes that are mostly JIT and set_memory_*()
> related. The third in the series in particular fixes the syzkaller
> bugs that were still pending; aside from local reproduction & testing,
> also 'syz test' wasn't able to trigger them anymore. I've tested this
> series on x86_64, arm64 and s390x, and kbuild bot wasn't yelling either
> for the rest. For details, please see patches as usual, thanks!

Applied, Thanks

^ permalink raw reply

* Re: [PATCH v1 net-next 14/14] net/sched: Make etf report drops on error_queue
From: Jesus Sanchez-Palencia @ 2018-06-29 17:48 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, Thomas Gleixner, jan.altenberg,
	Vinicius Gomes, kurt.kanzenbach, Henrik Austad, Richard Cochran,
	ilias.apalodimas, ivan.khoronzhuk, Miroslav Lichvar,
	Willem de Bruijn, Jamal Hadi Salim, Cong Wang,
	Jiří Pírko
In-Reply-To: <CAF=yD-+R867otr0n9q-pHvvUhv6o3Nj42a3Ts54J7gwPgr0G9Q@mail.gmail.com>

Hi Willem,


On 06/28/2018 07:27 AM, Willem de Bruijn wrote:

(...)

> 
>>  struct sock_txtime {
>>         clockid_t       clockid;        /* reference clockid */
>> -       u16             flags;          /* bit 0: txtime in deadline_mode */
>> +       u16             flags;          /* bit 0: txtime in deadline_mode
>> +                                        * bit 1: report drops on sk err queue
>> +                                        */
>>  };
> 
> If this is shared with userspace, should be defined in an uapi header.
> Same on the flag bits below. Self documenting code is preferable over
> comments.


Fixed for v2.


> 
>>  /*
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 73f4404e49e4..e681a45cfe7e 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -473,6 +473,7 @@ struct sock {
>>         u16                     sk_clockid;
>>         u16                     sk_txtime_flags;
>>  #define SK_TXTIME_DEADLINE_MASK        BIT(0)
>> +#define SK_TXTIME_RECV_ERR_MASK        BIT(1)
> 
> Integer bitfields are (arguably) more readable. There is no requirement
> that the user interface be the same as the in-kernel implementation. Indeed
> if you can save bits in struct sock, that is preferable (but not so for the ABI,
> which cannot easily be extended).


Sure, changed for v2.

(...)


>> diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
>> index 5514a8aa3bd5..166f4b72875b 100644
>> --- a/net/sched/sch_etf.c
>> +++ b/net/sched/sch_etf.c
>> @@ -11,6 +11,7 @@
>>  #include <linux/kernel.h>
>>  #include <linux/string.h>
>>  #include <linux/errno.h>
>> +#include <linux/errqueue.h>
>>  #include <linux/rbtree.h>
>>  #include <linux/skbuff.h>
>>  #include <linux/posix-timers.h>
>> @@ -124,6 +125,35 @@ static void reset_watchdog(struct Qdisc *sch)
>>         qdisc_watchdog_schedule_ns(&q->watchdog, ktime_to_ns(next));
>>  }
>>
>> +static void report_sock_error(struct sk_buff *skb, u32 err, u8 code)
>> +{
>> +       struct sock_exterr_skb *serr;
>> +       ktime_t txtime = skb->tstamp;
>> +
>> +       if (!skb->sk || !(skb->sk->sk_txtime_flags & SK_TXTIME_RECV_ERR_MASK))
>> +               return;
>> +
>> +       skb = skb_clone_sk(skb);
>> +       if (!skb)
>> +               return;
>> +
>> +       sock_hold(skb->sk);
> 
> Why take an extra reference? The skb holds a ref on the sk.


Yes, the cloned skb holds a ref on the socket, but the documentation of
skb_clone_sk() makes this explicit suggestion:

(...)
 * When passing buffers allocated with this function to sock_queue_err_skb
 * it is necessary to wrap the call with sock_hold/sock_put in order to
 * prevent the socket from being released prior to being enqueued on
 * the sk_error_queue.
 */

which I believe is here just so we are protected against a possible race after
skb_orphan() is called from sock_queue_err_skb(). Please let me know if I'm
misreading anything.

And for v2 I will move the sock_hold() call to immediately before the
sock_queue_err_skb() to avoid any future confusion.



> 
>> +
>> +       serr = SKB_EXT_ERR(skb);
>> +       serr->ee.ee_errno = err;
>> +       serr->ee.ee_origin = SO_EE_ORIGIN_LOCAL;
> 
> I suggest adding a new SO_EE_ORIGIN_TXTIME as opposed to overloading
> the existing
> local origin. Then the EE_CODE can start at 1, as ee_code can be
> demultiplexed by origin.


OK, it looks better indeed. Fixed for v2.


> 
>> +       serr->ee.ee_type = 0;
>> +       serr->ee.ee_code = code;
>> +       serr->ee.ee_pad = 0;
>> +       serr->ee.ee_data = (txtime >> 32); /* high part of tstamp */
>> +       serr->ee.ee_info = txtime; /* low part of tstamp */
>> +
>> +       if (sock_queue_err_skb(skb->sk, skb))
>> +               kfree_skb(skb);
>> +
>> +       sock_put(skb->sk);
>> +}


Thanks,
Jesus

^ permalink raw reply

* [PATCH net 5/5] s390/qeth: consistently re-enable device features
From: Julian Wiedmann @ 2018-06-29 17:45 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, linux-s390, Martin Schwidefsky, Heiko Carstens,
	Stefan Raspl, Ursula Braun, Julian Wiedmann
In-Reply-To: <20180629174554.53718-1-jwi@linux.ibm.com>

commit e830baa9c3f0 ("qeth: restore device features after recovery") and
commit ce3443564145 ("s390/qeth: rely on kernel for feature recovery")
made sure that the HW functions for device features get re-programmed
after recovery.

But we missed that the same handling is also required when a card is
first set offline (destroying all HW context), and then online again.
Fix this by moving the re-enable action out of the recovery-only path.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
---
 drivers/s390/net/qeth_core.h      |  2 +-
 drivers/s390/net/qeth_core_main.c | 23 +++++++++++------------
 drivers/s390/net/qeth_l2_main.c   |  5 ++---
 drivers/s390/net/qeth_l3_main.c   |  3 ++-
 4 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index 940fd7b558d3..a246a618f9a4 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -1040,7 +1040,7 @@ struct qeth_cmd_buffer *qeth_get_setassparms_cmd(struct qeth_card *,
 						 __u16, __u16,
 						 enum qeth_prot_versions);
 int qeth_set_features(struct net_device *, netdev_features_t);
-void qeth_recover_features(struct net_device *dev);
+void qeth_enable_hw_features(struct net_device *dev);
 netdev_features_t qeth_fix_features(struct net_device *, netdev_features_t);
 netdev_features_t qeth_features_check(struct sk_buff *skb,
 				      struct net_device *dev,
diff --git a/drivers/s390/net/qeth_core_main.c b/drivers/s390/net/qeth_core_main.c
index d20a69a3bc40..d01ac29fd986 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -6469,28 +6469,27 @@ static int qeth_set_ipa_rx_csum(struct qeth_card *card, bool on)
 #define QETH_HW_FEATURES (NETIF_F_RXCSUM | NETIF_F_IP_CSUM | NETIF_F_TSO | \
 			  NETIF_F_IPV6_CSUM)
 /**
- * qeth_recover_features() - Restore device features after recovery
- * @dev:	the recovering net_device
- *
- * Caller must hold rtnl lock.
+ * qeth_enable_hw_features() - (Re-)Enable HW functions for device features
+ * @dev:	a net_device
  */
-void qeth_recover_features(struct net_device *dev)
+void qeth_enable_hw_features(struct net_device *dev)
 {
-	netdev_features_t features = dev->features;
 	struct qeth_card *card = dev->ml_priv;
+	netdev_features_t features;
 
+	rtnl_lock();
+	features = dev->features;
 	/* force-off any feature that needs an IPA sequence.
 	 * netdev_update_features() will restart them.
 	 */
 	dev->features &= ~QETH_HW_FEATURES;
 	netdev_update_features(dev);
-
-	if (features == dev->features)
-		return;
-	dev_warn(&card->gdev->dev,
-		 "Device recovery failed to restore all offload features\n");
+	if (features != dev->features)
+		dev_warn(&card->gdev->dev,
+			 "Device recovery failed to restore all offload features\n");
+	rtnl_unlock();
 }
-EXPORT_SYMBOL_GPL(qeth_recover_features);
+EXPORT_SYMBOL_GPL(qeth_enable_hw_features);
 
 int qeth_set_features(struct net_device *dev, netdev_features_t features)
 {
diff --git a/drivers/s390/net/qeth_l2_main.c b/drivers/s390/net/qeth_l2_main.c
index 5464515b71f1..2487f0aeb165 100644
--- a/drivers/s390/net/qeth_l2_main.c
+++ b/drivers/s390/net/qeth_l2_main.c
@@ -1119,6 +1119,8 @@ static int __qeth_l2_set_online(struct ccwgroup_device *gdev, int recovery_mode)
 		netif_carrier_off(card->dev);
 
 	qeth_set_allowed_threads(card, 0xffffffff, 0);
+
+	qeth_enable_hw_features(card->dev);
 	if (recover_flag == CARD_STATE_RECOVER) {
 		if (recovery_mode &&
 		    card->info.type != QETH_CARD_TYPE_OSN) {
@@ -1130,9 +1132,6 @@ static int __qeth_l2_set_online(struct ccwgroup_device *gdev, int recovery_mode)
 		}
 		/* this also sets saved unicast addresses */
 		qeth_l2_set_rx_mode(card->dev);
-		rtnl_lock();
-		qeth_recover_features(card->dev);
-		rtnl_unlock();
 	}
 	/* let user_space know that device is online */
 	kobject_uevent(&gdev->dev.kobj, KOBJ_CHANGE);
diff --git a/drivers/s390/net/qeth_l3_main.c b/drivers/s390/net/qeth_l3_main.c
index e7fa479adf47..5905dc63e256 100644
--- a/drivers/s390/net/qeth_l3_main.c
+++ b/drivers/s390/net/qeth_l3_main.c
@@ -2662,6 +2662,8 @@ static int __qeth_l3_set_online(struct ccwgroup_device *gdev, int recovery_mode)
 		netif_carrier_on(card->dev);
 	else
 		netif_carrier_off(card->dev);
+
+	qeth_enable_hw_features(card->dev);
 	if (recover_flag == CARD_STATE_RECOVER) {
 		rtnl_lock();
 		if (recovery_mode)
@@ -2669,7 +2671,6 @@ static int __qeth_l3_set_online(struct ccwgroup_device *gdev, int recovery_mode)
 		else
 			dev_open(card->dev);
 		qeth_l3_set_rx_mode(card->dev);
-		qeth_recover_features(card->dev);
 		rtnl_unlock();
 	}
 	qeth_trace_features(card);
-- 
2.16.4

^ permalink raw reply related

* [PATCH net 3/5] s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6]
From: Julian Wiedmann @ 2018-06-29 17:45 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, linux-s390, Martin Schwidefsky, Heiko Carstens,
	Stefan Raspl, Ursula Braun, Vasily Gorbik, Julian Wiedmann
In-Reply-To: <20180629174554.53718-1-jwi@linux.ibm.com>

From: Vasily Gorbik <gor@linux.ibm.com>

*ether_addr*_64bits functions have been introduced to optimize
performance critical paths, which access 6-byte ethernet address as u64
value to get "nice" assembly. A harmless hack works nicely on ethernet
addresses shoved into a structure or a larger buffer, until busted by
Kasan on smth like plain (u8 *)[6].

qeth_l2_set_mac_address calls qeth_l2_remove_mac passing
u8 old_addr[ETH_ALEN] as an argument.

Adding/removing macs for an ethernet adapter is not that performance
critical. Moreover is_multicast_ether_addr_64bits itself on s390 is not
faster than is_multicast_ether_addr:

is_multicast_ether_addr(%r2) -> %r2
llc	%r2,0(%r2)
risbg	%r2,%r2,63,191,0

is_multicast_ether_addr_64bits(%r2) -> %r2
llgc	%r2,0(%r2)
risbg	%r2,%r2,63,191,0

So, let's just use is_multicast_ether_addr instead of
is_multicast_ether_addr_64bits.

Fixes: bcacfcbc82b4 ("s390/qeth: fix MAC address update sequence")
Reviewed-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
---
 drivers/s390/net/qeth_l2_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/s390/net/qeth_l2_main.c b/drivers/s390/net/qeth_l2_main.c
index 7daf125dae76..5464515b71f1 100644
--- a/drivers/s390/net/qeth_l2_main.c
+++ b/drivers/s390/net/qeth_l2_main.c
@@ -140,7 +140,7 @@ static int qeth_l2_send_setmac(struct qeth_card *card, __u8 *mac)
 
 static int qeth_l2_write_mac(struct qeth_card *card, u8 *mac)
 {
-	enum qeth_ipa_cmds cmd = is_multicast_ether_addr_64bits(mac) ?
+	enum qeth_ipa_cmds cmd = is_multicast_ether_addr(mac) ?
 					IPA_CMD_SETGMAC : IPA_CMD_SETVMAC;
 	int rc;
 
@@ -157,7 +157,7 @@ static int qeth_l2_write_mac(struct qeth_card *card, u8 *mac)
 
 static int qeth_l2_remove_mac(struct qeth_card *card, u8 *mac)
 {
-	enum qeth_ipa_cmds cmd = is_multicast_ether_addr_64bits(mac) ?
+	enum qeth_ipa_cmds cmd = is_multicast_ether_addr(mac) ?
 					IPA_CMD_DELGMAC : IPA_CMD_DELVMAC;
 	int rc;
 
-- 
2.16.4

^ permalink raw reply related

* [PATCH net 4/5] s390/qeth: don't clobber buffer on async TX completion
From: Julian Wiedmann @ 2018-06-29 17:45 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, linux-s390, Martin Schwidefsky, Heiko Carstens,
	Stefan Raspl, Ursula Braun, Julian Wiedmann
In-Reply-To: <20180629174554.53718-1-jwi@linux.ibm.com>

If qeth_qdio_output_handler() detects that a transmit requires async
completion, it replaces the pending buffer's metadata object
(qeth_qdio_out_buffer) so that this queue buffer can be re-used while
the data is pending completion.

Later when the CQ indicates async completion of such a metadata object,
qeth_qdio_cq_handler() tries to free any data associated with this
object (since HW has now completed the transfer). By calling
qeth_clear_output_buffer(), it erronously operates on the queue buffer
that _previously_ belonged to this transfer ... but which has been
potentially re-used several times by now.
This results in double-free's of the buffer's data, and failing
transmits as the buffer descriptor is scrubbed in mid-air.

The correct way of handling this situation is to
1. scrub the queue buffer when it is prepared for re-use, and
2. later obtain the data addresses from the async-completion notifier
   (ie. the AOB), instead of the queue buffer.

All this only affects qeth devices used for af_iucv HiperTransport.

Fixes: 0da9581ddb0f ("qeth: exploit asynchronous delivery of storage blocks")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
---
 drivers/s390/net/qeth_core.h      | 11 +++++++++++
 drivers/s390/net/qeth_core_main.c | 22 ++++++++++++++++------
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index 2a5fec55bf60..940fd7b558d3 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -829,6 +829,17 @@ struct qeth_trap_id {
 /*some helper functions*/
 #define QETH_CARD_IFNAME(card) (((card)->dev)? (card)->dev->name : "")
 
+static inline void qeth_scrub_qdio_buffer(struct qdio_buffer *buf,
+					  unsigned int elements)
+{
+	unsigned int i;
+
+	for (i = 0; i < elements; i++)
+		memset(&buf->element[i], 0, sizeof(struct qdio_buffer_element));
+	buf->element[14].sflags = 0;
+	buf->element[15].sflags = 0;
+}
+
 /**
  * qeth_get_elements_for_range() -	find number of SBALEs to cover range.
  * @start:				Start of the address range.
diff --git a/drivers/s390/net/qeth_core_main.c b/drivers/s390/net/qeth_core_main.c
index 9d9182ed8ac4..d20a69a3bc40 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -73,9 +73,6 @@ static void qeth_notify_skbs(struct qeth_qdio_out_q *queue,
 		struct qeth_qdio_out_buffer *buf,
 		enum iucv_tx_notify notification);
 static void qeth_release_skbs(struct qeth_qdio_out_buffer *buf);
-static void qeth_clear_output_buffer(struct qeth_qdio_out_q *queue,
-		struct qeth_qdio_out_buffer *buf,
-		enum qeth_qdio_buffer_states newbufstate);
 static int qeth_init_qdio_out_buf(struct qeth_qdio_out_q *, int);
 
 struct workqueue_struct *qeth_wq;
@@ -489,6 +486,7 @@ static void qeth_qdio_handle_aob(struct qeth_card *card,
 	struct qaob *aob;
 	struct qeth_qdio_out_buffer *buffer;
 	enum iucv_tx_notify notification;
+	unsigned int i;
 
 	aob = (struct qaob *) phys_to_virt(phys_aob_addr);
 	QETH_CARD_TEXT(card, 5, "haob");
@@ -513,10 +511,18 @@ static void qeth_qdio_handle_aob(struct qeth_card *card,
 	qeth_notify_skbs(buffer->q, buffer, notification);
 
 	buffer->aob = NULL;
-	qeth_clear_output_buffer(buffer->q, buffer,
-				 QETH_QDIO_BUF_HANDLED_DELAYED);
+	/* Free dangling allocations. The attached skbs are handled by
+	 * qeth_cleanup_handled_pending().
+	 */
+	for (i = 0;
+	     i < aob->sb_count && i < QETH_MAX_BUFFER_ELEMENTS(card);
+	     i++) {
+		if (aob->sba[i] && buffer->is_header[i])
+			kmem_cache_free(qeth_core_header_cache,
+					(void *) aob->sba[i]);
+	}
+	atomic_set(&buffer->state, QETH_QDIO_BUF_HANDLED_DELAYED);
 
-	/* from here on: do not touch buffer anymore */
 	qdio_release_aob(aob);
 }
 
@@ -3759,6 +3765,10 @@ static void qeth_qdio_output_handler(struct ccw_device *ccwdev,
 			QETH_CARD_TEXT(queue->card, 5, "aob");
 			QETH_CARD_TEXT_(queue->card, 5, "%lx",
 					virt_to_phys(buffer->aob));
+
+			/* prepare the queue slot for re-use: */
+			qeth_scrub_qdio_buffer(buffer->buffer,
+					       QETH_MAX_BUFFER_ELEMENTS(card));
 			if (qeth_init_qdio_out_buf(queue, bidx)) {
 				QETH_CARD_TEXT(card, 2, "outofbuf");
 				qeth_schedule_recovery(card);
-- 
2.16.4

^ permalink raw reply related

* [PATCH net 0/5] s390/qeth: fixes 2018-06-29
From: Julian Wiedmann @ 2018-06-29 17:45 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, linux-s390, Martin Schwidefsky, Heiko Carstens,
	Stefan Raspl, Ursula Braun, Julian Wiedmann

Hi Dave,

please apply a few qeth fixes for -net and your 4.17 stable queue.

Patches 1-3 fix several issues wrt to MAC address management that were
introduced during the 4.17 cycle.
Patch 4 tackles a long-standing issue with busy multi-connection workloads
on devices in af_iucv mode.
Patch 5 makes sure to re-enable all active HW offloads, after a card was
previously set offline and thus lost its HW context.

Thanks,
Julian

Julian Wiedmann (4):
  Revert "s390/qeth: use Read device to query hypervisor for MAC"
  s390/qeth: fix race when setting MAC address
  s390/qeth: don't clobber buffer on async TX completion
  s390/qeth: consistently re-enable device features

Vasily Gorbik (1):
  s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6]

 drivers/s390/net/qeth_core.h      | 13 ++++++++++-
 drivers/s390/net/qeth_core_main.c | 47 +++++++++++++++++++++++----------------
 drivers/s390/net/qeth_l2_main.c   | 24 ++++++++++++--------
 drivers/s390/net/qeth_l3_main.c   |  3 ++-
 4 files changed, 57 insertions(+), 30 deletions(-)

-- 
2.16.4

^ permalink raw reply

* [PATCH net 1/5] Revert "s390/qeth: use Read device to query hypervisor for MAC"
From: Julian Wiedmann @ 2018-06-29 17:45 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, linux-s390, Martin Schwidefsky, Heiko Carstens,
	Stefan Raspl, Ursula Braun, Julian Wiedmann
In-Reply-To: <20180629174554.53718-1-jwi@linux.ibm.com>

This reverts commit b7493e91c11a757cf0f8ab26989642ee4bb2c642.

On its own, querying RDEV for a MAC address works fine. But when upgrading
from a qeth that previously queried DDEV on a z/VM NIC (ie. any kernel with
commit ec61bd2fd2a2), the RDEV query now returns a _different_ MAC address
than the DDEV query.

If the NIC is configured with MACPROTECT, z/VM apparently requires us to
use the MAC that was initially returned (on DDEV) and registered. So after
upgrading to a kernel that uses RDEV, the SETVMAC registration cmd for the
new MAC address fails and we end up with a non-operabel interface.

To avoid regressions on upgrade, switch back to using DDEV for the MAC
address query. The downgrade path (first RDEV, later DDEV) is fine, in this
case both queries return the same MAC address.

Fixes: b7493e91c11a ("s390/qeth: use Read device to query hypervisor for MAC")
Reported-by: Michal Kubecek <mkubecek@suse.com>
Tested-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
---
 drivers/s390/net/qeth_core_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/s390/net/qeth_core_main.c b/drivers/s390/net/qeth_core_main.c
index 8e1474f1ffac..9d9182ed8ac4 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -4834,7 +4834,7 @@ int qeth_vm_request_mac(struct qeth_card *card)
 		goto out;
 	}

-	ccw_device_get_id(CARD_RDEV(card), &id);
+	ccw_device_get_id(CARD_DDEV(card), &id);
 	request->resp_buf_len = sizeof(*response);
 	request->resp_version = DIAG26C_VERSION2;
 	request->op_code = DIAG26C_GET_MAC;
-- 
2.16.4

^ permalink raw reply related

* [PATCH net 2/5] s390/qeth: fix race when setting MAC address
From: Julian Wiedmann @ 2018-06-29 17:45 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, linux-s390, Martin Schwidefsky, Heiko Carstens,
	Stefan Raspl, Ursula Braun, Julian Wiedmann
In-Reply-To: <20180629174554.53718-1-jwi@linux.ibm.com>

When qeth_l2_set_mac_address() finds the card in a non-reachable state,
it merely copies the new MAC address into dev->dev_addr so that
__qeth_l2_set_online() can later register it with the HW.

But __qeth_l2_set_online() may very well be running concurrently, so we
can't trust the card state without appropriate locking:
If the online sequence is past the point where it registers
dev->dev_addr (but not yet in SOFTSETUP state), any address change needs
to be properly programmed into the HW. Otherwise the netdevice ends up
with a different MAC address than what's set in the HW, and inbound
traffic is not forwarded as expected.

This is most likely to occur for OSD in LPAR, where
commit 21b1702af12e ("s390/qeth: improve fallback to random MAC address")
now triggers eg. systemd to immediately change the MAC when the netdevice
is registered with a NET_ADDR_RANDOM address.

Fixes: bcacfcbc82b4 ("s390/qeth: fix MAC address update sequence")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
---
 drivers/s390/net/qeth_l2_main.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/s390/net/qeth_l2_main.c b/drivers/s390/net/qeth_l2_main.c
index a7cb37da6a21..7daf125dae76 100644
--- a/drivers/s390/net/qeth_l2_main.c
+++ b/drivers/s390/net/qeth_l2_main.c
@@ -501,27 +501,34 @@ static int qeth_l2_set_mac_address(struct net_device *dev, void *p)
 		return -ERESTARTSYS;
 	}

+	/* avoid racing against concurrent state change: */
+	if (!mutex_trylock(&card->conf_mutex))
+		return -EAGAIN;
+
 	if (!qeth_card_hw_is_reachable(card)) {
 		ether_addr_copy(dev->dev_addr, addr->sa_data);
-		return 0;
+		goto out_unlock;
 	}

 	/* don't register the same address twice */
 	if (ether_addr_equal_64bits(dev->dev_addr, addr->sa_data) &&
 	    (card->info.mac_bits & QETH_LAYER2_MAC_REGISTERED))
-		return 0;
+		goto out_unlock;

 	/* add the new address, switch over, drop the old */
 	rc = qeth_l2_send_setmac(card, addr->sa_data);
 	if (rc)
-		return rc;
+		goto out_unlock;
 	ether_addr_copy(old_addr, dev->dev_addr);
 	ether_addr_copy(dev->dev_addr, addr->sa_data);

 	if (card->info.mac_bits & QETH_LAYER2_MAC_REGISTERED)
 		qeth_l2_remove_mac(card, old_addr);
 	card->info.mac_bits |= QETH_LAYER2_MAC_REGISTERED;
-	return 0;
+
+out_unlock:
+	mutex_unlock(&card->conf_mutex);
+	return rc;
 }

 static void qeth_promisc_to_bridge(struct qeth_card *card)
-- 
2.16.4

^ permalink raw reply related

* Re: [PATCH RFC 2/2] net: phy: sfp: Add HWMON support for module sensors
From: Guenter Roeck @ 2018-06-29 17:34 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev, Florian Fainelli, Russell King, vadimp, linux-hwmon
In-Reply-To: <20180629074540.GC11285@lunn.ch>

On Fri, Jun 29, 2018 at 09:45:40AM +0200, Andrew Lunn wrote:
> > > +	case hwmon_power:
> > > +		/* External calibration of receive power requires
> > > +		 * floating point arithmetic. Doing that in the kernel
> > > +		 * is not easy, so just skip it. If the module does
> > > +		 * not require external calibration, we can however
> > > +		 * show receiver power, since FP is then not needed.
> > > +		 */
> > > +		if (sfp->id.ext.diagmon & SFP_DIAGMON_EXT_CAL &&
> > > +		    channel == 1)
> > > +			return 0;
> > 
> > It would be nice if it was possible to convert the floting point to
> > a fixed point calculation. Would that be possible ?
> 
> Maybe. I decided to leave it for later.
> 
> The kernel has some support for emulating floating point hardware, by
> doing floating point operations in software. I didn't find any
> examples of using that code outside of emulation, but i wondered if it
> would be possible to use it here. We don't need high performance here,
> when just reading a sensor once per second.
> 
> > > +/* Sensors values are stored as two bytes, MSB second */
> > > +static int sfp_hwmon_read_sensor(struct sfp *sfp, int reg, long *value)
> > > +{
> > > +	u8 val[2];
> > > +	int err;
> > > +
> > > +	err = sfp_read(sfp, true, reg, val, 2);
> > > +	if (err < 0)
> > > +		return err;
> > > +
> > > +	*value = val[0] << 8 | val[1];
> > > +
> > 
> > Any chance to use something like __be16 and be16_to_cpu() ?
> > You do that elsewhere - why not here ?
> 
> Yes. I want to look at this again. I don't like it either.
> 
> > > +	for (i = j = 0; sfp->hwmon_name[i]; i++) {
> > > +		if (isalnum(sfp->hwmon_name[i])) {
> > > +			if (i != j)
> > > +				sfp->hwmon_name[j] = sfp->hwmon_name[i];
> > > +			j++;
> > > +		}
> > > +	}
> > 
> > It might be better and simpler to replace invalid characters with '_'
> > instead of dropping them. Also note that '_' is a valid character.
> > Strictly speaking only "-* \t\n" are invalid.
> 
> I borrowed this code from the marvell10g driver. I don't know where it

... which wasn't reviewed by a hwmon maintainer, so I take no
responsibility (it does look pretty clean, though). Wonder if anyone
noticed that the hwmon interface is disabled if HWMON is built as module.

> borrowed it from. Is there a hwmon core function which we can pass an
> arbitrary name to and it returned a sanitised one? Maybe we should add
> one?
> 
Maybe, but I am not sure how to allocate the replacement string.
You are using devm_kstrdup() which is another devm_ function that you
should probably not use. How about declaring hwmon_name[] with a fixed
maximum length in sfp ? The memory savings from dynamic allocation (if
there are any) seem negligible.

> > > +	sfp->hwmon_name[j] = '\0';
> > > +
> > Is it possible that j == 0 ?
> 
> Hummm....
> 
> sfp->hwmon_name is derived from dev_name(sfp->dev), which comes from
> pdev->dev in the probe function. That comes from the device tree node
> name. I suppose it is possible to name the node $@#$@, but i suspect
> Rob would NACK it :-)
> 
> I can add a check for j==0 and return -EINVAL.
> 
I would prefer replacing invalid characters with '_', but I won't argue. 

> > > +	sfp->hwmon_dev = devm_hwmon_device_register_with_info(sfp->dev,
> > > +				sfp->hwmon_name, sfp, &sfp_hwmon_chip_info,
> > > +				NULL);
> > > +
> > > +	return PTR_ERR_OR_ZERO(sfp->hwmon_dev);
> > > +}
> > > +
> > > +static void sfp_hwmon_remove(struct sfp *sfp)
> > > +{
> > > +	devm_hwmon_device_unregister(sfp->hwmon_dev);
> > 
> > If registartion and removal are not tied to a device, it doesn't make sense
> > to use devm_ functions. Either use hwmon_device_register_with_info()
> > and hwmon_device_unregister(), or drop the remove function.
> 
> Yes. I can change it. We have a few different lifetimes involved
> here. You can consider the driver probe being for the SFP cage. The
> SFP module being inserted into the cage is a different life time, and
> the lifetime of the hwmon device.
> 
As Russell pointed out, devm_ functions are inappropriate in this case.

Thanks,
Guenter

^ permalink raw reply

* How should NFS, SUNRPC interact with Network and User Namespaces?
From: Sargun Dhillon @ 2018-06-29 17:32 UTC (permalink / raw)
  To: linux-fsdevel, netdev; +Cc: David Howells, Eric W . Biederman

Today, sunrpc lives in net/sunrpc. As far as I can tell, the primary
production consumer of it is NFS. The RPC clients have the concept of
being tied back to a network namespace. On the other hand, NFS has its
own superblock with its own user namespace.

When sunrpc convert kuids to UIDs to send over the wire, should it use
the user namespace of the network namespace that the RPC client is
associated with? This is required for auth_unix (UID based).
Alternatively, should the sunrpc RPC client use the user namespace
associated with the NFS superblock?

^ permalink raw reply

* Re: [PATCH RFC 1/2] hwmon: Add support for power min, lcrit, min_alarm and lcrit_alarm
From: Guenter Roeck @ 2018-06-29 17:12 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev, Florian Fainelli, Russell King, vadimp, linux-hwmon
In-Reply-To: <20180629072156.GA11285@lunn.ch>

On Fri, Jun 29, 2018 at 09:21:56AM +0200, Andrew Lunn wrote:
> On Thu, Jun 28, 2018 at 03:42:36PM -0700, Guenter Roeck wrote:
> > On Thu, Jun 28, 2018 at 10:41:14PM +0200, Andrew Lunn wrote:
> > > Some sensors support reporting minimal and lower critical power, as
> > > well as alarms when these thresholds are reached. Add support for
> > > these attributes to the hwmon core.
> > > 
> > > Signed-off-by: Andrew Lunn <andrew@lunn.ch>
> > 
> > I am inclined to accept this patch immediately. I'll do that
> > in the next couple of days unless someone gives me a good reason
> > not to.
> 
> Hi Guenter
> 
> We need to watch out for merge dependencies. If you take it, you
> probably should also take the second patch into your tree as
> well. Otherwise, you need a stable branch DaveM can pull into net-next
> if he takes the second patch.
> 
Good point. I don't have anything queued for hwmon.c, so it should be ok
for the patch to go through networking. I'll Ack it when the time comes.

> I also have a patch to lm-sensors sensors, so it prints these
> values. I will create a github pull request.
> 
Excellent.

Thanks,
Guenter

^ permalink raw reply

* Re: [PATCH v4 09/18] ARM: davinci: da850-evm: remove dead MTD code
From: David Lechner @ 2018-06-29 17:09 UTC (permalink / raw)
  To: Bartosz Golaszewski, Sekhar Nori, Kevin Hilman, Russell King,
	Grygorii Strashko, David S . Miller, Srinivas Kandagatla,
	Lukas Wunner, Rob Herring, Florian Fainelli, Dan Carpenter,
	Ivan Khoronzhuk, Greg Kroah-Hartman, Andrew Lunn, Jonathan Corbet
  Cc: linux-arm-kernel, linux-kernel, linux-omap, netdev,
	Bartosz Golaszewski
In-Reply-To: <20180629094039.7543-10-brgl@bgdev.pl>

On 06/29/2018 04:40 AM, Bartosz Golaszewski wrote:
> From: Bartosz Golaszewski <bgolaszewski@baylibre.com>
> 
> We no longer need to register the MTD notifier to read the MAC address
> as it's now being done in the emac address.

I think you mean "it's now being done in the emac _driver_"

^ permalink raw reply

* Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw
From: Samudrala, Sridhar @ 2018-06-29 17:06 UTC (permalink / raw)
  To: Jiri Pirko, David Ahern
  Cc: Jamal Hadi Salim, Cong Wang, Linux Kernel Network Developers,
	David Miller, Jakub Kicinski, Simon Horman, john.hurley, mlxsw
In-Reply-To: <20180629130529.GE2195@nanopsycho.orion>

On 6/29/2018 6:05 AM, Jiri Pirko wrote:
> Fri, Jun 29, 2018 at 02:54:36PM CEST, dsahern@gmail.com wrote:
>> On 6/29/18 6:48 AM, Jiri Pirko wrote:
>>> Fri, Jun 29, 2018 at 02:12:21PM CEST, jhs@mojatatu.com wrote:
>>>> On 29/06/18 04:39 AM, Jiri Pirko wrote:
>>>>> Fri, Jun 29, 2018 at 12:25:53AM CEST, xiyou.wangcong@gmail.com wrote:
>>>>>> On Thu, Jun 28, 2018 at 6:10 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>>> Add a template of type flower allowing to insert rules matching on last
>>>>>>> 2 bytes of destination mac address:
>>>>>>> # tc chaintemplate add dev dummy0 ingress proto ip flower dst_mac 00:00:00:00:00:00/00:00:00:00:FF:FF
>>>>>>>
>>>>>>> The template is now showed in the list:
>>>>>>> # tc chaintemplate show dev dummy0 ingress
>>>>>>> chaintemplate flower chain 0
>>>>>>>     dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff
>>>>>>>     eth_type ipv4
>>>>>>>
>>>>>>> Add another template, this time for chain number 22:
>>>>>>> # tc chaintemplate add dev dummy0 ingress proto ip chain 22 flower dst_ip 0.0.0.0/16
>>>>>>> # tc chaintemplate show dev dummy0 ingress
>>>>>>> chaintemplate flower chain 0
>>>>>>>     dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff
>>>>>>>     eth_type ipv4
>>>>>>> chaintemplate flower chain 22
>>>>>>>     eth_type ipv4
>>>>>>>     dst_ip 0.0.0.0/16
>>>>>> So, if I want to check the template of a chain, I have to use
>>>>>> 'tc chaintemplate... chain X'.
>>>>>>
>>>>>> If I want to check the filters in a chain, I have to use
>>>>>> 'tc filter show .... chain X'.
>>>>>>
>>>>>> If you introduce 'tc chain', it would just need one command:
>>>>>> `tc chain show ... X` which could list its template first and
>>>>>> followed by filters in this chain, something like:
>>>>>>
>>>>>> # tc chain show dev eth0 chain X
>>>>>> template: # could be none
>>>>>> ....
>>>>>> filter1
>>>>>> ...
>>>>>> filter2
>>>>>> ...
>>>>>>
>>>>>> Isn't it more elegant?
>>>>> Well, that is just another iproute2 command. It would use the same
>>>>> kernel uapi. Filters+templates. Sure, why not. Can be easily introduced.
>>>>> Let's do it in a follow-up iproute2 patch.
>>>>>
>>>> Half a dozen or 6 - take your pick, really.
>>>> I would call the template an attribute as opposed to a stand alone
>>>> object i.e A chain of filters may have a template. If you have to
>>>> introduce a new object then Sridhar's suggested syntax seems appealing.
>>> I think what I have makes sense. Maps nicely to what it really is inside
>>> kernel. What Cong proposes looks like nice extension of userspace app to
>>> do more things in one go. Will address that in followup iproute2 patch.
>> The resolution of the syntax affect the uapi changes proposed. You are
>> wanting to add new RTM commands which suggests new objects. If a
>> template is an attribute of an existing object then the netlink API
>> should indicate that.
> There is no existing "chain" object. So no, no uapi changes.

So instead of introducing 'chaintemplate' object in the kernel, can't we add 'chain'
object in the kernel that takes the 'template' as an attribute?

^ permalink raw reply

* Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags
From: Jakub Kicinski @ 2018-06-29 17:01 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Jiri Benc, davem, Roopa Prabhu, jiri, jhs, xiyou.wangcong,
	oss-drivers, netdev, Pieter Jansen van Vuuren
In-Reply-To: <c2b40603-6b42-7c87-49cf-f139e80dfd8a@iogearbox.net>

On Fri, 29 Jun 2018 09:04:15 +0200, Daniel Borkmann wrote:
> On 06/28/2018 06:54 PM, Jakub Kicinski wrote:
> > On Thu, 28 Jun 2018 09:42:06 +0200, Jiri Benc wrote:  
> >> On Wed, 27 Jun 2018 11:49:49 +0200, Daniel Borkmann wrote:  
> >>> Looks good to me, and yes in BPF case a mask like TUNNEL_OPTIONS_PRESENT is
> >>> right approach since this is opaque info and solely defined by the BPF prog
> >>> that is using the generic helper.    
> >>
> >> Wouldn't it make sense to introduce some safeguards here (in a backward
> >> compatible way, of course)? It's easy to mistakenly set data for a
> >> different tunnel type in a BPF program and then be surprised by the
> >> result. It might help users if such usage was detected by the kernel,
> >> one way or another.  
> > 
> > Well, that's how it works today ;)  
> 
> Well, it was designed like that on purpose, to be i) agnostic of the underlying
> device, ii) to not clutter BPF API with tens of different APIs effectively doing
> the same thing, and at the same time to avoid adding protocol specifics. E.g. at
> least core bits of bpf_skb_{set,get}_tunnel_key() will work whether I use vxlan
> or geneve underneath (we are actually using it this way) and I could use things
> like tun_id to encode custom meta data from BPF for either of them depending on flavor
> picked by orchestration system. For the tunnel options in bpf_skb_{set,get}_tunnel_opt()
> it's similar although here there needs to be awareness of the underlying dev depending
> on whether you encode data into e.g. gbp or tlvs, etc. However, downside right now I
> can see with a patch like below is that:
> 
> i) People might still just keep using 'TUNNEL_OPTIONS_PRESENT path' since available
> and backwards compatible with current/older kernels, ii) we cut bits away from
> size over time for each new tunnel proto added in future that would support tunnel
> options, iii) that extension is one-sided (at least below) and same would be needed
> in getter part, and iv) there needs to be a way for the case when folks add new
> tunnel options where we don't need to worry that we forget updating BPF_F_TUN_*
> each time otherwise this will easily slip through and again people will just rely
> on using TUNNEL_OPTIONS_PRESENT catchall. Given latter and in particular point i)
> I wouldn't think it's worth the pain, the APIs were added to BPF in v4.6 so this
> would buy them 2 more years wrt kernel compatibility with same functionality level.
> And point v), I just noticed the patch is actually buggy: size is ARG_CONST_SIZE and
> verifier will attempt to check the value whether the buffer passed in argument 2 is
> valid or not, so using flags here in upper bits would let verification fail, you'd
> really have to make a new helper just for this.

Ah, indeed.  I'd rather avoid a new helper, if we reuse an old one
people can always write a program like:

	err = helper(all_flags);
	if (err == -EINVAL)
		err = helper(fallback_flags);

With a new helper the program will not even load on old kernels :(

Could we add the flags new to bpf_skb_set_tunnel_key(), maybe?  It is a
bit ugly because only options care about the flags and in theory
bpf_skb_set_tunnel_key() doesn't have to be called before
bpf_skb_set_tunnel_opt() ...

Alternatively we could do this:

diff --git a/net/core/filter.c b/net/core/filter.c
index dade922678f6..9edcc2947be5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3582,7 +3582,9 @@ BPF_CALL_3(bpf_skb_set_tunnel_opt, struct sk_buff *, skb,
        if (unlikely(size > IP_TUNNEL_OPTS_MAX))
                return -ENOMEM;
 
-       ip_tunnel_info_opts_set(info, from, size, TUNNEL_OPTIONS_PRESENT);
+       ip_tunnel_info_opts_set(info, from, size, TUNNEL_VXLAN_OPT |
+                                                 TUNNEL_GENEVE_OPT |
+                                                 TUNNEL_ERSPAN_OPT);
 
        return 0;
 }

to force ourselves to solve the problem when a next protocol is added ;)

Or should we just leave things as they are?  As you explain the new
helper would really be of marginal use given the backward compatibility
requirement and availability of the old one...

^ permalink raw reply related

* Re: [PATCH vhost] vhost_net: Fix too many vring kick on busypoll
From: Michael S. Tsirkin @ 2018-06-29 16:38 UTC (permalink / raw)
  To: Toshiaki Makita; +Cc: netdev, kvm, virtualization
In-Reply-To: <1530259790-2414-1-git-send-email-makita.toshiaki@lab.ntt.co.jp>

On Fri, Jun 29, 2018 at 05:09:50PM +0900, Toshiaki Makita wrote:
> Under heavy load vhost busypoll may run without suppressing
> notification. For example tx zerocopy callback can push tx work while
> handle_tx() is running, then busyloop exits due to vhost_has_work()
> condition and enables notification but immediately reenters handle_tx()
> because the pushed work was tx. In this case handle_tx() tries to
> disable notification again, but when using event_idx it by design
> cannot. Then busyloop will run without suppressing notification.
> Another example is the case where handle_tx() tries to enable
> notification but avail idx is advanced so disables it again. This case
> also lead to the same situation with event_idx.
> 
> The problem is that once we enter this situation busyloop does not work
> under heavy load for considerable amount of time, because notification
> is likely to happen during busyloop and handle_tx() immediately enables
> notification after notification happens. Specifically busyloop detects
> notification by vhost_has_work() and then handle_tx() calls
> vhost_enable_notify().

I'd like to understand the problem a bit better.
Why does this happen?
Doesn't this only happen if ring is empty?

> Because the detected work was the tx work, it
> enters handle_tx(), and enters busyloop without suppression again.
> This is likely to be repeated, so with event_idx we are almost not able
> to suppress notification in this case.
> 
> To fix this, poll the work instead of enabling notification when
> busypoll is interrupted by something. IMHO signal_pending() and
> vhost_has_work() are kind of interruptions rather than signals to
> completely cancel the busypoll, so let's run busypoll after the
> necessary work is done. In order to avoid too long busyloop due to
> interruption, save the endtime in vq field and use it when reentering
> the work function.
> 
> I observed this problem makes tx performance poor but does not rx. I
> guess it is because rx notification from socket is not that heavy so
> did not impact the performance even when we failed to mask the
> notification. Anyway for consistency I fixed rx routine as well as tx.
> 
> Performance numbers:
> 
> - Bulk transfer from guest to external physical server.
>     [Guest]->vhost_net->tap--(XDP_REDIRECT)-->i40e --(wire)--> [Server]
> - Set 10us busypoll.
> - Guest disables checksum and TSO because of host XDP.
> - Measured single flow Mbps by netperf, and kicks by perf kvm stat
>   (EPT_MISCONFIG event).
> 
>                             Before              After
>                           Mbps  kicks/s      Mbps  kicks/s
> UDP_STREAM 1472byte              247758                 27
>                 Send   3645.37            6958.10
>                 Recv   3588.56            6958.10
>               1byte                9865                 37
>                 Send      4.34               5.43
>                 Recv      4.17               5.26
> TCP_STREAM             8801.03    45794   9592.77     2884
> 
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Is this with busy poll enabled?
Are there CPU utilization #s?

> ---
>  drivers/vhost/net.c   | 94 +++++++++++++++++++++++++++++++++++----------------
>  drivers/vhost/vhost.c |  1 +
>  drivers/vhost/vhost.h |  1 +
>  3 files changed, 66 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index eeaf6739215f..0e85f628b965 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -391,13 +391,14 @@ static inline unsigned long busy_clock(void)
>  	return local_clock() >> 10;
>  }
>  
> -static bool vhost_can_busy_poll(struct vhost_dev *dev,
> -				unsigned long endtime)
> +static bool vhost_can_busy_poll(unsigned long endtime)
>  {
> -	return likely(!need_resched()) &&
> -	       likely(!time_after(busy_clock(), endtime)) &&
> -	       likely(!signal_pending(current)) &&
> -	       !vhost_has_work(dev);
> +	return likely(!need_resched() && !time_after(busy_clock(), endtime));
> +}
> +
> +static bool vhost_busy_poll_interrupted(struct vhost_dev *dev)
> +{
> +	return unlikely(signal_pending(current)) || vhost_has_work(dev);
>  }
>  
>  static void vhost_net_disable_vq(struct vhost_net *n,
> @@ -437,10 +438,21 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
>  
>  	if (r == vq->num && vq->busyloop_timeout) {
>  		preempt_disable();
> -		endtime = busy_clock() + vq->busyloop_timeout;
> -		while (vhost_can_busy_poll(vq->dev, endtime) &&
> -		       vhost_vq_avail_empty(vq->dev, vq))
> +		if (vq->busyloop_endtime) {
> +			endtime = vq->busyloop_endtime;
> +			vq->busyloop_endtime = 0;
> +		} else {
> +			endtime = busy_clock() + vq->busyloop_timeout;
> +		}
> +		while (vhost_can_busy_poll(endtime)) {
> +			if (vhost_busy_poll_interrupted(vq->dev)) {
> +				vq->busyloop_endtime = endtime;
> +				break;
> +			}
> +			if (!vhost_vq_avail_empty(vq->dev, vq))
> +				break;
>  			cpu_relax();
> +		}
>  		preempt_enable();
>  		r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
>  				      out_num, in_num, NULL, NULL);
> @@ -509,12 +521,16 @@ static void handle_tx(struct vhost_net *net)
>  			break;
>  		/* Nothing new?  Wait for eventfd to tell us they refilled. */
>  		if (head == vq->num) {
> -			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
> +			if (unlikely(vq->busyloop_endtime)) {
> +				/* Busy poll is interrupted. */
> +				vhost_poll_queue(&vq->poll);
> +			} else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
>  				vhost_disable_notify(&net->dev, vq);
>  				continue;
>  			}
>  			break;
>  		}
> +		vq->busyloop_endtime = 0;
>  		if (in) {
>  			vq_err(vq, "Unexpected descriptor format for TX: "
>  			       "out %d, int %d\n", out, in);
> @@ -642,39 +658,51 @@ static void vhost_rx_signal_used(struct vhost_net_virtqueue *nvq)
>  
>  static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
>  {
> -	struct vhost_net_virtqueue *rvq = &net->vqs[VHOST_NET_VQ_RX];
> -	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
> -	struct vhost_virtqueue *vq = &nvq->vq;
> +	struct vhost_net_virtqueue *rnvq = &net->vqs[VHOST_NET_VQ_RX];
> +	struct vhost_net_virtqueue *tnvq = &net->vqs[VHOST_NET_VQ_TX];
> +	struct vhost_virtqueue *rvq = &rnvq->vq;
> +	struct vhost_virtqueue *tvq = &tnvq->vq;
>  	unsigned long uninitialized_var(endtime);
> -	int len = peek_head_len(rvq, sk);
> +	int len = peek_head_len(rnvq, sk);
>  
> -	if (!len && vq->busyloop_timeout) {
> +	if (!len && tvq->busyloop_timeout) {
>  		/* Flush batched heads first */
> -		vhost_rx_signal_used(rvq);
> +		vhost_rx_signal_used(rnvq);
>  		/* Both tx vq and rx socket were polled here */
> -		mutex_lock_nested(&vq->mutex, 1);
> -		vhost_disable_notify(&net->dev, vq);
> +		mutex_lock_nested(&tvq->mutex, 1);
> +		vhost_disable_notify(&net->dev, tvq);
>  
>  		preempt_disable();
> -		endtime = busy_clock() + vq->busyloop_timeout;
> +		if (rvq->busyloop_endtime) {
> +			endtime = rvq->busyloop_endtime;
> +			rvq->busyloop_endtime = 0;
> +		} else {
> +			endtime = busy_clock() + tvq->busyloop_timeout;
> +		}
>  
> -		while (vhost_can_busy_poll(&net->dev, endtime) &&
> -		       !sk_has_rx_data(sk) &&
> -		       vhost_vq_avail_empty(&net->dev, vq))
> +		while (vhost_can_busy_poll(endtime)) {
> +			if (vhost_busy_poll_interrupted(&net->dev)) {
> +				rvq->busyloop_endtime = endtime;
> +				break;
> +			}
> +			if (sk_has_rx_data(sk) ||
> +			    !vhost_vq_avail_empty(&net->dev, tvq))
> +				break;
>  			cpu_relax();
> +		}
>  
>  		preempt_enable();
>  
> -		if (!vhost_vq_avail_empty(&net->dev, vq))
> -			vhost_poll_queue(&vq->poll);
> -		else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
> -			vhost_disable_notify(&net->dev, vq);
> -			vhost_poll_queue(&vq->poll);
> +		if (!vhost_vq_avail_empty(&net->dev, tvq)) {
> +			vhost_poll_queue(&tvq->poll);
> +		} else if (unlikely(vhost_enable_notify(&net->dev, tvq))) {
> +			vhost_disable_notify(&net->dev, tvq);
> +			vhost_poll_queue(&tvq->poll);
>  		}
>  
> -		mutex_unlock(&vq->mutex);
> +		mutex_unlock(&tvq->mutex);
>  
> -		len = peek_head_len(rvq, sk);
> +		len = peek_head_len(rnvq, sk);
>  	}
>  
>  	return len;
> @@ -804,6 +832,7 @@ static void handle_rx(struct vhost_net *net)
>  	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
>  
>  	while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
> +		vq->busyloop_endtime = 0;
>  		sock_len += sock_hlen;
>  		vhost_len = sock_len + vhost_hlen;
>  		headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx,
> @@ -889,7 +918,12 @@ static void handle_rx(struct vhost_net *net)
>  			goto out;
>  		}
>  	}
> -	vhost_net_enable_vq(net, vq);
> +	if (unlikely(vq->busyloop_endtime)) {
> +		/* Busy poll is interrupted. */
> +		vhost_poll_queue(&vq->poll);
> +	} else {
> +		vhost_net_enable_vq(net, vq);
> +	}
>  out:
>  	vhost_rx_signal_used(nvq);
>  	mutex_unlock(&vq->mutex);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 9beefa6ed1ce..fe83578fe336 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -323,6 +323,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vhost_reset_is_le(vq);
>  	vhost_disable_cross_endian(vq);
>  	vq->busyloop_timeout = 0;
> +	vq->busyloop_endtime = 0;
>  	vq->umem = NULL;
>  	vq->iotlb = NULL;
>  	__vhost_vq_meta_reset(vq);
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 6c844b90a168..7e9cf80ccae9 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -144,6 +144,7 @@ struct vhost_virtqueue {
>  	bool user_be;
>  #endif
>  	u32 busyloop_timeout;
> +	unsigned long busyloop_endtime;
>  };
>  
>  struct vhost_msg_node {
> -- 
> 2.14.2
> 

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] net: phy: DP83TC811: Add INT_STAT3
From: Dan Murphy @ 2018-06-29 16:17 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: f.fainelli, netdev, linux-kernel
In-Reply-To: <20180629154502.GG11285@lunn.ch>

Andrew

On 06/29/2018 10:45 AM, Andrew Lunn wrote:
> On Fri, Jun 29, 2018 at 10:35:45AM -0500, Dan Murphy wrote:
>> Add INT_STAT3 interrupt setting and clearing
>> support.
>>
>> Signed-off-by: Dan Murphy <dmurphy@ti.com>
>> ---
>>
>> v2 - Removed bug fix removal of writing INT_STAT1 twice when disabling interrupts
>>
>>  drivers/net/phy/dp83tc811.c | 26 ++++++++++++++++++++++++++
>>  1 file changed, 26 insertions(+)
>>
>> diff --git a/drivers/net/phy/dp83tc811.c b/drivers/net/phy/dp83tc811.c
>> index 49ac678eb2dc..f8653f5d8789 100644
>> --- a/drivers/net/phy/dp83tc811.c
>> +++ b/drivers/net/phy/dp83tc811.c
>> @@ -21,6 +21,7 @@
>>  #define MII_DP83811_SGMII_CTRL	0x09
>>  #define MII_DP83811_INT_STAT1	0x12
>>  #define MII_DP83811_INT_STAT2	0x13
>> +#define MII_DP83811_INT_STAT3	0x18
>>  #define MII_DP83811_RESET_CTRL	0x1f
>>  
>>  #define DP83811_HW_RESET	BIT(15)
>> @@ -44,6 +45,11 @@
>>  #define DP83811_OVERVOLTAGE_INT_EN	BIT(6)
>>  #define DP83811_UNDERVOLTAGE_INT_EN	BIT(7)
>>  
>> +/* INT_STAT3 bits */
>> +#define DP83811_LPS_INT_EN	BIT(0)
>> +#define DP83811_NO_FRAME_INT_EN	BIT(3)
>> +#define DP83811_POR_DONE_INT_EN	BIT(4)
>> +
>>  #define MII_DP83811_RXSOP1	0x04a5
>>  #define MII_DP83811_RXSOP2	0x04a6
>>  #define MII_DP83811_RXSOP3	0x04a7
>> @@ -81,6 +87,10 @@ static int dp83811_ack_interrupt(struct phy_device *phydev)
>>  	if (err < 0)
>>  		return err;
>>  
>> +	err = phy_read(phydev, MII_DP83811_INT_STAT3);
>> +	if (err < 0)
>> +		return err;
>> +
>>  	return 0;
>>  }
>>  
>> @@ -216,6 +226,18 @@ static int dp83811_config_intr(struct phy_device *phydev)
>>  				DP83811_UNDERVOLTAGE_INT_EN);
>>  
>>  		err = phy_write(phydev, MII_DP83811_INT_STAT2, misr_status);
> 
> Hi Dan
> 
> Isn't this going to fail to apply because net-next says STAT1 here?
> 

Yes but this should not be a pre-requisite for a code review.

Maybe I should have put the RFC in the subject.

> That is why i said you need to wait for David to merge net into
> net-next. Then you can submit these patches, and not have conflicts.
> 
> 	  Andrew
> 


-- 
------------------
Dan Murphy

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox