* [RFC PATCH 0/2] net: threadable napi poll loop
@ 2016-05-10 14:11 Paolo Abeni
  2016-05-10 14:11 ` [RFC PATCH 1/2] net: implement threaded-able napi poll loop support Paolo Abeni
                   ` (3 more replies)
  0 siblings, 4 replies; 48+ messages in thread
From: Paolo Abeni @ 2016-05-10 14:11 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jiri Pirko, Daniel Borkmann,
	Alexei Starovoitov, Alexander Duyck, Tom Herbert, Peter Zijlstra,
	Ingo Molnar, Rik van Riel, Hannes Frederic Sowa, linux-kernel
Currently, the softirq loop can be scheduled both inside the ksofirqd kernel
thread and inside any running process. This makes nearly impossible for the
process scheduler to balance in a fair way the amount of time that
a given core spends performing the softirq loop.
Under high network load, the softirq loop can take nearly 100% of a given CPU,
leaving very little time for use space processing. On single core hosts, this
means that the user space can nearly starve; for example super_netperf
UDP_STREAM tests towards a remote single core vCPU guest[1] can measure an
aggregated throughput of a few thousands pps, and the same behavior can be
reproduced even on bare-metal, eventually simulating a single core with taskset
and/or sysfs configuration.
This patch series allows the administrator to let the napi poll loop run inside
its own kernel thread, a thread for each napi instance, while retaining the
default, softirq-based behavior. The RPS mechanism is currently not affected.
When the napi poll loop is run inside a proper kernel thread, the process
scheduler can fairly balance the rx job between the user space application and
the kernel and give the administrator the ability to manage the network workload
with scheduler tools and configuration.
With the default scheduling policy, the starvation issue observed on single vCPU
guest under UDP flood is solved and the throughput measured under heavy
overload is quite stable around the peak performance.
In the remote host to VM scenario, running even the hypervisor napi poll loop
in threaded mode gives additional benefit, since the process scheduler can
more easily avoid cpu conflict between the VM process and the kernel thread
processing the rx packets.
The raw numbers, obtained with the super_neterf UDP_STREAM test, in a remote
host to VM scenario, using a tun device with a noqueue qdisc in the hypervisor
and using 'sdfn' for the rx flow hash on the ingress device, are as follow:
		vanilla		guest threaded		both hypevisor and
							guest threaded
size/flow	kpps		kpps/delta		kpps/delta
1/1		746		901/+20%		1024/+37%
1/25		185		585/+215%		789/+325%
1/50		330		642/+94%		843/+155%
1/100		180		662/+267%		872/+383%
1/200		177		672/+279%		812/+358%
64/1		707		1042/+47%		1062/+50%
64/25		320		586/+83%		746/+132%
64/50		195		648/+232%		761/+290%
64/100		221		666/+200%		787/+255%
64/200		186		688/+268%		793/+325%
256/1		475		777/+63%		809/+70%
256/25		303		589/+83%		860/+183%
256/50		308		584/+89%		825/+168%
256/100		268		698/+159%		785/+191%
256/200		186		656/+398%		795/+503%
1438/1		619		664/+7%			640/+3%
1438/25		519		766/+47%		829/+59%
1438/50		451		712/+57%		820/+81%
1438/100	294		759/+158%		797/+170%
1438/200	262		728/+177%		769/+193%
4096/1		176		207/+17%		200/+13%
4096/25		225		275/+22%		286/+27%
4096/50		212		272/+28%		283/+33%
4096/100	168		264/+57%		283/+68%
4096/200	134		240/+78%		273/+102%
64000/1		16		18/+13%			18/+13%
64000/25	18		18/0			18/0
64000/50	18		18/0			18/0
64000/100	18		18/0			18/0
64000/200	15		15/0			15/0
This patchset is a first RFC but in the long run we would like to move
more and more NAPI instances into kthreads. The kthread approach should
give a lot of new advantages over the softirq based approach:
* moving into a more dpdk-alike busy poll packet processing direction:
we can even use busy polling without the need of a connected UDP or TCP
socket and can leverage busy polling for forwarding setups. This could
very well increase latency and packet throughput without hurting other
processes if the networking stack gets more and more preemptive in the
future.
* possibility to acquire mutexes in the networking processing path: e.g.
we would need that to configure hw_breakpoints if we want to add
watchpoints in the memory based on some rules in the kernel
* more and better tooling to adjust the weight of the networking
kthreads, preferring certain networking cards or setting cpus affinity
on packet processing threads. Maybe also using deadline scheduling or
other scheduler features might be worthwhile.
* scheduler statistics can be used to observe network packet processing
At this point we are not really sure if we should go with this simpler
approach by putting NAPI itself into kthreads or leverage the threadirqs
function by putting the whole interrupt into a thread and signaling NAPI
that it does not reschedule itself in a softirq but to simply run at
this particular context of the interrupt handler.
While the threaded irq way seems to better integrate into the kernel and
also other devices could move their interrupts into the threads easily
on a common policy, we don't know how to really express the necessary
knobs with the current device driver model (module parameters, sysfs
attributes, etc.). This is where we would like to hear some opinions.
NAPI would e.g. have to query the kernel if the particular IRQ/MSI if it
should be scheduled in a softirq or in a thread, so we don't have to
rewrite all device drivers. This might even be needed on a per rx-queue
granularity.
[1] when the flows are processed by the hypervisor on different rx queues, i.e.
the flows use different source/destination IPs or the hypervisor uses the L4
header to compute the rx hash.
Paolo Abeni (2):
  net: implement threaded-able napi poll loop support
  net: add sysfs attribute to control napi threaded mode
 include/linux/netdevice.h |   4 ++
 net/core/dev.c            | 113 ++++++++++++++++++++++++++++++++++++++++++++++
 net/core/net-sysfs.c      | 102 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 219 insertions(+)
-- 
1.8.3.1
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [RFC PATCH 1/2] net: implement threaded-able napi poll loop support
  2016-05-10 14:11 [RFC PATCH 0/2] net: threadable napi poll loop Paolo Abeni
@ 2016-05-10 14:11 ` Paolo Abeni
  2016-05-10 14:11 ` [RFC PATCH 2/2] net: add sysfs attribute to control napi threaded mode Paolo Abeni
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 48+ messages in thread
From: Paolo Abeni @ 2016-05-10 14:11 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jiri Pirko, Daniel Borkmann,
	Alexei Starovoitov, Alexander Duyck, Tom Herbert, Peter Zijlstra,
	Ingo Molnar, Rik van Riel, Hannes Frederic Sowa, linux-kernel
This patch allows running each napi poll loop inside its
own kernel thread.
The rx mode can be enabled per napi instance via the
newly addded napi_set_threaded() api; the requested kthread
will be created on demand and shut down on device stop.
Once that threaded mode is enabled and the kthread is
started, napi_schedule() will wake-up such thread instead
of scheduling the softirq.
The threaded poll loop behaves quite likely the net_rx_action,
but it does not have to manipulate local irqs and uses
an explicit scheduling point based on netdev_budget.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/linux/netdevice.h |   4 ++
 net/core/dev.c            | 113 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 117 insertions(+)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 63580e6..0722ed5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -323,6 +323,7 @@ struct napi_struct {
 	struct list_head	dev_list;
 	struct hlist_node	napi_hash_node;
 	unsigned int		napi_id;
+	struct task_struct	*thread;
 };
 
 enum {
@@ -331,6 +332,7 @@ enum {
 	NAPI_STATE_NPSVC,	/* Netpoll - don't dequeue from poll_list */
 	NAPI_STATE_HASHED,	/* In NAPI hash (busy polling possible) */
 	NAPI_STATE_NO_BUSY_POLL,/* Do not add in napi_hash, no busy polling */
+	NAPI_STATE_THREADED,	/* The poll is performed inside its own thread*/
 };
 
 enum gro_result {
@@ -475,6 +477,8 @@ static inline void napi_complete(struct napi_struct *n)
  */
 void napi_hash_add(struct napi_struct *napi);
 
+int napi_set_threaded(struct napi_struct *n, bool threded);
+
 /**
  *	napi_hash_del - remove a NAPI from global table
  *	@napi: NAPI context
diff --git a/net/core/dev.c b/net/core/dev.c
index c749033..0de286b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -94,6 +94,7 @@
 #include <linux/ethtool.h>
 #include <linux/notifier.h>
 #include <linux/skbuff.h>
+#include <linux/kthread.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>
 #include <net/busy_poll.h>
@@ -1305,9 +1306,19 @@ void netdev_notify_peers(struct net_device *dev)
 }
 EXPORT_SYMBOL(netdev_notify_peers);
 
+static int napi_threaded_poll(void *data);
+
+static inline void napi_thread_start(struct napi_struct *n)
+{
+	if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread)
+		n->thread = kthread_create(napi_threaded_poll, n, "%s-%d",
+					   n->dev->name, n->napi_id);
+}
+
 static int __dev_open(struct net_device *dev)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
+	struct napi_struct *n;
 	int ret;
 
 	ASSERT_RTNL();
@@ -1334,6 +1345,9 @@ static int __dev_open(struct net_device *dev)
 	if (!ret && ops->ndo_open)
 		ret = ops->ndo_open(dev);
 
+	list_for_each_entry(n, &dev->napi_list, dev_list)
+		napi_thread_start(n);
+
 	netpoll_poll_enable(dev);
 
 	if (ret)
@@ -1378,6 +1392,14 @@ int dev_open(struct net_device *dev)
 }
 EXPORT_SYMBOL(dev_open);
 
+static inline void napi_thread_stop(struct napi_struct *n)
+{
+	if (!n->thread)
+		return;
+	kthread_stop(n->thread);
+	n->thread = NULL;
+}
+
 static int __dev_close_many(struct list_head *head)
 {
 	struct net_device *dev;
@@ -1406,6 +1428,7 @@ static int __dev_close_many(struct list_head *head)
 
 	list_for_each_entry(dev, head, close_list) {
 		const struct net_device_ops *ops = dev->netdev_ops;
+		struct napi_struct *n;
 
 		/*
 		 *	Call the device specific close. This cannot fail.
@@ -1417,6 +1440,9 @@ static int __dev_close_many(struct list_head *head)
 		if (ops->ndo_stop)
 			ops->ndo_stop(dev);
 
+		list_for_each_entry(n, &dev->napi_list, dev_list)
+			napi_thread_stop(n);
+
 		dev->flags &= ~IFF_UP;
 		netpoll_poll_enable(dev);
 	}
@@ -3456,6 +3482,11 @@ int weight_p __read_mostly = 64;            /* old backlog weight */
 static inline void ____napi_schedule(struct softnet_data *sd,
 				     struct napi_struct *napi)
 {
+	if (napi->thread) {
+		wake_up_process(napi->thread);
+		return;
+	}
+
 	list_add_tail(&napi->poll_list, &sd->poll_list);
 	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 }
@@ -5174,6 +5205,88 @@ out_unlock:
 	return work;
 }
 
+static int napi_thread_wait(struct napi_struct *napi)
+{
+	set_current_state(TASK_INTERRUPTIBLE);
+
+	while (!kthread_should_stop() && !napi_disable_pending(napi)) {
+		if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
+			__set_current_state(TASK_RUNNING);
+			return 0;
+		}
+
+		schedule();
+		set_current_state(TASK_INTERRUPTIBLE);
+	}
+	__set_current_state(TASK_RUNNING);
+	return -1;
+}
+
+static int napi_threaded_poll(void *data)
+{
+	struct napi_struct *napi = data;
+
+	while (!napi_thread_wait(napi)) {
+		struct list_head dummy_repoll;
+		int budget = netdev_budget;
+		unsigned long time_limit;
+		bool again = true;
+
+		INIT_LIST_HEAD(&dummy_repoll);
+		local_bh_disable();
+		time_limit = jiffies + 2;
+		do {
+			/* ensure that the poll list is not empty */
+			if (list_empty(&dummy_repoll))
+				list_add(&napi->poll_list, &dummy_repoll);
+
+			budget -= napi_poll(napi, &dummy_repoll);
+			if (unlikely(budget <= 0 ||
+				     time_after_eq(jiffies, time_limit))) {
+				cond_resched_softirq();
+
+				/* refresh the budget */
+				budget = netdev_budget;
+				__kfree_skb_flush();
+				time_limit = jiffies + 2;
+			}
+
+			if (napi_disable_pending(napi))
+				again = false;
+			else if (!test_bit(NAPI_STATE_SCHED, &napi->state))
+				again = false;
+		} while (again);
+
+		__kfree_skb_flush();
+		local_bh_enable();
+	}
+	return 0;
+}
+
+int napi_set_threaded(struct napi_struct *n, bool threaded)
+{
+	ASSERT_RTNL();
+
+	if (n->dev->flags & IFF_UP)
+		return -EBUSY;
+
+	if (threaded == !!test_bit(NAPI_STATE_THREADED, &n->state))
+		return 0;
+	if (threaded)
+		set_bit(NAPI_STATE_THREADED, &n->state);
+	else
+		clear_bit(NAPI_STATE_THREADED, &n->state);
+
+	/* if the device is initializing, nothing todo */
+	if (test_bit(__LINK_STATE_START, &n->dev->state))
+		return 0;
+
+	napi_thread_stop(n);
+	napi_thread_start(n);
+	return 0;
+}
+EXPORT_SYMBOL(napi_set_threaded);
+
 static void net_rx_action(struct softirq_action *h)
 {
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
-- 
1.8.3.1
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* [RFC PATCH 2/2] net: add sysfs attribute to control napi threaded mode
  2016-05-10 14:11 [RFC PATCH 0/2] net: threadable napi poll loop Paolo Abeni
  2016-05-10 14:11 ` [RFC PATCH 1/2] net: implement threaded-able napi poll loop support Paolo Abeni
@ 2016-05-10 14:11 ` Paolo Abeni
  2016-05-10 14:29 ` [RFC PATCH 0/2] net: threadable napi poll loop Eric Dumazet
  2016-05-10 15:57 ` Thomas Gleixner
  3 siblings, 0 replies; 48+ messages in thread
From: Paolo Abeni @ 2016-05-10 14:11 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jiri Pirko, Daniel Borkmann,
	Alexei Starovoitov, Alexander Duyck, Tom Herbert, Peter Zijlstra,
	Ingo Molnar, Rik van Riel, Hannes Frederic Sowa, linux-kernel
this patch addis a new sysfs attribute to the network
device class. Said attribute is a bitmask that allows controlling
the threaded mode for all the napi instances of the given
network device.
The threaded mode can be switched only if related network device
is down.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/core/net-sysfs.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 102 insertions(+)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 2b3f76f..60bc768 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -489,6 +489,107 @@ static ssize_t phys_switch_id_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(phys_switch_id);
 
+unsigned long *__alloc_thread_bitmap(struct net_device *netdev, int *bits)
+{
+	struct napi_struct *n;
+
+	*bits = 0;
+	list_for_each_entry(n, &netdev->napi_list, dev_list)
+		(*bits)++;
+
+	return kmalloc_array(BITS_TO_LONGS(*bits), sizeof(unsigned long),
+			     GFP_ATOMIC | __GFP_ZERO);
+}
+
+static ssize_t threaded_show(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	struct net_device *netdev = to_net_dev(dev);
+	struct napi_struct *n;
+	unsigned long *bmap;
+	size_t count = 0;
+	int i, bits;
+
+	if (!rtnl_trylock())
+		return restart_syscall();
+
+	if (!dev_isalive(netdev))
+		goto unlock;
+
+	bmap = __alloc_thread_bitmap(netdev, &bits);
+	if (!bmap) {
+		count = -ENOMEM;
+		goto unlock;
+	}
+
+	i = 0;
+	list_for_each_entry(n, &netdev->napi_list, dev_list) {
+		if (test_bit(NAPI_STATE_THREADED, &n->state))
+			set_bit(i, bmap);
+		i++;
+	}
+
+	count = bitmap_print_to_pagebuf(true, buf, bmap, bits);
+	kfree(bmap);
+
+unlock:
+	rtnl_unlock();
+
+	return count;
+}
+
+static ssize_t threaded_store(struct device *dev,
+			      struct device_attribute *attr,
+			      const char *buf, size_t len)
+{
+	struct net_device *netdev = to_net_dev(dev);
+	struct napi_struct *n;
+	unsigned long *bmap;
+	int i, bits;
+	size_t ret;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (!rtnl_trylock())
+		return restart_syscall();
+
+	if (!dev_isalive(netdev)) {
+		ret = len;
+		goto unlock;
+	}
+
+	if (netdev->flags & IFF_UP) {
+		ret = -EBUSY;
+		goto unlock;
+	}
+
+	bmap = __alloc_thread_bitmap(netdev, &bits);
+	if (!bmap) {
+		ret = -ENOMEM;
+		goto unlock;
+	}
+
+	ret = bitmap_parselist(buf, bmap, bits);
+	if (ret)
+		goto free_unlock;
+
+	i = 0;
+	list_for_each_entry(n, &netdev->napi_list, dev_list) {
+		napi_set_threaded(n, test_bit(i, bmap));
+		i++;
+	}
+	ret = len;
+
+free_unlock:
+	kfree(bmap);
+
+unlock:
+	rtnl_unlock();
+	return ret;
+}
+static DEVICE_ATTR_RW(threaded);
+
 static struct attribute *net_class_attrs[] = {
 	&dev_attr_netdev_group.attr,
 	&dev_attr_type.attr,
@@ -517,6 +618,7 @@ static struct attribute *net_class_attrs[] = {
 	&dev_attr_phys_port_name.attr,
 	&dev_attr_phys_switch_id.attr,
 	&dev_attr_proto_down.attr,
+	&dev_attr_threaded.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(net_class);
-- 
1.8.3.1
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 14:11 [RFC PATCH 0/2] net: threadable napi poll loop Paolo Abeni
  2016-05-10 14:11 ` [RFC PATCH 1/2] net: implement threaded-able napi poll loop support Paolo Abeni
  2016-05-10 14:11 ` [RFC PATCH 2/2] net: add sysfs attribute to control napi threaded mode Paolo Abeni
@ 2016-05-10 14:29 ` Eric Dumazet
  2016-05-10 15:51   ` David Miller
                     ` (2 more replies)
  2016-05-10 15:57 ` Thomas Gleixner
  3 siblings, 3 replies; 48+ messages in thread
From: Eric Dumazet @ 2016-05-10 14:29 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, David S. Miller, Eric Dumazet, Jiri Pirko,
	Daniel Borkmann, Alexei Starovoitov, Alexander Duyck, Tom Herbert,
	Peter Zijlstra, Ingo Molnar, Rik van Riel, Hannes Frederic Sowa,
	linux-kernel
On Tue, 2016-05-10 at 16:11 +0200, Paolo Abeni wrote:
> Currently, the softirq loop can be scheduled both inside the ksofirqd kernel
> thread and inside any running process. This makes nearly impossible for the
> process scheduler to balance in a fair way the amount of time that
> a given core spends performing the softirq loop.
> 
> Under high network load, the softirq loop can take nearly 100% of a given CPU,
> leaving very little time for use space processing. On single core hosts, this
> means that the user space can nearly starve; for example super_netperf
> UDP_STREAM tests towards a remote single core vCPU guest[1] can measure an
> aggregated throughput of a few thousands pps, and the same behavior can be
> reproduced even on bare-metal, eventually simulating a single core with taskset
> and/or sysfs configuration.
I hate these patches and ideas guys, sorry. That is before my breakfast,
but still...
I have enough hard time dealing with loads where ksoftirqd has to
compete with user threads that thought that playing with priorities was
a nice idea.
Guess what, when they lose networking they complain.
We already have ksoftirqd to normally cope with the case you are
describing.
If it is not working as intended, please identify the bugs and fix them,
instead of adding yet another tests in fast path and extra complexity in
the stack.
In the one vcpu case, allowing the user thread to consume more UDP
packets from the target UDP socket will also make your NIC drop more
packets, that are not necessarily packets for the same socket.
So you are shifting the attack to a different target,
at the expense of more kernel bloat.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 14:29 ` [RFC PATCH 0/2] net: threadable napi poll loop Eric Dumazet
@ 2016-05-10 15:51   ` David Miller
  2016-05-10 16:03   ` Paolo Abeni
  2016-05-10 20:46   ` Hannes Frederic Sowa
  2 siblings, 0 replies; 48+ messages in thread
From: David Miller @ 2016-05-10 15:51 UTC (permalink / raw)
  To: eric.dumazet
  Cc: pabeni, netdev, edumazet, jiri, daniel, ast, aduyck, tom, peterz,
	mingo, riel, hannes, linux-kernel
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 10 May 2016 07:29:50 -0700
> We already have ksoftirqd to normally cope with the case you are
> describing.
> 
> If it is not working as intended, please identify the bugs and fix them,
> instead of adding yet another tests in fast path and extra complexity in
> the stack.
+1
Indeed, if ksoftirqd is not doing it's job, please fix it.  It is designed
exactly to deal with the problems described here.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 14:11 [RFC PATCH 0/2] net: threadable napi poll loop Paolo Abeni
                   ` (2 preceding siblings ...)
  2016-05-10 14:29 ` [RFC PATCH 0/2] net: threadable napi poll loop Eric Dumazet
@ 2016-05-10 15:57 ` Thomas Gleixner
  2016-05-10 20:41   ` Paolo Abeni
  3 siblings, 1 reply; 48+ messages in thread
From: Thomas Gleixner @ 2016-05-10 15:57 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, David S. Miller, Eric Dumazet, Jiri Pirko,
	Daniel Borkmann, Alexei Starovoitov, Alexander Duyck, Tom Herbert,
	Peter Zijlstra, Ingo Molnar, Rik van Riel, Hannes Frederic Sowa,
	linux-kernel
On Tue, 10 May 2016, Paolo Abeni wrote:
Nice patch set and very promising results! 
> At this point we are not really sure if we should go with this simpler
> approach by putting NAPI itself into kthreads or leverage the threadirqs
> function by putting the whole interrupt into a thread and signaling NAPI
> that it does not reschedule itself in a softirq but to simply run at
> this particular context of the interrupt handler.
> 
> While the threaded irq way seems to better integrate into the kernel and
> also other devices could move their interrupts into the threads easily
> on a common policy, we don't know how to really express the necessary
> knobs with the current device driver model (module parameters, sysfs
> attributes, etc.). This is where we would like to hear some opinions.
> NAPI would e.g. have to query the kernel if the particular IRQ/MSI if it
> should be scheduled in a softirq or in a thread, so we don't have to
> rewrite all device drivers. This might even be needed on a per rx-queue
> granularity.
Utilizing threaded irqs should be halfways simple even without touching the
device driver at all.
We can do the switch to threading in two ways:
1) Let the driver request the interrupt(s) as it does now and then have a
   /proc/irq/NNN/threaded file which converts it to a threaded interrupt on
   the fly. That should be fairly trivial.
2) Let the driver request the interrupt(s) as it does now and retrieve the
   interrupt number which belongs to the device/queue from the network core
   and let the irq core switch it over to threaded.
So the interrupt flow of the device would be:
interrupt
    IRQ_WAKE_THREAD
    
irq thread()
{ 
    local_bh_disable();
    action->thread_fn(action->irq, action->dev_id); <-- driver handler
    irq_finalize_oneshot(desc, action);
    local_bh_enable();
}
The driver irq handler calls napi_schedule(). So if your napi_struct is
flagged POLL_IRQ_THREAD then you can call your polling machinery from there
instead of raising the softirq.
You surely need some way to figure out whether the interrupt is threaded when
you set up the device in order to flag your napi struct, but that should be
not too hard to achieve.
Thanks,
	tglx
   
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 14:29 ` [RFC PATCH 0/2] net: threadable napi poll loop Eric Dumazet
  2016-05-10 15:51   ` David Miller
@ 2016-05-10 16:03   ` Paolo Abeni
  2016-05-10 16:08     ` Eric Dumazet
  2016-05-10 20:46   ` Hannes Frederic Sowa
  2 siblings, 1 reply; 48+ messages in thread
From: Paolo Abeni @ 2016-05-10 16:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, Eric Dumazet, Jiri Pirko,
	Daniel Borkmann, Alexei Starovoitov, Alexander Duyck, Tom Herbert,
	Peter Zijlstra, Ingo Molnar, Rik van Riel, Hannes Frederic Sowa,
	linux-kernel
Hi,
On Tue, 2016-05-10 at 07:29 -0700, Eric Dumazet wrote:
> On Tue, 2016-05-10 at 16:11 +0200, Paolo Abeni wrote:
> > Currently, the softirq loop can be scheduled both inside the ksofirqd kernel
> > thread and inside any running process. This makes nearly impossible for the
> > process scheduler to balance in a fair way the amount of time that
> > a given core spends performing the softirq loop.
> > 
> > Under high network load, the softirq loop can take nearly 100% of a given CPU,
> > leaving very little time for use space processing. On single core hosts, this
> > means that the user space can nearly starve; for example super_netperf
> > UDP_STREAM tests towards a remote single core vCPU guest[1] can measure an
> > aggregated throughput of a few thousands pps, and the same behavior can be
> > reproduced even on bare-metal, eventually simulating a single core with taskset
> > and/or sysfs configuration.
> 
> I hate these patches and ideas guys, sorry. That is before my breakfast,
> but still...
I'm sorry, I did not meant to spoil your breakfast ;-)
> I have enough hard time dealing with loads where ksoftirqd has to
> compete with user threads that thought that playing with priorities was
> a nice idea.
I fear there is a misunderstanding. I'm not suggesting to fiddle with
priorities; the above 'taskset' reference was just an hint to replicate
the starvation issue on bare-metal in the lack of a single core host.
> 
> Guess what, when they lose networking they complain.
> 
> We already have ksoftirqd to normally cope with the case you are
> describing.
> 
> If it is not working as intended, please identify the bugs and fix them,
> instead of adding yet another tests in fast path and extra complexity in
> the stack.
The idea it exactly that: the problem is how the softirq loop is
scheduled and executed, i.e. the current ksoftirqd/"inline loop" model. 
If a single core host is under network flood, i.e. ksoftirqd is
scheduled and it eventually (after processing ~640 packets) will let the
user space process run. The latter will execute a syscall to receive a
packet, which will have to disable/enable bh at least once and that will
cause the processing of another ~640 packets. To receive a single packet
in user space, the kernel has to process more than one thousand packets.
AFAICS it can't be solved without changing how the net_rx_action is
served.
The user space starvation issue don't affect large server, but AFAIK
many small devices have a lot of out-of-tree hacks to cope with this
sort of issues.
In the VM scenario, the starvation issue was not a real concern up to a
little time ago because the vhost/tun device was not able to push
packets fast enough into the guest to trigger the issue. Recent
improvements have changed the situation.
Also, the scheduler's ability to migrate the napi threads is quite
beneficial for hypervisor when the VMs are receiving a lot of network
traffic. 
Please have a look at the performance numbers.
The current patch adds a single, simple, test per napi_schedule
invocation, and with minimal changes, the kernel won't access any
additional cache-line when the napi thread is disabled. Even in the
current form, in my tests no regression is seen with the patched kernel
when the napi thread mode is disabled.
> In the one vcpu case, allowing the user thread to consume more UDP
> packets from the target UDP socket will also make your NIC drop more
> packets, that are not necessarily packets for the same socket.
That is true. But the threaded napi will not starve, i.e. the forwarding
process, to a nearly zero packet rate, while with the current code the
reverse scenario can happen. 
Cheers,
Paolo
> 
> So you are shifting the attack to a different target,
> at the expense of more kernel bloat.
> 
> 
> 
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 16:03   ` Paolo Abeni
@ 2016-05-10 16:08     ` Eric Dumazet
  2016-05-10 20:22       ` Paolo Abeni
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Dumazet @ 2016-05-10 16:08 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, David S. Miller, Eric Dumazet, Jiri Pirko,
	Daniel Borkmann, Alexei Starovoitov, Alexander Duyck, Tom Herbert,
	Peter Zijlstra, Ingo Molnar, Rik van Riel, Hannes Frederic Sowa,
	linux-kernel
On Tue, 2016-05-10 at 18:03 +0200, Paolo Abeni wrote:
> If a single core host is under network flood, i.e. ksoftirqd is
> scheduled and it eventually (after processing ~640 packets) will let the
> user space process run. The latter will execute a syscall to receive a
> packet, which will have to disable/enable bh at least once and that will
> cause the processing of another ~640 packets. To receive a single packet
> in user space, the kernel has to process more than one thousand packets.
Looks you found the bug then. Have you tried to fix it ?
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 16:08     ` Eric Dumazet
@ 2016-05-10 20:22       ` Paolo Abeni
  2016-05-10 20:45         ` David Miller
  0 siblings, 1 reply; 48+ messages in thread
From: Paolo Abeni @ 2016-05-10 20:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, Eric Dumazet, Jiri Pirko,
	Daniel Borkmann, Alexei Starovoitov, Alexander Duyck, Tom Herbert,
	Peter Zijlstra, Ingo Molnar, Rik van Riel, Hannes Frederic Sowa,
	linux-kernel
On Tue, 2016-05-10 at 09:08 -0700, Eric Dumazet wrote:
> On Tue, 2016-05-10 at 18:03 +0200, Paolo Abeni wrote:
> 
> > If a single core host is under network flood, i.e. ksoftirqd is
> > scheduled and it eventually (after processing ~640 packets) will let the
> > user space process run. The latter will execute a syscall to receive a
> > packet, which will have to disable/enable bh at least once and that will
> > cause the processing of another ~640 packets. To receive a single packet
> > in user space, the kernel has to process more than one thousand packets.
> 
> Looks you found the bug then. Have you tried to fix it ?
The core functionality is implemented in ~100 lines of code, is that
the kind of bloat that do concerns you ?
That could probably be improved removing some code duplication, i.e.
factorizing napi_thread_wait() with irq_wait_for_interrupt() and
possibly napi_threaded_poll() with net_rx_action(). 
If the additional test inside napi_schedule() is really scaring, it can
be guarded with a static_key.
The ksoftirq and the local_bh_enable() design are the root of the
problem, they need to be touched/affected to solve it.
We actually experimented several different options.
Limiting the amount of work performed by local_bh_enable() somewhat
mitigate the issue, but it adds just another kernel parameter difficult
to be tuned.
Running the softirq loop exclusively inside the ksoftirqd will solve the
issue, but this is a very invasive approach, affecting all others
subsystem.
The above can be restricted to the net_rx_action only (i.e. running
net_rx_action always in ksoftirqd context). The related patch isn't
really much simpler than this and will add at least the same number of
additional tests in fast path.
Running the napi loop in a thread that can be migrated gives additional
benefit in the hyper-visor/VM scenario, which can't be achieved
elsewhere.
Would you consider the threaded irq alternative more viable ?
Cheers,
Paolo
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 15:57 ` Thomas Gleixner
@ 2016-05-10 20:41   ` Paolo Abeni
  0 siblings, 0 replies; 48+ messages in thread
From: Paolo Abeni @ 2016-05-10 20:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, David S. Miller, Eric Dumazet, Jiri Pirko,
	Daniel Borkmann, Alexei Starovoitov, Alexander Duyck, Tom Herbert,
	Peter Zijlstra, Ingo Molnar, Rik van Riel, Hannes Frederic Sowa,
	linux-kernel
On Tue, 2016-05-10 at 17:57 +0200, Thomas Gleixner wrote:
> On Tue, 10 May 2016, Paolo Abeni wrote:
> 
> Nice patch set and very promising results! 
> 
> > At this point we are not really sure if we should go with this simpler
> > approach by putting NAPI itself into kthreads or leverage the threadirqs
> > function by putting the whole interrupt into a thread and signaling NAPI
> > that it does not reschedule itself in a softirq but to simply run at
> > this particular context of the interrupt handler.
> > 
> > While the threaded irq way seems to better integrate into the kernel and
> > also other devices could move their interrupts into the threads easily
> > on a common policy, we don't know how to really express the necessary
> > knobs with the current device driver model (module parameters, sysfs
> > attributes, etc.). This is where we would like to hear some opinions.
> > NAPI would e.g. have to query the kernel if the particular IRQ/MSI if it
> > should be scheduled in a softirq or in a thread, so we don't have to
> > rewrite all device drivers. This might even be needed on a per rx-queue
> > granularity.
> 
> Utilizing threaded irqs should be halfways simple even without touching the
> device driver at all.
> 
> We can do the switch to threading in two ways:
> 
> 1) Let the driver request the interrupt(s) as it does now and then have a
>    /proc/irq/NNN/threaded file which converts it to a threaded interrupt on
>    the fly. That should be fairly trivial.
> 
> 2) Let the driver request the interrupt(s) as it does now and retrieve the
>    interrupt number which belongs to the device/queue from the network core
>    and let the irq core switch it over to threaded.
Thank you for the feedback.
We actually experimented something similar to (2). In our implementation
we needed a per device chunk of code to do the actual irq number ->
queue mapping (and than we performed as well the switch in the device
code).
> You surely need some way to figure out whether the interrupt is threaded when
> you set up the device in order to flag your napi struct, but that should be
> not too hard to achieve.
This is the part that required per device changes and complicated a bit
the implementation. We can research further to simplify it, according to
the overall discussion.
Cheers,
Paolo
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 20:22       ` Paolo Abeni
@ 2016-05-10 20:45         ` David Miller
  2016-05-10 20:50           ` Rik van Riel
  0 siblings, 1 reply; 48+ messages in thread
From: David Miller @ 2016-05-10 20:45 UTC (permalink / raw)
  To: pabeni
  Cc: eric.dumazet, netdev, edumazet, jiri, daniel, ast, aduyck, tom,
	peterz, mingo, riel, hannes, linux-kernel
From: Paolo Abeni <pabeni@redhat.com>
Date: Tue, 10 May 2016 22:22:50 +0200
> On Tue, 2016-05-10 at 09:08 -0700, Eric Dumazet wrote:
>> On Tue, 2016-05-10 at 18:03 +0200, Paolo Abeni wrote:
>> 
>> > If a single core host is under network flood, i.e. ksoftirqd is
>> > scheduled and it eventually (after processing ~640 packets) will let the
>> > user space process run. The latter will execute a syscall to receive a
>> > packet, which will have to disable/enable bh at least once and that will
>> > cause the processing of another ~640 packets. To receive a single packet
>> > in user space, the kernel has to process more than one thousand packets.
>> 
>> Looks you found the bug then. Have you tried to fix it ?
 ...
> The ksoftirq and the local_bh_enable() design are the root of the
> problem, they need to be touched/affected to solve it.
That's not what I read from your description, processing 640 packets
before going to ksoftirqd seems to the be the absolute root problem.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 14:29 ` [RFC PATCH 0/2] net: threadable napi poll loop Eric Dumazet
  2016-05-10 15:51   ` David Miller
  2016-05-10 16:03   ` Paolo Abeni
@ 2016-05-10 20:46   ` Hannes Frederic Sowa
  2016-05-10 21:09     ` Eric Dumazet
  2 siblings, 1 reply; 48+ messages in thread
From: Hannes Frederic Sowa @ 2016-05-10 20:46 UTC (permalink / raw)
  To: Eric Dumazet, Paolo Abeni
  Cc: netdev, David S. Miller, Eric Dumazet, Jiri Pirko,
	Daniel Borkmann, Alexei Starovoitov, Alexander Duyck, Tom Herbert,
	Peter Zijlstra, Ingo Molnar, Rik van Riel, linux-kernel
Hello,
On 10.05.2016 16:29, Eric Dumazet wrote:
> On Tue, 2016-05-10 at 16:11 +0200, Paolo Abeni wrote:
>> Currently, the softirq loop can be scheduled both inside the ksofirqd kernel
>> thread and inside any running process. This makes nearly impossible for the
>> process scheduler to balance in a fair way the amount of time that
>> a given core spends performing the softirq loop.
>>
>> Under high network load, the softirq loop can take nearly 100% of a given CPU,
>> leaving very little time for use space processing. On single core hosts, this
>> means that the user space can nearly starve; for example super_netperf
>> UDP_STREAM tests towards a remote single core vCPU guest[1] can measure an
>> aggregated throughput of a few thousands pps, and the same behavior can be
>> reproduced even on bare-metal, eventually simulating a single core with taskset
>> and/or sysfs configuration.
> 
> I hate these patches and ideas guys, sorry. That is before my breakfast,
> but still...
:)
> I have enough hard time dealing with loads where ksoftirqd has to
> compete with user threads that thought that playing with priorities was
> a nice idea.
We tried a lot of approaches so far and this seemed to be the best
architectural RFC we could post. I was quite surprised to see such good
performance numbers with threaded NAPI, thus I think it could be a way
forward.
Your mentioned problem above seems to be a configuration mistake, no?
Otherwise isn't that something user space/cgroups might solve?
> Guess what, when they lose networking they complain.
> 
> We already have ksoftirqd to normally cope with the case you are
> describing.
Indeed, but the time until we wake up ksoftirqd can be already quite
long and for every packet we get in udp_recvmsg the local_bh_enable call
let's us pick up quite a lot of new packets, which we drop before user
space can make any progress. By being more fair between user space and
"napid" we hoped to solve this. We also want more feedback from the
scheduler people, so we Cc'ed them also.
> If it is not working as intended, please identify the bugs and fix them,
> instead of adding yet another tests in fast path and extra complexity in
> the stack.
We could use _local_bh_enable instead of local_bh_enable in udp_recvmsg,
which certainly wouldn't branch down to softirqs as often, but this
feels wrong to me and certainly is.
After the discussion on netdev@ with Peter Hurley here [1] about
"Softirq priority inversion from "softirq: reduce latencies"", we didn't
want to propose some patch looking like this again, but this could help.
The idea would be to limit the number we recheck for softirqs but give
back control to user space.
[1] https://lkml.org/lkml/2016/2/27/152
If I remember local_bh_enable in kernel-rt processes one softirq
directly and defers its work to ksoftirqd much more quickly.
> In the one vcpu case, allowing the user thread to consume more UDP
> packets from the target UDP socket will also make your NIC drop more
> packets, that are not necessarily packets for the same socket.
> 
> So you are shifting the attack to a different target,
> at the expense of more kernel bloat.
I agree here, but I don't think this patch particularly is a lot of
bloat and something very interesting people can play with and extend upon.
Thanks,
Hannes
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 20:45         ` David Miller
@ 2016-05-10 20:50           ` Rik van Riel
  2016-05-10 20:52             ` David Miller
  0 siblings, 1 reply; 48+ messages in thread
From: Rik van Riel @ 2016-05-10 20:50 UTC (permalink / raw)
  To: David Miller, pabeni
  Cc: eric.dumazet, netdev, edumazet, jiri, daniel, ast, aduyck, tom,
	peterz, mingo, hannes, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1439 bytes --]
On Tue, 2016-05-10 at 16:45 -0400, David Miller wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Tue, 10 May 2016 22:22:50 +0200
> 
> > On Tue, 2016-05-10 at 09:08 -0700, Eric Dumazet wrote:
> >> On Tue, 2016-05-10 at 18:03 +0200, Paolo Abeni wrote:
> >> 
> >> > If a single core host is under network flood, i.e. ksoftirqd is
> >> > scheduled and it eventually (after processing ~640 packets) will
> let the
> >> > user space process run. The latter will execute a syscall to
> receive a
> >> > packet, which will have to disable/enable bh at least once and
> that will
> >> > cause the processing of another ~640 packets. To receive a
> single packet
> >> > in user space, the kernel has to process more than one thousand
> packets.
> >> 
> >> Looks you found the bug then. Have you tried to fix it ?
>  ...
> > The ksoftirq and the local_bh_enable() design are the root of the
> > problem, they need to be touched/affected to solve it.
> 
> That's not what I read from your description, processing 640 packets
> before going to ksoftirqd seems to the be the absolute root problem.
What would a fix for that look like?
Keep track of the number of processed incoming packets,
and the number of packets handed off, and defer to
ksoftirqd earlier if the statistics suggest packets are
getting dropped on the floor?
Is there a cheap way to do that kind of thing?
-- 
All Rights Reversed.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 20:50           ` Rik van Riel
@ 2016-05-10 20:52             ` David Miller
  2016-05-10 21:01               ` Rik van Riel
  0 siblings, 1 reply; 48+ messages in thread
From: David Miller @ 2016-05-10 20:52 UTC (permalink / raw)
  To: riel
  Cc: pabeni, eric.dumazet, netdev, edumazet, jiri, daniel, ast, aduyck,
	tom, peterz, mingo, hannes, linux-kernel
From: Rik van Riel <riel@redhat.com>
Date: Tue, 10 May 2016 16:50:56 -0400
> On Tue, 2016-05-10 at 16:45 -0400, David Miller wrote:
>> From: Paolo Abeni <pabeni@redhat.com>
>> Date: Tue, 10 May 2016 22:22:50 +0200
>> 
>> > On Tue, 2016-05-10 at 09:08 -0700, Eric Dumazet wrote:
>> >> On Tue, 2016-05-10 at 18:03 +0200, Paolo Abeni wrote:
>> >> 
>> >> > If a single core host is under network flood, i.e. ksoftirqd is
>> >> > scheduled and it eventually (after processing ~640 packets) will
>> let the
>> >> > user space process run. The latter will execute a syscall to
>> receive a
>> >> > packet, which will have to disable/enable bh at least once and
>> that will
>> >> > cause the processing of another ~640 packets. To receive a
>> single packet
>> >> > in user space, the kernel has to process more than one thousand
>> packets.
>> >> 
>> >> Looks you found the bug then. Have you tried to fix it ?
>>  ...
>> > The ksoftirq and the local_bh_enable() design are the root of the
>> > problem, they need to be touched/affected to solve it.
>> 
>> That's not what I read from your description, processing 640 packets
>> before going to ksoftirqd seems to the be the absolute root problem.
> 
> What would a fix for that look like?
> 
> Keep track of the number of processed incoming packets,
> and the number of packets handed off, and defer to
> ksoftirqd earlier if the statistics suggest packets are
> getting dropped on the floor?
Not by packet count but by something more easily to measure and
scalable to fairness like processing time.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 20:52             ` David Miller
@ 2016-05-10 21:01               ` Rik van Riel
  0 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2016-05-10 21:01 UTC (permalink / raw)
  To: David Miller
  Cc: pabeni, eric.dumazet, netdev, edumazet, jiri, daniel, ast, aduyck,
	tom, peterz, mingo, hannes, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1913 bytes --]
On Tue, 2016-05-10 at 16:52 -0400, David Miller wrote:
> From: Rik van Riel <riel@redhat.com>
> Date: Tue, 10 May 2016 16:50:56 -0400
> 
> > On Tue, 2016-05-10 at 16:45 -0400, David Miller wrote:
> >> From: Paolo Abeni <pabeni@redhat.com>
> >> Date: Tue, 10 May 2016 22:22:50 +0200
> >> 
> >> > On Tue, 2016-05-10 at 09:08 -0700, Eric Dumazet wrote:
> >> >> On Tue, 2016-05-10 at 18:03 +0200, Paolo Abeni wrote:
> >> >> 
> >> >> > If a single core host is under network flood, i.e. ksoftirqd
> is
> >> >> > scheduled and it eventually (after processing ~640 packets)
> will
> >> let the
> >> >> > user space process run. The latter will execute a syscall to
> >> receive a
> >> >> > packet, which will have to disable/enable bh at least once
> and
> >> that will
> >> >> > cause the processing of another ~640 packets. To receive a
> >> single packet
> >> >> > in user space, the kernel has to process more than one
> thousand
> >> packets.
> >> >> 
> >> >> Looks you found the bug then. Have you tried to fix it ?
> >>  ...
> >> > The ksoftirq and the local_bh_enable() design are the root of
> the
> >> > problem, they need to be touched/affected to solve it.
> >> 
> >> That's not what I read from your description, processing 640
> packets
> >> before going to ksoftirqd seems to the be the absolute root
> problem.
> > 
> > What would a fix for that look like?
> > 
> > Keep track of the number of processed incoming packets,
> > and the number of packets handed off, and defer to
> > ksoftirqd earlier if the statistics suggest packets are
> > getting dropped on the floor?
> 
> Not by packet count but by something more easily to measure and
> scalable to fairness like processing time.
I need to get back to fixing irq & softirq time accounting,
which does not currently work correctly in all time keeping
modes...
-- 
All Rights Reversed.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 20:46   ` Hannes Frederic Sowa
@ 2016-05-10 21:09     ` Eric Dumazet
  2016-05-10 21:31       ` Eric Dumazet
  2016-05-10 22:32       ` Hannes Frederic Sowa
  0 siblings, 2 replies; 48+ messages in thread
From: Eric Dumazet @ 2016-05-10 21:09 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Eric Dumazet, Paolo Abeni, netdev, David S. Miller, Jiri Pirko,
	Daniel Borkmann, Alexei Starovoitov, Alexander Duyck, Tom Herbert,
	Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML
On Tue, May 10, 2016 at 1:46 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> I agree here, but I don't think this patch particularly is a lot of
> bloat and something very interesting people can play with and extend upon.
>
Sure, very rarely patch authors think their stuff is bloat.
I prefer to fix kernel softirq.c, or at least show me that you tried
hard enough.
I am pretty sure that the following would work :
When ksoftirqd is scheduled, remember this in a per cpu variable
(ksoftiqd_scheduled)
When enabling BH , do not call do_softirq() if this variable is set.
ksoftirqd would clear the variable at the right place (probably in
run_ksoftirqd())
Sure, this might add a lot of latency regressions, but lets fix them.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 21:09     ` Eric Dumazet
@ 2016-05-10 21:31       ` Eric Dumazet
  2016-05-10 21:35         ` Rik van Riel
  2016-05-10 22:32       ` Hannes Frederic Sowa
  1 sibling, 1 reply; 48+ messages in thread
From: Eric Dumazet @ 2016-05-10 21:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Hannes Frederic Sowa, Paolo Abeni, netdev, David S. Miller,
	Jiri Pirko, Daniel Borkmann, Alexei Starovoitov, Alexander Duyck,
	Tom Herbert, Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML
On Tue, 2016-05-10 at 14:09 -0700, Eric Dumazet wrote:
> On Tue, May 10, 2016 at 1:46 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> 
> > I agree here, but I don't think this patch particularly is a lot of
> > bloat and something very interesting people can play with and extend upon.
> >
> 
> Sure, very rarely patch authors think their stuff is bloat.
> 
> I prefer to fix kernel softirq.c, or at least show me that you tried
> hard enough.
> 
> I am pretty sure that the following would work :
> 
> When ksoftirqd is scheduled, remember this in a per cpu variable
> (ksoftiqd_scheduled)
> 
> When enabling BH , do not call do_softirq() if this variable is set.
> 
> ksoftirqd would clear the variable at the right place (probably in
> run_ksoftirqd())
> 
> Sure, this might add a lot of latency regressions, but lets fix them.
Only to give the idea (it is completely untested and probably buggy)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..cb30cfd76687 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -56,6 +56,7 @@ EXPORT_SYMBOL(irq_stat);
 static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
 
 DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
+DEFINE_PER_CPU(bool, ksoftirqd_scheduled);
 
 const char * const softirq_to_name[NR_SOFTIRQS] = {
 	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
@@ -73,8 +74,10 @@ static void wakeup_softirqd(void)
 	/* Interrupts are disabled: no need to stop preemption */
 	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
 
-	if (tsk && tsk->state != TASK_RUNNING)
+	if (tsk && tsk->state != TASK_RUNNING) {
+		__this_cpu_write(ksoftirqd_scheduled, true);
 		wake_up_process(tsk);
+	}
 }
 
 /*
@@ -162,7 +165,9 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
 	 */
 	preempt_count_sub(cnt - 1);
 
-	if (unlikely(!in_interrupt() && local_softirq_pending())) {
+	if (unlikely(!in_interrupt() &&
+		     local_softirq_pending() &&
+		     !__this_cpu_read(ksoftirqd_scheduled))) {
 		/*
 		 * Run softirq if any pending. And do it in its own stack
 		 * as we may be calling this deep in a task call stack already.
@@ -660,6 +665,8 @@ static void run_ksoftirqd(unsigned int cpu)
 		 * in the task stack here.
 		 */
 		__do_softirq();
+		if (!local_softirq_pending())
+			__this_cpu_write(ksoftirqd_scheduled, false);
 		local_irq_enable();
 		cond_resched_rcu_qs();
 		return;
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 21:31       ` Eric Dumazet
@ 2016-05-10 21:35         ` Rik van Riel
  2016-05-10 21:53           ` Eric Dumazet
  0 siblings, 1 reply; 48+ messages in thread
From: Rik van Riel @ 2016-05-10 21:35 UTC (permalink / raw)
  To: Eric Dumazet, Eric Dumazet
  Cc: Hannes Frederic Sowa, Paolo Abeni, netdev, David S. Miller,
	Jiri Pirko, Daniel Borkmann, Alexei Starovoitov, Alexander Duyck,
	Tom Herbert, Peter Zijlstra, Ingo Molnar, LKML
[-- Attachment #1: Type: text/plain, Size: 2439 bytes --]
On Tue, 2016-05-10 at 14:31 -0700, Eric Dumazet wrote:
> On Tue, 2016-05-10 at 14:09 -0700, Eric Dumazet wrote:
> > 
> > On Tue, May 10, 2016 at 1:46 PM, Hannes Frederic Sowa
> > <hannes@stressinduktion.org> wrote:
> > 
> > > 
> > > I agree here, but I don't think this patch particularly is a lot
> > > of
> > > bloat and something very interesting people can play with and
> > > extend upon.
> > > 
> > Sure, very rarely patch authors think their stuff is bloat.
> > 
> > I prefer to fix kernel softirq.c, or at least show me that you
> > tried
> > hard enough.
> > 
> > I am pretty sure that the following would work :
> > 
> > When ksoftirqd is scheduled, remember this in a per cpu variable
> > (ksoftiqd_scheduled)
> > 
> > When enabling BH , do not call do_softirq() if this variable is
> > set.
> > 
> > ksoftirqd would clear the variable at the right place (probably in
> > run_ksoftirqd())
> > 
> > Sure, this might add a lot of latency regressions, but lets fix
> > them.
> Only to give the idea (it is completely untested and probably buggy)
> 
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 17caf4b63342..cb30cfd76687 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -56,6 +56,7 @@ EXPORT_SYMBOL(irq_stat);
>  static struct softirq_action softirq_vec[NR_SOFTIRQS]
> __cacheline_aligned_in_smp;
>  
>  DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
> +DEFINE_PER_CPU(bool, ksoftirqd_scheduled);
>  
>  const char * const softirq_to_name[NR_SOFTIRQS] = {
>  	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
> @@ -73,8 +74,10 @@ static void wakeup_softirqd(void)
>  	/* Interrupts are disabled: no need to stop preemption */
>  	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
>  
> -	if (tsk && tsk->state != TASK_RUNNING)
> +	if (tsk && tsk->state != TASK_RUNNING) {
> +		__this_cpu_write(ksoftirqd_scheduled, true);
>  		wake_up_process(tsk);
> +	}
>  }
>  
>  /*
> @@ -162,7 +165,9 @@ void __local_bh_enable_ip(unsigned long ip,
> unsigned int cnt)
>  	 */
>  	preempt_count_sub(cnt - 1);
>  
> -	if (unlikely(!in_interrupt() && local_softirq_pending())) {
> +	if (unlikely(!in_interrupt() &&
> +		     local_softirq_pending() &&
> +		     !__this_cpu_read(ksoftirqd_scheduled))) {
>  		/*
> 
You might need another one of these in invoke_softirq()
-- 
All Rights Reversed.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 21:35         ` Rik van Riel
@ 2016-05-10 21:53           ` Eric Dumazet
  2016-05-10 22:02             ` Eric Dumazet
                               ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Eric Dumazet @ 2016-05-10 21:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Eric Dumazet, Hannes Frederic Sowa, Paolo Abeni, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Peter Zijlstra, Ingo Molnar, LKML
On Tue, 2016-05-10 at 17:35 -0400, Rik van Riel wrote:
> You might need another one of these in invoke_softirq()
> 
Excellent.
I gave it a quick try (without your suggestion), and host seems to
survive a stress test.
Of course we do have to fix these problems :
[  147.781629] NOHZ: local_softirq_pending 48
[  147.785546] NOHZ: local_softirq_pending 48
[  147.788344] NOHZ: local_softirq_pending 48
[  147.788992] NOHZ: local_softirq_pending 48
[  147.790943] NOHZ: local_softirq_pending 48
[  147.791232] NOHZ: local_softirq_pending 24a
[  147.791258] NOHZ: local_softirq_pending 48
[  147.791366] NOHZ: local_softirq_pending 48
[  147.792118] NOHZ: local_softirq_pending 48
[  147.793428] NOHZ: local_softirq_pending 48
Thanks.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 21:53           ` Eric Dumazet
@ 2016-05-10 22:02             ` Eric Dumazet
  2016-05-10 22:44               ` Eric Dumazet
  2016-05-10 22:02             ` Rik van Riel
  2016-05-11 17:55             ` Eric Dumazet
  2 siblings, 1 reply; 48+ messages in thread
From: Eric Dumazet @ 2016-05-10 22:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Eric Dumazet, Hannes Frederic Sowa, Paolo Abeni, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Peter Zijlstra, Ingo Molnar, LKML
On Tue, 2016-05-10 at 14:53 -0700, Eric Dumazet wrote:
> On Tue, 2016-05-10 at 17:35 -0400, Rik van Riel wrote:
> 
> > You might need another one of these in invoke_softirq()
> > 
> 
> Excellent.
> 
> I gave it a quick try (without your suggestion), and host seems to
> survive a stress test.
> 
> Of course we do have to fix these problems :
> 
> [  147.781629] NOHZ: local_softirq_pending 48
> [  147.785546] NOHZ: local_softirq_pending 48
> [  147.788344] NOHZ: local_softirq_pending 48
> [  147.788992] NOHZ: local_softirq_pending 48
> [  147.790943] NOHZ: local_softirq_pending 48
> [  147.791232] NOHZ: local_softirq_pending 24a
> [  147.791258] NOHZ: local_softirq_pending 48
> [  147.791366] NOHZ: local_softirq_pending 48
> [  147.792118] NOHZ: local_softirq_pending 48
> [  147.793428] NOHZ: local_softirq_pending 48
Well, with your suggestion, these warnings disappear ;)
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 21:53           ` Eric Dumazet
  2016-05-10 22:02             ` Eric Dumazet
@ 2016-05-10 22:02             ` Rik van Riel
  2016-05-11 17:55             ` Eric Dumazet
  2 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2016-05-10 22:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, Hannes Frederic Sowa, Paolo Abeni, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Peter Zijlstra, Ingo Molnar, LKML
[-- Attachment #1: Type: text/plain, Size: 1001 bytes --]
On Tue, 2016-05-10 at 14:53 -0700, Eric Dumazet wrote:
> On Tue, 2016-05-10 at 17:35 -0400, Rik van Riel wrote:
> 
> > 
> > You might need another one of these in invoke_softirq()
> > 
> Excellent.
> 
> I gave it a quick try (without your suggestion), and host seems to
> survive a stress test.
> 
> Of course we do have to fix these problems :
> 
> [  147.781629] NOHZ: local_softirq_pending 48
> [  147.785546] NOHZ: local_softirq_pending 48
> [  147.788344] NOHZ: local_softirq_pending 48
> [  147.788992] NOHZ: local_softirq_pending 48
> [  147.790943] NOHZ: local_softirq_pending 48
> [  147.791232] NOHZ: local_softirq_pending 24a
> [  147.791258] NOHZ: local_softirq_pending 48
> [  147.791366] NOHZ: local_softirq_pending 48
> [  147.792118] NOHZ: local_softirq_pending 48
> [  147.793428] NOHZ: local_softirq_pending 48
As long as ksoftirqd is running, that should not be
an actual problem, just a false positive.
-- 
All Rights Reversed.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 21:09     ` Eric Dumazet
  2016-05-10 21:31       ` Eric Dumazet
@ 2016-05-10 22:32       ` Hannes Frederic Sowa
  2016-05-10 22:51         ` Eric Dumazet
  1 sibling, 1 reply; 48+ messages in thread
From: Hannes Frederic Sowa @ 2016-05-10 22:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, Paolo Abeni, netdev, David S. Miller, Jiri Pirko,
	Daniel Borkmann, Alexei Starovoitov, Alexander Duyck, Tom Herbert,
	Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML
On 10.05.2016 23:09, Eric Dumazet wrote:
> On Tue, May 10, 2016 at 1:46 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> 
>> I agree here, but I don't think this patch particularly is a lot of
>> bloat and something very interesting people can play with and extend upon.
>>
> 
> Sure, very rarely patch authors think their stuff is bloat.
> 
> I prefer to fix kernel softirq.c, or at least show me that you tried
> hard enough.
> 
> I am pretty sure that the following would work :
> 
> When ksoftirqd is scheduled, remember this in a per cpu variable
> (ksoftiqd_scheduled)
> 
> When enabling BH , do not call do_softirq() if this variable is set.
> 
> ksoftirqd would clear the variable at the right place (probably in
> run_ksoftirqd())
> 
> Sure, this might add a lot of latency regressions, but lets fix them.
Probably, yes.
We had a version which limited the number of restarts if softirqs were
invoked from local_bh_enable (so that at least timers etc. would run)
and would defer all other work to ksoftirqd. That also solved the
initial live lock problem. I do have concerns about the fairness of this
approach, but we now have to investigate this. ;)
Not only did we want to present this solely as a bugfix but also as as
performance enhancements in case of virtio (as you can see in the cover
letter). Given that a long time ago there was a tendency to remove
softirqs completely, we thought it might be very interesting, that a
threaded napi in general seems to be absolutely viable nowadays and
might offer new features.
Bye,
Hannes
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 22:02             ` Eric Dumazet
@ 2016-05-10 22:44               ` Eric Dumazet
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Dumazet @ 2016-05-10 22:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Eric Dumazet, Hannes Frederic Sowa, Paolo Abeni, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Peter Zijlstra, Ingo Molnar, LKML
On Tue, 2016-05-10 at 15:02 -0700, Eric Dumazet wrote:
> On Tue, 2016-05-10 at 14:53 -0700, Eric Dumazet wrote:
> > On Tue, 2016-05-10 at 17:35 -0400, Rik van Riel wrote:
> > 
> > > You might need another one of these in invoke_softirq()
> > > 
> > 
> > Excellent.
> > 
> > I gave it a quick try (without your suggestion), and host seems to
> > survive a stress test.
> > 
> > Of course we do have to fix these problems :
> > 
> > [  147.781629] NOHZ: local_softirq_pending 48
> > [  147.785546] NOHZ: local_softirq_pending 48
> > [  147.788344] NOHZ: local_softirq_pending 48
> > [  147.788992] NOHZ: local_softirq_pending 48
> > [  147.790943] NOHZ: local_softirq_pending 48
> > [  147.791232] NOHZ: local_softirq_pending 24a
> > [  147.791258] NOHZ: local_softirq_pending 48
> > [  147.791366] NOHZ: local_softirq_pending 48
> > [  147.792118] NOHZ: local_softirq_pending 48
> > [  147.793428] NOHZ: local_softirq_pending 48
> 
> 
> Well, with your suggestion, these warnings disappear ;)
This is really nice.
Under stress number of context switches is really small.
ksoftirqd and my netserver compete equally to get the cpu cycles (on
CPU0)
lpaa23:~# vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  0      0 260668416  37240 2414428    0    0    21     0  329  349  0  3 96  0
 1  0      0 260667904  37240 2414428    0    0     0    12 193126 1050  0  2 98  0
 1  0      0 260667904  37240 2414428    0    0     0     0 194354 1056  0  2 98  0
 1  0      0 260669104  37240 2414492    0    0     0     0 200897 1095  0  2 98  0
 1  0      0 260668592  37240 2414492    0    0     0     0 205731  964  0  2 98  0
 1  0      0 260678832  37240 2414492    0    0     0     0 201689  981  0  2 98  0
 1  0      0 260678832  37240 2414492    0    0     0     0 204899  742  0  2 98  0
 1  0      0 260678320  37240 2414492    0    0     0     0 199148  792  0  3 97  0
 1  0      0 260678832  37240 2414492    0    0     0     0 196398  766  0  2 98  0
 1  0      0 260678832  37240 2414492    0    0     0     0 201930  858  0  2 98  0
And we can see that ksoftirqd/0 runs for longer periods (~500 usec),
instead of stupid 4 usec before the patch. Less overhead.
lpaa23:~# cat /proc/3/sched
ksoftirqd/0 (3, #threads: 1)
-------------------------------------------------------------------
se.exec_start                                :       1552401.399526
se.vruntime                                  :        237599.421560
se.sum_exec_runtime                          :         75432.494199
se.nr_migrations                             :                    0
nr_switches                                  :               144333
nr_voluntary_switches                        :               143828
nr_involuntary_switches                      :                  505
se.load.weight                               :                 1024
se.avg.load_sum                              :                10445
se.avg.util_sum                              :                10445
se.avg.load_avg                              :                    0
se.avg.util_avg                              :                    0
se.avg.last_update_time                      :        1552401399526
policy                                       :                    0
prio                                         :                  120
clock-delta                                  :                   47
lpaa23:~# echo 75432.494199/144333|bc -l
.52262818758703830724
And yes indeed, user space can progress way faster under flood.
lpaa23:~# nstat >/dev/null;sleep 1;nstat | grep Udp
UdpInDatagrams                  186132             0.0
UdpInErrors                     735462             0.0
UdpOutDatagrams                 10                 0.0
UdpRcvbufErrors                 735461             0.0
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 22:32       ` Hannes Frederic Sowa
@ 2016-05-10 22:51         ` Eric Dumazet
  2016-05-11  6:55           ` Peter Zijlstra
  2016-05-11  9:48           ` Paolo Abeni
  0 siblings, 2 replies; 48+ messages in thread
From: Eric Dumazet @ 2016-05-10 22:51 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Eric Dumazet, Paolo Abeni, netdev, David S. Miller, Jiri Pirko,
	Daniel Borkmann, Alexei Starovoitov, Alexander Duyck, Tom Herbert,
	Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML
On Wed, 2016-05-11 at 00:32 +0200, Hannes Frederic Sowa wrote:
> Not only did we want to present this solely as a bugfix but also as as
> performance enhancements in case of virtio (as you can see in the cover
> letter). Given that a long time ago there was a tendency to remove
> softirqs completely, we thought it might be very interesting, that a
> threaded napi in general seems to be absolutely viable nowadays and
> might offer new features.
Well, you did not fix the bug, you worked around by adding yet another
layer, with another sysctl that admins or programs have to manage.
If you have a special need for virtio, do not hide it behind a 'bug fix'
but add it as a features request.
This ksoftirqd issue is real and a fix looks very reasonable.
Please try this patch, as I had very good success with it.
Thanks.
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..22463217e3cf 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -56,6 +56,7 @@ EXPORT_SYMBOL(irq_stat);
 static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
 
 DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
+DEFINE_PER_CPU(bool, ksoftirqd_scheduled);
 
 const char * const softirq_to_name[NR_SOFTIRQS] = {
 	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
@@ -73,8 +74,10 @@ static void wakeup_softirqd(void)
 	/* Interrupts are disabled: no need to stop preemption */
 	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
 
-	if (tsk && tsk->state != TASK_RUNNING)
+	if (tsk && tsk->state != TASK_RUNNING) {
+		__this_cpu_write(ksoftirqd_scheduled, true);
 		wake_up_process(tsk);
+	}
 }
 
 /*
@@ -162,7 +165,9 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
 	 */
 	preempt_count_sub(cnt - 1);
 
-	if (unlikely(!in_interrupt() && local_softirq_pending())) {
+	if (unlikely(!in_interrupt() &&
+		     local_softirq_pending() &&
+		     !__this_cpu_read(ksoftirqd_scheduled))) {
 		/*
 		 * Run softirq if any pending. And do it in its own stack
 		 * as we may be calling this deep in a task call stack already.
@@ -340,6 +345,9 @@ void irq_enter(void)
 
 static inline void invoke_softirq(void)
 {
+	if (__this_cpu_read(ksoftirqd_scheduled))
+		return;
+
 	if (!force_irqthreads) {
 #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
 		/*
@@ -660,6 +668,8 @@ static void run_ksoftirqd(unsigned int cpu)
 		 * in the task stack here.
 		 */
 		__do_softirq();
+		if (!local_softirq_pending())
+			__this_cpu_write(ksoftirqd_scheduled, false);
 		local_irq_enable();
 		cond_resched_rcu_qs();
 		return;
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 22:51         ` Eric Dumazet
@ 2016-05-11  6:55           ` Peter Zijlstra
  2016-05-11 13:13             ` Hannes Frederic Sowa
  2016-05-11 21:56             ` Eric Dumazet
  2016-05-11  9:48           ` Paolo Abeni
  1 sibling, 2 replies; 48+ messages in thread
From: Peter Zijlstra @ 2016-05-11  6:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Hannes Frederic Sowa, Eric Dumazet, Paolo Abeni, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Ingo Molnar, Rik van Riel, LKML
On Tue, May 10, 2016 at 03:51:37PM -0700, Eric Dumazet wrote:
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 17caf4b63342..22463217e3cf 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -56,6 +56,7 @@ EXPORT_SYMBOL(irq_stat);
>  static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
>  
>  DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
> +DEFINE_PER_CPU(bool, ksoftirqd_scheduled);
>  
>  const char * const softirq_to_name[NR_SOFTIRQS] = {
>  	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
> @@ -73,8 +74,10 @@ static void wakeup_softirqd(void)
>  	/* Interrupts are disabled: no need to stop preemption */
>  	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
>  
> -	if (tsk && tsk->state != TASK_RUNNING)
> +	if (tsk && tsk->state != TASK_RUNNING) {
> +		__this_cpu_write(ksoftirqd_scheduled, true);
>  		wake_up_process(tsk);
Since we're already looking at tsk->state, and the wake_up_process()
ensures the thing becomes TASK_RUNNING, you could add:
static inline bool ksoftirqd_running(void)
{
	return __this_cpu_read(ksoftirqd)->state == TASK_RUNNING;
}
> +	}
>  }
>  
>  /*
> @@ -162,7 +165,9 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
>  	 */
>  	preempt_count_sub(cnt - 1);
>  
> -	if (unlikely(!in_interrupt() && local_softirq_pending())) {
> +	if (unlikely(!in_interrupt() &&
> +		     local_softirq_pending() &&
> +		     !__this_cpu_read(ksoftirqd_scheduled))) {
And use it here,
>  		/*
>  		 * Run softirq if any pending. And do it in its own stack
>  		 * as we may be calling this deep in a task call stack already.
> @@ -340,6 +345,9 @@ void irq_enter(void)
>  
>  static inline void invoke_softirq(void)
>  {
> +	if (__this_cpu_read(ksoftirqd_scheduled))
and here.
> +		return;
> +
>  	if (!force_irqthreads) {
>  #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
>  		/*
> @@ -660,6 +668,8 @@ static void run_ksoftirqd(unsigned int cpu)
>  		 * in the task stack here.
>  		 */
>  		__do_softirq();
> +		if (!local_softirq_pending())
> +			__this_cpu_write(ksoftirqd_scheduled, false);
And avoid twiddling the new variable which only seems to mirror
tsk->state.
>  		local_irq_enable();
>  		cond_resched_rcu_qs();
>  		return;
> 
> 
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 22:51         ` Eric Dumazet
  2016-05-11  6:55           ` Peter Zijlstra
@ 2016-05-11  9:48           ` Paolo Abeni
  2016-05-11 13:08             ` Eric Dumazet
  1 sibling, 1 reply; 48+ messages in thread
From: Paolo Abeni @ 2016-05-11  9:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Hannes Frederic Sowa, Eric Dumazet, netdev, David S. Miller,
	Jiri Pirko, Daniel Borkmann, Alexei Starovoitov, Alexander Duyck,
	Tom Herbert, Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML
Hi Eric,
On Tue, 2016-05-10 at 15:51 -0700, Eric Dumazet wrote:
> On Wed, 2016-05-11 at 00:32 +0200, Hannes Frederic Sowa wrote:
> 
> > Not only did we want to present this solely as a bugfix but also as as
> > performance enhancements in case of virtio (as you can see in the cover
> > letter). Given that a long time ago there was a tendency to remove
> > softirqs completely, we thought it might be very interesting, that a
> > threaded napi in general seems to be absolutely viable nowadays and
> > might offer new features.
> 
> Well, you did not fix the bug, you worked around by adding yet another
> layer, with another sysctl that admins or programs have to manage.
> 
> If you have a special need for virtio, do not hide it behind a 'bug fix'
> but add it as a features request.
> 
> This ksoftirqd issue is real and a fix looks very reasonable.
> 
> Please try this patch, as I had very good success with it.
Thank you for your time and your effort.
I tested your patch on the bare metal "single core" scenario, disabling
the unneeded cores with:
CPUS=`nproc`
for I in `seq 1 $CPUS`; do echo 0  >  /sys/devices/system/node/node0/cpu$I/online; done
And I got a:
[   86.925249] Broke affinity for irq <num>
for each irq number generated by a network device.
In this scenario, your patch solves the ksoftirqd issue, performing
comparable to the napi threaded patches (with a negative delta in the
noise range) and introducing a minor regression with a single flow, in
the noise range (3%).
As said in a previous mail, we actually experimented something similar,
but it felt quite hackish.
AFAICS this patch adds three more tests in the fast path and affect all
other softirq use case. I'm not sure how to check for regression there.
The napi thread patches are actually a new feature, that also fixes the
ksoftirqd issue: hunting the ksoftirqd issue has been the initial
trigger for this work. I'm sorry for not being clear enough in the cover
letter.
The napi thread patches offer additional benefits, i.e. an additional
relevant gain in the described test scenario, and do not impact on other
subsystems/kernel entities. 
I still think they are worthy, and I bet you would disagree, but could
you please articulate more which parts concern you most and/or are more
bloated ?
Thank you,
Paolo
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-11  9:48           ` Paolo Abeni
@ 2016-05-11 13:08             ` Eric Dumazet
  2016-05-11 13:39               ` Hannes Frederic Sowa
  2016-05-11 14:38               ` Paolo Abeni
  0 siblings, 2 replies; 48+ messages in thread
From: Eric Dumazet @ 2016-05-11 13:08 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Hannes Frederic Sowa, Eric Dumazet, netdev, David S. Miller,
	Jiri Pirko, Daniel Borkmann, Alexei Starovoitov, Alexander Duyck,
	Tom Herbert, Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML
On Wed, 2016-05-11 at 11:48 +0200, Paolo Abeni wrote:
> Hi Eric,
> On Tue, 2016-05-10 at 15:51 -0700, Eric Dumazet wrote:
> > On Wed, 2016-05-11 at 00:32 +0200, Hannes Frederic Sowa wrote:
> > 
> > > Not only did we want to present this solely as a bugfix but also as as
> > > performance enhancements in case of virtio (as you can see in the cover
> > > letter). Given that a long time ago there was a tendency to remove
> > > softirqs completely, we thought it might be very interesting, that a
> > > threaded napi in general seems to be absolutely viable nowadays and
> > > might offer new features.
> > 
> > Well, you did not fix the bug, you worked around by adding yet another
> > layer, with another sysctl that admins or programs have to manage.
> > 
> > If you have a special need for virtio, do not hide it behind a 'bug fix'
> > but add it as a features request.
> > 
> > This ksoftirqd issue is real and a fix looks very reasonable.
> > 
> > Please try this patch, as I had very good success with it.
> 
> Thank you for your time and your effort.
> 
> I tested your patch on the bare metal "single core" scenario, disabling
> the unneeded cores with:
> CPUS=`nproc`
> for I in `seq 1 $CPUS`; do echo 0  >  /sys/devices/system/node/node0/cpu$I/online; done
> 
> And I got a:
> 
> [   86.925249] Broke affinity for irq <num>
> 
Was it fatal, or simply a warning that you are removing the cpu that was
the only allowed cpu in an affinity_mask ?
Looks another bug to fix then ? We disabled CPU hotplug here at Google
for our production, as it was notoriously buggy. No time to fix dozens
of issues added by a crowd of developers that do not even know a cpu can
be unplugged.
Maybe some caller of local_bh_disable()/local_bh_enable() expected that
current softirq would be processed. Obviously flaky even before the
patches.
> for each irq number generated by a network device.
> 
> In this scenario, your patch solves the ksoftirqd issue, performing
> comparable to the napi threaded patches (with a negative delta in the
> noise range) and introducing a minor regression with a single flow, in
> the noise range (3%).
> 
> As said in a previous mail, we actually experimented something similar,
> but it felt quite hackish.
Right, we are networking guys, and we feel that messing with such core
infra is not for us. So we feel comfortable adding a pure networking
patch.
> 
> AFAICS this patch adds three more tests in the fast path and affect all
> other softirq use case. I'm not sure how to check for regression there.
It is obvious to me that ksoftird mechanism is not working as intended.
Fixing it might uncover bugs from parts of the kernel relying on the
bug, indirectly or directly. Is it a good thing ?
I can not tell before trying.
Just by looking at /proc/{ksoftirqs_pid}/sched you can see the problem,
as we normally schedule ksoftird under stress but most of the time,
the softirq items were processed by another tasks as you found out.
> 
> The napi thread patches are actually a new feature, that also fixes the
> ksoftirqd issue: hunting the ksoftirqd issue has been the initial
> trigger for this work. I'm sorry for not being clear enough in the cover
> letter.
> 
> The napi thread patches offer additional benefits, i.e. an additional
> relevant gain in the described test scenario, and do not impact on other
> subsystems/kernel entities. 
> 
> I still think they are worthy, and I bet you would disagree, but could
> you please articulate more which parts concern you most and/or are more
> bloated ?
Just look at the added code. napi_threaded_poll() is very buggy, but
honestly I do not want to fix the bugs you added there. If you have only
one vcpu, how jiffies can ever change since you block BH ?
I was planning to remove cond_resched_softirq() that we no longer use
after my recent changes to TCP stack,
and you call it again (while it is obviously buggy since it does not
check if a BH is pending, only if a thread needs the cpu)
I prefer fixing the existing code, really. It took us years to
understand it and maybe fix it.
Just think of what will happen if you have 10 devices (10 new threads in
your model) and one cpu.
Instead of the nice existing netif_rx() doing 64 packets per device
rounds, you'll now rely on process scheduler behavior that has no such
granularity.
Adding more threads is the natural answer of userland programmers, but
in the kernel it is not the right answer. We already have mechanism,
just use them and fix them if they are broken.
Sorry, I really do not think your patches are the way to go.
But this thread is definitely interesting.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-11  6:55           ` Peter Zijlstra
@ 2016-05-11 13:13             ` Hannes Frederic Sowa
  2016-05-11 14:40               ` Eric Dumazet
  2016-05-11 21:56             ` Eric Dumazet
  1 sibling, 1 reply; 48+ messages in thread
From: Hannes Frederic Sowa @ 2016-05-11 13:13 UTC (permalink / raw)
  To: Peter Zijlstra, Eric Dumazet
  Cc: Eric Dumazet, Paolo Abeni, netdev, David S. Miller, Jiri Pirko,
	Daniel Borkmann, Alexei Starovoitov, Alexander Duyck, Tom Herbert,
	Ingo Molnar, Rik van Riel, LKML
On 11.05.2016 08:55, Peter Zijlstra wrote:
> On Tue, May 10, 2016 at 03:51:37PM -0700, Eric Dumazet wrote:
>> diff --git a/kernel/softirq.c b/kernel/softirq.c
>> index 17caf4b63342..22463217e3cf 100644
>> --- a/kernel/softirq.c
>> +++ b/kernel/softirq.c
>> @@ -56,6 +56,7 @@ EXPORT_SYMBOL(irq_stat);
>>  static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
>>  
>>  DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
>> +DEFINE_PER_CPU(bool, ksoftirqd_scheduled);
>>  
>>  const char * const softirq_to_name[NR_SOFTIRQS] = {
>>  	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
>> @@ -73,8 +74,10 @@ static void wakeup_softirqd(void)
>>  	/* Interrupts are disabled: no need to stop preemption */
>>  	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
>>  
>> -	if (tsk && tsk->state != TASK_RUNNING)
>> +	if (tsk && tsk->state != TASK_RUNNING) {
>> +		__this_cpu_write(ksoftirqd_scheduled, true);
>>  		wake_up_process(tsk);
> 
> Since we're already looking at tsk->state, and the wake_up_process()
> ensures the thing becomes TASK_RUNNING, you could add:
> 
> static inline bool ksoftirqd_running(void)
> {
> 	return __this_cpu_read(ksoftirqd)->state == TASK_RUNNING;
> }
This looks racy to me as the ksoftirqd could be in the progress to stop
and we would miss another softirq invocation.
Thanks,
Hannes
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-11 13:08             ` Eric Dumazet
@ 2016-05-11 13:39               ` Hannes Frederic Sowa
  2016-05-11 13:47                 ` Hannes Frederic Sowa
  2016-05-11 14:38               ` Paolo Abeni
  1 sibling, 1 reply; 48+ messages in thread
From: Hannes Frederic Sowa @ 2016-05-11 13:39 UTC (permalink / raw)
  To: Eric Dumazet, Paolo Abeni
  Cc: Eric Dumazet, netdev, David S. Miller, Jiri Pirko,
	Daniel Borkmann, Alexei Starovoitov, Alexander Duyck, Tom Herbert,
	Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML
Hi all,
On 11.05.2016 15:08, Eric Dumazet wrote:
> On Wed, 2016-05-11 at 11:48 +0200, Paolo Abeni wrote:
>> Hi Eric,
>> On Tue, 2016-05-10 at 15:51 -0700, Eric Dumazet wrote:
>>> On Wed, 2016-05-11 at 00:32 +0200, Hannes Frederic Sowa wrote:
>>>
>>>> Not only did we want to present this solely as a bugfix but also as as
>>>> performance enhancements in case of virtio (as you can see in the cover
>>>> letter). Given that a long time ago there was a tendency to remove
>>>> softirqs completely, we thought it might be very interesting, that a
>>>> threaded napi in general seems to be absolutely viable nowadays and
>>>> might offer new features.
>>>
>>> Well, you did not fix the bug, you worked around by adding yet another
>>> layer, with another sysctl that admins or programs have to manage.
>>>
>>> If you have a special need for virtio, do not hide it behind a 'bug fix'
>>> but add it as a features request.
>>>
>>> This ksoftirqd issue is real and a fix looks very reasonable.
>>>
>>> Please try this patch, as I had very good success with it.
>>
>> Thank you for your time and your effort.
>>
>> I tested your patch on the bare metal "single core" scenario, disabling
>> the unneeded cores with:
>> CPUS=`nproc`
>> for I in `seq 1 $CPUS`; do echo 0  >  /sys/devices/system/node/node0/cpu$I/online; done
>>
>> And I got a:
>>
>> [   86.925249] Broke affinity for irq <num>
>>
> 
> Was it fatal, or simply a warning that you are removing the cpu that was
> the only allowed cpu in an affinity_mask ?
> 
> Looks another bug to fix then ? We disabled CPU hotplug here at Google
> for our production, as it was notoriously buggy. No time to fix dozens
> of issues added by a crowd of developers that do not even know a cpu can
> be unplugged.
> 
> Maybe some caller of local_bh_disable()/local_bh_enable() expected that
> current softirq would be processed. Obviously flaky even before the
> patches.
Yes, I fear this could come up. If we want to target net or stable maybe
we should maybe special case this patch specifically for net-rx?
>> for each irq number generated by a network device.
>>
>> In this scenario, your patch solves the ksoftirqd issue, performing
>> comparable to the napi threaded patches (with a negative delta in the
>> noise range) and introducing a minor regression with a single flow, in
>> the noise range (3%).
>>
>> As said in a previous mail, we actually experimented something similar,
>> but it felt quite hackish.
> 
> Right, we are networking guys, and we feel that messing with such core
> infra is not for us. So we feel comfortable adding a pure networking
> patch.
We posted this patch as an RFC. My initial internal proposal only had a
check in ___napi_schedule and completely relied on threaded irqs and
didn't spawn a thread per napi instance in the networking stack. I think
this is the better approach long term, as it allows to configure
threaded irqs per device and doesn't specifically deal with networking
only. NAPI must be aware of when to schedule, obviously, so we need
another check in napi_schedule.
My plan was definitely to go with something more generic, but we didn't
yet know how to express that in a generic way, but relied on the forced
threaded irqs kernel parameter.
>> AFAICS this patch adds three more tests in the fast path and affect all
>> other softirq use case. I'm not sure how to check for regression there.
> 
> It is obvious to me that ksoftird mechanism is not working as intended.
Yes.
> Fixing it might uncover bugs from parts of the kernel relying on the
> bug, indirectly or directly. Is it a good thing ?
> 
> I can not tell before trying.
> 
> Just by looking at /proc/{ksoftirqs_pid}/sched you can see the problem,
> as we normally schedule ksoftird under stress but most of the time,
> the softirq items were processed by another tasks as you found out.
Exactly, the pending mask gets reset by the task handling the softirq
inline and ksoftirqd runs dry too early not processing any more softirq
notifications.
>> The napi thread patches are actually a new feature, that also fixes the
>> ksoftirqd issue: hunting the ksoftirqd issue has been the initial
>> trigger for this work. I'm sorry for not being clear enough in the cover
>> letter.
>>
>> The napi thread patches offer additional benefits, i.e. an additional
>> relevant gain in the described test scenario, and do not impact on other
>> subsystems/kernel entities. 
>>
>> I still think they are worthy, and I bet you would disagree, but could
>> you please articulate more which parts concern you most and/or are more
>> bloated ?
> 
> Just look at the added code. napi_threaded_poll() is very buggy, but
> honestly I do not want to fix the bugs you added there. If you have only
> one vcpu, how jiffies can ever change since you block BH ?
I think the local_bh_disable/enable needs to be more fine granular,
correct (inside the loop).
> I was planning to remove cond_resched_softirq() that we no longer use
> after my recent changes to TCP stack,
> and you call it again (while it is obviously buggy since it does not
> check if a BH is pending, only if a thread needs the cpu)
Good point, thanks!
> I prefer fixing the existing code, really. It took us years to
> understand it and maybe fix it.
I agree, we should find a simple way to let ksoftirqd and netrx behave
better and target threaded napi as a new feature.
> Just think of what will happen if you have 10 devices (10 new threads in
> your model) and one cpu.
> 
> Instead of the nice existing netif_rx() doing 64 packets per device
> rounds, you'll now rely on process scheduler behavior that has no such
> granularity.
We didn't inspect all kinds of workloads right now but I hoped to see
benefits in forwarding with multiple interfaces. This is also why we
want to have some "easy" way to configure that for admins. Also
integration with RPS is on the todo list.
> Adding more threads is the natural answer of userland programmers, but
> in the kernel it is not the right answer. We already have mechanism,
> just use them and fix them if they are broken.
> 
> Sorry, I really do not think your patches are the way to go.
> But this thread is definitely interesting.
I am fine with that. It is us to show clear benefits or use cases for
that. If we fail with that, no problem at all that the patches get
rejected, we don't want to add bloat to the kernel, for sure! At this
point I still think a possibility to run napi in kthreads will allow
specific workloads to see an improvement. Maybe the simple branch in
napi_schedule is just worth so people can play around with it. As it
shouldn't change behavior we can later on simply remove it.
Thanks,
Hannes
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-11 13:39               ` Hannes Frederic Sowa
@ 2016-05-11 13:47                 ` Hannes Frederic Sowa
  0 siblings, 0 replies; 48+ messages in thread
From: Hannes Frederic Sowa @ 2016-05-11 13:47 UTC (permalink / raw)
  To: Eric Dumazet, Paolo Abeni
  Cc: Eric Dumazet, netdev, David S. Miller, Jiri Pirko,
	Daniel Borkmann, Alexei Starovoitov, Alexander Duyck, Tom Herbert,
	Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML
On 11.05.2016 15:39, Hannes Frederic Sowa wrote:
> I am fine with that. It is us to show clear benefits or use cases for
> that. If we fail with that, no problem at all that the patches get
> rejected, we don't want to add bloat to the kernel, for sure! At this
> point I still think a possibility to run napi in kthreads will allow
> specific workloads to see an improvement. Maybe the simple branch in
> napi_schedule is just worth so people can play around with it. As it
> shouldn't change behavior we can later on simply remove it.
Actually, I consider it a bug that when we force the kernel to use
threaded irqs that we only schedule the softirq for napi later on and
don't do the processing within the thread.
Bye,
Hannes
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-11 13:08             ` Eric Dumazet
  2016-05-11 13:39               ` Hannes Frederic Sowa
@ 2016-05-11 14:38               ` Paolo Abeni
  2016-05-11 14:45                 ` Eric Dumazet
  1 sibling, 1 reply; 48+ messages in thread
From: Paolo Abeni @ 2016-05-11 14:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Hannes Frederic Sowa, Eric Dumazet, netdev, David S. Miller,
	Jiri Pirko, Daniel Borkmann, Alexei Starovoitov, Alexander Duyck,
	Tom Herbert, Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML
On Wed, 2016-05-11 at 06:08 -0700, Eric Dumazet wrote:
> On Wed, 2016-05-11 at 11:48 +0200, Paolo Abeni wrote:
> > Hi Eric,
> > On Tue, 2016-05-10 at 15:51 -0700, Eric Dumazet wrote:
> > > On Wed, 2016-05-11 at 00:32 +0200, Hannes Frederic Sowa wrote:
> > > 
> > > > Not only did we want to present this solely as a bugfix but also as as
> > > > performance enhancements in case of virtio (as you can see in the cover
> > > > letter). Given that a long time ago there was a tendency to remove
> > > > softirqs completely, we thought it might be very interesting, that a
> > > > threaded napi in general seems to be absolutely viable nowadays and
> > > > might offer new features.
> > > 
> > > Well, you did not fix the bug, you worked around by adding yet another
> > > layer, with another sysctl that admins or programs have to manage.
> > > 
> > > If you have a special need for virtio, do not hide it behind a 'bug fix'
> > > but add it as a features request.
> > > 
> > > This ksoftirqd issue is real and a fix looks very reasonable.
> > > 
> > > Please try this patch, as I had very good success with it.
> > 
> > Thank you for your time and your effort.
> > 
> > I tested your patch on the bare metal "single core" scenario, disabling
> > the unneeded cores with:
> > CPUS=`nproc`
> > for I in `seq 1 $CPUS`; do echo 0  >  /sys/devices/system/node/node0/cpu$I/online; done
> > 
> > And I got a:
> > 
> > [   86.925249] Broke affinity for irq <num>
> > 
> 
> Was it fatal, or simply a warning that you are removing the cpu that was
> the only allowed cpu in an affinity_mask ?
The above message is emitted with pr_notice() by the x86 version of
fixup_irqs(). It's not fatal, the host is alive and well after that. The
un-patched kernel does not emit it on cpus disabling.
I'll try to look into this later.
> Looks another bug to fix then ? We disabled CPU hotplug here at Google
> for our production, as it was notoriously buggy. No time to fix dozens
> of issues added by a crowd of developers that do not even know a cpu can
> be unplugged.
> 
> Maybe some caller of local_bh_disable()/local_bh_enable() expected that
> current softirq would be processed. Obviously flaky even before the
> patches.
> 
> > for each irq number generated by a network device.
> > 
> > In this scenario, your patch solves the ksoftirqd issue, performing
> > comparable to the napi threaded patches (with a negative delta in the
> > noise range) and introducing a minor regression with a single flow, in
> > the noise range (3%).
> > 
> > As said in a previous mail, we actually experimented something similar,
> > but it felt quite hackish.
> 
> Right, we are networking guys, and we feel that messing with such core
> infra is not for us. So we feel comfortable adding a pure networking
> patch.
> 
> > 
> > AFAICS this patch adds three more tests in the fast path and affect all
> > other softirq use case. I'm not sure how to check for regression there.
> 
> It is obvious to me that ksoftird mechanism is not working as intended.
> 
> Fixing it might uncover bugs from parts of the kernel relying on the
> bug, indirectly or directly. Is it a good thing ?
> 
> I can not tell before trying.
> 
> Just by looking at /proc/{ksoftirqs_pid}/sched you can see the problem,
> as we normally schedule ksoftird under stress but most of the time,
> the softirq items were processed by another tasks as you found out.
> 
> 
> > 
> > The napi thread patches are actually a new feature, that also fixes the
> > ksoftirqd issue: hunting the ksoftirqd issue has been the initial
> > trigger for this work. I'm sorry for not being clear enough in the cover
> > letter.
> > 
> > The napi thread patches offer additional benefits, i.e. an additional
> > relevant gain in the described test scenario, and do not impact on other
> > subsystems/kernel entities. 
> > 
> > I still think they are worthy, and I bet you would disagree, but could
> > you please articulate more which parts concern you most and/or are more
> > bloated ?
> 
> Just look at the added code. napi_threaded_poll() is very buggy, but
> honestly I do not want to fix the bugs you added there. If you have only
> one vcpu, how jiffies can ever change since you block BH ?
Uh, we have likely the same issue in the net_rx_action() function, which
also execute with bh disabled and check for jiffies changes even on
single core hosts ?!?
Aren't jiffies updated by the timer interrupt ? and thous even with
bh_disabled ?!?
> I was planning to remove cond_resched_softirq() that we no longer use
> after my recent changes to TCP stack,
> and you call it again (while it is obviously buggy since it does not
> check if a BH is pending, only if a thread needs the cpu)
I missed that, thank you for pointing out.
> I prefer fixing the existing code, really. It took us years to
> understand it and maybe fix it.
> 
> Just think of what will happen if you have 10 devices (10 new threads in
> your model) and one cpu.
> 
> Instead of the nice existing netif_rx() doing 64 packets per device
> rounds, you'll now rely on process scheduler behavior that has no such
> granularity.
> 
> Adding more threads is the natural answer of userland programmers, but
> in the kernel it is not the right answer. We already have mechanism,
> just use them and fix them if they are broken.
> 
> Sorry, I really do not think your patches are the way to go.
> But this thread is definitely interesting.
Oh, this is a far better comment that I would have expected ;-)
Cheers,
Paolo
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-11 13:13             ` Hannes Frederic Sowa
@ 2016-05-11 14:40               ` Eric Dumazet
  2016-05-11 15:01                 ` Rik van Riel
  2016-05-11 15:50                 ` Eric Dumazet
  0 siblings, 2 replies; 48+ messages in thread
From: Eric Dumazet @ 2016-05-11 14:40 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Peter Zijlstra, Eric Dumazet, Paolo Abeni, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Ingo Molnar, Rik van Riel, LKML
On Wed, May 11, 2016 at 6:13 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> This looks racy to me as the ksoftirqd could be in the progress to stop
> and we would miss another softirq invocation.
Looking at smpboot_thread_fn(), it looks fine :
                if (!ht->thread_should_run(td->cpu)) {
                        preempt_enable_no_resched();
                        schedule();
                } else {
                        __set_current_state(TASK_RUNNING);
                        preempt_enable();
                        ht->thread_fn(td->cpu);
                }
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-11 14:38               ` Paolo Abeni
@ 2016-05-11 14:45                 ` Eric Dumazet
  2016-05-11 22:47                   ` Hannes Frederic Sowa
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Dumazet @ 2016-05-11 14:45 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Eric Dumazet, Hannes Frederic Sowa, netdev, David S. Miller,
	Jiri Pirko, Daniel Borkmann, Alexei Starovoitov, Alexander Duyck,
	Tom Herbert, Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML
On Wed, May 11, 2016 at 7:38 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> Uh, we have likely the same issue in the net_rx_action() function, which
> also execute with bh disabled and check for jiffies changes even on
> single core hosts ?!?
That is why we have a loop break after netdev_budget=300 packets.
And a sysctl to eventually tune this.
Same issue for softirq handler, look at commit
34376a50fb1fa095b9d0636fa41ed2e73125f214
Your questions about this central piece of networking code are worrying.
>
> Aren't jiffies updated by the timer interrupt ? and thous even with
> bh_disabled ?!?
Exactly my point : jiffie wont be updated in your code, since you block BH.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-11 14:40               ` Eric Dumazet
@ 2016-05-11 15:01                 ` Rik van Riel
  2016-05-11 15:50                 ` Eric Dumazet
  1 sibling, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2016-05-11 15:01 UTC (permalink / raw)
  To: Eric Dumazet, Hannes Frederic Sowa
  Cc: Peter Zijlstra, Eric Dumazet, Paolo Abeni, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Ingo Molnar, LKML
[-- Attachment #1: Type: text/plain, Size: 630 bytes --]
On Wed, 2016-05-11 at 07:40 -0700, Eric Dumazet wrote:
> On Wed, May 11, 2016 at 6:13 AM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> 
> > This looks racy to me as the ksoftirqd could be in the progress to
> > stop
> > and we would miss another softirq invocation.
> 
> Looking at smpboot_thread_fn(), it looks fine :
> 
Additionally, we are talking about waking up
ksoftirqd on the same CPU.
That means the wakeup code could interrupt
ksoftirqd almost going to sleep, but the
two code paths could not run simultaneously.
That does narrow the scope considerably.
-- 
All rights reversed
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-11 14:40               ` Eric Dumazet
  2016-05-11 15:01                 ` Rik van Riel
@ 2016-05-11 15:50                 ` Eric Dumazet
  1 sibling, 0 replies; 48+ messages in thread
From: Eric Dumazet @ 2016-05-11 15:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Hannes Frederic Sowa, Peter Zijlstra, Paolo Abeni, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Ingo Molnar, Rik van Riel, LKML
On Wed, 2016-05-11 at 07:40 -0700, Eric Dumazet wrote:
> On Wed, May 11, 2016 at 6:13 AM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> 
> > This looks racy to me as the ksoftirqd could be in the progress to stop
> > and we would miss another softirq invocation.
> 
> Looking at smpboot_thread_fn(), it looks fine :
> 
>                 if (!ht->thread_should_run(td->cpu)) {
>                         preempt_enable_no_resched();
>                         schedule();
>                 } else {
>                         __set_current_state(TASK_RUNNING);
>                         preempt_enable();
>                         ht->thread_fn(td->cpu);
>                 }
BTW, I wonder why we pass td->cpu as argument to ht->thread_fn(td->cpu)
This always should be the current processor id.
Or do we have an issue because we ignore it in :
static int ksoftirqd_should_run(unsigned int cpu)
{
        return local_softirq_pending();
}
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-10 21:53           ` Eric Dumazet
  2016-05-10 22:02             ` Eric Dumazet
  2016-05-10 22:02             ` Rik van Riel
@ 2016-05-11 17:55             ` Eric Dumazet
  2 siblings, 0 replies; 48+ messages in thread
From: Eric Dumazet @ 2016-05-11 17:55 UTC (permalink / raw)
  To: Rik van Riel, Paul E. McKenney
  Cc: Eric Dumazet, Hannes Frederic Sowa, Paolo Abeni, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Peter Zijlstra, Ingo Molnar, LKML
On Tue, 2016-05-10 at 14:53 -0700, Eric Dumazet wrote:
> On Tue, 2016-05-10 at 17:35 -0400, Rik van Riel wrote:
> 
> > You might need another one of these in invoke_softirq()
> > 
> 
> Excellent.
> 
> I gave it a quick try (without your suggestion), and host seems to
> survive a stress test.
Well, we instantly trigger rcu issues.
How to reproduce :
netserver &
for i in `seq 1 100`
do
  netperf -H 127.0.0.1 -t TCP_RR -l 1000 &
done
# local hack to enable the new behavior
# without having to add a new sysctl, but hacking an existing one
echo 1001 >/proc/sys/net/core/netdev_max_backlog
<bang :>
[  236.977511] INFO: rcu_sched self-detected stall on CPU
[  236.977512] INFO: rcu_sched self-detected stall on CPU
[  236.977515] INFO: rcu_sched self-detected stall on CPU
[  236.977518] INFO: rcu_sched self-detected stall on CPU
[  236.977519] INFO: rcu_sched self-detected stall on CPU
[  236.977521] INFO: rcu_sched self-detected stall on CPU
[  236.977522] INFO: rcu_sched self-detected stall on CPU
[  236.977523] INFO: rcu_sched self-detected stall on CPU
[  236.977525] INFO: rcu_sched self-detected stall on CPU
[  236.977526] INFO: rcu_sched self-detected stall on CPU
[  236.977527] INFO: rcu_sched self-detected stall on CPU
[  236.977529] INFO: rcu_sched self-detected stall on CPU
[  236.977530] INFO: rcu_sched self-detected stall on CPU
[  236.977532] INFO: rcu_sched self-detected stall on CPU
[  236.977532] 	47-...: (1 GPs behind) idle=8d1/1/0 softirq=2500/2506 fqs=1 
[  236.977535] INFO: rcu_sched self-detected stall on CPU
[  236.977536] INFO: rcu_sched self-detected stall on CPU
[  236.977540] 	36-...: (1 GPs behind) idle=d05/1/0 softirq=2637/2644 fqs=1 
[  236.977546] 	
[  236.977546] 	38-...: (1 GPs behind) idle=a5b/1/0 softirq=2612/2618 fqs=1 
[  236.977549] 	0-...: (1 GPs behind) idle=c39/1/0 softirq=15315/15321 fqs=1 
[  236.977551] 	24-...: (1 GPs behind) idle=ea3/1/0 softirq=2455/2461 fqs=1 
[  236.977554] 	18-...: (20995 ticks this GP) idle=ef5/1/0 softirq=8530/8530 fqs=1 
[  236.977556] 	39-...: (1 GPs behind) idle=f9d/1/0 softirq=2144/2150 fqs=1 
[  236.977558] 	
[  236.977558] 	22-...: (1 GPs behind) idle=5a7/1/0 softirq=10238/10244 fqs=1 
[  236.977561] 	7-...: (1 GPs behind) idle=323/1/0 softirq=5279/5285 fqs=1 
[  236.977563] 	31-...: (1 GPs behind) idle=47d/1/0 softirq=2526/2532 fqs=1 
[  236.977565] 	33-...: (1 GPs behind) idle=175/1/0 softirq=2060/2066 fqs=1 
[  236.977568] 	10-...: (1 GPs behind) idle=c3d/1/0 softirq=4864/4870 fqs=1 
[  236.977570] 	34-...: (20995 ticks this GP) idle=dd5/1/0 softirq=2243/2243 fqs=1 
[  236.977574] 	
[  236.977574] 	37-...: (1 GPs behind) idle=aef/1/0 softirq=2660/2666 fqs=1 
[  236.977576] 	13-...: (1 GPs behind) idle=a2b/1/0 softirq=9928/9934 fqs=1 
[  236.977578] 	
[  236.977578] 	
[  236.977579] 	
[  236.977580] 	
[  236.977582] 	
[  236.977583] 	
[  236.977583] 	
[  236.977584] 	
[  236.977584] 	
[  236.977586] 	
[  236.977587] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977588] 	
[  236.977589] 	
[  236.977595] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977603] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977607] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977609] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977610] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977612] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977614] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977616] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977618] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977619] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977620] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977622] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977626] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.977627] rcu_sched kthread starved for 20997 jiffies! g33049 c33048 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[  236.978512] INFO: rcu_sched self-detected stall on CPU
[  236.978512] INFO: rcu_sched self-detected stall on CPU
[  236.978514] INFO: rcu_sched self-detected stall on CPU
[  236.978516] INFO: rcu_sched self-detected stall on CPU
[  236.978517] INFO: rcu_sched self-detected stall on CPU
[  236.978518] INFO: rcu_sched self-detected stall on CPU
[  236.978519] INFO: rcu_sched self-detected stall on CPU
[  236.978520] INFO: rcu_sched self-detected stall on CPU
[  236.978521] INFO: rcu_sched self-detected stall on CPU
[  236.978522] INFO: rcu_sched self-detected stall on CPU
[  236.978523] INFO: rcu_sched self-detected stall on CPU
[  236.978524] INFO: rcu_sched self-detected stall on CPU
[  236.978532] 	45-...: (1 GPs behind) idle=8ed/1/0 softirq=3047/3053 fqs=1 
[  236.978534] 	19-...: (20996 ticks this GP) idle=b5d/1/0 softirq=8157/8157 fqs=1 
[  236.978538] 	17-...: (1 GPs behind) idle=5ad/1/0 softirq=7839/7845 fqs=1 
[  236.978539] 	41-...: (1 GPs behind) idle=f4f/1/0 softirq=2345/2351 fqs=1 
[  236.978542] 	6-...: (1 GPs behind) idle=a39/1/0 softirq=5492/5498 fqs=1 
[  236.978544] 	30-...: (1 GPs behind) idle=c51/1/0 softirq=2499/2505 fqs=1 
[  236.978546] 	5-...: (1 GPs behind) idle=917/1/0 softirq=5196/5202 fqs=1 
[  236.978548] 	26-...: (20996 ticks this GP) idle=c61/1/0 softirq=2863/2863 fqs=1 
[  236.978550] 	32-...: (1 GPs behind) idle=8db/1/0 softirq=2588/2594 fqs=1 
[  236.978552] 	35-...: (1 GPs behind) idle=351/1/0 softirq=1869/1875 fqs=1 
[  236.978554] 	8-...: (1 GPs behind) idle=221/1/0 softirq=5192/5198 fqs=1 
[  236.978556] 	11-...: (1 GPs behind) idle=485/1/0 softirq=4480/4486 fqs=1 
[  236.978557] 	
[  236.978558] 	
[  236.978559] 	
[  236.978560] 	
[  236.978561] 	
Tentative proto / patch (not including Peter suggestions yet)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..be94e0241a70 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -56,6 +56,14 @@ EXPORT_SYMBOL(irq_stat);
 static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
 
 DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
+DEFINE_PER_CPU(bool, ksoftirqd_scheduled);
+
+static inline bool ksoftirqd_running(void)
+{
+	extern int netdev_max_backlog; /* temp hack */
+
+	return (netdev_max_backlog & 1) && __this_cpu_read(ksoftirqd_scheduled);
+}
 
 const char * const softirq_to_name[NR_SOFTIRQS] = {
 	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
@@ -73,8 +81,10 @@ static void wakeup_softirqd(void)
 	/* Interrupts are disabled: no need to stop preemption */
 	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
 
-	if (tsk && tsk->state != TASK_RUNNING)
+	if (tsk && tsk->state != TASK_RUNNING) {
+		__this_cpu_write(ksoftirqd_scheduled, true);
 		wake_up_process(tsk);
+	}
 }
 
 /*
@@ -313,7 +323,7 @@ asmlinkage __visible void do_softirq(void)
 
 	pending = local_softirq_pending();
 
-	if (pending)
+	if (pending && !ksoftirqd_running())
 		do_softirq_own_stack();
 
 	local_irq_restore(flags);
@@ -340,6 +350,9 @@ void irq_enter(void)
 
 static inline void invoke_softirq(void)
 {
+	if (ksoftirqd_running())
+		return;
+
 	if (!force_irqthreads) {
 #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
 		/*
@@ -660,6 +673,8 @@ static void run_ksoftirqd(unsigned int cpu)
 		 * in the task stack here.
 		 */
 		__do_softirq();
+		if (!local_softirq_pending())
+			__this_cpu_write(ksoftirqd_scheduled, false);
 		local_irq_enable();
 		cond_resched_rcu_qs();
 		return;
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-11  6:55           ` Peter Zijlstra
  2016-05-11 13:13             ` Hannes Frederic Sowa
@ 2016-05-11 21:56             ` Eric Dumazet
  2016-05-12 20:07               ` Paolo Abeni
  2016-05-13 16:50               ` Paolo Abeni
  1 sibling, 2 replies; 48+ messages in thread
From: Eric Dumazet @ 2016-05-11 21:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hannes Frederic Sowa, Eric Dumazet, Paolo Abeni, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Ingo Molnar, Rik van Riel, LKML,
	Paul E. McKenney
On Wed, 2016-05-11 at 08:55 +0200, Peter Zijlstra wrote:
> On Tue, May 10, 2016 at 03:51:37PM -0700, Eric Dumazet wrote:
> > diff --git a/kernel/softirq.c b/kernel/softirq.c
> > index 17caf4b63342..22463217e3cf 100644
> > --- a/kernel/softirq.c
> > +++ b/kernel/softirq.c
> > @@ -56,6 +56,7 @@ EXPORT_SYMBOL(irq_stat);
> >  static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
> >  
> >  DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
> > +DEFINE_PER_CPU(bool, ksoftirqd_scheduled);
> >  
> >  const char * const softirq_to_name[NR_SOFTIRQS] = {
> >  	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
> > @@ -73,8 +74,10 @@ static void wakeup_softirqd(void)
> >  	/* Interrupts are disabled: no need to stop preemption */
> >  	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
> >  
> > -	if (tsk && tsk->state != TASK_RUNNING)
> > +	if (tsk && tsk->state != TASK_RUNNING) {
> > +		__this_cpu_write(ksoftirqd_scheduled, true);
> >  		wake_up_process(tsk);
> 
> Since we're already looking at tsk->state, and the wake_up_process()
> ensures the thing becomes TASK_RUNNING, you could add:
> 
> static inline bool ksoftirqd_running(void)
> {
> 	return __this_cpu_read(ksoftirqd)->state == TASK_RUNNING;
> }
Indeed, and the patch looks quite simple now ;)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342d7839528f367b283a386413b0362..23c364485d03618773c385d943c0ef39f5931d09 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -57,6 +57,11 @@ static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp
 
 DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
 
+static inline bool ksoftirqd_running(void)
+{
+	return __this_cpu_read(ksoftirqd)->state == TASK_RUNNING;
+}
+
 const char * const softirq_to_name[NR_SOFTIRQS] = {
 	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
 	"TASKLET", "SCHED", "HRTIMER", "RCU"
@@ -313,7 +318,7 @@ asmlinkage __visible void do_softirq(void)
 
 	pending = local_softirq_pending();
 
-	if (pending)
+	if (pending && !ksoftirqd_running())
 		do_softirq_own_stack();
 
 	local_irq_restore(flags);
@@ -340,6 +345,9 @@ void irq_enter(void)
 
 static inline void invoke_softirq(void)
 {
+	if (ksoftirqd_running())
+		return;
+
 	if (!force_irqthreads) {
 #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
 		/*
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-11 14:45                 ` Eric Dumazet
@ 2016-05-11 22:47                   ` Hannes Frederic Sowa
  0 siblings, 0 replies; 48+ messages in thread
From: Hannes Frederic Sowa @ 2016-05-11 22:47 UTC (permalink / raw)
  To: Eric Dumazet, Paolo Abeni
  Cc: Eric Dumazet, netdev, David S. Miller, Jiri Pirko,
	Daniel Borkmann, Alexei Starovoitov, Alexander Duyck, Tom Herbert,
	Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML
On 11.05.2016 16:45, Eric Dumazet wrote:
> On Wed, May 11, 2016 at 7:38 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> 
>> Uh, we have likely the same issue in the net_rx_action() function, which
>> also execute with bh disabled and check for jiffies changes even on
>> single core hosts ?!?
> 
> That is why we have a loop break after netdev_budget=300 packets.
> And a sysctl to eventually tune this.
> 
> Same issue for softirq handler, look at commit
> 34376a50fb1fa095b9d0636fa41ed2e73125f214
> 
> Your questions about this central piece of networking code are worrying.
> 
>>
>> Aren't jiffies updated by the timer interrupt ? and thous even with
>> bh_disabled ?!?
> 
> Exactly my point : jiffie wont be updated in your code, since you block BH.
To be fair, jiffies get updated in hardirq and not softirq context. The
cond_resched_softirq not looking for pending softirqs is indeed a problem.
Thanks,
Hannes
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-11 21:56             ` Eric Dumazet
@ 2016-05-12 20:07               ` Paolo Abeni
  2016-05-12 20:49                 ` Eric Dumazet
  2016-05-13 16:50               ` Paolo Abeni
  1 sibling, 1 reply; 48+ messages in thread
From: Paolo Abeni @ 2016-05-12 20:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Hannes Frederic Sowa, Eric Dumazet, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Ingo Molnar, Rik van Riel, LKML,
	Paul E. McKenney
On Wed, 2016-05-11 at 14:56 -0700, Eric Dumazet wrote:
> On Wed, 2016-05-11 at 08:55 +0200, Peter Zijlstra wrote:
> > On Tue, May 10, 2016 at 03:51:37PM -0700, Eric Dumazet wrote:
> > > diff --git a/kernel/softirq.c b/kernel/softirq.c
> > > index 17caf4b63342..22463217e3cf 100644
> > > --- a/kernel/softirq.c
> > > +++ b/kernel/softirq.c
> > > @@ -56,6 +56,7 @@ EXPORT_SYMBOL(irq_stat);
> > >  static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
> > >  
> > >  DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
> > > +DEFINE_PER_CPU(bool, ksoftirqd_scheduled);
> > >  
> > >  const char * const softirq_to_name[NR_SOFTIRQS] = {
> > >  	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
> > > @@ -73,8 +74,10 @@ static void wakeup_softirqd(void)
> > >  	/* Interrupts are disabled: no need to stop preemption */
> > >  	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
> > >  
> > > -	if (tsk && tsk->state != TASK_RUNNING)
> > > +	if (tsk && tsk->state != TASK_RUNNING) {
> > > +		__this_cpu_write(ksoftirqd_scheduled, true);
> > >  		wake_up_process(tsk);
> > 
> > Since we're already looking at tsk->state, and the wake_up_process()
> > ensures the thing becomes TASK_RUNNING, you could add:
> > 
> > static inline bool ksoftirqd_running(void)
> > {
> > 	return __this_cpu_read(ksoftirqd)->state == TASK_RUNNING;
here something like:
	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
        return tsk && (tsk->state == TASK_RUNNING);
is needed since __this_cpu_read(ksoftirqd) can be NULL on boot.
> > }
> 
> Indeed, and the patch looks quite simple now ;)
> 
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 17caf4b63342d7839528f367b283a386413b0362..23c364485d03618773c385d943c0ef39f5931d09 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -57,6 +57,11 @@ static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp
>  
>  DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
>  
> +static inline bool ksoftirqd_running(void)
> +{
> +	return __this_cpu_read(ksoftirqd)->state == TASK_RUNNING;
> +}
> +
>  const char * const softirq_to_name[NR_SOFTIRQS] = {
>  	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
>  	"TASKLET", "SCHED", "HRTIMER", "RCU"
> @@ -313,7 +318,7 @@ asmlinkage __visible void do_softirq(void)
>  
>  	pending = local_softirq_pending();
>  
> -	if (pending)
> +	if (pending && !ksoftirqd_running())
>  		do_softirq_own_stack();
>  
>  	local_irq_restore(flags);
> @@ -340,6 +345,9 @@ void irq_enter(void)
>  
>  static inline void invoke_softirq(void)
>  {
> +	if (ksoftirqd_running())
> +		return;
> +
>  	if (!force_irqthreads) {
>  #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
>  		/*
> 
> 
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-12 20:07               ` Paolo Abeni
@ 2016-05-12 20:49                 ` Eric Dumazet
  2016-05-12 20:58                   ` Paolo Abeni
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Dumazet @ 2016-05-12 20:49 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Peter Zijlstra, Hannes Frederic Sowa, Eric Dumazet, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Ingo Molnar, Rik van Riel, LKML,
	Paul E. McKenney
On Thu, 2016-05-12 at 22:07 +0200, Paolo Abeni wrote:
> > > static inline bool ksoftirqd_running(void)
> > > {
> > > 	return __this_cpu_read(ksoftirqd)->state == TASK_RUNNING;
> 
> here something like:
> 
> 	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
>         return tsk && (tsk->state == TASK_RUNNING);
> 
> is needed since __this_cpu_read(ksoftirqd) can be NULL on boot.
Indeed I've seen this but only when backporting to an older linux kernel
this morning.
Have you got this with current linux kernel ?
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-12 20:49                 ` Eric Dumazet
@ 2016-05-12 20:58                   ` Paolo Abeni
  2016-05-12 21:05                     ` Eric Dumazet
  0 siblings, 1 reply; 48+ messages in thread
From: Paolo Abeni @ 2016-05-12 20:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Hannes Frederic Sowa, Eric Dumazet, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Ingo Molnar, Rik van Riel, LKML,
	Paul E. McKenney
On Thu, 2016-05-12 at 13:49 -0700, Eric Dumazet wrote:
> On Thu, 2016-05-12 at 22:07 +0200, Paolo Abeni wrote:
> 
> > > > static inline bool ksoftirqd_running(void)
> > > > {
> > > > 	return __this_cpu_read(ksoftirqd)->state == TASK_RUNNING;
> > 
> > here something like:
> > 
> > 	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
> >         return tsk && (tsk->state == TASK_RUNNING);
> > 
> > is needed since __this_cpu_read(ksoftirqd) can be NULL on boot.
> 
> Indeed I've seen this but only when backporting to an older linux kernel
> this morning.
> 
> Have you got this with current linux kernel ?
Yes, on net-next updated to 
commit c66b2581123cd1527b6a084f39e9271cb02673b7
Author: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Date:   Sat May 7 14:09:01 2016 -0700
    sh_eth: reuse sh_eth_chip_reset()
Cheers,
Paolo
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-12 20:58                   ` Paolo Abeni
@ 2016-05-12 21:05                     ` Eric Dumazet
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Dumazet @ 2016-05-12 21:05 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Peter Zijlstra, Hannes Frederic Sowa, Eric Dumazet, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Ingo Molnar, Rik van Riel, LKML,
	Paul E. McKenney
On Thu, 2016-05-12 at 22:58 +0200, Paolo Abeni wrote:
> On Thu, 2016-05-12 at 13:49 -0700, Eric Dumazet wrote:
> > On Thu, 2016-05-12 at 22:07 +0200, Paolo Abeni wrote:
> > 
> > > > > static inline bool ksoftirqd_running(void)
> > > > > {
> > > > > 	return __this_cpu_read(ksoftirqd)->state == TASK_RUNNING;
> > > 
> > > here something like:
> > > 
> > > 	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
> > >         return tsk && (tsk->state == TASK_RUNNING);
> > > 
> > > is needed since __this_cpu_read(ksoftirqd) can be NULL on boot.
> > 
> > Indeed I've seen this but only when backporting to an older linux kernel
> > this morning.
> > 
> > Have you got this with current linux kernel ?
> 
> Yes, on net-next updated to 
> 
> commit c66b2581123cd1527b6a084f39e9271cb02673b7
> Author: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
> Date:   Sat May 7 14:09:01 2016 -0700
> 
>     sh_eth: reuse sh_eth_chip_reset()
> 
Yeah, I was unsure if the same test in wakeup_softirqd() was still
relevant today.
static void wakeup_softirqd(void)
{
        /* Interrupts are disabled: no need to stop preemption */
        struct task_struct *tsk = __this_cpu_read(ksoftirqd);
        if (tsk && tsk->state != TASK_RUNNING)
                wake_up_process(tsk);
}
I guess we could avoid the NULL test if all these per_cpu var where
pointing to a dummy task_struct at boot time, before they are properly
allocated.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-11 21:56             ` Eric Dumazet
  2016-05-12 20:07               ` Paolo Abeni
@ 2016-05-13 16:50               ` Paolo Abeni
  2016-05-13 17:03                 ` Eric Dumazet
  1 sibling, 1 reply; 48+ messages in thread
From: Paolo Abeni @ 2016-05-13 16:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Hannes Frederic Sowa, Eric Dumazet, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Ingo Molnar, Rik van Riel, LKML,
	Paul E. McKenney
On Wed, 2016-05-11 at 14:56 -0700, Eric Dumazet wrote:
> On Wed, 2016-05-11 at 08:55 +0200, Peter Zijlstra wrote:
> > On Tue, May 10, 2016 at 03:51:37PM -0700, Eric Dumazet wrote:
> > > diff --git a/kernel/softirq.c b/kernel/softirq.c
> > > index 17caf4b63342..22463217e3cf 100644
> > > --- a/kernel/softirq.c
> > > +++ b/kernel/softirq.c
> > > @@ -56,6 +56,7 @@ EXPORT_SYMBOL(irq_stat);
> > >  static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
> > >  
> > >  DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
> > > +DEFINE_PER_CPU(bool, ksoftirqd_scheduled);
> > >  
> > >  const char * const softirq_to_name[NR_SOFTIRQS] = {
> > >  	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
> > > @@ -73,8 +74,10 @@ static void wakeup_softirqd(void)
> > >  	/* Interrupts are disabled: no need to stop preemption */
> > >  	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
> > >  
> > > -	if (tsk && tsk->state != TASK_RUNNING)
> > > +	if (tsk && tsk->state != TASK_RUNNING) {
> > > +		__this_cpu_write(ksoftirqd_scheduled, true);
> > >  		wake_up_process(tsk);
> > 
> > Since we're already looking at tsk->state, and the wake_up_process()
> > ensures the thing becomes TASK_RUNNING, you could add:
> > 
> > static inline bool ksoftirqd_running(void)
> > {
> > 	return __this_cpu_read(ksoftirqd)->state == TASK_RUNNING;
> > }
> 
> Indeed, and the patch looks quite simple now ;)
> 
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 17caf4b63342d7839528f367b283a386413b0362..23c364485d03618773c385d943c0ef39f5931d09 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -57,6 +57,11 @@ static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp
>  
>  DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
>  
> +static inline bool ksoftirqd_running(void)
> +{
> +	return __this_cpu_read(ksoftirqd)->state == TASK_RUNNING;
> +}
> +
>  const char * const softirq_to_name[NR_SOFTIRQS] = {
>  	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
>  	"TASKLET", "SCHED", "HRTIMER", "RCU"
> @@ -313,7 +318,7 @@ asmlinkage __visible void do_softirq(void)
>  
>  	pending = local_softirq_pending();
>  
> -	if (pending)
> +	if (pending && !ksoftirqd_running())
>  		do_softirq_own_stack();
>  
>  	local_irq_restore(flags);
> @@ -340,6 +345,9 @@ void irq_enter(void)
>  
>  static inline void invoke_softirq(void)
>  {
> +	if (ksoftirqd_running())
> +		return;
> +
>  	if (!force_irqthreads) {
>  #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
>  		/*
In this version of the path, the chunk affecting __local_bh_enable_ip()
has been removed.
I think it is beneficial, because it allows avoiding a
local_irq_save()/local_irq_restore() pairs per local_bh_enable under heavy load.
Cheers,
Paolo
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-13 16:50               ` Paolo Abeni
@ 2016-05-13 17:03                 ` Eric Dumazet
  2016-05-13 17:19                   ` Paolo Abeni
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Dumazet @ 2016-05-13 17:03 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Eric Dumazet, Peter Zijlstra, Hannes Frederic Sowa, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Ingo Molnar, Rik van Riel, LKML,
	Paul E. McKenney
On Fri, May 13, 2016 at 9:50 AM, Paolo Abeni <pabeni@redhat.com> wrote:
>> Indeed, and the patch looks quite simple now ;)
>>
>> diff --git a/kernel/softirq.c b/kernel/softirq.c
>> index 17caf4b63342d7839528f367b283a386413b0362..23c364485d03618773c385d943c0ef39f5931d09 100644
>> --- a/kernel/softirq.c
>> +++ b/kernel/softirq.c
>> @@ -57,6 +57,11 @@ static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp
>>
>>  DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
>>
>> +static inline bool ksoftirqd_running(void)
>> +{
>> +     return __this_cpu_read(ksoftirqd)->state == TASK_RUNNING;
>> +}
>> +
>>  const char * const softirq_to_name[NR_SOFTIRQS] = {
>>       "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
>>       "TASKLET", "SCHED", "HRTIMER", "RCU"
>> @@ -313,7 +318,7 @@ asmlinkage __visible void do_softirq(void)
>>
>>       pending = local_softirq_pending();
>>
>> -     if (pending)
>> +     if (pending && !ksoftirqd_running())
>>               do_softirq_own_stack();
>>
>>       local_irq_restore(flags);
>> @@ -340,6 +345,9 @@ void irq_enter(void)
>>
>>  static inline void invoke_softirq(void)
>>  {
>> +     if (ksoftirqd_running())
>> +             return;
>> +
>>       if (!force_irqthreads) {
>>  #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
>>               /*
>
> In this version of the path, the chunk affecting __local_bh_enable_ip()
> has been removed.
>
> I think it is beneficial, because it allows avoiding a
> local_irq_save()/local_irq_restore() pairs per local_bh_enable under heavy load.
>
Interesting, do you have any numbers ?
I believe I did this so that we factorize the logic in do_softirq()
and keep the code local to kernel/softirq.c
Otherwise, netif_rx_ni() could also process softirq while ksoftirqd
was scheduled,
so I would have to  'export' the ksoftirqd_running(void) helper in an
include file.
I noticed that simply doing a "ping -n gateway" while the UDP flood
was occurring, my udp receiver had quite a different efficiency.
Thanks.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-13 17:03                 ` Eric Dumazet
@ 2016-05-13 17:19                   ` Paolo Abeni
  2016-05-13 17:36                     ` Eric Dumazet
  0 siblings, 1 reply; 48+ messages in thread
From: Paolo Abeni @ 2016-05-13 17:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, Peter Zijlstra, Hannes Frederic Sowa, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Ingo Molnar, Rik van Riel, LKML,
	Paul E. McKenney
On Fri, 2016-05-13 at 10:03 -0700, Eric Dumazet wrote:
> On Fri, May 13, 2016 at 9:50 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> 
> >> Indeed, and the patch looks quite simple now ;)
> >>
> >> diff --git a/kernel/softirq.c b/kernel/softirq.c
> >> index 17caf4b63342d7839528f367b283a386413b0362..23c364485d03618773c385d943c0ef39f5931d09 100644
> >> --- a/kernel/softirq.c
> >> +++ b/kernel/softirq.c
> >> @@ -57,6 +57,11 @@ static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp
> >>
> >>  DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
> >>
> >> +static inline bool ksoftirqd_running(void)
> >> +{
> >> +     return __this_cpu_read(ksoftirqd)->state == TASK_RUNNING;
> >> +}
> >> +
> >>  const char * const softirq_to_name[NR_SOFTIRQS] = {
> >>       "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
> >>       "TASKLET", "SCHED", "HRTIMER", "RCU"
> >> @@ -313,7 +318,7 @@ asmlinkage __visible void do_softirq(void)
> >>
> >>       pending = local_softirq_pending();
> >>
> >> -     if (pending)
> >> +     if (pending && !ksoftirqd_running())
> >>               do_softirq_own_stack();
> >>
> >>       local_irq_restore(flags);
> >> @@ -340,6 +345,9 @@ void irq_enter(void)
> >>
> >>  static inline void invoke_softirq(void)
> >>  {
> >> +     if (ksoftirqd_running())
> >> +             return;
> >> +
> >>       if (!force_irqthreads) {
> >>  #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
> >>               /*
> >
> > In this version of the path, the chunk affecting __local_bh_enable_ip()
> > has been removed.
> >
> > I think it is beneficial, because it allows avoiding a
> > local_irq_save()/local_irq_restore() pairs per local_bh_enable under heavy load.
> >
> 
> Interesting, do you have any numbers ?
The difference is small, in the noise range:
[with this patch applied]
super_netperf 100 -H 192.168.122.1 -t UDP_STREAM -l 60 -- -m 1 
9.00
[adding the test into __local_bh_enable_ip(), too]
super_netperf 100 -H 192.168.122.1 -t UDP_STREAM -l 60 -- -m 1 
9.14
but reproducible, in my experiments.
I have similar data for different number of flows.
> I believe I did this so that we factorize the logic in do_softirq()
> and keep the code local to kernel/softirq.c
> 
> Otherwise, netif_rx_ni() could also process softirq while ksoftirqd
> was scheduled,
> so I would have to  'export' the ksoftirqd_running(void) helper in an
> include file.
The idea could be to add the test in __local_bh_enable_ip(), maintaining
the test also in do_softirq() (as currently done, i.e for
local_softirq_pending())
Cheers,
Paolo
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-13 17:19                   ` Paolo Abeni
@ 2016-05-13 17:36                     ` Eric Dumazet
  2016-05-16 13:10                       ` Paolo Abeni
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Dumazet @ 2016-05-13 17:36 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Eric Dumazet, Peter Zijlstra, Hannes Frederic Sowa, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Ingo Molnar, Rik van Riel, LKML,
	Paul E. McKenney
On Fri, May 13, 2016 at 10:19 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> The difference is small, in the noise range:
>
> [with this patch applied]
> super_netperf 100 -H 192.168.122.1 -t UDP_STREAM -l 60 -- -m 1
> 9.00
>
> [adding the test into __local_bh_enable_ip(), too]
> super_netperf 100 -H 192.168.122.1 -t UDP_STREAM -l 60 -- -m 1
> 9.14
>
> but reproducible, in my experiments.
> I have similar data for different number of flows.
>
>> I believe I did this so that we factorize the logic in do_softirq()
>> and keep the code local to kernel/softirq.c
>>
>> Otherwise, netif_rx_ni() could also process softirq while ksoftirqd
>> was scheduled,
>> so I would have to  'export' the ksoftirqd_running(void) helper in an
>> include file.
>
> The idea could be to add the test in __local_bh_enable_ip(), maintaining
> the test also in do_softirq() (as currently done, i.e for
> local_softirq_pending())
>
Then I guess even the !in_interrupt() test we do is expensive and
could be avoided,
since do_softirq() is doing it again in the unlikely case it really is needed.
@@ -162,7 +170,8 @@ void __local_bh_enable_ip(unsigned long ip,
unsigned int cnt)
         */
        preempt_count_sub(cnt - 1);
-       if (unlikely(!in_interrupt() && local_softirq_pending())) {
+       if (unlikely(local_softirq_pending()) &&
+                    !ksoftirqd_running()) {
                /*
                 * Run softirq if any pending. And do it in its own stack
                 * as we may be calling this deep in a task call stack already.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-13 17:36                     ` Eric Dumazet
@ 2016-05-16 13:10                       ` Paolo Abeni
  2016-05-16 13:38                         ` Eric Dumazet
  0 siblings, 1 reply; 48+ messages in thread
From: Paolo Abeni @ 2016-05-16 13:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, Peter Zijlstra, Hannes Frederic Sowa, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Ingo Molnar, Rik van Riel, LKML,
	Paul E. McKenney
On Fri, 2016-05-13 at 10:36 -0700, Eric Dumazet wrote:
> On Fri, May 13, 2016 at 10:19 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> 
> > The difference is small, in the noise range:
> >
> > [with this patch applied]
> > super_netperf 100 -H 192.168.122.1 -t UDP_STREAM -l 60 -- -m 1
> > 9.00
> >
> > [adding the test into __local_bh_enable_ip(), too]
> > super_netperf 100 -H 192.168.122.1 -t UDP_STREAM -l 60 -- -m 1
> > 9.14
> >
> > but reproducible, in my experiments.
> > I have similar data for different number of flows.
> >
> >> I believe I did this so that we factorize the logic in do_softirq()
> >> and keep the code local to kernel/softirq.c
> >>
> >> Otherwise, netif_rx_ni() could also process softirq while ksoftirqd
> >> was scheduled,
> >> so I would have to  'export' the ksoftirqd_running(void) helper in an
> >> include file.
> >
> > The idea could be to add the test in __local_bh_enable_ip(), maintaining
> > the test also in do_softirq() (as currently done, i.e for
> > local_softirq_pending())
> >
> 
> Then I guess even the !in_interrupt() test we do is expensive and
> could be avoided,
> since do_softirq() is doing it again in the unlikely case it really is needed.
> 
> @@ -162,7 +170,8 @@ void __local_bh_enable_ip(unsigned long ip,
> unsigned int cnt)
>          */
>         preempt_count_sub(cnt - 1);
> 
> -       if (unlikely(!in_interrupt() && local_softirq_pending())) {
> +       if (unlikely(local_softirq_pending()) &&
> +                    !ksoftirqd_running()) {
>                 /*
>                  * Run softirq if any pending. And do it in its own stack
>                  * as we may be calling this deep in a task call stack already.
I'm sorry for the not-so-prompt reply. I had to use a different H/W, so
I had to re-run the tests with all the patch flavors to get comparable
results.
While I can confirm that adding the '!ksoftirqd_running()' condition
improves the throughput a little, but in a reproducible way, removing
the '!in_interrupt()' don't change the result measurably, in my
environment.
While running the test against a kernel with the above chunk applied I
got a couple of:
[  702.791025] NOHZ: local_softirq_pending 08
Not seen with the other versions of this patch.
Cheers,
Paolo
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [RFC PATCH 0/2] net: threadable napi poll loop
  2016-05-16 13:10                       ` Paolo Abeni
@ 2016-05-16 13:38                         ` Eric Dumazet
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Dumazet @ 2016-05-16 13:38 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Eric Dumazet, Peter Zijlstra, Hannes Frederic Sowa, netdev,
	David S. Miller, Jiri Pirko, Daniel Borkmann, Alexei Starovoitov,
	Alexander Duyck, Tom Herbert, Ingo Molnar, Rik van Riel, LKML,
	Paul E. McKenney
On Mon, May 16, 2016 at 6:10 AM, Paolo Abeni <pabeni@redhat.com> wrote:
>
> I'm sorry for the not-so-prompt reply. I had to use a different H/W, so
> I had to re-run the tests with all the patch flavors to get comparable
> results.
>
> While I can confirm that adding the '!ksoftirqd_running()' condition
> improves the throughput a little, but in a reproducible way, removing
> the '!in_interrupt()' don't change the result measurably, in my
> environment.
>
> While running the test against a kernel with the above chunk applied I
> got a couple of:
>
> [  702.791025] NOHZ: local_softirq_pending 08
>
> Not seen with the other versions of this patch.
I've seen this message in all versions, depending on the workload.
Either a barrier of some kind is missing, or we uncover an existing bug.
Note that in my tests on an older base kernel (something based on 3.11
but with thousands of patches),
I would not have the scary rcu messages that I got with current
upstream kernels.
^ permalink raw reply	[flat|nested] 48+ messages in thread
end of thread, other threads:[~2016-05-16 13:38 UTC | newest]
Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-10 14:11 [RFC PATCH 0/2] net: threadable napi poll loop Paolo Abeni
2016-05-10 14:11 ` [RFC PATCH 1/2] net: implement threaded-able napi poll loop support Paolo Abeni
2016-05-10 14:11 ` [RFC PATCH 2/2] net: add sysfs attribute to control napi threaded mode Paolo Abeni
2016-05-10 14:29 ` [RFC PATCH 0/2] net: threadable napi poll loop Eric Dumazet
2016-05-10 15:51   ` David Miller
2016-05-10 16:03   ` Paolo Abeni
2016-05-10 16:08     ` Eric Dumazet
2016-05-10 20:22       ` Paolo Abeni
2016-05-10 20:45         ` David Miller
2016-05-10 20:50           ` Rik van Riel
2016-05-10 20:52             ` David Miller
2016-05-10 21:01               ` Rik van Riel
2016-05-10 20:46   ` Hannes Frederic Sowa
2016-05-10 21:09     ` Eric Dumazet
2016-05-10 21:31       ` Eric Dumazet
2016-05-10 21:35         ` Rik van Riel
2016-05-10 21:53           ` Eric Dumazet
2016-05-10 22:02             ` Eric Dumazet
2016-05-10 22:44               ` Eric Dumazet
2016-05-10 22:02             ` Rik van Riel
2016-05-11 17:55             ` Eric Dumazet
2016-05-10 22:32       ` Hannes Frederic Sowa
2016-05-10 22:51         ` Eric Dumazet
2016-05-11  6:55           ` Peter Zijlstra
2016-05-11 13:13             ` Hannes Frederic Sowa
2016-05-11 14:40               ` Eric Dumazet
2016-05-11 15:01                 ` Rik van Riel
2016-05-11 15:50                 ` Eric Dumazet
2016-05-11 21:56             ` Eric Dumazet
2016-05-12 20:07               ` Paolo Abeni
2016-05-12 20:49                 ` Eric Dumazet
2016-05-12 20:58                   ` Paolo Abeni
2016-05-12 21:05                     ` Eric Dumazet
2016-05-13 16:50               ` Paolo Abeni
2016-05-13 17:03                 ` Eric Dumazet
2016-05-13 17:19                   ` Paolo Abeni
2016-05-13 17:36                     ` Eric Dumazet
2016-05-16 13:10                       ` Paolo Abeni
2016-05-16 13:38                         ` Eric Dumazet
2016-05-11  9:48           ` Paolo Abeni
2016-05-11 13:08             ` Eric Dumazet
2016-05-11 13:39               ` Hannes Frederic Sowa
2016-05-11 13:47                 ` Hannes Frederic Sowa
2016-05-11 14:38               ` Paolo Abeni
2016-05-11 14:45                 ` Eric Dumazet
2016-05-11 22:47                   ` Hannes Frederic Sowa
2016-05-10 15:57 ` Thomas Gleixner
2016-05-10 20:41   ` Paolo Abeni
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).