[RFC 0/2] pm,net: Introduce QoS requests per CPU

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/2] pm,net: Introduce QoS requests per CPU
@ 2014-03-25 13:18 Amir Vadai
  2014-03-25 13:18 ` [RFC 1/2] pm: " Amir Vadai
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Amir Vadai @ 2014-03-25 13:18 UTC (permalink / raw)
  To: David S. Miller
  Cc: linux-pm, netdev, Pavel Machek, Rafael J. Wysocki, Len Brown,
	yuvali, Or Gerlitz, Yevgeny Petrilin, idos, Amir Vadai

Hi,

This patch is a preliminary work to add power management reqeusts per
core. The patch does compile, but I still need to work on it to make it
a non RFC. Please look at it as a reference, I would like to prepare the
final implementation after having a discussion in the community.

The problem
-----------
I'm maintaining Mellanox's network driver (mlx4_en) in the kernel.

The current pm_qos implementation has a problem. During a short pause in a high
bandwidth traffic, the kernel can lower the c-state to preserve energy.
When the pause ends, and the traffic resumes, the NIC hardware buffers may be
overflowed before the CPU starts to process the traffic due to the CPU wake-up
latency.

The driver can add a request to have constraint on the c-state during high
bandwidth traffic - but pm_qos only allows a global constraint for all the
CPU's.  While this fixes the problem of the wakeup latency, it is bad for power
consumption of the server.

Suggested solution
------------------
The idea is to extend the current pm_qos_request API - to have pm_qos_request
per core.
The networking driver will add a request on the CPU which handles the traffic,
to prevent it from getting into a low c-state. The request will be deleted once
there is no need to keep the CPU active.

The governor select the next idle state by looking at the target value of the
specific core in addition to the global target value, instead of using only the
global one.
If a global request is added/removed/updated, the target values of all the CPUs
are re-calculated.

When a CPU specific request is added/removed/updated, the target value of the
specific core is re-calculated to be the min/max (according to the constrain
type) value of all the global and the CPU specific constraints.

During initialization, before the CPU specific data structures are allocated
and initialized, only global target value is begin used.

I added to this patchset a preliminary work on mlx4_en. In this version the
driver restrict the c-state of all the CPU's during high bandwidth traffic. In
the final version this patch will use the new API and restrict only the
relevant CPU's c-state

TODO's
------
- Use cpumask instead of int, to enable add/del/modify request for
  cpusets in order to specificy a set of cpus, i.e a numa node.
- Update Documentation/tracing

Thanks,
Amir

Amir Vadai (2):
  pm: Introduce QoS requests per CPU
  net/mlx4_en: Use pm_qos API to avoid packet loss in high CPU c-states

 Documentation/trace/events-power.txt            |   2 +
 drivers/base/power/qos.c                        |   6 +-
 drivers/cpuidle/governors/menu.c                |   2 +-
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  37 ++++
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |  40 +++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c      |   7 +
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |  13 ++
 include/linux/pm_qos.h                          |  22 ++-
 include/trace/events/power.h                    |  20 ++-
 kernel/power/qos.c                              | 221 ++++++++++++++++++------
 10 files changed, 302 insertions(+), 68 deletions(-)

-- 
1.8.3.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 1/2] pm: Introduce QoS requests per CPU
  2014-03-25 13:18 [RFC 0/2] pm,net: Introduce QoS requests per CPU Amir Vadai
@ 2014-03-25 13:18 ` Amir Vadai
  2014-03-25 18:44   ` Rafael J. Wysocki
  2014-03-25 13:18 ` [RFC 2/2] net/mlx4_en: Use pm_qos API to avoid packet loss in high CPU c-states Amir Vadai
  2014-03-25 15:14 ` [RFC 0/2] pm,net: Introduce QoS requests per CPU Eric Dumazet
  2 siblings, 1 reply; 11+ messages in thread
From: Amir Vadai @ 2014-03-25 13:18 UTC (permalink / raw)
  To: David S. Miller
  Cc: linux-pm, netdev, Pavel Machek, Rafael J. Wysocki, Len Brown,
	yuvali, Or Gerlitz, Yevgeny Petrilin, idos, Amir Vadai

Extend the current pm_qos_request API - to have pm_qos_request per core.
When a global request is added, it is added under the global plist.
When a core specific request is added, it is added to the core specific
list.
core number is saved in the request and later modify/delete operations
are using it to access the right list.

When a cpu specific request is added/removed/updated, the target value
of the specific core is recalculated to be the min/max (according to the
constrain type) value of all the global and the cpu specific
constraints.

If a global request is added/removed/updated, the target values of all
the cpu's are recalculated.

During initialization, before the cpu specific data structures are
allocated and initialized, only global target value is begin used.

Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 Documentation/trace/events-power.txt |   2 +
 drivers/base/power/qos.c             |   6 +-
 drivers/cpuidle/governors/menu.c     |   2 +-
 include/linux/pm_qos.h               |  22 +++-
 include/trace/events/power.h         |  20 ++--
 kernel/power/qos.c                   | 221 ++++++++++++++++++++++++++---------
 6 files changed, 205 insertions(+), 68 deletions(-)

diff --git a/Documentation/trace/events-power.txt b/Documentation/trace/events-power.txt
index 3bd33b8..be54c6c 100644
--- a/Documentation/trace/events-power.txt
+++ b/Documentation/trace/events-power.txt
@@ -68,6 +68,8 @@ The second parameter is the power domain target state.
 The PM QoS events are used for QoS add/update/remove request and for
 target/flags update.
 
+TODO: update documentation here
+
 pm_qos_add_request                 "pm_qos_class=%s value=%d"
 pm_qos_update_request              "pm_qos_class=%s value=%d"
 pm_qos_remove_request              "pm_qos_class=%s value=%d"
diff --git a/drivers/base/power/qos.c b/drivers/base/power/qos.c
index 5c1361a..d816f00 100644
--- a/drivers/base/power/qos.c
+++ b/drivers/base/power/qos.c
@@ -105,7 +105,7 @@ EXPORT_SYMBOL_GPL(dev_pm_qos_flags);
 s32 __dev_pm_qos_read_value(struct device *dev)
 {
 	return IS_ERR_OR_NULL(dev->power.qos) ?
-		0 : pm_qos_read_value(&dev->power.qos->latency);
+		0 : pm_qos_read_value(&dev->power.qos->latency, -1);
 }
 
 /**
@@ -143,9 +143,9 @@ static int apply_constraint(struct dev_pm_qos_request *req,
 	switch(req->type) {
 	case DEV_PM_QOS_LATENCY:
 		ret = pm_qos_update_target(&qos->latency, &req->data.pnode,
-					   action, value);
+					   -1, action, value);
 		if (ret) {
-			value = pm_qos_read_value(&qos->latency);
+			value = pm_qos_read_value(&qos->latency, -1);
 			blocking_notifier_call_chain(&dev_pm_notifiers,
 						     (unsigned long)value,
 						     req);
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index cf7f2f0..edb7b4a 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -287,7 +287,7 @@ again:
 static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 {
 	struct menu_device *data = &__get_cpu_var(menu_devices);
-	int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
+	int latency_req = pm_qos_request_cpu(PM_QOS_CPU_DMA_LATENCY, smp_processor_id());
 	int i;
 	int multiplier;
 	struct timespec t;
diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h
index 5a95013..62a15ee 100644
--- a/include/linux/pm_qos.h
+++ b/include/linux/pm_qos.h
@@ -40,6 +40,7 @@ enum pm_qos_flags_status {
 struct pm_qos_request {
 	struct plist_node node;
 	int pm_qos_class;
+	int cpu;
 	struct delayed_work work; /* for pm_qos_update_request_timeout */
 };
 
@@ -68,6 +69,11 @@ enum pm_qos_type {
 	PM_QOS_MIN		/* return the smallest value */
 };
 
+struct pm_qos_constraints_percpu  {
+	struct plist_head list;
+	s32 target_value;
+};
+
 /*
  * Note: The lockless read path depends on the CPU accessing target_value
  * or effective_flags atomically.  Atomic access is only guaranteed on all CPU
@@ -75,7 +81,11 @@ enum pm_qos_type {
  */
 struct pm_qos_constraints {
 	struct plist_head list;
-	s32 target_value;	/* Do not change to 64 bit */
+	struct pm_qos_constraints_percpu __percpu *percpu;
+	s32 target_value;	/* Do not change to 64 bit.
+				 * Value will be overriden by percpu
+				 * target_value if exists
+				 */
 	s32 default_value;
 	enum pm_qos_type type;
 	struct blocking_notifier_head *notifiers;
@@ -106,12 +116,15 @@ static inline int dev_pm_qos_request_active(struct dev_pm_qos_request *req)
 }
 
 int pm_qos_update_target(struct pm_qos_constraints *c, struct plist_node *node,
+			 int cpu,
 			 enum pm_qos_req_action action, int value);
 bool pm_qos_update_flags(struct pm_qos_flags *pqf,
 			 struct pm_qos_flags_request *req,
 			 enum pm_qos_req_action action, s32 val);
-void pm_qos_add_request(struct pm_qos_request *req, int pm_qos_class,
-			s32 value);
+void pm_qos_add_request_cpu(struct pm_qos_request *req, int pm_qos_class,
+			    int cpu, s32 value);
+#define pm_qos_add_request(req, pm_qos_class, value) \
+	pm_qos_add_request_cpu(req, pm_qos_class, -1, value)
 void pm_qos_update_request(struct pm_qos_request *req,
 			   s32 new_value);
 void pm_qos_update_request_timeout(struct pm_qos_request *req,
@@ -119,10 +132,11 @@ void pm_qos_update_request_timeout(struct pm_qos_request *req,
 void pm_qos_remove_request(struct pm_qos_request *req);
 
 int pm_qos_request(int pm_qos_class);
+int pm_qos_request_cpu(int pm_qos_class, int cpu);
 int pm_qos_add_notifier(int pm_qos_class, struct notifier_block *notifier);
 int pm_qos_remove_notifier(int pm_qos_class, struct notifier_block *notifier);
 int pm_qos_request_active(struct pm_qos_request *req);
-s32 pm_qos_read_value(struct pm_qos_constraints *c);
+s32 pm_qos_read_value(struct pm_qos_constraints *c, int cpu);
 
 #ifdef CONFIG_PM
 enum pm_qos_flags_status __dev_pm_qos_flags(struct device *dev, s32 mask);
diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index e5bf9a7..2e441e2 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -340,42 +340,46 @@ TRACE_EVENT(pm_qos_update_request_timeout,
 
 DECLARE_EVENT_CLASS(pm_qos_update,
 
-	TP_PROTO(enum pm_qos_req_action action, int prev_value, int curr_value),
+	TP_PROTO(enum pm_qos_req_action action, int cpu, int prev_value, int curr_value),
 
-	TP_ARGS(action, prev_value, curr_value),
+	TP_ARGS(action, cpu, prev_value, curr_value),
 
 	TP_STRUCT__entry(
 		__field( enum pm_qos_req_action, action         )
+		__field( int,                    cpu            )
 		__field( int,                    prev_value     )
 		__field( int,                    curr_value     )
 	),
 
 	TP_fast_assign(
 		__entry->action = action;
+		__entry->cpu = cpu;
 		__entry->prev_value = prev_value;
 		__entry->curr_value = curr_value;
 	),
 
-	TP_printk("action=%s prev_value=%d curr_value=%d",
+	TP_printk("action=%s cpu=%d prev_value=%d curr_value=%d",
 		  __print_symbolic(__entry->action,
 			{ PM_QOS_ADD_REQ,	"ADD_REQ" },
 			{ PM_QOS_UPDATE_REQ,	"UPDATE_REQ" },
 			{ PM_QOS_REMOVE_REQ,	"REMOVE_REQ" }),
-		  __entry->prev_value, __entry->curr_value)
+		  __entry->cpu, __entry->prev_value, __entry->curr_value)
 );
 
 DEFINE_EVENT(pm_qos_update, pm_qos_update_target,
 
-	TP_PROTO(enum pm_qos_req_action action, int prev_value, int curr_value),
+	TP_PROTO(enum pm_qos_req_action action, int cpu, int prev_value, int
+		 curr_value),
 
-	TP_ARGS(action, prev_value, curr_value)
+	TP_ARGS(action, cpu, prev_value, curr_value)
 );
 
 DEFINE_EVENT_PRINT(pm_qos_update, pm_qos_update_flags,
 
-	TP_PROTO(enum pm_qos_req_action action, int prev_value, int curr_value),
+	TP_PROTO(enum pm_qos_req_action action, int cpu, int prev_value,
+		 int curr_value),
 
-	TP_ARGS(action, prev_value, curr_value),
+	TP_ARGS(action, cpu, prev_value, curr_value),
 
 	TP_printk("action=%s prev_value=0x%x curr_value=0x%x",
 		  __print_symbolic(__entry->action,
diff --git a/kernel/power/qos.c b/kernel/power/qos.c
index 8dff9b4..4a38329 100644
--- a/kernel/power/qos.c
+++ b/kernel/power/qos.c
@@ -125,59 +125,124 @@ static const struct file_operations pm_qos_power_fops = {
 };
 
 /* unlocked internal variant */
-static inline int pm_qos_get_value(struct pm_qos_constraints *c)
+static inline int pm_qos_get_value(struct pm_qos_constraints *c, int cpu)
 {
-	if (plist_head_empty(&c->list))
+	struct pm_qos_constraints_percpu *cc;
+	struct plist_head *clist = NULL;
+	int val;
+
+	if (c->percpu && cpu >= 0) {
+		cc = per_cpu_ptr(c->percpu, cpu);
+		if (!plist_head_empty(&cc->list))
+			clist= &cc->list;
+	}
+
+	if (plist_head_empty(&c->list) && !clist)
 		return c->default_value;
 
 	switch (c->type) {
 	case PM_QOS_MIN:
-		return plist_first(&c->list)->prio;
-
+		if (plist_head_empty(&c->list))
+			return plist_first(clist)->prio;
+		val = plist_first(&c->list)->prio;
+		if (clist)
+			val = min(val, plist_first(clist)->prio);
+		return val;
 	case PM_QOS_MAX:
-		return plist_last(&c->list)->prio;
-
+		if (plist_head_empty(&c->list))
+			return plist_last(clist)->prio;
+		val = plist_last(&c->list)->prio;
+		if (clist)
+			val = max(val, plist_last(clist)->prio);
+		return val;
 	default:
 		/* runtime check for not using enum */
 		BUG();
-		return PM_QOS_DEFAULT_VALUE;
 	}
+
+	return PM_QOS_DEFAULT_VALUE;
 }
 
-s32 pm_qos_read_value(struct pm_qos_constraints *c)
+s32 pm_qos_read_value(struct pm_qos_constraints *c, int cpu)
 {
+	struct pm_qos_constraints_percpu *cc;
+
+	if (c->percpu && cpu >= 0) {
+		cc = per_cpu_ptr(c->percpu, cpu);
+		return cc->target_value;
+	}
+
 	return c->target_value;
 }
 
-static inline void pm_qos_set_value(struct pm_qos_constraints *c, s32 value)
+static inline int _pm_qos_set_value(enum pm_qos_req_action act,
+				    struct pm_qos_constraints *c, int cpu)
 {
+	struct pm_qos_constraints_percpu *cc = per_cpu_ptr(c->percpu, cpu);
+	s32 value = pm_qos_get_value(c, cpu);
+
+	if (value == cc->target_value)
+		return 0;
+
+	trace_pm_qos_update_target(act, cpu, c->target_value, value);
+	cc->target_value = value;
+
+	return 1;
+}
+
+static inline int pm_qos_set_value(enum pm_qos_req_action act,
+				   struct pm_qos_constraints *c, int cpu)
+{
+	s32 value;
+
+	if (cpu >= 0)
+		return _pm_qos_set_value(act, c, cpu);
+
+	value = pm_qos_get_value(c, -1);
+	if (value == c->target_value)
+		return 0;
+
+	trace_pm_qos_update_target(act, -1, c->target_value, value);
 	c->target_value = value;
+
+	if (c->percpu) {
+		for_each_possible_cpu(cpu)
+			_pm_qos_set_value(act, c, cpu);
+	}
+
+	return 1;
 }
 
-/**
- * pm_qos_update_target - manages the constraints list and calls the notifiers
- *  if needed
- * @c: constraints data struct
- * @node: request to add to the list, to update or to remove
- * @action: action to take on the constraints list
- * @value: value of the request to add or update
- *
- * This function returns 1 if the aggregated constraint value has changed, 0
- *  otherwise.
- */
-int pm_qos_update_target(struct pm_qos_constraints *c, struct plist_node *node,
-			 enum pm_qos_req_action action, int value)
+static int _pm_qos_update_target(struct pm_qos_constraints *c, struct plist_node *node,
+				 enum pm_qos_req_action action, int value, int cpu)
 {
 	unsigned long flags;
-	int prev_value, curr_value, new_value;
+	int new_value;
+	int ret;
+	struct plist_head *plist;
 
 	spin_lock_irqsave(&pm_qos_lock, flags);
-	prev_value = pm_qos_get_value(c);
 	if (value == PM_QOS_DEFAULT_VALUE)
 		new_value = c->default_value;
 	else
 		new_value = value;
 
+	if (cpu == -1) {
+		plist = &c->list;
+	} else {
+		struct pm_qos_constraints_percpu *cc;
+
+		if (!c->percpu) {
+			ret = 0;
+			printk(KERN_ERR "%s: Can't set cpu constraint\n",
+			       __func__);
+			goto out;
+		}
+
+		cc = per_cpu_ptr(c->percpu, cpu);
+		plist = &cc->list;
+	}
+
 	switch (action) {
 	case PM_QOS_REMOVE_REQ:
 		plist_del(node, &c->list);
@@ -198,20 +263,39 @@ int pm_qos_update_target(struct pm_qos_constraints *c, struct plist_node *node,
 		;
 	}
 
-	curr_value = pm_qos_get_value(c);
-	pm_qos_set_value(c, curr_value);
+	ret =  pm_qos_set_value(action, c, cpu);
 
+out:
 	spin_unlock_irqrestore(&pm_qos_lock, flags);
 
-	trace_pm_qos_update_target(action, prev_value, curr_value);
-	if (prev_value != curr_value) {
-		blocking_notifier_call_chain(c->notifiers,
-					     (unsigned long)curr_value,
-					     NULL);
-		return 1;
-	} else {
-		return 0;
+	if (ret) {
+		if (c->notifiers)
+			blocking_notifier_call_chain(c->notifiers,
+						     0, /* XXX should change to notifier per core? */
+						     /*(unsigned long)curr_value,*/
+						     NULL);
 	}
+
+	return ret;
+}
+
+/**
+ * pm_qos_update_target - manages the constraints list and calls the notifiers
+ *  if needed
+ * @c: constraints data struct
+ * @node: request to add to the list, to update or to remove
+ * @cpu: core number, -1 for all
+ * @action: action to take on the constraints list
+ * @value: value of the request to add or update
+ *
+ * This function returns 1 if the aggregated constraint value has changed, 0
+ *  otherwise.
+ */
+int pm_qos_update_target(struct pm_qos_constraints *c, struct plist_node *node,
+			 int cpu,
+			 enum pm_qos_req_action action, int value)
+{
+	return _pm_qos_update_target(c, node, action, value, -1);
 }
 
 /**
@@ -274,7 +358,7 @@ bool pm_qos_update_flags(struct pm_qos_flags *pqf,
 
 	spin_unlock_irqrestore(&pm_qos_lock, irqflags);
 
-	trace_pm_qos_update_flags(action, prev_value, curr_value);
+	trace_pm_qos_update_flags(action, -1, prev_value, curr_value);
 	return prev_value != curr_value;
 }
 
@@ -286,10 +370,23 @@ bool pm_qos_update_flags(struct pm_qos_flags *pqf,
  */
 int pm_qos_request(int pm_qos_class)
 {
-	return pm_qos_read_value(pm_qos_array[pm_qos_class]->constraints);
+	return pm_qos_read_value(pm_qos_array[pm_qos_class]->constraints, -1);
 }
 EXPORT_SYMBOL_GPL(pm_qos_request);
 
+/**
+ * pm_qos_request_cpu - returns current per cpu qos expectation
+ * @pm_qos_class: identification of which qos value is requested
+ * @cpu: core number
+ *
+ * This function returns the current target value for a core.
+ */
+int pm_qos_request_cpu(int pm_qos_class, int cpu)
+{
+	return pm_qos_read_value(pm_qos_array[pm_qos_class]->constraints, cpu);
+}
+EXPORT_SYMBOL_GPL(pm_qos_request_cpu);
+
 int pm_qos_request_active(struct pm_qos_request *req)
 {
 	return req->pm_qos_class != 0;
@@ -297,14 +394,14 @@ int pm_qos_request_active(struct pm_qos_request *req)
 EXPORT_SYMBOL_GPL(pm_qos_request_active);
 
 static void __pm_qos_update_request(struct pm_qos_request *req,
-			   s32 new_value)
+				    s32 new_value)
 {
 	trace_pm_qos_update_request(req->pm_qos_class, new_value);
 
 	if (new_value != req->node.prio)
 		pm_qos_update_target(
 			pm_qos_array[req->pm_qos_class]->constraints,
-			&req->node, PM_QOS_UPDATE_REQ, new_value);
+			&req->node, req->cpu, PM_QOS_UPDATE_REQ, new_value);
 }
 
 /**
@@ -326,6 +423,7 @@ static void pm_qos_work_fn(struct work_struct *work)
  * pm_qos_add_request - inserts new qos request into the list
  * @req: pointer to a preallocated handle
  * @pm_qos_class: identifies which list of qos request to use
+ * @cpu : cpu which the request belong to. -1 for global request
  * @value: defines the qos request
  *
  * This function inserts a new entry in the pm_qos_class list of requested qos
@@ -334,9 +432,8 @@ static void pm_qos_work_fn(struct work_struct *work)
  * handle.  Caller needs to save this handle for later use in updates and
  * removal.
  */
-
-void pm_qos_add_request(struct pm_qos_request *req,
-			int pm_qos_class, s32 value)
+void pm_qos_add_request_cpu(struct pm_qos_request *req,
+			    int pm_qos_class, int cpu, s32 value)
 {
 	if (!req) /*guard against callers passing in null */
 		return;
@@ -346,12 +443,13 @@ void pm_qos_add_request(struct pm_qos_request *req,
 		return;
 	}
 	req->pm_qos_class = pm_qos_class;
+	req->cpu = cpu;
 	INIT_DELAYED_WORK(&req->work, pm_qos_work_fn);
 	trace_pm_qos_add_request(pm_qos_class, value);
 	pm_qos_update_target(pm_qos_array[pm_qos_class]->constraints,
-			     &req->node, PM_QOS_ADD_REQ, value);
+			     &req->node, cpu, PM_QOS_ADD_REQ, value);
 }
-EXPORT_SYMBOL_GPL(pm_qos_add_request);
+EXPORT_SYMBOL_GPL(pm_qos_add_request_cpu);
 
 /**
  * pm_qos_update_request - modifies an existing qos request
@@ -364,7 +462,7 @@ EXPORT_SYMBOL_GPL(pm_qos_add_request);
  * Attempts are made to make this code callable on hot code paths.
  */
 void pm_qos_update_request(struct pm_qos_request *req,
-			   s32 new_value)
+			       s32 new_value)
 {
 	if (!req) /*guard against callers passing in null */
 		return;
@@ -387,8 +485,8 @@ EXPORT_SYMBOL_GPL(pm_qos_update_request);
  *
  * After timeout_us, this qos request is cancelled automatically.
  */
-void pm_qos_update_request_timeout(struct pm_qos_request *req, s32 new_value,
-				   unsigned long timeout_us)
+void pm_qos_update_request_timeout(struct pm_qos_request *req,
+				   s32 new_value, unsigned long timeout_us)
 {
 	if (!req)
 		return;
@@ -403,7 +501,7 @@ void pm_qos_update_request_timeout(struct pm_qos_request *req, s32 new_value,
 	if (new_value != req->node.prio)
 		pm_qos_update_target(
 			pm_qos_array[req->pm_qos_class]->constraints,
-			&req->node, PM_QOS_UPDATE_REQ, new_value);
+			&req->node, req->cpu, PM_QOS_UPDATE_REQ, new_value);
 
 	schedule_delayed_work(&req->work, usecs_to_jiffies(timeout_us));
 }
@@ -431,7 +529,7 @@ void pm_qos_remove_request(struct pm_qos_request *req)
 
 	trace_pm_qos_remove_request(req->pm_qos_class, PM_QOS_DEFAULT_VALUE);
 	pm_qos_update_target(pm_qos_array[req->pm_qos_class]->constraints,
-			     &req->node, PM_QOS_REMOVE_REQ,
+			     &req->node, req->cpu, PM_QOS_REMOVE_REQ,
 			     PM_QOS_DEFAULT_VALUE);
 	memset(req, 0, sizeof(*req));
 }
@@ -529,7 +627,7 @@ static int pm_qos_power_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
-
+// TODO sysfs for non global
 static ssize_t pm_qos_power_read(struct file *filp, char __user *buf,
 		size_t count, loff_t *f_pos)
 {
@@ -543,7 +641,7 @@ static ssize_t pm_qos_power_read(struct file *filp, char __user *buf,
 		return -EINVAL;
 
 	spin_lock_irqsave(&pm_qos_lock, flags);
-	value = pm_qos_get_value(pm_qos_array[req->pm_qos_class]->constraints);
+	value = pm_qos_get_value(pm_qos_array[req->pm_qos_class]->constraints, -1);
 	spin_unlock_irqrestore(&pm_qos_lock, flags);
 
 	return simple_read_from_buffer(buf, count, f_pos, &value, sizeof(s32));
@@ -581,10 +679,29 @@ static int __init pm_qos_power_init(void)
 	BUILD_BUG_ON(ARRAY_SIZE(pm_qos_array) != PM_QOS_NUM_CLASSES);
 
 	for (i = PM_QOS_CPU_DMA_LATENCY; i < PM_QOS_NUM_CLASSES; i++) {
-		ret = register_pm_qos_misc(pm_qos_array[i]);
+		struct pm_qos_object *pm_qos = pm_qos_array[i];
+		struct pm_qos_constraints *c = pm_qos->constraints;
+
+		c->percpu = alloc_percpu(struct pm_qos_constraints_percpu);
+		if (!c->percpu) {
+			printk(KERN_ERR "pm_qos_param: %s per cpu alloc failed\n",
+			       pm_qos->name);
+		} else {
+			int cpu;
+
+			for_each_possible_cpu(cpu) {
+				struct pm_qos_constraints_percpu *cc =
+					per_cpu_ptr(c->percpu, cpu);
+
+				plist_head_init(&cc->list);
+				cc->target_value = pm_qos_read_value(c, -1);
+			}
+		}
+
+		ret = register_pm_qos_misc(pm_qos);
 		if (ret < 0) {
 			printk(KERN_ERR "pm_qos_param: %s setup failed\n",
-			       pm_qos_array[i]->name);
+			       pm_qos->name);
 			return ret;
 		}
 	}
-- 
1.8.3.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] pm: Introduce QoS requests per CPU
  2014-03-25 13:18 ` [RFC 1/2] pm: " Amir Vadai
@ 2014-03-25 18:44   ` Rafael J. Wysocki
  2014-03-26 15:40     ` Amir Vadai
  2014-03-26 17:36     ` Jeremy Eder
  0 siblings, 2 replies; 11+ messages in thread
From: Rafael J. Wysocki @ 2014-03-25 18:44 UTC (permalink / raw)
  To: Amir Vadai
  Cc: David S. Miller, linux-pm, netdev, Pavel Machek, Len Brown,
	yuvali, Or Gerlitz, Yevgeny Petrilin, idos

On Tuesday, March 25, 2014 03:18:24 PM Amir Vadai wrote:
> Extend the current pm_qos_request API - to have pm_qos_request per core.
> When a global request is added, it is added under the global plist.
> When a core specific request is added, it is added to the core specific
> list.
> core number is saved in the request and later modify/delete operations
> are using it to access the right list.
> 
> When a cpu specific request is added/removed/updated, the target value
> of the specific core is recalculated to be the min/max (according to the
> constrain type) value of all the global and the cpu specific
> constraints.
> 
> If a global request is added/removed/updated, the target values of all
> the cpu's are recalculated.
> 
> During initialization, before the cpu specific data structures are
> allocated and initialized, only global target value is begin used.

I have to review this in detail (which rather won't be possible before
the next week), but in principle I don't really like it, because it
assumes that its users will know what's going to run on which CPU cores
and I'm not sure where that knowledge is going to come from.

> Signed-off-by: Ido Shamay <idos@mellanox.com>
> Signed-off-by: Amir Vadai <amirv@mellanox.com>
> ---
>  Documentation/trace/events-power.txt |   2 +
>  drivers/base/power/qos.c             |   6 +-
>  drivers/cpuidle/governors/menu.c     |   2 +-
>  include/linux/pm_qos.h               |  22 +++-
>  include/trace/events/power.h         |  20 ++--
>  kernel/power/qos.c                   | 221 ++++++++++++++++++++++++++---------
>  6 files changed, 205 insertions(+), 68 deletions(-)
> 
> diff --git a/Documentation/trace/events-power.txt b/Documentation/trace/events-power.txt
> index 3bd33b8..be54c6c 100644
> --- a/Documentation/trace/events-power.txt
> +++ b/Documentation/trace/events-power.txt
> @@ -68,6 +68,8 @@ The second parameter is the power domain target state.
>  The PM QoS events are used for QoS add/update/remove request and for
>  target/flags update.
>  
> +TODO: update documentation here
> +
>  pm_qos_add_request                 "pm_qos_class=%s value=%d"
>  pm_qos_update_request              "pm_qos_class=%s value=%d"
>  pm_qos_remove_request              "pm_qos_class=%s value=%d"
> diff --git a/drivers/base/power/qos.c b/drivers/base/power/qos.c
> index 5c1361a..d816f00 100644
> --- a/drivers/base/power/qos.c
> +++ b/drivers/base/power/qos.c
> @@ -105,7 +105,7 @@ EXPORT_SYMBOL_GPL(dev_pm_qos_flags);
>  s32 __dev_pm_qos_read_value(struct device *dev)
>  {
>  	return IS_ERR_OR_NULL(dev->power.qos) ?
> -		0 : pm_qos_read_value(&dev->power.qos->latency);
> +		0 : pm_qos_read_value(&dev->power.qos->latency, -1);
>  }
>  
>  /**
> @@ -143,9 +143,9 @@ static int apply_constraint(struct dev_pm_qos_request *req,
>  	switch(req->type) {
>  	case DEV_PM_QOS_LATENCY:
>  		ret = pm_qos_update_target(&qos->latency, &req->data.pnode,
> -					   action, value);
> +					   -1, action, value);
>  		if (ret) {
> -			value = pm_qos_read_value(&qos->latency);
> +			value = pm_qos_read_value(&qos->latency, -1);
>  			blocking_notifier_call_chain(&dev_pm_notifiers,
>  						     (unsigned long)value,
>  						     req);
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index cf7f2f0..edb7b4a 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -287,7 +287,7 @@ again:
>  static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  {
>  	struct menu_device *data = &__get_cpu_var(menu_devices);
> -	int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
> +	int latency_req = pm_qos_request_cpu(PM_QOS_CPU_DMA_LATENCY, smp_processor_id());
>  	int i;
>  	int multiplier;
>  	struct timespec t;
> diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h
> index 5a95013..62a15ee 100644
> --- a/include/linux/pm_qos.h
> +++ b/include/linux/pm_qos.h
> @@ -40,6 +40,7 @@ enum pm_qos_flags_status {
>  struct pm_qos_request {
>  	struct plist_node node;
>  	int pm_qos_class;
> +	int cpu;
>  	struct delayed_work work; /* for pm_qos_update_request_timeout */
>  };
>  
> @@ -68,6 +69,11 @@ enum pm_qos_type {
>  	PM_QOS_MIN		/* return the smallest value */
>  };
>  
> +struct pm_qos_constraints_percpu  {
> +	struct plist_head list;
> +	s32 target_value;
> +};
> +
>  /*
>   * Note: The lockless read path depends on the CPU accessing target_value
>   * or effective_flags atomically.  Atomic access is only guaranteed on all CPU
> @@ -75,7 +81,11 @@ enum pm_qos_type {
>   */
>  struct pm_qos_constraints {
>  	struct plist_head list;
> -	s32 target_value;	/* Do not change to 64 bit */
> +	struct pm_qos_constraints_percpu __percpu *percpu;
> +	s32 target_value;	/* Do not change to 64 bit.
> +				 * Value will be overriden by percpu
> +				 * target_value if exists
> +				 */
>  	s32 default_value;
>  	enum pm_qos_type type;
>  	struct blocking_notifier_head *notifiers;
> @@ -106,12 +116,15 @@ static inline int dev_pm_qos_request_active(struct dev_pm_qos_request *req)
>  }
>  
>  int pm_qos_update_target(struct pm_qos_constraints *c, struct plist_node *node,
> +			 int cpu,
>  			 enum pm_qos_req_action action, int value);
>  bool pm_qos_update_flags(struct pm_qos_flags *pqf,
>  			 struct pm_qos_flags_request *req,
>  			 enum pm_qos_req_action action, s32 val);
> -void pm_qos_add_request(struct pm_qos_request *req, int pm_qos_class,
> -			s32 value);
> +void pm_qos_add_request_cpu(struct pm_qos_request *req, int pm_qos_class,
> +			    int cpu, s32 value);
> +#define pm_qos_add_request(req, pm_qos_class, value) \
> +	pm_qos_add_request_cpu(req, pm_qos_class, -1, value)
>  void pm_qos_update_request(struct pm_qos_request *req,
>  			   s32 new_value);
>  void pm_qos_update_request_timeout(struct pm_qos_request *req,
> @@ -119,10 +132,11 @@ void pm_qos_update_request_timeout(struct pm_qos_request *req,
>  void pm_qos_remove_request(struct pm_qos_request *req);
>  
>  int pm_qos_request(int pm_qos_class);
> +int pm_qos_request_cpu(int pm_qos_class, int cpu);
>  int pm_qos_add_notifier(int pm_qos_class, struct notifier_block *notifier);
>  int pm_qos_remove_notifier(int pm_qos_class, struct notifier_block *notifier);
>  int pm_qos_request_active(struct pm_qos_request *req);
> -s32 pm_qos_read_value(struct pm_qos_constraints *c);
> +s32 pm_qos_read_value(struct pm_qos_constraints *c, int cpu);
>  
>  #ifdef CONFIG_PM
>  enum pm_qos_flags_status __dev_pm_qos_flags(struct device *dev, s32 mask);
> diff --git a/include/trace/events/power.h b/include/trace/events/power.h
> index e5bf9a7..2e441e2 100644
> --- a/include/trace/events/power.h
> +++ b/include/trace/events/power.h
> @@ -340,42 +340,46 @@ TRACE_EVENT(pm_qos_update_request_timeout,
>  
>  DECLARE_EVENT_CLASS(pm_qos_update,
>  
> -	TP_PROTO(enum pm_qos_req_action action, int prev_value, int curr_value),
> +	TP_PROTO(enum pm_qos_req_action action, int cpu, int prev_value, int curr_value),
>  
> -	TP_ARGS(action, prev_value, curr_value),
> +	TP_ARGS(action, cpu, prev_value, curr_value),
>  
>  	TP_STRUCT__entry(
>  		__field( enum pm_qos_req_action, action         )
> +		__field( int,                    cpu            )
>  		__field( int,                    prev_value     )
>  		__field( int,                    curr_value     )
>  	),
>  
>  	TP_fast_assign(
>  		__entry->action = action;
> +		__entry->cpu = cpu;
>  		__entry->prev_value = prev_value;
>  		__entry->curr_value = curr_value;
>  	),
>  
> -	TP_printk("action=%s prev_value=%d curr_value=%d",
> +	TP_printk("action=%s cpu=%d prev_value=%d curr_value=%d",
>  		  __print_symbolic(__entry->action,
>  			{ PM_QOS_ADD_REQ,	"ADD_REQ" },
>  			{ PM_QOS_UPDATE_REQ,	"UPDATE_REQ" },
>  			{ PM_QOS_REMOVE_REQ,	"REMOVE_REQ" }),
> -		  __entry->prev_value, __entry->curr_value)
> +		  __entry->cpu, __entry->prev_value, __entry->curr_value)
>  );
>  
>  DEFINE_EVENT(pm_qos_update, pm_qos_update_target,
>  
> -	TP_PROTO(enum pm_qos_req_action action, int prev_value, int curr_value),
> +	TP_PROTO(enum pm_qos_req_action action, int cpu, int prev_value, int
> +		 curr_value),
>  
> -	TP_ARGS(action, prev_value, curr_value)
> +	TP_ARGS(action, cpu, prev_value, curr_value)
>  );
>  
>  DEFINE_EVENT_PRINT(pm_qos_update, pm_qos_update_flags,
>  
> -	TP_PROTO(enum pm_qos_req_action action, int prev_value, int curr_value),
> +	TP_PROTO(enum pm_qos_req_action action, int cpu, int prev_value,
> +		 int curr_value),
>  
> -	TP_ARGS(action, prev_value, curr_value),
> +	TP_ARGS(action, cpu, prev_value, curr_value),
>  
>  	TP_printk("action=%s prev_value=0x%x curr_value=0x%x",
>  		  __print_symbolic(__entry->action,
> diff --git a/kernel/power/qos.c b/kernel/power/qos.c
> index 8dff9b4..4a38329 100644
> --- a/kernel/power/qos.c
> +++ b/kernel/power/qos.c
> @@ -125,59 +125,124 @@ static const struct file_operations pm_qos_power_fops = {
>  };
>  
>  /* unlocked internal variant */
> -static inline int pm_qos_get_value(struct pm_qos_constraints *c)
> +static inline int pm_qos_get_value(struct pm_qos_constraints *c, int cpu)
>  {
> -	if (plist_head_empty(&c->list))
> +	struct pm_qos_constraints_percpu *cc;
> +	struct plist_head *clist = NULL;
> +	int val;
> +
> +	if (c->percpu && cpu >= 0) {
> +		cc = per_cpu_ptr(c->percpu, cpu);
> +		if (!plist_head_empty(&cc->list))
> +			clist= &cc->list;
> +	}
> +
> +	if (plist_head_empty(&c->list) && !clist)
>  		return c->default_value;
>  
>  	switch (c->type) {
>  	case PM_QOS_MIN:
> -		return plist_first(&c->list)->prio;
> -
> +		if (plist_head_empty(&c->list))
> +			return plist_first(clist)->prio;
> +		val = plist_first(&c->list)->prio;
> +		if (clist)
> +			val = min(val, plist_first(clist)->prio);
> +		return val;
>  	case PM_QOS_MAX:
> -		return plist_last(&c->list)->prio;
> -
> +		if (plist_head_empty(&c->list))
> +			return plist_last(clist)->prio;
> +		val = plist_last(&c->list)->prio;
> +		if (clist)
> +			val = max(val, plist_last(clist)->prio);
> +		return val;
>  	default:
>  		/* runtime check for not using enum */
>  		BUG();
> -		return PM_QOS_DEFAULT_VALUE;
>  	}
> +
> +	return PM_QOS_DEFAULT_VALUE;
>  }
>  
> -s32 pm_qos_read_value(struct pm_qos_constraints *c)
> +s32 pm_qos_read_value(struct pm_qos_constraints *c, int cpu)
>  {
> +	struct pm_qos_constraints_percpu *cc;
> +
> +	if (c->percpu && cpu >= 0) {
> +		cc = per_cpu_ptr(c->percpu, cpu);
> +		return cc->target_value;
> +	}
> +
>  	return c->target_value;
>  }
>  
> -static inline void pm_qos_set_value(struct pm_qos_constraints *c, s32 value)
> +static inline int _pm_qos_set_value(enum pm_qos_req_action act,
> +				    struct pm_qos_constraints *c, int cpu)
>  {
> +	struct pm_qos_constraints_percpu *cc = per_cpu_ptr(c->percpu, cpu);
> +	s32 value = pm_qos_get_value(c, cpu);
> +
> +	if (value == cc->target_value)
> +		return 0;
> +
> +	trace_pm_qos_update_target(act, cpu, c->target_value, value);
> +	cc->target_value = value;
> +
> +	return 1;
> +}
> +
> +static inline int pm_qos_set_value(enum pm_qos_req_action act,
> +				   struct pm_qos_constraints *c, int cpu)
> +{
> +	s32 value;
> +
> +	if (cpu >= 0)
> +		return _pm_qos_set_value(act, c, cpu);
> +
> +	value = pm_qos_get_value(c, -1);
> +	if (value == c->target_value)
> +		return 0;
> +
> +	trace_pm_qos_update_target(act, -1, c->target_value, value);
>  	c->target_value = value;
> +
> +	if (c->percpu) {
> +		for_each_possible_cpu(cpu)
> +			_pm_qos_set_value(act, c, cpu);
> +	}
> +
> +	return 1;
>  }
>  
> -/**
> - * pm_qos_update_target - manages the constraints list and calls the notifiers
> - *  if needed
> - * @c: constraints data struct
> - * @node: request to add to the list, to update or to remove
> - * @action: action to take on the constraints list
> - * @value: value of the request to add or update
> - *
> - * This function returns 1 if the aggregated constraint value has changed, 0
> - *  otherwise.
> - */
> -int pm_qos_update_target(struct pm_qos_constraints *c, struct plist_node *node,
> -			 enum pm_qos_req_action action, int value)
> +static int _pm_qos_update_target(struct pm_qos_constraints *c, struct plist_node *node,
> +				 enum pm_qos_req_action action, int value, int cpu)
>  {
>  	unsigned long flags;
> -	int prev_value, curr_value, new_value;
> +	int new_value;
> +	int ret;
> +	struct plist_head *plist;
>  
>  	spin_lock_irqsave(&pm_qos_lock, flags);
> -	prev_value = pm_qos_get_value(c);
>  	if (value == PM_QOS_DEFAULT_VALUE)
>  		new_value = c->default_value;
>  	else
>  		new_value = value;
>  
> +	if (cpu == -1) {
> +		plist = &c->list;
> +	} else {
> +		struct pm_qos_constraints_percpu *cc;
> +
> +		if (!c->percpu) {
> +			ret = 0;
> +			printk(KERN_ERR "%s: Can't set cpu constraint\n",
> +			       __func__);
> +			goto out;
> +		}
> +
> +		cc = per_cpu_ptr(c->percpu, cpu);
> +		plist = &cc->list;
> +	}
> +
>  	switch (action) {
>  	case PM_QOS_REMOVE_REQ:
>  		plist_del(node, &c->list);
> @@ -198,20 +263,39 @@ int pm_qos_update_target(struct pm_qos_constraints *c, struct plist_node *node,
>  		;
>  	}
>  
> -	curr_value = pm_qos_get_value(c);
> -	pm_qos_set_value(c, curr_value);
> +	ret =  pm_qos_set_value(action, c, cpu);
>  
> +out:
>  	spin_unlock_irqrestore(&pm_qos_lock, flags);
>  
> -	trace_pm_qos_update_target(action, prev_value, curr_value);
> -	if (prev_value != curr_value) {
> -		blocking_notifier_call_chain(c->notifiers,
> -					     (unsigned long)curr_value,
> -					     NULL);
> -		return 1;
> -	} else {
> -		return 0;
> +	if (ret) {
> +		if (c->notifiers)
> +			blocking_notifier_call_chain(c->notifiers,
> +						     0, /* XXX should change to notifier per core? */
> +						     /*(unsigned long)curr_value,*/
> +						     NULL);
>  	}
> +
> +	return ret;
> +}
> +
> +/**
> + * pm_qos_update_target - manages the constraints list and calls the notifiers
> + *  if needed
> + * @c: constraints data struct
> + * @node: request to add to the list, to update or to remove
> + * @cpu: core number, -1 for all
> + * @action: action to take on the constraints list
> + * @value: value of the request to add or update
> + *
> + * This function returns 1 if the aggregated constraint value has changed, 0
> + *  otherwise.
> + */
> +int pm_qos_update_target(struct pm_qos_constraints *c, struct plist_node *node,
> +			 int cpu,
> +			 enum pm_qos_req_action action, int value)
> +{
> +	return _pm_qos_update_target(c, node, action, value, -1);
>  }
>  
>  /**
> @@ -274,7 +358,7 @@ bool pm_qos_update_flags(struct pm_qos_flags *pqf,
>  
>  	spin_unlock_irqrestore(&pm_qos_lock, irqflags);
>  
> -	trace_pm_qos_update_flags(action, prev_value, curr_value);
> +	trace_pm_qos_update_flags(action, -1, prev_value, curr_value);
>  	return prev_value != curr_value;
>  }
>  
> @@ -286,10 +370,23 @@ bool pm_qos_update_flags(struct pm_qos_flags *pqf,
>   */
>  int pm_qos_request(int pm_qos_class)
>  {
> -	return pm_qos_read_value(pm_qos_array[pm_qos_class]->constraints);
> +	return pm_qos_read_value(pm_qos_array[pm_qos_class]->constraints, -1);
>  }
>  EXPORT_SYMBOL_GPL(pm_qos_request);
>  
> +/**
> + * pm_qos_request_cpu - returns current per cpu qos expectation
> + * @pm_qos_class: identification of which qos value is requested
> + * @cpu: core number
> + *
> + * This function returns the current target value for a core.
> + */
> +int pm_qos_request_cpu(int pm_qos_class, int cpu)
> +{
> +	return pm_qos_read_value(pm_qos_array[pm_qos_class]->constraints, cpu);
> +}
> +EXPORT_SYMBOL_GPL(pm_qos_request_cpu);
> +
>  int pm_qos_request_active(struct pm_qos_request *req)
>  {
>  	return req->pm_qos_class != 0;
> @@ -297,14 +394,14 @@ int pm_qos_request_active(struct pm_qos_request *req)
>  EXPORT_SYMBOL_GPL(pm_qos_request_active);
>  
>  static void __pm_qos_update_request(struct pm_qos_request *req,
> -			   s32 new_value)
> +				    s32 new_value)
>  {
>  	trace_pm_qos_update_request(req->pm_qos_class, new_value);
>  
>  	if (new_value != req->node.prio)
>  		pm_qos_update_target(
>  			pm_qos_array[req->pm_qos_class]->constraints,
> -			&req->node, PM_QOS_UPDATE_REQ, new_value);
> +			&req->node, req->cpu, PM_QOS_UPDATE_REQ, new_value);
>  }
>  
>  /**
> @@ -326,6 +423,7 @@ static void pm_qos_work_fn(struct work_struct *work)
>   * pm_qos_add_request - inserts new qos request into the list
>   * @req: pointer to a preallocated handle
>   * @pm_qos_class: identifies which list of qos request to use
> + * @cpu : cpu which the request belong to. -1 for global request
>   * @value: defines the qos request
>   *
>   * This function inserts a new entry in the pm_qos_class list of requested qos
> @@ -334,9 +432,8 @@ static void pm_qos_work_fn(struct work_struct *work)
>   * handle.  Caller needs to save this handle for later use in updates and
>   * removal.
>   */
> -
> -void pm_qos_add_request(struct pm_qos_request *req,
> -			int pm_qos_class, s32 value)
> +void pm_qos_add_request_cpu(struct pm_qos_request *req,
> +			    int pm_qos_class, int cpu, s32 value)
>  {
>  	if (!req) /*guard against callers passing in null */
>  		return;
> @@ -346,12 +443,13 @@ void pm_qos_add_request(struct pm_qos_request *req,
>  		return;
>  	}
>  	req->pm_qos_class = pm_qos_class;
> +	req->cpu = cpu;
>  	INIT_DELAYED_WORK(&req->work, pm_qos_work_fn);
>  	trace_pm_qos_add_request(pm_qos_class, value);
>  	pm_qos_update_target(pm_qos_array[pm_qos_class]->constraints,
> -			     &req->node, PM_QOS_ADD_REQ, value);
> +			     &req->node, cpu, PM_QOS_ADD_REQ, value);
>  }
> -EXPORT_SYMBOL_GPL(pm_qos_add_request);
> +EXPORT_SYMBOL_GPL(pm_qos_add_request_cpu);
>  
>  /**
>   * pm_qos_update_request - modifies an existing qos request
> @@ -364,7 +462,7 @@ EXPORT_SYMBOL_GPL(pm_qos_add_request);
>   * Attempts are made to make this code callable on hot code paths.
>   */
>  void pm_qos_update_request(struct pm_qos_request *req,
> -			   s32 new_value)
> +			       s32 new_value)
>  {
>  	if (!req) /*guard against callers passing in null */
>  		return;
> @@ -387,8 +485,8 @@ EXPORT_SYMBOL_GPL(pm_qos_update_request);
>   *
>   * After timeout_us, this qos request is cancelled automatically.
>   */
> -void pm_qos_update_request_timeout(struct pm_qos_request *req, s32 new_value,
> -				   unsigned long timeout_us)
> +void pm_qos_update_request_timeout(struct pm_qos_request *req,
> +				   s32 new_value, unsigned long timeout_us)
>  {
>  	if (!req)
>  		return;
> @@ -403,7 +501,7 @@ void pm_qos_update_request_timeout(struct pm_qos_request *req, s32 new_value,
>  	if (new_value != req->node.prio)
>  		pm_qos_update_target(
>  			pm_qos_array[req->pm_qos_class]->constraints,
> -			&req->node, PM_QOS_UPDATE_REQ, new_value);
> +			&req->node, req->cpu, PM_QOS_UPDATE_REQ, new_value);
>  
>  	schedule_delayed_work(&req->work, usecs_to_jiffies(timeout_us));
>  }
> @@ -431,7 +529,7 @@ void pm_qos_remove_request(struct pm_qos_request *req)
>  
>  	trace_pm_qos_remove_request(req->pm_qos_class, PM_QOS_DEFAULT_VALUE);
>  	pm_qos_update_target(pm_qos_array[req->pm_qos_class]->constraints,
> -			     &req->node, PM_QOS_REMOVE_REQ,
> +			     &req->node, req->cpu, PM_QOS_REMOVE_REQ,
>  			     PM_QOS_DEFAULT_VALUE);
>  	memset(req, 0, sizeof(*req));
>  }
> @@ -529,7 +627,7 @@ static int pm_qos_power_release(struct inode *inode, struct file *filp)
>  	return 0;
>  }
>  
> -
> +// TODO sysfs for non global
>  static ssize_t pm_qos_power_read(struct file *filp, char __user *buf,
>  		size_t count, loff_t *f_pos)
>  {
> @@ -543,7 +641,7 @@ static ssize_t pm_qos_power_read(struct file *filp, char __user *buf,
>  		return -EINVAL;
>  
>  	spin_lock_irqsave(&pm_qos_lock, flags);
> -	value = pm_qos_get_value(pm_qos_array[req->pm_qos_class]->constraints);
> +	value = pm_qos_get_value(pm_qos_array[req->pm_qos_class]->constraints, -1);
>  	spin_unlock_irqrestore(&pm_qos_lock, flags);
>  
>  	return simple_read_from_buffer(buf, count, f_pos, &value, sizeof(s32));
> @@ -581,10 +679,29 @@ static int __init pm_qos_power_init(void)
>  	BUILD_BUG_ON(ARRAY_SIZE(pm_qos_array) != PM_QOS_NUM_CLASSES);
>  
>  	for (i = PM_QOS_CPU_DMA_LATENCY; i < PM_QOS_NUM_CLASSES; i++) {
> -		ret = register_pm_qos_misc(pm_qos_array[i]);
> +		struct pm_qos_object *pm_qos = pm_qos_array[i];
> +		struct pm_qos_constraints *c = pm_qos->constraints;
> +
> +		c->percpu = alloc_percpu(struct pm_qos_constraints_percpu);
> +		if (!c->percpu) {
> +			printk(KERN_ERR "pm_qos_param: %s per cpu alloc failed\n",
> +			       pm_qos->name);
> +		} else {
> +			int cpu;
> +
> +			for_each_possible_cpu(cpu) {
> +				struct pm_qos_constraints_percpu *cc =
> +					per_cpu_ptr(c->percpu, cpu);
> +
> +				plist_head_init(&cc->list);
> +				cc->target_value = pm_qos_read_value(c, -1);
> +			}
> +		}
> +
> +		ret = register_pm_qos_misc(pm_qos);
>  		if (ret < 0) {
>  			printk(KERN_ERR "pm_qos_param: %s setup failed\n",
> -			       pm_qos_array[i]->name);
> +			       pm_qos->name);
>  			return ret;
>  		}
>  	}
> 

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] pm: Introduce QoS requests per CPU
  2014-03-25 18:44   ` Rafael J. Wysocki
@ 2014-03-26 15:40     ` Amir Vadai
  2014-03-26 17:36     ` Jeremy Eder
  1 sibling, 0 replies; 11+ messages in thread
From: Amir Vadai @ 2014-03-26 15:40 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: David S. Miller, linux-pm, netdev, Pavel Machek, Len Brown,
	yuvali, Or Gerlitz, Yevgeny Petrilin, idos, hadarh

[This mail might be double posted due to problems I have with the mail 
server]

On 25/03/14 19:44 +0100, Rafael J. Wysocki wrote:
> On Tuesday, March 25, 2014 03:18:24 PM Amir Vadai wrote:
> > Extend the current pm_qos_request API - to have pm_qos_request per core.
> > When a global request is added, it is added under the global plist.
> > When a core specific request is added, it is added to the core specific
> > list.
> > core number is saved in the request and later modify/delete operations
> > are using it to access the right list.
> >
> > When a cpu specific request is added/removed/updated, the target value
> > of the specific core is recalculated to be the min/max (according to the
> > constrain type) value of all the global and the cpu specific
> > constraints.
> >
> > If a global request is added/removed/updated, the target values of all
> > the cpu's are recalculated.
> >
> > During initialization, before the cpu specific data structures are
> > allocated and initialized, only global target value is begin used.
>
> I have to review this in detail (which rather won't be possible before
> the next week), but in principle I don't really like it, because it
> assumes that its users will know what's going to run on which CPU cores
> and I'm not sure where that knowledge is going to come from.
>

The network driver can use affinity hint and irq balancer to set the
affinity of IRQ (and rx ring) to a core. A stream is tied to an rx
ring by RSS/Flow steering. So, if we have an active flow, we can know
which CPU will be used.
The feature could be turned off by default and be used only when the
driver has all the needed information.

Thank you for your time,
Amir

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] pm: Introduce QoS requests per CPU
  2014-03-25 18:44   ` Rafael J. Wysocki
  2014-03-26 15:40     ` Amir Vadai
@ 2014-03-26 17:36     ` Jeremy Eder
  2014-03-27 19:41       ` Amir Vadai
  1 sibling, 1 reply; 11+ messages in thread
From: Jeremy Eder @ 2014-03-26 17:36 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Amir Vadai, David S. Miller, linux-pm, netdev, Pavel Machek,
	Len Brown, yuvali, Or Gerlitz, Yevgeny Petrilin, idos

On 140325 19:44:53, Rafael J. Wysocki wrote:
> On Tuesday, March 25, 2014 03:18:24 PM Amir Vadai wrote:
> > Extend the current pm_qos_request API - to have pm_qos_request per core.
> > When a global request is added, it is added under the global plist.
> > When a core specific request is added, it is added to the core specific
> > list.
> > core number is saved in the request and later modify/delete operations
> > are using it to access the right list.
> > 
> > When a cpu specific request is added/removed/updated, the target value
> > of the specific core is recalculated to be the min/max (according to the
> > constrain type) value of all the global and the cpu specific
> > constraints.
> > 
> > If a global request is added/removed/updated, the target values of all
> > the cpu's are recalculated.
> > 
> > During initialization, before the cpu specific data structures are
> > allocated and initialized, only global target value is begin used.
> 
> I have to review this in detail (which rather won't be possible before
> the next week), but in principle I don't really like it, because it
> assumes that its users will know what's going to run on which CPU cores
> and I'm not sure where that knowledge is going to come from.

Hi guys,

I think busy_poll can accomplish the basic goals of this patch
set.  Stop drops due to c-state transition latency.  Get into more performant
c-states only on active cores with SO_BUSY_POLL or the sysctl.

Whether it's system-wide or per-cpu, cpu_dma_latency wastes power and
worse, it's a static thing.  We need adaptable power management for the
general case.  I guess that might look like power-aware scheduling, or wiring
menu.c to incorporate hints from drivers/userspace.

cpu_dma_latency reduces TDP headroom because non-active cores are in
unnecessarily high c-states, reduces the amount of turbo boost you can have,
and thus reduces performance of (i.e.) low-thread-count workloads.

busy_poll has another positive side-effect; it's even more granular (thus
more power friendly) than the percpu idea:  it will only affect cores that
have active sockets on them.  When the sockets aren't active, the core can
settle into a deep c-state, and possibly the socket can settle into a deeper
package c-state.  There's some data in the blog post that Jesper sent.

I also want to mention that this "class" of issue is not particularly
related to networking.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] pm: Introduce QoS requests per CPU
  2014-03-26 17:36     ` Jeremy Eder
@ 2014-03-27 19:41       ` Amir Vadai
  0 siblings, 0 replies; 11+ messages in thread
From: Amir Vadai @ 2014-03-27 19:41 UTC (permalink / raw)
  To: Jeremy Eder, Rafael J. Wysocki
  Cc: David S. Miller, linux-pm, netdev, Pavel Machek, Len Brown,
	yuvali, Or Gerlitz, Yevgeny Petrilin, idos

On 3/26/2014 7:36 PM, Jeremy Eder wrote:
> On 140325 19:44:53, Rafael J. Wysocki wrote:
>> On Tuesday, March 25, 2014 03:18:24 PM Amir Vadai wrote:
>>> Extend the current pm_qos_request API - to have pm_qos_request per core.
>>> When a global request is added, it is added under the global plist.
>>> When a core specific request is added, it is added to the core specific
>>> list.
>>> core number is saved in the request and later modify/delete operations
>>> are using it to access the right list.
>>>
>>> When a cpu specific request is added/removed/updated, the target value
>>> of the specific core is recalculated to be the min/max (according to the
>>> constrain type) value of all the global and the cpu specific
>>> constraints.
>>>
>>> If a global request is added/removed/updated, the target values of all
>>> the cpu's are recalculated.
>>>
>>> During initialization, before the cpu specific data structures are
>>> allocated and initialized, only global target value is begin used.
>>
>> I have to review this in detail (which rather won't be possible before
>> the next week), but in principle I don't really like it, because it
>> assumes that its users will know what's going to run on which CPU cores
>> and I'm not sure where that knowledge is going to come from.
>
> Hi guys,
>
> I think busy_poll can accomplish the basic goals of this patch
> set.  Stop drops due to c-state transition latency.  Get into more performant
> c-states only on active cores with SO_BUSY_POLL or the sysctl.
>
> Whether it's system-wide or per-cpu, cpu_dma_latency wastes power and
> worse, it's a static thing.  We need adaptable power management for the
> general case.  I guess that might look like power-aware scheduling, or wiring
> menu.c to incorporate hints from drivers/userspace.
>
> cpu_dma_latency reduces TDP headroom because non-active cores are in
> unnecessarily high c-states, reduces the amount of turbo boost you can have,
> and thus reduces performance of (i.e.) low-thread-count workloads.
>
> busy_poll has another positive side-effect; it's even more granular (thus
> more power friendly) than the percpu idea:  it will only affect cores that
> have active sockets on them.  When the sockets aren't active, the core can
> settle into a deep c-state, and possibly the socket can settle into a deeper
> package c-state.  There's some data in the blog post that Jesper sent.
>
> I also want to mention that this "class" of issue is not particularly
> related to networking.
>

Thanks Jeremy, it was very interesting talking to you over the phone.

We agree that it should solve the packets drops and power consumption 
issue - still a bit afraid that it will have negative influence on the 
CPU utilization.
We will run some tests using busy-poll and gather data, and see if the 
per-cpu path is still needed.

Amir



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 2/2] net/mlx4_en: Use pm_qos API to avoid packet loss in high CPU c-states
  2014-03-25 13:18 [RFC 0/2] pm,net: Introduce QoS requests per CPU Amir Vadai
  2014-03-25 13:18 ` [RFC 1/2] pm: " Amir Vadai
@ 2014-03-25 13:18 ` Amir Vadai
  2014-03-25 15:14 ` [RFC 0/2] pm,net: Introduce QoS requests per CPU Eric Dumazet
  2 siblings, 0 replies; 11+ messages in thread
From: Amir Vadai @ 2014-03-25 13:18 UTC (permalink / raw)
  To: David S. Miller
  Cc: linux-pm, netdev, Pavel Machek, Rafael J. Wysocki, Len Brown,
	yuvali, Or Gerlitz, Yevgeny Petrilin, idos, Amir Vadai

TODO: change this patch to use the new per core api

In order to avoid packet loss during traffic, we need to limit the CPU
wake up time, as long as we have work to do.
We restore the system default when there's no traffic.

Feature could be enabled/disabled using a private flag in ethtool:
$ ethtool --set-priv-flags eth10 pm_qos_request_low_latency on
$ ethtool --show-priv-flags eth10
Private flags for eth10:
pm_qos_request_low_latency: on


Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 37 +++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  | 40 +++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c      |  7 +++++
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    | 13 ++++++++
 4 files changed, 97 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 3e8d336..5a43992 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -34,6 +34,7 @@
 #include <linux/kernel.h>
 #include <linux/ethtool.h>
 #include <linux/netdevice.h>
+#include <linux/pm_qos.h>
 #include <linux/mlx4/driver.h>
 #include <linux/in.h>
 #include <net/ip.h>
@@ -98,6 +99,10 @@ mlx4_en_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *drvinfo)
 	drvinfo->eedump_len = 0;
 }
 
+static const char mlx4_en_priv_flags[][ETH_GSTRING_LEN] = {
+	"pm_qos_request_low_latency",
+};
+
 static const char main_strings[][ETH_GSTRING_LEN] = {
 	"rx_packets", "tx_packets", "rx_bytes", "tx_bytes", "rx_errors",
 	"tx_errors", "rx_dropped", "tx_dropped", "multicast", "collisions",
@@ -235,6 +240,8 @@ static int mlx4_en_get_sset_count(struct net_device *dev, int sset)
 	case ETH_SS_TEST:
 		return MLX4_EN_NUM_SELF_TEST - !(priv->mdev->dev->caps.flags
 					& MLX4_DEV_CAP_FLAG_UC_LOOPBACK) * 2;
+	case ETH_SS_PRIV_FLAGS:
+		return ARRAY_SIZE(mlx4_en_priv_flags);
 	default:
 		return -EOPNOTSUPP;
 	}
@@ -358,6 +365,10 @@ static void mlx4_en_get_strings(struct net_device *dev,
 #endif
 		}
 		break;
+	case ETH_SS_PRIV_FLAGS:
+		for (i = 0; i < ARRAY_SIZE(mlx4_en_priv_flags); i++)
+			strcpy(data + i * ETH_GSTRING_LEN, mlx4_en_priv_flags[i]);
+		break;
 	}
 }
 
@@ -1201,6 +1212,29 @@ static int mlx4_en_get_ts_info(struct net_device *dev,
 	return ret;
 }
 
+int mlx4_en_set_priv_flags(struct net_device *dev, u32 flag)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	int cpu = smp_processor_id();
+	u32 flags = priv->pflags;
+
+	if (flags & MLX4_EN_PRIV_FLAGS_PM_QOS) {
+		priv->pflags |= MLX4_EN_PRIV_FLAGS_PM_QOS;
+	} else {
+		pm_qos_update_request(&priv->pm_qos_req, PM_QOS_DEFAULT_VALUE);
+		priv->last_cpu_dma_latency = PM_QOS_DEFAULT_VALUE;
+		priv->pflags &= ~MLX4_EN_PRIV_FLAGS_PM_QOS;
+	}
+
+	return 0;
+}
+
+u32 mlx4_en_get_priv_flags(struct net_device *dev)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	return priv->pflags;
+}
+
 const struct ethtool_ops mlx4_en_ethtool_ops = {
 	.get_drvinfo = mlx4_en_get_drvinfo,
 	.get_settings = mlx4_en_get_settings,
@@ -1228,6 +1262,9 @@ const struct ethtool_ops mlx4_en_ethtool_ops = {
 	.get_channels = mlx4_en_get_channels,
 	.set_channels = mlx4_en_set_channels,
 	.get_ts_info = mlx4_en_get_ts_info,
+
+	.set_priv_flags = mlx4_en_set_priv_flags,
+	.get_priv_flags = mlx4_en_get_priv_flags,
 };
 
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 84a96f7..fb3fe6d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1404,6 +1404,7 @@ static void mlx4_en_set_default_moderation(struct mlx4_en_priv *priv)
 	priv->sample_interval = MLX4_EN_SAMPLE_INTERVAL;
 	priv->adaptive_rx_coal = 1;
 	priv->last_moder_jiffies = 0;
+	priv->last_cstate_jiffies = 0;
 	priv->last_moder_tx_packets = 0;
 }
 
@@ -1470,6 +1471,33 @@ static void mlx4_en_auto_moderation(struct mlx4_en_priv *priv)
 	priv->last_moder_jiffies = jiffies;
 }
 
+static void mlx4_en_default_cstate(struct mlx4_en_priv *priv)
+{
+	unsigned long period = (unsigned long)
+				(jiffies - priv->last_cstate_jiffies);
+	unsigned long packets = 0;
+	unsigned long global_rate = 0;
+	int ring;
+
+	if (!(priv->pflags & MLX4_EN_PRIV_FLAGS_PM_QOS) ||
+	    period < priv->sample_interval * HZ)
+		return;
+
+	for (ring = 0; ring < priv->rx_ring_num; ring++)
+		packets += priv->rx_ring[ring]->packets;
+
+	global_rate = (packets - priv->last_packets) * HZ / period;
+	priv->last_packets = packets;
+
+	if ((global_rate < MLX4_EN_RX_RATE_THRESH) &&
+	    (priv->last_cpu_dma_latency == 0)) {
+		pm_qos_update_request(&priv->pm_qos_req,
+				      PM_QOS_DEFAULT_VALUE);
+		priv->last_cpu_dma_latency = PM_QOS_DEFAULT_VALUE;
+	}
+	priv->last_cstate_jiffies = jiffies;
+}
+
 static void mlx4_en_do_get_stats(struct work_struct *work)
 {
 	struct delayed_work *delay = to_delayed_work(work);
@@ -1486,6 +1514,7 @@ static void mlx4_en_do_get_stats(struct work_struct *work)
 				en_dbg(HW, priv, "Could not update stats\n");
 
 			mlx4_en_auto_moderation(priv);
+			mlx4_en_default_cstate(priv);
 		}
 
 		queue_delayed_work(mdev->workqueue, &priv->stats_task, STATS_DELAY);
@@ -2073,6 +2102,10 @@ void mlx4_en_destroy_netdev(struct net_device *dev)
 	mdev->pndev[priv->port] = NULL;
 	mutex_unlock(&mdev->state_lock);
 
+	/* Remove pm_qos request, after reseting to default value */
+	pm_qos_update_request(&priv->pm_qos_req, PM_QOS_DEFAULT_VALUE);
+	pm_qos_remove_request(&priv->pm_qos_req);
+
 	mlx4_en_free_resources(priv);
 
 	kfree(priv->tx_ring);
@@ -2452,6 +2485,13 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 		en_err(priv, "Failed to allocate page for rx qps\n");
 		goto out;
 	}
+
+	/* Initialize pm_qos request object */
+	priv->last_cpu_dma_latency = PM_QOS_DEFAULT_VALUE;
+	pm_qos_add_request(&priv->pm_qos_req,
+			   PM_QOS_CPU_DMA_LATENCY,
+			   PM_QOS_DEFAULT_VALUE);
+
 	priv->allocated = 1;
 
 	/*
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 890922c..1823119 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -865,6 +865,13 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
 	if (!mlx4_en_cq_lock_napi(cq))
 		return budget;
 
+	/* Request best DMA latency possible from CPU while in traffic */
+	if (priv->pflags & MLX4_EN_PRIV_FLAGS_PM_QOS &&
+	    priv->last_cpu_dma_latency != 0) {
+		pm_qos_update_request(&priv->pm_qos_req, 0);
+		priv->last_cpu_dma_latency = 0;
+	}
+
 	done = mlx4_en_process_rx_cq(dev, cq, budget);
 
 	mlx4_en_cq_unlock_napi(cq);
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index b57e8c8..72f0337 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -41,6 +41,8 @@
 #include <linux/netdevice.h>
 #include <linux/if_vlan.h>
 #include <linux/net_tstamp.h>
+#include <linux/pm_qos.h>
+
 #ifdef CONFIG_MLX4_EN_DCB
 #include <linux/dcbnl.h>
 #endif
@@ -478,6 +480,10 @@ enum {
 	MLX4_EN_FLAG_FORCE_PROMISC	= (1 << 4)
 };
 
+enum {
+       MLX4_EN_PRIV_FLAGS_PM_QOS = (1 << 0)
+};
+
 #define MLX4_EN_MAC_HASH_SIZE (1 << BITS_PER_BYTE)
 #define MLX4_EN_MAC_HASH_IDX 5
 
@@ -513,6 +519,12 @@ struct mlx4_en_priv {
 	u32 loopback_ok;
 	u32 validate_loopback;
 
+	/* pm_qos related variables */
+	unsigned long last_cstate_jiffies;
+	unsigned long last_packets;
+	int last_cpu_dma_latency;
+	struct pm_qos_request pm_qos_req;
+
 	struct mlx4_hwq_resources res;
 	int link_state;
 	int last_link_state;
@@ -530,6 +542,7 @@ struct mlx4_en_priv {
 	struct mlx4_en_rss_map rss_map;
 	__be32 ctrl_flags;
 	u32 flags;
+	u32 pflags;
 	u8 num_tx_rings_p_up;
 	u32 tx_ring_num;
 	u32 rx_ring_num;
-- 
1.8.3.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC 0/2] pm,net: Introduce QoS requests per CPU
  2014-03-25 13:18 [RFC 0/2] pm,net: Introduce QoS requests per CPU Amir Vadai
  2014-03-25 13:18 ` [RFC 1/2] pm: " Amir Vadai
  2014-03-25 13:18 ` [RFC 2/2] net/mlx4_en: Use pm_qos API to avoid packet loss in high CPU c-states Amir Vadai
@ 2014-03-25 15:14 ` Eric Dumazet
  2014-03-25 22:47   ` Ben Hutchings
  2014-03-26 15:42   ` Amir Vadai
  2 siblings, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2014-03-25 15:14 UTC (permalink / raw)
  To: Amir Vadai
  Cc: David S. Miller, linux-pm, netdev, Pavel Machek,
	Rafael J. Wysocki, Len Brown, yuvali, Or Gerlitz,
	Yevgeny Petrilin, idos

On Tue, 2014-03-25 at 15:18 +0200, Amir Vadai wrote:

> The current pm_qos implementation has a problem. During a short pause in a high
> bandwidth traffic, the kernel can lower the c-state to preserve energy.
> When the pause ends, and the traffic resumes, the NIC hardware buffers may be
> overflowed before the CPU starts to process the traffic due to the CPU wake-up
> latency.

This is the point I never understood with mlx4

RX ring buffers should allow NIC to buffer quite a large amount of
incoming frames. But apparently we miss frames, even in a single TCP
flow. I really cant understand why, as sender in my case do not have
more than 90 packets in flight (cwnd is limited to 90)

# ethtool -S eth0 | grep error
     rx_errors: 268
     tx_errors: 0
     rx_length_errors: 0
     rx_over_errors: 40
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_fifo_errors: 40
     rx_missed_errors: 40
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:		8192
RX Mini:	0
RX Jumbo:	0
TX:		8192
Current hardware settings:
RX:		4096
RX Mini:	0
RX Jumbo:	0
TX:		4096





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 0/2] pm,net: Introduce QoS requests per CPU
  2014-03-25 15:14 ` [RFC 0/2] pm,net: Introduce QoS requests per CPU Eric Dumazet
@ 2014-03-25 22:47   ` Ben Hutchings
  2014-03-26  7:12     ` Yevgeny Petrilin
  2014-03-26 15:42   ` Amir Vadai
  1 sibling, 1 reply; 11+ messages in thread
From: Ben Hutchings @ 2014-03-25 22:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Amir Vadai, David S. Miller, linux-pm, netdev, Pavel Machek,
	Rafael J. Wysocki, Len Brown, yuvali, Or Gerlitz,
	Yevgeny Petrilin, idos

[-- Attachment #1: Type: text/plain, Size: 1740 bytes --]

On Tue, 2014-03-25 at 08:14 -0700, Eric Dumazet wrote:
> On Tue, 2014-03-25 at 15:18 +0200, Amir Vadai wrote:
> 
> > The current pm_qos implementation has a problem. During a short pause in a high
> > bandwidth traffic, the kernel can lower the c-state to preserve energy.
> > When the pause ends, and the traffic resumes, the NIC hardware buffers may be
> > overflowed before the CPU starts to process the traffic due to the CPU wake-up
> > latency.
> 
> This is the point I never understood with mlx4
> 
> RX ring buffers should allow NIC to buffer quite a large amount of
> incoming frames. But apparently we miss frames, even in a single TCP
> flow. I really cant understand why, as sender in my case do not have
> more than 90 packets in flight (cwnd is limited to 90)
[...]

The time taken for software to clean the RX ring is only half the story.

A DMA write requires every CPU's cache controller to invalidate affected
cache lines.  It may also require reading from cache, if the write
covers only part of a cache line.  So at least the cache controllers
need to be woken from sleep, and until then all DMA writes must be
buffered in some combination of CPUs, bridges and the network
controller's RX FIFOs.  If those buffers aren't long enough for the
delay, packets will be dropped.  (Ethernet flow control may help with
this, if enabled.)

Back in 2007, colleagues at Solarflare measured DMA write delays of
about 10 us when CPUs had to be woken up, rising to 40-50 us for one
buggy Intel model.  This motivated a large increase to RX FIFO size in
the SFC4000B and subsequent controllers.

Ben.

-- 
Ben Hutchings
Make three consecutive correct guesses and you will be considered an expert.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [RFC 0/2] pm,net: Introduce QoS requests per CPU
  2014-03-25 22:47   ` Ben Hutchings
@ 2014-03-26  7:12     ` Yevgeny Petrilin
  0 siblings, 0 replies; 11+ messages in thread
From: Yevgeny Petrilin @ 2014-03-26  7:12 UTC (permalink / raw)
  To: Ben Hutchings, Eric Dumazet
  Cc: Amir Vadai, David S. Miller, linux-pm@vger.kernel.org,
	netdev@vger.kernel.org, Pavel Machek, Rafael J. Wysocki,
	Len Brown, Yuval Itkin, Or Gerlitz, Ido Shamay

> > > The current pm_qos implementation has a problem. During a short pause in a high
> > > bandwidth traffic, the kernel can lower the c-state to preserve energy.
> > > When the pause ends, and the traffic resumes, the NIC hardware buffers may be
> > > overflowed before the CPU starts to process the traffic due to the CPU wake-up
> > > latency.
> >
> > This is the point I never understood with mlx4
> >
> > RX ring buffers should allow NIC to buffer quite a large amount of
> > incoming frames. But apparently we miss frames, even in a single TCP
> > flow. I really cant understand why, as sender in my case do not have
> > more than 90 packets in flight (cwnd is limited to 90)
> [...]
> 
> The time taken for software to clean the RX ring is only half the story.
> 
> A DMA write requires every CPU's cache controller to invalidate affected
> cache lines.  It may also require reading from cache, if the write
> covers only part of a cache line.  So at least the cache controllers
> need to be woken from sleep, and until then all DMA writes must be
> buffered in some combination of CPUs, bridges and the network
> controller's RX FIFOs.  If those buffers aren't long enough for the
> delay, packets will be dropped.  (Ethernet flow control may help with
> this, if enabled.)
> 
> Back in 2007, colleagues at Solarflare measured DMA write delays of
> about 10 us when CPUs had to be woken up, rising to 40-50 us for one
> buggy Intel model.  This motivated a large increase to RX FIFO size in
> the SFC4000B and subsequent controllers.
> 
This is exactly the story here.
Not all NICs have HW buffers which are big enough. And for those who don't,
changing c-states would cause a packet drop with burst traffic,
even if the number of packets is lower than the ring size.
Indeed, flow control could prevent those drops, but in many cases it is not enabled.

Yevgeny 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 0/2] pm,net: Introduce QoS requests per CPU
  2014-03-25 15:14 ` [RFC 0/2] pm,net: Introduce QoS requests per CPU Eric Dumazet
  2014-03-25 22:47   ` Ben Hutchings
@ 2014-03-26 15:42   ` Amir Vadai
  1 sibling, 0 replies; 11+ messages in thread
From: Amir Vadai @ 2014-03-26 15:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, linux-pm, netdev, Pavel Machek,
	Rafael J. Wysocki, Len Brown, yuvali, Or Gerlitz,
	Yevgeny Petrilin, idos

[This mail might be double posted due to problems I have with the mail 
server]

On 25/03/14 08:14 -0700, Eric Dumazet wrote:
> On Tue, 2014-03-25 at 15:18 +0200, Amir Vadai wrote:
>
> > The current pm_qos implementation has a problem. During a short pause in a high
> > bandwidth traffic, the kernel can lower the c-state to preserve energy.
> > When the pause ends, and the traffic resumes, the NIC hardware buffers may be
> > overflowed before the CPU starts to process the traffic due to the CPU wake-up
> > latency.
>
> This is the point I never understood with mlx4
>
> RX ring buffers should allow NIC to buffer quite a large amount of
> incoming frames. But apparently we miss frames, even in a single TCP
> flow. I really cant understand why, as sender in my case do not have
> more than 90 packets in flight (cwnd is limited to 90)

Hi,

We would like to nail down the errors you experience.

>
> # ethtool -S eth0 | grep error
>      rx_errors: 268
This is an indication for a bad cable

>      tx_errors: 0
>      rx_length_errors: 0
>      rx_over_errors: 40
>      rx_crc_errors: 0
>      rx_frame_errors: 0
>      rx_fifo_errors: 40
>      rx_missed_errors: 40

did you experience the rx_over_errors, rx_fifo_errors and
rx_missed_errors on a setup where rx_errors is 0?

The above 3 counters are actually the same HW counter, which indicates
that the HW buffer is full - probably as Ben indicated, because the
DMA wasn't fast enough.

>      tx_aborted_errors: 0
>      tx_carrier_errors: 0
>      tx_fifo_errors: 0
>      tx_heartbeat_errors: 0
>      tx_window_errors: 0
> # ethtool -g eth0
> Ring parameters for eth0:
> Pre-set maximums:
> RX:		8192
> RX Mini:	0
> RX Jumbo:	0
> TX:		8192
> Current hardware settings:
> RX:		4096
> RX Mini:	0
> RX Jumbo:	0
> TX:		4096
This is relevant to the buffers on the host memory, the error
statistics above indicates that problem is in the HW buffers on the
NIC memory.

Assuming that there are no cable issues here, please give us
instructions how to reproduce the issue.

Just to make sure, are you running with flow control disabled?

When flow control is enabled, we didn't see any errors - single or
multi stream traffic.
When flow control is disabled, we didn't see any errors on a single
stream of 27Gbe. Only with a multi stream traffic (full line rate) we
did see drops - but it is expected.
In any case we didn't get rx_errors.

Amir

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-03-27 19:41 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-25 13:18 [RFC 0/2] pm,net: Introduce QoS requests per CPU Amir Vadai
2014-03-25 13:18 ` [RFC 1/2] pm: " Amir Vadai
2014-03-25 18:44   ` Rafael J. Wysocki
2014-03-26 15:40     ` Amir Vadai
2014-03-26 17:36     ` Jeremy Eder
2014-03-27 19:41       ` Amir Vadai
2014-03-25 13:18 ` [RFC 2/2] net/mlx4_en: Use pm_qos API to avoid packet loss in high CPU c-states Amir Vadai
2014-03-25 15:14 ` [RFC 0/2] pm,net: Introduce QoS requests per CPU Eric Dumazet
2014-03-25 22:47   ` Ben Hutchings
2014-03-26  7:12     ` Yevgeny Petrilin
2014-03-26 15:42   ` Amir Vadai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).