Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* [PATCH v10 5/9] cpuset: Make sure that domain roots work properly with CPU hotplug
From: Waiman Long @ 2018-06-18  4:14 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long
In-Reply-To: <1529295249-5207-1-git-send-email-longman@redhat.com>

When there is a cpu hotplug event (CPU online or offline), the scheduling
domains needed to be reconfigured and regenerated. So code is added to
the hotplug functions to make them work with new reserved_cpus mask to
compute the right effective_cpus for each of the affected cpusets.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  7 +++++++
 kernel/cgroup/cpuset.c                  | 26 ++++++++++++++++++++++++--
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 5ee5e77..6ef3516 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1626,6 +1626,13 @@ Cpuset Interface Files
 	2) No CPU that has been distributed to child scheduling domain
 	   roots is deleted.
 
+	When all the CPUs allocated to a scheduling domain are offlined,
+	that scheduling domain will be temporaily gone and all the
+	tasks in that scheduling domain will migrate to another one that
+	belongs to the parent of the scheduling domain root.  When any
+	of those offlined CPUs is onlined again, a new scheduling domain
+	will be re-created and the tasks will be migrated back.
+
 
 Device controller
 -----------------
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b1abe3d..26ac083 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -900,7 +900,8 @@ static void update_tasks_cpumask(struct cpuset *cs)
  * @parent: the parent cpuset
  *
  * If the parent has reserved CPUs, include them in the list of allowable
- * CPUs in computing the new effective_cpus mask.
+ * CPUs in computing the new effective_cpus mask. The cpu_active_mask is
+ * used to mask off cpus that are to be offlined.
  */
 static void compute_effective_cpumask(struct cpumask *new_cpus,
 				      struct cpuset *cs, struct cpuset *parent)
@@ -909,6 +910,7 @@ static void compute_effective_cpumask(struct cpumask *new_cpus,
 		cpumask_or(new_cpus, parent->effective_cpus,
 			   parent->reserved_cpus);
 		cpumask_and(new_cpus, new_cpus, cs->cpus_allowed);
+		cpumask_and(new_cpus, new_cpus, cpu_active_mask);
 	} else {
 		cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
 	}
@@ -2571,9 +2573,17 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs)
 		goto retry;
 	}
 
-	cpumask_and(&new_cpus, cs->cpus_allowed, parent_cs(cs)->effective_cpus);
+	compute_effective_cpumask(&new_cpus, cs, parent_cs(cs));
 	nodes_and(new_mems, cs->mems_allowed, parent_cs(cs)->effective_mems);
 
+	if (cs->nr_reserved) {
+		/*
+		 * Some of the CPUs may have been distributed to child
+		 * domain roots. So we need skip those when computing the
+		 * real effective cpus.
+		 */
+		cpumask_andnot(&new_cpus, &new_cpus, cs->reserved_cpus);
+	}
 	cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus);
 	mems_updated = !nodes_equal(new_mems, cs->effective_mems);
 
@@ -2623,6 +2633,11 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 	cpumask_copy(&new_cpus, cpu_active_mask);
 	new_mems = node_states[N_MEMORY];
 
+	/*
+	 * If reserved_cpus is populated, it is likely that the check below
+	 * will produce a false positive on cpus_updated when the cpu list
+	 * isn't changed. It is extra work, but it is better to be safe.
+	 */
 	cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, &new_cpus);
 	mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems);
 
@@ -2631,6 +2646,13 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 		spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
+		/*
+		 * Make sure that the reserved cpus aren't in the
+		 * effective cpus.
+		 */
+		if (top_cpuset.nr_reserved)
+			cpumask_andnot(&new_cpus, &new_cpus,
+					top_cpuset.reserved_cpus);
 		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
 		spin_unlock_irq(&callback_lock);
 		/* we don't mess with cpumasks of tasks in top_cpuset */
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v10 6/9] cpuset: Make generate_sched_domains() recognize isolated_cpus
From: Waiman Long @ 2018-06-18  4:14 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long
In-Reply-To: <1529295249-5207-1-git-send-email-longman@redhat.com>

The generate_sched_domains() function and the hotplug code are modified
to make them use the newly introduced isolated_cpus mask for schedule
domains generation.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index cfc9b7b..5ee4239 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -672,13 +672,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	int ndoms = 0;		/* number of sched domains in result */
 	int nslot;		/* next empty doms[] struct cpumask slot */
 	struct cgroup_subsys_state *pos_css;
+	bool root_load_balance = is_sched_load_balance(&top_cpuset);
 
 	doms = NULL;
 	dattr = NULL;
 	csa = NULL;
 
 	/* Special case for the 99% of systems with one, full, sched domain */
-	if (is_sched_load_balance(&top_cpuset)) {
+	if (root_load_balance && !top_cpuset.isolation_count) {
 		ndoms = 1;
 		doms = alloc_sched_domains(ndoms);
 		if (!doms)
@@ -701,6 +702,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	csn = 0;
 
 	rcu_read_lock();
+	if (root_load_balance)
+		csa[csn++] = &top_cpuset;
 	cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
 		if (cp == &top_cpuset)
 			continue;
@@ -711,6 +714,9 @@ static int generate_sched_domains(cpumask_var_t **domains,
 		 * parent's cpus, so just skip them, and then we call
 		 * update_domain_attr_tree() to calc relax_domain_level of
 		 * the corresponding sched domain.
+		 *
+		 * If root is load-balancing, we can skip @cp if it
+		 * is a subset of the root's effective_cpus.
 		 */
 		if (!cpumask_empty(cp->cpus_allowed) &&
 		    !(is_sched_load_balance(cp) &&
@@ -718,11 +724,16 @@ static int generate_sched_domains(cpumask_var_t **domains,
 					 housekeeping_cpumask(HK_FLAG_DOMAIN))))
 			continue;
 
+		if (root_load_balance &&
+		    cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus))
+			continue;
+
 		if (is_sched_load_balance(cp))
 			csa[csn++] = cp;
 
-		/* skip @cp's subtree */
-		pos_css = css_rightmost_descendant(pos_css);
+		/* skip @cp's subtree if not a scheduling domain root */
+		if (!is_sched_domain_root(cp))
+			pos_css = css_rightmost_descendant(pos_css);
 	}
 	rcu_read_unlock();
 
@@ -849,7 +860,12 @@ static void rebuild_sched_domains_locked(void)
 	 * passing doms with offlined cpu to partition_sched_domains().
 	 * Anyways, hotplug work item will rebuild sched domains.
 	 */
-	if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+	if (!top_cpuset.isolation_count &&
+	    !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+		goto out;
+
+	if (top_cpuset.isolation_count &&
+	   !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
 		goto out;
 
 	/* Generate domain masks and attrs */
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v10 4/9] cpuset: Allow changes to cpus in a domain root
From: Waiman Long @ 2018-06-18  4:14 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long
In-Reply-To: <1529295249-5207-1-git-send-email-longman@redhat.com>

The previous patch introduces a new domain_root flag, but won't allow
changes made to "cpuset.cpus" once the flag is on. That may be too
restrictive in some use cases. So this restiction is now relaxed to
allow changes made to the "cpuset.cpus" file with some constraints:

 1) The new set of cpus must still be exclusive.
 2) Newly added cpus must be a subset of the parent effective_cpus.
 3) None of the deleted cpus can be one of those allocated to a child
    domain roots, if present.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  9 ++++
 kernel/cgroup/cpuset.c                  | 81 ++++++++++++++++++++++++++-------
 2 files changed, 73 insertions(+), 17 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index d5e25a0..5ee5e77 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1617,6 +1617,15 @@ Cpuset Interface Files
 	There must be at least one cpu left in the parent scheduling
 	domain root cgroup.
 
+	In a scheduling domain root, changes to "cpuset.cpus" is allowed
+	as long as the first condition above as well as the following
+	two additional conditions are true.
+
+	1) Any added CPUs must be a proper subset of the parent's
+	   "cpuset.cpus.effective".
+	2) No CPU that has been distributed to child scheduling domain
+	   roots is deleted.
+
 
 Device controller
 -----------------
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a1d5ccd..b1abe3d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -957,6 +957,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 
 		spin_lock_irq(&callback_lock);
 		cpumask_copy(cp->effective_cpus, new_cpus);
+		if (cp->nr_reserved)
+			cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
+				       cp->reserved_cpus);
 		spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
@@ -984,24 +987,26 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 /**
  * update_reserved_cpumask - update the reserved_cpus mask of parent cpuset
  * @cpuset:  The cpuset that requests CPU reservation
- * @delmask: The old reserved cpumask to be removed from the parent
- * @addmask: The new reserved cpumask to be added to the parent
+ * @oldmask: The old reserved cpumask to be removed from the parent
+ * @newmask: The new reserved cpumask to be added to the parent
  * Return: 0 if successful, an error code otherwise
  *
  * Changes to the reserved CPUs are not allowed if any of CPUs changing
  * state are in any of the child cpusets of the parent except the requesting
  * child.
  *
- * If the sched_domain_root flag changes, either the delmask (0=>1) or the
- * addmask (1=>0) will be NULL.
+ * If the sched_domain_root flag changes, either the oldmask (0=>1) or the
+ * newmask (1=>0) will be NULL.
  *
  * Called with cpuset_mutex held. Some of the checks are skipped if the
  * cpuset is being offlined (dying).
  */
 static int update_reserved_cpumask(struct cpuset *cpuset,
-	struct cpumask *delmask, struct cpumask *addmask)
+	struct cpumask *oldmask, struct cpumask *newmask)
 {
 	int retval;
+	int adding, deleting;
+	cpumask_var_t addmask, delmask;
 	struct cpuset *parent = parent_cs(cpuset);
 	struct cpuset *sibling;
 	struct cgroup_subsys_state *pos_css;
@@ -1013,15 +1018,15 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	 * The new cpumask, if present, must not be empty.
 	 */
 	if (!is_sched_domain_root(parent) ||
-	   (addmask && cpumask_empty(addmask)))
+	   (newmask && cpumask_empty(newmask)))
 		return -EINVAL;
 
 	/*
-	 * The delmask, if present, must be a subset of parent's reserved
+	 * The oldmask, if present, must be a subset of parent's reserved
 	 * CPUs.
 	 */
-	if (delmask && !cpumask_empty(delmask) && (!parent->nr_reserved ||
-		       !cpumask_subset(delmask, parent->reserved_cpus))) {
+	if (oldmask && !cpumask_empty(oldmask) && (!parent->nr_reserved ||
+		       !cpumask_subset(oldmask, parent->reserved_cpus))) {
 		WARN_ON_ONCE(1);
 		return -EINVAL;
 	}
@@ -1030,9 +1035,17 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	 * A sched_domain_root state change is not allowed if there are
 	 * online children and the cpuset is not dying.
 	 */
-	if (!dying && css_has_online_children(&cpuset->css))
+	if (!dying && (!oldmask || !newmask) &&
+	    css_has_online_children(&cpuset->css))
 		return -EBUSY;
 
+	if (!zalloc_cpumask_var(&addmask, GFP_KERNEL))
+		return -ENOMEM;
+	if (!zalloc_cpumask_var(&delmask, GFP_KERNEL)) {
+		free_cpumask_var(addmask);
+		return -ENOMEM;
+	}
+
 	if (!old_count) {
 		if (!zalloc_cpumask_var(&parent->reserved_cpus, GFP_KERNEL)) {
 			retval = -ENOMEM;
@@ -1042,12 +1055,29 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	}
 
 	retval = -EBUSY;
+	adding = deleting = false;
+	/*
+	 * addmask = newmask & ~oldmask
+	 * delmask = oldmask & ~newmask
+	 */
+	if (oldmask && newmask) {
+		adding   = cpumask_andnot(addmask, newmask, oldmask);
+		deleting = cpumask_andnot(delmask, oldmask, newmask);
+		if (!adding && !deleting)
+			goto out_ok;
+	} else if (newmask) {
+		adding = true;
+		cpumask_copy(addmask, newmask);
+	} else if (oldmask) {
+		deleting = true;
+		cpumask_copy(delmask, oldmask);
+	}
 
 	/*
 	 * The cpus to be added must be a proper subset of the parent's
 	 * effective_cpus mask but not in the reserved_cpus mask.
 	 */
-	if (addmask) {
+	if (adding) {
 		if (!cpumask_subset(addmask, parent->effective_cpus) ||
 		     cpumask_equal(addmask, parent->effective_cpus))
 			goto out;
@@ -1057,6 +1087,15 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	}
 
 	/*
+	 * For cpu changes in a domain root, cpu deletion isn't allowed
+	 * if any of the deleted CPUs is in reserved_cpus (distributed
+	 * to child domain roots).
+	 */
+	if (oldmask && newmask && cpuset->nr_reserved && deleting &&
+	    cpumask_intersects(delmask, cpuset->reserved_cpus))
+		goto out;
+
+	/*
 	 * Check if any CPUs in addmask or delmask are in the effective_cpus
 	 * of a sibling cpuset. The implied cpu_exclusive of a scheduling
 	 * domain root will ensure there are no overlap in cpus_allowed.
@@ -1070,10 +1109,10 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	cpuset_for_each_child(sibling, pos_css, parent) {
 		if ((sibling == cpuset) || !(sibling->css.flags & CSS_ONLINE))
 			continue;
-		if (addmask &&
+		if (adding &&
 		    cpumask_intersects(sibling->effective_cpus, addmask))
 			goto out_unlock;
-		if (delmask &&
+		if (deleting &&
 		    cpumask_intersects(sibling->effective_cpus, delmask))
 			goto out_unlock;
 	}
@@ -1086,13 +1125,13 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	 */
 updated_reserved_cpus:
 	spin_lock_irq(&callback_lock);
-	if (addmask) {
+	if (adding) {
 		cpumask_or(parent->reserved_cpus,
 			   parent->reserved_cpus, addmask);
 		cpumask_andnot(parent->effective_cpus,
 			       parent->effective_cpus, addmask);
 	}
-	if (delmask) {
+	if (deleting) {
 		cpumask_andnot(parent->reserved_cpus,
 			       parent->reserved_cpus, delmask);
 		cpumask_or(parent->effective_cpus,
@@ -1101,8 +1140,12 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 
 	parent->nr_reserved = cpumask_weight(parent->reserved_cpus);
 	spin_unlock_irq(&callback_lock);
+
+out_ok:
 	retval = 0;
 out:
+	free_cpumask_var(addmask);
+	free_cpumask_var(delmask);
 	if (old_count && !parent->nr_reserved)
 		free_cpumask_var(parent->reserved_cpus);
 
@@ -1154,8 +1197,12 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		return retval;
 
-	if (is_sched_domain_root(cs))
-		return -EBUSY;
+	if (is_sched_domain_root(cs)) {
+		retval = update_reserved_cpumask(cs, cs->cpus_allowed,
+						 trialcs->cpus_allowed);
+		if (retval < 0)
+			return retval;
+	}
 
 	spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v10 3/9] cpuset: Simulate auto-off of sched.domain_root at cgroup removal
From: Waiman Long @ 2018-06-18  4:14 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long
In-Reply-To: <1529295249-5207-1-git-send-email-longman@redhat.com>

Making a cgroup a domain root will reserve cpu resource at its parent.
So when a domain root cgroup is destroyed, we need to free the
reserved cpus at its parent. This is now done by doing an auto-off of
the sched.domain_root flag in the offlining phase when a domain root
cgroup is being removed.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 34 +++++++++++++++++++++++++++++-----
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 68a9c25..a1d5ccd 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -995,7 +995,8 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
  * If the sched_domain_root flag changes, either the delmask (0=>1) or the
  * addmask (1=>0) will be NULL.
  *
- * Called with cpuset_mutex held.
+ * Called with cpuset_mutex held. Some of the checks are skipped if the
+ * cpuset is being offlined (dying).
  */
 static int update_reserved_cpumask(struct cpuset *cpuset,
 	struct cpumask *delmask, struct cpumask *addmask)
@@ -1005,6 +1006,7 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	struct cpuset *sibling;
 	struct cgroup_subsys_state *pos_css;
 	int old_count = parent->nr_reserved;
+	bool dying = cpuset->css.flags & CSS_DYING;
 
 	/*
 	 * The parent must be a scheduling domain root.
@@ -1026,9 +1028,9 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 
 	/*
 	 * A sched_domain_root state change is not allowed if there are
-	 * online children.
+	 * online children and the cpuset is not dying.
 	 */
-	if (css_has_online_children(&cpuset->css))
+	if (!dying && css_has_online_children(&cpuset->css))
 		return -EBUSY;
 
 	if (!old_count) {
@@ -1058,7 +1060,12 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	 * Check if any CPUs in addmask or delmask are in the effective_cpus
 	 * of a sibling cpuset. The implied cpu_exclusive of a scheduling
 	 * domain root will ensure there are no overlap in cpus_allowed.
+	 *
+	 * This check is skipped if the cpuset is dying.
 	 */
+	if (dying)
+		goto updated_reserved_cpus;
+
 	rcu_read_lock();
 	cpuset_for_each_child(sibling, pos_css, parent) {
 		if ((sibling == cpuset) || !(sibling->css.flags & CSS_ONLINE))
@@ -1077,6 +1084,7 @@ static int update_reserved_cpumask(struct cpuset *cpuset,
 	 * Newly added reserved CPUs will be removed from effective_cpus
 	 * and newly deleted ones will be added back if they are online.
 	 */
+updated_reserved_cpus:
 	spin_lock_irq(&callback_lock);
 	if (addmask) {
 		cpumask_or(parent->reserved_cpus,
@@ -2278,7 +2286,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 /*
  * If the cpuset being removed has its flag 'sched_load_balance'
  * enabled, then simulate turning sched_load_balance off, which
- * will call rebuild_sched_domains_locked().
+ * will call rebuild_sched_domains_locked(). That is not needed
+ * in the default hierarchy where only changes in domain_root
+ * will cause repartitioning.
+ *
+ * If the cpuset has the 'sched.domain_root' flag enabled, simulate
+ * turning 'sched.domain_root" off.
  */
 
 static void cpuset_css_offline(struct cgroup_subsys_state *css)
@@ -2287,7 +2300,18 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 
 	mutex_lock(&cpuset_mutex);
 
-	if (is_sched_load_balance(cs))
+	/*
+	 * A WARN_ON_ONCE() check after calling update_flag() to make
+	 * sure that the operation succceeds without failure.
+	 */
+	if (is_sched_domain_root(cs)) {
+		int ret = update_flag(CS_SCHED_DOMAIN_ROOT, cs, 0);
+
+		WARN_ON_ONCE(ret);
+	}
+
+	if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
+	    is_sched_load_balance(cs))
 		update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
 
 	cpuset_dec();
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v10 1/9] cpuset: Enable cpuset controller in default hierarchy
From: Waiman Long @ 2018-06-18  4:14 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long
In-Reply-To: <1529295249-5207-1-git-send-email-longman@redhat.com>

Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.

The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.

This patch enables cpuset controller in the default hierarchy with
a minimal set of features, namely just the cpus and mems and their
effective_* counterparts.  We can certainly add more features to the
default hierarchy in the future if there is a real user need for them
later on.

Alternatively, with the unified hiearachy, it may make more sense
to move some of those additional cpuset features, if desired, to
memory controller or may be to the cpu controller instead of staying
with cpuset.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 109 ++++++++++++++++++++++++++++++--
 kernel/cgroup/cpuset.c                  |  48 +++++++++++++-
 2 files changed, 149 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 8a2c52d..fbc30b6 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -53,11 +53,13 @@ v1 is available under Documentation/cgroup-v1/.
        5-3-2. Writeback
      5-4. PID
        5-4-1. PID Interface Files
-     5-5. Device
-     5-6. RDMA
-       5-6-1. RDMA Interface Files
-     5-7. Misc
-       5-7-1. perf_event
+     5-5. Cpuset
+       5.5-1. Cpuset Interface Files
+     5-6. Device
+     5-7. RDMA
+       5-7-1. RDMA Interface Files
+     5-8. Misc
+       5-8-1. perf_event
      5-N. Non-normative information
        5-N-1. CPU controller root cgroup process behaviour
        5-N-2. IO controller root cgroup process behaviour
@@ -1486,6 +1488,103 @@ through fork() or clone(). These will return -EAGAIN if the creation
 of a new process would cause a cgroup policy to be violated.
 
 
+Cpuset
+------
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical.  That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~~~~~~~~~~~~~~~~~~~~~
+
+  cpuset.cpus
+	A read-write multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the requested CPUs to be used by tasks within this
+	cgroup.  The actual list of CPUs to be granted, however, is
+	subjected to constraints imposed by its parent and can differ
+	from the requested CPUs.
+
+	The CPU numbers are comma-separated numbers or ranges.
+	For example:
+
+	  # cat cpuset.cpus
+	  0-4,6,8-10
+
+	An empty value indicates that the cgroup is using the same
+	setting as the nearest cgroup ancestor with a non-empty
+	"cpuset.cpus" or all the available CPUs if none is found.
+
+	The value of "cpuset.cpus" stays constant until the next update
+	and won't be affected by any CPU hotplug events.
+
+  cpuset.cpus.effective
+	A read-only multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the onlined CPUs that are actually granted to this
+	cgroup by its parent.  These CPUs are allowed to be used by
+	tasks within the current cgroup.
+
+	If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows
+	all the CPUs from the parent cgroup that can be available to
+	be used by this cgroup.  Otherwise, it should be a subset of
+	"cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus"
+	can be granted.  In this case, it will be treated just like an
+	empty "cpuset.cpus".
+
+	Its value will be affected by CPU hotplug events.
+
+  cpuset.mems
+	A read-write multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the requested memory nodes to be used by tasks within
+	this cgroup.  The actual list of memory nodes granted, however,
+	is subjected to constraints imposed by its parent and can differ
+	from the requested memory nodes.
+
+	The memory node numbers are comma-separated numbers or ranges.
+	For example:
+
+	  # cat cpuset.mems
+	  0-1,3
+
+	An empty value indicates that the cgroup is using the same
+	setting as the nearest cgroup ancestor with a non-empty
+	"cpuset.mems" or all the available memory nodes if none
+	is found.
+
+	The value of "cpuset.mems" stays constant until the next update
+	and won't be affected by any memory nodes hotplug events.
+
+  cpuset.mems.effective
+	A read-only multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the onlined memory nodes that are actually granted to
+	this cgroup by its parent. These memory nodes are allowed to
+	be used by tasks within the current cgroup.
+
+	If "cpuset.mems" is empty, it shows all the memory nodes from the
+	parent cgroup that will be available to be used by this cgroup.
+	Otherwise, it should be a subset of "cpuset.mems" unless none of
+	the memory nodes listed in "cpuset.mems" can be granted.  In this
+	case, it will be treated just like an empty "cpuset.mems".
+
+	Its value will be affected by memory nodes hotplug events.
+
+
 Device controller
 -----------------
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 266f10c..2b5c447 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1824,12 +1824,11 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	return 0;
 }
 
-
 /*
  * for the common functions, 'private' gives the type of file
  */
 
-static struct cftype files[] = {
+static struct cftype legacy_files[] = {
 	{
 		.name = "cpus",
 		.seq_show = cpuset_common_seq_show,
@@ -1932,6 +1931,47 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 };
 
 /*
+ * This is currently a minimal set for the default hierarchy. It can be
+ * expanded later on by migrating more features and control files from v1.
+ */
+static struct cftype dfl_files[] = {
+	{
+		.name = "cpus",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * NR_CPUS),
+		.private = FILE_CPULIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "mems",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * MAX_NUMNODES),
+		.private = FILE_MEMLIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "cpus.effective",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_EFFECTIVE_CPULIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "mems.effective",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_EFFECTIVE_MEMLIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{ }	/* terminate */
+};
+
+
+/*
  *	cpuset_css_alloc - allocate a cpuset css
  *	cgrp:	control group that the new cpuset will be part of
  */
@@ -2105,8 +2145,10 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
 	.post_attach	= cpuset_post_attach,
 	.bind		= cpuset_bind,
 	.fork		= cpuset_fork,
-	.legacy_cftypes	= files,
+	.legacy_cftypes	= legacy_files,
+	.dfl_cftypes	= dfl_files,
 	.early_init	= true,
+	.threaded	= true,
 };
 
 /**
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH v4 22/26] devicetree: fix name of pinctrl-bindings.txt
From: Lee Jones @ 2018-06-18  5:48 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Linux Doc Mailing List, Jonathan Corbet, Mauro Carvalho Chehab,
	linux-kernel, Rob Herring, Mark Rutland, Ulf Hansson,
	Linus Walleij, Greg Kroah-Hartman, Mark Brown, devicetree,
	linux-mmc, linux-gpio, linux-serial, linux-spi
In-Reply-To: <1832904c012f823958663464ac4fb249f77978b5.1529079120.git.mchehab+samsung@kernel.org>

On Fri, 15 Jun 2018, Mauro Carvalho Chehab wrote:

> Rename:
> 	pinctrl-binding.txt -> pinctrl-bindings.txt
> 
> In order to match the current name of this file.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
> ---
>  Documentation/devicetree/bindings/media/stih407-c8sectpfe.txt | 2 +-

>  Documentation/devicetree/bindings/mfd/as3722.txt              | 2 +-

Acked-by: Lee Jones <lee.jones@linaro.org>

>  .../devicetree/bindings/mmc/microchip,sdhci-pic32.txt         | 2 +-
>  Documentation/devicetree/bindings/mmc/sdhci-st.txt            | 2 +-
>  .../devicetree/bindings/pinctrl/pinctrl-max77620.txt          | 4 ++--
>  .../devicetree/bindings/pinctrl/pinctrl-mcp23s08.txt          | 4 ++--
>  Documentation/devicetree/bindings/pinctrl/pinctrl-rk805.txt   | 4 ++--
>  .../devicetree/bindings/serial/microchip,pic32-uart.txt       | 2 +-
>  Documentation/devicetree/bindings/spi/spi-st-ssc.txt          | 2 +-
>  9 files changed, 12 insertions(+), 12 deletions(-)

-- 
Lee Jones [李琼斯]
Linaro Services Technical Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v4 23/26] devicetree: fix a series of wrong file references
From: Lee Jones @ 2018-06-18  5:49 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Linux Doc Mailing List, Jonathan Corbet, Mauro Carvalho Chehab,
	linux-kernel, Dmitry Torokhov, Rob Herring, Mark Rutland,
	Maxime Ripard, Chen-Yu Tsai, Zhou Wang, Bjorn Helgaas,
	Xiaowei Song, Binghui Wang, Liam Girdwood, Mark Brown,
	Maxime Coquelin, Alexandre Torgue, linux-input, devicetree,
	linux-arm-kernel, linux-pci, alsa-devel
In-Reply-To: <266e5f874b36aa83f3327879d85921fab52d5461.1529079120.git.mchehab+samsung@kernel.org>

On Fri, 15 Jun 2018, Mauro Carvalho Chehab wrote:

> As files got renamed, their references broke.
> 
> Manually fix a series of broken refs at the DT bindings.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
> ---
>  .../devicetree/bindings/input/rmi4/rmi_2d_sensor.txt |  2 +-
>  Documentation/devicetree/bindings/mfd/sun6i-prcm.txt |  2 +-

Acked-by: Lee Jones <lee.jones@linaro.org>

>  .../devicetree/bindings/pci/hisilicon-pcie.txt       |  2 +-
>  Documentation/devicetree/bindings/pci/kirin-pcie.txt |  2 +-
>  .../devicetree/bindings/pci/pci-keystone.txt         |  4 ++--
>  .../devicetree/bindings/sound/st,stm32-i2s.txt       |  2 +-
>  .../devicetree/bindings/sound/st,stm32-sai.txt       |  2 +-
>  MAINTAINERS                                          | 12 ++++++------
>  8 files changed, 14 insertions(+), 14 deletions(-)

-- 
Lee Jones [李琼斯]
Linaro Services Technical Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH resend*3] VFS: simplify seq_file iteration code and interface
From: NeilBrown @ 2018-06-18  6:46 UTC (permalink / raw)
  To: Andrew Morton, Alexander Viro, Linus Torvalds
  Cc: linux-doc, linux-kernel, linux-fsdevel, Jonathan Corbet
In-Reply-To: <874lintqa6.fsf@notabene.neil.brown.name>

[-- Attachment #1: Type: text/plain, Size: 11420 bytes --]


The documentation for seq_file suggests that it is necessary to be
able to move the iterator to a given offset, however that is not the
case.  If the iterator is stored in the private data and is stable
from one read() syscall to the next, it is only necessary to support
first/next interactions.  Implementing this in a client is a little
clumsy.
- if ->start() is given a pos of zero, it should go to start of
  sequence.
- if ->start() is given the name pos that was given to the most recent
  next() or start(), it should restore the iterator to state just
  before that last call
- if ->start is given another number, it should set the iterator one
  beyond the start just before the last ->start or ->next call.


Also, the documentation says that the implementation can interpret the
pos however it likes (other than zero meaning start), but seq_file
increments the pos sometimes which does impose on the implementation.

This patch simplifies the interface for first/next iteration and
simplifies the code, while maintaining complete backward
compatability.  Now:

- if ->start() is given a pos of zero, it should return an iterator
  placed at the start of the sequence
- if ->start() is given a non-zero pos, it should return the iterator
  in the same state it was after the last ->start or ->next.

This is particularly useful for interators which walk the multiple
chains in a hash table, e.g. using rhashtable_walk*. See
fs/gfs2/glock.c and drivers/staging/lustre/lustre/llite/vvp_dev.c

A large part of achieving this is to *always* call ->next after ->show
has successfully stored all of an entry in the buffer.  Never just
increment the index instead.
Also:
 - always pass &m->index to ->start() and ->next(), never a temp
   variable
 - don't clear ->from when ->count is zero, as ->from is dead when
    ->count is zero.


Some ->next functions do not increment *pos when they return NULL.
To maintain compatability with this, we still need to increment
m->index in one place, if ->next didn't increment it.
Note that such ->next functions are buggy and should be fixed.
A simple demonstration is
   dd if=/proc/swaps bs=1000 skip=1
Choose any block size larger than the size of /proc/swaps.
This will always show the whole last line of /proc/swaps.

This patch doesn't work around buggy next() functions for this case.

Acked-by: Jonathan Corbet <corbet@lwn.net> (For the docs part)
Signed-off-by: NeilBrown <neilb@suse.com>
---

Still hoping someone might apply this, or at least review it,
or maybe just tell me how insane it is - anything but silence :-(

NeilBrown


 Documentation/filesystems/seq_file.txt | 63 ++++++++++++++++++++++------------
 fs/seq_file.c                          | 53 +++++++++++-----------------
 2 files changed, 62 insertions(+), 54 deletions(-)

diff --git a/Documentation/filesystems/seq_file.txt b/Documentation/filesystems/seq_file.txt
index 9de4303201e1..d412b236a9d6 100644
--- a/Documentation/filesystems/seq_file.txt
+++ b/Documentation/filesystems/seq_file.txt
@@ -66,23 +66,39 @@ kernel 3.10. Current versions require the following update
 
 The iterator interface
 
-Modules implementing a virtual file with seq_file must implement a simple
-iterator object that allows stepping through the data of interest.
-Iterators must be able to move to a specific position - like the file they
-implement - but the interpretation of that position is up to the iterator
-itself. A seq_file implementation that is formatting firewall rules, for
-example, could interpret position N as the Nth rule in the chain.
-Positioning can thus be done in whatever way makes the most sense for the
-generator of the data, which need not be aware of how a position translates
-to an offset in the virtual file. The one obvious exception is that a
-position of zero should indicate the beginning of the file.
+Modules implementing a virtual file with seq_file must implement an
+iterator object that allows stepping through the data of interest
+during a "session" (roughly one read() system call).  If the iterator
+is able to move to a specific position - like the file they implement,
+though with freedom to map the position number to a sequence location
+in whatever way is convenient - the iterator need only exist
+transiently during a session.  If the iterator cannot easily find a
+numerical position but works well with a first/next interface, the
+iterator can be stored in the private data area and continue from one
+session to the next.
+
+A seq_file implementation that is formatting firewall rules from a
+table, for example, could provide a simple iterator that interprets
+position N as the Nth rule in the chain.  A seq_file implementation
+that presents the content of a, potentially volatile, linked list
+might record a pointer into that list, providing that can be done
+without risk of the current location being removed.
+
+Positioning can thus be done in whatever way makes the most sense for
+the generator of the data, which need not be aware of how a position
+translates to an offset in the virtual file. The one obvious exception
+is that a position of zero should indicate the beginning of the file.
 
 The /proc/sequence iterator just uses the count of the next number it
 will output as its position.
 
-Four functions must be implemented to make the iterator work. The first,
-called start() takes a position as an argument and returns an iterator
-which will start reading at that position. For our simple sequence example,
+Four functions must be implemented to make the iterator work. The
+first, called start(), starts a session and takes a position as an
+argument, returning an iterator which will start reading at that
+position.  The pos passed to start() will always be either zero, or
+the most recent pos used in the previous session.
+
+For our simple sequence example,
 the start() function looks like:
 
 	static void *ct_seq_start(struct seq_file *s, loff_t *pos)
@@ -101,11 +117,12 @@ implementations; in most cases the start() function should check for a
 "past end of file" condition and return NULL if need be.
 
 For more complicated applications, the private field of the seq_file
-structure can be used. There is also a special value which can be returned
-by the start() function called SEQ_START_TOKEN; it can be used if you wish
-to instruct your show() function (described below) to print a header at the
-top of the output. SEQ_START_TOKEN should only be used if the offset is
-zero, however.
+structure can be used to hold state from session to session.  There is
+also a special value which can be returned by the start() function
+called SEQ_START_TOKEN; it can be used if you wish to instruct your
+show() function (described below) to print a header at the top of the
+output. SEQ_START_TOKEN should only be used if the offset is zero,
+however.
 
 The next function to implement is called, amazingly, next(); its job is to
 move the iterator forward to the next position in the sequence.  The
@@ -121,9 +138,13 @@ complete. Here's the example version:
 	        return spos;
 	}
 
-The stop() function is called when iteration is complete; its job, of
-course, is to clean up. If dynamic memory is allocated for the iterator,
-stop() is the place to free it.
+The stop() function closes a session; its job, of course, is to clean
+up. If dynamic memory is allocated for the iterator, stop() is the
+place to free it; if a lock was taken by start(), stop() must release
+that lock.  The value that *pos was set to by the last next() call
+before stop() is remembered, and used for the first start() call of
+the next session unless lseek() has been called on the file; in that
+case next start() will be asked to start at position zero.
 
 	static void ct_seq_stop(struct seq_file *s, void *v)
 	{
diff --git a/fs/seq_file.c b/fs/seq_file.c
index 4cc090b50cc5..fd82585ab50f 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -90,23 +90,22 @@ EXPORT_SYMBOL(seq_open);
 
 static int traverse(struct seq_file *m, loff_t offset)
 {
-	loff_t pos = 0, index;
+	loff_t pos = 0;
 	int error = 0;
 	void *p;
 
 	m->version = 0;
-	index = 0;
+	m->index = 0;
 	m->count = m->from = 0;
-	if (!offset) {
-		m->index = index;
+	if (!offset)
 		return 0;
-	}
+
 	if (!m->buf) {
 		m->buf = seq_buf_alloc(m->size = PAGE_SIZE);
 		if (!m->buf)
 			return -ENOMEM;
 	}
-	p = m->op->start(m, &index);
+	p = m->op->start(m, &m->index);
 	while (p) {
 		error = PTR_ERR(p);
 		if (IS_ERR(p))
@@ -123,20 +122,15 @@ static int traverse(struct seq_file *m, loff_t offset)
 		if (pos + m->count > offset) {
 			m->from = offset - pos;
 			m->count -= m->from;
-			m->index = index;
 			break;
 		}
 		pos += m->count;
 		m->count = 0;
-		if (pos == offset) {
-			index++;
-			m->index = index;
+		p = m->op->next(m, p, &m->index);
+		if (pos == offset)
 			break;
-		}
-		p = m->op->next(m, p, &index);
 	}
 	m->op->stop(m, p);
-	m->index = index;
 	return error;
 
 Eoverflow:
@@ -160,7 +154,6 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
 {
 	struct seq_file *m = file->private_data;
 	size_t copied = 0;
-	loff_t pos;
 	size_t n;
 	void *p;
 	int err = 0;
@@ -223,16 +216,11 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
 		size -= n;
 		buf += n;
 		copied += n;
-		if (!m->count) {
-			m->from = 0;
-			m->index++;
-		}
 		if (!size)
 			goto Done;
 	}
 	/* we need at least one record in buffer */
-	pos = m->index;
-	p = m->op->start(m, &pos);
+	p = m->op->start(m, &m->index);
 	while (1) {
 		err = PTR_ERR(p);
 		if (!p || IS_ERR(p))
@@ -243,8 +231,7 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
 		if (unlikely(err))
 			m->count = 0;
 		if (unlikely(!m->count)) {
-			p = m->op->next(m, p, &pos);
-			m->index = pos;
+			p = m->op->next(m, p, &m->index);
 			continue;
 		}
 		if (m->count < m->size)
@@ -256,29 +243,33 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
 		if (!m->buf)
 			goto Enomem;
 		m->version = 0;
-		pos = m->index;
-		p = m->op->start(m, &pos);
+		p = m->op->start(m, &m->index);
 	}
 	m->op->stop(m, p);
 	m->count = 0;
 	goto Done;
 Fill:
 	/* they want more? let's try to get some more */
-	while (m->count < size) {
+	while (1) {
 		size_t offs = m->count;
-		loff_t next = pos;
-		p = m->op->next(m, p, &next);
+		loff_t pos = m->index;
+
+		p = m->op->next(m, p, &m->index);
+		if (pos == m->index)
+			/* Buggy ->next function */
+			m->index++;
 		if (!p || IS_ERR(p)) {
 			err = PTR_ERR(p);
 			break;
 		}
+		if (m->count >= size)
+			break;
 		err = m->op->show(m, p);
 		if (seq_has_overflowed(m) || err) {
 			m->count = offs;
 			if (likely(err <= 0))
 				break;
 		}
-		pos = next;
 	}
 	m->op->stop(m, p);
 	n = min(m->count, size);
@@ -287,11 +278,7 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
 		goto Efault;
 	copied += n;
 	m->count -= n;
-	if (m->count)
-		m->from = n;
-	else
-		pos++;
-	m->index = pos;
+	m->from = n;
 Done:
 	if (!copied)
 		copied = err;
-- 
2.14.0.rc0.dirty


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: [PATCH v3 24/27] devicetree: fix a series of wrong file references
From: Alexandre Torgue @ 2018-06-18  7:57 UTC (permalink / raw)
  To: Mauro Carvalho Chehab, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-kernel, Jonathan Corbet,
	Dmitry Torokhov, Rob Herring, Mark Rutland, Lee Jones,
	Maxime Ripard, Chen-Yu Tsai, Zhou Wang, Bjorn Helgaas,
	Xiaowei Song, Binghui Wang, Liam Girdwood, Mark Brown,
	Maxime Coquelin, linux-input, devicetree, linux-arm-kernel,
	linux-pci, alsa-devel
In-Reply-To: <bce1b59268ac2d659fd7cb37f89285ef2acc8316.1528990947.git.mchehab+samsung@kernel.org>



On 06/14/2018 06:09 PM, Mauro Carvalho Chehab wrote:
> As files got renamed, their references broke.
> 
> Manually fix a series of broken refs at the DT bindings.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
> ---

For stm32 part:

Acked-by: Alexandre TORGUE <alexandre.torgue@st.com>

Thanks
Alex


>   .../devicetree/bindings/input/rmi4/rmi_2d_sensor.txt |  2 +-
>   Documentation/devicetree/bindings/mfd/sun6i-prcm.txt |  2 +-
>   .../devicetree/bindings/pci/hisilicon-pcie.txt       |  2 +-
>   Documentation/devicetree/bindings/pci/kirin-pcie.txt |  2 +-
>   .../devicetree/bindings/pci/pci-keystone.txt         |  4 ++--
>   .../devicetree/bindings/sound/st,stm32-i2s.txt       |  2 +-
>   .../devicetree/bindings/sound/st,stm32-sai.txt       |  2 +-
>   MAINTAINERS                                          | 12 ++++++------
>   8 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/input/rmi4/rmi_2d_sensor.txt b/Documentation/devicetree/bindings/input/rmi4/rmi_2d_sensor.txt
> index f2c30c8b725d..9afffbdf6e28 100644
> --- a/Documentation/devicetree/bindings/input/rmi4/rmi_2d_sensor.txt
> +++ b/Documentation/devicetree/bindings/input/rmi4/rmi_2d_sensor.txt
> @@ -12,7 +12,7 @@ Additional documentation for F11 can be found at:
>   http://www.synaptics.com/sites/default/files/511-000136-01-Rev-E-RMI4-Interfacing-Guide.pdf
>   
>   Optional Touch Properties:
> -Description in Documentation/devicetree/bindings/input/touch
> +Description in Documentation/devicetree/bindings/input/touchscreen
>   - touchscreen-inverted-x
>   - touchscreen-inverted-y
>   - touchscreen-swapped-x-y
> diff --git a/Documentation/devicetree/bindings/mfd/sun6i-prcm.txt b/Documentation/devicetree/bindings/mfd/sun6i-prcm.txt
> index 4d21ffdb0fc1..daa091c2e67b 100644
> --- a/Documentation/devicetree/bindings/mfd/sun6i-prcm.txt
> +++ b/Documentation/devicetree/bindings/mfd/sun6i-prcm.txt
> @@ -8,7 +8,7 @@ Required properties:
>    - reg: The PRCM registers range
>   
>   The prcm node may contain several subdevices definitions:
> - - see Documentation/devicetree/clk/sunxi.txt for clock devices
> + - see Documentation/devicetree/bindings/clock/sunxi.txt for clock devices
>    - see Documentation/devicetree/bindings/reset/allwinner,sunxi-clock-reset.txt for reset
>      controller devices
>   
> diff --git a/Documentation/devicetree/bindings/pci/hisilicon-pcie.txt b/Documentation/devicetree/bindings/pci/hisilicon-pcie.txt
> index 7bf9df047a1e..0dcb87d6554f 100644
> --- a/Documentation/devicetree/bindings/pci/hisilicon-pcie.txt
> +++ b/Documentation/devicetree/bindings/pci/hisilicon-pcie.txt
> @@ -3,7 +3,7 @@ HiSilicon Hip05 and Hip06 PCIe host bridge DT description
>   HiSilicon PCIe host controller is based on the Synopsys DesignWare PCI core.
>   It shares common functions with the PCIe DesignWare core driver and inherits
>   common properties defined in
> -Documentation/devicetree/bindings/pci/designware-pci.txt.
> +Documentation/devicetree/bindings/pci/designware-pcie.txt.
>   
>   Additional properties are described here:
>   
> diff --git a/Documentation/devicetree/bindings/pci/kirin-pcie.txt b/Documentation/devicetree/bindings/pci/kirin-pcie.txt
> index 6e217c63123d..6bbe43818ad5 100644
> --- a/Documentation/devicetree/bindings/pci/kirin-pcie.txt
> +++ b/Documentation/devicetree/bindings/pci/kirin-pcie.txt
> @@ -3,7 +3,7 @@ HiSilicon Kirin SoCs PCIe host DT description
>   Kirin PCIe host controller is based on the Synopsys DesignWare PCI core.
>   It shares common functions with the PCIe DesignWare core driver and
>   inherits common properties defined in
> -Documentation/devicetree/bindings/pci/designware-pci.txt.
> +Documentation/devicetree/bindings/pci/designware-pcie.txt.
>   
>   Additional properties are described here:
>   
> diff --git a/Documentation/devicetree/bindings/pci/pci-keystone.txt b/Documentation/devicetree/bindings/pci/pci-keystone.txt
> index 7e05487544ed..3d4a209b0fd0 100644
> --- a/Documentation/devicetree/bindings/pci/pci-keystone.txt
> +++ b/Documentation/devicetree/bindings/pci/pci-keystone.txt
> @@ -3,9 +3,9 @@ TI Keystone PCIe interface
>   Keystone PCI host Controller is based on the Synopsys DesignWare PCI
>   hardware version 3.65.  It shares common functions with the PCIe DesignWare
>   core driver and inherits common properties defined in
> -Documentation/devicetree/bindings/pci/designware-pci.txt
> +Documentation/devicetree/bindings/pci/designware-pcie.txt
>   
> -Please refer to Documentation/devicetree/bindings/pci/designware-pci.txt
> +Please refer to Documentation/devicetree/bindings/pci/designware-pcie.txt
>   for the details of DesignWare DT bindings.  Additional properties are
>   described here as well as properties that are not applicable.
>   
> diff --git a/Documentation/devicetree/bindings/sound/st,stm32-i2s.txt b/Documentation/devicetree/bindings/sound/st,stm32-i2s.txt
> index 4bda52042402..58c341300552 100644
> --- a/Documentation/devicetree/bindings/sound/st,stm32-i2s.txt
> +++ b/Documentation/devicetree/bindings/sound/st,stm32-i2s.txt
> @@ -18,7 +18,7 @@ Required properties:
>       See Documentation/devicetree/bindings/dma/stm32-dma.txt.
>     - dma-names: Identifier for each DMA request line. Must be "tx" and "rx".
>     - pinctrl-names: should contain only value "default"
> -  - pinctrl-0: see Documentation/devicetree/bindings/pinctrl/pinctrl-stm32.txt
> +  - pinctrl-0: see Documentation/devicetree/bindings/pinctrl/st,stm32-pinctrl.txt
>   
>   Optional properties:
>     - resets: Reference to a reset controller asserting the reset controller
> diff --git a/Documentation/devicetree/bindings/sound/st,stm32-sai.txt b/Documentation/devicetree/bindings/sound/st,stm32-sai.txt
> index f301cdf0b7e6..3a3fc506e43a 100644
> --- a/Documentation/devicetree/bindings/sound/st,stm32-sai.txt
> +++ b/Documentation/devicetree/bindings/sound/st,stm32-sai.txt
> @@ -37,7 +37,7 @@ SAI subnodes required properties:
>   	"tx": if sai sub-block is configured as playback DAI
>   	"rx": if sai sub-block is configured as capture DAI
>     - pinctrl-names: should contain only value "default"
> -  - pinctrl-0: see Documentation/devicetree/bindings/pinctrl/pinctrl-stm32.txt
> +  - pinctrl-0: see Documentation/devicetree/bindings/pinctrl/st,stm32-pinctrl.txt
>   
>   SAI subnodes Optional properties:
>     - st,sync: specify synchronization mode.
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 67641f5bb373..69c9e9924902 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6965,7 +6965,7 @@ IIO MULTIPLEXER
>   M:	Peter Rosin <peda@axentia.se>
>   L:	linux-iio@vger.kernel.org
>   S:	Maintained
> -F:	Documentation/devicetree/bindings/iio/multiplexer/iio-mux.txt
> +F:	Documentation/devicetree/bindings/iio/multiplexer/io-channel-mux.txt
>   F:	drivers/iio/multiplexer/iio-mux.c
>   
>   IIO SUBSYSTEM AND DRIVERS
> @@ -9695,7 +9695,7 @@ MXSFB DRM DRIVER
>   M:	Marek Vasut <marex@denx.de>
>   S:	Supported
>   F:	drivers/gpu/drm/mxsfb/
> -F:	Documentation/devicetree/bindings/display/mxsfb-drm.txt
> +F:	Documentation/devicetree/bindings/display/mxsfb.txt
>   
>   MYRICOM MYRI-10G 10GbE DRIVER (MYRI10GE)
>   M:	Chris Lee <christopher.lee@cspi.com>
> @@ -10884,7 +10884,7 @@ M:	Will Deacon <will.deacon@arm.com>
>   L:	linux-pci@vger.kernel.org
>   L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
>   S:	Maintained
> -F:	Documentation/devicetree/bindings/pci/controller-generic-pci.txt
> +F:	Documentation/devicetree/bindings/pci/host-generic-pci.txt
>   F:	drivers/pci/controller/pci-host-common.c
>   F:	drivers/pci/controller/pci-host-generic.c
>   
> @@ -11065,7 +11065,7 @@ M:	Xiaowei Song <songxiaowei@hisilicon.com>
>   M:	Binghui Wang <wangbinghui@hisilicon.com>
>   L:	linux-pci@vger.kernel.org
>   S:	Maintained
> -F:	Documentation/devicetree/bindings/pci/pcie-kirin.txt
> +F:	Documentation/devicetree/bindings/pci/kirin-pcie.txt
>   F:	drivers/pci/controller/dwc/pcie-kirin.c
>   
>   PCIE DRIVER FOR HISILICON STB
> @@ -12456,7 +12456,7 @@ L:	linux-crypto@vger.kernel.org
>   L:	linux-samsung-soc@vger.kernel.org
>   S:	Maintained
>   F:	drivers/crypto/exynos-rng.c
> -F:	Documentation/devicetree/bindings/crypto/samsung,exynos-rng4.txt
> +F:	Documentation/devicetree/bindings/rng/samsung,exynos4-rng.txt
>   
>   SAMSUNG EXYNOS TRUE RANDOM NUMBER GENERATOR (TRNG) DRIVER
>   M:	Łukasz Stelmach <l.stelmach@samsung.com>
> @@ -13570,7 +13570,7 @@ F:	drivers/*/stm32-*timer*
>   F:	drivers/pwm/pwm-stm32*
>   F:	include/linux/*/stm32-*tim*
>   F:	Documentation/ABI/testing/*timer-stm32
> -F:	Documentation/devicetree/bindings/*/stm32-*timer
> +F:	Documentation/devicetree/bindings/*/stm32-*timer*
>   F:	Documentation/devicetree/bindings/pwm/pwm-stm32*
>   
>   STMMAC ETHERNET DRIVER
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net: fix e100.rst Documentation build errors
From: Jani Nikula @ 2018-06-18  8:04 UTC (permalink / raw)
  To: Randy Dunlap, linux-doc@vger.kernel.org, netdev@vger.kernel.org,
	Jeff Kirsher, David Miller
  Cc: LKML, Aaron Brown
In-Reply-To: <72f13386-c484-3eed-c363-a5b667aea2e6@infradead.org>

On Sat, 16 Jun 2018, Randy Dunlap <rdunlap@infradead.org> wrote:
> From: Randy Dunlap <rdunlap@infradead.org>
>
> Fix Documentation build errors in e100.rst.  Several section titles
> and the corresponding underlines should not be indented.

Really the content blocks below the titles should not be indented
either. It's not an error, but the end result is probably not what you
want.

BR,
Jani.


>
> Documentation/networking/e100.rst:90: (SEVERE/4) Unexpected section title.
> Documentation/networking/e100.rst:109: (SEVERE/4) Unexpected section title.
>
> Fixes: 85d63445f411 ("Documentation: e100: Update the Intel 10/100 driver doc")
>
> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Cc: Aaron Brown <aaron.f.brown@intel.com>
> ---
> Is there a Sphinx version problem here?  Tested-by: should indicate
> that there was no error like I am seeing.
>
>  Documentation/networking/e100.rst |   24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
>
> --- lnx-418-rc1.orig/Documentation/networking/e100.rst
> +++ lnx-418-rc1/Documentation/networking/e100.rst
> @@ -86,8 +86,8 @@ Event Log Message Level:  The driver use
>  Additional Configurations
>  =========================
>  
> -  Configuring the Driver on Different Distributions
> -  -------------------------------------------------
> +Configuring the Driver on Different Distributions
> +-------------------------------------------------
>  
>    Configuring a network driver to load properly when the system is started is
>    distribution dependent. Typically, the configuration process involves adding
> @@ -105,8 +105,8 @@ Additional Configurations
>         alias eth0 e100
>         alias eth1 e100
>  
> -  Viewing Link Messages
> -  ---------------------
> +Viewing Link Messages
> +---------------------
>    In order to see link messages and other Intel driver information on your
>    console, you must set the dmesg level up to six. This can be done by
>    entering the following on the command line before loading the e100 driver::
> @@ -119,8 +119,8 @@ Additional Configurations
>    NOTE: This setting is not saved across reboots.
>  
>  
> -  ethtool
> -  -------
> +ethtool
> +-------
>  
>    The driver utilizes the ethtool interface for driver configuration and
>    diagnostics, as well as displaying statistical information.  The ethtool
> @@ -129,8 +129,8 @@ Additional Configurations
>    The latest release of ethtool can be found from
>    https://www.kernel.org/pub/software/network/ethtool/
>  
> -  Enabling Wake on LAN* (WoL)
> -  ---------------------------
> +Enabling Wake on LAN* (WoL)
> +---------------------------
>    WoL is provided through the ethtool* utility.  For instructions on enabling
>    WoL with ethtool, refer to the ethtool man page.
>  
> @@ -138,16 +138,16 @@ Additional Configurations
>    this driver version, in order to enable WoL, the e100 driver must be
>    loaded when shutting down or rebooting the system.
>  
> -  NAPI
> -  ----
> +NAPI
> +----
>  
>    NAPI (Rx polling mode) is supported in the e100 driver.
>  
>    See https://wiki.linuxfoundation.org/networking/napi for more information
>    on NAPI.
>  
> -  Multiple Interfaces on Same Ethernet Broadcast Network
> -  ------------------------------------------------------
> +Multiple Interfaces on Same Ethernet Broadcast Network
> +------------------------------------------------------
>  
>    Due to the default ARP behavior on Linux, it is not possible to have
>    one system on two IP networks in the same Ethernet broadcast domain
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-doc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Jani Nikula, Intel Open Source Graphics Center
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH RESEND v4 2/2] arm/arm64: KVM: Add KVM_GET/SET_VCPU_EVENTS
From: gengdongjiu @ 2018-06-18  8:24 UTC (permalink / raw)
  To: James Morse
  Cc: rkrcmar@redhat.com, corbet@lwn.net, christoffer.dall@arm.com,
	marc.zyngier@arm.com, linux@armlinux.org.uk,
	catalin.marinas@arm.com, will.deacon@arm.com, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org

> On 12/06/18 15:50, gengdongjiu wrote:
> > On 2018/6/11 21:36, James Morse wrote:
> >> On 08/06/18 20:48, Dongjiu Geng wrote:
> >>> For the migrating VMs, user space may need to know the exception
> >>> state. For example, in the machine A, KVM make an SError pending,
> >>> when migrate to B, KVM also needs to pend an SError.
> >>>
> >>> This new IOCTL exports user-invisible states related to SError.
> >>> Together with appropriate user space changes, user space can get/set
> >>> the SError exception state to do migrate/snapshot/suspend.
> 
> 
> >>> diff --git a/arch/arm/include/uapi/asm/kvm.h
> >>> b/arch/arm/include/uapi/asm/kvm.h index caae484..c3e6975 100644
> >>> --- a/arch/arm/include/uapi/asm/kvm.h
> >>> +++ b/arch/arm/include/uapi/asm/kvm.h
> >>> @@ -124,6 +124,18 @@ struct kvm_sync_regs {  struct
> >>> kvm_arch_memory_slot {  };
> >>>
> >>> +/* for KVM_GET/SET_VCPU_EVENTS */
> >>> +struct kvm_vcpu_events {
> >>> +	struct {
> >>> +		__u8 serror_pending;
> >>> +		__u8 serror_has_esr;
> >>> +		/* Align it to 8 bytes */
> >>> +		__u8 pad[6];
> >>> +		__u64 serror_esr;
> >>> +	} exception;
> >>> +	__u32 reserved[12];
> >>> +};
> >>> +
> >>
> >> You haven't defined __KVM_HAVE_VCPU_EVENTS for 32bit, so presumably
> >> this struct will never be used. Why is it here?
> 
> >   if not add it for 32 bits. the 32 arm platform will build Fail, whether you have good
> >    idea to avoid this Failure if not add this struct for the 32 bit?
> 
> How does this 32bit code build without this patch?
> If do you provide the struct, how will that code build with older headers?
> 
> As far as I can see, this is what the __KVM_HAVE_VCPU_EVENTS define is for.
> 
> This should be both, or neither. Having just the struct is useless.
> 
> 
> >>> +int kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
> >>> +			struct kvm_vcpu_events *events)
> >>> +{
> >>> +	bool serror_pending = events->exception.serror_pending;
> >>> +	bool has_esr = events->exception.serror_has_esr;
> >>> +
> >>> +	if (serror_pending && has_esr) {
> >>> +		if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN))
> >>> +			return -EINVAL;
> >>> +
> >>> +		kvm_set_sei_esr(vcpu, events->exception.serror_esr);
> >>
> >> kvm_set_sei_esr() will silently discard the top 40 bits of
> >> serror_esr, (which is correct, we shouldn't copy them into hardware without know what they do).
> >>
> >> Could we please force user-space to zero these bits, we can advertise
> >> extra CAPs if new features turn up in that space, instead of
> >> user-space passing <something> and relying on the kernel to remove it.
> >
> >   yes, I can zero these bits in the  user-space and not depend on kernel to remove it.
> 
> But the kernel must check that user-space did zero those bits. Otherwise user-space may start using them when a future version of the

For this comments, how about add below kernel check that user-space did zero those bits? Thanks.

+               if (!((events->exception.serror_esr) & ~ESR_ELx_ISS_MASK))
+                       kvm_set_sei_esr(vcpu, events->exception.serror_esr);
+               else
+                       return -EINVAL;


> architecture gives them a meaning, but an older kernel version doesn't know it has to do extra work, but still lets the bits from user-space
> through into the hardware.
> 
> If new bits do turn up, we can advertise a CAP that says that KVM supports whatever that feature is.
> 
> 
> >> (Background: VSESR is a 64bit register that holds the value to go in
> >> a 32bit register. I suspect the top-half could get re-used for
> >> control values or something we don't want to give user-space)
> 
> >   do you mean when user-space get the VSESR value through
> > KVM_GET_VCPU_EVENTS it only return the low-half 32 bits?
> 
> No, the kernel will only ever set a 24bit value here. If we force user-space to only provide a 24bit value then we don't need to check it on
> read. We never read the value back from hardware.
> 
> These high bits are RES0 at the moment, they may get used for something in the future. As we are exposing this via a user-space ABI we
> need to make sure we only expose the bits we understand today.

Ok

> 
> 
> Thanks,
> 
> James

^ permalink raw reply

* Re: [PATCH v3 1/3] usb: gadget: ccid: add support for USB CCID Gadget Device
From: Felipe Balbi @ 2018-06-18  8:22 UTC (permalink / raw)
  To: Marcus Folkesson
  Cc: Greg Kroah-Hartman, Jonathan Corbet, davem, Mauro Carvalho Chehab,
	Andrew Morton, Randy Dunlap, Ruslan Bilovol, Thomas Gleixner,
	Kate Stewart, linux-usb, linux-doc, linux-kernel
In-Reply-To: <20180608185443.GB874@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1590 bytes --]


Hi,

Marcus Folkesson <marcus.folkesson@gmail.com> writes:
> Hi Felipe,
>
> Should I send out v4 or what do you think?

sorry for the delay, have been busy with other tasks.

> On Wed, May 30, 2018 at 04:04:15PM +0200, Marcus Folkesson wrote:
>> Hi Filipe,
>> 
>> On Wed, May 30, 2018 at 03:28:18PM +0300, Felipe Balbi wrote:
>> > Marcus Folkesson <marcus.folkesson@gmail.com> writes:
>> > 
>> > > Chip Card Interface Device (CCID) protocol is a USB protocol that
>> > > allows a smartcard device to be connected to a computer via a card
>> > > reader using a standard USB interface, without the need for each manufacturer
>> > > of smartcards to provide its own reader or protocol.
>> > >
>> > > This gadget driver makes Linux show up as a CCID device to the host and let a
>> > > userspace daemon act as the smartcard.
>> > >
>> > > This is useful when the Linux gadget itself should act as a cryptographic
>> > > device or forward APDUs to an embedded smartcard device.
>> > >
>> > > Signed-off-by: Marcus Folkesson <marcus.folkesson@gmail.com>
>> > 
>> > this could be done entirely in userspace with functionfs, why do we need
>> > this part in the kernel? It does very little.
>> 
>> Andrzej pointed this out, and I actually do not have any good answer
>> more than that the userspace application could be kept small and the
>> important configuration of the CCID device is done with well (I hope)
>> documented configfs attributes.

can we use existing open source applications without modification by
accepting this glue layer?

-- 
balbi

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH v7 1/4] ipc: IPCMNI limit check for msgmni and shmmni
From: Waiman Long @ 2018-06-18  9:52 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Kees Cook, Andrew Morton, Jonathan Corbet, linux-kernel,
	linux-fsdevel, linux-doc, Al Viro, Matthew Wilcox,
	Eric W. Biederman
In-Reply-To: <20180509193200.GU27853@wotan.suse.de>

On 05/10/2018 03:32 AM, Luis R. Rodriguez wrote:
> On Mon, May 07, 2018 at 07:57:12PM -0400, Waiman Long wrote:
>> On 05/07/2018 06:39 PM, Luis R. Rodriguez wrote:
>>> On Mon, May 07, 2018 at 04:59:09PM -0400, Waiman Long wrote:
>>>> A user can write arbitrary integer values to msgmni and shmmni sysctl
>>>> parameters without getting error, but the actual limit is really
>>>> IPCMNI (32k). This can mislead users as they think they can get a
>>>> value that is not real.
>>>>
>>>> The right limits are now set for msgmni and shmmni so that the users
>>>> will become aware if they set a value outside of the acceptable range.
>>>>
>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>> ---
>>>>  ipc/ipc_sysctl.c | 7 +++++--
>>>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
>>>> index 8ad93c2..f87cb29 100644
>>>> --- a/ipc/ipc_sysctl.c
>>>> +++ b/ipc/ipc_sysctl.c
>>>> @@ -99,6 +99,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
>>>>  static int zero;
>>>>  static int one = 1;
>>>>  static int int_max = INT_MAX;
>>>> +static int ipc_mni = IPCMNI;
>>>>  
>>>>  static struct ctl_table ipc_kern_table[] = {
>>>>  	{
>>>> @@ -120,7 +121,9 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
>>>>  		.data		= &init_ipc_ns.shm_ctlmni,
>>>>  		.maxlen		= sizeof(init_ipc_ns.shm_ctlmni),
>>>>  		.mode		= 0644,
>>>> -		.proc_handler	= proc_ipc_dointvec,
>>>> +		.proc_handler	= proc_ipc_dointvec_minmax,
>>>> +		.extra1		= &zero,
>>>> +		.extra2		= &ipc_mni,
>>>>  	},
>>>>  	{
>>>>  		.procname	= "shm_rmid_forced",
>>>> @@ -147,7 +150,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
>>>>  		.mode		= 0644,
>>>>  		.proc_handler	= proc_ipc_dointvec_minmax,
>>>>  		.extra1		= &zero,
>>>> -		.extra2		= &int_max,
>>>> +		.extra2		= &ipc_mni,
>>>>  	},
>>>>  	{
>>>>  		.procname	= "auto_msgmni",
>>>> -- 
>>>> 1.8.3.1
>>> It seems negative values are not allowed, if true then having
>>> a caller to use proc_douintvec_Fminmax() would help with ensuring
>>> no invalid negative input values are used as well.
>>>
>>>   Luis
>> Negative value doesn't mean sense here. So it is true that we can use
>> proc_douintvec_minmax() instead. However, the data types themselves are
>> defined as "int". So I think it is better to keep using
>> proc_dointvec_minmax() to be consistent with the data type.
> Huh, no... If you *know* the valid values *are* only positive, the right
> thing to do is to then *change* the data type. Tons of odd bugs can creep
> up because of these stupid things.
>
>   Luis

Sorry for the late reply.

First of all, negative value will not be accepted because of the zero
lower limit check. The type of msgmni, shmmni and semmni are defined as
int in the uapi/linux/msg.h and uapi/linux/shm.h and uapi/linux/sem.h.
They are exposed to the userspace and changing them to "unsigned int"
may cause some undesirable consequence. Again this is a case of
introducing risk without any noticeable benefit.

I understand your desire of cleaning thing up. However, I am hesitant to
take this risk without seeing any real benefit in this case.

Cheers,
Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v8 0/5] ipc: IPCMNI limit check for *mni & increase that limit
From: Waiman Long @ 2018-06-18 10:28 UTC (permalink / raw)
  To: Luis R. Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet
  Cc: linux-kernel, linux-fsdevel, linux-doc, Al Viro, Matthew Wilcox,
	Eric W. Biederman, Takashi Iwai, Davidlohr Bueso, Waiman Long

v7->v8:
 - Remove the __read_mostly tag for ipc_mni and related variables as their
   accesses are not really in performance critical path.
 - Add a new ipcmni_compat sysctl parameter that can be set to restore old
   range check behavior if desired.

v6->v7:
 - Drop the range clamping code and just return error instead for now
   until there is user request for clamping support.
 - Fix compilation error when CONFIG_SYSVIPC_SYSCTL isn't defined.

v5->v6:
 - Consolidate the 3 ctl_table flags into 2.
 - Make similar changes to proc_doulongvec_minmax() and its associates
   to complete the clamping change.
 - Remove the sysctl registration failure test patch for now for later
   consideration.
 - Add extra braces to patch 1 to reduce code diff in a later patch.

v4->v5:
 - Revert the flags back to 16-bit so that there will be no change to
   the size of ctl_table.
 - Enhance the sysctl_check_flags() as requested by Luis to perform more
   checks to spot incorrect ctl_table entries.
 - Change the sysctl selftest to use dummy sysctls instead of production
   ones & enhance it to do more checks.
 - Add one more sysctl selftest for registration failure.
 - Add 2 ipc patches to add an extended mode to increase IPCMNI from
   32k to 2M.
 - Miscellaneous change to incorporate feedback comments from
   reviewers.

v3->v4:
 - Remove v3 patches 1 & 2 as they have been merged into the mm tree.
 - Change flags from uint16_t to unsigned int.
 - Remove CTL_FLAGS_OOR_WARNED and use pr_warn_ratelimited() instead.
 - Simplify the warning message code.
 - Add a new patch to fail the ctl_table registration with invalid flag.
 - Add a test case for range clamping in sysctl selftest.

v2->v3:
 - Fix kdoc comment errors.
 - Incorporate comments and suggestions from Luis R. Rodriguez.
 - Add a patch to fix a typo error in fs/proc/proc_sysctl.c.

v1->v2:
 - Add kdoc comments to the do_proc_do{u}intvec_minmax_conv_param
   structures.
 - Add a new flags field to the ctl_table structure for specifying
   whether range clamping should be activated instead of adding new
   sysctl parameter handlers.
 - Clamp the semmni value embedded in the multi-values sem parameter.

v5 patch: https://lkml.org/lkml/2018/3/16/1106
v6 patch: https://lkml.org/lkml/2018/4/27/1094
v7 patch: https://lkml.org/lkml/2018/5/7/666

The sysctl parameters msgmni, shmmni and semmni have an inherent limit
of IPC_MNI (32k). However, users may not be aware of that because they
can write a value much higher than that without getting any error or
notification. Reading the parameters back will show the newly written
values which are not real.

The real IPCMNI limit is now enforced to make sure that users won't
put in an unrealistic value. The first 2 patches enforce the limits.

There are also users out there requesting increase in the IPCMNI value.
The last 2 patches attempt to do that by using a boot kernel parameter
"ipcmni_extend" to increase the IPCMNI limit from 32k to 2M if the users
really want the extended value.

Enforcing the range limit check may cause some existing applications to break
if they unwittingly set a value higher than 32k. To allow system administrators
to work around this issue, a new ipcmni_compat sysctl parameter can now be set
to restore the old behavior. This compatibility mode can only be set if the
ipcmni_extend boot parameter is not specified. Patch 5 implements this new
sysctl parameter.

Waiman Long (5):
  ipc: IPCMNI limit check for msgmni and shmmni
  ipc: IPCMNI limit check for semmni
  ipc: Allow boot time extension of IPCMNI from 32k to 2M
  ipc: Conserve sequence numbers in extended IPCMNI mode
  ipc: Add a new ipcmni_compat sysctl to fall back to old behavior

 Documentation/admin-guide/kernel-parameters.txt |  3 +
 Documentation/sysctl/kernel.txt                 | 15 +++++
 include/linux/ipc_namespace.h                   |  1 +
 ipc/ipc_sysctl.c                                | 78 ++++++++++++++++++++++++-
 ipc/util.c                                      | 41 ++++++++-----
 ipc/util.h                                      | 50 +++++++++++++---
 6 files changed, 164 insertions(+), 24 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v8 5/5] ipc: Add a new ipcmni_compat sysctl to fall back to old behavior
From: Waiman Long @ 2018-06-18 10:28 UTC (permalink / raw)
  To: Luis R. Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet
  Cc: linux-kernel, linux-fsdevel, linux-doc, Al Viro, Matthew Wilcox,
	Eric W. Biederman, Takashi Iwai, Davidlohr Bueso, Waiman Long
In-Reply-To: <1529317698-16575-1-git-send-email-longman@redhat.com>

With strict range limit enforcement of msgmni, shmmni and sem, it is
possible that some existing applications that set those values to above
32k may fail. To help users to work around this potential problem, a new
boolean ipcmni_compat sysctl is added to provide the old beahavior for
compatibility when it is set to 1. In other word, the limit will then be
enforced internally but no error will be reported.

This compatibility mode can only be enabled if the ipcmni_extend kernel
boot parameter is not specified.

The sysctl documentation is also updated accordingly.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/sysctl/kernel.txt | 15 +++++++++++++++
 ipc/ipc_sysctl.c                | 42 ++++++++++++++++++++++++++++++++++++++---
 ipc/util.h                      |  5 +++--
 3 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index eded671d..e98d967 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -39,6 +39,7 @@ show up in /proc/sys/kernel:
 - hung_task_check_count
 - hung_task_timeout_secs
 - hung_task_warnings
+- ipcmni_compat
 - kexec_load_disabled
 - kptr_restrict
 - l2cr                        [ PPC only ]
@@ -374,6 +375,20 @@ This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
 
 ==============================================================
 
+ipcmni_compat:
+
+A boolean flag to control range checking behavior of msgmni, shmmni
+and the mni portion of sem.
+
+0: Range limits will be strictly enforced and error will be returned
+   if limits are exceeded.
+1: Range limits will only be enforced internally and no error will be
+   returned if the upper limit is exceeded. This compatibility behavior
+   can only be selected if the ipcmni_extend kernel boot parameter is
+   not specified.
+
+==============================================================
+
 kexec_load_disabled:
 
 A toggle indicating if the kexec_load syscall has been disabled. This
diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index d9ac6ca..5c0eac4 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -18,6 +18,8 @@
 #include <linux/msg.h>
 #include "util.h"
 
+static int ipcmni_compat;
+
 static void *get_ipc(struct ctl_table *table)
 {
 	char *which = table->data;
@@ -108,6 +110,25 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
 	return ret;
 }
 
+static int proc_ipcmni_compat_minmax(struct ctl_table *table, int write,
+	void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+	if (ret)
+		return ret;
+
+	/*
+	 * ipcmni_compat can only be set if !ipcmni_extend.
+	 */
+	if (ipcmni_compat && ipc_mni_extended) {
+		ipcmni_compat = 0;
+		return -EINVAL;
+	}
+	ipcmni_max = ipcmni_compat ? INT_MAX : ipc_mni;
+	return 0;
+}
+
 #else
 #define proc_ipc_doulongvec_minmax NULL
 #define proc_ipc_dointvec	   NULL
@@ -115,11 +136,13 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
 #define proc_ipc_dointvec_minmax_orphans   NULL
 #define proc_ipc_auto_msgmni	   NULL
 #define proc_ipc_sem_dointvec	   NULL
+#define proc_ipcmni_compat_minmax  NULL
 #endif
 
 static int zero;
 static int one = 1;
 static int int_max = INT_MAX;
+int ipcmni_max = IPCMNI;
 int ipc_mni = IPCMNI;
 int ipc_mni_shift = IPCMNI_SHIFT;
 bool ipc_mni_extended;
@@ -146,7 +169,7 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
 		.mode		= 0644,
 		.proc_handler	= proc_ipc_dointvec_minmax,
 		.extra1		= &zero,
-		.extra2		= &ipc_mni,
+		.extra2		= &ipcmni_max,
 	},
 	{
 		.procname	= "shm_rmid_forced",
@@ -173,7 +196,7 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
 		.mode		= 0644,
 		.proc_handler	= proc_ipc_dointvec_minmax,
 		.extra1		= &zero,
-		.extra2		= &ipc_mni,
+		.extra2		= &ipcmni_max,
 	},
 	{
 		.procname	= "auto_msgmni",
@@ -229,6 +252,19 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
 		.extra2		= &int_max,
 	},
 #endif
+	/*
+	 * Unlike other IPC sysctl parameters above, the following sysctl
+	 * parameter is global and affect behavior for all the namespaces.
+	 */
+	{
+		.procname	= "ipcmni_compat",
+		.data		= &ipcmni_compat,
+		.maxlen		= sizeof(ipcmni_compat),
+		.mode		= 0644,
+		.proc_handler	= proc_ipcmni_compat_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
 	{}
 };
 
@@ -251,7 +287,7 @@ static int __init ipc_sysctl_init(void)
 
 static int __init ipc_mni_extend(char *str)
 {
-	ipc_mni = IPCMNI_EXTEND;
+	ipc_mni = ipcmni_max = IPCMNI_EXTEND;
 	ipc_mni_shift = IPCMNI_EXTEND_SHIFT;
 	ipc_mni_extended = true;
 	pr_info("IPCMNI extended to %d.\n", ipc_mni);
diff --git a/ipc/util.h b/ipc/util.h
index 62b6247..dde5014 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -29,6 +29,7 @@
 #ifdef CONFIG_SYSVIPC_SYSCTL
 extern int ipc_mni;
 extern int ipc_mni_shift;
+extern int ipcmni_max;
 extern bool ipc_mni_extended;
 
 #define SEQ_SHIFT		ipc_mni_shift
@@ -244,10 +245,10 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
 
 static inline int sem_check_semmni(struct ipc_namespace *ns) {
 	/*
-	 * Check semmni range [0, ipc_mni]
+	 * Check semmni range [0, ipcmni_max]
 	 * semmni is the last element of sem_ctls[4] array
 	 */
-	return ((ns->sem_ctls[3] < 0) || (ns->sem_ctls[3] > ipc_mni))
+	return ((ns->sem_ctls[3] < 0) || (ns->sem_ctls[3] > ipcmni_max))
 		? -ERANGE : 0;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v8 3/5] ipc: Allow boot time extension of IPCMNI from 32k to 2M
From: Waiman Long @ 2018-06-18 10:28 UTC (permalink / raw)
  To: Luis R. Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet
  Cc: linux-kernel, linux-fsdevel, linux-doc, Al Viro, Matthew Wilcox,
	Eric W. Biederman, Takashi Iwai, Davidlohr Bueso, Waiman Long
In-Reply-To: <1529317698-16575-1-git-send-email-longman@redhat.com>

The maximum number of unique System V IPC identifiers was limited to
32k.  That limit should be big enough for most use cases.

However, there are some users out there requesting for more. To satisfy
the need of those users, a new boot time kernel option "ipcmni_extend"
is added to extend the IPCMNI value to 2M. This is a 64X increase which
hopefully is big enough for them.

This new option does have the side effect of reducing the maximum
number of unique sequence numbers from 64k down to 1k. So it is
a trade-off.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  3 ++
 ipc/ipc_sysctl.c                                | 12 ++++++-
 ipc/util.c                                      | 12 +++----
 ipc/util.h                                      | 42 +++++++++++++++++++------
 4 files changed, 52 insertions(+), 17 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index efc7aa7..6712a42 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1744,6 +1744,9 @@
 	ip=		[IP_PNP]
 			See Documentation/filesystems/nfs/nfsroot.txt.
 
+	ipcmni_extend	[KNL] Extend the maximum number of unique System V
+			IPC identifiers from 32768 to 2097152.
+
 	irqaffinity=	[SMP] Set the default irq affinity mask
 			The argument is a cpu list, as described above.
 
diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 49f9bf4..73b7782 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -120,7 +120,8 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
 static int zero;
 static int one = 1;
 static int int_max = INT_MAX;
-static int ipc_mni = IPCMNI;
+int ipc_mni = IPCMNI;
+int ipc_mni_shift = IPCMNI_SHIFT;
 
 static struct ctl_table ipc_kern_table[] = {
 	{
@@ -246,3 +247,12 @@ static int __init ipc_sysctl_init(void)
 }
 
 device_initcall(ipc_sysctl_init);
+
+static int __init ipc_mni_extend(char *str)
+{
+	ipc_mni = IPCMNI_EXTEND;
+	ipc_mni_shift = IPCMNI_EXTEND_SHIFT;
+	pr_info("IPCMNI extended to %d.\n", ipc_mni);
+	return 0;
+}
+early_param("ipcmni_extend", ipc_mni_extend);
diff --git a/ipc/util.c b/ipc/util.c
index 4e81182..782a8d0 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -113,7 +113,7 @@ static int __init ipc_init(void)
  * @ids: ipc identifier set
  *
  * Set up the sequence range to use for the ipc identifier range (limited
- * below IPCMNI) then initialise the keys hashtable and ids idr.
+ * below ipc_mni) then initialise the keys hashtable and ids idr.
  */
 int ipc_init_ids(struct ipc_ids *ids)
 {
@@ -214,7 +214,7 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
 		ids->next_id = -1;
 	}
 
-	return SEQ_MULTIPLIER * new->seq + id;
+	return (new->seq << SEQ_SHIFT) + id;
 }
 
 #else
@@ -228,7 +228,7 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
 	if (ids->seq > IPCID_SEQ_MAX)
 		ids->seq = 0;
 
-	return SEQ_MULTIPLIER * new->seq + id;
+	return (new->seq << SEQ_SHIFT) + id;
 }
 
 #endif /* CONFIG_CHECKPOINT_RESTORE */
@@ -252,8 +252,8 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm *new, int limit)
 	kgid_t egid;
 	int id, err;
 
-	if (limit > IPCMNI)
-		limit = IPCMNI;
+	if (limit > ipc_mni)
+		limit = ipc_mni;
 
 	if (!ids->tables_initialized || ids->in_use >= limit)
 		return -ENOSPC;
@@ -777,7 +777,7 @@ static struct kern_ipc_perm *sysvipc_find_ipc(struct ipc_ids *ids, loff_t pos,
 	if (total >= ids->in_use)
 		return NULL;
 
-	for (; pos < IPCMNI; pos++) {
+	for (; pos < ipc_mni; pos++) {
 		ipc = idr_find(&ids->ipcs_idr, pos);
 		if (ipc != NULL) {
 			*new_pos = pos + 1;
diff --git a/ipc/util.h b/ipc/util.h
index 8e9c52c..d103630 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -15,8 +15,30 @@
 #include <linux/err.h>
 #include <linux/ipc_namespace.h>
 
-#define IPCMNI 32768  /* <= MAX_INT limit for ipc arrays (including sysctl changes) */
-#define SEQ_MULTIPLIER	(IPCMNI)
+/*
+ * By default, the ipc arrays can have up to 32k (15 bits) entries.
+ * When IPCMNI extension mode is turned on, the ipc arrays can have up
+ * to 2M (21 bits) entries. However, the space for sequence number will
+ * be shrunk from 16 bits to 10 bits.
+ */
+#define IPCMNI_SHIFT		15
+#define IPCMNI_EXTEND_SHIFT	21
+#define IPCMNI			(1 << IPCMNI_SHIFT)
+#define IPCMNI_EXTEND		(1 << IPCMNI_EXTEND_SHIFT)
+
+#ifdef CONFIG_SYSVIPC_SYSCTL
+extern int ipc_mni;
+extern int ipc_mni_shift;
+
+#define SEQ_SHIFT		ipc_mni_shift
+#define SEQ_MASK		((1 << ipc_mni_shift) - 1)
+
+#else /* CONFIG_SYSVIPC_SYSCTL */
+
+#define ipc_mni 		IPCMNI
+#define SEQ_SHIFT		IPCMNI_SHIFT
+#define SEQ_MASK		((1 << IPCMNI_SHIFT) - 1)
+#endif /* CONFIG_SYSVIPC_SYSCTL */
 
 int sem_init(void);
 int msg_init(void);
@@ -96,9 +118,9 @@ void __init ipc_init_proc_interface(const char *path, const char *header,
 #define IPC_MSG_IDS	1
 #define IPC_SHM_IDS	2
 
-#define ipcid_to_idx(id) ((id) % SEQ_MULTIPLIER)
-#define ipcid_to_seqx(id) ((id) / SEQ_MULTIPLIER)
-#define IPCID_SEQ_MAX min_t(int, INT_MAX/SEQ_MULTIPLIER, USHRT_MAX)
+#define ipcid_to_idx(id)  ((id) & SEQ_MASK)
+#define ipcid_to_seqx(id) ((id) >> SEQ_SHIFT)
+#define IPCID_SEQ_MAX	  (INT_MAX >> SEQ_SHIFT)
 
 /* must be called with ids->rwsem acquired for writing */
 int ipc_addid(struct ipc_ids *, struct kern_ipc_perm *, int);
@@ -123,8 +145,8 @@ static inline int ipc_get_maxid(struct ipc_ids *ids)
 	if (ids->in_use == 0)
 		return -1;
 
-	if (ids->in_use == IPCMNI)
-		return IPCMNI - 1;
+	if (ids->in_use == ipc_mni)
+		return ipc_mni - 1;
 
 	return ids->max_id;
 }
@@ -175,7 +197,7 @@ static inline void ipc_update_pid(struct pid **pos, struct pid *pid)
 
 static inline int ipc_checkid(struct kern_ipc_perm *ipcp, int uid)
 {
-	return uid / SEQ_MULTIPLIER != ipcp->seq;
+	return (uid >> SEQ_SHIFT) != ipcp->seq;
 }
 
 static inline void ipc_lock_object(struct kern_ipc_perm *perm)
@@ -220,10 +242,10 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
 
 static inline int sem_check_semmni(struct ipc_namespace *ns) {
 	/*
-	 * Check semmni range [0, IPCMNI]
+	 * Check semmni range [0, ipc_mni]
 	 * semmni is the last element of sem_ctls[4] array
 	 */
-	return ((ns->sem_ctls[3] < 0) || (ns->sem_ctls[3] > IPCMNI))
+	return ((ns->sem_ctls[3] < 0) || (ns->sem_ctls[3] > ipc_mni))
 		? -ERANGE : 0;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v8 4/5] ipc: Conserve sequence numbers in extended IPCMNI mode
From: Waiman Long @ 2018-06-18 10:28 UTC (permalink / raw)
  To: Luis R. Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet
  Cc: linux-kernel, linux-fsdevel, linux-doc, Al Viro, Matthew Wilcox,
	Eric W. Biederman, Takashi Iwai, Davidlohr Bueso, Waiman Long
In-Reply-To: <1529317698-16575-1-git-send-email-longman@redhat.com>

The mixing in of a sequence number into the IPC IDs is probably to
avoid ID reuse in userspace as much as possible. With extended IPCMNI
mode, the number of usable sequence numbers is greatly reduced leading
to higher chance of ID reuse.

To address this issue, we need to conserve the sequence number space
as much as possible. Right now, the sequence number is incremented
for every new ID created. In reality, we only need to increment the
sequence number when one or more IDs have been removed previously to
make sure that those IDs will not be reused when a new one is built.
This is being done in the extended IPCMNI mode,

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/ipc_namespace.h |  1 +
 ipc/ipc_sysctl.c              |  2 ++
 ipc/util.c                    | 29 ++++++++++++++++++++++-------
 ipc/util.h                    |  2 ++
 4 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index b5630c8..9c86fd9 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -16,6 +16,7 @@
 struct ipc_ids {
 	int in_use;
 	unsigned short seq;
+	unsigned short deleted;
 	bool tables_initialized;
 	struct rw_semaphore rwsem;
 	struct idr ipcs_idr;
diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 73b7782..d9ac6ca 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -122,6 +122,7 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
 static int int_max = INT_MAX;
 int ipc_mni = IPCMNI;
 int ipc_mni_shift = IPCMNI_SHIFT;
+bool ipc_mni_extended;
 
 static struct ctl_table ipc_kern_table[] = {
 	{
@@ -252,6 +253,7 @@ static int __init ipc_mni_extend(char *str)
 {
 	ipc_mni = IPCMNI_EXTEND;
 	ipc_mni_shift = IPCMNI_EXTEND_SHIFT;
+	ipc_mni_extended = true;
 	pr_info("IPCMNI extended to %d.\n", ipc_mni);
 	return 0;
 }
diff --git a/ipc/util.c b/ipc/util.c
index 782a8d0..7c8e733 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -119,7 +119,8 @@ int ipc_init_ids(struct ipc_ids *ids)
 {
 	int err;
 	ids->in_use = 0;
-	ids->seq = 0;
+	ids->deleted = false;
+	ids->seq = ipc_mni_extended ? 0 : -1; /* seq # is pre-incremented */
 	init_rwsem(&ids->rwsem);
 	err = rhashtable_init(&ids->key_ht, &ipc_kht_params);
 	if (err)
@@ -193,6 +194,11 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids *ids, key_t key)
 	return NULL;
 }
 
+/*
+ * To conserve sequence number space with extended ipc_mni when new ID
+ * is built, the sequence number is incremented only when one or more
+ * IDs have been removed previously.
+ */
 #ifdef CONFIG_CHECKPOINT_RESTORE
 /*
  * Specify desired id for next allocated IPC object.
@@ -206,9 +212,13 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
 			      struct kern_ipc_perm *new)
 {
 	if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */
-		new->seq = ids->seq++;
-		if (ids->seq > IPCID_SEQ_MAX)
-			ids->seq = 0;
+		if (!ipc_mni_extended || ids->deleted) {
+			ids->seq++;
+			if (ids->seq > IPCID_SEQ_MAX)
+				ids->seq = 0;
+			ids->deleted = false;
+		}
+		new->seq = ids->seq;
 	} else {
 		new->seq = ipcid_to_seqx(ids->next_id);
 		ids->next_id = -1;
@@ -224,9 +234,13 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
 static inline int ipc_buildid(int id, struct ipc_ids *ids,
 			      struct kern_ipc_perm *new)
 {
-	new->seq = ids->seq++;
-	if (ids->seq > IPCID_SEQ_MAX)
-		ids->seq = 0;
+	if (!ipc_mni_extended || ids->deleted) {
+		ids->seq++;
+		if (ids->seq > IPCID_SEQ_MAX)
+			ids->seq = 0;
+		ids->deleted = false;
+	}
+	new->seq = ids->seq;
 
 	return (new->seq << SEQ_SHIFT) + id;
 }
@@ -436,6 +450,7 @@ void ipc_rmid(struct ipc_ids *ids, struct kern_ipc_perm *ipcp)
 	idr_remove(&ids->ipcs_idr, lid);
 	ipc_kht_remove(ids, ipcp);
 	ids->in_use--;
+	ids->deleted = true;
 	ipcp->deleted = true;
 
 	if (unlikely(lid == ids->max_id)) {
diff --git a/ipc/util.h b/ipc/util.h
index d103630..62b6247 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -29,6 +29,7 @@
 #ifdef CONFIG_SYSVIPC_SYSCTL
 extern int ipc_mni;
 extern int ipc_mni_shift;
+extern bool ipc_mni_extended;
 
 #define SEQ_SHIFT		ipc_mni_shift
 #define SEQ_MASK		((1 << ipc_mni_shift) - 1)
@@ -36,6 +37,7 @@
 #else /* CONFIG_SYSVIPC_SYSCTL */
 
 #define ipc_mni 		IPCMNI
+#define ipc_mni_extended	false
 #define SEQ_SHIFT		IPCMNI_SHIFT
 #define SEQ_MASK		((1 << IPCMNI_SHIFT) - 1)
 #endif /* CONFIG_SYSVIPC_SYSCTL */
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v8 1/5] ipc: IPCMNI limit check for msgmni and shmmni
From: Waiman Long @ 2018-06-18 10:28 UTC (permalink / raw)
  To: Luis R. Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet
  Cc: linux-kernel, linux-fsdevel, linux-doc, Al Viro, Matthew Wilcox,
	Eric W. Biederman, Takashi Iwai, Davidlohr Bueso, Waiman Long
In-Reply-To: <1529317698-16575-1-git-send-email-longman@redhat.com>

A user can write arbitrary integer values to msgmni and shmmni sysctl
parameters without getting error, but the actual limit is really
IPCMNI (32k). This can mislead users as they think they can get a
value that is not real.

The right limits are now set for msgmni and shmmni so that the users
will become aware if they set a value outside of the acceptable range.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 ipc/ipc_sysctl.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 8ad93c2..f87cb29 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -99,6 +99,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
 static int zero;
 static int one = 1;
 static int int_max = INT_MAX;
+static int ipc_mni = IPCMNI;
 
 static struct ctl_table ipc_kern_table[] = {
 	{
@@ -120,7 +121,9 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
 		.data		= &init_ipc_ns.shm_ctlmni,
 		.maxlen		= sizeof(init_ipc_ns.shm_ctlmni),
 		.mode		= 0644,
-		.proc_handler	= proc_ipc_dointvec,
+		.proc_handler	= proc_ipc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &ipc_mni,
 	},
 	{
 		.procname	= "shm_rmid_forced",
@@ -147,7 +150,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
 		.mode		= 0644,
 		.proc_handler	= proc_ipc_dointvec_minmax,
 		.extra1		= &zero,
-		.extra2		= &int_max,
+		.extra2		= &ipc_mni,
 	},
 	{
 		.procname	= "auto_msgmni",
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v8 2/5] ipc: IPCMNI limit check for semmni
From: Waiman Long @ 2018-06-18 10:28 UTC (permalink / raw)
  To: Luis R. Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet
  Cc: linux-kernel, linux-fsdevel, linux-doc, Al Viro, Matthew Wilcox,
	Eric W. Biederman, Takashi Iwai, Davidlohr Bueso, Waiman Long
In-Reply-To: <1529317698-16575-1-git-send-email-longman@redhat.com>

For SysV semaphores, the semmni value is the last part of the 4-element
sem number array. To make semmni behave in a similar way to msgmni and
shmmni, we can't directly use the _minmax handler. Instead, a special
sem specific handler is added to check the last argument to make sure
that it is limited to the [0, IPCMNI] range. An error will be returned
if this is not the case.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 ipc/ipc_sysctl.c | 23 ++++++++++++++++++++++-
 ipc/util.h       |  9 +++++++++
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index f87cb29..49f9bf4 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -88,12 +88,33 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
 	return proc_dointvec_minmax(&ipc_table, write, buffer, lenp, ppos);
 }
 
+static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
+	void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret, semmni;
+	struct ipc_namespace *ns = current->nsproxy->ipc_ns;
+
+	semmni = ns->sem_ctls[3];
+	ret = proc_ipc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (!ret)
+		ret = sem_check_semmni(current->nsproxy->ipc_ns);
+
+	/*
+	 * Reset the semmni value if an error happens.
+	 */
+	if (ret)
+		ns->sem_ctls[3] = semmni;
+	return ret;
+}
+
 #else
 #define proc_ipc_doulongvec_minmax NULL
 #define proc_ipc_dointvec	   NULL
 #define proc_ipc_dointvec_minmax   NULL
 #define proc_ipc_dointvec_minmax_orphans   NULL
 #define proc_ipc_auto_msgmni	   NULL
+#define proc_ipc_sem_dointvec	   NULL
 #endif
 
 static int zero;
@@ -175,7 +196,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
 		.data		= &init_ipc_ns.sem_ctls,
 		.maxlen		= 4*sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_ipc_dointvec,
+		.proc_handler	= proc_ipc_sem_dointvec,
 	},
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	{
diff --git a/ipc/util.h b/ipc/util.h
index 0aba323..8e9c52c 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -218,6 +218,15 @@ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
 void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
 		void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
 
+static inline int sem_check_semmni(struct ipc_namespace *ns) {
+	/*
+	 * Check semmni range [0, IPCMNI]
+	 * semmni is the last element of sem_ctls[4] array
+	 */
+	return ((ns->sem_ctls[3] < 0) || (ns->sem_ctls[3] > IPCMNI))
+		? -ERANGE : 0;
+}
+
 #ifdef CONFIG_COMPAT
 #include <linux/compat.h>
 struct compat_ipc_perm {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH v3 2/2] locking: Implement an algorithm choice for Wound-Wait mutexes
From: Thomas Hellstrom @ 2018-06-18 11:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dri-devel, linux-kernel, Ingo Molnar, Jonathan Corbet,
	Gustavo Padovan, Maarten Lankhorst, Sean Paul, David Airlie,
	Davidlohr Bueso, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Kate Stewart, Philippe Ombredanne, Greg Kroah-Hartman, linux-doc,
	linux-media, linaro-mm-sig, stern
In-Reply-To: <20180615164604.GD2458@hirez.programming.kicks-ass.net>

On 06/15/2018 06:46 PM, Peter Zijlstra wrote:
> On Fri, Jun 15, 2018 at 02:08:27PM +0200, Thomas Hellstrom wrote:
>
>> @@ -772,6 +856,25 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter,
>>   	}
>>   
>>   	list_add_tail(&waiter->list, pos);
>> +	if (__mutex_waiter_is_first(lock, waiter))
>> +		__mutex_set_flag(lock, MUTEX_FLAG_WAITERS);
>> +
>> +	/*
>> +	 * Wound-Wait: if we're blocking on a mutex owned by a younger context,
>> +	 * wound that such that we might proceed.
>> +	 */
>> +	if (!is_wait_die) {
>> +		struct ww_mutex *ww = container_of(lock, struct ww_mutex, base);
>> +
>> +		/*
>> +		 * See ww_mutex_set_context_fastpath(). Orders setting
>> +		 * MUTEX_FLAG_WAITERS (atomic operation) vs the ww->ctx load,
>> +		 * such that either we or the fastpath will wound @ww->ctx.
>> +		 */
>> +		smp_mb__after_atomic();
>> +
>> +		__ww_mutex_wound(lock, ww_ctx, ww->ctx);
>> +	}
> I think we want the smp_mb__after_atomic() in the same branch as
> __mutex_set_flag(). So something like:
>
> 	if (__mutex_waiter_is_first()) {
> 		__mutex_set_flag();
> 		if (!is_wait_die)
> 			smp_mb__after_atomic();
> 	}
>
> Or possibly even without the !is_wait_die. The rules for
> smp_mb__*_atomic() are such that we want it unconditional after an
> atomic, otherwise the semantics get too fuzzy.
>
> Alan (rightfully) complained about that a while ago when he was auditing
> users.
>
>
Hmm, yes that's understandable, although I must admit that when one of 
the accesses we want to order is actually an atomic this shouldn't 
really be causing much confusion.

But I'll think I'll change it back to an smp_mb() then. It's in a 
slowpath, and awkward constructs around smp_mb__after_atomic() might be 
causing grief in the future.

/Thomas


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v8 5/5] ipc: Add a new ipcmni_compat sysctl to fall back to old behavior
From: kbuild test robot @ 2018-06-18 11:36 UTC (permalink / raw)
  To: Waiman Long
  Cc: kbuild-all, Luis R. Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, linux-kernel, linux-fsdevel, linux-doc, Al Viro,
	Matthew Wilcox, Eric W. Biederman, Takashi Iwai, Davidlohr Bueso,
	Waiman Long
In-Reply-To: <1529317698-16575-6-git-send-email-longman@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 2507 bytes --]

Hi Waiman,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18-rc1 next-20180618]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Waiman-Long/ipc-IPCMNI-limit-check-for-mni-increase-that-limit/20180618-183206
config: i386-randconfig-x013-201824 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from ipc/mqueue.c:43:0:
   ipc/util.h: In function 'sem_check_semmni':
>> ipc/util.h:251:54: error: 'ipcmni_max' undeclared (first use in this function); did you mean 'icmp_mib'?
     return ((ns->sem_ctls[3] < 0) || (ns->sem_ctls[3] > ipcmni_max))
                                                         ^~~~~~~~~~
                                                         icmp_mib
   ipc/util.h:251:54: note: each undeclared identifier is reported only once for each function it appears in
--
   In file included from ipc/msgutil.c:22:0:
   ipc/util.h: In function 'sem_check_semmni':
>> ipc/util.h:251:54: error: 'ipcmni_max' undeclared (first use in this function); did you mean 'pci_iomap'?
     return ((ns->sem_ctls[3] < 0) || (ns->sem_ctls[3] > ipcmni_max))
                                                         ^~~~~~~~~~
                                                         pci_iomap
   ipc/util.h:251:54: note: each undeclared identifier is reported only once for each function it appears in

vim +251 ipc/util.h

   239	
   240	struct kern_ipc_perm *ipc_obtain_object_check(struct ipc_ids *ids, int id);
   241	int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
   242				const struct ipc_ops *ops, struct ipc_params *params);
   243	void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
   244			void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
   245	
   246	static inline int sem_check_semmni(struct ipc_namespace *ns) {
   247		/*
   248		 * Check semmni range [0, ipcmni_max]
   249		 * semmni is the last element of sem_ctls[4] array
   250		 */
 > 251		return ((ns->sem_ctls[3] < 0) || (ns->sem_ctls[3] > ipcmni_max))
   252			? -ERANGE : 0;
   253	}
   254	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28855 bytes --]

^ permalink raw reply

* Re: [PATCH] net: fix e100.rst Documentation build errors
From: Michal Kubecek @ 2018-06-18 11:44 UTC (permalink / raw)
  To: Jani Nikula
  Cc: Randy Dunlap, linux-doc@vger.kernel.org, netdev@vger.kernel.org,
	Jeff Kirsher, David Miller, LKML, Aaron Brown
In-Reply-To: <87efh4y0yk.fsf@intel.com>

On Mon, Jun 18, 2018 at 11:04:51AM +0300, Jani Nikula wrote:
> On Sat, 16 Jun 2018, Randy Dunlap <rdunlap@infradead.org> wrote:
> > From: Randy Dunlap <rdunlap@infradead.org>
> >
> > Fix Documentation build errors in e100.rst.  Several section titles
> > and the corresponding underlines should not be indented.
> 
> Really the content blocks below the titles should not be indented
> either. It's not an error, but the end result is probably not what you
> want.

Also the indentation of this part:

> Rx Descriptors: Number of receive descriptors. A receive descriptor is a data
>    structure that describes a receive buffer and its attributes to the network
>    controller. The data in the descriptor is used by the controller to write
>    data from the controller to host memory. In the 3.x.x driver the valid range
>    for this parameter is 64-256. The default value is 256. This parameter can be
>    changed using the command::
> 
>    ethtool -G eth? rx n
> 
>    Where n is the number of desired Rx descriptors.
> 
> Tx Descriptors: Number of transmit descriptors. A transmit descriptor is a data
>    structure that describes a transmit buffer and its attributes to the network
>    controller. The data in the descriptor is used by the controller to read
>    data from the host memory to the controller. In the 3.x.x driver the valid
>    range for this parameter is 64-256. The default value is 128. This parameter
>    can be changed using the command::
> 
>    ethtool -G eth? tx n
> 
>    Where n is the number of desired Tx descriptors.
> 
> Speed/Duplex: The driver auto-negotiates the link speed and duplex settings by
>    default. The ethtool utility can be used as follows to force speed/duplex.::
> 
>    ethtool -s eth?  autoneg off speed {10|100} duplex {full|half}
> 
>    NOTE: setting the speed/duplex to incorrect values will cause the link to
>    fail.
> 
> Event Log Message Level:  The driver uses the message level flag to log events
>    to syslog. The message level can be set at driver load time. It can also be
>    set using the command::
> 
>    ethtool -s eth? msglvl n

causes

.../Documentation/networking/e100.rst:56: WARNING: Literal block expected; none found.
.../Documentation/networking/e100.rst:67: WARNING: Literal block expected; none found.
.../Documentation/networking/e100.rst:74: WARNING: Literal block expected; none found.
.../Documentation/networking/e100.rst:83: WARNING: Literal block expected; none found.

as the literal block has the same indentation as preceding paragraph
(except for the first line).

Michal Kubecek
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v8 5/5 RESEND] ipc: Add a new ipcmni_compat sysctl to fall back to old behavior
From: Waiman Long @ 2018-06-18 13:25 UTC (permalink / raw)
  To: Luis R. Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet
  Cc: linux-kernel, linux-fsdevel, linux-doc, Al Viro, Matthew Wilcox,
	Eric W. Biederman, Takashi Iwai, Davidlohr Bueso, Waiman Long

With strict range limit enforcement of msgmni, shmmni and sem, it is
possible that some existing applications that set those values to above
32k may fail. To help users to work around this potential problem, a new
boolean ipcmni_compat sysctl is added to provide the old beahavior for
compatibility when it is set to 1. In other word, the limit will then be
enforced internally but no error will be reported.

This compatibility mode can only be enabled if the ipcmni_extend kernel
boot parameter is not specified.

The sysctl documentation is also updated accordingly.

RESEND: Add missing ipcmni_max macro in util.h.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/sysctl/kernel.txt | 15 +++++++++++++++
 ipc/ipc_sysctl.c                | 42 ++++++++++++++++++++++++++++++++++++++---
 ipc/util.h                      |  6 ++++--
 3 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index eded671d..e98d967 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -39,6 +39,7 @@ show up in /proc/sys/kernel:
 - hung_task_check_count
 - hung_task_timeout_secs
 - hung_task_warnings
+- ipcmni_compat
 - kexec_load_disabled
 - kptr_restrict
 - l2cr                        [ PPC only ]
@@ -374,6 +375,20 @@ This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
 
 ==============================================================
 
+ipcmni_compat:
+
+A boolean flag to control range checking behavior of msgmni, shmmni
+and the mni portion of sem.
+
+0: Range limits will be strictly enforced and error will be returned
+   if limits are exceeded.
+1: Range limits will only be enforced internally and no error will be
+   returned if the upper limit is exceeded. This compatibility behavior
+   can only be selected if the ipcmni_extend kernel boot parameter is
+   not specified.
+
+==============================================================
+
 kexec_load_disabled:
 
 A toggle indicating if the kexec_load syscall has been disabled. This
diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index d9ac6ca..5c0eac4 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -18,6 +18,8 @@
 #include <linux/msg.h>
 #include "util.h"
 
+static int ipcmni_compat;
+
 static void *get_ipc(struct ctl_table *table)
 {
 	char *which = table->data;
@@ -108,6 +110,25 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
 	return ret;
 }
 
+static int proc_ipcmni_compat_minmax(struct ctl_table *table, int write,
+	void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+	if (ret)
+		return ret;
+
+	/*
+	 * ipcmni_compat can only be set if !ipcmni_extend.
+	 */
+	if (ipcmni_compat && ipc_mni_extended) {
+		ipcmni_compat = 0;
+		return -EINVAL;
+	}
+	ipcmni_max = ipcmni_compat ? INT_MAX : ipc_mni;
+	return 0;
+}
+
 #else
 #define proc_ipc_doulongvec_minmax NULL
 #define proc_ipc_dointvec	   NULL
@@ -115,11 +136,13 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
 #define proc_ipc_dointvec_minmax_orphans   NULL
 #define proc_ipc_auto_msgmni	   NULL
 #define proc_ipc_sem_dointvec	   NULL
+#define proc_ipcmni_compat_minmax  NULL
 #endif
 
 static int zero;
 static int one = 1;
 static int int_max = INT_MAX;
+int ipcmni_max = IPCMNI;
 int ipc_mni = IPCMNI;
 int ipc_mni_shift = IPCMNI_SHIFT;
 bool ipc_mni_extended;
@@ -146,7 +169,7 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
 		.mode		= 0644,
 		.proc_handler	= proc_ipc_dointvec_minmax,
 		.extra1		= &zero,
-		.extra2		= &ipc_mni,
+		.extra2		= &ipcmni_max,
 	},
 	{
 		.procname	= "shm_rmid_forced",
@@ -173,7 +196,7 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
 		.mode		= 0644,
 		.proc_handler	= proc_ipc_dointvec_minmax,
 		.extra1		= &zero,
-		.extra2		= &ipc_mni,
+		.extra2		= &ipcmni_max,
 	},
 	{
 		.procname	= "auto_msgmni",
@@ -229,6 +252,19 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
 		.extra2		= &int_max,
 	},
 #endif
+	/*
+	 * Unlike other IPC sysctl parameters above, the following sysctl
+	 * parameter is global and affect behavior for all the namespaces.
+	 */
+	{
+		.procname	= "ipcmni_compat",
+		.data		= &ipcmni_compat,
+		.maxlen		= sizeof(ipcmni_compat),
+		.mode		= 0644,
+		.proc_handler	= proc_ipcmni_compat_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
 	{}
 };
 
@@ -251,7 +287,7 @@ static int __init ipc_sysctl_init(void)
 
 static int __init ipc_mni_extend(char *str)
 {
-	ipc_mni = IPCMNI_EXTEND;
+	ipc_mni = ipcmni_max = IPCMNI_EXTEND;
 	ipc_mni_shift = IPCMNI_EXTEND_SHIFT;
 	ipc_mni_extended = true;
 	pr_info("IPCMNI extended to %d.\n", ipc_mni);
diff --git a/ipc/util.h b/ipc/util.h
index 62b6247..ce2db71 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -29,6 +29,7 @@
 #ifdef CONFIG_SYSVIPC_SYSCTL
 extern int ipc_mni;
 extern int ipc_mni_shift;
+extern int ipcmni_max;
 extern bool ipc_mni_extended;
 
 #define SEQ_SHIFT		ipc_mni_shift
@@ -38,6 +39,7 @@
 
 #define ipc_mni 		IPCMNI
 #define ipc_mni_extended	false
+#define ipcmni_max		IPCMNI
 #define SEQ_SHIFT		IPCMNI_SHIFT
 #define SEQ_MASK		((1 << IPCMNI_SHIFT) - 1)
 #endif /* CONFIG_SYSVIPC_SYSCTL */
@@ -244,10 +246,10 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
 
 static inline int sem_check_semmni(struct ipc_namespace *ns) {
 	/*
-	 * Check semmni range [0, ipc_mni]
+	 * Check semmni range [0, ipcmni_max]
 	 * semmni is the last element of sem_ctls[4] array
 	 */
-	return ((ns->sem_ctls[3] < 0) || (ns->sem_ctls[3] > ipc_mni))
+	return ((ns->sem_ctls[3] < 0) || (ns->sem_ctls[3] > ipcmni_max))
 		? -ERANGE : 0;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 0/2] Documentation/sphinx: add "nodocs" directive
From: Mike Rapoport @ 2018-06-18 13:36 UTC (permalink / raw)
  To: Jonathan Corbet; +Cc: Matthew Wilcox, Jani Nikula, linux-doc, Mike Rapoport

Hi,

These patches allow passing "-no-doc-sections" option to scripts/kernel-doc
from the sphinx generator.

This allows to avoid duplicated DOC: sections when "kernel-doc:" directive
is used without explicit selection of functions or function types. For
instance, [1] has "IDA description" and "idr synchronization" twice.

[1] https://www.kernel.org/doc/html/v4.17/core-api/idr.html

Mike Rapoport (2):
  Documentation/sphinx: add "nodocs" directive
  docs/idr: use "nodocs" directive

 Documentation/core-api/idr.rst    | 2 ++
 Documentation/sphinx/kerneldoc.py | 3 +++
 2 files changed, 5 insertions(+)

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 1/2] Documentation/sphinx: add "nodocs" directive
From: Mike Rapoport @ 2018-06-18 13:36 UTC (permalink / raw)
  To: Jonathan Corbet; +Cc: Matthew Wilcox, Jani Nikula, linux-doc, Mike Rapoport
In-Reply-To: <1529328996-16247-1-git-send-email-rppt@linux.vnet.ibm.com>

When kernel-doc:: specified in .rst document without explicit directives,
it outputs both comment and DOC: sections. If a DOC: section was explictly
included in the same document it will be duplicated. For example, the
output generated for Documentation/core-api/idr.rst [1] has "IDA
description" in the "IDA usage" section and in the middle of the API
reference.

Addition of "nodocs" directive prevents the duplication without the need to
explicitly define what functions should be include in the API reference.

[1] https://www.kernel.org/doc/html/v4.17/core-api/idr.html

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
---
 Documentation/sphinx/kerneldoc.py | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/Documentation/sphinx/kerneldoc.py b/Documentation/sphinx/kerneldoc.py
index fbedcc3..bc5dd05 100644
--- a/Documentation/sphinx/kerneldoc.py
+++ b/Documentation/sphinx/kerneldoc.py
@@ -50,6 +50,7 @@ class KernelDocDirective(Directive):
         'functions': directives.unchanged_required,
         'export': directives.unchanged,
         'internal': directives.unchanged,
+        'nodocs': directives.unchanged,
     }
     has_content = False
 
@@ -77,6 +78,8 @@ class KernelDocDirective(Directive):
         elif 'functions' in self.options:
             for f in str(self.options.get('functions')).split():
                 cmd += ['-function', f]
+        elif 'nodocs' in self.options:
+            cmd += ['-no-doc-sections']
 
         for pattern in export_file_patterns:
             for f in glob.glob(env.config.kerneldoc_srctree + '/' + pattern):
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox