Linux Documentation
 help / color / mirror / Atom feed
* Re: [PATCH] mm, slab: Extend slab/shrink to shrink all the memcg caches
From: Waiman Long @ 2019-07-02 19:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Pekka Enberg, Joonsoo Kim, Andrew Morton,
	Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, linux-mm,
	linux-doc, linux-fsdevel, cgroups, linux-kernel, Roman Gushchin,
	Shakeel Butt, Andrea Arcangeli
In-Reply-To: <alpine.DEB.2.21.1907021206000.67286@chino.kir.corp.google.com>

On 7/2/19 3:09 PM, David Rientjes wrote:
> On Tue, 2 Jul 2019, Waiman Long wrote:
>
>> diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab
>> index 29601d93a1c2..2a3d0fc4b4ac 100644
>> --- a/Documentation/ABI/testing/sysfs-kernel-slab
>> +++ b/Documentation/ABI/testing/sysfs-kernel-slab
>> @@ -429,10 +429,12 @@ KernelVersion:	2.6.22
>>  Contact:	Pekka Enberg <penberg@cs.helsinki.fi>,
>>  		Christoph Lameter <cl@linux-foundation.org>
>>  Description:
>> -		The shrink file is written when memory should be reclaimed from
>> -		a cache.  Empty partial slabs are freed and the partial list is
>> -		sorted so the slabs with the fewest available objects are used
>> -		first.
>> +		A value of '1' is written to the shrink file when memory should
>> +		be reclaimed from a cache.  Empty partial slabs are freed and
>> +		the partial list is sorted so the slabs with the fewest
>> +		available objects are used first.  When a value of '2' is
>> +		written, all the corresponding child memory cgroup caches
>> +		should be shrunk as well.  All other values are invalid.
>>  
> This should likely call out that '2' also does '1', that might not be 
> clear enough.

You are right. I will reword the text to make it clearer.


>>  What:		/sys/kernel/slab/cache/slab_size
>>  Date:		May 2007
>> diff --git a/mm/slab.h b/mm/slab.h
>> index 3b22931bb557..a16b2c7ff4dd 100644
>> --- a/mm/slab.h
>> +++ b/mm/slab.h
>> @@ -174,6 +174,7 @@ int __kmem_cache_shrink(struct kmem_cache *);
>>  void __kmemcg_cache_deactivate(struct kmem_cache *s);
>>  void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
>>  void slab_kmem_cache_release(struct kmem_cache *);
>> +int kmem_cache_shrink_all(struct kmem_cache *s);
>>  
>>  struct seq_file;
>>  struct file;
>> diff --git a/mm/slab_common.c b/mm/slab_common.c
>> index 464faaa9fd81..493697ba1da5 100644
>> --- a/mm/slab_common.c
>> +++ b/mm/slab_common.c
>> @@ -981,6 +981,49 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
>>  }
>>  EXPORT_SYMBOL(kmem_cache_shrink);
>>  
>> +/**
>> + * kmem_cache_shrink_all - shrink a cache and all its memcg children
>> + * @s: The root cache to shrink.
>> + *
>> + * Return: 0 if successful, -EINVAL if not a root cache
>> + */
>> +int kmem_cache_shrink_all(struct kmem_cache *s)
>> +{
>> +	struct kmem_cache *c;
>> +
>> +	if (!IS_ENABLED(CONFIG_MEMCG_KMEM)) {
>> +		kmem_cache_shrink(s);
>> +		return 0;
>> +	}
>> +	if (!is_root_cache(s))
>> +		return -EINVAL;
>> +
>> +	/*
>> +	 * The caller should have a reference to the root cache and so
>> +	 * we don't need to take the slab_mutex. We have to take the
>> +	 * slab_mutex, however, to iterate the memcg caches.
>> +	 */
>> +	get_online_cpus();
>> +	get_online_mems();
>> +	kasan_cache_shrink(s);
>> +	__kmem_cache_shrink(s);
>> +
>> +	mutex_lock(&slab_mutex);
>> +	for_each_memcg_cache(c, s) {
>> +		/*
>> +		 * Don't need to shrink deactivated memcg caches.
>> +		 */
>> +		if (s->flags & SLAB_DEACTIVATED)
>> +			continue;
>> +		kasan_cache_shrink(c);
>> +		__kmem_cache_shrink(c);
>> +	}
>> +	mutex_unlock(&slab_mutex);
>> +	put_online_mems();
>> +	put_online_cpus();
>> +	return 0;
>> +}
>> +
>>  bool slab_is_available(void)
>>  {
>>  	return slab_state >= UP;
> I'm wondering how long this could take, i.e. how long we hold slab_mutex 
> while we traverse each cache and shrink it.

It will depends on how many memcg caches are there. Actually, I have
been thinking about using the show method to show the time spent in the
last shrink operation. I am just not sure if it is worth doing. What do
you think?

-Longman


^ permalink raw reply

* Re: [PATCH] mm, slab: Extend slab/shrink to shrink all the memcg caches
From: David Rientjes @ 2019-07-02 19:09 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Pekka Enberg, Joonsoo Kim, Andrew Morton,
	Alexander Viro, Jonathan Corbet, Luis Chamberlain, Kees Cook,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, linux-mm,
	linux-doc, linux-fsdevel, cgroups, linux-kernel, Roman Gushchin,
	Shakeel Butt, Andrea Arcangeli
In-Reply-To: <20190702183730.14461-1-longman@redhat.com>

On Tue, 2 Jul 2019, Waiman Long wrote:

> diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab
> index 29601d93a1c2..2a3d0fc4b4ac 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-slab
> +++ b/Documentation/ABI/testing/sysfs-kernel-slab
> @@ -429,10 +429,12 @@ KernelVersion:	2.6.22
>  Contact:	Pekka Enberg <penberg@cs.helsinki.fi>,
>  		Christoph Lameter <cl@linux-foundation.org>
>  Description:
> -		The shrink file is written when memory should be reclaimed from
> -		a cache.  Empty partial slabs are freed and the partial list is
> -		sorted so the slabs with the fewest available objects are used
> -		first.
> +		A value of '1' is written to the shrink file when memory should
> +		be reclaimed from a cache.  Empty partial slabs are freed and
> +		the partial list is sorted so the slabs with the fewest
> +		available objects are used first.  When a value of '2' is
> +		written, all the corresponding child memory cgroup caches
> +		should be shrunk as well.  All other values are invalid.
>  

This should likely call out that '2' also does '1', that might not be 
clear enough.

>  What:		/sys/kernel/slab/cache/slab_size
>  Date:		May 2007
> diff --git a/mm/slab.h b/mm/slab.h
> index 3b22931bb557..a16b2c7ff4dd 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -174,6 +174,7 @@ int __kmem_cache_shrink(struct kmem_cache *);
>  void __kmemcg_cache_deactivate(struct kmem_cache *s);
>  void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
>  void slab_kmem_cache_release(struct kmem_cache *);
> +int kmem_cache_shrink_all(struct kmem_cache *s);
>  
>  struct seq_file;
>  struct file;
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 464faaa9fd81..493697ba1da5 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -981,6 +981,49 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
>  }
>  EXPORT_SYMBOL(kmem_cache_shrink);
>  
> +/**
> + * kmem_cache_shrink_all - shrink a cache and all its memcg children
> + * @s: The root cache to shrink.
> + *
> + * Return: 0 if successful, -EINVAL if not a root cache
> + */
> +int kmem_cache_shrink_all(struct kmem_cache *s)
> +{
> +	struct kmem_cache *c;
> +
> +	if (!IS_ENABLED(CONFIG_MEMCG_KMEM)) {
> +		kmem_cache_shrink(s);
> +		return 0;
> +	}
> +	if (!is_root_cache(s))
> +		return -EINVAL;
> +
> +	/*
> +	 * The caller should have a reference to the root cache and so
> +	 * we don't need to take the slab_mutex. We have to take the
> +	 * slab_mutex, however, to iterate the memcg caches.
> +	 */
> +	get_online_cpus();
> +	get_online_mems();
> +	kasan_cache_shrink(s);
> +	__kmem_cache_shrink(s);
> +
> +	mutex_lock(&slab_mutex);
> +	for_each_memcg_cache(c, s) {
> +		/*
> +		 * Don't need to shrink deactivated memcg caches.
> +		 */
> +		if (s->flags & SLAB_DEACTIVATED)
> +			continue;
> +		kasan_cache_shrink(c);
> +		__kmem_cache_shrink(c);
> +	}
> +	mutex_unlock(&slab_mutex);
> +	put_online_mems();
> +	put_online_cpus();
> +	return 0;
> +}
> +
>  bool slab_is_available(void)
>  {
>  	return slab_state >= UP;

I'm wondering how long this could take, i.e. how long we hold slab_mutex 
while we traverse each cache and shrink it.

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
From: Waiman Long @ 2019-07-02 18:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Vladimir Davydov, linux-mm, linux-doc,
	linux-fsdevel, cgroups, linux-kernel, Roman Gushchin,
	Shakeel Butt, Andrea Arcangeli
In-Reply-To: <20190628073128.GC2751@dhcp22.suse.cz>

On 6/28/19 3:31 AM, Michal Hocko wrote:
> On Thu 27-06-19 17:16:04, Waiman Long wrote:
>> On 6/27/19 11:15 AM, Michal Hocko wrote:
>>> On Mon 24-06-19 13:42:19, Waiman Long wrote:
>>>> With the slub memory allocator, the numbers of active slab objects
>>>> reported in /proc/slabinfo are not real because they include objects
>>>> that are held by the per-cpu slab structures whether they are actually
>>>> used or not.  The problem gets worse the more CPUs a system have. For
>>>> instance, looking at the reported number of active task_struct objects,
>>>> one will wonder where all the missing tasks gone.
>>>>
>>>> I know it is hard and costly to get a real count of active objects.
>>> What exactly is expensive? Why cannot slabinfo reduce the number of
>>> active objects by per-cpu cached objects?
>>>
>> The number of cachelines that needs to be accessed in order to get an
>> accurate count will be much higher if we need to iterate through all the
>> per-cpu structures. In addition, accessing the per-cpu partial list will
>> be racy.
> Why is all that a problem for a root only interface that should be used
> quite rarely (it is not something that you should be reading hundreds
> time per second, right)?

That can be true. Anyway, I have posted a new patch to use the existing
<slab>/shrink sysfs file to perform memcg cache shrinking as well. So I
am not going to pursue this patch.

Thanks,
Longman


^ permalink raw reply

* Re: [PATCH] mm, slab: Extend slab/shrink to shrink all the memcg caches
From: Waiman Long @ 2019-07-02 18:39 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Michal Hocko, Vladimir Davydov
  Cc: linux-mm, linux-doc, linux-fsdevel, cgroups, linux-kernel,
	Roman Gushchin, Shakeel Butt, Andrea Arcangeli
In-Reply-To: <20190702183730.14461-1-longman@redhat.com>

On 7/2/19 2:37 PM, Waiman Long wrote:
> Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
> file to shrink the slab by flushing all the per-cpu slabs and free
> slabs in partial lists. This applies only to the root caches, though.
>
> Extends this capability by shrinking all the child memcg caches and
> the root cache when a value of '2' is written to the shrink sysfs file.
>
> On a 4-socket 112-core 224-thread x86-64 system after a parallel kernel
> build, the the amount of memory occupied by slabs before shrinking
> slabs were:
>
>  # grep task_struct /proc/slabinfo
>  task_struct         7114   7296   7744    4    8 : tunables    0    0
>  0 : slabdata   1824   1824      0
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:            1310444 kB
>  SReclaimable:     377604 kB
>  SUnreclaim:       932840 kB
>
> After shrinking slabs:
>
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:             695652 kB
>  SReclaimable:     322796 kB
>  SUnreclaim:       372856 kB
>  # grep task_struct /proc/slabinfo
>  task_struct         2262   2572   7744    4    8 : tunables    0    0
>  0 : slabdata    643    643      0
>
> Signed-off-by: Waiman Long <longman@redhat.com>

This is a follow-up of my previous patch "mm, slab: Extend
vm/drop_caches to shrink kmem slabs". It is based on the linux-next tree.

-Longman

> ---
>  Documentation/ABI/testing/sysfs-kernel-slab | 10 +++--
>  mm/slab.h                                   |  1 +
>  mm/slab_common.c                            | 43 +++++++++++++++++++++
>  mm/slub.c                                   |  2 +
>  4 files changed, 52 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab
> index 29601d93a1c2..2a3d0fc4b4ac 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-slab
> +++ b/Documentation/ABI/testing/sysfs-kernel-slab
> @@ -429,10 +429,12 @@ KernelVersion:	2.6.22
>  Contact:	Pekka Enberg <penberg@cs.helsinki.fi>,
>  		Christoph Lameter <cl@linux-foundation.org>
>  Description:
> -		The shrink file is written when memory should be reclaimed from
> -		a cache.  Empty partial slabs are freed and the partial list is
> -		sorted so the slabs with the fewest available objects are used
> -		first.
> +		A value of '1' is written to the shrink file when memory should
> +		be reclaimed from a cache.  Empty partial slabs are freed and
> +		the partial list is sorted so the slabs with the fewest
> +		available objects are used first.  When a value of '2' is
> +		written, all the corresponding child memory cgroup caches
> +		should be shrunk as well.  All other values are invalid.
>  
>  What:		/sys/kernel/slab/cache/slab_size
>  Date:		May 2007
> diff --git a/mm/slab.h b/mm/slab.h
> index 3b22931bb557..a16b2c7ff4dd 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -174,6 +174,7 @@ int __kmem_cache_shrink(struct kmem_cache *);
>  void __kmemcg_cache_deactivate(struct kmem_cache *s);
>  void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
>  void slab_kmem_cache_release(struct kmem_cache *);
> +int kmem_cache_shrink_all(struct kmem_cache *s);
>  
>  struct seq_file;
>  struct file;
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 464faaa9fd81..493697ba1da5 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -981,6 +981,49 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
>  }
>  EXPORT_SYMBOL(kmem_cache_shrink);
>  
> +/**
> + * kmem_cache_shrink_all - shrink a cache and all its memcg children
> + * @s: The root cache to shrink.
> + *
> + * Return: 0 if successful, -EINVAL if not a root cache
> + */
> +int kmem_cache_shrink_all(struct kmem_cache *s)
> +{
> +	struct kmem_cache *c;
> +
> +	if (!IS_ENABLED(CONFIG_MEMCG_KMEM)) {
> +		kmem_cache_shrink(s);
> +		return 0;
> +	}
> +	if (!is_root_cache(s))
> +		return -EINVAL;
> +
> +	/*
> +	 * The caller should have a reference to the root cache and so
> +	 * we don't need to take the slab_mutex. We have to take the
> +	 * slab_mutex, however, to iterate the memcg caches.
> +	 */
> +	get_online_cpus();
> +	get_online_mems();
> +	kasan_cache_shrink(s);
> +	__kmem_cache_shrink(s);
> +
> +	mutex_lock(&slab_mutex);
> +	for_each_memcg_cache(c, s) {
> +		/*
> +		 * Don't need to shrink deactivated memcg caches.
> +		 */
> +		if (s->flags & SLAB_DEACTIVATED)
> +			continue;
> +		kasan_cache_shrink(c);
> +		__kmem_cache_shrink(c);
> +	}
> +	mutex_unlock(&slab_mutex);
> +	put_online_mems();
> +	put_online_cpus();
> +	return 0;
> +}
> +
>  bool slab_is_available(void)
>  {
>  	return slab_state >= UP;
> diff --git a/mm/slub.c b/mm/slub.c
> index a384228ff6d3..5d7b0004c51f 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5298,6 +5298,8 @@ static ssize_t shrink_store(struct kmem_cache *s,
>  {
>  	if (buf[0] == '1')
>  		kmem_cache_shrink(s);
> +	else if (buf[0] == '2')
> +		kmem_cache_shrink_all(s);
>  	else
>  		return -EINVAL;
>  	return length;



^ permalink raw reply

* [PATCH] mm, slab: Extend slab/shrink to shrink all the memcg caches
From: Waiman Long @ 2019-07-02 18:37 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Michal Hocko, Vladimir Davydov
  Cc: linux-mm, linux-doc, linux-fsdevel, cgroups, linux-kernel,
	Roman Gushchin, Shakeel Butt, Andrea Arcangeli, Waiman Long

Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
file to shrink the slab by flushing all the per-cpu slabs and free
slabs in partial lists. This applies only to the root caches, though.

Extends this capability by shrinking all the child memcg caches and
the root cache when a value of '2' is written to the shrink sysfs file.

On a 4-socket 112-core 224-thread x86-64 system after a parallel kernel
build, the the amount of memory occupied by slabs before shrinking
slabs were:

 # grep task_struct /proc/slabinfo
 task_struct         7114   7296   7744    4    8 : tunables    0    0
 0 : slabdata   1824   1824      0
 # grep "^S[lRU]" /proc/meminfo
 Slab:            1310444 kB
 SReclaimable:     377604 kB
 SUnreclaim:       932840 kB

After shrinking slabs:

 # grep "^S[lRU]" /proc/meminfo
 Slab:             695652 kB
 SReclaimable:     322796 kB
 SUnreclaim:       372856 kB
 # grep task_struct /proc/slabinfo
 task_struct         2262   2572   7744    4    8 : tunables    0    0
 0 : slabdata    643    643      0

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/ABI/testing/sysfs-kernel-slab | 10 +++--
 mm/slab.h                                   |  1 +
 mm/slab_common.c                            | 43 +++++++++++++++++++++
 mm/slub.c                                   |  2 +
 4 files changed, 52 insertions(+), 4 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab
index 29601d93a1c2..2a3d0fc4b4ac 100644
--- a/Documentation/ABI/testing/sysfs-kernel-slab
+++ b/Documentation/ABI/testing/sysfs-kernel-slab
@@ -429,10 +429,12 @@ KernelVersion:	2.6.22
 Contact:	Pekka Enberg <penberg@cs.helsinki.fi>,
 		Christoph Lameter <cl@linux-foundation.org>
 Description:
-		The shrink file is written when memory should be reclaimed from
-		a cache.  Empty partial slabs are freed and the partial list is
-		sorted so the slabs with the fewest available objects are used
-		first.
+		A value of '1' is written to the shrink file when memory should
+		be reclaimed from a cache.  Empty partial slabs are freed and
+		the partial list is sorted so the slabs with the fewest
+		available objects are used first.  When a value of '2' is
+		written, all the corresponding child memory cgroup caches
+		should be shrunk as well.  All other values are invalid.
 
 What:		/sys/kernel/slab/cache/slab_size
 Date:		May 2007
diff --git a/mm/slab.h b/mm/slab.h
index 3b22931bb557..a16b2c7ff4dd 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -174,6 +174,7 @@ int __kmem_cache_shrink(struct kmem_cache *);
 void __kmemcg_cache_deactivate(struct kmem_cache *s);
 void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
 void slab_kmem_cache_release(struct kmem_cache *);
+int kmem_cache_shrink_all(struct kmem_cache *s);
 
 struct seq_file;
 struct file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 464faaa9fd81..493697ba1da5 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -981,6 +981,49 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
+/**
+ * kmem_cache_shrink_all - shrink a cache and all its memcg children
+ * @s: The root cache to shrink.
+ *
+ * Return: 0 if successful, -EINVAL if not a root cache
+ */
+int kmem_cache_shrink_all(struct kmem_cache *s)
+{
+	struct kmem_cache *c;
+
+	if (!IS_ENABLED(CONFIG_MEMCG_KMEM)) {
+		kmem_cache_shrink(s);
+		return 0;
+	}
+	if (!is_root_cache(s))
+		return -EINVAL;
+
+	/*
+	 * The caller should have a reference to the root cache and so
+	 * we don't need to take the slab_mutex. We have to take the
+	 * slab_mutex, however, to iterate the memcg caches.
+	 */
+	get_online_cpus();
+	get_online_mems();
+	kasan_cache_shrink(s);
+	__kmem_cache_shrink(s);
+
+	mutex_lock(&slab_mutex);
+	for_each_memcg_cache(c, s) {
+		/*
+		 * Don't need to shrink deactivated memcg caches.
+		 */
+		if (s->flags & SLAB_DEACTIVATED)
+			continue;
+		kasan_cache_shrink(c);
+		__kmem_cache_shrink(c);
+	}
+	mutex_unlock(&slab_mutex);
+	put_online_mems();
+	put_online_cpus();
+	return 0;
+}
+
 bool slab_is_available(void)
 {
 	return slab_state >= UP;
diff --git a/mm/slub.c b/mm/slub.c
index a384228ff6d3..5d7b0004c51f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5298,6 +5298,8 @@ static ssize_t shrink_store(struct kmem_cache *s,
 {
 	if (buf[0] == '1')
 		kmem_cache_shrink(s);
+	else if (buf[0] == '2')
+		kmem_cache_shrink_all(s);
 	else
 		return -EINVAL;
 	return length;
-- 
2.18.1


^ permalink raw reply related

* Re: [PATCH v1 1/2] Documentation/filesystems: add binderfs
From: Matthew Wilcox @ 2019-07-02 17:57 UTC (permalink / raw)
  To: Jonathan Corbet; +Cc: Christian Brauner, linux-doc, linux-kernel
In-Reply-To: <20190114172401.018afb9c@lwn.net>

On Mon, Jan 14, 2019 at 05:24:01PM -0700, Jonathan Corbet wrote:
> On Fri, 11 Jan 2019 14:40:59 +0100
> Christian Brauner <christian@brauner.io> wrote:
> > This documents the Android binderfs filesystem used to dynamically add and
> > remove binder devices that are private to each instance.
> 
> You didn't add it to index.rst, so it won't actually become part of the
> docs build.

I think you added it in the wrong place.

From 8167b80c950834da09a9204b6236f238197c197b Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Tue, 2 Jul 2019 13:54:38 -0400
Subject: [PATCH] docs: Move binderfs to admin-guide

The documentation is more appropriate for the administrator than for
the internal kernel API section it is currently in.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 .../{filesystems => admin-guide}/binderfs.rst          |  0
 Documentation/admin-guide/index.rst                    |  1 +
 Documentation/filesystems/index.rst                    | 10 ----------
 3 files changed, 1 insertion(+), 10 deletions(-)
 rename Documentation/{filesystems => admin-guide}/binderfs.rst (100%)

diff --git a/Documentation/filesystems/binderfs.rst b/Documentation/admin-guide/binderfs.rst
similarity index 100%
rename from Documentation/filesystems/binderfs.rst
rename to Documentation/admin-guide/binderfs.rst
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 8001917ee012..24fbe0568eff 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -70,6 +70,7 @@ configure specific aspects of kernel behavior to your liking.
    ras
    bcache
    ext4
+   binderfs
    pm/index
    thunderbolt
    LSM/index
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 1131c34d77f6..970c0a3ec377 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -31,13 +31,3 @@ filesystem implementations.
 
    journalling
    fscrypt
-
-Filesystem-specific documentation
-=================================
-
-Documentation for individual filesystem types can be found here.
-
-.. toctree::
-   :maxdepth: 2
-
-   binderfs.rst
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH v5 07/18] kunit: test: add initial tests
From: Brendan Higgins @ 2019-07-02 17:52 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Frank Rowand, Greg KH, Josh Poimboeuf, Kees Cook, Kieran Bingham,
	Peter Zijlstra, Rob Herring, Stephen Boyd, shuah,
	Theodore Ts'o, Masahiro Yamada, devicetree, dri-devel,
	kunit-dev, open list:DOCUMENTATION, linux-fsdevel, linux-kbuild,
	Linux Kernel Mailing List, open list:KERNEL SELFTEST FRAMEWORK,
	linux-nvdimm, linux-um, Sasha Levin, Bird, Timothy,
	Amir Goldstein, Dan Carpenter, Daniel Vetter, Jeff Dike,
	Joel Stanley, Julia Lawall, Kevin Hilman, Knut Omang,
	Logan Gunthorpe, Michael Ellerman, Petr Mladek, Randy Dunlap,
	Richard Weinberger, David Rientjes, Steven Rostedt, wfg
In-Reply-To: <CAFd5g46mnd=a0OqFCx0hOHX+DxW+5yA2LXH5Q0gEg8yUZK=4FA@mail.gmail.com>

On Wed, Jun 26, 2019 at 12:53 AM Brendan Higgins
<brendanhiggins@google.com> wrote:
>
> On Tue, Jun 25, 2019 at 4:22 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > On Mon, Jun 17, 2019 at 01:26:02AM -0700, Brendan Higgins wrote:
> > > diff --git a/kunit/example-test.c b/kunit/example-test.c
> > > new file mode 100644
> > > index 0000000000000..f44b8ece488bb
> > > --- /dev/null
> > > +++ b/kunit/example-test.c
> >
> > <-- snip -->
> >
> > > +/*
> > > + * This defines a suite or grouping of tests.
> > > + *
> > > + * Test cases are defined as belonging to the suite by adding them to
> > > + * `kunit_cases`.
> > > + *
> > > + * Often it is desirable to run some function which will set up things which
> > > + * will be used by every test; this is accomplished with an `init` function
> > > + * which runs before each test case is invoked. Similarly, an `exit` function
> > > + * may be specified which runs after every test case and can be used to for
> > > + * cleanup. For clarity, running tests in a test module would behave as follows:
> > > + *
> >
> > To be clear this is not the kernel module init, but rather the kunit
> > module init. I think using kmodule would make this clearer to a reader.
>
> Seems reasonable. Will fix in next revision.
>
> > > + * module.init(test);
> > > + * module.test_case[0](test);
> > > + * module.exit(test);
> > > + * module.init(test);
> > > + * module.test_case[1](test);
> > > + * module.exit(test);
> > > + * ...;
> > > + */

Do you think it might be clearer yet to rename `struct kunit_module
*module;` to `struct kunit_suite *suite;`?

^ permalink raw reply

* RE: [PATCH v7 1/2] fTPM: firmware TPM running in TEE
From: Thirupathaiah Annapureddy @ 2019-07-02 16:54 UTC (permalink / raw)
  To: Ilias Apalodimas, Jarkko Sakkinen
  Cc: Sasha Levin, peterhuewe@gmx.de, jgg@ziepe.ca, corbet@lwn.net,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-integrity@vger.kernel.org, Microsoft Linux Kernel List,
	Bryan Kelly (CSI), tee-dev@lists.linaro.org,
	sumit.garg@linaro.org, rdunlap@infradead.org
In-Reply-To: <20190702142109.GA32069@apalos>

Hi Ilias,

First of all, Thanks a lot for trying to test the driver. 

> -----Original Message-----
> From: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> Sent: Tuesday, July 2, 2019 7:21 AM
> To: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> Cc: Sasha Levin <sashal@kernel.org>; peterhuewe@gmx.de; jgg@ziepe.ca;
> corbet@lwn.net; linux-kernel@vger.kernel.org; linux-doc@vger.kernel.org;
> linux-integrity@vger.kernel.org; Microsoft Linux Kernel List <linux-
> kernel@microsoft.com>; Thirupathaiah Annapureddy <thiruan@microsoft.com>;
> Bryan Kelly (CSI) <bryankel@microsoft.com>; tee-dev@lists.linaro.org;
> sumit.garg@linaro.org; rdunlap@infradead.org
> Subject: Re: [PATCH v7 1/2] fTPM: firmware TPM running in TEE
> 
> Hi,
> 
> > On Thu, 2019-06-27 at 16:30 +0300, Ilias Apalodimas wrote:
> > > is really useful. I don't have hardware to test this at the moment, but
> once i
> > > get it, i'll give it a spin.
> >
> > Thank you for responding, really appreciate it.
> >
> No worries
> > Please note, however, that I already did my v5.3 PR so there is a lot of
> > time to give it a spin. In all cases, we will find a way to put this to
> > my v5.4 PR. I don't see any reason why not.
> >
> > As soon as the cosmetic stuff is fixed that I remarked in v7 I'm ready
> > to take this to my tree and after that soonish make it available on
> > linux-next.
> I managed to do some quick testing in QEMU.
> Everything works fine when i build this as a module (using IBM's TPM 2.0
> TSS)
> 
> - As module
> # insmod /lib/modules/5.2.0-rc1/kernel/drivers/char/tpm/tpm_ftpm_tee.ko
> # getrandom -by 8
> randomBytes length 8
> 23 b9 3d c3 90 13 d9 6b
> 
> - Built-in
> # dmesg | grep optee
> ftpm-tee firmware:optee: ftpm_tee_probe:tee_client_open_session failed,
> err=ffff0008
This (0xffff0008) translates to TEE_ERROR_ITEM_NOT_FOUND.

Where is fTPM TA located in the your test setup? 
Is it stitched into TEE binary as an EARLY_TA or 
Is it expected to be loaded during run-time with the help of user mode OP-TEE supplicant?

My guess is that you are trying to load fTPM TA through user mode OP-TEE supplicant. 
Can you confirm? 
If that is the true, 
- In the case of driver built as a module (CONFIG_TCG_FTPM_TEE=m), this is works fine 
as user mode supplicant is ready. 
- In the built-in case (CONFIG_TCG_FTPM_TEE=y), 
This would result in the above error 0xffff0008 as TEE is unable to find fTPM TA. 

The expectation is that fTPM TA is built as an EARLY_TA (in BL32) so that
U-boot and Linux driver stacks work seamlessly without dependency on supplicant.  


> ftpm-tee: probe of firmware:optee failed with error -22
> # getrandom -by 8
> random: fast init done
> urandom_read: 2 callbacks suppressed
> random: getrandom: uninitialized urandom read (32 bytes read)
> TSS_Dev_Open: Error opening /dev/tpm0
> getrandom: failed, rc 000b0008
> TSS_RC_NO_CONNECTION - Failure connecting to lower layer
> 
> Am i missing anything?
> 
> Thanks
> /Ilias

^ permalink raw reply

* Re: [PATCH 10/10] tools/power/x86: A tool to validate Intel Speed Select commands
From: Andy Shevchenko @ 2019-07-02 15:39 UTC (permalink / raw)
  To: Len Brown
  Cc: Srinivas Pandruvada, Darren Hart, Andy Shevchenko,
	Andriy Shevchenko, Jonathan Corbet, Rafael J. Wysocki, Alan Cox,
	Prarit Bhargava, David Arcari, Linux Documentation List,
	Linux Kernel Mailing List, Platform Driver
In-Reply-To: <CAJvTdK=S1vPGg9HZjUxJN2aXSfSXBDyYYLawONA0PP_yKvf19A@mail.gmail.com>

On Tue, Jul 2, 2019 at 5:42 PM Len Brown <lenb@kernel.org> wrote:
>
> Acked-by: Len Brown <len.brown@intel.com>
>

Thanks!
I hope this is applicable for v2.

> On Sat, Jun 29, 2019 at 10:31 AM Andy Shevchenko
> <andy.shevchenko@gmail.com> wrote:
> >
> > On Thu, Jun 27, 2019 at 1:39 AM Srinivas Pandruvada
> > <srinivas.pandruvada@linux.intel.com> wrote:
> > >
> > > The Intel(R) Speed select technologies contains four features.
> > >
> > > Performance profile:An non architectural mechanism that allows multiple
> > > optimized performance profiles per system via static and/or dynamic
> > > adjustment of core count, workload, Tjmax, and TDP, etc. aka ISS
> > > in the documentation.
> > >
> > > Base Frequency: Enables users to increase guaranteed base frequency on
> > > certain cores (high priority cores) in exchange for lower base frequency
> > > on remaining cores (low priority cores). aka PBF in the documenation.
> > >
> > > Turbo frequency: Enables the ability to set different turbo ratio limits
> > > to cores based on priority. aka FACT in the documentation.
> > >
> > > Core power: An Interface that allows user to define per core/tile
> > > priority.
> > >
> > > There is a multi level help for commands and options. This can be used
> > > to check required arguments for each feature and commands for the
> > > feature.
> > >
> > > To start navigating the features start with
> > >
> > > $sudo intel-speed-select --help
> > >
> > > For help on a specific feature for example
> > > $sudo intel-speed-select perf-profile --help
> > >
> > > To get help for a command for a feature for example
> > > $sudo intel-speed-select perf-profile get-lock-status --help
> > >
> >
> > I need an Ack from tools/power maintainer(s) for this.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

* Re: [linux-kernel-mentees] [PATCH v5] Doc : fs : convert xfs.txt to ReST
From: Darrick J. Wong @ 2019-07-02 15:22 UTC (permalink / raw)
  To: Sheriff Esseson
  Cc: skhan, linux-xfs, corbet, linux-doc, linux-kernel,
	linux-kernel-mentees
In-Reply-To: <20190702123040.GA30111@localhost>

On Tue, Jul 02, 2019 at 01:30:40PM +0100, Sheriff Esseson wrote:
> Convert xfs.txt to ReST, rename and fix broken references, consequently.
> 
> Make the name "value" in "option=value" look like a variable (that it probably
> is), by embedding in angle "<>" brackets, rather than something predifined
> elsewhere. This is inline with the conventions in manuals.
>  	
> Also, make defaults of boolean options prefixed with "(*)". This is so that
> options can be compressed to "[no]option" and on a single line, which renders
> consistently and nicely in htmldocs.
> 
> lastly, enforce a "one option, one definition" policy to keep things
> consistent and simple.
> 
> 
> Signed-off-by: Sheriff Esseson <sheriffesseson@gmail.com>
> ---
> 
> v5 aims to comply with the guiding comments on its previous versions.
> 
>  Documentation/filesystems/dax.txt   |   2 +-
>  Documentation/filesystems/index.rst |   5 +-
>  Documentation/filesystems/xfs.rst   | 468 +++++++++++++++++++++++++++
>  Documentation/filesystems/xfs.txt   | 470 ----------------------------
>  MAINTAINERS                         |   2 +-
>  5 files changed, 473 insertions(+), 474 deletions(-)
>  create mode 100644 Documentation/filesystems/xfs.rst
>  delete mode 100644 Documentation/filesystems/xfs.txt
> 
> diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
> index 6d2c0d340..c333285b8 100644
> --- a/Documentation/filesystems/dax.txt
> +++ b/Documentation/filesystems/dax.txt
> @@ -76,7 +76,7 @@ exposure of uninitialized data through mmap.
>  These filesystems may be used for inspiration:
>  - ext2: see Documentation/filesystems/ext2.txt
>  - ext4: see Documentation/filesystems/ext4/
> -- xfs:  see Documentation/filesystems/xfs.txt
> +- xfs:  see Documentation/filesystems/xfs.rst
>  
>  
>  Handling Media Errors
> diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
> index 1131c34d7..a4cf5fca4 100644
> --- a/Documentation/filesystems/index.rst
> +++ b/Documentation/filesystems/index.rst
> @@ -16,7 +16,7 @@ algorithms work.
>  .. toctree::
>     :maxdepth: 2
>  
> -   path-lookup.rst
> +   path-lookup
>     api-summary
>     splice
>  
> @@ -40,4 +40,5 @@ Documentation for individual filesystem types can be found here.
>  .. toctree::
>     :maxdepth: 2
>  
> -   binderfs.rst
> +   binderfs
> +   xfs
> diff --git a/Documentation/filesystems/xfs.rst b/Documentation/filesystems/xfs.rst
> new file mode 100644
> index 000000000..d36ef042c
> --- /dev/null
> +++ b/Documentation/filesystems/xfs.rst
> @@ -0,0 +1,468 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +======================
> +The SGI XFS Filesystem
> +======================
> +
> +XFS is a high performance journaling filesystem which originated
> +on the SGI IRIX platform.  It is completely multi-threaded, can
> +support large files and large filesystems, extended attributes,
> +variable block sizes, is extent based, and makes extensive use of
> +Btrees (directories, extents, free space) to aid both performance
> +and scalability.
> +
> +Refer to the documentation at https://xfs.wiki.kernel.org/
> +for further details.  This implementation is on-disk compatible
> +with the IRIX version of XFS.
> +
> +
> +Mount Options
> +=============
> +
> +When mounting an XFS filesystem, the following options are accepted.  For
> +boolean mount options, the names with the "(*)" prefix is the default behaviour.
> +For example, take a behaviour enabled by default to be a one (1) or, a zero (0)
> +otherwise, ``(*)[no]default`` would be 0 while ``[no](*)default`` , a 1.

That's really confusing.  Does this mean that I can pass discard=1 now?

(no)

I also don't really understand why we need to cram so much into a single
line.  Why not just:

    discard
    nodiscard (default)
	Something something discard chomps on free space, chomp chomp
	a chewy chomp, pretend I wrote the real text here, etc.

Or if you really want the single-line header...

    [no]discard
	Something something discard chomps on free space, chomp chomp
	a chewy chomp, pretend I wrote the real text here, etc.

	''nodiscard'' is the default setting.

Please don't introduce '1's and '0's here because there are other parts
of xfs where you /can/ enable or disable features by saying "foo=0" or
"foo=1".

There's probably more to say but the text reflowing since v1 makes this
patch unreviewable because ugh 1000-line diff.

--D

> +
> +   allocsize=<size>
> +        Sets the buffered I/O end-of-file preallocation size when doing delayed
> +        allocation writeout (default size is 64KiB).  Valid values for this
> +        option are page size (typically 4KiB) through to 1GiB, inclusive, in
> +        power-of-2 increments.
> +
> +        The default behaviour is for dynamic end-of-file preallocation size,
> +        which uses a set of heuristics to optimise the preallocation size based
> +        on the current allocation patterns within the file and the access
> +        patterns to the file. Specifying a fixed allocsize value turns off the
> +        dynamic behaviour.
> +
> +   [no]attr2
> +        The options enable/disable an "opportunistic" improvement to be made in
> +        the way inline extended attributes are stored on-disk.  When the new
> +        form is used for the first time when ``attr2`` is selected (either when
> +        setting or removing extended attributes) the on-disk superblock feature
> +        bit field will be updated to reflect this format being in use.
> +
> +        The default behaviour is determined by the on-disk feature bit
> +        indicating that ``attr2`` behaviour is active. If either mount option is
> +        set, then that becomes the new default used by the filesystem. However
> +        on CRC enabled filesystems, the ``attr2`` format is always used , and so
> +        will reject the ``noattr2`` mount option if it is set.
> +
> +   (*)[no]discard
> +        Enable/disable the issuing of commands to let the block device reclaim
> +        space freed by the filesystem.  This is useful for SSD devices, thinly
> +        provisioned LUNs and virtual machine images, but may have a performance
> +        impact.
> +
> +        Note: It is currently recommended that you use the ``fstrim``
> +        application to discard unused blocks rather than the ``discard`` mount
> +        option because the performance impact of this option is quite severe.
> +
> +   grpid/bsdgroups
> +   nogrpid/(*)sysvgroups
> +        These options define what group ID a newly created file gets.  When
> +        ``grpid`` is set, it takes the group ID of the directory in which it is
> +        created; otherwise it takes the ``fsgid`` of the current process, unless
> +        the directory has the ``setgid`` bit set, in which case it takes the
> +        ``gid`` from the parent directory, and also gets the ``setgid`` bit set
> +        if it is a directory itself.
> +
> +   filestreams
> +        Make the data allocator use the filestreams allocation mode across the
> +        entire filesystem rather than just on directories configured to use it.
> +
> +   (*)[no]ikeep
> +        When ``ikeep`` is specified, XFS does not delete empty inode clusters
> +        and keeps them around on disk.  When ``noikeep`` is specified, empty
> +        inode clusters are returned to the free space pool.
> +
> +   inode32 | (*)inode64
> +        When ``inode32`` is specified, it indicates that XFS limits inode
> +        creation to locations which will not result in inode numbers with more
> +        than 32 bits of significance.
> +
> +        When ``inode64`` is specified, it indicates that XFS is allowed to
> +        create inodes at any location in the filesystem, including those which
> +        will result in inode numbers occupying more than 32 bits of
> +        significance.
> +
> +        ``inode32`` is provided for backwards compatibility with older systems
> +        and applications, since 64 bits inode numbers might cause problems for
> +        some applications that cannot handle large inode numbers.  If
> +        applications are in use which do not handle inode numbers bigger than 32
> +        bits, the ``inode32`` option should be specified.
> +
> +
> +   (*)[no]largeio
> +        If ``nolargeio`` is specified, the optimal I/O reported in st_blksize by
> +        **stat(2)** will be as small as possible to allow user applications to
> +        avoid inefficient read/modify/write I/O.  This is typically the page
> +        size of the machine, as this is the granularity of the page cache.
> +
> +        If ``largeio`` is specified, a filesystem that was created with a
> +        ``swidth`` specified will return the ``swidth`` value (in bytes) in
> +        st_blksize. If the filesystem does not have a ``swidth`` specified but
> +        does specify an ``allocsize`` then ``allocsize`` (in bytes) will be
> +        returned instead. Otherwise the behaviour is the same as if
> +        ``nolargeio`` was specified.
> +
> +   logbufs=<value>
> +        Set the number of in-memory log buffers to ``value``.  Valid numbers
> +        range from 2-8 inclusive.
> +
> +        The default value is 8 buffers.
> +
> +        If the memory cost of 8 log buffers is too high on small systems, then
> +        it may be reduced at some cost to performance on metadata intensive
> +        workloads. The ``logbsize`` option below controls the size of each
> +        buffer and so is also relevant to this case.
> +
> +   logbsize=<value>
> +        Set the size of each in-memory log buffer to ``value``.  The size may be
> +        specified in bytes, or in kilobytes with a "k" suffix. Valid sizes for
> +        version 1 and version 2 logs are 16384 (16k) and 32768 (32k).  Valid
> +        sizes for version 2 logs also include 65536 (64k), 131072 (128k) and
> +        262144 (256k). The ``logbsize`` must be an integer multiple of the
> +        "log stripe unit" configured at mkfs time.
> +
> +        The default value for for version 1 logs is 32768, while the default
> +        value for version 2 logs is ``MAX(32768, log_sunit)``.
> +
> +   logdev=<device>
> +        Use ``device`` as an external log (metadata journal).  In an XFS
> +        filesystem, the log device can be separate from the data device or
> +        contained within it.
> +
> +   rtdev=<device>
> +        An XFS filesystem has up to three parts: a data section, a log section,
> +        and a real-time section.  The real-time section is optional.  If
> +        enabled, ``rtdev`` sets ``device`` to be used as an external real-time
> +        section, similar to ``logdev`` above.
> +
> +   noalign
> +        Data allocations will not be aligned at stripe unit boundaries. This is
> +        only relevant to filesystems created with non-zero data alignment
> +        parameters (sunit, swidth) by mkfs.
> +
> +   norecovery
> +        The filesystem will be mounted without running log recovery.  If the
> +        filesystem was not cleanly unmounted, it is likely to be inconsistent
> +        when mounted in ``norecovery`` mode.  Some files or directories may not
> +        be accessible because of this.  Filesystems mounted ``norecovery`` must
> +        be mounted read-only or the mount will fail.
> +
> +   nouuid
> +        Don't check for double mounted file systems using the file system uuid.
> +        This is useful to mount LVM snapshot volumes, and often used in
> +        combination with ``norecovery`` for mounting read-only snapshots.
> +
> +   noquota
> +	Forcibly turns off all quota accounting and enforcement
> +	within the filesystem.
> +
> +   uquota/usrquota/uqnoenforce/quota
> +        User disk quota accounting enabled, and limits (optionally) enforced.
> +        Refer to **xfs_quota(8)** for further details.
> +
> +   gquota/grpquota/gqnoenforce
> +        Group disk quota accounting enabled and limits (optionally) enforced.
> +        Refer to **xfs_quota(8)** for further details.
> +
> +   pquota/prjquota/pqnoenforce
> +        Project disk quota accounting enabled and limits (optionally) enforced.
> +        Refer to **xfs_quota(8)** for further details.
> +
> +   sunit=<value>
> +        Used to specify the stripe unit for a RAID device or (in conjunction
> +        with ``swidth`` below) a stripe volume.  ``value`` must be specified in
> +        512-byte block units. This option is only relevant to filesystems that
> +        were created with non-zero data alignment parameters.
> +
> +        The ``sunit`` parameter specified must be compatible with the existing
> +        filesystem alignment characteristics.  In general, that means the only
> +        valid changes to ``sunit`` are increasing it by a power-of-2 multiple.
> +
> +        Typically, this mount option is necessary only after an underlying RAID
> +        device has had its geometry modified, such as adding a new disk to a
> +        RAID5 lun and reshaping it.
> +
> +   swidth=<value>
> +        Used to specify the stripe width for a RAID device or (in conjunction
> +        with ``sunit`` above) a stripe volume.  ``value`` must be specified in
> +        512-byte block units. This option, like ``sunit`` above, is only
> +        relevant to filesystems that were created with non-zero data alignment
> +        parameters.
> +
> +        The ``swidth`` parameter specified must be compatible with the existing
> +        filesystem alignment characteristics.  In general, that means the only
> +        valid swidth values are any integer multiple of a valid ``sunit`` value.
> +
> +        Typically, this mount option is necessary only after an underlying RAID
> +        device has had its geometry modified, such as adding a new disk to a
> +        RAID5 lun and reshaping it.
> +
> +
> +   swalloc
> +        Data allocations will be rounded up to stripe width boundaries when the
> +        current end of file is being extended and the file size is larger than
> +        the stripe width size.
> +
> +   wsync
> +        When specified, all filesystem namespace operations are executed
> +        synchronously. This ensures that when the namespace operation (create,
> +        unlink, etc) completes, the change to the namespace is on stable
> +        storage. This is useful in HA setups where failover must not result in
> +        clients seeing inconsistent namespace presentation during or after a
> +        failover event.
> +
> +
> +Deprecated Mount Options
> +========================
> +
> +  Name				Removal Schedule
> +  ----				----------------
> +
> +
> +Removed Mount Options
> +=====================
> +
> +  Name				Removed
> +  ----				-------
> +  delaylog/nodelaylog		v4.0
> +  ihashsize			v4.0
> +  irixsgid			v4.0
> +  osyncisdsync/osyncisosync	v4.0
> +  barrier			v4.19
> +  nobarrier			v4.19
> +
> +
> +sysctls
> +=======
> +
> +The following sysctls are available for the XFS filesystem:
> +
> +  fs.xfs.stats_clear		(Min: 0  Default: 0  Max: 1)
> +	Setting this to "1" clears accumulated XFS statistics
> +	in /proc/fs/xfs/stat.  It then immediately resets to "0".
> +
> +  fs.xfs.xfssyncd_centisecs	(Min: 100  Default: 3000  Max: 720000)
> +	The interval at which the filesystem flushes metadata
> +	out to disk and runs internal cache cleanup routines.
> +
> +  fs.xfs.filestream_centisecs	(Min: 1  Default: 3000  Max: 360000)
> +	The interval at which the filesystem ages filestreams cache
> +	references and returns timed-out AGs back to the free stream
> +	pool.
> +
> +  fs.xfs.speculative_prealloc_lifetime
> +		(Units: seconds   Min: 1  Default: 300  Max: 86400)
> +	The interval at which the background scanning for inodes
> +	with unused speculative preallocation runs. The scan
> +	removes unused preallocation from clean inodes and releases
> +	the unused space back to the free pool.
> +
> +  fs.xfs.error_level		(Min: 0  Default: 3  Max: 11)
> +	A volume knob for error reporting when internal errors occur.
> +	This will generate detailed messages & backtraces for filesystem
> +	shutdowns, for example.  Current threshold values are:
> +
> +		XFS_ERRLEVEL_OFF:       0
> +		XFS_ERRLEVEL_LOW:       1
> +		XFS_ERRLEVEL_HIGH:      5
> +
> +  fs.xfs.panic_mask		(Min: 0  Default: 0  Max: 256)
> +	Causes certain error conditions to call BUG(). Value is a bitmask;
> +	OR together the tags which represent errors which should cause panics:
> +
> +		XFS_NO_PTAG                     0
> +		XFS_PTAG_IFLUSH                 0x00000001
> +		XFS_PTAG_LOGRES                 0x00000002
> +		XFS_PTAG_AILDELETE              0x00000004
> +		XFS_PTAG_ERROR_REPORT           0x00000008
> +		XFS_PTAG_SHUTDOWN_CORRUPT       0x00000010
> +		XFS_PTAG_SHUTDOWN_IOERROR       0x00000020
> +		XFS_PTAG_SHUTDOWN_LOGERROR      0x00000040
> +		XFS_PTAG_FSBLOCK_ZERO           0x00000080
> +		XFS_PTAG_VERIFIER_ERROR         0x00000100
> +
> +	This option is intended for debugging only.
> +
> +  fs.xfs.irix_symlink_mode	(Min: 0  Default: 0  Max: 1)
> +	Controls whether symlinks are created with mode 0777 (default)
> +	or whether their mode is affected by the umask (irix mode).
> +
> +  fs.xfs.irix_sgid_inherit	(Min: 0  Default: 0  Max: 1)
> +	Controls files created in SGID directories.
> +	If the group ID of the new file does not match the effective group
> +	ID or one of the supplementary group IDs of the parent dir, the
> +	ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
> +	is set.
> +
> +  fs.xfs.inherit_sync		(Min: 0  Default: 1  Max: 1)
> +	Setting this to "1" will cause the "sync" flag set
> +	by the **xfs_io(8)** chattr command on a directory to be
> +	inherited by files in that directory.
> +
> +  fs.xfs.inherit_nodump		(Min: 0  Default: 1  Max: 1)
> +	Setting this to "1" will cause the "nodump" flag set
> +	by the **xfs_io(8)** chattr command on a directory to be
> +	inherited by files in that directory.
> +
> +  fs.xfs.inherit_noatime	(Min: 0  Default: 1  Max: 1)
> +	Setting this to "1" will cause the "noatime" flag set
> +	by the **xfs_io(8)** chattr command on a directory to be
> +	inherited by files in that directory.
> +
> +  fs.xfs.inherit_nosymlinks	(Min: 0  Default: 1  Max: 1)
> +	Setting this to "1" will cause the "nosymlinks" flag set
> +	by the **xfs_io(8)** chattr command on a directory to be
> +	inherited by files in that directory.
> +
> +  fs.xfs.inherit_nodefrag	(Min: 0  Default: 1  Max: 1)
> +	Setting this to "1" will cause the "nodefrag" flag set
> +	by the **xfs_io(8)** chattr command on a directory to be
> +	inherited by files in that directory.
> +
> +  fs.xfs.rotorstep		(Min: 1  Default: 1  Max: 256)
> +	In "inode32" allocation mode, this option determines how many
> +	files the allocator attempts to allocate in the same allocation
> +	group before moving to the next allocation group.  The intent
> +	is to control the rate at which the allocator moves between
> +	allocation groups when allocating extents for new files.
> +
> +Deprecated Sysctls
> +==================
> +
> +None at present.
> +
> +
> +Removed Sysctls
> +===============
> +
> +  Name				Removed
> +  ----				-------
> +  fs.xfs.xfsbufd_centisec	v4.0
> +  fs.xfs.age_buffer_centisecs	v4.0
> +
> +
> +Error handling
> +==============
> +
> +XFS can act differently according to the type of error found during its
> +operation. The implementation introduces the following concepts to the error
> +handler:
> +
> + -failure speed:
> +	Defines how fast XFS should propagate an error upwards when a specific
> +	error is found during the filesystem operation. It can propagate
> +	immediately, after a defined number of retries, after a set time period,
> +	or simply retry forever.
> +
> + -error classes:
> +	Specifies the subsystem the error configuration will apply to, such as
> +	metadata IO or memory allocation. Different subsystems will have
> +	different error handlers for which behaviour can be configured.
> +
> + -error handlers:
> +	Defines the behavior for a specific error.
> +
> +The filesystem behavior during an error can be set via sysfs files. Each
> +error handler works independently - the first condition met by an error handler
> +for a specific class will cause the error to be propagated rather than reset and
> +retried.
> +
> +The action taken by the filesystem when the error is propagated is context
> +dependent - it may cause a shut down in the case of an unrecoverable error,
> +it may be reported back to userspace, or it may even be ignored because
> +there's nothing useful we can with the error or anyone we can report it to (e.g.
> +during unmount).
> +
> +The configuration files are organized into the following hierarchy for each
> +mounted filesystem:
> +
> +  /sys/fs/xfs/<dev>/error/<class>/<error>/
> +
> +Where:
> +  <dev>
> +	The short device name of the mounted filesystem. This is the same device
> +	name that shows up in XFS kernel error messages as "XFS(<dev>): ..."
> +
> +  <class>
> +	The subsystem the error configuration belongs to. As of 4.9, the defined
> +	classes are:
> +
> +		- "metadata": applies metadata buffer write IO
> +
> +  <error>
> +	The individual error handler configurations.
> +
> +
> +Each filesystem has "global" error configuration options defined in their top
> +level directory:
> +
> +  /sys/fs/xfs/<dev>/error/
> +
> +  fail_at_unmount		(Min:  0  Default:  1  Max: 1)
> +	Defines the filesystem error behavior at unmount time.
> +
> +	If set to a value of 1, XFS will override all other error configurations
> +	during unmount and replace them with "immediate fail" characteristics.
> +	i.e. no retries, no retry timeout. This will always allow unmount to
> +	succeed when there are persistent errors present.
> +
> +	If set to 0, the configured retry behaviour will continue until all
> +	retries and/or timeouts have been exhausted. This will delay unmount
> +	completion when there are persistent errors, and it may prevent the
> +	filesystem from ever unmounting fully in the case of "retry forever"
> +	handler configurations.
> +
> +	Note: there is no guarantee that fail_at_unmount can be set while an
> +	unmount is in progress. It is possible that the sysfs entries are
> +	removed by the unmounting filesystem before a "retry forever" error
> +	handler configuration causes unmount to hang, and hence the filesystem
> +	must be configured appropriately before unmount begins to prevent
> +	unmount hangs.
> +
> +Each filesystem has specific error class handlers that define the error
> +propagation behaviour for specific errors. There is also a "default" error
> +handler defined, which defines the behaviour for all errors that don't have
> +specific handlers defined. Where multiple retry constraints are configuredi for
> +a single error, the first retry configuration that expires will cause the error
> +to be propagated. The handler configurations are found in the directory:
> +
> +  /sys/fs/xfs/<dev>/error/<class>/<error>/
> +
> +  max_retries			(Min: -1  Default: Varies  Max: INTMAX)
> +	Defines the allowed number of retries of a specific error before
> +	the filesystem will propagate the error. The retry count for a given
> +	error context (e.g. a specific metadata buffer) is reset every time
> +	there is a successful completion of the operation.
> +
> +	Setting the value to "-1" will cause XFS to retry forever for this
> +	specific error.
> +
> +	Setting the value to "0" will cause XFS to fail immediately when the
> +	specific error is reported.
> +
> +	Setting the value to "N" (where 0 < N < Max) will make XFS retry the
> +	operation "N" times before propagating the error.
> +
> +  retry_timeout_seconds		(Min:  -1  Default:  Varies  Max: 1 day)
> +	Define the amount of time (in seconds) that the filesystem is
> +	allowed to retry its operations when the specific error is
> +	found.
> +
> +	Setting the value to "-1" will allow XFS to retry forever for this
> +	specific error.
> +
> +	Setting the value to "0" will cause XFS to fail immediately when the
> +	specific error is reported.
> +
> +	Setting the value to "N" (where 0 < N < Max) will allow XFS to retry the
> +	operation for up to "N" seconds before propagating the error.
> +
> +Note: The default behaviour for a specific error handler is dependent on both
> +the class and error context. For example, the default values for
> +"metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
> +to "fail immediately" behaviour. This is done because ENODEV is a fatal,
> +unrecoverable error no matter how many times the metadata IO is retried.
> diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
> deleted file mode 100644
> index a5cbb5e0e..000000000
> --- a/Documentation/filesystems/xfs.txt
> +++ /dev/null
> @@ -1,470 +0,0 @@
> -
> -The SGI XFS Filesystem
> -======================
> -
> -XFS is a high performance journaling filesystem which originated
> -on the SGI IRIX platform.  It is completely multi-threaded, can
> -support large files and large filesystems, extended attributes,
> -variable block sizes, is extent based, and makes extensive use of
> -Btrees (directories, extents, free space) to aid both performance
> -and scalability.
> -
> -Refer to the documentation at https://xfs.wiki.kernel.org/
> -for further details.  This implementation is on-disk compatible
> -with the IRIX version of XFS.
> -
> -
> -Mount Options
> -=============
> -
> -When mounting an XFS filesystem, the following options are accepted.
> -For boolean mount options, the names with the (*) suffix is the
> -default behaviour.
> -
> -  allocsize=size
> -	Sets the buffered I/O end-of-file preallocation size when
> -	doing delayed allocation writeout (default size is 64KiB).
> -	Valid values for this option are page size (typically 4KiB)
> -	through to 1GiB, inclusive, in power-of-2 increments.
> -
> -	The default behaviour is for dynamic end-of-file
> -	preallocation size, which uses a set of heuristics to
> -	optimise the preallocation size based on the current
> -	allocation patterns within the file and the access patterns
> -	to the file. Specifying a fixed allocsize value turns off
> -	the dynamic behaviour.
> -
> -  attr2
> -  noattr2
> -	The options enable/disable an "opportunistic" improvement to
> -	be made in the way inline extended attributes are stored
> -	on-disk.  When the new form is used for the first time when
> -	attr2 is selected (either when setting or removing extended
> -	attributes) the on-disk superblock feature bit field will be
> -	updated to reflect this format being in use.
> -
> -	The default behaviour is determined by the on-disk feature
> -	bit indicating that attr2 behaviour is active. If either
> -	mount option it set, then that becomes the new default used
> -	by the filesystem.
> -
> -	CRC enabled filesystems always use the attr2 format, and so
> -	will reject the noattr2 mount option if it is set.
> -
> -  discard
> -  nodiscard (*)
> -	Enable/disable the issuing of commands to let the block
> -	device reclaim space freed by the filesystem.  This is
> -	useful for SSD devices, thinly provisioned LUNs and virtual
> -	machine images, but may have a performance impact.
> -
> -	Note: It is currently recommended that you use the fstrim
> -	application to discard unused blocks rather than the discard
> -	mount option because the performance impact of this option
> -	is quite severe.
> -
> -  grpid/bsdgroups
> -  nogrpid/sysvgroups (*)
> -	These options define what group ID a newly created file
> -	gets.  When grpid is set, it takes the group ID of the
> -	directory in which it is created; otherwise it takes the
> -	fsgid of the current process, unless the directory has the
> -	setgid bit set, in which case it takes the gid from the
> -	parent directory, and also gets the setgid bit set if it is
> -	a directory itself.
> -
> -  filestreams
> -	Make the data allocator use the filestreams allocation mode
> -	across the entire filesystem rather than just on directories
> -	configured to use it.
> -
> -  ikeep
> -  noikeep (*)
> -	When ikeep is specified, XFS does not delete empty inode
> -	clusters and keeps them around on disk.  When noikeep is
> -	specified, empty inode clusters are returned to the free
> -	space pool.
> -
> -  inode32
> -  inode64 (*)
> -	When inode32 is specified, it indicates that XFS limits
> -	inode creation to locations which will not result in inode
> -	numbers with more than 32 bits of significance.
> -
> -	When inode64 is specified, it indicates that XFS is allowed
> -	to create inodes at any location in the filesystem,
> -	including those which will result in inode numbers occupying
> -	more than 32 bits of significance. 
> -
> -	inode32 is provided for backwards compatibility with older
> -	systems and applications, since 64 bits inode numbers might
> -	cause problems for some applications that cannot handle
> -	large inode numbers.  If applications are in use which do
> -	not handle inode numbers bigger than 32 bits, the inode32
> -	option should be specified.
> -
> -
> -  largeio
> -  nolargeio (*)
> -	If "nolargeio" is specified, the optimal I/O reported in
> -	st_blksize by stat(2) will be as small as possible to allow
> -	user applications to avoid inefficient read/modify/write
> -	I/O.  This is typically the page size of the machine, as
> -	this is the granularity of the page cache.
> -
> -	If "largeio" specified, a filesystem that was created with a
> -	"swidth" specified will return the "swidth" value (in bytes)
> -	in st_blksize. If the filesystem does not have a "swidth"
> -	specified but does specify an "allocsize" then "allocsize"
> -	(in bytes) will be returned instead. Otherwise the behaviour
> -	is the same as if "nolargeio" was specified.
> -
> -  logbufs=value
> -	Set the number of in-memory log buffers.  Valid numbers
> -	range from 2-8 inclusive.
> -
> -	The default value is 8 buffers.
> -
> -	If the memory cost of 8 log buffers is too high on small
> -	systems, then it may be reduced at some cost to performance
> -	on metadata intensive workloads. The logbsize option below
> -	controls the size of each buffer and so is also relevant to
> -	this case.
> -
> -  logbsize=value
> -	Set the size of each in-memory log buffer.  The size may be
> -	specified in bytes, or in kilobytes with a "k" suffix.
> -	Valid sizes for version 1 and version 2 logs are 16384 (16k)
> -	and 32768 (32k).  Valid sizes for version 2 logs also
> -	include 65536 (64k), 131072 (128k) and 262144 (256k). The
> -	logbsize must be an integer multiple of the log
> -	stripe unit configured at mkfs time.
> -
> -	The default value for for version 1 logs is 32768, while the
> -	default value for version 2 logs is MAX(32768, log_sunit).
> -
> -  logdev=device and rtdev=device
> -	Use an external log (metadata journal) and/or real-time device.
> -	An XFS filesystem has up to three parts: a data section, a log
> -	section, and a real-time section.  The real-time section is
> -	optional, and the log section can be separate from the data
> -	section or contained within it.
> -
> -  noalign
> -	Data allocations will not be aligned at stripe unit
> -	boundaries. This is only relevant to filesystems created
> -	with non-zero data alignment parameters (sunit, swidth) by
> -	mkfs.
> -
> -  norecovery
> -	The filesystem will be mounted without running log recovery.
> -	If the filesystem was not cleanly unmounted, it is likely to
> -	be inconsistent when mounted in "norecovery" mode.
> -	Some files or directories may not be accessible because of this.
> -	Filesystems mounted "norecovery" must be mounted read-only or
> -	the mount will fail.
> -
> -  nouuid
> -	Don't check for double mounted file systems using the file
> -	system uuid.  This is useful to mount LVM snapshot volumes,
> -	and often used in combination with "norecovery" for mounting
> -	read-only snapshots.
> -
> -  noquota
> -	Forcibly turns off all quota accounting and enforcement
> -	within the filesystem.
> -
> -  uquota/usrquota/uqnoenforce/quota
> -	User disk quota accounting enabled, and limits (optionally)
> -	enforced.  Refer to xfs_quota(8) for further details.
> -
> -  gquota/grpquota/gqnoenforce
> -	Group disk quota accounting enabled and limits (optionally)
> -	enforced.  Refer to xfs_quota(8) for further details.
> -
> -  pquota/prjquota/pqnoenforce
> -	Project disk quota accounting enabled and limits (optionally)
> -	enforced.  Refer to xfs_quota(8) for further details.
> -
> -  sunit=value and swidth=value
> -	Used to specify the stripe unit and width for a RAID device
> -	or a stripe volume.  "value" must be specified in 512-byte
> -	block units. These options are only relevant to filesystems
> -	that were created with non-zero data alignment parameters.
> -
> -	The sunit and swidth parameters specified must be compatible
> -	with the existing filesystem alignment characteristics.  In
> -	general, that means the only valid changes to sunit are
> -	increasing it by a power-of-2 multiple. Valid swidth values
> -	are any integer multiple of a valid sunit value.
> -
> -	Typically the only time these mount options are necessary if
> -	after an underlying RAID device has had it's geometry
> -	modified, such as adding a new disk to a RAID5 lun and
> -	reshaping it.
> -
> -  swalloc
> -	Data allocations will be rounded up to stripe width boundaries
> -	when the current end of file is being extended and the file
> -	size is larger than the stripe width size.
> -
> -  wsync
> -	When specified, all filesystem namespace operations are
> -	executed synchronously. This ensures that when the namespace
> -	operation (create, unlink, etc) completes, the change to the
> -	namespace is on stable storage. This is useful in HA setups
> -	where failover must not result in clients seeing
> -	inconsistent namespace presentation during or after a
> -	failover event.
> -
> -
> -Deprecated Mount Options
> -========================
> -
> -  Name				Removal Schedule
> -  ----				----------------
> -
> -
> -Removed Mount Options
> -=====================
> -
> -  Name				Removed
> -  ----				-------
> -  delaylog/nodelaylog		v4.0
> -  ihashsize			v4.0
> -  irixsgid			v4.0
> -  osyncisdsync/osyncisosync	v4.0
> -  barrier			v4.19
> -  nobarrier			v4.19
> -
> -
> -sysctls
> -=======
> -
> -The following sysctls are available for the XFS filesystem:
> -
> -  fs.xfs.stats_clear		(Min: 0  Default: 0  Max: 1)
> -	Setting this to "1" clears accumulated XFS statistics
> -	in /proc/fs/xfs/stat.  It then immediately resets to "0".
> -
> -  fs.xfs.xfssyncd_centisecs	(Min: 100  Default: 3000  Max: 720000)
> -	The interval at which the filesystem flushes metadata
> -	out to disk and runs internal cache cleanup routines.
> -
> -  fs.xfs.filestream_centisecs	(Min: 1  Default: 3000  Max: 360000)
> -	The interval at which the filesystem ages filestreams cache
> -	references and returns timed-out AGs back to the free stream
> -	pool.
> -
> -  fs.xfs.speculative_prealloc_lifetime
> -		(Units: seconds   Min: 1  Default: 300  Max: 86400)
> -	The interval at which the background scanning for inodes
> -	with unused speculative preallocation runs. The scan
> -	removes unused preallocation from clean inodes and releases
> -	the unused space back to the free pool.
> -
> -  fs.xfs.error_level		(Min: 0  Default: 3  Max: 11)
> -	A volume knob for error reporting when internal errors occur.
> -	This will generate detailed messages & backtraces for filesystem
> -	shutdowns, for example.  Current threshold values are:
> -
> -		XFS_ERRLEVEL_OFF:       0
> -		XFS_ERRLEVEL_LOW:       1
> -		XFS_ERRLEVEL_HIGH:      5
> -
> -  fs.xfs.panic_mask		(Min: 0  Default: 0  Max: 256)
> -	Causes certain error conditions to call BUG(). Value is a bitmask;
> -	OR together the tags which represent errors which should cause panics:
> -
> -		XFS_NO_PTAG                     0
> -		XFS_PTAG_IFLUSH                 0x00000001
> -		XFS_PTAG_LOGRES                 0x00000002
> -		XFS_PTAG_AILDELETE              0x00000004
> -		XFS_PTAG_ERROR_REPORT           0x00000008
> -		XFS_PTAG_SHUTDOWN_CORRUPT       0x00000010
> -		XFS_PTAG_SHUTDOWN_IOERROR       0x00000020
> -		XFS_PTAG_SHUTDOWN_LOGERROR      0x00000040
> -		XFS_PTAG_FSBLOCK_ZERO           0x00000080
> -		XFS_PTAG_VERIFIER_ERROR         0x00000100
> -
> -	This option is intended for debugging only.
> -
> -  fs.xfs.irix_symlink_mode	(Min: 0  Default: 0  Max: 1)
> -	Controls whether symlinks are created with mode 0777 (default)
> -	or whether their mode is affected by the umask (irix mode).
> -
> -  fs.xfs.irix_sgid_inherit	(Min: 0  Default: 0  Max: 1)
> -	Controls files created in SGID directories.
> -	If the group ID of the new file does not match the effective group
> -	ID or one of the supplementary group IDs of the parent dir, the
> -	ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
> -	is set.
> -
> -  fs.xfs.inherit_sync		(Min: 0  Default: 1  Max: 1)
> -	Setting this to "1" will cause the "sync" flag set
> -	by the xfs_io(8) chattr command on a directory to be
> -	inherited by files in that directory.
> -
> -  fs.xfs.inherit_nodump		(Min: 0  Default: 1  Max: 1)
> -	Setting this to "1" will cause the "nodump" flag set
> -	by the xfs_io(8) chattr command on a directory to be
> -	inherited by files in that directory.
> -
> -  fs.xfs.inherit_noatime	(Min: 0  Default: 1  Max: 1)
> -	Setting this to "1" will cause the "noatime" flag set
> -	by the xfs_io(8) chattr command on a directory to be
> -	inherited by files in that directory.
> -
> -  fs.xfs.inherit_nosymlinks	(Min: 0  Default: 1  Max: 1)
> -	Setting this to "1" will cause the "nosymlinks" flag set
> -	by the xfs_io(8) chattr command on a directory to be
> -	inherited by files in that directory.
> -
> -  fs.xfs.inherit_nodefrag	(Min: 0  Default: 1  Max: 1)
> -	Setting this to "1" will cause the "nodefrag" flag set
> -	by the xfs_io(8) chattr command on a directory to be
> -	inherited by files in that directory.
> -
> -  fs.xfs.rotorstep		(Min: 1  Default: 1  Max: 256)
> -	In "inode32" allocation mode, this option determines how many
> -	files the allocator attempts to allocate in the same allocation
> -	group before moving to the next allocation group.  The intent
> -	is to control the rate at which the allocator moves between
> -	allocation groups when allocating extents for new files.
> -
> -Deprecated Sysctls
> -==================
> -
> -None at present.
> -
> -
> -Removed Sysctls
> -===============
> -
> -  Name				Removed
> -  ----				-------
> -  fs.xfs.xfsbufd_centisec	v4.0
> -  fs.xfs.age_buffer_centisecs	v4.0
> -
> -
> -Error handling
> -==============
> -
> -XFS can act differently according to the type of error found during its
> -operation. The implementation introduces the following concepts to the error
> -handler:
> -
> - -failure speed:
> -	Defines how fast XFS should propagate an error upwards when a specific
> -	error is found during the filesystem operation. It can propagate
> -	immediately, after a defined number of retries, after a set time period,
> -	or simply retry forever.
> -
> - -error classes:
> -	Specifies the subsystem the error configuration will apply to, such as
> -	metadata IO or memory allocation. Different subsystems will have
> -	different error handlers for which behaviour can be configured.
> -
> - -error handlers:
> -	Defines the behavior for a specific error.
> -
> -The filesystem behavior during an error can be set via sysfs files. Each
> -error handler works independently - the first condition met by an error handler
> -for a specific class will cause the error to be propagated rather than reset and
> -retried.
> -
> -The action taken by the filesystem when the error is propagated is context
> -dependent - it may cause a shut down in the case of an unrecoverable error,
> -it may be reported back to userspace, or it may even be ignored because
> -there's nothing useful we can with the error or anyone we can report it to (e.g.
> -during unmount).
> -
> -The configuration files are organized into the following hierarchy for each
> -mounted filesystem:
> -
> -  /sys/fs/xfs/<dev>/error/<class>/<error>/
> -
> -Where:
> -  <dev>
> -	The short device name of the mounted filesystem. This is the same device
> -	name that shows up in XFS kernel error messages as "XFS(<dev>): ..."
> -
> -  <class>
> -	The subsystem the error configuration belongs to. As of 4.9, the defined
> -	classes are:
> -
> -		- "metadata": applies metadata buffer write IO
> -
> -  <error>
> -	The individual error handler configurations.
> -
> -
> -Each filesystem has "global" error configuration options defined in their top
> -level directory:
> -
> -  /sys/fs/xfs/<dev>/error/
> -
> -  fail_at_unmount		(Min:  0  Default:  1  Max: 1)
> -	Defines the filesystem error behavior at unmount time.
> -
> -	If set to a value of 1, XFS will override all other error configurations
> -	during unmount and replace them with "immediate fail" characteristics.
> -	i.e. no retries, no retry timeout. This will always allow unmount to
> -	succeed when there are persistent errors present.
> -
> -	If set to 0, the configured retry behaviour will continue until all
> -	retries and/or timeouts have been exhausted. This will delay unmount
> -	completion when there are persistent errors, and it may prevent the
> -	filesystem from ever unmounting fully in the case of "retry forever"
> -	handler configurations.
> -
> -	Note: there is no guarantee that fail_at_unmount can be set while an
> -	unmount is in progress. It is possible that the sysfs entries are
> -	removed by the unmounting filesystem before a "retry forever" error
> -	handler configuration causes unmount to hang, and hence the filesystem
> -	must be configured appropriately before unmount begins to prevent
> -	unmount hangs.
> -
> -Each filesystem has specific error class handlers that define the error
> -propagation behaviour for specific errors. There is also a "default" error
> -handler defined, which defines the behaviour for all errors that don't have
> -specific handlers defined. Where multiple retry constraints are configuredi for
> -a single error, the first retry configuration that expires will cause the error
> -to be propagated. The handler configurations are found in the directory:
> -
> -  /sys/fs/xfs/<dev>/error/<class>/<error>/
> -
> -  max_retries			(Min: -1  Default: Varies  Max: INTMAX)
> -	Defines the allowed number of retries of a specific error before
> -	the filesystem will propagate the error. The retry count for a given
> -	error context (e.g. a specific metadata buffer) is reset every time
> -	there is a successful completion of the operation.
> -
> -	Setting the value to "-1" will cause XFS to retry forever for this
> -	specific error.
> -
> -	Setting the value to "0" will cause XFS to fail immediately when the
> -	specific error is reported.
> -
> -	Setting the value to "N" (where 0 < N < Max) will make XFS retry the
> -	operation "N" times before propagating the error.
> -
> -  retry_timeout_seconds		(Min:  -1  Default:  Varies  Max: 1 day)
> -	Define the amount of time (in seconds) that the filesystem is
> -	allowed to retry its operations when the specific error is
> -	found.
> -
> -	Setting the value to "-1" will allow XFS to retry forever for this
> -	specific error.
> -
> -	Setting the value to "0" will cause XFS to fail immediately when the
> -	specific error is reported.
> -
> -	Setting the value to "N" (where 0 < N < Max) will allow XFS to retry the
> -	operation for up to "N" seconds before propagating the error.
> -
> -Note: The default behaviour for a specific error handler is dependent on both
> -the class and error context. For example, the default values for
> -"metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
> -to "fail immediately" behaviour. This is done because ENODEV is a fatal,
> -unrecoverable error no matter how many times the metadata IO is retried.
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d0ed73599..66e972e9a 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -17364,7 +17364,7 @@ L:	linux-xfs@vger.kernel.org
>  W:	http://xfs.org/
>  T:	git git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git
>  S:	Supported
> -F:	Documentation/filesystems/xfs.txt
> +F:	Documentation/filesystems/xfs.rst
>  F:	fs/xfs/
>  
>  XILINX AXI ETHERNET DRIVER
> -- 
> 2.22.0
> 

^ permalink raw reply

* Re: [linux-kernel-mentees] [PATCH v5] Doc : fs : convert xfs.txt to ReST
From: Darrick J. Wong @ 2019-07-02 15:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sheriff Esseson, skhan, linux-xfs, corbet, linux-doc,
	linux-kernel, linux-kernel-mentees
In-Reply-To: <20190702150452.GD1729@bombadil.infradead.org>

On Tue, Jul 02, 2019 at 08:04:52AM -0700, Matthew Wilcox wrote:
> On Tue, Jul 02, 2019 at 01:30:40PM +0100, Sheriff Esseson wrote:
> > +When mounting an XFS filesystem, the following options are accepted.  For
> > +boolean mount options, the names with the "(*)" prefix is the default behaviour.
> > +For example, take a behaviour enabled by default to be a one (1) or, a zero (0)
> > +otherwise, ``(*)[no]default`` would be 0 while ``[no](*)default`` , a 1.
> > -When mounting an XFS filesystem, the following options are accepted.
> > -For boolean mount options, the names with the (*) suffix is the
> > -default behaviour.
> 
> You seem to have reflowed all the text.  That means git no longer notices
> it's a rename, and quite frankly the shorter lines that were in use were
> better.

Agreed.  Please don't reflow text in a format conversion patch, it makes
it very difficult to figure out which changes were to accomodate rst.

If you want to reflow text (because of line length etc.) please do it as
a second patch.  I'd rather break the 80 column rule for a single commit
if it makes reviewing easy on the eyes.

> This is not an improvement; please undo it in the next version
> (which you should not post for several days to accumulate more feedback).

Seconded.  Thank you for sending v5 as a separate patch, though. :)

--D

^ permalink raw reply

* Re: [linux-kernel-mentees] [PATCH v5] Doc : fs : convert xfs.txt to ReST
From: Matthew Wilcox @ 2019-07-02 15:15 UTC (permalink / raw)
  To: Sheriff Esseson
  Cc: skhan, darrick.wong, linux-xfs, corbet, linux-doc, linux-kernel,
	linux-kernel-mentees
In-Reply-To: <20190702123040.GA30111@localhost>

On Tue, Jul 02, 2019 at 01:30:40PM +0100, Sheriff Esseson wrote:
> +++ b/Documentation/filesystems/index.rst
> @@ -40,4 +40,5 @@ Documentation for individual filesystem types can be found here.
>  .. toctree::
>     :maxdepth: 2
>  
> -   binderfs.rst
> +   binderfs
> +   xfs

I don't think this makes sense.  Look:

Kernel API documentation
------------------------
...
   filesystems/index

but the contents of xfs.rst are not kernel API documentation.  We have
precedent in Documentation/index.rst for:

Filesystem Documentation
------------------------

The documentation in this section are provided by specific filesystem
subprojects.

.. toctree::
   :maxdepth: 2

   filesystems/ext4/index

but that looks more like the xfs-delayed-logging-design.txt and
xfs-self-describing-metadata.txt files.

I think Documentation/filesystems/xfs.rst should be part of the
admin guide.  Furtnately, ext4 has again led the way here, and
Documentation/admin-guide/ext4.rst already exists.

^ permalink raw reply

* Re: [linux-kernel-mentees] [PATCH v5] Doc : fs : convert xfs.txt to ReST
From: Matthew Wilcox @ 2019-07-02 15:04 UTC (permalink / raw)
  To: Sheriff Esseson
  Cc: skhan, darrick.wong, linux-xfs, corbet, linux-doc, linux-kernel,
	linux-kernel-mentees
In-Reply-To: <20190702123040.GA30111@localhost>

On Tue, Jul 02, 2019 at 01:30:40PM +0100, Sheriff Esseson wrote:
> +When mounting an XFS filesystem, the following options are accepted.  For
> +boolean mount options, the names with the "(*)" prefix is the default behaviour.
> +For example, take a behaviour enabled by default to be a one (1) or, a zero (0)
> +otherwise, ``(*)[no]default`` would be 0 while ``[no](*)default`` , a 1.
> -When mounting an XFS filesystem, the following options are accepted.
> -For boolean mount options, the names with the (*) suffix is the
> -default behaviour.

You seem to have reflowed all the text.  That means git no longer notices
it's a rename, and quite frankly the shorter lines that were in use were
better.  This is not an improvement; please undo it in the next version
(which you should not post for several days to accumulate more feedback).

^ permalink raw reply

* Re: [PATCH v7 1/2] fTPM: firmware TPM running in TEE
From: Ilias Apalodimas @ 2019-07-02 14:21 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Sasha Levin, peterhuewe, jgg, corbet, linux-kernel, linux-doc,
	linux-integrity, linux-kernel, thiruan, bryankel, tee-dev,
	sumit.garg, rdunlap
In-Reply-To: <0893dc429d4c3f3b52d423f9e61c08a5012a7519.camel@linux.intel.com>

Hi,

> On Thu, 2019-06-27 at 16:30 +0300, Ilias Apalodimas wrote:
> > is really useful. I don't have hardware to test this at the moment, but once i
> > get it, i'll give it a spin.
> 
> Thank you for responding, really appreciate it.
> 
No worries
> Please note, however, that I already did my v5.3 PR so there is a lot of
> time to give it a spin. In all cases, we will find a way to put this to
> my v5.4 PR. I don't see any reason why not.
> 
> As soon as the cosmetic stuff is fixed that I remarked in v7 I'm ready
> to take this to my tree and after that soonish make it available on
> linux-next.
I managed to do some quick testing in QEMU. 
Everything works fine when i build this as a module (using IBM's TPM 2.0 TSS)

- As module
# insmod /lib/modules/5.2.0-rc1/kernel/drivers/char/tpm/tpm_ftpm_tee.ko
# getrandom -by 8
randomBytes length 8
23 b9 3d c3 90 13 d9 6b 

- Built-in
# dmesg | grep optee
ftpm-tee firmware:optee: ftpm_tee_probe:tee_client_open_session failed,
err=ffff0008
ftpm-tee: probe of firmware:optee failed with error -22
# getrandom -by 8
random: fast init done
urandom_read: 2 callbacks suppressed
random: getrandom: uninitialized urandom read (32 bytes read)
TSS_Dev_Open: Error opening /dev/tpm0
getrandom: failed, rc 000b0008
TSS_RC_NO_CONNECTION - Failure connecting to lower layer

Am i missing anything?

Thanks
/Ilias

^ permalink raw reply

* Re: [RFC PATCH] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Dave Martin @ 2019-07-02 14:20 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue
In-Reply-To: <20190628172203.797-1-yu-cheng.yu@intel.com>

On Fri, Jun 28, 2019 at 10:22:03AM -0700, Yu-cheng Yu wrote:
> This patch was part of the Intel Control-flow Enforcement (CET) series at:
> 
>     https://lkml.org/lkml/2019/6/6/1014.
> 
> In the discussion, we decided to look at only an ELF header's
> PT_GNU_PROPERTY, which is a shortcut pointing to the file's
> .note.gnu.property.
> 
> The Linux gABI extension draft is here:
> 
>     https://github.com/hjl-tools/linux-abi/wiki/linux-abi-draft.pdf.
> 
> A few existing CET-enabled binary files were built without
> PT_GNU_PROPERTY; but those files' .note.gnu.property are checked by
> ld-linux, not Linux.  The compatibility impact from this change is
> therefore managable.

That's convenient :)

> An ELF file's .note.gnu.property indicates features the executable file
> can support.  For example, the property GNU_PROPERTY_X86_FEATURE_1_AND
> indicates the file supports GNU_PROPERTY_X86_FEATURE_1_IBT and/or
> GNU_PROPERTY_X86_FEATURE_1_SHSTK.
> 
> With this patch, if an arch needs to setup features from ELF properties,
> it needs CONFIG_ARCH_USE_GNU_PROPERTY to be set, and specific
> arch_parse_property() and arch_setup_property().
> 
> This work is derived from code provided by H.J. Lu <hjl.tools@gmail.com>.

Thanks for reworking this ... comments below.

> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  fs/Kconfig.binfmt        |   3 +
>  fs/Makefile              |   1 +
>  fs/binfmt_elf.c          |  20 +++
>  fs/gnu_property.c        | 279 +++++++++++++++++++++++++++++++++++++++
>  include/linux/elf.h      |  11 ++
>  include/uapi/linux/elf.h |  14 ++
>  6 files changed, 328 insertions(+)
>  create mode 100644 fs/gnu_property.c
> 
> diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
> index f87ddd1b6d72..397138ab305b 100644
> --- a/fs/Kconfig.binfmt
> +++ b/fs/Kconfig.binfmt
> @@ -36,6 +36,9 @@ config COMPAT_BINFMT_ELF
>  config ARCH_BINFMT_ELF_STATE
>  	bool
>  
> +config ARCH_USE_GNU_PROPERTY
> +	bool
> +
>  config BINFMT_ELF_FDPIC
>  	bool "Kernel support for FDPIC ELF binaries"
>  	default y if !BINFMT_ELF
> diff --git a/fs/Makefile b/fs/Makefile
> index c9aea23aba56..b69f18c14e09 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -44,6 +44,7 @@ obj-$(CONFIG_BINFMT_ELF)	+= binfmt_elf.o
>  obj-$(CONFIG_COMPAT_BINFMT_ELF)	+= compat_binfmt_elf.o
>  obj-$(CONFIG_BINFMT_ELF_FDPIC)	+= binfmt_elf_fdpic.o
>  obj-$(CONFIG_BINFMT_FLAT)	+= binfmt_flat.o
> +obj-$(CONFIG_ARCH_USE_GNU_PROPERTY) += gnu_property.o
>  
>  obj-$(CONFIG_FS_MBCACHE)	+= mbcache.o
>  obj-$(CONFIG_FS_POSIX_ACL)	+= posix_acl.o
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 8264b468f283..cbc6d68f4a18 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -852,6 +852,21 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  			}
>  	}
>  
> +	if (interpreter) {
> +		retval = arch_parse_property(&loc->interp_elf_ex,
> +					     interp_elf_phdata,
> +					     interpreter, true,
> +					     &arch_state);
> +	} else {
> +		retval = arch_parse_property(&loc->elf_ex,
> +					     elf_phdata,
> +					     bprm->file, false,
> +					     &arch_state);
> +	}
> +
> +	if (retval)
> +		goto out_free_dentry;
> +
>  	/*
>  	 * Allow arch code to reject the ELF at this point, whilst it's
>  	 * still possible to return an error to the code that invoked
> @@ -1080,6 +1095,11 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  		goto out_free_dentry;
>  	}
>  
> +	retval = arch_setup_property(&arch_state);
> +
> +	if (retval < 0)
> +		goto out_free_dentry;
> +
>  	if (interpreter) {
>  		unsigned long interp_map_addr = 0;
>  
> diff --git a/fs/gnu_property.c b/fs/gnu_property.c
> new file mode 100644
> index 000000000000..37cd503a0c48
> --- /dev/null
> +++ b/fs/gnu_property.c
> @@ -0,0 +1,279 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Extract an ELF file's .note.gnu.property.
> + *
> + * The path from the ELF header to the note section is the following:
> + * elfhdr->elf_phdr->elf_note->property[].
> + */
> +
> +#include <uapi/linux/elf-em.h>
> +#include <linux/processor.h>
> +#include <linux/binfmts.h>
> +#include <linux/elf.h>
> +#include <linux/slab.h>
> +#include <linux/fs.h>
> +#include <linux/uaccess.h>
> +#include <linux/string.h>
> +#include <linux/compat.h>
> +
> +/*
> + * The .note.gnu.property layout:
> + *
> + *	struct elf_note {
> + *		u32 n_namesz; --> sizeof(n_name[]); always (4)
> + *		u32 n_ndescsz;--> sizeof(property[])
> + *		u32 n_type;   --> always NT_GNU_PROPERTY_TYPE_0 (5)
> + *	};
> + *	char n_name[4]; --> always 'GNU\0'
> + *
> + *	struct {
> + *		struct gnu_property {
> + *			u32 pr_type;
> + *			u32 pr_datasz;
> + *		};
> + *		u8 pr_data[pr_datasz];
> + *	}[];
> + */
> +
> +typedef bool (test_item_fn)(void *buf, u32 *arg, u32 type);
> +typedef void *(next_item_fn)(void *buf, u32 *arg, u32 type);
> +
> +static bool test_property(void *buf, u32 *max_type, u32 pr_type)
> +{
> +	struct gnu_property *pr = buf;
> +
> +	/*
> +	 * Property types must be in ascending order.
> +	 * Keep track of the max when testing each.
> +	 */
> +	if (pr->pr_type > *max_type)
> +		*max_type = pr->pr_type;
> +
> +	return (pr->pr_type == pr_type);
> +}
> +
> +static void *next_property(void *buf, u32 *max_type, u32 pr_type)
> +{
> +	struct gnu_property *pr = buf;
> +
> +	if ((buf + sizeof(*pr) + pr->pr_datasz < buf) ||
> +	    (pr->pr_type > pr_type) ||
> +	    (pr->pr_type > *max_type))
> +		return NULL;
> +	else
> +		return (buf + sizeof(*pr) + pr->pr_datasz);
> +}
> +
> +/*
> + * Scan 'buf' for a pattern; return true if found.
> + * *pos is the distance from the beginning of buf to where
> + * the searched item or the next item is located.
> + */
> +static int scan(u8 *buf, u32 buf_size, int item_size, test_item_fn test_item,
> +		next_item_fn next_item, u32 *arg, u32 type, u32 *pos)
> +{
> +	int found = 0;
> +	u8 *p, *max;
> +
> +	max = buf + buf_size;
> +	if (max < buf)
> +		return 0;
> +
> +	p = buf;
> +
> +	while ((p + item_size < max) && (p + item_size > buf)) {
> +		if (test_item(p, arg, type)) {
> +			found = 1;
> +			break;
> +		}
> +
> +		p = next_item(p, arg, type);
> +	}
> +
> +	*pos = (p + item_size <= buf) ? 0 : (u32)(p - buf);
> +	return found;
> +}
> +
> +/*
> + * Search an NT_GNU_PROPERTY_TYPE_0 for the property of 'pr_type'.
> + */
> +static int find_property(u32 pr_type, u32 *property, struct file *file,
> +			 loff_t file_offset, unsigned long desc_size)
> +{
> +	u8 *buf;
> +	int buf_size;
> +
> +	u32 buf_pos;
> +	unsigned long read_size;
> +	unsigned long done;
> +	int found = 0;
> +	int ret = 0;
> +	u32 last_pr = 0;
> +
> +	*property = 0;
> +	buf_pos = 0;
> +
> +	buf_size = (desc_size > PAGE_SIZE) ? PAGE_SIZE : desc_size;
> +	buf = kmalloc(buf_size, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	for (done = 0; done < desc_size; done += buf_pos) {
> +		read_size = desc_size - done;
> +		if (read_size > buf_size)
> +			read_size = buf_size;
> +
> +		ret = kernel_read(file, buf, read_size, &file_offset);

This can be simpler if we just read the whole PT_GNU_PROPERTY segment
before hand.

We should set some sanity limit on the size we accept, but I don't think
it's realistically going to be very big.

> +
> +		if (ret != read_size)
> +			return (ret < 0) ? ret : -EIO;
> +
> +		ret = 0;
> +		found = scan(buf, read_size, sizeof(struct gnu_property),
> +			     test_property, next_property,
> +			     &last_pr, pr_type, &buf_pos);
> +
> +		if ((!buf_pos) || found)
> +			break;
> +
> +		file_offset += buf_pos - read_size;
> +	}
> +
> +	if (found) {
> +		struct gnu_property *pr =
> +			(struct gnu_property *)(buf + buf_pos);
> +
> +		if (pr->pr_datasz == 4) {
> +			u32 *max =  (u32 *)(buf + read_size);
> +			u32 *data = (u32 *)((u8 *)pr + sizeof(*pr));
> +
> +			if (data + 1 <= max) {
> +				*property = *data;
> +			} else {
> +				file_offset += buf_pos - read_size;
> +				file_offset += sizeof(*pr);
> +				ret = kernel_read(file, property, 4,
> +						  &file_offset);
> +			}
> +		}
> +	}
> +
> +	kfree(buf);
> +	return ret;
> +}
> +
> +/*
> + * Look at an ELF file's PT_GNU_PROPERTY for the property of pr_type.
> + *
> + * Input:
> + *	file: the file to search;
> + *	phdr: the file's elf header;
> + *	phnum: number of entries in phdr;
> + *	pr_type: the property type.
> + *
> + * Output:
> + *	The property found.
> + *
> + * Return:
> + *	Zero or error.
> + */
> +
> +static int scan_segments_64(struct file *file, struct elf64_phdr *phdr,
> +			    int phnum, u32 pr_type, u32 *property)
> +{
> +	int i, err;
> +
> +	err = 0;
> +
> +	for (i = 0; i < phnum; i++, phdr++) {
> +		if (phdr->p_align != 8)
> +			continue;
> +
> +		if (phdr->p_type == PT_GNU_PROPERTY) {
> +			struct elf64_note n;
> +			loff_t pos;
> +
> +			/* read note header */
> +			pos = phdr->p_offset;
> +			err = kernel_read(file, &n, sizeof(n), &pos);
> +			if (err < sizeof(n))
> +				return -EIO;

Should we check n_type and n_name?

Maybe we don't need to bother if we trust the tools not to put garbage
on PT_GNU_PROPERTY.  I'm a little concerned that hjl's spec is pretty
vague on what the PT_GNU_PROPERTY segment should contain (even it it's
"obvious").

> +
> +			/* find note payload offset */
> +			pos = phdr->p_offset + round_up(sizeof(n) + n.n_namesz,
> +							phdr->p_align);
> +
> +			err = find_property(pr_type, property, file,
> +					    pos, n.n_descsz);
> +			break;
> +		}
> +	}
> +
> +	return err;
> +}
> +
> +static int scan_segments_32(struct file *file, struct elf32_phdr *phdr,
> +			    int phnum, u32 pr_type, u32 *property)
> +{
> +	int i, err;
> +
> +	err = 0;
> +
> +	for (i = 0; i < phnum; i++, phdr++) {
> +		if (phdr->p_align != 4)
> +			continue;
> +
> +		if (phdr->p_type == PT_GNU_PROPERTY) {

I wonder whether we should stick a printk_once here, along the lines of
"malformed PT_GNU_PROPERTY note ignored, go fix your toolchain".

Otherwise, maybe we don't need to bother to check this at all: if the
toolchain generates bad binaries, it's arguably not our problem?

(For example, we don't even bother to check that e_ident[EI_DATA]
matches the host endianness..., and we don't look at e_ident[EI_VERSION]
etc.)

		if (phdr->p_type != PT_GNU_PROPERTY)
			continue;

		if (phdr->p_align != 4) {
			/* complaining printk */
			break;
		}

		/* handle PT_GNU_PROPERTY */

> +			struct elf32_note n;
> +			loff_t pos;
> +
> +			/* read note header */
> +			pos = phdr->p_offset;
> +			err = kernel_read(file, &n, sizeof(n), &pos);

Would it be simpler just to load the whole segment using phdr->p_memsz?
This would allow us to do just a single kernel_read()?

> +			if (err < sizeof(n))
> +				return -EIO;
> +
> +			/* find note payload offset */
> +			pos = phdr->p_offset + round_up(sizeof(n) + n.n_namesz,
> +							phdr->p_align);
> +
> +			err = find_property(pr_type, property, file,
> +					    pos, n.n_descsz);
> +			break;
> +		}
> +	}
> +
> +	return err;
> +}

These two functions look the same except for trivial details.

Can we pass in a pointer to the ELF header, and a void * or union
pointer for the phdrs?  We already do those tricks for calling
get_gnu_property() anyway.

> +
> +int get_gnu_property(void *ehdr_p, void *phdr_p, struct file *f,
> +		     u32 pr_type, u32 *property)

Do we have to call this every time we want to fetch a property?

This will be costly if there are several properties we want to
look at.  I can also imagine that some properties will be generic
while others are arch-specific.

So, if the arch or generic code wants properties, we call this
from the generic code, and call out to arch and generic hooks to
handle any properties found.  That way we would only need to do
this scan once.

> +{
> +	struct elf64_hdr *ehdr64 = ehdr_p;
> +	int err = 0;
> +
> +	*property = 0;
> +
> +	if (ehdr64->e_ident[EI_CLASS] == ELFCLASS64) {
> +		struct elf64_phdr *phdr64 = phdr_p;
> +
> +		err = scan_segments_64(f, phdr64, ehdr64->e_phnum,
> +				       pr_type, property);
> +		if (err < 0)
> +			goto out;
> +	} else {
> +		struct elf32_hdr *ehdr32 = ehdr_p;
> +
> +		if (ehdr32->e_ident[EI_CLASS] == ELFCLASS32) {
> +			struct elf32_phdr *phdr32 = phdr_p;
> +
> +			err = scan_segments_32(f, phdr32, ehdr32->e_phnum,
> +					       pr_type, property);
> +			if (err < 0)
> +				goto out;
> +		}
> +	}

We still do nothing and return 0 if e_ident->[EI_CLASS] is neither
ELFCLASS32 or ELFCLASS64, which seems a bit odd.

If we think this should never happen, it might be worth sticking a
WARN() in here and returning an error just in case.

> +
> +out:
> +	return err;
> +}
> diff --git a/include/linux/elf.h b/include/linux/elf.h
> index e3649b3e970e..c86cbfd17382 100644
> --- a/include/linux/elf.h
> +++ b/include/linux/elf.h
> @@ -56,4 +56,15 @@ static inline int elf_coredump_extra_notes_write(struct coredump_params *cprm) {
>  extern int elf_coredump_extra_notes_size(void);
>  extern int elf_coredump_extra_notes_write(struct coredump_params *cprm);
>  #endif
> +
> +#ifdef CONFIG_ARCH_USE_GNU_PROPERTY
> +extern int arch_parse_property(void *ehdr, void *phdr, struct file *f,
> +			       bool inter, struct arch_elf_state *state);
> +extern int arch_setup_property(struct arch_elf_state *state);
> +extern int get_gnu_property(void *ehdr_p, void *phdr_p, struct file *f,
> +			    u32 pr_type, u32 *feature);
> +#else
> +#define arch_parse_property(ehdr, phdr, file, inter, state) (0)
> +#define arch_setup_property(state) (0)

Can we make these fallbacks into static inlines, so that we still get
argument type checking?

> +#endif
>  #endif /* _LINUX_ELF_H */
> diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
> index 34c02e4290fe..530ce08467c2 100644
> --- a/include/uapi/linux/elf.h
> +++ b/include/uapi/linux/elf.h
> @@ -36,6 +36,7 @@ typedef __s64	Elf64_Sxword;
>  #define PT_LOPROC  0x70000000
>  #define PT_HIPROC  0x7fffffff
>  #define PT_GNU_EH_FRAME		0x6474e550
> +#define PT_GNU_PROPERTY		0x6474e553
>  
>  #define PT_GNU_STACK	(PT_LOOS + 0x474e551)
>  
> @@ -443,4 +444,17 @@ typedef struct elf64_note {
>    Elf64_Word n_type;	/* Content type */
>  } Elf64_Nhdr;
>  
> +/* NT_GNU_PROPERTY_TYPE_0 header */
> +struct gnu_property {
> +  __u32 pr_type;
> +  __u32 pr_datasz;

Would it make sense to have

	__u8 pr_data[];

here?

We should also be using the Elf types here for pr_type and pr_datasz.

Maybe we can follow hjl's lead on the definition of the type...

In linux-abi-draft.pdf, we already have

	typedef struct {
		Elf_Word pr_type;
		Elf_Word pr_datasz;
		unsigned char pr_data[PR_DATASZ];
		unsigned char pr_padding[PR_PADDING];
	} ElF_Prop;

This doesn't work as a generic definition due to the variable-sized
arrays, but we can omit pr_padding.  For Linux purposes, __u8 is
probably preferable to unsigned char for pd_data, which we can leave as
a flexible array member.

I see no reason not to introduce

typedef __u32 Elf_Word;

somewhere so that we don't have to pointlessly special-case Elf_Prop for
the 32- and 64-bit cases.

> +};
> +
> +/* .note.gnu.property types */
> +#define GNU_PROPERTY_X86_FEATURE_1_AND		0xc0000002
> +
> +/* Bits of GNU_PROPERTY_X86_FEATURE_1_AND */
> +#define GNU_PROPERTY_X86_FEATURE_1_IBT		0x00000001
> +#define GNU_PROPERTY_X86_FEATURE_1_SHSTK	0x00000002
> +

[...]

Cheers
---Dave

^ permalink raw reply

* [linux-kernel-mentees] [PATCH v5] Doc : fs : convert xfs.txt to ReST
From: Sheriff Esseson @ 2019-07-02 12:30 UTC (permalink / raw)
  To: skhan
  Cc: darrick.wong, linux-xfs, corbet, linux-doc, linux-kernel,
	linux-kernel-mentees

Convert xfs.txt to ReST, rename and fix broken references, consequently.

Make the name "value" in "option=value" look like a variable (that it probably
is), by embedding in angle "<>" brackets, rather than something predifined
elsewhere. This is inline with the conventions in manuals.
 	
Also, make defaults of boolean options prefixed with "(*)". This is so that
options can be compressed to "[no]option" and on a single line, which renders
consistently and nicely in htmldocs.

lastly, enforce a "one option, one definition" policy to keep things
consistent and simple.


Signed-off-by: Sheriff Esseson <sheriffesseson@gmail.com>
---

v5 aims to comply with the guiding comments on its previous versions.

 Documentation/filesystems/dax.txt   |   2 +-
 Documentation/filesystems/index.rst |   5 +-
 Documentation/filesystems/xfs.rst   | 468 +++++++++++++++++++++++++++
 Documentation/filesystems/xfs.txt   | 470 ----------------------------
 MAINTAINERS                         |   2 +-
 5 files changed, 473 insertions(+), 474 deletions(-)
 create mode 100644 Documentation/filesystems/xfs.rst
 delete mode 100644 Documentation/filesystems/xfs.txt

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 6d2c0d340..c333285b8 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -76,7 +76,7 @@ exposure of uninitialized data through mmap.
 These filesystems may be used for inspiration:
 - ext2: see Documentation/filesystems/ext2.txt
 - ext4: see Documentation/filesystems/ext4/
-- xfs:  see Documentation/filesystems/xfs.txt
+- xfs:  see Documentation/filesystems/xfs.rst
 
 
 Handling Media Errors
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 1131c34d7..a4cf5fca4 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -16,7 +16,7 @@ algorithms work.
 .. toctree::
    :maxdepth: 2
 
-   path-lookup.rst
+   path-lookup
    api-summary
    splice
 
@@ -40,4 +40,5 @@ Documentation for individual filesystem types can be found here.
 .. toctree::
    :maxdepth: 2
 
-   binderfs.rst
+   binderfs
+   xfs
diff --git a/Documentation/filesystems/xfs.rst b/Documentation/filesystems/xfs.rst
new file mode 100644
index 000000000..d36ef042c
--- /dev/null
+++ b/Documentation/filesystems/xfs.rst
@@ -0,0 +1,468 @@
+.. SPDX-License-Identifier: GPL-2.0
+======================
+The SGI XFS Filesystem
+======================
+
+XFS is a high performance journaling filesystem which originated
+on the SGI IRIX platform.  It is completely multi-threaded, can
+support large files and large filesystems, extended attributes,
+variable block sizes, is extent based, and makes extensive use of
+Btrees (directories, extents, free space) to aid both performance
+and scalability.
+
+Refer to the documentation at https://xfs.wiki.kernel.org/
+for further details.  This implementation is on-disk compatible
+with the IRIX version of XFS.
+
+
+Mount Options
+=============
+
+When mounting an XFS filesystem, the following options are accepted.  For
+boolean mount options, the names with the "(*)" prefix is the default behaviour.
+For example, take a behaviour enabled by default to be a one (1) or, a zero (0)
+otherwise, ``(*)[no]default`` would be 0 while ``[no](*)default`` , a 1.
+
+   allocsize=<size>
+        Sets the buffered I/O end-of-file preallocation size when doing delayed
+        allocation writeout (default size is 64KiB).  Valid values for this
+        option are page size (typically 4KiB) through to 1GiB, inclusive, in
+        power-of-2 increments.
+
+        The default behaviour is for dynamic end-of-file preallocation size,
+        which uses a set of heuristics to optimise the preallocation size based
+        on the current allocation patterns within the file and the access
+        patterns to the file. Specifying a fixed allocsize value turns off the
+        dynamic behaviour.
+
+   [no]attr2
+        The options enable/disable an "opportunistic" improvement to be made in
+        the way inline extended attributes are stored on-disk.  When the new
+        form is used for the first time when ``attr2`` is selected (either when
+        setting or removing extended attributes) the on-disk superblock feature
+        bit field will be updated to reflect this format being in use.
+
+        The default behaviour is determined by the on-disk feature bit
+        indicating that ``attr2`` behaviour is active. If either mount option is
+        set, then that becomes the new default used by the filesystem. However
+        on CRC enabled filesystems, the ``attr2`` format is always used , and so
+        will reject the ``noattr2`` mount option if it is set.
+
+   (*)[no]discard
+        Enable/disable the issuing of commands to let the block device reclaim
+        space freed by the filesystem.  This is useful for SSD devices, thinly
+        provisioned LUNs and virtual machine images, but may have a performance
+        impact.
+
+        Note: It is currently recommended that you use the ``fstrim``
+        application to discard unused blocks rather than the ``discard`` mount
+        option because the performance impact of this option is quite severe.
+
+   grpid/bsdgroups
+   nogrpid/(*)sysvgroups
+        These options define what group ID a newly created file gets.  When
+        ``grpid`` is set, it takes the group ID of the directory in which it is
+        created; otherwise it takes the ``fsgid`` of the current process, unless
+        the directory has the ``setgid`` bit set, in which case it takes the
+        ``gid`` from the parent directory, and also gets the ``setgid`` bit set
+        if it is a directory itself.
+
+   filestreams
+        Make the data allocator use the filestreams allocation mode across the
+        entire filesystem rather than just on directories configured to use it.
+
+   (*)[no]ikeep
+        When ``ikeep`` is specified, XFS does not delete empty inode clusters
+        and keeps them around on disk.  When ``noikeep`` is specified, empty
+        inode clusters are returned to the free space pool.
+
+   inode32 | (*)inode64
+        When ``inode32`` is specified, it indicates that XFS limits inode
+        creation to locations which will not result in inode numbers with more
+        than 32 bits of significance.
+
+        When ``inode64`` is specified, it indicates that XFS is allowed to
+        create inodes at any location in the filesystem, including those which
+        will result in inode numbers occupying more than 32 bits of
+        significance.
+
+        ``inode32`` is provided for backwards compatibility with older systems
+        and applications, since 64 bits inode numbers might cause problems for
+        some applications that cannot handle large inode numbers.  If
+        applications are in use which do not handle inode numbers bigger than 32
+        bits, the ``inode32`` option should be specified.
+
+
+   (*)[no]largeio
+        If ``nolargeio`` is specified, the optimal I/O reported in st_blksize by
+        **stat(2)** will be as small as possible to allow user applications to
+        avoid inefficient read/modify/write I/O.  This is typically the page
+        size of the machine, as this is the granularity of the page cache.
+
+        If ``largeio`` is specified, a filesystem that was created with a
+        ``swidth`` specified will return the ``swidth`` value (in bytes) in
+        st_blksize. If the filesystem does not have a ``swidth`` specified but
+        does specify an ``allocsize`` then ``allocsize`` (in bytes) will be
+        returned instead. Otherwise the behaviour is the same as if
+        ``nolargeio`` was specified.
+
+   logbufs=<value>
+        Set the number of in-memory log buffers to ``value``.  Valid numbers
+        range from 2-8 inclusive.
+
+        The default value is 8 buffers.
+
+        If the memory cost of 8 log buffers is too high on small systems, then
+        it may be reduced at some cost to performance on metadata intensive
+        workloads. The ``logbsize`` option below controls the size of each
+        buffer and so is also relevant to this case.
+
+   logbsize=<value>
+        Set the size of each in-memory log buffer to ``value``.  The size may be
+        specified in bytes, or in kilobytes with a "k" suffix. Valid sizes for
+        version 1 and version 2 logs are 16384 (16k) and 32768 (32k).  Valid
+        sizes for version 2 logs also include 65536 (64k), 131072 (128k) and
+        262144 (256k). The ``logbsize`` must be an integer multiple of the
+        "log stripe unit" configured at mkfs time.
+
+        The default value for for version 1 logs is 32768, while the default
+        value for version 2 logs is ``MAX(32768, log_sunit)``.
+
+   logdev=<device>
+        Use ``device`` as an external log (metadata journal).  In an XFS
+        filesystem, the log device can be separate from the data device or
+        contained within it.
+
+   rtdev=<device>
+        An XFS filesystem has up to three parts: a data section, a log section,
+        and a real-time section.  The real-time section is optional.  If
+        enabled, ``rtdev`` sets ``device`` to be used as an external real-time
+        section, similar to ``logdev`` above.
+
+   noalign
+        Data allocations will not be aligned at stripe unit boundaries. This is
+        only relevant to filesystems created with non-zero data alignment
+        parameters (sunit, swidth) by mkfs.
+
+   norecovery
+        The filesystem will be mounted without running log recovery.  If the
+        filesystem was not cleanly unmounted, it is likely to be inconsistent
+        when mounted in ``norecovery`` mode.  Some files or directories may not
+        be accessible because of this.  Filesystems mounted ``norecovery`` must
+        be mounted read-only or the mount will fail.
+
+   nouuid
+        Don't check for double mounted file systems using the file system uuid.
+        This is useful to mount LVM snapshot volumes, and often used in
+        combination with ``norecovery`` for mounting read-only snapshots.
+
+   noquota
+	Forcibly turns off all quota accounting and enforcement
+	within the filesystem.
+
+   uquota/usrquota/uqnoenforce/quota
+        User disk quota accounting enabled, and limits (optionally) enforced.
+        Refer to **xfs_quota(8)** for further details.
+
+   gquota/grpquota/gqnoenforce
+        Group disk quota accounting enabled and limits (optionally) enforced.
+        Refer to **xfs_quota(8)** for further details.
+
+   pquota/prjquota/pqnoenforce
+        Project disk quota accounting enabled and limits (optionally) enforced.
+        Refer to **xfs_quota(8)** for further details.
+
+   sunit=<value>
+        Used to specify the stripe unit for a RAID device or (in conjunction
+        with ``swidth`` below) a stripe volume.  ``value`` must be specified in
+        512-byte block units. This option is only relevant to filesystems that
+        were created with non-zero data alignment parameters.
+
+        The ``sunit`` parameter specified must be compatible with the existing
+        filesystem alignment characteristics.  In general, that means the only
+        valid changes to ``sunit`` are increasing it by a power-of-2 multiple.
+
+        Typically, this mount option is necessary only after an underlying RAID
+        device has had its geometry modified, such as adding a new disk to a
+        RAID5 lun and reshaping it.
+
+   swidth=<value>
+        Used to specify the stripe width for a RAID device or (in conjunction
+        with ``sunit`` above) a stripe volume.  ``value`` must be specified in
+        512-byte block units. This option, like ``sunit`` above, is only
+        relevant to filesystems that were created with non-zero data alignment
+        parameters.
+
+        The ``swidth`` parameter specified must be compatible with the existing
+        filesystem alignment characteristics.  In general, that means the only
+        valid swidth values are any integer multiple of a valid ``sunit`` value.
+
+        Typically, this mount option is necessary only after an underlying RAID
+        device has had its geometry modified, such as adding a new disk to a
+        RAID5 lun and reshaping it.
+
+
+   swalloc
+        Data allocations will be rounded up to stripe width boundaries when the
+        current end of file is being extended and the file size is larger than
+        the stripe width size.
+
+   wsync
+        When specified, all filesystem namespace operations are executed
+        synchronously. This ensures that when the namespace operation (create,
+        unlink, etc) completes, the change to the namespace is on stable
+        storage. This is useful in HA setups where failover must not result in
+        clients seeing inconsistent namespace presentation during or after a
+        failover event.
+
+
+Deprecated Mount Options
+========================
+
+  Name				Removal Schedule
+  ----				----------------
+
+
+Removed Mount Options
+=====================
+
+  Name				Removed
+  ----				-------
+  delaylog/nodelaylog		v4.0
+  ihashsize			v4.0
+  irixsgid			v4.0
+  osyncisdsync/osyncisosync	v4.0
+  barrier			v4.19
+  nobarrier			v4.19
+
+
+sysctls
+=======
+
+The following sysctls are available for the XFS filesystem:
+
+  fs.xfs.stats_clear		(Min: 0  Default: 0  Max: 1)
+	Setting this to "1" clears accumulated XFS statistics
+	in /proc/fs/xfs/stat.  It then immediately resets to "0".
+
+  fs.xfs.xfssyncd_centisecs	(Min: 100  Default: 3000  Max: 720000)
+	The interval at which the filesystem flushes metadata
+	out to disk and runs internal cache cleanup routines.
+
+  fs.xfs.filestream_centisecs	(Min: 1  Default: 3000  Max: 360000)
+	The interval at which the filesystem ages filestreams cache
+	references and returns timed-out AGs back to the free stream
+	pool.
+
+  fs.xfs.speculative_prealloc_lifetime
+		(Units: seconds   Min: 1  Default: 300  Max: 86400)
+	The interval at which the background scanning for inodes
+	with unused speculative preallocation runs. The scan
+	removes unused preallocation from clean inodes and releases
+	the unused space back to the free pool.
+
+  fs.xfs.error_level		(Min: 0  Default: 3  Max: 11)
+	A volume knob for error reporting when internal errors occur.
+	This will generate detailed messages & backtraces for filesystem
+	shutdowns, for example.  Current threshold values are:
+
+		XFS_ERRLEVEL_OFF:       0
+		XFS_ERRLEVEL_LOW:       1
+		XFS_ERRLEVEL_HIGH:      5
+
+  fs.xfs.panic_mask		(Min: 0  Default: 0  Max: 256)
+	Causes certain error conditions to call BUG(). Value is a bitmask;
+	OR together the tags which represent errors which should cause panics:
+
+		XFS_NO_PTAG                     0
+		XFS_PTAG_IFLUSH                 0x00000001
+		XFS_PTAG_LOGRES                 0x00000002
+		XFS_PTAG_AILDELETE              0x00000004
+		XFS_PTAG_ERROR_REPORT           0x00000008
+		XFS_PTAG_SHUTDOWN_CORRUPT       0x00000010
+		XFS_PTAG_SHUTDOWN_IOERROR       0x00000020
+		XFS_PTAG_SHUTDOWN_LOGERROR      0x00000040
+		XFS_PTAG_FSBLOCK_ZERO           0x00000080
+		XFS_PTAG_VERIFIER_ERROR         0x00000100
+
+	This option is intended for debugging only.
+
+  fs.xfs.irix_symlink_mode	(Min: 0  Default: 0  Max: 1)
+	Controls whether symlinks are created with mode 0777 (default)
+	or whether their mode is affected by the umask (irix mode).
+
+  fs.xfs.irix_sgid_inherit	(Min: 0  Default: 0  Max: 1)
+	Controls files created in SGID directories.
+	If the group ID of the new file does not match the effective group
+	ID or one of the supplementary group IDs of the parent dir, the
+	ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
+	is set.
+
+  fs.xfs.inherit_sync		(Min: 0  Default: 1  Max: 1)
+	Setting this to "1" will cause the "sync" flag set
+	by the **xfs_io(8)** chattr command on a directory to be
+	inherited by files in that directory.
+
+  fs.xfs.inherit_nodump		(Min: 0  Default: 1  Max: 1)
+	Setting this to "1" will cause the "nodump" flag set
+	by the **xfs_io(8)** chattr command on a directory to be
+	inherited by files in that directory.
+
+  fs.xfs.inherit_noatime	(Min: 0  Default: 1  Max: 1)
+	Setting this to "1" will cause the "noatime" flag set
+	by the **xfs_io(8)** chattr command on a directory to be
+	inherited by files in that directory.
+
+  fs.xfs.inherit_nosymlinks	(Min: 0  Default: 1  Max: 1)
+	Setting this to "1" will cause the "nosymlinks" flag set
+	by the **xfs_io(8)** chattr command on a directory to be
+	inherited by files in that directory.
+
+  fs.xfs.inherit_nodefrag	(Min: 0  Default: 1  Max: 1)
+	Setting this to "1" will cause the "nodefrag" flag set
+	by the **xfs_io(8)** chattr command on a directory to be
+	inherited by files in that directory.
+
+  fs.xfs.rotorstep		(Min: 1  Default: 1  Max: 256)
+	In "inode32" allocation mode, this option determines how many
+	files the allocator attempts to allocate in the same allocation
+	group before moving to the next allocation group.  The intent
+	is to control the rate at which the allocator moves between
+	allocation groups when allocating extents for new files.
+
+Deprecated Sysctls
+==================
+
+None at present.
+
+
+Removed Sysctls
+===============
+
+  Name				Removed
+  ----				-------
+  fs.xfs.xfsbufd_centisec	v4.0
+  fs.xfs.age_buffer_centisecs	v4.0
+
+
+Error handling
+==============
+
+XFS can act differently according to the type of error found during its
+operation. The implementation introduces the following concepts to the error
+handler:
+
+ -failure speed:
+	Defines how fast XFS should propagate an error upwards when a specific
+	error is found during the filesystem operation. It can propagate
+	immediately, after a defined number of retries, after a set time period,
+	or simply retry forever.
+
+ -error classes:
+	Specifies the subsystem the error configuration will apply to, such as
+	metadata IO or memory allocation. Different subsystems will have
+	different error handlers for which behaviour can be configured.
+
+ -error handlers:
+	Defines the behavior for a specific error.
+
+The filesystem behavior during an error can be set via sysfs files. Each
+error handler works independently - the first condition met by an error handler
+for a specific class will cause the error to be propagated rather than reset and
+retried.
+
+The action taken by the filesystem when the error is propagated is context
+dependent - it may cause a shut down in the case of an unrecoverable error,
+it may be reported back to userspace, or it may even be ignored because
+there's nothing useful we can with the error or anyone we can report it to (e.g.
+during unmount).
+
+The configuration files are organized into the following hierarchy for each
+mounted filesystem:
+
+  /sys/fs/xfs/<dev>/error/<class>/<error>/
+
+Where:
+  <dev>
+	The short device name of the mounted filesystem. This is the same device
+	name that shows up in XFS kernel error messages as "XFS(<dev>): ..."
+
+  <class>
+	The subsystem the error configuration belongs to. As of 4.9, the defined
+	classes are:
+
+		- "metadata": applies metadata buffer write IO
+
+  <error>
+	The individual error handler configurations.
+
+
+Each filesystem has "global" error configuration options defined in their top
+level directory:
+
+  /sys/fs/xfs/<dev>/error/
+
+  fail_at_unmount		(Min:  0  Default:  1  Max: 1)
+	Defines the filesystem error behavior at unmount time.
+
+	If set to a value of 1, XFS will override all other error configurations
+	during unmount and replace them with "immediate fail" characteristics.
+	i.e. no retries, no retry timeout. This will always allow unmount to
+	succeed when there are persistent errors present.
+
+	If set to 0, the configured retry behaviour will continue until all
+	retries and/or timeouts have been exhausted. This will delay unmount
+	completion when there are persistent errors, and it may prevent the
+	filesystem from ever unmounting fully in the case of "retry forever"
+	handler configurations.
+
+	Note: there is no guarantee that fail_at_unmount can be set while an
+	unmount is in progress. It is possible that the sysfs entries are
+	removed by the unmounting filesystem before a "retry forever" error
+	handler configuration causes unmount to hang, and hence the filesystem
+	must be configured appropriately before unmount begins to prevent
+	unmount hangs.
+
+Each filesystem has specific error class handlers that define the error
+propagation behaviour for specific errors. There is also a "default" error
+handler defined, which defines the behaviour for all errors that don't have
+specific handlers defined. Where multiple retry constraints are configuredi for
+a single error, the first retry configuration that expires will cause the error
+to be propagated. The handler configurations are found in the directory:
+
+  /sys/fs/xfs/<dev>/error/<class>/<error>/
+
+  max_retries			(Min: -1  Default: Varies  Max: INTMAX)
+	Defines the allowed number of retries of a specific error before
+	the filesystem will propagate the error. The retry count for a given
+	error context (e.g. a specific metadata buffer) is reset every time
+	there is a successful completion of the operation.
+
+	Setting the value to "-1" will cause XFS to retry forever for this
+	specific error.
+
+	Setting the value to "0" will cause XFS to fail immediately when the
+	specific error is reported.
+
+	Setting the value to "N" (where 0 < N < Max) will make XFS retry the
+	operation "N" times before propagating the error.
+
+  retry_timeout_seconds		(Min:  -1  Default:  Varies  Max: 1 day)
+	Define the amount of time (in seconds) that the filesystem is
+	allowed to retry its operations when the specific error is
+	found.
+
+	Setting the value to "-1" will allow XFS to retry forever for this
+	specific error.
+
+	Setting the value to "0" will cause XFS to fail immediately when the
+	specific error is reported.
+
+	Setting the value to "N" (where 0 < N < Max) will allow XFS to retry the
+	operation for up to "N" seconds before propagating the error.
+
+Note: The default behaviour for a specific error handler is dependent on both
+the class and error context. For example, the default values for
+"metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
+to "fail immediately" behaviour. This is done because ENODEV is a fatal,
+unrecoverable error no matter how many times the metadata IO is retried.
diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
deleted file mode 100644
index a5cbb5e0e..000000000
--- a/Documentation/filesystems/xfs.txt
+++ /dev/null
@@ -1,470 +0,0 @@
-
-The SGI XFS Filesystem
-======================
-
-XFS is a high performance journaling filesystem which originated
-on the SGI IRIX platform.  It is completely multi-threaded, can
-support large files and large filesystems, extended attributes,
-variable block sizes, is extent based, and makes extensive use of
-Btrees (directories, extents, free space) to aid both performance
-and scalability.
-
-Refer to the documentation at https://xfs.wiki.kernel.org/
-for further details.  This implementation is on-disk compatible
-with the IRIX version of XFS.
-
-
-Mount Options
-=============
-
-When mounting an XFS filesystem, the following options are accepted.
-For boolean mount options, the names with the (*) suffix is the
-default behaviour.
-
-  allocsize=size
-	Sets the buffered I/O end-of-file preallocation size when
-	doing delayed allocation writeout (default size is 64KiB).
-	Valid values for this option are page size (typically 4KiB)
-	through to 1GiB, inclusive, in power-of-2 increments.
-
-	The default behaviour is for dynamic end-of-file
-	preallocation size, which uses a set of heuristics to
-	optimise the preallocation size based on the current
-	allocation patterns within the file and the access patterns
-	to the file. Specifying a fixed allocsize value turns off
-	the dynamic behaviour.
-
-  attr2
-  noattr2
-	The options enable/disable an "opportunistic" improvement to
-	be made in the way inline extended attributes are stored
-	on-disk.  When the new form is used for the first time when
-	attr2 is selected (either when setting or removing extended
-	attributes) the on-disk superblock feature bit field will be
-	updated to reflect this format being in use.
-
-	The default behaviour is determined by the on-disk feature
-	bit indicating that attr2 behaviour is active. If either
-	mount option it set, then that becomes the new default used
-	by the filesystem.
-
-	CRC enabled filesystems always use the attr2 format, and so
-	will reject the noattr2 mount option if it is set.
-
-  discard
-  nodiscard (*)
-	Enable/disable the issuing of commands to let the block
-	device reclaim space freed by the filesystem.  This is
-	useful for SSD devices, thinly provisioned LUNs and virtual
-	machine images, but may have a performance impact.
-
-	Note: It is currently recommended that you use the fstrim
-	application to discard unused blocks rather than the discard
-	mount option because the performance impact of this option
-	is quite severe.
-
-  grpid/bsdgroups
-  nogrpid/sysvgroups (*)
-	These options define what group ID a newly created file
-	gets.  When grpid is set, it takes the group ID of the
-	directory in which it is created; otherwise it takes the
-	fsgid of the current process, unless the directory has the
-	setgid bit set, in which case it takes the gid from the
-	parent directory, and also gets the setgid bit set if it is
-	a directory itself.
-
-  filestreams
-	Make the data allocator use the filestreams allocation mode
-	across the entire filesystem rather than just on directories
-	configured to use it.
-
-  ikeep
-  noikeep (*)
-	When ikeep is specified, XFS does not delete empty inode
-	clusters and keeps them around on disk.  When noikeep is
-	specified, empty inode clusters are returned to the free
-	space pool.
-
-  inode32
-  inode64 (*)
-	When inode32 is specified, it indicates that XFS limits
-	inode creation to locations which will not result in inode
-	numbers with more than 32 bits of significance.
-
-	When inode64 is specified, it indicates that XFS is allowed
-	to create inodes at any location in the filesystem,
-	including those which will result in inode numbers occupying
-	more than 32 bits of significance. 
-
-	inode32 is provided for backwards compatibility with older
-	systems and applications, since 64 bits inode numbers might
-	cause problems for some applications that cannot handle
-	large inode numbers.  If applications are in use which do
-	not handle inode numbers bigger than 32 bits, the inode32
-	option should be specified.
-
-
-  largeio
-  nolargeio (*)
-	If "nolargeio" is specified, the optimal I/O reported in
-	st_blksize by stat(2) will be as small as possible to allow
-	user applications to avoid inefficient read/modify/write
-	I/O.  This is typically the page size of the machine, as
-	this is the granularity of the page cache.
-
-	If "largeio" specified, a filesystem that was created with a
-	"swidth" specified will return the "swidth" value (in bytes)
-	in st_blksize. If the filesystem does not have a "swidth"
-	specified but does specify an "allocsize" then "allocsize"
-	(in bytes) will be returned instead. Otherwise the behaviour
-	is the same as if "nolargeio" was specified.
-
-  logbufs=value
-	Set the number of in-memory log buffers.  Valid numbers
-	range from 2-8 inclusive.
-
-	The default value is 8 buffers.
-
-	If the memory cost of 8 log buffers is too high on small
-	systems, then it may be reduced at some cost to performance
-	on metadata intensive workloads. The logbsize option below
-	controls the size of each buffer and so is also relevant to
-	this case.
-
-  logbsize=value
-	Set the size of each in-memory log buffer.  The size may be
-	specified in bytes, or in kilobytes with a "k" suffix.
-	Valid sizes for version 1 and version 2 logs are 16384 (16k)
-	and 32768 (32k).  Valid sizes for version 2 logs also
-	include 65536 (64k), 131072 (128k) and 262144 (256k). The
-	logbsize must be an integer multiple of the log
-	stripe unit configured at mkfs time.
-
-	The default value for for version 1 logs is 32768, while the
-	default value for version 2 logs is MAX(32768, log_sunit).
-
-  logdev=device and rtdev=device
-	Use an external log (metadata journal) and/or real-time device.
-	An XFS filesystem has up to three parts: a data section, a log
-	section, and a real-time section.  The real-time section is
-	optional, and the log section can be separate from the data
-	section or contained within it.
-
-  noalign
-	Data allocations will not be aligned at stripe unit
-	boundaries. This is only relevant to filesystems created
-	with non-zero data alignment parameters (sunit, swidth) by
-	mkfs.
-
-  norecovery
-	The filesystem will be mounted without running log recovery.
-	If the filesystem was not cleanly unmounted, it is likely to
-	be inconsistent when mounted in "norecovery" mode.
-	Some files or directories may not be accessible because of this.
-	Filesystems mounted "norecovery" must be mounted read-only or
-	the mount will fail.
-
-  nouuid
-	Don't check for double mounted file systems using the file
-	system uuid.  This is useful to mount LVM snapshot volumes,
-	and often used in combination with "norecovery" for mounting
-	read-only snapshots.
-
-  noquota
-	Forcibly turns off all quota accounting and enforcement
-	within the filesystem.
-
-  uquota/usrquota/uqnoenforce/quota
-	User disk quota accounting enabled, and limits (optionally)
-	enforced.  Refer to xfs_quota(8) for further details.
-
-  gquota/grpquota/gqnoenforce
-	Group disk quota accounting enabled and limits (optionally)
-	enforced.  Refer to xfs_quota(8) for further details.
-
-  pquota/prjquota/pqnoenforce
-	Project disk quota accounting enabled and limits (optionally)
-	enforced.  Refer to xfs_quota(8) for further details.
-
-  sunit=value and swidth=value
-	Used to specify the stripe unit and width for a RAID device
-	or a stripe volume.  "value" must be specified in 512-byte
-	block units. These options are only relevant to filesystems
-	that were created with non-zero data alignment parameters.
-
-	The sunit and swidth parameters specified must be compatible
-	with the existing filesystem alignment characteristics.  In
-	general, that means the only valid changes to sunit are
-	increasing it by a power-of-2 multiple. Valid swidth values
-	are any integer multiple of a valid sunit value.
-
-	Typically the only time these mount options are necessary if
-	after an underlying RAID device has had it's geometry
-	modified, such as adding a new disk to a RAID5 lun and
-	reshaping it.
-
-  swalloc
-	Data allocations will be rounded up to stripe width boundaries
-	when the current end of file is being extended and the file
-	size is larger than the stripe width size.
-
-  wsync
-	When specified, all filesystem namespace operations are
-	executed synchronously. This ensures that when the namespace
-	operation (create, unlink, etc) completes, the change to the
-	namespace is on stable storage. This is useful in HA setups
-	where failover must not result in clients seeing
-	inconsistent namespace presentation during or after a
-	failover event.
-
-
-Deprecated Mount Options
-========================
-
-  Name				Removal Schedule
-  ----				----------------
-
-
-Removed Mount Options
-=====================
-
-  Name				Removed
-  ----				-------
-  delaylog/nodelaylog		v4.0
-  ihashsize			v4.0
-  irixsgid			v4.0
-  osyncisdsync/osyncisosync	v4.0
-  barrier			v4.19
-  nobarrier			v4.19
-
-
-sysctls
-=======
-
-The following sysctls are available for the XFS filesystem:
-
-  fs.xfs.stats_clear		(Min: 0  Default: 0  Max: 1)
-	Setting this to "1" clears accumulated XFS statistics
-	in /proc/fs/xfs/stat.  It then immediately resets to "0".
-
-  fs.xfs.xfssyncd_centisecs	(Min: 100  Default: 3000  Max: 720000)
-	The interval at which the filesystem flushes metadata
-	out to disk and runs internal cache cleanup routines.
-
-  fs.xfs.filestream_centisecs	(Min: 1  Default: 3000  Max: 360000)
-	The interval at which the filesystem ages filestreams cache
-	references and returns timed-out AGs back to the free stream
-	pool.
-
-  fs.xfs.speculative_prealloc_lifetime
-		(Units: seconds   Min: 1  Default: 300  Max: 86400)
-	The interval at which the background scanning for inodes
-	with unused speculative preallocation runs. The scan
-	removes unused preallocation from clean inodes and releases
-	the unused space back to the free pool.
-
-  fs.xfs.error_level		(Min: 0  Default: 3  Max: 11)
-	A volume knob for error reporting when internal errors occur.
-	This will generate detailed messages & backtraces for filesystem
-	shutdowns, for example.  Current threshold values are:
-
-		XFS_ERRLEVEL_OFF:       0
-		XFS_ERRLEVEL_LOW:       1
-		XFS_ERRLEVEL_HIGH:      5
-
-  fs.xfs.panic_mask		(Min: 0  Default: 0  Max: 256)
-	Causes certain error conditions to call BUG(). Value is a bitmask;
-	OR together the tags which represent errors which should cause panics:
-
-		XFS_NO_PTAG                     0
-		XFS_PTAG_IFLUSH                 0x00000001
-		XFS_PTAG_LOGRES                 0x00000002
-		XFS_PTAG_AILDELETE              0x00000004
-		XFS_PTAG_ERROR_REPORT           0x00000008
-		XFS_PTAG_SHUTDOWN_CORRUPT       0x00000010
-		XFS_PTAG_SHUTDOWN_IOERROR       0x00000020
-		XFS_PTAG_SHUTDOWN_LOGERROR      0x00000040
-		XFS_PTAG_FSBLOCK_ZERO           0x00000080
-		XFS_PTAG_VERIFIER_ERROR         0x00000100
-
-	This option is intended for debugging only.
-
-  fs.xfs.irix_symlink_mode	(Min: 0  Default: 0  Max: 1)
-	Controls whether symlinks are created with mode 0777 (default)
-	or whether their mode is affected by the umask (irix mode).
-
-  fs.xfs.irix_sgid_inherit	(Min: 0  Default: 0  Max: 1)
-	Controls files created in SGID directories.
-	If the group ID of the new file does not match the effective group
-	ID or one of the supplementary group IDs of the parent dir, the
-	ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
-	is set.
-
-  fs.xfs.inherit_sync		(Min: 0  Default: 1  Max: 1)
-	Setting this to "1" will cause the "sync" flag set
-	by the xfs_io(8) chattr command on a directory to be
-	inherited by files in that directory.
-
-  fs.xfs.inherit_nodump		(Min: 0  Default: 1  Max: 1)
-	Setting this to "1" will cause the "nodump" flag set
-	by the xfs_io(8) chattr command on a directory to be
-	inherited by files in that directory.
-
-  fs.xfs.inherit_noatime	(Min: 0  Default: 1  Max: 1)
-	Setting this to "1" will cause the "noatime" flag set
-	by the xfs_io(8) chattr command on a directory to be
-	inherited by files in that directory.
-
-  fs.xfs.inherit_nosymlinks	(Min: 0  Default: 1  Max: 1)
-	Setting this to "1" will cause the "nosymlinks" flag set
-	by the xfs_io(8) chattr command on a directory to be
-	inherited by files in that directory.
-
-  fs.xfs.inherit_nodefrag	(Min: 0  Default: 1  Max: 1)
-	Setting this to "1" will cause the "nodefrag" flag set
-	by the xfs_io(8) chattr command on a directory to be
-	inherited by files in that directory.
-
-  fs.xfs.rotorstep		(Min: 1  Default: 1  Max: 256)
-	In "inode32" allocation mode, this option determines how many
-	files the allocator attempts to allocate in the same allocation
-	group before moving to the next allocation group.  The intent
-	is to control the rate at which the allocator moves between
-	allocation groups when allocating extents for new files.
-
-Deprecated Sysctls
-==================
-
-None at present.
-
-
-Removed Sysctls
-===============
-
-  Name				Removed
-  ----				-------
-  fs.xfs.xfsbufd_centisec	v4.0
-  fs.xfs.age_buffer_centisecs	v4.0
-
-
-Error handling
-==============
-
-XFS can act differently according to the type of error found during its
-operation. The implementation introduces the following concepts to the error
-handler:
-
- -failure speed:
-	Defines how fast XFS should propagate an error upwards when a specific
-	error is found during the filesystem operation. It can propagate
-	immediately, after a defined number of retries, after a set time period,
-	or simply retry forever.
-
- -error classes:
-	Specifies the subsystem the error configuration will apply to, such as
-	metadata IO or memory allocation. Different subsystems will have
-	different error handlers for which behaviour can be configured.
-
- -error handlers:
-	Defines the behavior for a specific error.
-
-The filesystem behavior during an error can be set via sysfs files. Each
-error handler works independently - the first condition met by an error handler
-for a specific class will cause the error to be propagated rather than reset and
-retried.
-
-The action taken by the filesystem when the error is propagated is context
-dependent - it may cause a shut down in the case of an unrecoverable error,
-it may be reported back to userspace, or it may even be ignored because
-there's nothing useful we can with the error or anyone we can report it to (e.g.
-during unmount).
-
-The configuration files are organized into the following hierarchy for each
-mounted filesystem:
-
-  /sys/fs/xfs/<dev>/error/<class>/<error>/
-
-Where:
-  <dev>
-	The short device name of the mounted filesystem. This is the same device
-	name that shows up in XFS kernel error messages as "XFS(<dev>): ..."
-
-  <class>
-	The subsystem the error configuration belongs to. As of 4.9, the defined
-	classes are:
-
-		- "metadata": applies metadata buffer write IO
-
-  <error>
-	The individual error handler configurations.
-
-
-Each filesystem has "global" error configuration options defined in their top
-level directory:
-
-  /sys/fs/xfs/<dev>/error/
-
-  fail_at_unmount		(Min:  0  Default:  1  Max: 1)
-	Defines the filesystem error behavior at unmount time.
-
-	If set to a value of 1, XFS will override all other error configurations
-	during unmount and replace them with "immediate fail" characteristics.
-	i.e. no retries, no retry timeout. This will always allow unmount to
-	succeed when there are persistent errors present.
-
-	If set to 0, the configured retry behaviour will continue until all
-	retries and/or timeouts have been exhausted. This will delay unmount
-	completion when there are persistent errors, and it may prevent the
-	filesystem from ever unmounting fully in the case of "retry forever"
-	handler configurations.
-
-	Note: there is no guarantee that fail_at_unmount can be set while an
-	unmount is in progress. It is possible that the sysfs entries are
-	removed by the unmounting filesystem before a "retry forever" error
-	handler configuration causes unmount to hang, and hence the filesystem
-	must be configured appropriately before unmount begins to prevent
-	unmount hangs.
-
-Each filesystem has specific error class handlers that define the error
-propagation behaviour for specific errors. There is also a "default" error
-handler defined, which defines the behaviour for all errors that don't have
-specific handlers defined. Where multiple retry constraints are configuredi for
-a single error, the first retry configuration that expires will cause the error
-to be propagated. The handler configurations are found in the directory:
-
-  /sys/fs/xfs/<dev>/error/<class>/<error>/
-
-  max_retries			(Min: -1  Default: Varies  Max: INTMAX)
-	Defines the allowed number of retries of a specific error before
-	the filesystem will propagate the error. The retry count for a given
-	error context (e.g. a specific metadata buffer) is reset every time
-	there is a successful completion of the operation.
-
-	Setting the value to "-1" will cause XFS to retry forever for this
-	specific error.
-
-	Setting the value to "0" will cause XFS to fail immediately when the
-	specific error is reported.
-
-	Setting the value to "N" (where 0 < N < Max) will make XFS retry the
-	operation "N" times before propagating the error.
-
-  retry_timeout_seconds		(Min:  -1  Default:  Varies  Max: 1 day)
-	Define the amount of time (in seconds) that the filesystem is
-	allowed to retry its operations when the specific error is
-	found.
-
-	Setting the value to "-1" will allow XFS to retry forever for this
-	specific error.
-
-	Setting the value to "0" will cause XFS to fail immediately when the
-	specific error is reported.
-
-	Setting the value to "N" (where 0 < N < Max) will allow XFS to retry the
-	operation for up to "N" seconds before propagating the error.
-
-Note: The default behaviour for a specific error handler is dependent on both
-the class and error context. For example, the default values for
-"metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
-to "fail immediately" behaviour. This is done because ENODEV is a fatal,
-unrecoverable error no matter how many times the metadata IO is retried.
diff --git a/MAINTAINERS b/MAINTAINERS
index d0ed73599..66e972e9a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17364,7 +17364,7 @@ L:	linux-xfs@vger.kernel.org
 W:	http://xfs.org/
 T:	git git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git
 S:	Supported
-F:	Documentation/filesystems/xfs.txt
+F:	Documentation/filesystems/xfs.rst
 F:	fs/xfs/
 
 XILINX AXI ETHERNET DRIVER
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
From: bsegall @ 2019-07-01 20:15 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Pqhil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos,
	John Hammond, Cong Wang, Jonathan Corbet, linux-doc, Paul Turner
In-Reply-To: <1561664970-1555-2-git-send-email-chiluk+linux@indeed.com>

Alright, this prototype of "maybe we should just 100% avoid stranding
runtime for a full period" appears to fix the fibtest synthetic example,
and seems like a theoretically-reasonable approach.

Things that may want improvement or at least thought (but it's a holiday
week in the US and I wanted any opinions on the basic approach):

- I don't like cfs_rq->slack_list, since logically it's mutually
  exclusive with throttled_list, but we have to iterate without
  locks, so I don't know that we can avoid it.

- I previously was using _rcu for the slack list, like throttled, but
  there is no list_for_each_entry_rcu_safe, so the list_del_init would
  be invalid and we'd have to use another flag or opencode the
  equivalent.

- (Actually, this just made me realize that distribute is sorta wrong if
  the unthrottled task immediately runs and rethrottles; this would just
  mean that we effectively restart the loop)

- We unconditionally start the slack timer, even if nothing is
  throttled. We could instead have throttle start the timer in this case
  (setting the timeout some definition of "appropriately"), but this
  bookkeeping would be a big hassle.

- We could try to do better about deciding what cfs_rqs are idle than
  "nr_running == 0", possibly requiring that to have been true for N<5
  ms, and restarting the slack timer if we didn't clear everything.

- Taking up-to-every rq->lock is bad and expensive and 5ms may be too
  short a delay for this. I haven't tried microbenchmarks on the cost of
  this vs min_cfs_rq_runtime = 0 vs baseline.

- runtime_expires vs expires_seq choice is basically rand(), much like
  the existing code. (I think the most consistent with existing code
  would be runtime_expires, since cfs_b lock is held; I think most
  consistent in general would change most of the existing ones as well
  to be seq)


-- >8 --
Subject: [PATCH] sched: avoid stranding cfs_bandwidth runtime

We avoid contention on the per-tg cfs_b->lock by keeping 1ms of runtime on a
cfs_rq even when all tasks in that cfs_rq dequeue. This way tasks doing frequent
wake/sleep can't hit this cross-cpu lock more than once per ms. This however
means that up to 1ms of runtime per cpu can be lost if no task does wake up on
that cpu, which is leading to issues on cgroups with low quota, many available
cpus, and a combination of threads that run for very little time and ones that
want to run constantly.

This was previously hidden by runtime expiration being broken, which allowed
this stranded runtime to be kept indefinitely across period resets. Thus after
an initial period or two any long-running tasks could use an appropriate portion
of their group's quota. The issue was that the group could also potentially
burst for 1ms * cpus more than their quota allowed, and in these situations this
is a significant increase.

Fix this by having the group's slack timer (which runs at most once per 5ms)
remove all runtime from empty cfs_rqs, not just redistribute any runtime above
that 1ms that was returned immediately.

Signed-off-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/fair.c  | 66 +++++++++++++++++++++++++++++++++++---------
 kernel/sched/sched.h |  2 ++
 2 files changed, 55 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 300f2c54dea5..80b2198d9b29 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4745,23 +4745,32 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime;
 
-	if (slack_runtime <= 0)
+	if (cfs_rq->runtime_remaining <= 0)
+		return;
+
+	if (slack_runtime <= 0 && !list_empty(&cfs_rq->slack_list))
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
 	if (cfs_b->quota != RUNTIME_INF &&
-	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
-		cfs_b->runtime += slack_runtime;
+	    cfs_rq->expires_seq == cfs_b->expires_seq) {
+		if (slack_runtime > 0)
+			cfs_b->runtime += slack_runtime;
+		if (list_empty(&cfs_rq->slack_list))
+			list_add(&cfs_rq->slack_list, &cfs_b->slack_cfs_rq);
 
-		/* we are under rq->lock, defer unthrottling using a timer */
-		if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
-		    !list_empty(&cfs_b->throttled_cfs_rq))
-			start_cfs_slack_bandwidth(cfs_b);
+		/*
+		 * After a timeout, gather our remaining runtime so it can't get
+		 * stranded. We need a timer anyways to distribute any of the
+		 * runtime due to locking issues.
+		 */
+		start_cfs_slack_bandwidth(cfs_b);
 	}
 	raw_spin_unlock(&cfs_b->lock);
 
-	/* even if it's not valid for return we don't want to try again */
-	cfs_rq->runtime_remaining -= slack_runtime;
+	if (slack_runtime > 0)
+		/* even if it's not valid for return we don't want to try again */
+		cfs_rq->runtime_remaining -= slack_runtime;
 }
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
@@ -4781,12 +4790,41 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
  */
 static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
-	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
+	u64 runtime = 0;
 	unsigned long flags;
 	u64 expires;
+	struct cfs_rq *cfs_rq, *temp;
+	LIST_HEAD(temp_head);
+
+	local_irq_save(flags);
+
+	raw_spin_lock(&cfs_b->lock);
+	cfs_b->slack_started = false;
+	list_splice_init(&cfs_b->slack_cfs_rq, &temp_head);
+	raw_spin_unlock(&cfs_b->lock);
+
+
+	/* Gather all left over runtime from all rqs */
+	list_for_each_entry_safe(cfs_rq, temp, &temp_head, slack_list) {
+		struct rq *rq = rq_of(cfs_rq);
+		struct rq_flags rf;
+
+		rq_lock(rq, &rf);
+
+		raw_spin_lock(&cfs_b->lock);
+		list_del_init(&cfs_rq->slack_list);
+		if (!cfs_rq->nr_running && cfs_rq->runtime_remaining > 0 &&
+		    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+			cfs_b->runtime += cfs_rq->runtime_remaining;
+			cfs_rq->runtime_remaining = 0;
+		}
+		raw_spin_unlock(&cfs_b->lock);
+
+		rq_unlock(rq, &rf);
+	}
 
 	/* confirm we're still not at a refresh boundary */
-	raw_spin_lock_irqsave(&cfs_b->lock, flags);
+	raw_spin_lock(&cfs_b->lock);
 	cfs_b->slack_started = false;
 	if (cfs_b->distribute_running) {
 		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
@@ -4798,7 +4836,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 		return;
 	}
 
-	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
+	if (cfs_b->quota != RUNTIME_INF)
 		runtime = cfs_b->runtime;
 
 	expires = cfs_b->runtime_expires;
@@ -4946,6 +4984,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
+	INIT_LIST_HEAD(&cfs_b->slack_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
 	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
@@ -4958,6 +4997,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->runtime_enabled = 0;
 	INIT_LIST_HEAD(&cfs_rq->throttled_list);
+	INIT_LIST_HEAD(&cfs_rq->slack_list);
 }
 
 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b08dee29ef5e..3b272ee894fb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -345,6 +345,7 @@ struct cfs_bandwidth {
 	struct hrtimer		period_timer;
 	struct hrtimer		slack_timer;
 	struct list_head	throttled_cfs_rq;
+	struct list_head	slack_cfs_rq;
 
 	/* Statistics: */
 	int			nr_periods;
@@ -566,6 +567,7 @@ struct cfs_rq {
 	int			throttled;
 	int			throttle_count;
 	struct list_head	throttled_list;
+	struct list_head	slack_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 };
-- 
2.22.0.410.gd8fdbe21b5-goog


^ permalink raw reply related

* Re: [RFC PATCH] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Yu-cheng Yu @ 2019-07-01 19:49 UTC (permalink / raw)
  To: Jann Horn
  Cc: the arch/x86 maintainers, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, kernel list, linux-doc, Linux-MM, linux-arch,
	Linux API, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar, Vedvyas Shanbhogue,
	Dave Martin
In-Reply-To: <CAG48ez0rHHfcRgiVZf5FP0YOzxsXigvpg6ci790cmiN6PBwkhQ@mail.gmail.com>

On Mon, 2019-07-01 at 21:49 +0200, Jann Horn wrote:
> On Fri, Jun 28, 2019 at 7:30 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> [...]
> > In the discussion, we decided to look at only an ELF header's
> > PT_GNU_PROPERTY, which is a shortcut pointing to the file's
> > .note.gnu.property.
> > 
> > The Linux gABI extension draft is here:
> > 
> >     https://github.com/hjl-tools/linux-abi/wiki/linux-abi-draft.pdf.
> > 
> > A few existing CET-enabled binary files were built without
> > PT_GNU_PROPERTY; but those files' .note.gnu.property are checked by
> > ld-linux, not Linux.  The compatibility impact from this change is
> > therefore managable.
> > 
> > An ELF file's .note.gnu.property indicates features the executable file
> > can support.  For example, the property GNU_PROPERTY_X86_FEATURE_1_AND
> > indicates the file supports GNU_PROPERTY_X86_FEATURE_1_IBT and/or
> > GNU_PROPERTY_X86_FEATURE_1_SHSTK.
> > 
> > With this patch, if an arch needs to setup features from ELF properties,
> > it needs CONFIG_ARCH_USE_GNU_PROPERTY to be set, and specific
> > arch_parse_property() and arch_setup_property().
> 
> [...]
> > +typedef bool (test_item_fn)(void *buf, u32 *arg, u32 type);
> > +typedef void *(next_item_fn)(void *buf, u32 *arg, u32 type);
> > +
> > +static bool test_property(void *buf, u32 *max_type, u32 pr_type)
> > +{
> > +       struct gnu_property *pr = buf;
> > +
> > +       /*
> > +        * Property types must be in ascending order.
> > +        * Keep track of the max when testing each.
> > +        */
> > +       if (pr->pr_type > *max_type)
> > +               *max_type = pr->pr_type;
> > +
> > +       return (pr->pr_type == pr_type);
> > +}
> > +
> > +static void *next_property(void *buf, u32 *max_type, u32 pr_type)
> > +{
> > +       struct gnu_property *pr = buf;
> > +
> > +       if ((buf + sizeof(*pr) + pr->pr_datasz < buf) ||
> 
> This looks like UB to me, see below.
> 
> > +           (pr->pr_type > pr_type) ||
> > +           (pr->pr_type > *max_type))
> > +               return NULL;
> > +       else
> > +               return (buf + sizeof(*pr) + pr->pr_datasz);
> > +}
> > +
> > +/*
> > + * Scan 'buf' for a pattern; return true if found.
> > + * *pos is the distance from the beginning of buf to where
> > + * the searched item or the next item is located.
> > + */
> > +static int scan(u8 *buf, u32 buf_size, int item_size, test_item_fn
> > test_item,
> > +               next_item_fn next_item, u32 *arg, u32 type, u32 *pos)
> > +{
> > +       int found = 0;
> > +       u8 *p, *max;
> > +
> > +       max = buf + buf_size;
> > +       if (max < buf)
> > +               return 0;
> 
> How can this ever legitimately happen? If it can't, perhaps you meant
> to put a WARN_ON_ONCE() or something like that here?
> Also, computing out-of-bounds pointers is UB (section 6.5.6 of C99:
> "If both the pointer operand and the result point to elements of the
> same array object, or one past the last element of the array object,
> the evaluation shall not produce an overflow; otherwise, the behavior
> is undefined."), and if the addition makes the pointer wrap, that's
> certainly out of bounds; so I don't think this condition can trigger
> without UB.
> 
> > +
> > +       p = buf;
> > +
> > +       while ((p + item_size < max) && (p + item_size > buf)) {
> 
> Again, as far as I know, this is technically UB. Please rewrite this.
> For example, you could do something like:
> 
>     while (max - p >= item_size) {
> 
> and then make sure that next_item() never computes OOB pointers.
> 
> > +               if (test_item(p, arg, type)) {
> > +                       found = 1;
> > +                       break;
> > +               }
> > +
> > +               p = next_item(p, arg, type);
> > +       }
> > +
> > +       *pos = (p + item_size <= buf) ? 0 : (u32)(p - buf);
> > +       return found;
> > +}
> > +
> > +/*
> > + * Search an NT_GNU_PROPERTY_TYPE_0 for the property of 'pr_type'.
> > + */
> > +static int find_property(u32 pr_type, u32 *property, struct file *file,
> > +                        loff_t file_offset, unsigned long desc_size)
> > +{
> > +       u8 *buf;
> > +       int buf_size;
> > +
> > +       u32 buf_pos;
> > +       unsigned long read_size;
> > +       unsigned long done;
> > +       int found = 0;
> > +       int ret = 0;
> > +       u32 last_pr = 0;
> > +
> > +       *property = 0;
> > +       buf_pos = 0;
> > +
> > +       buf_size = (desc_size > PAGE_SIZE) ? PAGE_SIZE : desc_size;
> 
> open-coded min(desc_size, PAGE_SIZE)
> 
> > +       buf = kmalloc(buf_size, GFP_KERNEL);
> > +       if (!buf)
> > +               return -ENOMEM;
> > +
> > +       for (done = 0; done < desc_size; done += buf_pos) {
> > +               read_size = desc_size - done;
> > +               if (read_size > buf_size)
> > +                       read_size = buf_size;
> > +
> > +               ret = kernel_read(file, buf, read_size, &file_offset);
> > +
> > +               if (ret != read_size)
> > +                       return (ret < 0) ? ret : -EIO;
> 
> This leaks the memory allocated for `buf`.
> 
> > +
> > +               ret = 0;
> > +               found = scan(buf, read_size, sizeof(struct gnu_property),
> > +                            test_property, next_property,
> > +                            &last_pr, pr_type, &buf_pos);
> > +
> > +               if ((!buf_pos) || found)
> > +                       break;
> > +
> > +               file_offset += buf_pos - read_size;
> > +       }
> 
> [...]
> > +       kfree(buf);
> > +       return ret;
> > +}

I will fix these.

Thanks,
Yu-cheng

^ permalink raw reply

* Re: [RFC PATCH 3/3] Prevent user from writing to IBT bitmap.
From: Yu-cheng Yu @ 2019-07-01 19:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, LKML,
	open list:DOCUMENTATION, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin
In-Reply-To: <CALCETrXsXXJWTSJxUO8YxHUo=QJKmHyJa7iz+jOBjWMRhno4rA@mail.gmail.com>

On Sat, 2019-06-29 at 16:44 -0700, Andy Lutomirski wrote:
> On Fri, Jun 28, 2019 at 12:50 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > 
> > The IBT bitmap is visiable from user-mode, but not writable.
> > 
> > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > ---
> >  arch/x86/mm/fault.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index 59f4f66e4f2e..231196abb62e 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1454,6 +1454,13 @@ void do_user_addr_fault(struct pt_regs *regs,
> >          * we can handle it..
> >          */
> >  good_area:
> > +#define USER_MODE_WRITE (FAULT_FLAG_WRITE | FAULT_FLAG_USER)
> > +       if (((flags & USER_MODE_WRITE)  == USER_MODE_WRITE) &&
> > +           (vma->vm_flags & VM_IBT)) {
> > +               bad_area_access_error(regs, hw_error_code, address, vma);
> > +               return;
> > +       }
> > +
> 
> Just make the VMA have VM_WRITE and VM_MAYWRITE clear.  No new code
> like this should be required.

Ok, I will work on that.

Thanks,
Yu-cheng

^ permalink raw reply

* Re: [RFC PATCH] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Jann Horn @ 2019-07-01 19:49 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: the arch/x86 maintainers, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, kernel list, linux-doc, Linux-MM, linux-arch,
	Linux API, Arnd Bergmann, Andy Lutomirski, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jonathan Corbet,
	Kees Cook, Mike Kravetz, Nadav Amit, Oleg Nesterov, Pavel Machek,
	Peter Zijlstra, Randy Dunlap, Ravi V. Shankar, Vedvyas Shanbhogue,
	Dave Martin
In-Reply-To: <20190628172203.797-1-yu-cheng.yu@intel.com>

On Fri, Jun 28, 2019 at 7:30 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
[...]
> In the discussion, we decided to look at only an ELF header's
> PT_GNU_PROPERTY, which is a shortcut pointing to the file's
> .note.gnu.property.
>
> The Linux gABI extension draft is here:
>
>     https://github.com/hjl-tools/linux-abi/wiki/linux-abi-draft.pdf.
>
> A few existing CET-enabled binary files were built without
> PT_GNU_PROPERTY; but those files' .note.gnu.property are checked by
> ld-linux, not Linux.  The compatibility impact from this change is
> therefore managable.
>
> An ELF file's .note.gnu.property indicates features the executable file
> can support.  For example, the property GNU_PROPERTY_X86_FEATURE_1_AND
> indicates the file supports GNU_PROPERTY_X86_FEATURE_1_IBT and/or
> GNU_PROPERTY_X86_FEATURE_1_SHSTK.
>
> With this patch, if an arch needs to setup features from ELF properties,
> it needs CONFIG_ARCH_USE_GNU_PROPERTY to be set, and specific
> arch_parse_property() and arch_setup_property().
[...]
> +typedef bool (test_item_fn)(void *buf, u32 *arg, u32 type);
> +typedef void *(next_item_fn)(void *buf, u32 *arg, u32 type);
> +
> +static bool test_property(void *buf, u32 *max_type, u32 pr_type)
> +{
> +       struct gnu_property *pr = buf;
> +
> +       /*
> +        * Property types must be in ascending order.
> +        * Keep track of the max when testing each.
> +        */
> +       if (pr->pr_type > *max_type)
> +               *max_type = pr->pr_type;
> +
> +       return (pr->pr_type == pr_type);
> +}
> +
> +static void *next_property(void *buf, u32 *max_type, u32 pr_type)
> +{
> +       struct gnu_property *pr = buf;
> +
> +       if ((buf + sizeof(*pr) + pr->pr_datasz < buf) ||

This looks like UB to me, see below.

> +           (pr->pr_type > pr_type) ||
> +           (pr->pr_type > *max_type))
> +               return NULL;
> +       else
> +               return (buf + sizeof(*pr) + pr->pr_datasz);
> +}
> +
> +/*
> + * Scan 'buf' for a pattern; return true if found.
> + * *pos is the distance from the beginning of buf to where
> + * the searched item or the next item is located.
> + */
> +static int scan(u8 *buf, u32 buf_size, int item_size, test_item_fn test_item,
> +               next_item_fn next_item, u32 *arg, u32 type, u32 *pos)
> +{
> +       int found = 0;
> +       u8 *p, *max;
> +
> +       max = buf + buf_size;
> +       if (max < buf)
> +               return 0;

How can this ever legitimately happen? If it can't, perhaps you meant
to put a WARN_ON_ONCE() or something like that here?
Also, computing out-of-bounds pointers is UB (section 6.5.6 of C99:
"If both the pointer operand and the result point to elements of the
same array object, or one past the last element of the array object,
the evaluation shall not produce an overflow; otherwise, the behavior
is undefined."), and if the addition makes the pointer wrap, that's
certainly out of bounds; so I don't think this condition can trigger
without UB.

> +
> +       p = buf;
> +
> +       while ((p + item_size < max) && (p + item_size > buf)) {

Again, as far as I know, this is technically UB. Please rewrite this.
For example, you could do something like:

    while (max - p >= item_size) {

and then make sure that next_item() never computes OOB pointers.

> +               if (test_item(p, arg, type)) {
> +                       found = 1;
> +                       break;
> +               }
> +
> +               p = next_item(p, arg, type);
> +       }
> +
> +       *pos = (p + item_size <= buf) ? 0 : (u32)(p - buf);
> +       return found;
> +}
> +
> +/*
> + * Search an NT_GNU_PROPERTY_TYPE_0 for the property of 'pr_type'.
> + */
> +static int find_property(u32 pr_type, u32 *property, struct file *file,
> +                        loff_t file_offset, unsigned long desc_size)
> +{
> +       u8 *buf;
> +       int buf_size;
> +
> +       u32 buf_pos;
> +       unsigned long read_size;
> +       unsigned long done;
> +       int found = 0;
> +       int ret = 0;
> +       u32 last_pr = 0;
> +
> +       *property = 0;
> +       buf_pos = 0;
> +
> +       buf_size = (desc_size > PAGE_SIZE) ? PAGE_SIZE : desc_size;

open-coded min(desc_size, PAGE_SIZE)

> +       buf = kmalloc(buf_size, GFP_KERNEL);
> +       if (!buf)
> +               return -ENOMEM;
> +
> +       for (done = 0; done < desc_size; done += buf_pos) {
> +               read_size = desc_size - done;
> +               if (read_size > buf_size)
> +                       read_size = buf_size;
> +
> +               ret = kernel_read(file, buf, read_size, &file_offset);
> +
> +               if (ret != read_size)
> +                       return (ret < 0) ? ret : -EIO;

This leaks the memory allocated for `buf`.

> +
> +               ret = 0;
> +               found = scan(buf, read_size, sizeof(struct gnu_property),
> +                            test_property, next_property,
> +                            &last_pr, pr_type, &buf_pos);
> +
> +               if ((!buf_pos) || found)
> +                       break;
> +
> +               file_offset += buf_pos - read_size;
> +       }
[...]
> +       kfree(buf);
> +       return ret;
> +}

^ permalink raw reply

* Re: [UPDATE][PATCH 10/10] tools/power/x86: A tool to validate Intel Speed Select commands
From: Srinivas Pandruvada @ 2019-07-01 15:18 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Darren Hart, Andy Shevchenko, Andriy Shevchenko, Jonathan Corbet,
	Rafael J. Wysocki, Alan Cox, Len Brown, prarit, darcari,
	Linux Documentation List, Linux Kernel Mailing List,
	Platform Driver
In-Reply-To: <CAHp75Vf-p3O10_Ns_NY4JoWBS1S34z-NW0jVJdCdqszdGVmoQw@mail.gmail.com>

On Mon, 2019-07-01 at 14:32 +0300, Andy Shevchenko wrote:
> On Sun, Jun 30, 2019 at 8:14 PM Srinivas Pandruvada
> <srinivas.pandruvada@linux.intel.com> wrote:
> > 
> > The Intel(R) Speed select technologies contains four features.
> > 
> > Performance profile:An non architectural mechanism that allows
> > multiple
> > optimized performance profiles per system via static and/or dynamic
> > adjustment of core count, workload, Tjmax, and TDP, etc. aka ISS
> > in the documentation.
> > 
> > Base Frequency: Enables users to increase guaranteed base frequency
> > on
> > certain cores (high priority cores) in exchange for lower base
> > frequency
> > on remaining cores (low priority cores). aka PBF in the
> > documenation.
> > 
> > Turbo frequency: Enables the ability to set different turbo ratio
> > limits
> > to cores based on priority. aka FACT in the documentation.
> > 
> > Core power: An Interface that allows user to define per core/tile
> > priority.
> > 
> > There is a multi level help for commands and options. This can be
> > used
> > to check required arguments for each feature and commands for the
> > feature.
> > 
> > To start navigating the features start with
> > 
> > $sudo intel-speed-select --help
> > 
> > For help on a specific feature for example
> > $sudo intel-speed-select perf-profile --help
> > 
> > To get help for a command for a feature for example
> > $sudo intel-speed-select perf-profile get-lock-status --help
> > 
> > Signed-off-by: Srinivas Pandruvada <
> > srinivas.pandruvada@linux.intel.com>
> > ---
> > Updates:
> > - Copied Makefile from tools/gpio and moified the Makefile here
> > - Added entry to tools/build/Makefile
> > - Rename directory to match the executable name
> > - Fix one error message
> 
> Thanks!
> I pushed to my review and testing queue, while still waiting for some
> ACKs.
> 
> It seems I can promote the driver itself now,w/o tools, if you want
> me to do so.
I am fine with driver only push if we don't get ACK by your deadline
for the next kernel.

Thanks,
Srinivas



^ permalink raw reply

* [PATCH v14 03/13] dt-bindings: Add doc for the Ingenic TCU drivers
From: Paul Cercueil @ 2019-07-01 15:14 UTC (permalink / raw)
  To: Lee Jones, Jonathan Corbet, Ralf Baechle, Paul Burton,
	James Hogan, Michael Turquette, Stephen Boyd, Thomas Gleixner,
	Daniel Lezcano
  Cc: Mathieu Malaterre, od, devicetree, linux-kernel, linux-doc,
	linux-mips, linux-clk, Paul Cercueil, Rob Herring, Artur Rojek
In-Reply-To: <20190701151410.23127-1-paul@crapouillou.net>

Add documentation about how to properly use the Ingenic TCU
(Timer/Counter Unit) drivers from devicetree.

Signed-off-by: Paul Cercueil <paul@crapouillou.net>
Reviewed-by: Rob Herring <robh@kernel.org>
Tested-by: Mathieu Malaterre <malat@debian.org>
Tested-by: Artur Rojek <contact@artur-rojek.eu>
---

Notes:
    v4: New patch in this series. Corresponds to V2 patches 3-4-5 with
     added content.
    
    v5:
     - Edited PWM/watchdog DT bindings documentation to point to the new
       document.
     - Moved main document to
       Documentation/devicetree/bindings/timer/ingenic,tcu.txt
     - Updated documentation to reflect the new devicetree bindings.
    
    v6:
     - Removed PWM/watchdog documentation files as asked by upstream
     - Removed doc about properties that should be implicit
     - Removed doc about ingenic,timer-channel /
       ingenic,clocksource-channel as they are gone
     - Fix WDT clock name in the binding doc
     - Fix lengths of register areas in watchdog/pwm nodes
    
    v7: No change
    
    v8:
     - Fix address of the PWM node
     - Added doc about system timer and clocksource children nodes
    
    v9:
     - Remove doc about system timer and clocksource children
       nodes...
     - Add doc about ingenic,pwm-channels-mask property
    
    v10: No change
    
    v11: Fix info about default value of ingenic,pwm-channels-mask
    
    v12: Drop sub-nodes for now; they will be introduced in a follow-up
         patchset.
    
    v13:
     - Revert back to v11. Turns out it was okay.
     - Remove 'interrupt-parent' of the list of required properties.
    
    v14: No change

 .../bindings/pwm/ingenic,jz47xx-pwm.txt       |  25 ----
 .../devicetree/bindings/timer/ingenic,tcu.txt | 136 ++++++++++++++++++
 .../bindings/watchdog/ingenic,jz4740-wdt.txt  |  17 ---
 3 files changed, 136 insertions(+), 42 deletions(-)
 delete mode 100644 Documentation/devicetree/bindings/pwm/ingenic,jz47xx-pwm.txt
 create mode 100644 Documentation/devicetree/bindings/timer/ingenic,tcu.txt
 delete mode 100644 Documentation/devicetree/bindings/watchdog/ingenic,jz4740-wdt.txt

diff --git a/Documentation/devicetree/bindings/pwm/ingenic,jz47xx-pwm.txt b/Documentation/devicetree/bindings/pwm/ingenic,jz47xx-pwm.txt
deleted file mode 100644
index 7d9d3f90641b..000000000000
--- a/Documentation/devicetree/bindings/pwm/ingenic,jz47xx-pwm.txt
+++ /dev/null
@@ -1,25 +0,0 @@
-Ingenic JZ47xx PWM Controller
-=============================
-
-Required properties:
-- compatible: One of:
-  * "ingenic,jz4740-pwm"
-  * "ingenic,jz4770-pwm"
-  * "ingenic,jz4780-pwm"
-- #pwm-cells: Should be 3. See pwm.txt in this directory for a description
-  of the cells format.
-- clocks : phandle to the external clock.
-- clock-names : Should be "ext".
-
-
-Example:
-
-	pwm: pwm@10002000 {
-		compatible = "ingenic,jz4740-pwm";
-		reg = <0x10002000 0x1000>;
-
-		#pwm-cells = <3>;
-
-		clocks = <&ext>;
-		clock-names = "ext";
-	};
diff --git a/Documentation/devicetree/bindings/timer/ingenic,tcu.txt b/Documentation/devicetree/bindings/timer/ingenic,tcu.txt
new file mode 100644
index 000000000000..87962da64561
--- /dev/null
+++ b/Documentation/devicetree/bindings/timer/ingenic,tcu.txt
@@ -0,0 +1,136 @@
+Ingenic JZ47xx SoCs Timer/Counter Unit devicetree bindings
+==========================================================
+
+For a description of the TCU hardware and drivers, have a look at
+Documentation/mips/ingenic-tcu.txt.
+
+Required properties:
+
+- compatible: Must be one of:
+  * ingenic,jz4740-tcu
+  * ingenic,jz4725b-tcu
+  * ingenic,jz4770-tcu
+- reg: Should be the offset/length value corresponding to the TCU registers
+- clocks: List of phandle & clock specifiers for clocks external to the TCU.
+  The "pclk", "rtc" and "ext" clocks should be provided. The "tcu" clock
+  should be provided if the SoC has it.
+- clock-names: List of name strings for the external clocks.
+- #clock-cells: Should be <1>;
+  Clock consumers specify this argument to identify a clock. The valid values
+  may be found in <dt-bindings/clock/ingenic,tcu.h>.
+- interrupt-controller : Identifies the node as an interrupt controller
+- #interrupt-cells : Specifies the number of cells needed to encode an
+  interrupt source. The value should be 1.
+- interrupts : Specifies the interrupt the controller is connected to.
+
+Optional properties:
+
+- ingenic,pwm-channels-mask: Bitmask of TCU channels reserved for PWM use.
+  Default value is 0xfc.
+
+
+Children nodes
+==========================================================
+
+
+PWM node:
+---------
+
+Required properties:
+
+- compatible: Must be one of:
+  * ingenic,jz4740-pwm
+  * ingenic,jz4725b-pwm
+- #pwm-cells: Should be 3. See ../pwm/pwm.txt for a description of the cell
+  format.
+- clocks: List of phandle & clock specifiers for the TCU clocks.
+- clock-names: List of name strings for the TCU clocks.
+
+
+Watchdog node:
+--------------
+
+Required properties:
+
+- compatible: Must be "ingenic,jz4740-watchdog"
+- clocks: phandle to the WDT clock
+- clock-names: should be "wdt"
+
+
+OS Timer node:
+---------
+
+Required properties:
+
+- compatible: Must be one of:
+  * ingenic,jz4725b-ost
+  * ingenic,jz4770-ost
+- clocks: phandle to the OST clock
+- clock-names: should be "ost"
+- interrupts : Specifies the interrupt the OST is connected to.
+
+
+Example
+==========================================================
+
+#include <dt-bindings/clock/jz4770-cgu.h>
+#include <dt-bindings/clock/ingenic,tcu.h>
+
+/ {
+	tcu: timer@10002000 {
+		compatible = "ingenic,jz4770-tcu";
+		reg = <0x10002000 0x1000>;
+		#address-cells = <1>;
+		#size-cells = <1>;
+		ranges = <0x0 0x10002000 0x1000>;
+
+		#clock-cells = <1>;
+
+		clocks = <&cgu JZ4770_CLK_RTC
+			  &cgu JZ4770_CLK_EXT
+			  &cgu JZ4770_CLK_PCLK>;
+		clock-names = "rtc", "ext", "pclk";
+
+		interrupt-controller;
+		#interrupt-cells = <1>;
+
+		interrupt-parent = <&intc>;
+		interrupts = <27 26 25>;
+
+		watchdog: watchdog@0 {
+			compatible = "ingenic,jz4740-watchdog";
+			reg = <0x0 0xc>;
+
+			clocks = <&tcu TCU_CLK_WDT>;
+			clock-names = "wdt";
+		};
+
+		pwm: pwm@40 {
+			compatible = "ingenic,jz4740-pwm";
+			reg = <0x40 0x80>;
+
+			#pwm-cells = <3>;
+
+			clocks = <&tcu TCU_CLK_TIMER0
+				  &tcu TCU_CLK_TIMER1
+				  &tcu TCU_CLK_TIMER2
+				  &tcu TCU_CLK_TIMER3
+				  &tcu TCU_CLK_TIMER4
+				  &tcu TCU_CLK_TIMER5
+				  &tcu TCU_CLK_TIMER6
+				  &tcu TCU_CLK_TIMER7>;
+			clock-names = "timer0", "timer1", "timer2", "timer3",
+				      "timer4", "timer5", "timer6", "timer7";
+		};
+
+		ost: timer@e0 {
+			compatible = "ingenic,jz4770-ost";
+			reg = <0xe0 0x20>;
+
+			clocks = <&tcu TCU_CLK_OST>;
+			clock-names = "ost";
+
+			interrupts = <15>;
+		};
+	};
+};
diff --git a/Documentation/devicetree/bindings/watchdog/ingenic,jz4740-wdt.txt b/Documentation/devicetree/bindings/watchdog/ingenic,jz4740-wdt.txt
deleted file mode 100644
index ce1cb72d5345..000000000000
--- a/Documentation/devicetree/bindings/watchdog/ingenic,jz4740-wdt.txt
+++ /dev/null
@@ -1,17 +0,0 @@
-Ingenic Watchdog Timer (WDT) Controller for JZ4740 & JZ4780
-
-Required properties:
-compatible: "ingenic,jz4740-watchdog" or "ingenic,jz4780-watchdog"
-reg: Register address and length for watchdog registers
-clocks: phandle to the RTC clock
-clock-names: should be "rtc"
-
-Example:
-
-watchdog: jz4740-watchdog@10002000 {
-	compatible = "ingenic,jz4740-watchdog";
-	reg = <0x10002000 0x10>;
-
-	clocks = <&cgu JZ4740_CLK_RTC>;
-	clock-names = "rtc";
-};
-- 
2.21.0.593.g511ec345e18


^ permalink raw reply related

* [PATCH v14 05/13] clk: ingenic: Add driver for the TCU clocks
From: Paul Cercueil @ 2019-07-01 15:14 UTC (permalink / raw)
  To: Lee Jones, Jonathan Corbet, Ralf Baechle, Paul Burton,
	James Hogan, Michael Turquette, Stephen Boyd, Thomas Gleixner,
	Daniel Lezcano
  Cc: Mathieu Malaterre, od, devicetree, linux-kernel, linux-doc,
	linux-mips, linux-clk, Paul Cercueil, Artur Rojek
In-Reply-To: <20190701151410.23127-1-paul@crapouillou.net>

Add driver to support the clocks provided by the Timer/Counter Unit
(TCU) of the JZ47xx SoCs from Ingenic.

Signed-off-by: Paul Cercueil <paul@crapouillou.net>
Tested-by: Mathieu Malaterre <malat@debian.org>
Tested-by: Artur Rojek <contact@artur-rojek.eu>
---

Notes:
    v12: New patch
    
    v13:
     - Don't enable/disable the TCU clock on demand. Enable it in the probe
       and call it a day.
     - Register suspend callbacks to gate/ungate the TCU clock on
       suspend/resume.
     - Use pr_fmt and pr_crit instead of custom TCU_ERR() macro
     - Remove useless dependency on COMMON_CLK in Kconfig
     - Remove registration of clkdev
    
    v14: Change %i to %d

 drivers/clk/ingenic/Kconfig  |  10 +-
 drivers/clk/ingenic/Makefile |   1 +
 drivers/clk/ingenic/tcu.c    | 473 +++++++++++++++++++++++++++++++++++
 3 files changed, 483 insertions(+), 1 deletion(-)
 create mode 100644 drivers/clk/ingenic/tcu.c

diff --git a/drivers/clk/ingenic/Kconfig b/drivers/clk/ingenic/Kconfig
index fe8db93cf21a..0f39c038670f 100644
--- a/drivers/clk/ingenic/Kconfig
+++ b/drivers/clk/ingenic/Kconfig
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
-menu "Ingenic JZ47xx CGU drivers"
+menu "Ingenic JZ47xx drivers"
 	depends on MIPS
 
 config INGENIC_CGU_COMMON
@@ -45,4 +45,12 @@ config INGENIC_CGU_JZ4780
 
 	  If building for a JZ4780 SoC, you want to say Y here.
 
+config INGENIC_TCU_CLK
+	bool "Ingenic JZ47xx TCU clocks driver"
+	default MACH_INGENIC
+	select INGENIC_TCU
+	help
+	  Support the clocks of the Timer/Counter Unit (TCU) of the Ingenic
+	  JZ47xx SoCs.
+
 endmenu
diff --git a/drivers/clk/ingenic/Makefile b/drivers/clk/ingenic/Makefile
index ab58a6a862a5..d25a5801bd8a 100644
--- a/drivers/clk/ingenic/Makefile
+++ b/drivers/clk/ingenic/Makefile
@@ -4,3 +4,4 @@ obj-$(CONFIG_INGENIC_CGU_JZ4740)	+= jz4740-cgu.o
 obj-$(CONFIG_INGENIC_CGU_JZ4725B)	+= jz4725b-cgu.o
 obj-$(CONFIG_INGENIC_CGU_JZ4770)	+= jz4770-cgu.o
 obj-$(CONFIG_INGENIC_CGU_JZ4780)	+= jz4780-cgu.o
+obj-$(CONFIG_INGENIC_TCU_CLK)		+= tcu.o
diff --git a/drivers/clk/ingenic/tcu.c b/drivers/clk/ingenic/tcu.c
new file mode 100644
index 000000000000..528aa215eb0f
--- /dev/null
+++ b/drivers/clk/ingenic/tcu.c
@@ -0,0 +1,473 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * JZ47xx SoCs TCU clocks driver
+ * Copyright (C) 2019 Paul Cercueil <paul@crapouillou.net>
+ */
+
+#include <linux/clk.h>
+#include <linux/clk-provider.h>
+#include <linux/clockchips.h>
+#include <linux/mfd/ingenic-tcu.h>
+#include <linux/regmap.h>
+#include <linux/slab.h>
+#include <linux/syscore_ops.h>
+
+#include <dt-bindings/clock/ingenic,tcu.h>
+
+/* 8 channels max + watchdog + OST */
+#define TCU_CLK_COUNT	10
+
+#undef pr_fmt
+#define pr_fmt(fmt) "ingenic-tcu-clk: " fmt
+
+enum tcu_clk_parent {
+	TCU_PARENT_PCLK,
+	TCU_PARENT_RTC,
+	TCU_PARENT_EXT,
+};
+
+struct ingenic_soc_info {
+	unsigned int num_channels;
+	bool has_ost;
+	bool has_tcu_clk;
+};
+
+struct ingenic_tcu_clk_info {
+	struct clk_init_data init_data;
+	u8 gate_bit;
+	u8 tcsr_reg;
+};
+
+struct ingenic_tcu_clk {
+	struct clk_hw hw;
+	unsigned int idx;
+	struct ingenic_tcu *tcu;
+	const struct ingenic_tcu_clk_info *info;
+};
+
+struct ingenic_tcu {
+	const struct ingenic_soc_info *soc_info;
+	struct regmap *map;
+	struct clk *clk;
+
+	struct clk_hw_onecell_data *clocks;
+};
+
+static struct ingenic_tcu *ingenic_tcu;
+
+static inline struct ingenic_tcu_clk *to_tcu_clk(struct clk_hw *hw)
+{
+	return container_of(hw, struct ingenic_tcu_clk, hw);
+}
+
+static int ingenic_tcu_enable(struct clk_hw *hw)
+{
+	struct ingenic_tcu_clk *tcu_clk = to_tcu_clk(hw);
+	const struct ingenic_tcu_clk_info *info = tcu_clk->info;
+	struct ingenic_tcu *tcu = tcu_clk->tcu;
+
+	regmap_write(tcu->map, TCU_REG_TSCR, BIT(info->gate_bit));
+
+	return 0;
+}
+
+static void ingenic_tcu_disable(struct clk_hw *hw)
+{
+	struct ingenic_tcu_clk *tcu_clk = to_tcu_clk(hw);
+	const struct ingenic_tcu_clk_info *info = tcu_clk->info;
+	struct ingenic_tcu *tcu = tcu_clk->tcu;
+
+	regmap_write(tcu->map, TCU_REG_TSSR, BIT(info->gate_bit));
+}
+
+static int ingenic_tcu_is_enabled(struct clk_hw *hw)
+{
+	struct ingenic_tcu_clk *tcu_clk = to_tcu_clk(hw);
+	const struct ingenic_tcu_clk_info *info = tcu_clk->info;
+	unsigned int value;
+
+	regmap_read(tcu_clk->tcu->map, TCU_REG_TSR, &value);
+
+	return !(value & BIT(info->gate_bit));
+}
+
+static bool ingenic_tcu_enable_regs(struct clk_hw *hw)
+{
+	struct ingenic_tcu_clk *tcu_clk = to_tcu_clk(hw);
+	const struct ingenic_tcu_clk_info *info = tcu_clk->info;
+	struct ingenic_tcu *tcu = tcu_clk->tcu;
+	bool enabled = false;
+
+	/*
+	 * If the SoC has no global TCU clock, we must ungate the channel's
+	 * clock to be able to access its registers.
+	 * If we have a TCU clock, it will be enabled automatically as it has
+	 * been attached to the regmap.
+	 */
+	if (!tcu->clk) {
+		enabled = !!ingenic_tcu_is_enabled(hw);
+		regmap_write(tcu->map, TCU_REG_TSCR, BIT(info->gate_bit));
+	}
+
+	return enabled;
+}
+
+static void ingenic_tcu_disable_regs(struct clk_hw *hw)
+{
+	struct ingenic_tcu_clk *tcu_clk = to_tcu_clk(hw);
+	const struct ingenic_tcu_clk_info *info = tcu_clk->info;
+	struct ingenic_tcu *tcu = tcu_clk->tcu;
+
+	if (!tcu->clk)
+		regmap_write(tcu->map, TCU_REG_TSSR, BIT(info->gate_bit));
+}
+
+static u8 ingenic_tcu_get_parent(struct clk_hw *hw)
+{
+	struct ingenic_tcu_clk *tcu_clk = to_tcu_clk(hw);
+	const struct ingenic_tcu_clk_info *info = tcu_clk->info;
+	unsigned int val = 0;
+	int ret;
+
+	ret = regmap_read(tcu_clk->tcu->map, info->tcsr_reg, &val);
+	WARN_ONCE(ret < 0, "Unable to read TCSR %d", tcu_clk->idx);
+
+	return ffs(val & TCU_TCSR_PARENT_CLOCK_MASK) - 1;
+}
+
+static int ingenic_tcu_set_parent(struct clk_hw *hw, u8 idx)
+{
+	struct ingenic_tcu_clk *tcu_clk = to_tcu_clk(hw);
+	const struct ingenic_tcu_clk_info *info = tcu_clk->info;
+	bool was_enabled;
+	int ret;
+
+	was_enabled = ingenic_tcu_enable_regs(hw);
+
+	ret = regmap_update_bits(tcu_clk->tcu->map, info->tcsr_reg,
+				 TCU_TCSR_PARENT_CLOCK_MASK, BIT(idx));
+	WARN_ONCE(ret < 0, "Unable to update TCSR %d", tcu_clk->idx);
+
+	if (!was_enabled)
+		ingenic_tcu_disable_regs(hw);
+
+	return 0;
+}
+
+static unsigned long ingenic_tcu_recalc_rate(struct clk_hw *hw,
+		unsigned long parent_rate)
+{
+	struct ingenic_tcu_clk *tcu_clk = to_tcu_clk(hw);
+	const struct ingenic_tcu_clk_info *info = tcu_clk->info;
+	unsigned int prescale;
+	int ret;
+
+	ret = regmap_read(tcu_clk->tcu->map, info->tcsr_reg, &prescale);
+	WARN_ONCE(ret < 0, "Unable to read TCSR %d", tcu_clk->idx);
+
+	prescale = (prescale & TCU_TCSR_PRESCALE_MASK) >> TCU_TCSR_PRESCALE_LSB;
+
+	return parent_rate >> (prescale * 2);
+}
+
+static u8 ingenic_tcu_get_prescale(unsigned long rate, unsigned long req_rate)
+{
+	u8 prescale;
+
+	for (prescale = 0; prescale < 5; prescale++)
+		if ((rate >> (prescale * 2)) <= req_rate)
+			return prescale;
+
+	return 5; /* /1024 divider */
+}
+
+static long ingenic_tcu_round_rate(struct clk_hw *hw, unsigned long req_rate,
+		unsigned long *parent_rate)
+{
+	unsigned long rate = *parent_rate;
+	u8 prescale;
+
+	if (req_rate > rate)
+		return -EINVAL;
+
+	prescale = ingenic_tcu_get_prescale(rate, req_rate);
+
+	return rate >> (prescale * 2);
+}
+
+static int ingenic_tcu_set_rate(struct clk_hw *hw, unsigned long req_rate,
+		unsigned long parent_rate)
+{
+	struct ingenic_tcu_clk *tcu_clk = to_tcu_clk(hw);
+	const struct ingenic_tcu_clk_info *info = tcu_clk->info;
+	u8 prescale = ingenic_tcu_get_prescale(parent_rate, req_rate);
+	bool was_enabled;
+	int ret;
+
+	was_enabled = ingenic_tcu_enable_regs(hw);
+
+	ret = regmap_update_bits(tcu_clk->tcu->map, info->tcsr_reg,
+				 TCU_TCSR_PRESCALE_MASK,
+				 prescale << TCU_TCSR_PRESCALE_LSB);
+	WARN_ONCE(ret < 0, "Unable to update TCSR %d", tcu_clk->idx);
+
+	if (!was_enabled)
+		ingenic_tcu_disable_regs(hw);
+
+	return 0;
+}
+
+static const struct clk_ops ingenic_tcu_clk_ops = {
+	.get_parent	= ingenic_tcu_get_parent,
+	.set_parent	= ingenic_tcu_set_parent,
+
+	.recalc_rate	= ingenic_tcu_recalc_rate,
+	.round_rate	= ingenic_tcu_round_rate,
+	.set_rate	= ingenic_tcu_set_rate,
+
+	.enable		= ingenic_tcu_enable,
+	.disable	= ingenic_tcu_disable,
+	.is_enabled	= ingenic_tcu_is_enabled,
+};
+
+static const char * const ingenic_tcu_timer_parents[] = {
+	[TCU_PARENT_PCLK] = "pclk",
+	[TCU_PARENT_RTC]  = "rtc",
+	[TCU_PARENT_EXT]  = "ext",
+};
+
+#define DEF_TIMER(_name, _gate_bit, _tcsr)				\
+	{								\
+		.init_data = {						\
+			.name = _name,					\
+			.parent_names = ingenic_tcu_timer_parents,	\
+			.num_parents = ARRAY_SIZE(ingenic_tcu_timer_parents),\
+			.ops = &ingenic_tcu_clk_ops,			\
+			.flags = CLK_SET_RATE_UNGATE,			\
+		},							\
+		.gate_bit = _gate_bit,					\
+		.tcsr_reg = _tcsr,					\
+	}
+static const struct ingenic_tcu_clk_info ingenic_tcu_clk_info[] = {
+	[TCU_CLK_TIMER0] = DEF_TIMER("timer0", 0, TCU_REG_TCSRc(0)),
+	[TCU_CLK_TIMER1] = DEF_TIMER("timer1", 1, TCU_REG_TCSRc(1)),
+	[TCU_CLK_TIMER2] = DEF_TIMER("timer2", 2, TCU_REG_TCSRc(2)),
+	[TCU_CLK_TIMER3] = DEF_TIMER("timer3", 3, TCU_REG_TCSRc(3)),
+	[TCU_CLK_TIMER4] = DEF_TIMER("timer4", 4, TCU_REG_TCSRc(4)),
+	[TCU_CLK_TIMER5] = DEF_TIMER("timer5", 5, TCU_REG_TCSRc(5)),
+	[TCU_CLK_TIMER6] = DEF_TIMER("timer6", 6, TCU_REG_TCSRc(6)),
+	[TCU_CLK_TIMER7] = DEF_TIMER("timer7", 7, TCU_REG_TCSRc(7)),
+};
+
+static const struct ingenic_tcu_clk_info ingenic_tcu_watchdog_clk_info =
+					 DEF_TIMER("wdt", 16, TCU_REG_WDT_TCSR);
+static const struct ingenic_tcu_clk_info ingenic_tcu_ost_clk_info =
+					 DEF_TIMER("ost", 15, TCU_REG_OST_TCSR);
+#undef DEF_TIMER
+
+static int __init ingenic_tcu_register_clock(struct ingenic_tcu *tcu,
+			unsigned int idx, enum tcu_clk_parent parent,
+			const struct ingenic_tcu_clk_info *info,
+			struct clk_hw_onecell_data *clocks)
+{
+	struct ingenic_tcu_clk *tcu_clk;
+	int err;
+
+	tcu_clk = kzalloc(sizeof(*tcu_clk), GFP_KERNEL);
+	if (!tcu_clk)
+		return -ENOMEM;
+
+	tcu_clk->hw.init = &info->init_data;
+	tcu_clk->idx = idx;
+	tcu_clk->info = info;
+	tcu_clk->tcu = tcu;
+
+	/* Reset channel and clock divider, set default parent */
+	ingenic_tcu_enable_regs(&tcu_clk->hw);
+	regmap_update_bits(tcu->map, info->tcsr_reg, 0xffff, BIT(parent));
+	ingenic_tcu_disable_regs(&tcu_clk->hw);
+
+	err = clk_hw_register(NULL, &tcu_clk->hw);
+	if (err) {
+		kfree(tcu_clk);
+		return err;
+	}
+
+	clocks->hws[idx] = &tcu_clk->hw;
+
+	return 0;
+}
+
+static const struct ingenic_soc_info jz4740_soc_info = {
+	.num_channels = 8,
+	.has_ost = false,
+	.has_tcu_clk = true,
+};
+
+static const struct ingenic_soc_info jz4725b_soc_info = {
+	.num_channels = 6,
+	.has_ost = true,
+	.has_tcu_clk = true,
+};
+
+static const struct ingenic_soc_info jz4770_soc_info = {
+	.num_channels = 8,
+	.has_ost = true,
+	.has_tcu_clk = false,
+};
+
+static const struct of_device_id ingenic_tcu_of_match[] __initconst = {
+	{ .compatible = "ingenic,jz4740-tcu", .data = &jz4740_soc_info, },
+	{ .compatible = "ingenic,jz4725b-tcu", .data = &jz4725b_soc_info, },
+	{ .compatible = "ingenic,jz4770-tcu", .data = &jz4770_soc_info, },
+	{ }
+};
+
+static int __init ingenic_tcu_probe(struct device_node *np)
+{
+	const struct of_device_id *id = of_match_node(ingenic_tcu_of_match, np);
+	struct ingenic_tcu *tcu;
+	struct regmap *map;
+	unsigned int i;
+	int ret;
+
+	map = ingenic_tcu_get_regmap(np);
+	if (IS_ERR(map))
+		return PTR_ERR(map);
+
+	tcu = kzalloc(sizeof(*tcu), GFP_KERNEL);
+	if (!tcu)
+		return -ENOMEM;
+
+	tcu->map = map;
+	tcu->soc_info = id->data;
+
+	if (tcu->soc_info->has_tcu_clk) {
+		tcu->clk = of_clk_get_by_name(np, "tcu");
+		if (IS_ERR(tcu->clk)) {
+			ret = PTR_ERR(tcu->clk);
+			pr_crit("Cannot get TCU clock\n");
+			goto err_free_tcu;
+		}
+
+		ret = clk_prepare_enable(tcu->clk);
+		if (ret) {
+			pr_crit("Unable to enable TCU clock\n");
+			goto err_put_clk;
+		}
+	}
+
+	tcu->clocks = kzalloc(sizeof(*tcu->clocks) +
+			      sizeof(*tcu->clocks->hws) * TCU_CLK_COUNT,
+			      GFP_KERNEL);
+	if (!tcu->clocks) {
+		ret = -ENOMEM;
+		goto err_clk_disable;
+	}
+
+	tcu->clocks->num = TCU_CLK_COUNT;
+
+	for (i = 0; i < tcu->soc_info->num_channels; i++) {
+		ret = ingenic_tcu_register_clock(tcu, i, TCU_PARENT_EXT,
+						 &ingenic_tcu_clk_info[i],
+						 tcu->clocks);
+		if (ret) {
+			pr_crit("cannot register clock %d\n", i);
+			goto err_unregister_timer_clocks;
+		}
+	}
+
+	/*
+	 * We set EXT as the default parent clock for all the TCU clocks
+	 * except for the watchdog one, where we set the RTC clock as the
+	 * parent. Since the EXT and PCLK are much faster than the RTC clock,
+	 * the watchdog would kick after a maximum time of 5s, and we might
+	 * want a slower kicking time.
+	 */
+	ret = ingenic_tcu_register_clock(tcu, TCU_CLK_WDT, TCU_PARENT_RTC,
+					 &ingenic_tcu_watchdog_clk_info,
+					 tcu->clocks);
+	if (ret) {
+		pr_crit("cannot register watchdog clock\n");
+		goto err_unregister_timer_clocks;
+	}
+
+	if (tcu->soc_info->has_ost) {
+		ret = ingenic_tcu_register_clock(tcu, TCU_CLK_OST,
+						 TCU_PARENT_EXT,
+						 &ingenic_tcu_ost_clk_info,
+						 tcu->clocks);
+		if (ret) {
+			pr_crit("cannot register ost clock\n");
+			goto err_unregister_watchdog_clock;
+		}
+	}
+
+	ret = of_clk_add_hw_provider(np, of_clk_hw_onecell_get, tcu->clocks);
+	if (ret) {
+		pr_crit("cannot add OF clock provider\n");
+		goto err_unregister_ost_clock;
+	}
+
+	ingenic_tcu = tcu;
+
+	return 0;
+
+err_unregister_ost_clock:
+	if (tcu->soc_info->has_ost)
+		clk_hw_unregister(tcu->clocks->hws[i + 1]);
+err_unregister_watchdog_clock:
+	clk_hw_unregister(tcu->clocks->hws[i]);
+err_unregister_timer_clocks:
+	for (i = 0; i < tcu->clocks->num; i++)
+		if (tcu->clocks->hws[i])
+			clk_hw_unregister(tcu->clocks->hws[i]);
+	kfree(tcu->clocks);
+err_clk_disable:
+	if (tcu->soc_info->has_tcu_clk)
+		clk_disable_unprepare(tcu->clk);
+err_put_clk:
+	if (tcu->soc_info->has_tcu_clk)
+		clk_put(tcu->clk);
+err_free_tcu:
+	kfree(tcu);
+	return ret;
+}
+
+static int __maybe_unused tcu_pm_suspend(void)
+{
+	struct ingenic_tcu *tcu = ingenic_tcu;
+
+	if (tcu->clk)
+		clk_disable(tcu->clk);
+
+	return 0;
+}
+
+static void __maybe_unused tcu_pm_resume(void)
+{
+	struct ingenic_tcu *tcu = ingenic_tcu;
+
+	if (tcu->clk)
+		clk_enable(tcu->clk);
+}
+
+static struct syscore_ops __maybe_unused tcu_pm_ops = {
+	.suspend = tcu_pm_suspend,
+	.resume = tcu_pm_resume,
+};
+
+static void __init ingenic_tcu_init(struct device_node *np)
+{
+	int ret = ingenic_tcu_probe(np);
+
+	if (ret)
+		pr_crit("Failed to initialize TCU clocks: %d\n", ret);
+
+	if (IS_ENABLED(CONFIG_PM_SLEEP))
+		register_syscore_ops(&tcu_pm_ops);
+}
+
+CLK_OF_DECLARE(jz4740_cgu, "ingenic,jz4740-tcu", ingenic_tcu_init);
+CLK_OF_DECLARE(jz4725b_cgu, "ingenic,jz4725b-tcu", ingenic_tcu_init);
+CLK_OF_DECLARE(jz4770_cgu, "ingenic,jz4770-tcu", ingenic_tcu_init);
-- 
2.21.0.593.g511ec345e18


^ permalink raw reply related

* [PATCH v14 04/13] mfd: Add Ingenic TCU driver
From: Paul Cercueil @ 2019-07-01 15:14 UTC (permalink / raw)
  To: Lee Jones, Jonathan Corbet, Ralf Baechle, Paul Burton,
	James Hogan, Michael Turquette, Stephen Boyd, Thomas Gleixner,
	Daniel Lezcano
  Cc: Mathieu Malaterre, od, devicetree, linux-kernel, linux-doc,
	linux-mips, linux-clk, Paul Cercueil, Artur Rojek
In-Reply-To: <20190701151410.23127-1-paul@crapouillou.net>

This driver will provide a regmap that can be retrieved very early in
the boot process through the API function ingenic_tcu_get_regmap().

Additionally, it will call devm_of_platform_populate() so that all the
children devices will be probed.

Signed-off-by: Paul Cercueil <paul@crapouillou.net>
Tested-by: Mathieu Malaterre <malat@debian.org>
Tested-by: Artur Rojek <contact@artur-rojek.eu>
Reviewed-by: Paul Burton <paul.burton@mips.com>
---

Notes:
    v12: New patch
    
    v13: No change
    
    v14:
     - Use ERR_CAST() instead of ERR_PTR(PTR_ERR())
     - Remove ingenic_tcu_can_use_pwm(); it belongs in the PWM driver.

 drivers/mfd/Kconfig             |  8 +++
 drivers/mfd/Makefile            |  1 +
 drivers/mfd/ingenic-tcu.c       | 87 +++++++++++++++++++++++++++++++++
 include/linux/mfd/ingenic-tcu.h |  6 +++
 4 files changed, 102 insertions(+)
 create mode 100644 drivers/mfd/ingenic-tcu.c

diff --git a/drivers/mfd/Kconfig b/drivers/mfd/Kconfig
index a17d275bf1d4..f3ed22613cbc 100644
--- a/drivers/mfd/Kconfig
+++ b/drivers/mfd/Kconfig
@@ -495,6 +495,14 @@ config HTC_I2CPLD
 	  This device provides input and output GPIOs through an I2C
 	  interface to one or more sub-chips.
 
+config INGENIC_TCU
+	bool "Ingenic Timer/Counter Unit (TCU) support"
+	depends on MIPS || COMPILE_TEST
+	select REGMAP_MMIO
+	help
+	  Say yes here to support the Timer/Counter Unit (TCU) IP present
+	  in the JZ47xx SoCs from Ingenic.
+
 config MFD_INTEL_QUARK_I2C_GPIO
 	tristate "Intel Quark MFD I2C GPIO"
 	depends on PCI
diff --git a/drivers/mfd/Makefile b/drivers/mfd/Makefile
index 52b1a90ff515..fb89e131ae98 100644
--- a/drivers/mfd/Makefile
+++ b/drivers/mfd/Makefile
@@ -180,6 +180,7 @@ obj-$(CONFIG_AB8500_CORE)	+= ab8500-core.o ab8500-sysctrl.o
 obj-$(CONFIG_MFD_TIMBERDALE)    += timberdale.o
 obj-$(CONFIG_PMIC_ADP5520)	+= adp5520.o
 obj-$(CONFIG_MFD_KEMPLD)	+= kempld-core.o
+obj-$(CONFIG_INGENIC_TCU)	+= ingenic-tcu.o
 obj-$(CONFIG_MFD_INTEL_QUARK_I2C_GPIO)	+= intel_quark_i2c_gpio.o
 obj-$(CONFIG_LPC_SCH)		+= lpc_sch.o
 obj-$(CONFIG_LPC_ICH)		+= lpc_ich.o
diff --git a/drivers/mfd/ingenic-tcu.c b/drivers/mfd/ingenic-tcu.c
new file mode 100644
index 000000000000..2d12bebd5430
--- /dev/null
+++ b/drivers/mfd/ingenic-tcu.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * JZ47xx SoCs TCU MFD driver
+ * Copyright (C) 2019 Paul Cercueil <paul@crapouillou.net>
+ */
+
+#include <linux/mfd/ingenic-tcu.h>
+#include <linux/of_address.h>
+#include <linux/of_platform.h>
+#include <linux/platform_device.h>
+#include <linux/regmap.h>
+
+static struct regmap *tcu_regmap __initdata;
+
+static const struct regmap_config ingenic_tcu_regmap_config = {
+	.reg_bits = 32,
+	.val_bits = 32,
+	.reg_stride = 4,
+	.max_register = TCU_REG_OST_CNTHBUF,
+};
+
+static const struct of_device_id ingenic_tcu_of_match[] = {
+	{ .compatible = "ingenic,jz4740-tcu" },
+	{ .compatible = "ingenic,jz4725b-tcu" },
+	{ .compatible = "ingenic,jz4770-tcu" },
+	{ /* sentinel */ }
+};
+
+static struct regmap * __init ingenic_tcu_create_regmap(struct device_node *np)
+{
+	struct resource res;
+	void __iomem *base;
+	struct regmap *map;
+
+	if (!of_match_node(ingenic_tcu_of_match, np))
+		return ERR_PTR(-EINVAL);
+
+	base = of_io_request_and_map(np, 0, "TCU");
+	if (IS_ERR(base))
+		return ERR_CAST(base);
+
+	map = regmap_init_mmio(NULL, base, &ingenic_tcu_regmap_config);
+	if (IS_ERR(map))
+		goto err_iounmap;
+
+	return map;
+
+err_iounmap:
+	iounmap(base);
+	of_address_to_resource(np, 0, &res);
+	release_mem_region(res.start, resource_size(&res));
+
+	return map;
+}
+
+static int __init ingenic_tcu_probe(struct platform_device *pdev)
+{
+	struct regmap *map = ingenic_tcu_get_regmap(pdev->dev.of_node);
+
+	platform_set_drvdata(pdev, map);
+
+	regmap_attach_dev(&pdev->dev, map, &ingenic_tcu_regmap_config);
+
+	return devm_of_platform_populate(&pdev->dev);
+}
+
+static struct platform_driver ingenic_tcu_driver = {
+	.driver = {
+		.name = "ingenic-tcu",
+		.of_match_table = ingenic_tcu_of_match,
+	},
+};
+
+static int __init ingenic_tcu_platform_init(void)
+{
+	return platform_driver_probe(&ingenic_tcu_driver,
+				     ingenic_tcu_probe);
+}
+subsys_initcall(ingenic_tcu_platform_init);
+
+struct regmap * __init ingenic_tcu_get_regmap(struct device_node *np)
+{
+	if (!tcu_regmap)
+		tcu_regmap = ingenic_tcu_create_regmap(np);
+
+	return tcu_regmap;
+}
diff --git a/include/linux/mfd/ingenic-tcu.h b/include/linux/mfd/ingenic-tcu.h
index 2083fa20821d..045fb90c57fd 100644
--- a/include/linux/mfd/ingenic-tcu.h
+++ b/include/linux/mfd/ingenic-tcu.h
@@ -6,6 +6,10 @@
 #define __LINUX_MFD_INGENIC_TCU_H_
 
 #include <linux/bitops.h>
+#include <linux/init.h>
+
+struct device_node;
+struct regmap;
 
 #define TCU_REG_WDT_TDR		0x00
 #define TCU_REG_WDT_TCER	0x04
@@ -53,4 +57,6 @@
 #define TCU_REG_TCNTc(c)	(TCU_REG_TCNT0 + ((c) * TCU_CHANNEL_STRIDE))
 #define TCU_REG_TCSRc(c)	(TCU_REG_TCSR0 + ((c) * TCU_CHANNEL_STRIDE))
 
+struct regmap * __init ingenic_tcu_get_regmap(struct device_node *np);
+
 #endif /* __LINUX_MFD_INGENIC_TCU_H_ */
-- 
2.21.0.593.g511ec345e18


^ permalink raw reply related

* [PATCH v14 01/13] dt-bindings: ingenic: Add DT bindings for TCU clocks
From: Paul Cercueil @ 2019-07-01 15:13 UTC (permalink / raw)
  To: Lee Jones, Jonathan Corbet, Ralf Baechle, Paul Burton,
	James Hogan, Michael Turquette, Stephen Boyd, Thomas Gleixner,
	Daniel Lezcano
  Cc: Mathieu Malaterre, od, devicetree, linux-kernel, linux-doc,
	linux-mips, linux-clk, Paul Cercueil, Artur Rojek, Rob Herring
In-Reply-To: <20190701151410.23127-1-paul@crapouillou.net>

This header provides clock numbers for the ingenic,tcu
DT binding.

Signed-off-by: Paul Cercueil <paul@crapouillou.net>
Tested-by: Mathieu Malaterre <malat@debian.org>
Tested-by: Artur Rojek <contact@artur-rojek.eu>
Reviewed-by: Rob Herring <robh@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
---

Notes:
    v2: Use SPDX identifier for the license
    
    v3/v4: No change
    
    v5: s/JZ47*_/TCU_/ and dropped *_CLK_LAST defines
    
    v6-v14: No change

 include/dt-bindings/clock/ingenic,tcu.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)
 create mode 100644 include/dt-bindings/clock/ingenic,tcu.h

diff --git a/include/dt-bindings/clock/ingenic,tcu.h b/include/dt-bindings/clock/ingenic,tcu.h
new file mode 100644
index 000000000000..d569650a7945
--- /dev/null
+++ b/include/dt-bindings/clock/ingenic,tcu.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * This header provides clock numbers for the ingenic,tcu DT binding.
+ */
+
+#ifndef __DT_BINDINGS_CLOCK_INGENIC_TCU_H__
+#define __DT_BINDINGS_CLOCK_INGENIC_TCU_H__
+
+#define TCU_CLK_TIMER0	0
+#define TCU_CLK_TIMER1	1
+#define TCU_CLK_TIMER2	2
+#define TCU_CLK_TIMER3	3
+#define TCU_CLK_TIMER4	4
+#define TCU_CLK_TIMER5	5
+#define TCU_CLK_TIMER6	6
+#define TCU_CLK_TIMER7	7
+#define TCU_CLK_WDT	8
+#define TCU_CLK_OST	9
+
+#endif /* __DT_BINDINGS_CLOCK_INGENIC_TCU_H__ */
-- 
2.21.0.593.g511ec345e18


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox