Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v13 1/6] sched/core: uclamp: Extend CPU's cgroup controller
From: Michal Koutný @ 2019-08-06 16:11 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, linux-api, cgroups, Ingo Molnar,
	Peter Zijlstra, Tejun Heo, Rafael J . Wysocki, Vincent Guittot,
	Viresh Kumar, Paul Turner, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes,
	Steve Muckle, Suren Baghdasaryan, Alessio Balsini
In-Reply-To: <20190802090853.4810-2-patrick.bellasi@arm.com>

On Fri, Aug 02, 2019 at 10:08:48AM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> +static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
> +				size_t nbytes, loff_t off,
> +				enum uclamp_id clamp_id)
> +{
> +	struct uclamp_request req;
> +	struct task_group *tg;
> +
> +	req = capacity_from_percent(buf);
> +	if (req.ret)
> +		return req.ret;
> +
> +	rcu_read_lock();
This should be the uclamp_mutex.

(The compound results of the series is correct as the lock is introduced
in "sched/core: uclamp: Propagate parent clamps".
This is just for the happiness of cherry-pickers/bisectors.)

> +static inline void cpu_uclamp_print(struct seq_file *sf,
> +				    enum uclamp_id clamp_id)
> +{
> [...]
> +	rcu_read_lock();
> +	tg = css_tg(seq_css(sf));
> +	util_clamp = tg->uclamp_req[clamp_id].value;
> +	rcu_read_unlock();
Why is the rcu_read_lock() needed here? (I'm considering the comment in
of_css() that should apply here (and it seems that similar uses in other
seq_file handlers also skip this).)

> @@ -7369,6 +7506,20 @@ static struct cftype cpu_legacy_files[] = {
> [...]
> +		.name = "uclamp.min",
> [...]
> +		.name = "uclamp.max",
I don't see technical reasons why uclamp couldn't work on legacy
hierarchy and Tejun acked the series, despite that I'll ask -- should
the new attributes be exposed in v1 controller hierarchy (taking into
account the legacy API is sort of frozen and potential maintenance needs
spanning both hierarchies)?

^ permalink raw reply

* Re: [PATCH v13 2/6] sched/core: uclamp: Propagate parent clamps
From: Michal Koutný @ 2019-08-06 16:11 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, linux-api, cgroups, Ingo Molnar,
	Peter Zijlstra, Tejun Heo, Rafael J . Wysocki, Vincent Guittot,
	Viresh Kumar, Paul Turner, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes,
	Steve Muckle, Suren Baghdasaryan, Alessio Balsini
In-Reply-To: <20190802090853.4810-3-patrick.bellasi@arm.com>

[-- Attachment #1: Type: text/plain, Size: 932 bytes --]

On Fri, Aug 02, 2019 at 10:08:49AM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> @@ -7095,6 +7149,7 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
>  	if (req.ret)
>  		return req.ret;
>  
> +	mutex_lock(&uclamp_mutex);
>  	rcu_read_lock();
>  
>  	tg = css_tg(of_css(of));
> @@ -7107,7 +7162,11 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
>  	 */
>  	tg->uclamp_pct[clamp_id] = req.percent;
>  
> +	/* Update effective clamps to track the most restrictive value */
> +	cpu_util_update_eff(of_css(of));
> +
>  	rcu_read_unlock();
> +	mutex_unlock(&uclamp_mutex);
Following my remarks to "[PATCH v13 1/6] sched/core: uclamp: Extend
CPU's cgroup", I wonder if the rcu_read_lock() couldn't be moved right
before cpu_util_update_eff(). And by extension rcu_read_(un)lock could
be hidden into cpu_util_update_eff() closer to its actual need.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v13 0/6] Add utilization clamping support (CGroups API)
From: Michal Koutný @ 2019-08-06 16:12 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, linux-api, cgroups, Ingo Molnar,
	Peter Zijlstra, Tejun Heo, Rafael J . Wysocki, Vincent Guittot,
	Viresh Kumar, Paul Turner, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes,
	Steve Muckle, Suren Baghdasaryan, Alessio Balsini
In-Reply-To: <20190802090853.4810-1-patrick.bellasi@arm.com>

[-- Attachment #1: Type: text/plain, Size: 750 bytes --]

On Fri, Aug 02, 2019 at 10:08:47AM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> Patrick Bellasi (6):
>   sched/core: uclamp: Extend CPU's cgroup controller
>   sched/core: uclamp: Propagate parent clamps
>   sched/core: uclamp: Propagate system defaults to root group
>   sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
>   sched/core: uclamp: Update CPU's refcount on TG's clamp changes
>   sched/core: uclamp: always use enum uclamp_id for clamp_id values
Thank you Patrick for your patience. I used the time to revisit the
series once again and I think the RCU locks can be streamlined a bit. If
you find that correct, feel free to add my Reviewed-by to the updated
series (for 1/6 and legacy, I'm just asking).

Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH bpf-next v10 06/10] bpf,landlock: Add a new map type: inode
From: Mickaël Salaün @ 2019-08-06 16:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mickaël Salaün, LKML, Alexander Viro,
	Alexei Starovoitov, Andrew Morton, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, Michael Kerrisk, Paul Moore
In-Reply-To: <20190801173534.etfls5ltixp5hfrh@ast-mbp.dhcp.thefacebook.com>


On 01/08/2019 19:35, Alexei Starovoitov wrote:
> On Wed, Jul 31, 2019 at 09:11:10PM +0200, Mickaël Salaün wrote:
>>
>>
>> On 31/07/2019 20:58, Alexei Starovoitov wrote:
>>> On Wed, Jul 31, 2019 at 11:46 AM Mickaël Salaün
>>> <mickael.salaun@ssi.gouv.fr> wrote:
>>>>>> +    for (i = 0; i < htab->n_buckets; i++) {
>>>>>> +            head = select_bucket(htab, i);
>>>>>> +            hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
>>>>>> +                    landlock_inode_remove_map(*((struct inode **)l->key), map);
>>>>>> +            }
>>>>>> +    }
>>>>>> +    htab_map_free(map);
>>>>>> +}
>>>>>
>>>>> user space can delete the map.
>>>>> that will trigger inode_htab_map_free() which will call
>>>>> landlock_inode_remove_map().
>>>>> which will simply itereate the list and delete from the list.
>>>>
>>>> landlock_inode_remove_map() removes the reference to the map (being
>>>> freed) from the inode (with an RCU lock).
>>>
>>> I'm going to ignore everything else for now and focus only on this bit,
>>> since it's fundamental issue to address before this discussion can
>>> go any further.
>>> rcu_lock is not a spin_lock. I'm pretty sure you know this.
>>> But you're arguing that it's somehow protecting from the race
>>> I mentioned above?
>>>
>>
>> I was just clarifying your comment to avoid misunderstanding about what
>> is being removed.
>>
>> As said in the full response, there is currently a race but, if I add a
>> bpf_map_inc() call when the map is referenced by inode->security, then I
>> don't see how a race could occur because such added map could only be
>> freed in a security_inode_free() (as long as it retains a reference to
>> this inode).
> 
> then it will be a cycle and a map will never be deleted?
> closing map_fd should delete a map. It cannot be alive if it's not
> pinned in bpffs, there are no FDs that are holding it, and no progs using it.
> So the map deletion will iterate over inodes that belong to this map.
> In parallel security_inode_free() will be called that will iterate
> over its link list that contains elements from different maps.
> So the same link list is modified by two cpus.
> Where is a lock that protects from concurrent links list manipulations?

Ok, I think I got it. What about this fix?

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 4fc7755042f0..3226e50b6211 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1708,10 +1708,16 @@ static void inode_htab_map_free(struct bpf_map *map)

 	for (i = 0; i < htab->n_buckets; i++) {
 		head = select_bucket(htab, i);
-		hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
+		rcu_read_lock();
+		hlist_nulls_for_each_entry_rcu(l, n, head, hash_node) {
 			landlock_inode_remove_map(*((struct inode **)l->key), map);
 		}
+		rcu_read_unlock();
 	}
+	/*
+	 * The last pending put_landlock_inode_map() may be called here, before
+	 * the rcu_barrier() from htab_map_free().
+	 */
 	htab_map_free(map);
 }

diff --git a/security/landlock/common.h b/security/landlock/common.h
index b0ba3f31ac7d..535c6a4292b9 100644
--- a/security/landlock/common.h
+++ b/security/landlock/common.h
@@ -58,6 +58,11 @@ struct landlock_prog_set {
 	refcount_t usage;
 };

+struct landlock_inode_security {
+	struct list_head list;
+	spinlock_t lock;
+};
+
 struct landlock_inode_map {
 	struct list_head list;
 	struct rcu_head rcu_put;
diff --git a/security/landlock/hooks_fs.c b/security/landlock/hooks_fs.c
index 8c9d6a333111..b9bfd558f8b8 100644
--- a/security/landlock/hooks_fs.c
+++ b/security/landlock/hooks_fs.c
@@ -10,6 +10,7 @@
 #include <linux/kernel.h> /* ARRAY_SIZE */
 #include <linux/lsm_hooks.h>
 #include <linux/rcupdate.h> /* synchronize_rcu() */
+#include <linux/spinlock.h>
 #include <linux/stat.h> /* S_ISDIR */
 #include <linux/stddef.h> /* offsetof */
 #include <linux/types.h> /* uintptr_t */
@@ -251,13 +252,16 @@ static int hook_sb_pivotroot(const struct path *old_path,

 /* inode helpers */

-static inline struct list_head *inode_landlock(const struct inode *inode)
+static inline struct landlock_inode_security *inode_landlock(
+		const struct inode *inode)
 {
 	return inode->i_security + landlock_blob_sizes.lbs_inode;
 }

 int landlock_inode_add_map(struct inode *inode, struct bpf_map *map)
 {
+	unsigned long flags;
+	struct landlock_inode_security *inode_sec = inode_landlock(inode);
 	struct landlock_inode_map *inode_map;

 	inode_map = kzalloc(sizeof(*inode_map), GFP_ATOMIC);
@@ -266,60 +270,66 @@ int landlock_inode_add_map(struct inode *inode, struct bpf_map *map)
 	INIT_LIST_HEAD(&inode_map->list);
 	inode_map->map = map;
 	inode_map->inode = inode;
-	list_add_tail(&inode_map->list, inode_landlock(inode));
+	spin_lock_irqsave(&inode_sec->lock, flags);
+	list_add_tail_rcu(&inode_map->list, &inode_sec->list);
+	spin_unlock_irqrestore(&inode_sec->lock, flags);
 	return 0;
 }

 static void put_landlock_inode_map(struct rcu_head *head)
 {
 	struct landlock_inode_map *inode_map;
-	int err;

 	inode_map = container_of(head, struct landlock_inode_map, rcu_put);
-	err = bpf_inode_ptr_unlocked_htab_map_delete_elem(inode_map->map,
+	bpf_inode_ptr_unlocked_htab_map_delete_elem(inode_map->map,
 			&inode_map->inode, false);
-	bpf_map_put(inode_map->map);
 	kfree(inode_map);
 }

 void landlock_inode_remove_map(struct inode *inode, const struct bpf_map *map)
 {
+	unsigned long flags;
+	struct landlock_inode_security *inode_sec = inode_landlock(inode);
 	struct landlock_inode_map *inode_map;
-	bool found = false;

+	spin_lock_irqsave(&inode_sec->lock, flags);
 	rcu_read_lock();
-	list_for_each_entry_rcu(inode_map, inode_landlock(inode), list) {
+	list_for_each_entry_rcu(inode_map, &inode_sec->list, list) {
 		if (inode_map->map == map) {
-			found = true;
 			list_del_rcu(&inode_map->list);
 			kfree_rcu(inode_map, rcu_put);
 			break;
 		}
 	}
 	rcu_read_unlock();
-	WARN_ON(!found);
+	spin_unlock_irqrestore(&inode_sec->lock, flags);
 }

 /* inode hooks */

 static int hook_inode_alloc_security(struct inode *inode)
 {
-	struct list_head *ll_inode = inode_landlock(inode);
+	struct landlock_inode_security *inode_sec = inode_landlock(inode);

-	INIT_LIST_HEAD(ll_inode);
+	INIT_LIST_HEAD(&inode_sec->list);
+	spin_lock_init(&inode_sec->lock);
 	return 0;
 }

 static void hook_inode_free_security(struct inode *inode)
 {
+	unsigned long flags;
+	struct landlock_inode_security *inode_sec = inode_landlock(inode);
 	struct landlock_inode_map *inode_map;

+	spin_lock_irqsave(&inode_sec->lock, flags);
 	rcu_read_lock();
-	list_for_each_entry_rcu(inode_map, inode_landlock(inode), list) {
+	list_for_each_entry_rcu(inode_map, &inode_sec->list, list) {
 		list_del_rcu(&inode_map->list);
 		call_rcu(&inode_map->rcu_put, put_landlock_inode_map);
 	}
 	rcu_read_unlock();
+	spin_unlock_irqrestore(&inode_sec->lock, flags);
 }

 /* a directory inode contains only one dentry */
diff --git a/security/landlock/init.c b/security/landlock/init.c
index 35165fc8a595..1305255f5d2e 100644
--- a/security/landlock/init.c
+++ b/security/landlock/init.c
@@ -137,7 +137,7 @@ static int __init landlock_init(void)
 }

 struct lsm_blob_sizes landlock_blob_sizes __lsm_ro_after_init = {
-	.lbs_inode = sizeof(struct list_head),
+	.lbs_inode = sizeof(struct landlock_inode_security),
 };

 DEFINE_LSM(LANDLOCK_NAME) = {


> 
>> Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your d
 ata, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.
> 
> Please get rid of this. It's absolutely not appropriate on public mailing list.
> Next time I'd have to ignore emails that contain such disclaimers.

Unfortunately this message is automatically appended (server-side) to all my
professional emails...

^ permalink raw reply related

* Re: [RFC PATCH v1 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Mickaël Salaün @ 2019-08-06 16:40 UTC (permalink / raw)
  To: Andy Lutomirski, Jan Kara, Song Liu, Alexei Starovoitov,
	Daniel Borkmann
  Cc: LKML, Al Viro, James Morris, Jonathan Corbet, Kees Cook,
	Matthew Garrett, Michael Kerrisk, Mickaël Salaün,
	Mimi Zohar, Philippe Trébuchet, Shuah Khan,
	Thibaut Sautereau, Vincent Strubel, Yves-Alexis Perez,
	Kernel Hardening, Linux API, LSM List, Linux FS Devel
In-Reply-To: <CALCETrVeZ0eufFXwfhtaG_j+AdvbzEWE0M3wjXMWVEO7pj+xkw@mail.gmail.com>


On 05/08/2019 01:55, Andy Lutomirski wrote:
> On Wed, Dec 12, 2018 at 6:43 AM Jan Kara <jack@suse.cz> wrote:
>>
>> On Wed 12-12-18 09:17:08, Mickaël Salaün wrote:
>>> When the O_MAYEXEC flag is passed, sys_open() may be subject to
>>> additional restrictions depending on a security policy implemented by an
>>> LSM through the inode_permission hook.
>>>
>>> The underlying idea is to be able to restrict scripts interpretation
>>> according to a policy defined by the system administrator.  For this to
>>> be possible, script interpreters must use the O_MAYEXEC flag
>>> appropriately.  To be fully effective, these interpreters also need to
>>> handle the other ways to execute code (for which the kernel can't help):
>>> command line parameters (e.g., option -e for Perl), module loading
>>> (e.g., option -m for Python), stdin, file sourcing, environment
>>> variables, configuration files...  According to the threat model, it may
>>> be acceptable to allow some script interpreters (e.g. Bash) to interpret
>>> commands from stdin, may it be a TTY or a pipe, because it may not be
>>> enough to (directly) perform syscalls.
>>>
>>> A simple security policy implementation is available in a following
>>> patch for Yama.
>>>
>>> This is an updated subset of the patch initially written by Vincent
>>> Strubel for CLIP OS:
>>> https://github.com/clipos-archive/src_platform_clip-patches/blob/f5cb330d6b684752e403b4e41b39f7004d88e561/1901_open_mayexec.patch
>>> This patch has been used for more than 10 years with customized script
>>> interpreters.  Some examples can be found here:
>>> https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC
>>>
>>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>>> Signed-off-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
>>> Signed-off-by: Vincent Strubel <vincent.strubel@ssi.gouv.fr>
>>> Reviewed-by: Philippe Trébuchet <philippe.trebuchet@ssi.gouv.fr>
>>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>>> Cc: Kees Cook <keescook@chromium.org>
>>> Cc: Mickaël Salaün <mickael.salaun@ssi.gouv.fr>
>>
>> ...
>>
>>> diff --git a/fs/open.c b/fs/open.c
>>> index 0285ce7dbd51..75479b79a58f 100644
>>> --- a/fs/open.c
>>> +++ b/fs/open.c
>>> @@ -974,6 +974,10 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
>>>       if (flags & O_APPEND)
>>>               acc_mode |= MAY_APPEND;
>>>
>>> +     /* Check execution permissions on open. */
>>> +     if (flags & O_MAYEXEC)
>>> +             acc_mode |= MAY_OPENEXEC;
>>> +
>>>       op->acc_mode = acc_mode;
>>>
>>>       op->intent = flags & O_PATH ? 0 : LOOKUP_OPEN;
>>
>> I don't feel experienced enough in security to tell whether we want this
>> functionality or not. But if we do this, shouldn't we also set FMODE_EXEC
>> on the resulting struct file? That way also security_file_open() can be
>> used to arbitrate such executable opens and in particular
>> fanotify permission event FAN_OPEN_EXEC will get properly generated which I
>> guess is desirable (support for it is sitting in my tree waiting for the
>> merge window) - adding some audit people involved in FAN_OPEN_EXEC to
>> CC. Just an idea...
>>
> 
> I would really like to land this patch.  I'm fiddling with making
> bpffs handle permissions intelligently, and the lack of a way to say
> "hey, I want to open this bpf program so that I can run it" is
> annoying.

Are you OK with this series? What about Aleksa's work on openat2(), and
Sean's work on SGX/noexec? Is it time to send a new patch series (with a
dedicated LSM instead of Yama)?

^ permalink raw reply

* Re: [PATCH v13 0/6] Add utilization clamping support (CGroups API)
From: Patrick Bellasi @ 2019-08-06 16:40 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-kernel, linux-pm, linux-api, cgroups, Ingo Molnar,
	Peter Zijlstra, Tejun Heo, Rafael J . Wysocki, Vincent Guittot,
	Viresh Kumar, Paul Turner, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes,
	Steve Muckle, Suren Baghdasaryan, Alessio Balsini
In-Reply-To: <20190806161206.GA20526@blackbody.suse.cz>


On Tue, Aug 06, 2019 at 17:12:06 +0100, Michal Koutný wrote...

> On Fri, Aug 02, 2019 at 10:08:47AM +0100, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>> Patrick Bellasi (6):
>>   sched/core: uclamp: Extend CPU's cgroup controller
>>   sched/core: uclamp: Propagate parent clamps
>>   sched/core: uclamp: Propagate system defaults to root group
>>   sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
>>   sched/core: uclamp: Update CPU's refcount on TG's clamp changes
>>   sched/core: uclamp: always use enum uclamp_id for clamp_id values

Hi Michal!

> Thank you Patrick for your patience.

Thanks to you for your reviews.

> I used the time to revisit the series once again and I think the RCU
> locks can be streamlined a bit.

I'll have a look at those, thanks!

> If you find that correct, feel free to add my Reviewed-by to the
> updated series (for 1/6 and legacy, I'm just asking).

Sure, actually sorry for not having already added that tag in the
current version, it will be there in v14 ;)

> Michal

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply

* Re: [PATCH v8 12/20] fscrypt: add an HKDF-SHA512 implementation
From: Paul Crowley via Linux-f2fs-devel @ 2019-08-06 20:43 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Satya Tangirala, Theodore Ts'o, linux-api, linux-f2fs-devel,
	linux-fscrypt, keyrings, linux-mtd, linux-crypto, linux-fsdevel,
	Jaegeuk Kim, linux-ext4
In-Reply-To: <20190805162521.90882-13-ebiggers@kernel.org>

On Mon, 5 Aug 2019 at 09:28, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Add an implementation of HKDF (RFC 5869) to fscrypt, for the purpose of
> deriving additional key material from the fscrypt master keys for v2
> encryption policies.  HKDF is a key derivation function built on top of
> HMAC.  We choose SHA-512 for the underlying unkeyed hash, and use an
> "hmac(sha512)" transform allocated from the crypto API.
>
> We'll be using this to replace the AES-ECB based KDF currently used to
> derive the per-file encryption keys.  While the AES-ECB based KDF is
> believed to meet the original security requirements, it is nonstandard
> and has problems that don't exist in modern KDFs such as HKDF:
>
> 1. It's reversible.  Given a derived key and nonce, an attacker can
>    easily compute the master key.  This is okay if the master key and
>    derived keys are equally hard to compromise, but now we'd like to be
>    more robust against threats such as a derived key being compromised
>    through a timing attack, or a derived key for an in-use file being
>    compromised after the master key has already been removed.
>
> 2. It doesn't evenly distribute the entropy from the master key; each 16
>    input bytes only affects the corresponding 16 output bytes.
>
> 3. It isn't easily extensible to deriving other values or keys, such as
>    a public hash for securely identifying the key, or per-mode keys.
>    Per-mode keys will be immediately useful for Adiantum encryption, for
>    which fscrypt currently uses the master key directly, introducing
>    unnecessary usage constraints.  Per-mode keys will also be useful for
>    hardware inline encryption, which is currently being worked on.
>
> HKDF solves all the above problems.
>
> Reviewed-by: Theodore Ts'o <tytso@mit.edu>
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Looks good, feel free to add:

Reviewed-by: Paul Crowley <paulcrowley@google.com>

^ permalink raw reply

* Re: [PATCH v8 13/20] fscrypt: v2 encryption policy support
From: Paul Crowley via Linux-f2fs-devel @ 2019-08-06 20:44 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Satya Tangirala, Theodore Ts'o, linux-api, linux-f2fs-devel,
	linux-fscrypt, keyrings, linux-mtd, linux-crypto, linux-fsdevel,
	Jaegeuk Kim, linux-ext4
In-Reply-To: <20190805162521.90882-14-ebiggers@kernel.org>

On Mon, 5 Aug 2019 at 09:28, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Add a new fscrypt policy version, "v2".  It has the following changes
> from the original policy version, which we call "v1" (*):
>
> - Master keys (the user-provided encryption keys) are only ever used as
>   input to HKDF-SHA512.  This is more flexible and less error-prone, and
>   it avoids the quirks and limitations of the AES-128-ECB based KDF.
>   Three classes of cryptographically isolated subkeys are defined:
>
>     - Per-file keys, like used in v1 policies except for the new KDF.
>
>     - Per-mode keys.  These implement the semantics of the DIRECT_KEY
>       flag, which for v1 policies made the master key be used directly.
>       These are also planned to be used for inline encryption when
>       support for it is added.
>
>     - Key identifiers (see below).
>
> - Each master key is identified by a 16-byte master_key_identifier,
>   which is derived from the key itself using HKDF-SHA512.  This prevents
>   users from associating the wrong key with an encrypted file or
>   directory.  This was easily possible with v1 policies, which
>   identified the key by an arbitrary 8-byte master_key_descriptor.
>
> - The key must be provided in the filesystem-level keyring, not in a
>   process-subscribed keyring.
>
> The following UAPI additions are made:
>
> - The existing ioctl FS_IOC_SET_ENCRYPTION_POLICY can now be passed a
>   fscrypt_policy_v2 to set a v2 encryption policy.  It's disambiguated
>   from fscrypt_policy/fscrypt_policy_v1 by the version code prefix.
>
> - A new ioctl FS_IOC_GET_ENCRYPTION_POLICY_EX is added.  It allows
>   getting the v1 or v2 encryption policy of an encrypted file or
>   directory.  The existing FS_IOC_GET_ENCRYPTION_POLICY ioctl could not
>   be used because it did not have a way for userspace to indicate which
>   policy structure is expected.  The new ioctl includes a size field, so
>   it is extensible to future fscrypt policy versions.
>
> - The ioctls FS_IOC_ADD_ENCRYPTION_KEY, FS_IOC_REMOVE_ENCRYPTION_KEY,
>   and FS_IOC_GET_ENCRYPTION_KEY_STATUS now support managing keys for v2
>   encryption policies.  Such keys are kept logically separate from keys
>   for v1 encryption policies, and are identified by 'identifier' rather
>   than by 'descriptor'.  The 'identifier' need not be provided when
>   adding a key, since the kernel will calculate it anyway.
>
> This patch temporarily keeps adding/removing v2 policy keys behind the
> same permission check done for adding/removing v1 policy keys:
> capable(CAP_SYS_ADMIN).  However, the next patch will carefully take
> advantage of the cryptographically secure master_key_identifier to allow
> non-root users to add/remove v2 policy keys, thus providing a full
> replacement for v1 policies.
>
> (*) Actually, in the API fscrypt_policy::version is 0 while on-disk
>     fscrypt_context::format is 1.  But I believe it makes the most sense
>     to advance both to '2' to have them be in sync, and to consider the
>     numbering to start at 1 except for the API quirk.
>
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Looks good, feel free to add:

Reviewed-by: Paul Crowley <paulcrowley@google.com>

^ permalink raw reply

* Re: [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing
From: Andrew Morton @ 2019-08-06 22:19 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Alexey Dobriyan, Borislav Petkov, Brendan Gregg,
	Catalin Marinas, Christian Hansen, dancol, fmayer, H. Peter Anvin,
	Ingo Molnar, joelaf, Jonathan Corbet, Kees Cook, kernel-team,
	linux-api, linux-doc, linux-fsdevel, linux-mm, Michal Hocko,
	Mike Rapoport, minchan, namhyung, paulmck, Robin Murphy,
	Roman Gushchin, Stephen Rothwell <sf>
In-Reply-To: <20190805170451.26009-1-joel@joelfernandes.org>

(cc Brendan's other email address, hoping for review input ;))

On Mon,  5 Aug 2019 13:04:47 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:

> The page_idle tracking feature currently requires looking up the pagemap
> for a process followed by interacting with /sys/kernel/mm/page_idle.
> Looking up PFN from pagemap in Android devices is not supported by
> unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
> 
> This patch adds support to directly interact with page_idle tracking at
> the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> looking up PFN through pagemap is not needed since the interface uses
> virtual frame numbers, and at the same time also does not require
> SYS_ADMIN.
> 
> In Android, we are using this for the heap profiler (heapprofd) which
> profiles and pin points code paths which allocates and leaves memory
> idle for long periods of time. This method solves the security issue
> with userspace learning the PFN, and while at it is also shown to yield
> better results than the pagemap lookup, the theory being that the window
> where the address space can change is reduced by eliminating the
> intermediate pagemap look up stage. In virtual address indexing, the
> process's mmap_sem is held for the duration of the access.

Quite a lot of changes to the page_idle code.  Has this all been
runtime tested on architectures where
CONFIG_HAVE_ARCH_PTE_SWP_PGIDLE=n?  That could be x86 with a little
Kconfig fiddle-for-testing-purposes.

> 8 files changed, 376 insertions(+), 45 deletions(-)

Quite a lot of new code unconditionally added to major architectures. 
Are we confident that everyone will want this feature?

>
> ...
>
> +static int proc_page_idle_open(struct inode *inode, struct file *file)
> +{
> +	struct mm_struct *mm;
> +
> +	mm = proc_mem_open(inode, PTRACE_MODE_READ);
> +	if (IS_ERR(mm))
> +		return PTR_ERR(mm);
> +	file->private_data = mm;
> +	return 0;
> +}
> +
> +static int proc_page_idle_release(struct inode *inode, struct file *file)
> +{
> +	struct mm_struct *mm = file->private_data;
> +
> +	if (mm)

I suspect the test isn't needed?  proc_page_idle_release) won't be
called if proc_page_idle_open() failed?

> +		mmdrop(mm);
> +	return 0;
> +}
>
> ...
>

^ permalink raw reply

* [PATCHv6 01/37] ns: Introduce Time Namespace
From: Dmitry Safonov @ 2019-08-07  0:24 UTC (permalink / raw)
  To: dima
  Cc: 0x7f454c46, adrian, arnd, avagin, christian.brauner, containers,
	criu, ebiederm, gorcunov, hpa, jannh, jdike, linux-api,
	linux-kernel, luto, mingo, oleg, shuah, tglx, vincenzo.frascino,
	x86, xemul
In-Reply-To: <20190729215758.28405-2-dima@arista.com>

From: Andrei Vagin <avagin@openvz.org>

Time Namespace isolates clock values.

The kernel provides access to several clocks CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.

CLOCK_REALTIME
      System-wide clock that measures real (i.e., wall-clock) time.

CLOCK_MONOTONIC
      Clock that cannot be set and represents monotonic time since
      some unspecified starting point.

CLOCK_BOOTTIME
      Identical to CLOCK_MONOTONIC, except it also includes any time
      that the system is suspended.

For many users, the time namespace means the ability to changes date and
time in a container (CLOCK_REALTIME).

But in a context of the checkpoint/restore functionality, monotonic and
bootime clocks become interesting. Both clocks are monotonic with
unspecified staring points. These clocks are widely used to measure time
slices and set timers. After restoring or migrating processes, we have to
guarantee that they never go backward. In an ideal case, the behavior of
these clocks should be the same as for a case when a whole system is
suspended. All this means that we need to be able to set CLOCK_MONOTONIC
and CLOCK_BOOTTIME clocks, what can be done by adding per-namespace
offsets for clocks.

A time namespace is similar to a pid namespace in a way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
but doesn't set it to the current process. Then all children of
the process will be born in the new time namespace, or a process can
use the setns() system call to join a namespace.

This scheme allows setting clock offsets for a namespace, before any
processes appear in it.

All available clone flags have been used, so CLONE_NEWTIME uses the
highest bit of CSIGNAL. It means that we can use it with the unshare
system call only. Rith now, this works for us, because time namespace
offsets can be set only when a new time namespace is not populated. In a
future, we will have the clone3 system call [1] which will allow to use
the CSIGNAL mask for clone flags.

[1]: httmps://lkml.kernel.org/r/20190604160944.4058-1-christian@brauner.io

Link: https://criu.org/Time_namespace
Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
v5..v6 Change:
  Used current_is_single_threaded() instead of thread_group_empty().
  Changed errno code to something more grepabble (EUSERS).

 MAINTAINERS                    |   2 +
 fs/proc/namespaces.c           |   4 +
 include/linux/nsproxy.h        |   2 +
 include/linux/proc_ns.h        |   3 +
 include/linux/time_namespace.h |  69 +++++++++++
 include/linux/user_namespace.h |   1 +
 include/uapi/linux/sched.h     |   6 +
 init/Kconfig                   |   7 ++
 kernel/Makefile                |   1 +
 kernel/fork.c                  |  16 ++-
 kernel/nsproxy.c               |  41 +++++--
 kernel/time_namespace.c        | 217 +++++++++++++++++++++++++++++++++
 12 files changed, 359 insertions(+), 10 deletions(-)
 create mode 100644 include/linux/time_namespace.h
 create mode 100644 kernel/time_namespace.c

diff --git a/MAINTAINERS b/MAINTAINERS
index a2c343ee3b2c..8a44c833f264 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12841,6 +12841,8 @@ T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
 S:	Maintained
 F:	fs/timerfd.c
 F:	include/linux/timer*
+F:	include/linux/time_namespace.h
+F:	kernel/time_namespace.c
 F:	kernel/time/*timer*
 
 POWER MANAGEMENT CORE
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index dd2b35f78b09..8b5c720fe5d7 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -33,6 +33,10 @@ static const struct proc_ns_operations *ns_entries[] = {
 #ifdef CONFIG_CGROUPS
 	&cgroupns_operations,
 #endif
+#ifdef CONFIG_TIME_NS
+	&timens_operations,
+	&timens_for_children_operations,
+#endif
 };
 
 static const char *proc_ns_get_link(struct dentry *dentry,
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 2ae1b1a4d84d..074f395b9ad2 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -35,6 +35,8 @@ struct nsproxy {
 	struct mnt_namespace *mnt_ns;
 	struct pid_namespace *pid_ns_for_children;
 	struct net 	     *net_ns;
+	struct time_namespace *time_ns;
+	struct time_namespace *time_ns_for_children;
 	struct cgroup_namespace *cgroup_ns;
 };
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index d31cb6215905..d312e6281e69 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -32,6 +32,8 @@ extern const struct proc_ns_operations pidns_for_children_operations;
 extern const struct proc_ns_operations userns_operations;
 extern const struct proc_ns_operations mntns_operations;
 extern const struct proc_ns_operations cgroupns_operations;
+extern const struct proc_ns_operations timens_operations;
+extern const struct proc_ns_operations timens_for_children_operations;
 
 /*
  * We always define these enumerators
@@ -43,6 +45,7 @@ enum {
 	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
 	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
 	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
+	PROC_TIME_INIT_INO	= 0xEFFFFFFAU,
 };
 
 #ifdef CONFIG_PROC_FS
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
new file mode 100644
index 000000000000..9507ed7072fe
--- /dev/null
+++ b/include/linux/time_namespace.h
@@ -0,0 +1,69 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_TIMENS_H
+#define _LINUX_TIMENS_H
+
+
+#include <linux/sched.h>
+#include <linux/kref.h>
+#include <linux/nsproxy.h>
+#include <linux/ns_common.h>
+#include <linux/err.h>
+
+struct user_namespace;
+extern struct user_namespace init_user_ns;
+
+struct time_namespace {
+	struct kref kref;
+	struct user_namespace *user_ns;
+	struct ucounts *ucounts;
+	struct ns_common ns;
+	struct timens_offsets *offsets;
+	bool   initialized;
+} __randomize_layout;
+extern struct time_namespace init_time_ns;
+
+#ifdef CONFIG_TIME_NS
+static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
+{
+	kref_get(&ns->kref);
+	return ns;
+}
+
+extern struct time_namespace *copy_time_ns(unsigned long flags,
+	struct user_namespace *user_ns, struct time_namespace *old_ns);
+extern void free_time_ns(struct kref *kref);
+extern int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk);
+
+static inline void put_time_ns(struct time_namespace *ns)
+{
+	kref_put(&ns->kref, free_time_ns);
+}
+
+
+#else
+static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
+{
+	return NULL;
+}
+
+static inline void put_time_ns(struct time_namespace *ns)
+{
+}
+
+static inline struct time_namespace *copy_time_ns(unsigned long flags,
+	struct user_namespace *user_ns, struct time_namespace *old_ns)
+{
+	if (flags & CLONE_NEWTIME)
+		return ERR_PTR(-EINVAL);
+
+	return old_ns;
+}
+
+static inline int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk)
+{
+	return 0;
+}
+
+#endif
+
+#endif /* _LINUX_TIMENS_H */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index fb9f4f799554..6ef1c7109fc4 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -45,6 +45,7 @@ enum ucount_type {
 	UCOUNT_NET_NAMESPACES,
 	UCOUNT_MNT_NAMESPACES,
 	UCOUNT_CGROUP_NAMESPACES,
+	UCOUNT_TIME_NAMESPACES,
 #ifdef CONFIG_INOTIFY_USER
 	UCOUNT_INOTIFY_INSTANCES,
 	UCOUNT_INOTIFY_WATCHES,
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..551739227e96 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -33,6 +33,12 @@
 #define CLONE_NEWNET		0x40000000	/* New network namespace */
 #define CLONE_IO		0x80000000	/* Clone io context */
 
+/*
+ * cloning flags intersect with CSIGNAL so can be used with unshare and clone3
+ * syscalls only:
+ */
+#define CLONE_NEWTIME	0x00000080	/* New time namespace */
+
 /*
  * Arguments for the clone3 syscall
  */
diff --git a/init/Kconfig b/init/Kconfig
index bd7d650d4a99..a7cbc9b470c7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1069,6 +1069,13 @@ config UTS_NS
 	  In this namespace tasks see different info provided with the
 	  uname() system call
 
+config TIME_NS
+	bool "TIME namespace"
+	default y
+	help
+	  In this namespace boottime and monotonic clocks can be set.
+	  The time will keep going with the same pace.
+
 config IPC_NS
 	bool "IPC namespace"
 	depends on (SYSVIPC || POSIX_MQUEUE)
diff --git a/kernel/Makefile b/kernel/Makefile
index ef0d95a190b4..0f2d5affc377 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -68,6 +68,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup/
 obj-$(CONFIG_UTS_NS) += utsname.o
+obj-$(CONFIG_TIME_NS) += time_namespace.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
 obj-$(CONFIG_PID_NS) += pid_namespace.o
 obj-$(CONFIG_IKCONFIG) += configs.o
diff --git a/kernel/fork.c b/kernel/fork.c
index 2852d0e76ea3..3dbb2d33dc52 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1777,6 +1777,7 @@ static __latent_entropy struct task_struct *copy_process(
 	struct multiprocess_signals delayed;
 	struct file *pidfile = NULL;
 	u64 clone_flags = args->flags;
+	struct nsproxy *nsp = current->nsproxy;
 
 	/*
 	 * Don't allow sharing the root directory with processes in a different
@@ -1819,8 +1820,16 @@ static __latent_entropy struct task_struct *copy_process(
 	 */
 	if (clone_flags & CLONE_THREAD) {
 		if ((clone_flags & (CLONE_NEWUSER | CLONE_NEWPID)) ||
-		    (task_active_pid_ns(current) !=
-				current->nsproxy->pid_ns_for_children))
+		    (task_active_pid_ns(current) != nsp->pid_ns_for_children))
+			return ERR_PTR(-EINVAL);
+	}
+
+	/*
+	 * If the new process will be in a different time namespace
+	 * do not allow it to share VM or a thread group with the forking task.
+	 */
+	if (clone_flags & (CLONE_THREAD | CLONE_VM)) {
+		if (nsp->time_ns != nsp->time_ns_for_children)
 			return ERR_PTR(-EINVAL);
 	}
 
@@ -2707,7 +2716,8 @@ static int check_unshare_flags(unsigned long unshare_flags)
 	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
 				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
 				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
-				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
+				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP|
+				CLONE_NEWTIME))
 		return -EINVAL;
 	/*
 	 * Not implemented, but pretend it works if there is nothing
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index c815f58e6bc0..ed9882108cd2 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -18,6 +18,7 @@
 #include <linux/pid_namespace.h>
 #include <net/net_namespace.h>
 #include <linux/ipc_namespace.h>
+#include <linux/time_namespace.h>
 #include <linux/proc_ns.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
@@ -40,6 +41,10 @@ struct nsproxy init_nsproxy = {
 #ifdef CONFIG_CGROUPS
 	.cgroup_ns		= &init_cgroup_ns,
 #endif
+#ifdef CONFIG_TIME_NS
+	.time_ns		= &init_time_ns,
+	.time_ns_for_children	= &init_time_ns,
+#endif
 };
 
 static inline struct nsproxy *create_nsproxy(void)
@@ -106,8 +111,18 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		goto out_net;
 	}
 
+	new_nsp->time_ns_for_children = copy_time_ns(flags, user_ns,
+					tsk->nsproxy->time_ns_for_children);
+	if (IS_ERR(new_nsp->time_ns_for_children)) {
+		err = PTR_ERR(new_nsp->time_ns_for_children);
+		goto out_time;
+	}
+	new_nsp->time_ns = get_time_ns(tsk->nsproxy->time_ns);
+
 	return new_nsp;
 
+out_time:
+	put_net(new_nsp->net_ns);
 out_net:
 	put_cgroup_ns(new_nsp->cgroup_ns);
 out_cgroup:
@@ -136,15 +151,16 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 	struct nsproxy *old_ns = tsk->nsproxy;
 	struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
 	struct nsproxy *new_ns;
+	int ret;
 
 	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
 			      CLONE_NEWPID | CLONE_NEWNET |
-			      CLONE_NEWCGROUP)))) {
-		get_nsproxy(old_ns);
-		return 0;
-	}
-
-	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+			      CLONE_NEWCGROUP | CLONE_NEWTIME)))) {
+		if (likely(old_ns->time_ns_for_children == old_ns->time_ns)) {
+			get_nsproxy(old_ns);
+			return 0;
+		}
+	} else if (!ns_capable(user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 
 	/*
@@ -162,6 +178,12 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 	if (IS_ERR(new_ns))
 		return  PTR_ERR(new_ns);
 
+	ret = timens_on_fork(new_ns, tsk);
+	if (ret) {
+		free_nsproxy(new_ns);
+		return ret;
+	}
+
 	tsk->nsproxy = new_ns;
 	return 0;
 }
@@ -176,6 +198,10 @@ void free_nsproxy(struct nsproxy *ns)
 		put_ipc_ns(ns->ipc_ns);
 	if (ns->pid_ns_for_children)
 		put_pid_ns(ns->pid_ns_for_children);
+	if (ns->time_ns)
+		put_time_ns(ns->time_ns);
+	if (ns->time_ns_for_children)
+		put_time_ns(ns->time_ns_for_children);
 	put_cgroup_ns(ns->cgroup_ns);
 	put_net(ns->net_ns);
 	kmem_cache_free(nsproxy_cachep, ns);
@@ -192,7 +218,8 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 	int err = 0;
 
 	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
+			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP |
+			       CLONE_NEWTIME)))
 		return 0;
 
 	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
new file mode 100644
index 000000000000..8fd8384b7261
--- /dev/null
+++ b/kernel/time_namespace.c
@@ -0,0 +1,217 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Author: Andrei Vagin <avagin@openvz.org>
+ * Author: Dmitry Safonov <dima@arista.com>
+ */
+
+#include <linux/time_namespace.h>
+#include <linux/user_namespace.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/task.h>
+#include <linux/proc_ns.h>
+#include <linux/export.h>
+#include <linux/time.h>
+#include <linux/slab.h>
+#include <linux/cred.h>
+#include <linux/err.h>
+
+static struct ucounts *inc_time_namespaces(struct user_namespace *ns)
+{
+	return inc_ucount(ns, current_euid(), UCOUNT_TIME_NAMESPACES);
+}
+
+static void dec_time_namespaces(struct ucounts *ucounts)
+{
+	dec_ucount(ucounts, UCOUNT_TIME_NAMESPACES);
+}
+
+static struct time_namespace *create_time_ns(void)
+{
+	struct time_namespace *time_ns;
+
+	time_ns = kmalloc(sizeof(struct time_namespace), GFP_KERNEL);
+	if (time_ns) {
+		kref_init(&time_ns->kref);
+		time_ns->initialized = false;
+	}
+	return time_ns;
+}
+
+/*
+ * Clone a new ns copying @old_ns, setting refcount to 1
+ * @old_ns: namespace to clone
+ * Return the new ns or ERR_PTR.
+ */
+static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
+					  struct time_namespace *old_ns)
+{
+	struct time_namespace *ns;
+	struct ucounts *ucounts;
+	int err;
+
+	err = -ENOSPC;
+	ucounts = inc_time_namespaces(user_ns);
+	if (!ucounts)
+		goto fail;
+
+	err = -ENOMEM;
+	ns = create_time_ns();
+	if (!ns)
+		goto fail_dec;
+
+	err = ns_alloc_inum(&ns->ns);
+	if (err)
+		goto fail_free;
+
+	ns->ucounts = ucounts;
+	ns->ns.ops = &timens_operations;
+	ns->user_ns = get_user_ns(user_ns);
+	return ns;
+
+fail_free:
+	kfree(ns);
+fail_dec:
+	dec_time_namespaces(ucounts);
+fail:
+	return ERR_PTR(err);
+}
+
+/*
+ * Add a reference to old_ns, or clone it if @flags specify CLONE_NEWTIME.
+ * In latter case, changes to the time of this process won't be seen by parent,
+ * and vice versa.
+ */
+struct time_namespace *copy_time_ns(unsigned long flags,
+	struct user_namespace *user_ns, struct time_namespace *old_ns)
+{
+	if (!(flags & CLONE_NEWTIME))
+		return get_time_ns(old_ns);
+
+	return clone_time_ns(user_ns, old_ns);
+}
+
+void free_time_ns(struct kref *kref)
+{
+	struct time_namespace *ns;
+
+	ns = container_of(kref, struct time_namespace, kref);
+	dec_time_namespaces(ns->ucounts);
+	put_user_ns(ns->user_ns);
+	ns_free_inum(&ns->ns);
+	kfree(ns);
+}
+
+static struct time_namespace *to_time_ns(struct ns_common *ns)
+{
+	return container_of(ns, struct time_namespace, ns);
+}
+
+static struct ns_common *timens_get(struct task_struct *task)
+{
+	struct time_namespace *ns = NULL;
+	struct nsproxy *nsproxy;
+
+	task_lock(task);
+	nsproxy = task->nsproxy;
+	if (nsproxy) {
+		ns = nsproxy->time_ns;
+		get_time_ns(ns);
+	}
+	task_unlock(task);
+
+	return ns ? &ns->ns : NULL;
+}
+
+static struct ns_common *timens_for_children_get(struct task_struct *task)
+{
+	struct time_namespace *ns = NULL;
+	struct nsproxy *nsproxy;
+
+	task_lock(task);
+	nsproxy = task->nsproxy;
+	if (nsproxy) {
+		ns = nsproxy->time_ns_for_children;
+		get_time_ns(ns);
+	}
+	task_unlock(task);
+
+	return ns ? &ns->ns : NULL;
+}
+
+static void timens_put(struct ns_common *ns)
+{
+	put_time_ns(to_time_ns(ns));
+}
+
+static int timens_install(struct nsproxy *nsproxy, struct ns_common *new)
+{
+	struct time_namespace *ns = to_time_ns(new);
+
+	if (!current_is_single_threaded())
+		return -EUSERS;
+
+	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
+	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+		return -EPERM;
+
+	get_time_ns(ns);
+	get_time_ns(ns);
+	put_time_ns(nsproxy->time_ns);
+	put_time_ns(nsproxy->time_ns_for_children);
+	nsproxy->time_ns = ns;
+	nsproxy->time_ns_for_children = ns;
+	ns->initialized = true;
+	return 0;
+}
+
+int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk)
+{
+	struct ns_common *nsc = &nsproxy->time_ns_for_children->ns;
+	struct time_namespace *ns = to_time_ns(nsc);
+
+	if (nsproxy->time_ns == nsproxy->time_ns_for_children)
+		return 0;
+
+	get_time_ns(ns);
+	put_time_ns(nsproxy->time_ns);
+	nsproxy->time_ns = ns;
+	ns->initialized = true;
+
+	return 0;
+}
+
+static struct user_namespace *timens_owner(struct ns_common *ns)
+{
+	return to_time_ns(ns)->user_ns;
+}
+
+const struct proc_ns_operations timens_operations = {
+	.name		= "time",
+	.type		= CLONE_NEWTIME,
+	.get		= timens_get,
+	.put		= timens_put,
+	.install	= timens_install,
+	.owner		= timens_owner,
+};
+
+const struct proc_ns_operations timens_for_children_operations = {
+	.name		= "time_for_children",
+	.type		= CLONE_NEWTIME,
+	.get		= timens_for_children_get,
+	.put		= timens_put,
+	.install	= timens_install,
+	.owner		= timens_owner,
+};
+
+struct time_namespace init_time_ns = {
+	.kref = KREF_INIT(3),
+	.user_ns = &init_user_ns,
+	.ns.inum = PROC_TIME_INIT_INO,
+	.ns.ops = &timens_operations,
+};
+
+static int __init time_ns_init(void)
+{
+	return 0;
+}
+subsys_initcall(time_ns_init);
-- 
2.22.0

^ permalink raw reply related

* [PATCHv6 25/37] x86/vdso: Switch image on setns()/clone()
From: Dmitry Safonov @ 2019-08-07  0:27 UTC (permalink / raw)
  To: dima
  Cc: 0x7f454c46, adrian, arnd, avagin, avagin, christian.brauner,
	containers, criu, ebiederm, gorcunov, hpa, jannh, jdike,
	linux-api, linux-kernel, luto, mingo, oleg, shuah, tglx,
	vincenzo.frascino, x86, xemul
In-Reply-To: <20190729215758.28405-26-dima@arista.com>

As it has been discussed on timens RFC, adding a new conditional branch
`if (inside_time_ns)` on VDSO for all processes is undesirable.
It will add a penalty for everybody as branch predictor may mispredict
the jump. Also there are instruction cache lines wasted on cmp/jmp.

Those effects of introducing time namespace are very much unwanted
having in mind how much work have been spent on micro-optimisation
vdso code.

Addressing those problems, there are two versions of VDSO's .so:
for host tasks (without any penalty) and for processes inside of time
namespace with clk_to_ns() that subtracts offsets from host's time.

Whenever a user does setns() or unshare(CLONE_TIMENS) followed
by clone(), change VDSO image in mm and zap VVAR/VDSO page tables.
They will be re-faulted with corresponding image and VVAR offsets.

Co-developed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
v5..v6 Change:
  Rebased over current_is_single_threaded() change in the first patch.

 arch/x86/entry/vdso/vma.c   | 23 +++++++++++++++++++++++
 arch/x86/include/asm/vdso.h |  1 +
 kernel/time_namespace.c     | 11 +++++++++++
 3 files changed, 35 insertions(+)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 8a8211fd4cfc..91cf5a5c8c9e 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -25,6 +25,7 @@
 #include <asm/cpufeature.h>
 #include <clocksource/hyperv_timer.h>
 #include <asm/page.h>
+#include <asm/tlb.h>
 
 #if defined(CONFIG_X86_64)
 unsigned int __read_mostly vdso64_enabled = 1;
@@ -266,6 +267,28 @@ static const struct vm_special_mapping vvar_mapping = {
 	.mremap = vvar_mremap,
 };
 
+#ifdef CONFIG_TIME_NS
+int vdso_join_timens(struct task_struct *task)
+{
+	struct mm_struct *mm = task->mm;
+	struct vm_area_struct *vma;
+
+	if (down_write_killable(&mm->mmap_sem))
+		return -EINTR;
+
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		unsigned long size = vma->vm_end - vma->vm_start;
+
+		if (vma_is_special_mapping(vma, &vvar_mapping) ||
+		    vma_is_special_mapping(vma, &vdso_mapping))
+			zap_page_range(vma, vma->vm_start, size);
+	}
+
+	up_write(&mm->mmap_sem);
+	return 0;
+}
+#endif
+
 /*
  * Add vdso and vvar mappings to current process.
  * @image          - blob to map
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 03f468c63a24..ccf89dedd04f 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -45,6 +45,7 @@ extern struct vdso_image vdso_image_32;
 extern void __init init_vdso_image(struct vdso_image *image);
 
 extern int map_vdso_once(const struct vdso_image *image, unsigned long addr);
+extern int vdso_join_timens(struct task_struct *task);
 
 #endif /* __ASSEMBLER__ */
 
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
index cdfa1b75bd0d..2e7e0af44f04 100644
--- a/kernel/time_namespace.c
+++ b/kernel/time_namespace.c
@@ -15,6 +15,7 @@
 #include <linux/cred.h>
 #include <linux/err.h>
 #include <linux/mm.h>
+#include <asm/vdso.h>
 
 ktime_t do_timens_ktime_to_host(clockid_t clockid, ktime_t tim,
 				struct timens_offsets *ns_offsets)
@@ -199,6 +200,7 @@ static void timens_put(struct ns_common *ns)
 static int timens_install(struct nsproxy *nsproxy, struct ns_common *new)
 {
 	struct time_namespace *ns = to_time_ns(new);
+	int ret;
 
 	if (!current_is_single_threaded())
 		return -EUSERS;
@@ -207,6 +209,10 @@ static int timens_install(struct nsproxy *nsproxy, struct ns_common *new)
 	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
 		return -EPERM;
 
+	ret = vdso_join_timens(current);
+	if (ret)
+		return ret;
+
 	get_time_ns(ns);
 	get_time_ns(ns);
 	put_time_ns(nsproxy->time_ns);
@@ -221,10 +227,15 @@ int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk)
 {
 	struct ns_common *nsc = &nsproxy->time_ns_for_children->ns;
 	struct time_namespace *ns = to_time_ns(nsc);
+	int ret;
 
 	if (nsproxy->time_ns == nsproxy->time_ns_for_children)
 		return 0;
 
+	ret = vdso_join_timens(tsk);
+	if (ret)
+		return ret;
+
 	get_time_ns(ns);
 	put_time_ns(nsproxy->time_ns);
 	nsproxy->time_ns = ns;
-- 
2.22.0

^ permalink raw reply related

* Re: [PATCH] mm/mempolicy.c: Remove unnecessary nodemask check in kernel_migrate_pages()
From: Kefeng Wang @ 2019-08-07  0:58 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, linux-kernel
  Cc: Andrea Arcangeli, Dan Williams, Michal Hocko, Oscar Salvador,
	linux-mm, Linux API, linux-man@vger.kernel.org
In-Reply-To: <80f8da83-f425-1aab-f47e-8da41ec6dcbf@suse.cz>



On 2019/8/6 16:36, Vlastimil Babka wrote:
> On 8/6/19 4:36 AM, Kefeng Wang wrote:
[...]
>>
>> [QUESTION]
>>
>> SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
>>                 const unsigned long __user *, old_nodes,
>>                 const unsigned long __user *, new_nodes)
>> {
>>         return kernel_migrate_pages(pid, maxnode, old_nodes, new_nodes);
>> }
>>
>> The migrate_pages() takes pid argument, witch is the ID of the process
>> whose pages are to be moved. should the cpuset_mems_allowed(current) be
>> cpuset_mems_allowed(task)?
> 
> The check for cpuset_mems_allowed(task) is just above the code you change, so
> the new nodes have to be subset of the target task's cpuset.
> But they also have to be allowed by the calling task's cpuset. In manpage of
> migrate_pages(2), this is hinted by the NOTES "Use get_mempolicy(2) with the
> MPOL_F_MEMS_ALLOWED flag to obtain the set of nodes that are allowed by the
> calling process's cpuset..."
> 
> But perhaps the manpage should be better clarified:
> 
> - the EINVAL case includes "Or, none of the node IDs specified by new_nodes are
> on-line and allowed by the process's current cpuset context, or none of the
> specified nodes contain memory." - this should probably say "calling process" to
> disambiguate
> - the EPERM case should mention that new_nodes have to be subset of the target
> process' cpuset context. The caller should also have CAP_SYS_NICE and
> ptrace_may_access()

Get it, thanks for your detail explanation.

> 

^ permalink raw reply

* Re: [PATCH v4 04/12] fpga: dfl: afu: add userclock sysfs interfaces.
From: Wu Hao @ 2019-08-07  2:18 UTC (permalink / raw)
  To: Greg KH
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Ananda Ravuri, Russ Weight, Xu Yilun
In-Reply-To: <20190805155113.GA8107@kroah.com>

On Mon, Aug 05, 2019 at 05:51:13PM +0200, Greg KH wrote:
> On Sun, Aug 04, 2019 at 06:20:14PM +0800, Wu Hao wrote:
> > This patch introduces userclock sysfs interfaces for AFU, user
> > could use these interfaces for clock setting to AFU.
> > 
> > Please note that, this is only working for port header feature
> > with revision 0, for later revisions, userclock setting is moved
> > to a separated private feature, so one revision sysfs interface
> > is exposed to userspace application for this purpose too.
> > 
> > Signed-off-by: Ananda Ravuri <ananda.ravuri@intel.com>
> > Signed-off-by: Russ Weight <russell.h.weight@intel.com>
> > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > Signed-off-by: Wu Hao <hao.wu@intel.com>
> > Acked-by: Alan Tull <atull@kernel.org>
> > Signed-off-by: Moritz Fischer <mdf@kernel.org>
> > ---
> > v2: rebased, and switched to use device_add/remove_groups for sysfs
> > v3: update kernel version and date in sysfs doc
> > v4: rebased.
> > ---
> >  Documentation/ABI/testing/sysfs-platform-dfl-port |  35 +++++++
> >  drivers/fpga/dfl-afu-main.c                       | 114 +++++++++++++++++++++-
> >  drivers/fpga/dfl.h                                |   9 ++
> >  3 files changed, 157 insertions(+), 1 deletion(-)
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-port b/Documentation/ABI/testing/sysfs-platform-dfl-port
> > index 1ab3e6f..5663441 100644
> > --- a/Documentation/ABI/testing/sysfs-platform-dfl-port
> > +++ b/Documentation/ABI/testing/sysfs-platform-dfl-port
> > @@ -46,3 +46,38 @@ Contact:	Wu Hao <hao.wu@intel.com>
> >  Description:	Read-write. Read or set AFU latency tolerance reporting value.
> >  		Set ltr to 1 if the AFU can tolerate latency >= 40us or set it
> >  		to 0 if it is latency sensitive.
> > +
> > +What:		/sys/bus/platform/devices/dfl-port.0/revision
> > +Date:		August 2019
> > +KernelVersion:	5.4
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-only. Read this file to get the revision of port header
> > +		feature.
> 
> What does "revision" mean?
> 
> It feels like you are creating a different set of sysfs files depending
> on the revision field.  Which is fine, sysfs is one-value-per-file and
> userspace needs to handle if the file is present or not.  So why not
> just rely on that and not have to mess with 'revision' at all?  What is
> userspace going to do with that information?

Hi Greg

Thanks for the review and comments,

Yes, different revision of private feature may have different hardware
features, so driver will expose different set of sysfs entries. revision
here is used to help userspace to distinguish them. I think it makes
sense to just rely on if sysfs entry exists or not, manage revision in
userspace code may be quit difficult. Plan to remove this entry in the
next version.

> 
> > +
> > +What:		/sys/bus/platform/devices/dfl-port.0/userclk_freqcmd
> > +Date:		August 2019
> > +KernelVersion:	5.4
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Write-only. User writes command to this interface to set
> > +		userclock to AFU.
> > +
> > +What:		/sys/bus/platform/devices/dfl-port.0/userclk_freqsts
> > +Date:		August 2019
> > +KernelVersion:	5.4
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-only. Read this file to get the status of issued command
> > +		to userclck_freqcmd.
> > +
> > +What:		/sys/bus/platform/devices/dfl-port.0/userclk_freqcntrcmd
> > +Date:		August 2019
> > +KernelVersion:	5.4
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Write-only. User writes command to this interface to set
> > +		userclock counter.
> > +
> > +What:		/sys/bus/platform/devices/dfl-port.0/userclk_freqcntrsts
> > +Date:		August 2019
> > +KernelVersion:	5.4
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-only. Read this file to get the status of issued command
> > +		to userclck_freqcntrcmd.
> > diff --git a/drivers/fpga/dfl-afu-main.c b/drivers/fpga/dfl-afu-main.c
> > index 12175bb..407c97d 100644
> > --- a/drivers/fpga/dfl-afu-main.c
> > +++ b/drivers/fpga/dfl-afu-main.c
> > @@ -142,6 +142,17 @@ static int port_get_id(struct platform_device *pdev)
> >  static DEVICE_ATTR_RO(id);
> >  
> >  static ssize_t
> > +revision_show(struct device *dev, struct device_attribute *attr, char *buf)
> > +{
> > +	void __iomem *base;
> > +
> > +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> > +
> > +	return sprintf(buf, "%x\n", dfl_feature_revision(base));
> > +}
> > +static DEVICE_ATTR_RO(revision);
> > +
> > +static ssize_t
> >  ltr_show(struct device *dev, struct device_attribute *attr, char *buf)
> >  {
> >  	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > @@ -276,6 +287,7 @@ static int port_get_id(struct platform_device *pdev)
> >  
> >  static struct attribute *port_hdr_attrs[] = {
> >  	&dev_attr_id.attr,
> > +	&dev_attr_revision.attr,
> >  	&dev_attr_ltr.attr,
> >  	&dev_attr_ap1_event.attr,
> >  	&dev_attr_ap2_event.attr,
> > @@ -284,14 +296,113 @@ static int port_get_id(struct platform_device *pdev)
> >  };
> >  ATTRIBUTE_GROUPS(port_hdr);
> >  
> > +static ssize_t
> > +userclk_freqcmd_store(struct device *dev, struct device_attribute *attr,
> > +		      const char *buf, size_t count)
> > +{
> > +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > +	u64 userclk_freq_cmd;
> > +	void __iomem *base;
> > +
> > +	if (kstrtou64(buf, 0, &userclk_freq_cmd))
> > +		return -EINVAL;
> > +
> > +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> > +
> > +	mutex_lock(&pdata->lock);
> > +	writeq(userclk_freq_cmd, base + PORT_HDR_USRCLK_CMD0);
> > +	mutex_unlock(&pdata->lock);
> > +
> > +	return count;
> > +}
> > +static DEVICE_ATTR_WO(userclk_freqcmd);
> > +
> > +static ssize_t
> > +userclk_freqcntrcmd_store(struct device *dev, struct device_attribute *attr,
> > +			  const char *buf, size_t count)
> > +{
> > +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > +	u64 userclk_freqcntr_cmd;
> > +	void __iomem *base;
> > +
> > +	if (kstrtou64(buf, 0, &userclk_freqcntr_cmd))
> > +		return -EINVAL;
> > +
> > +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> > +
> > +	mutex_lock(&pdata->lock);
> > +	writeq(userclk_freqcntr_cmd, base + PORT_HDR_USRCLK_CMD1);
> > +	mutex_unlock(&pdata->lock);
> > +
> > +	return count;
> > +}
> > +static DEVICE_ATTR_WO(userclk_freqcntrcmd);
> > +
> > +static ssize_t
> > +userclk_freqsts_show(struct device *dev, struct device_attribute *attr,
> > +		     char *buf)
> > +{
> > +	u64 userclk_freqsts;
> > +	void __iomem *base;
> > +
> > +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> > +
> > +	userclk_freqsts = readq(base + PORT_HDR_USRCLK_STS0);
> > +
> > +	return sprintf(buf, "0x%llx\n", (unsigned long long)userclk_freqsts);
> > +}
> > +static DEVICE_ATTR_RO(userclk_freqsts);
> > +
> > +static ssize_t
> > +userclk_freqcntrsts_show(struct device *dev, struct device_attribute *attr,
> > +			 char *buf)
> > +{
> > +	u64 userclk_freqcntrsts;
> > +	void __iomem *base;
> > +
> > +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> > +
> > +	userclk_freqcntrsts = readq(base + PORT_HDR_USRCLK_STS1);
> > +
> > +	return sprintf(buf, "0x%llx\n",
> > +		       (unsigned long long)userclk_freqcntrsts);
> > +}
> > +static DEVICE_ATTR_RO(userclk_freqcntrsts);
> > +
> > +static struct attribute *port_hdr_userclk_attrs[] = {
> > +	&dev_attr_userclk_freqcmd.attr,
> > +	&dev_attr_userclk_freqcntrcmd.attr,
> > +	&dev_attr_userclk_freqsts.attr,
> > +	&dev_attr_userclk_freqcntrsts.attr,
> > +	NULL,
> > +};
> > +ATTRIBUTE_GROUPS(port_hdr_userclk);
> > +
> >  static int port_hdr_init(struct platform_device *pdev,
> >  			 struct dfl_feature *feature)
> >  {
> > +	int ret;
> > +
> >  	dev_dbg(&pdev->dev, "PORT HDR Init.\n");
> >  
> >  	port_reset(pdev);
> >  
> > -	return device_add_groups(&pdev->dev, port_hdr_groups);
> > +	ret = device_add_groups(&pdev->dev, port_hdr_groups);
> 
> This all needs to be reworked based on the ability for devices to
> properly add groups when they are bound on probe (the core does it for
> you, no need for the driver to do it.)  But until then, you should at
> least consider:
> 
> > +	if (ret)
> > +		return ret;
> > +
> > +	/*
> > +	 * if revision > 0, the userclock will be moved from port hdr register
> > +	 * region to a separated private feature.
> > +	 */
> > +	if (dfl_feature_revision(feature->ioaddr) > 0)
> > +		return 0;
> > +
> > +	ret = device_add_groups(&pdev->dev, port_hdr_userclk_groups);
> > +	if (ret)
> > +		device_remove_groups(&pdev->dev, port_hdr_groups);
> 
> struct attribute_group has is_visible() as a callback to have the core
> show or not show, individual attributes when they are created.  So no
> need for a second group of attributes and you needing to add/remove
> them, just add them all and let the callback handle the "is visible"
> logic.  Makes cleanup _so_ much easier (i.e. you don't have to do it.)

Sure, will use is_visible() here instead in the next version, it does
make thing more clear. Thanks a lot of the comments.

Hao

> 
> thanks,
> 
> greg k-h

^ permalink raw reply

* Re: [PATCH v4 06/12] fpga: dfl: afu: export __port_enable/disable function.
From: Wu Hao @ 2019-08-07  2:21 UTC (permalink / raw)
  To: Greg KH
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Xu Yilun
In-Reply-To: <20190805155240.GB8107@kroah.com>

On Mon, Aug 05, 2019 at 05:52:40PM +0200, Greg KH wrote:
> On Sun, Aug 04, 2019 at 06:20:16PM +0800, Wu Hao wrote:
> > As these two functions are used by other private features. e.g.
> > in error reporting private feature, it requires to check port status
> > and reset port for error clearing.
> > 
> > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > Signed-off-by: Wu Hao <hao.wu@intel.com>
> > Acked-by: Moritz Fischer <mdf@kernel.org>
> > Acked-by: Alan Tull <atull@kernel.org>
> > Signed-off-by: Moritz Fischer <mdf@kernel.org>
> > ---
> > v2: rebased
> > ---
> >  drivers/fpga/dfl-afu-main.c | 25 ++++++++++++++-----------
> >  drivers/fpga/dfl-afu.h      |  3 +++
> >  2 files changed, 17 insertions(+), 11 deletions(-)
> > 
> > diff --git a/drivers/fpga/dfl-afu-main.c b/drivers/fpga/dfl-afu-main.c
> > index e013afb..e312179 100644
> > --- a/drivers/fpga/dfl-afu-main.c
> > +++ b/drivers/fpga/dfl-afu-main.c
> > @@ -22,14 +22,16 @@
> >  #include "dfl-afu.h"
> >  
> >  /**
> > - * port_enable - enable a port
> > + * __port_enable - enable a port
> >   * @pdev: port platform device.
> >   *
> >   * Enable Port by clear the port soft reset bit, which is set by default.
> >   * The AFU is unable to respond to any MMIO access while in reset.
> > - * port_enable function should only be used after port_disable function.
> > + * __port_enable function should only be used after __port_disable function.
> > + *
> > + * The caller needs to hold lock for protection.
> >   */
> > -static void port_enable(struct platform_device *pdev)
> > +void __port_enable(struct platform_device *pdev)
> 
> worst global function name ever.
> 
> Don't polute the global namespace like this for a single driver.  If you
> REALLY need it, then use a prefix that shows it is your individual
> dfl_special_sauce_platform_device_only type thing.

Oh.. Sure.. Let me fix the naming in the next version.

Thanks
Hao

> 
> thanks,
> 
> greg k-h

^ permalink raw reply

* Re: [PATCH v4 07/12] fpga: dfl: afu: add error reporting support.
From: Wu Hao @ 2019-08-07  2:35 UTC (permalink / raw)
  To: Greg KH
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Xu Yilun
In-Reply-To: <20190805155437.GC8107@kroah.com>

On Mon, Aug 05, 2019 at 05:54:37PM +0200, Greg KH wrote:
> On Sun, Aug 04, 2019 at 06:20:17PM +0800, Wu Hao wrote:
> > Error reporting is one important private feature, it reports error
> > detected on port and accelerated function unit (AFU). It introduces
> > several sysfs interfaces to allow userspace to check and clear
> > errors detected by hardware.
> > 
> > Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> > Signed-off-by: Wu Hao <hao.wu@intel.com>
> > Acked-by: Alan Tull <atull@kernel.org>
> > Signed-off-by: Moritz Fischer <mdf@kernel.org>
> > ---
> > v2: switch to device_add/remove_group for sysfs.
> > v3: update kernel version and date in sysfs doc
> > v4: remove dev_dbg in init/uinit callback function.
> > ---
> >  Documentation/ABI/testing/sysfs-platform-dfl-port |  39 ++++
> >  drivers/fpga/Makefile                             |   1 +
> >  drivers/fpga/dfl-afu-error.c                      | 221 ++++++++++++++++++++++
> >  drivers/fpga/dfl-afu-main.c                       |   4 +
> >  drivers/fpga/dfl-afu.h                            |   4 +
> >  5 files changed, 269 insertions(+)
> >  create mode 100644 drivers/fpga/dfl-afu-error.c
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-port b/Documentation/ABI/testing/sysfs-platform-dfl-port
> > index 5663441..3b6580b 100644
> > --- a/Documentation/ABI/testing/sysfs-platform-dfl-port
> > +++ b/Documentation/ABI/testing/sysfs-platform-dfl-port
> > @@ -81,3 +81,42 @@ KernelVersion:	5.4
> >  Contact:	Wu Hao <hao.wu@intel.com>
> >  Description:	Read-only. Read this file to get the status of issued command
> >  		to userclck_freqcntrcmd.
> > +
> > +What:		/sys/bus/platform/devices/dfl-port.0/errors/revision
> > +Date:		August 2019
> > +KernelVersion:	5.4
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-only. Read this file to get the revision of this error
> > +		reporting private feature.
> 
> Same revision question here that I had on an earlier patch.
> 
> 
> > +
> > +What:		/sys/bus/platform/devices/dfl-port.0/errors/errors
> > +Date:		August 2019
> > +KernelVersion:	5.4
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-only. Read this file to get errors detected on port and
> > +		Accelerated Function Unit (AFU).
> > +
> > +What:		/sys/bus/platform/devices/dfl-port.0/errors/first_error
> > +Date:		August 2019
> > +KernelVersion:	5.4
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-only. Read this file to get the first error detected by
> > +		hardware.
> > +
> > +What:		/sys/bus/platform/devices/dfl-port.0/errors/first_malformed_req
> > +Date:		August 2019
> > +KernelVersion:	5.4
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Read-only. Read this file to get the first malformed request
> > +		captured by hardware.
> > +
> > +What:		/sys/bus/platform/devices/dfl-port.0/errors/clear
> > +Date:		August 2019
> > +KernelVersion:	5.4
> > +Contact:	Wu Hao <hao.wu@intel.com>
> > +Description:	Write-only. Write error code to this file to clear errors.
> > +		Write fails with -EINVAL if input parsing fails or input error
> > +		code doesn't match.
> > +		Write fails with -EBUSY or -ETIMEDOUT if error can't be cleared
> > +		as hardware is in low power state (-EBUSY) or not responding
> > +		(-ETIMEDOUT).
> > diff --git a/drivers/fpga/Makefile b/drivers/fpga/Makefile
> > index 312b937..7255891 100644
> > --- a/drivers/fpga/Makefile
> > +++ b/drivers/fpga/Makefile
> > @@ -41,6 +41,7 @@ obj-$(CONFIG_FPGA_DFL_AFU)		+= dfl-afu.o
> >  
> >  dfl-fme-objs := dfl-fme-main.o dfl-fme-pr.o
> >  dfl-afu-objs := dfl-afu-main.o dfl-afu-region.o dfl-afu-dma-region.o
> > +dfl-afu-objs += dfl-afu-error.o
> >  
> >  # Drivers for FPGAs which implement DFL
> >  obj-$(CONFIG_FPGA_DFL_PCI)		+= dfl-pci.o
> > diff --git a/drivers/fpga/dfl-afu-error.c b/drivers/fpga/dfl-afu-error.c
> > new file mode 100644
> > index 0000000..c5e0efa
> > --- /dev/null
> > +++ b/drivers/fpga/dfl-afu-error.c
> > @@ -0,0 +1,221 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Driver for FPGA Accelerated Function Unit (AFU) Error Reporting
> > + *
> > + * Copyright 2019 Intel Corporation, Inc.
> > + *
> > + * Authors:
> > + *   Wu Hao <hao.wu@linux.intel.com>
> > + *   Xiao Guangrong <guangrong.xiao@linux.intel.com>
> > + *   Joseph Grecco <joe.grecco@intel.com>
> > + *   Enno Luebbers <enno.luebbers@intel.com>
> > + *   Tim Whisonant <tim.whisonant@intel.com>
> > + *   Ananda Ravuri <ananda.ravuri@intel.com>
> > + *   Mitchel Henry <henry.mitchel@intel.com>
> > + */
> > +
> > +#include <linux/uaccess.h>
> > +
> > +#include "dfl-afu.h"
> > +
> > +#define PORT_ERROR_MASK		0x8
> > +#define PORT_ERROR		0x10
> > +#define PORT_FIRST_ERROR	0x18
> > +#define PORT_MALFORMED_REQ0	0x20
> > +#define PORT_MALFORMED_REQ1	0x28
> > +
> > +#define ERROR_MASK		GENMASK_ULL(63, 0)
> > +
> > +/* mask or unmask port errors by the error mask register. */
> > +static void __port_err_mask(struct device *dev, bool mask)
> > +{
> > +	void __iomem *base;
> > +
> > +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> > +
> > +	writeq(mask ? ERROR_MASK : 0, base + PORT_ERROR_MASK);
> > +}
> > +
> > +/* clear port errors. */
> > +static int __port_err_clear(struct device *dev, u64 err)
> > +{
> > +	struct platform_device *pdev = to_platform_device(dev);
> > +	void __iomem *base_err, *base_hdr;
> > +	int ret;
> > +	u64 v;
> > +
> > +	base_err = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> > +	base_hdr = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> > +
> > +	/*
> > +	 * clear Port Errors
> > +	 *
> > +	 * - Check for AP6 State
> > +	 * - Halt Port by keeping Port in reset
> > +	 * - Set PORT Error mask to all 1 to mask errors
> > +	 * - Clear all errors
> > +	 * - Set Port mask to all 0 to enable errors
> > +	 * - All errors start capturing new errors
> > +	 * - Enable Port by pulling the port out of reset
> > +	 */
> > +
> > +	/* if device is still in AP6 power state, can not clear any error. */
> > +	v = readq(base_hdr + PORT_HDR_STS);
> > +	if (FIELD_GET(PORT_STS_PWR_STATE, v) == PORT_STS_PWR_STATE_AP6) {
> > +		dev_err(dev, "Could not clear errors, device in AP6 state.\n");
> > +		return -EBUSY;
> > +	}
> > +
> > +	/* Halt Port by keeping Port in reset */
> > +	ret = __port_disable(pdev);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/* Mask all errors */
> > +	__port_err_mask(dev, true);
> > +
> > +	/* Clear errors if err input matches with current port errors.*/
> > +	v = readq(base_err + PORT_ERROR);
> > +
> > +	if (v == err) {
> > +		writeq(v, base_err + PORT_ERROR);
> > +
> > +		v = readq(base_err + PORT_FIRST_ERROR);
> > +		writeq(v, base_err + PORT_FIRST_ERROR);
> > +	} else {
> > +		ret = -EINVAL;
> > +	}
> > +
> > +	/* Clear mask */
> > +	__port_err_mask(dev, false);
> > +
> > +	/* Enable the Port by clear the reset */
> > +	__port_enable(pdev);
> > +
> > +	return ret;
> > +}
> > +
> > +static ssize_t revision_show(struct device *dev, struct device_attribute *attr,
> > +			     char *buf)
> > +{
> > +	void __iomem *base;
> > +
> > +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> > +
> > +	return sprintf(buf, "%u\n", dfl_feature_revision(base));
> > +}
> > +static DEVICE_ATTR_RO(revision);
> > +
> > +static ssize_t errors_show(struct device *dev, struct device_attribute *attr,
> > +			   char *buf)
> > +{
> > +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > +	void __iomem *base;
> > +	u64 error;
> > +
> > +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> > +
> > +	mutex_lock(&pdata->lock);
> > +	error = readq(base + PORT_ERROR);
> > +	mutex_unlock(&pdata->lock);
> > +
> > +	return sprintf(buf, "0x%llx\n", (unsigned long long)error);
> > +}
> > +static DEVICE_ATTR_RO(errors);
> > +
> > +static ssize_t first_error_show(struct device *dev,
> > +				struct device_attribute *attr, char *buf)
> > +{
> > +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > +	void __iomem *base;
> > +	u64 error;
> > +
> > +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> > +
> > +	mutex_lock(&pdata->lock);
> > +	error = readq(base + PORT_FIRST_ERROR);
> > +	mutex_unlock(&pdata->lock);
> > +
> > +	return sprintf(buf, "0x%llx\n", (unsigned long long)error);
> > +}
> > +static DEVICE_ATTR_RO(first_error);
> > +
> > +static ssize_t first_malformed_req_show(struct device *dev,
> > +					struct device_attribute *attr,
> > +					char *buf)
> > +{
> > +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > +	void __iomem *base;
> > +	u64 req0, req1;
> > +
> > +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> > +
> > +	mutex_lock(&pdata->lock);
> > +	req0 = readq(base + PORT_MALFORMED_REQ0);
> > +	req1 = readq(base + PORT_MALFORMED_REQ1);
> > +	mutex_unlock(&pdata->lock);
> > +
> > +	return sprintf(buf, "0x%016llx%016llx\n",
> > +		       (unsigned long long)req1, (unsigned long long)req0);
> > +}
> > +static DEVICE_ATTR_RO(first_malformed_req);
> > +
> > +static ssize_t clear_store(struct device *dev, struct device_attribute *attr,
> > +			   const char *buff, size_t count)
> > +{
> > +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> > +	u64 value;
> > +	int ret;
> > +
> > +	if (kstrtou64(buff, 0, &value))
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&pdata->lock);
> > +	ret = __port_err_clear(dev, value);
> > +	mutex_unlock(&pdata->lock);
> > +
> > +	return ret ? ret : count;
> > +}
> > +static DEVICE_ATTR_WO(clear);
> > +
> > +static struct attribute *port_err_attrs[] = {
> > +	&dev_attr_revision.attr,
> > +	&dev_attr_errors.attr,
> > +	&dev_attr_first_error.attr,
> > +	&dev_attr_first_malformed_req.attr,
> > +	&dev_attr_clear.attr,
> > +	NULL,
> > +};
> > +
> > +static struct attribute_group port_err_attr_group = {
> > +	.attrs = port_err_attrs,
> > +	.name = "errors",
> > +};
> > +
> > +static int port_err_init(struct platform_device *pdev,
> > +			 struct dfl_feature *feature)
> > +{
> > +	struct dfl_feature_platform_data *pdata = dev_get_platdata(&pdev->dev);
> > +
> > +	mutex_lock(&pdata->lock);
> > +	__port_err_mask(&pdev->dev, false);
> > +	mutex_unlock(&pdata->lock);
> 
> Locking one data structure and then modifying another one is up there
> with "things never to do in the kernel unless you want a huge
> headache!".

Actually we always use the same lock for protection as other places, but
the code may cause some misunderstanding, let me improve this part in
the next version.

> 
> > +
> > +	return device_add_group(&pdev->dev, &port_err_attr_group);
> 
> You raced userspace and lost :(

Do you mind giving some more hints on this one? I guess I didn't fully
understand this. :( Add handling if device_add_group failed here, or
something else I should fix?

Thanks
Hao

> 
> thanks,
> 
> greg k-h

^ permalink raw reply

* Re: [PATCH v4 11/12] fpga: dfl: fme: add global error reporting support
From: Wu Hao @ 2019-08-07  2:45 UTC (permalink / raw)
  To: Greg KH
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Luwei Kang, Ananda Ravuri, Xu Yilun
In-Reply-To: <20190805155626.GD8107@kroah.com>

On Mon, Aug 05, 2019 at 05:56:26PM +0200, Greg KH wrote:
> On Sun, Aug 04, 2019 at 06:20:21PM +0800, Wu Hao wrote:
> > +static int fme_global_err_init(struct platform_device *pdev,
> > +			       struct dfl_feature *feature)
> > +{
> > +	struct device *dev;
> > +	int ret = 0;
> > +
> > +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > +	if (!dev)
> > +		return -ENOMEM;
> > +
> > +	dev->parent = &pdev->dev;
> > +	dev->release = err_dev_release;
> > +	dev_set_name(dev, "errors");
> > +
> > +	fme_error_enable(feature);
> > +
> > +	ret = device_register(dev);
> > +	if (ret) {
> > +		put_device(dev);
> > +		return ret;
> > +	}
> > +
> > +	ret = device_add_groups(dev, error_groups);
> 
> cute, but no, you do not create a whole struct device for a subdir.  Use
> the attribute group name like you did on earlier patches.

Sure, let me fix it in the next version.

> 
> And again, you raced userspace and lost :(

Same here, could you please give some more hints here?

Thanks in advance.
Hao

> 
> thanks,
> 
> greg k-h

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-07  5:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190806011134.p5baub5l3t5fkmou@ast-mbp>

On Mon, Aug 5, 2019 at 6:11 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Aug 05, 2019 at 02:25:35PM -0700, Andy Lutomirski wrote:
> > It tries to make the kernel respect the access modes for fds.  Without
> > this patch, there seem to be some holes: nothing looked at program fds
> > and, unless I missed something, you could take a readonly fd for a
> > program, pin the program, and reopen it RW.
>
> I think it's by design. iirc Daniel had a use case for something like this.

That seems odd.  Daniel, can you elaborate?

> Hence unprivileged bpf is actually something that can be deprecated.

I hope not.  There are a couple setsockopt uses right now, and and
seccomp will surely want it someday.  And the bpf-inside-container use
case really is unprivileged bpf -- containers are, in many (most?)
cases, explicitly not trusted by the host.  But regardless, see below.

> > The ability to obtain an fd to a cgroup does *not* imply any right to
> > modify that cgroup.  The ability to write to a cgroup directory
> > already means something else -- it's the ability to create cgroups
> > under the group in question.  I'm suggesting that a new API be added
> > that allows attaching a bpf program to a cgroup without capabilities
> > and that instead requires write access to a new file in the cgroup
> > directory.  (It could be a single file for all bpf types or one file
> > per type.  I prefer the latter -- it gives the admin finer-grained
> > control.)
>
> This is something to discuss. I don't mind something like this,
> but in general bpf is not for untrusted users.
> Hence I don't want to overdesign.

I think the code would end up being extremely simple.  It would be an
ABI change, but I think that it could easily be arranged so that new
and old libbpf would be fully functional on new and old kernels except
that only new libbpf with a new kernel would be able to attach cgroup
programs without privilege.  I see no reason to remove the current
cgroup attach API.

> See https://github.com/systemd/systemd/blob/01234e1fe777602265ffd25ab6e73823c62b0efe/src/core/bpf-firewall.c#L671-L674
> bpf based IP sandboxing doesn't work in 'systemd --user'.
> That is just one of the problems that people complained about.

This isn't privilege *dropping* -- this is not having privilege in the
first place.  Giving systemd --user unrestricted use of bpf() is
tantamount to making the user root.  But again, see below.

>
> Inside containers and inside nested containers we need to start processes
> that will use bpf. All of the processes are trusted.

Trusted by whom?  In a non-nested container, the container manager
*might* be trusted by the outside world.  In a *nested* container,
unless the inner container management is controlled from outside the
outer container, it's not trusted.  I don't know much about how
Facebook's containers work, but the LXC/LXD/Podman world is moving
very strongly toward user namespaces and maximally-untrusted
containers, and I think bpf() should work in that context.

> To solve your concern of bypassing all capable checks...
> How about we do /dev/bpf/full_verifier first?
> It will replace capable() checks in the verifier only.

I'm not convinced that "in the verifier" is the right distinction.
Telling administrators that some setting lets certain users bypass
bpf() verifier checks doesn't have a clear enough meaning.  I propose,
instead, that the current capable() checks be divided into three
categories:

a) Those that, by design, control privileged operations.  This
includes most attach calls, but it also includes allow_ptr_leaks,
bpf_probe_read(), and quite a few other things.  It also includes all
of the by_id calls, I think, unless some clever modification to the
way they worked would isolate different users' objects.  I think that
persistent objects can do pretty much everything that by_id users
would need, so this isn't a big deal.

b) Those that protect code that is considered to be at higher risk of
exposing security bugs.  In other words, these are things that would
have no need for privilege if we could prove that the implementation
was perfect.  This includes bpf2bpf, LPM, etc.

c) Those that serve to prevent excessive resource usage.  This
includes the bounded loop code (I think -- this might also be category
(b)) and several explicit checks for program size.

I suppose that the speculative execution mitigation stuff could be
split out from (b) into its own category.

Based on this, how about a different division: /dev/bpf_dangerous for
(b) and /dev/bpf_resources for (c)?  And (a) is dealt with by
gradually reducing the privilege needed for the important operations.
/dev/bpf_dangerous isn't ideal in the long run though, since a lot of
the operations in that category will probably become less dangerous as
the code is better vetted, and people granting "dangerous" permission
might want to restrict it to only the specific parts that they need.
/dev/bpf_lpm, /dev/bpf2bpf, etc could work better.

This type of thing actually fits quite nicely into an idea I've been
thinking about for a while called "implicit rights". In very brief
summary, there would be objects called /dev/rights/xyz, where xyz is
the same of a "right".  If there is a readable object of the right
type at the literal path "/dev/rights/xyz", then you have right xyz.
There's a bit more flexibility on top of this.  BPF could use
/dev/rights/bpf/maptypes/lpm and
/dev/rights/bpf/verifier/bounded_loops, for example.  Other non-BPF
use cases include a biggie:
/dev/rights/namespace/create_unprivileged_userns.
/dev/rights/bind_port/80 would be nice, too.

I may have a bit of time over the next few days to actually prototype
this.  Would you be interested in using this for BPF?

^ permalink raw reply

* Re: [PATCHv5 04/37] posix-clocks: Rename *_clock_get() functions into *_clock_get_timespec()
From: Thomas Gleixner @ 2019-08-07  6:01 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: linux-kernel, Dmitry Safonov, Andrei Vagin, Adrian Reber,
	Andrei Vagin, Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
	Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <20190729215758.28405-5-dima@arista.com>

On Mon, 29 Jul 2019, Dmitry Safonov wrote:
>  static const struct k_clock clock_monotonic = {
>  	.clock_getres		= posix_get_hrtimer_res,
> -	.clock_get_timespec	= posix_ktime_get_ts,
> +	.clock_get_timespec	= posix_get_timespec,

 posix_get_monotonic_timespec

Please.

^ permalink raw reply

* Re: [PATCHv5 06/37] alarmtimer: Provide get_timespec() callback
From: Thomas Gleixner @ 2019-08-07  6:04 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: linux-kernel, Dmitry Safonov, Andrei Vagin, Adrian Reber,
	Andrei Vagin, Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
	Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <20190729215758.28405-7-dima@arista.com>

On Mon, 29 Jul 2019, Dmitry Safonov wrote:
>  /**
> @@ -869,8 +871,10 @@ static int __init alarmtimer_init(void)
>  	/* Initialize alarm bases */
>  	alarm_bases[ALARM_REALTIME].base_clockid = CLOCK_REALTIME;
>  	alarm_bases[ALARM_REALTIME].get_ktime = &ktime_get_real;
> +	alarm_bases[ALARM_REALTIME].get_timespec = posix_get_timespec,

That's just wrong:

>  /*
>   * Get monotonic time for posix timers
>   */
> -static int posix_get_timespec(clockid_t which_clock, struct timespec64 *tp)
> +int posix_get_timespec(clockid_t which_clock, struct timespec64 *tp)
>  {
>  	ktime_get_ts64(tp);
>  	return 0;

Using a proper function name would have avoided this.

^ permalink raw reply

* Re: [PATCHv5 09/37] posix-clocks: Introduce CLOCK_MONOTONIC time namespace offsets
From: Thomas Gleixner @ 2019-08-07  6:07 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: linux-kernel, Dmitry Safonov, Andrei Vagin, Adrian Reber,
	Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
	Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <20190729215758.28405-10-dima@arista.com>

On Mon, 29 Jul 2019, Dmitry Safonov wrote:
>  
> +static inline void timens_add_monotonic(struct timespec64 *ts)
> +{
> +	struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
> +
> +	if (ns_offsets)
> +		*ts = timespec64_add(*ts, ns_offsets->monotonic);
> +}

This helper is not posix timer specific and should be introduced either in
the name space patches or in a separate patch,

Thanks

	tglx

^ permalink raw reply

* Re: [PATCH v4 11/12] fpga: dfl: fme: add global error reporting support
From: Wu Hao @ 2019-08-07  8:08 UTC (permalink / raw)
  To: Greg KH
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Luwei Kang, Ananda Ravuri, Xu Yilun
In-Reply-To: <20190807024521.GD24158@hao-dev>

On Wed, Aug 07, 2019 at 10:45:22AM +0800, Wu Hao wrote:
> On Mon, Aug 05, 2019 at 05:56:26PM +0200, Greg KH wrote:
> > On Sun, Aug 04, 2019 at 06:20:21PM +0800, Wu Hao wrote:
> > > +static int fme_global_err_init(struct platform_device *pdev,
> > > +			       struct dfl_feature *feature)
> > > +{
> > > +	struct device *dev;
> > > +	int ret = 0;
> > > +
> > > +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > > +	if (!dev)
> > > +		return -ENOMEM;
> > > +
> > > +	dev->parent = &pdev->dev;
> > > +	dev->release = err_dev_release;
> > > +	dev_set_name(dev, "errors");
> > > +
> > > +	fme_error_enable(feature);
> > > +
> > > +	ret = device_register(dev);
> > > +	if (ret) {
> > > +		put_device(dev);
> > > +		return ret;
> > > +	}
> > > +
> > > +	ret = device_add_groups(dev, error_groups);
> > 
> > cute, but no, you do not create a whole struct device for a subdir.  Use
> > the attribute group name like you did on earlier patches.
> 
> Sure, let me fix it in the next version.
> 
> > 
> > And again, you raced userspace and lost :(
> 
> Same here, could you please give some more hints here?

Oh.. I see..

I should follow [1] as this is a platform driver. I will fix it. Thanks!

[PATCH 00/11] Platform drivers, provide a way to add sysfs groups easily

[1]https://lkml.org/lkml/2019/7/4/181

Hao

> 
> Thanks in advance.
> Hao
> 
> > 
> > thanks,
> > 
> > greg k-h

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Lorenz Bauer @ 2019-08-07  9:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Jann Horn,
	Greg KH, Linux API, LSM List
In-Reply-To: <CALCETrXEHL3+NAY6P6vUj7Pvd9ZpZsYC6VCLXOaNxb90a_POGw@mail.gmail.com>

On Wed, 7 Aug 2019 at 06:24, Andy Lutomirski <luto@kernel.org> wrote:
> a) Those that, by design, control privileged operations.  This
> includes most attach calls, but it also includes allow_ptr_leaks,
> bpf_probe_read(), and quite a few other things.  It also includes all
> of the by_id calls, I think, unless some clever modification to the
> way they worked would isolate different users' objects.  I think that
> persistent objects can do pretty much everything that by_id users
> would need, so this isn't a big deal.

Slightly OT, since this is an implementation question: GET_MAP_FD_BY_ID
is useful to iterate a nested map. This isn't covered by rights to
persistent objects,
so it would need some thought.

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply

* Re: [PATCH v4 11/12] fpga: dfl: fme: add global error reporting support
From: Greg KH @ 2019-08-07  9:30 UTC (permalink / raw)
  To: Wu Hao
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Luwei Kang, Ananda Ravuri, Xu Yilun
In-Reply-To: <20190807080825.GA10344@hao-dev>

On Wed, Aug 07, 2019 at 04:08:25PM +0800, Wu Hao wrote:
> On Wed, Aug 07, 2019 at 10:45:22AM +0800, Wu Hao wrote:
> > On Mon, Aug 05, 2019 at 05:56:26PM +0200, Greg KH wrote:
> > > On Sun, Aug 04, 2019 at 06:20:21PM +0800, Wu Hao wrote:
> > > > +static int fme_global_err_init(struct platform_device *pdev,
> > > > +			       struct dfl_feature *feature)
> > > > +{
> > > > +	struct device *dev;
> > > > +	int ret = 0;
> > > > +
> > > > +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > > > +	if (!dev)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	dev->parent = &pdev->dev;
> > > > +	dev->release = err_dev_release;
> > > > +	dev_set_name(dev, "errors");
> > > > +
> > > > +	fme_error_enable(feature);
> > > > +
> > > > +	ret = device_register(dev);
> > > > +	if (ret) {
> > > > +		put_device(dev);
> > > > +		return ret;
> > > > +	}
> > > > +
> > > > +	ret = device_add_groups(dev, error_groups);
> > > 
> > > cute, but no, you do not create a whole struct device for a subdir.  Use
> > > the attribute group name like you did on earlier patches.
> > 
> > Sure, let me fix it in the next version.
> > 
> > > 
> > > And again, you raced userspace and lost :(
> > 
> > Same here, could you please give some more hints here?
> 
> Oh.. I see..
> 
> I should follow [1] as this is a platform driver. I will fix it. Thanks!
> 
> [PATCH 00/11] Platform drivers, provide a way to add sysfs groups easily
> 
> [1]https://lkml.org/lkml/2019/7/4/181

Yes, that is the correct thing to do.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing
From: Joel Fernandes @ 2019-08-07 10:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexey Dobriyan, Borislav Petkov, Brendan Gregg,
	Catalin Marinas, Christian Hansen, dancol, fmayer, H. Peter Anvin,
	Ingo Molnar, Jonathan Corbet, Kees Cook, kernel-team, linux-api,
	linux-doc, linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport,
	minchan, namhyung, paulmck, Robin Murphy, Roman Gushchin,
	Stephen Rothwell
In-Reply-To: <20190806151921.edec128271caccb5214fc1bd@linux-foundation.org>

On Tue, Aug 06, 2019 at 03:19:21PM -0700, Andrew Morton wrote:
> (cc Brendan's other email address, hoping for review input ;))

;)

> On Mon,  5 Aug 2019 13:04:47 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:
> 
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > Looking up PFN from pagemap in Android devices is not supported by
> > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
> > 
> > This patch adds support to directly interact with page_idle tracking at
> > the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> > the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> > looking up PFN through pagemap is not needed since the interface uses
> > virtual frame numbers, and at the same time also does not require
> > SYS_ADMIN.
> > 
> > In Android, we are using this for the heap profiler (heapprofd) which
> > profiles and pin points code paths which allocates and leaves memory
> > idle for long periods of time. This method solves the security issue
> > with userspace learning the PFN, and while at it is also shown to yield
> > better results than the pagemap lookup, the theory being that the window
> > where the address space can change is reduced by eliminating the
> > intermediate pagemap look up stage. In virtual address indexing, the
> > process's mmap_sem is held for the duration of the access.
> 
> Quite a lot of changes to the page_idle code.  Has this all been
> runtime tested on architectures where
> CONFIG_HAVE_ARCH_PTE_SWP_PGIDLE=n?  That could be x86 with a little
> Kconfig fiddle-for-testing-purposes.

I will do this Kconfig fiddle test with CONFIG_HAVE_ARCH_PTE_SWP_PGIDLE=n and test
the patch as well.

In previous series, this flag was not there (which should have been
equivalent to the above test), and things are working fine.

> > 8 files changed, 376 insertions(+), 45 deletions(-)
> 
> Quite a lot of new code unconditionally added to major architectures. 
> Are we confident that everyone will want this feature?

I did not follow, could you clarify more? All of this diff stat is not to
architecture code:

 arch/Kconfig                  |   3 ++
 fs/proc/base.c                |   3 ++
 fs/proc/internal.h            |   1 +
 fs/proc/task_mmu.c            |  43 +++++++++++++++++++++
 include/asm-generic/pgtable.h |   6 +++
 include/linux/page_idle.h     |   4 ++
 mm/page_idle.c                | 359 +++++++++++++++++++++++++++++..
 mm/rmap.c                     |   2 +
 8 files changed, 376 insertions(+), 45 deletions(-)

The arcitecture change is in a later patch, and is not that many lines.

Also, I am planning to split the swap functionality of the patch into a
separate one for easier review.

> > +static int proc_page_idle_open(struct inode *inode, struct file *file)
> > +{
> > +	struct mm_struct *mm;
> > +
> > +	mm = proc_mem_open(inode, PTRACE_MODE_READ);
> > +	if (IS_ERR(mm))
> > +		return PTR_ERR(mm);
> > +	file->private_data = mm;
> > +	return 0;
> > +}
> > +
> > +static int proc_page_idle_release(struct inode *inode, struct file *file)
> > +{
> > +	struct mm_struct *mm = file->private_data;
> > +
> > +	if (mm)
> 
> I suspect the test isn't needed?  proc_page_idle_release) won't be
> called if proc_page_idle_open() failed?

Yes you are right, will remove the test.

thanks,

 - Joel

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-07 13:52 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Andy Lutomirski, Alexei Starovoitov, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <CACAyw9_fVZFW_x4uyTAiRfeH6oq1KHv0uB2wO84u5JZyD+Unaw@mail.gmail.com>


> On Aug 7, 2019, at 2:03 AM, Lorenz Bauer <lmb@cloudflare.com> wrote:
> 
>> On Wed, 7 Aug 2019 at 06:24, Andy Lutomirski <luto@kernel.org> wrote:
>> a) Those that, by design, control privileged operations.  This
>> includes most attach calls, but it also includes allow_ptr_leaks,
>> bpf_probe_read(), and quite a few other things.  It also includes all
>> of the by_id calls, I think, unless some clever modification to the
>> way they worked would isolate different users' objects.  I think that
>> persistent objects can do pretty much everything that by_id users
>> would need, so this isn't a big deal.
> 
> Slightly OT, since this is an implementation question: GET_MAP_FD_BY_ID
> is useful to iterate a nested map. This isn't covered by rights to
> persistent objects,
> so it would need some thought.
> 
> 

A call to get an fd to a map referenced by a map to which you already have an fd seems reasonable to me. The new fd would inherit the old fd’s access mode.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox