Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH 00/12] Add kdbus implementation
From: Karol Lewandowski @ 2014-10-30 19:55 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Jiri Kosina, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	John Stultz, Arnd Bergmann, Tejun Heo, Ryan Lortie,
	Simon McVittie, daniel-cYrQPVfZoowdnm+yROfE0A, David Herrmann,
	Paul Moore,
	casey.schaufler-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org,
	marcel-kz+m5ild9QBg9hUCZPvPmw, tixxdz-Umm1ozX2/EEdnm+yROfE0A,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ
In-Reply-To: <20141030144709.GA19721-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>

On 2014-10-30 15:47, Greg Kroah-Hartman wrote:
> On Thu, Oct 30, 2014 at 11:44:39AM +0100, Karol Lewandowski wrote:
>> [ Sorry for breaking thread and resend - gmane rejected my original message
>>   due to too long list of recipients... ]
>>
>> On 2014-10-30 00:40, Greg Kroah-Hartman wrote:
>>
>>> There is a 1815 line documentation file in this series, so we aren't
>>> trying to not provide this type of information here at all.  But yes,
>>> more background, about why this can't be done in userspace (zero copy,
>>> less context switches, proper credential passing, timestamping, availble
>>> at early-boot, LSM hooks for security models to tie into
>>
>> While you're at it... I did some work on proof-of-concept LSM patches for
>> kdbus some time ago, see [1][2].  Currently, these are completely of date.
>>
>>  [1] https://github.com/lmctl/linux/commits/kdbus-lsm-v4.for-systemd-v212
>>  [2] https://github.com/lmctl/kdbus/commit/aa0885489d19be92fa41c6f0a71df28763228a40
>>
>> May I ask if you guys have your own plan for LSM or maybe it would be
>> worth to resurrect [1]?
> 
> The core calls are already mediated by LSM today, right?  We don't want
> anyone to be parsing the data stream through an LSM, that idea got
> rejected a long time ago as something that is really not a good idea.

Parsing data is out of question, of course, but this is not what we were
proposing.

> Other than that, I don't know exactly what your patches do, or why they
> are needed, care to go into details?

Patches in question were supposed to add few hooks for kdbus-specific
operations that doesn't seem to have compatible semantics with hooks
currently available in LSM.

kdbus' bus introduces quite a few new concepts that we wanted to be able
to limit based on MAC label/context, eg.

 - check flags at HELO stage (say disallow fd passing),

 - restrict ability to acquire name to certain subjects (for system bus),

 - disallow creation of new buses,

 - limit scope of broadcasts,

 - etc.

Please take a look at hook list - I think most of names are self-explanatory:

  https://github.com/lmctl/linux/blob/a9fe4c33b6e5ab25a243e0590df406aabb6add12/include/linux/security.h#L1874

kdbus modifications were pretty light - with most visible change being
addition of opaque security pointer to kdbus_bus and similar structs.

Thanks!
-- 
Karol Lewandowski, Samsung R&D Institute Poland

^ permalink raw reply

* Re: kdbus: add code for buses, domains and endpoints
From: Simon McVittie @ 2014-10-30 18:46 UTC (permalink / raw)
  To: Djalal Harouni, Andy Lutomirski
  Cc: Eric W. Biederman, Greg Kroah-Hartman, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, John Stultz,
	Arnd Bergmann, Tejun Heo, Marcel Holtmann, Ryan Lortie,
	Bastien Nocera, David Herrmann, Daniel Mack, alban.crequy,
	Javier Martinez Canillas, Tom Gundersen
In-Reply-To: <20141030180813.GA11850@dztty>

On 30/10/14 18:08, Djalal Harouni wrote:
> So, this is similar to AF_UNIX sockets. For them there's SCM_CREDENTIALS
> and SO_PEERCRED. The former uses credentials at the time of when
> messages are being sent, the latter uses the credentials at the time
> when when the connection was initially established.

Please note that dbus-daemon, the reference implementation of D-Bus,
does not actually ever use SCM_CREDENTIALS on its AF_UNIX sockets. We
prefer to use Linux's SO_PEERCRED, or the platform's closest available
equivalent if there is one. dbus-daemon has methods (RPC calls) to get a
specified peer's uid, pid or LSM data (e.g. SELinux context), but those
methods return the value that was true when the connection was opened or
shortly afterwards, not the value that is true right now. I believe the
plan is that kdbus has ioctls that are equivalent to those RPC calls,
but without needing to wait for asynchronous socket events to get an answer.

The reason I say "or shortly afterwards" is that for the benefit of
platforms where the "best" credentials transfer mechanism behaves like
Linux SCM_CREDENTIALS, such as FreeBSD's SCM_CREDS, the beginning of a
D-Bus protocol stream is that the client sends '\0' to dbus-daemon,
accompanied by SCM_CREDS or whatever if the platform needs it. On Linux
we just send a plain '\0' with no out-of-band data at that point.

The only out-of-band data we send with individual D-Bus RPC messages
later in the connection's lifetime is for fd-passing (SCM_RIGHTS).

It would be a perfectly reasonable feature request to have individual
D-Bus messages that contain proof that, *at the time of sending*, the
sender possessed a given uid/pid/gid/capability/whatever, but we do not
currently have that feature. It would be reasonable for kdbus to have
that feature even though traditional D-Bus doesn't, and it's entirely
possible that it is a feature that would be of benefit for e.g. systemd,
but it is not required for feature parity with traditional D-Bus over
AF_UNIX; it should be included in kdbus, or not, on its own merits.

    S

^ permalink raw reply

* Re: [PATCH net-next v4 0/4] netns: allow to identify peer netns
From: Eric W. Biederman @ 2014-10-30 18:41 UTC (permalink / raw)
  To: Nicolas Dichtel
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	cwang-xCSkyg8dI+0RB7SZvlqPiA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1414682728-4532-1-git-send-email-nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org> writes:

> The goal of this serie is to be able to multicast netlink messages with an
> attribute that identify a peer netns.
> This is needed by the userland to interpret some informations contained in
> netlink messages (like IFLA_LINK value, but also some other attributes in case
> of x-netns netdevice (see also
> http://thread.gmane.org/gmane.linux.network/315933/focus=316064 and
> http://thread.gmane.org/gmane.linux.kernel.containers/28301/focus=4239)).
>
> Ids of peer netns are set by userland via a new genl messages. These ids are
> stored per netns and are local (ie only valid in the netns where they are set).
> To avoid allocating an int for each peer netns, I use idr_for_each() to retrieve
> the id of a peer netns. Note that it will be possible to add a table (struct net
> -> id) later to optimize this lookup if needed.
>
> Patch 1/4 introduces the netlink API mechanism to set and get these ids.
> Patch 2/4 and 3/4 implements an example of how to use these ids in rtnetlink
> messages. And patch 4/4 shows that the netlink messages can be symetric between
> a GET and a SET.
>
> iproute2 patches are available, I can send them on demand.

A quick reply.  I think this patchset is in the right general direction.
There are some oddball details that seem odd/awkward to me such as using
genetlink instead of rtnetlink to get and set the ids, and not having
ids if they are not set (that feels like a maintenance/usability challenge).

I would like to give your patches a deep review, but I won't be able to
do that for a couple of weeks.  I am deep in the process of moving,
and will be mostly offline until about the Nov 11th.

Eric


> Here is a small screenshot to show how it can be used by userland.
>
> First, setup netns and required ids:
> $ ip netns add foo
> $ ip netns del foo
> $ ip netns
> $ touch /var/run/netns/init_net
> $ mount --bind /proc/1/ns/net /var/run/netns/init_net
> $ ip netns add foo
> $ ip netns exec foo ip netns set init_net 0
> $ ip netns
> foo
> init_net
> $ ip netns exec foo ip netns
> foo
> init_net (id: 0)
>
> Now, add and display an ipip tunnel, with its link part in init_net (id 0 in
> netns foo) and the netdevice in foo:
> $ ip netns exec foo ip link add ipip1 link-netnsid 0 type ipip remote 10.16.0.121 local 10.16.0.249
> $ ip netns exec foo ip l ls ipip1
> 6: ipip1@NONE: <POINTOPOINT,NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default 
>     link/ipip 10.16.0.249 peer 10.16.0.121 link-netnsid 0
>
> The parameter link-netnsid shows us where the interface sends and receives
> packets (and thus we know where encapsulated addresses are set).
>
> RFCv3 -> v4:
>   rebase on net-next
>   add copyright text in the new netns.h file
>
> RFCv2 -> RFCv3:
>   ids are now defined by userland (via netlink). Ids are stored in each netns
>   (and they are local to this netns).
>   add get_link_net support for ip6 tunnels
>   netnsid is now a s32 instead of a u32
>
> RFCv1 -> RFCv2:
>   remove useless ()
>   ids are now stored in the user ns. It's possible to get an id for a peer netns
>   only if the current netns and the peer netns have the same user ns parent.
>
>  MAINTAINERS                  |   1 +
>  include/net/ip6_tunnel.h     |   1 +
>  include/net/ip_tunnels.h     |   1 +
>  include/net/net_namespace.h  |   5 ++
>  include/net/rtnetlink.h      |   2 +
>  include/uapi/linux/Kbuild    |   1 +
>  include/uapi/linux/if_link.h |   1 +
>  include/uapi/linux/netns.h   |  38 +++++++++
>  net/core/net_namespace.c     | 195 +++++++++++++++++++++++++++++++++++++++++++
>  net/core/rtnetlink.c         |  38 ++++++++-
>  net/ipv4/ip_gre.c            |   2 +
>  net/ipv4/ip_tunnel.c         |   8 ++
>  net/ipv4/ip_vti.c            |   1 +
>  net/ipv4/ipip.c              |   1 +
>  net/ipv6/ip6_gre.c           |   1 +
>  net/ipv6/ip6_tunnel.c        |   9 ++
>  net/ipv6/ip6_vti.c           |   1 +
>  net/ipv6/sit.c               |   1 +
>  net/netlink/genetlink.c      |   4 +
>  19 files changed, 308 insertions(+), 3 deletions(-)
>
> Comments are welcome.
>
> Regards,
> Nicolas

^ permalink raw reply

* Re: [PATCH net-next v4 1/4] netns: add genl cmd to add and get peer netns ids
From: Eric W. Biederman @ 2014-10-30 18:35 UTC (permalink / raw)
  To: Nicolas Dichtel
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	cwang-xCSkyg8dI+0RB7SZvlqPiA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1414682728-4532-2-git-send-email-nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org> writes:

> With this patch, a user can define an id for a peer netns by providing a FD or a
> PID. These ids are local to netns (ie valid only into one netns).

Scratches head.  Do you actually find value in using the pid instead of
a file descriptor?

Doing things by pid was an early attempt to make things work, and has
been a bit clutsy.  If you don't find value in it I would recommend just
supporting getting/setting the network namespace by file descriptor.

Eric

> This will be useful for netlink messages when a x-netns interface is dumped.
>
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>
> ---
>  MAINTAINERS                 |   1 +
>  include/net/net_namespace.h |   5 ++
>  include/uapi/linux/Kbuild   |   1 +
>  include/uapi/linux/netns.h  |  38 +++++++++
>  net/core/net_namespace.c    | 195 ++++++++++++++++++++++++++++++++++++++++++++
>  net/netlink/genetlink.c     |   4 +
>  6 files changed, 244 insertions(+)
>  create mode 100644 include/uapi/linux/netns.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 43898b1a8a2d..de7e6fcbd5c2 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6382,6 +6382,7 @@ F:	include/linux/netdevice.h
>  F:	include/uapi/linux/in.h
>  F:	include/uapi/linux/net.h
>  F:	include/uapi/linux/netdevice.h
> +F:	include/uapi/linux/netns.h
>  F:	tools/net/
>  F:	tools/testing/selftests/net/
>  F:	lib/random32.c
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index e0d64667a4b3..0f1367a71b81 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -59,6 +59,7 @@ struct net {
>  	struct list_head	exit_list;	/* Use only net_mutex */
>  
>  	struct user_namespace   *user_ns;	/* Owning user namespace */
> +	struct idr		netns_ids;
>  
>  	unsigned int		proc_inum;
>  
> @@ -289,6 +290,10 @@ static inline struct net *read_pnet(struct net * const *pnet)
>  #define __net_initconst	__initconst
>  #endif
>  
> +int peernet2id(struct net *net, struct net *peer);
> +struct net *get_net_ns_by_id(struct net *net, int id);
> +int netns_genl_register(void);
> +
>  struct pernet_operations {
>  	struct list_head list;
>  	int (*init)(struct net *net);
> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> index 6cad97485bad..d7f49c69585a 100644
> --- a/include/uapi/linux/Kbuild
> +++ b/include/uapi/linux/Kbuild
> @@ -277,6 +277,7 @@ header-y += netfilter_decnet.h
>  header-y += netfilter_ipv4.h
>  header-y += netfilter_ipv6.h
>  header-y += netlink.h
> +header-y += netns.h
>  header-y += netrom.h
>  header-y += nfc.h
>  header-y += nfs.h
> diff --git a/include/uapi/linux/netns.h b/include/uapi/linux/netns.h
> new file mode 100644
> index 000000000000..2edf129377de
> --- /dev/null
> +++ b/include/uapi/linux/netns.h
> @@ -0,0 +1,38 @@
> +/* Copyright (c) 2014 6WIND S.A.
> + * Author: Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + */
> +#ifndef _UAPI_LINUX_NETNS_H_
> +#define _UAPI_LINUX_NETNS_H_
> +
> +/* Generic netlink messages */
> +
> +#define NETNS_GENL_NAME			"netns"
> +#define NETNS_GENL_VERSION		0x1
> +
> +/* Commands */
> +enum {
> +	NETNS_CMD_UNSPEC,
> +	NETNS_CMD_NEWID,
> +	NETNS_CMD_GETID,
> +	__NETNS_CMD_MAX,
> +};
> +
> +#define NETNS_CMD_MAX		(__NETNS_CMD_MAX - 1)
> +
> +/* Attributes */
> +enum {
> +	NETNSA_NONE,
> +#define NETNSA_NSINDEX_UNKNOWN	-1
> +	NETNSA_NSID,
> +	NETNSA_PID,
> +	NETNSA_FD,
> +	__NETNSA_MAX,
> +};
> +
> +#define NETNSA_MAX		(__NETNSA_MAX - 1)
> +
> +#endif /* _UAPI_LINUX_NETNS_H_ */
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 7f155175bba8..4a5680ed42fb 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -15,6 +15,8 @@
>  #include <linux/file.h>
>  #include <linux/export.h>
>  #include <linux/user_namespace.h>
> +#include <linux/netns.h>
> +#include <net/genetlink.h>
>  #include <net/net_namespace.h>
>  #include <net/netns/generic.h>
>  
> @@ -144,6 +146,50 @@ static void ops_free_list(const struct pernet_operations *ops,
>  	}
>  }
>  
> +/* This function is used by idr_for_each(). If net is equal to peer, the
> + * function returns the id so that idr_for_each() stops. Because we cannot
> + * returns the id 0 (idr_for_each() will not stop), we return the magic value
> + * -1 for it.
> + */
> +static int net_eq_idr(int id, void *net, void *peer)
> +{
> +	if (net_eq(net, peer))
> +		return id ? : -1;
> +	return 0;
> +}
> +
> +/* returns NETNSA_NSINDEX_UNKNOWN if not found */
> +int peernet2id(struct net *net, struct net *peer)
> +{
> +	int id = idr_for_each(&net->netns_ids, net_eq_idr, peer);
> +
> +	ASSERT_RTNL();
> +
> +	/* Magic value for id 0. */
> +	if (id == -1)
> +		return 0;
> +	if (id == 0)
> +		return NETNSA_NSINDEX_UNKNOWN;
> +
> +	return id;
> +}
> +
> +struct net *get_net_ns_by_id(struct net *net, int id)
> +{
> +	struct net *peer;
> +
> +	if (id < 0)
> +		return NULL;
> +
> +	rcu_read_lock();
> +	peer = idr_find(&net->netns_ids, id);
> +	if (peer)
> +		get_net(peer);
> +	rcu_read_unlock();
> +
> +	return peer;
> +}
> +
>  /*
>   * setup_net runs the initializers for the network namespace object.
>   */
> @@ -158,6 +204,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
>  	atomic_set(&net->passive, 1);
>  	net->dev_base_seq = 1;
>  	net->user_ns = user_ns;
> +	idr_init(&net->netns_ids);
>  
>  #ifdef NETNS_REFCNT_DEBUG
>  	atomic_set(&net->use_count, 0);
> @@ -288,6 +335,14 @@ static void cleanup_net(struct work_struct *work)
>  	list_for_each_entry(net, &net_kill_list, cleanup_list) {
>  		list_del_rcu(&net->list);
>  		list_add_tail(&net->exit_list, &net_exit_list);
> +		for_each_net(tmp) {
> +			int id = peernet2id(tmp, net);
> +
> +			if (id >= 0)
> +				idr_remove(&tmp->netns_ids, id);
> +		}
> +		idr_destroy(&net->netns_ids);
> +
>  	}
>  	rtnl_unlock();
>  
> @@ -399,6 +454,146 @@ static struct pernet_operations __net_initdata net_ns_ops = {
>  	.exit = net_ns_net_exit,
>  };
>  
> +static struct genl_family netns_genl_family = {
> +	.id		= GENL_ID_GENERATE,
> +	.name		= NETNS_GENL_NAME,
> +	.version	= NETNS_GENL_VERSION,
> +	.hdrsize	= 0,
> +	.maxattr	= NETNSA_MAX,
> +	.netnsok	= true,
> +};
> +
> +static struct nla_policy netns_nl_policy[NETNSA_MAX + 1] = {
> +	[NETNSA_NONE]		= { .type = NLA_UNSPEC },
> +	[NETNSA_NSID]		= { .type = NLA_S32 },
> +	[NETNSA_PID]		= { .type = NLA_U32 },
> +	[NETNSA_FD]		= { .type = NLA_U32 },
> +};
> +
> +static int netns_nl_cmd_newid(struct sk_buff *skb, struct genl_info *info)
> +{
> +	struct net *net = genl_info_net(info);
> +	struct net *peer;
> +	int nsid, err;
> +
> +	if (!info->attrs[NETNSA_NSID])
> +		return -EINVAL;
> +	nsid = nla_get_s32(info->attrs[NETNSA_NSID]);
> +	if (nsid < 0)
> +		return -EINVAL;
> +
> +	if (info->attrs[NETNSA_PID])
> +		peer = get_net_ns_by_pid(nla_get_u32(info->attrs[NETNSA_PID]));
> +	else if (info->attrs[NETNSA_FD])
> +		peer = get_net_ns_by_fd(nla_get_u32(info->attrs[NETNSA_FD]));
> +	else
> +		return -EINVAL;
> +	if (IS_ERR(peer))
> +		return PTR_ERR(peer);
> +
> +	rtnl_lock();
> +	if (peernet2id(net, peer) >= 0) {
> +		err = -EEXIST;
> +		goto out;
> +	}
> +
> +	err = idr_alloc(&net->netns_ids, peer, nsid, nsid + 1, GFP_KERNEL);
> +	if (err >= 0)
> +		err = 0;
> +out:
> +	rtnl_unlock();
> +	put_net(peer);
> +	return err;
> +}
> +
> +static int netns_nl_get_size(void)
> +{
> +	return nla_total_size(sizeof(s32)) /* NETNSA_NSID */
> +	       ;
> +}
> +
> +static int netns_nl_fill(struct sk_buff *skb, u32 portid, u32 seq, int flags,
> +			 int cmd, struct net *net, struct net *peer)
> +{
> +	void *hdr;
> +	int id;
> +
> +	hdr = genlmsg_put(skb, portid, seq, &netns_genl_family, flags, cmd);
> +	if (!hdr)
> +		return -EMSGSIZE;
> +
> +	rtnl_lock();
> +	id = peernet2id(net, peer);
> +	rtnl_unlock();
> +	if (nla_put_s32(skb, NETNSA_NSID, id))
> +		goto nla_put_failure;
> +
> +	return genlmsg_end(skb, hdr);
> +
> +nla_put_failure:
> +	genlmsg_cancel(skb, hdr);
> +	return -EMSGSIZE;
> +}
> +
> +static int netns_nl_cmd_getid(struct sk_buff *skb, struct genl_info *info)
> +{
> +	struct net *net = genl_info_net(info);
> +	struct sk_buff *msg;
> +	int err = -ENOBUFS;
> +	struct net *peer;
> +
> +	if (info->attrs[NETNSA_PID])
> +		peer = get_net_ns_by_pid(nla_get_u32(info->attrs[NETNSA_PID]));
> +	else if (info->attrs[NETNSA_FD])
> +		peer = get_net_ns_by_fd(nla_get_u32(info->attrs[NETNSA_FD]));
> +	else
> +		return -EINVAL;
> +
> +	if (IS_ERR(peer))
> +		return PTR_ERR(peer);
> +
> +	msg = genlmsg_new(netns_nl_get_size(), GFP_KERNEL);
> +	if (!msg) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +
> +	err = netns_nl_fill(msg, info->snd_portid, info->snd_seq,
> +			    NLM_F_ACK, NETNS_CMD_GETID, net, peer);
> +	if (err < 0)
> +		goto err_out;
> +
> +	err = genlmsg_unicast(net, msg, info->snd_portid);
> +	goto out;
> +
> +err_out:
> +	nlmsg_free(msg);
> +out:
> +	put_net(peer);
> +	return err;
> +}
> +
> +static struct genl_ops netns_genl_ops[] = {
> +	{
> +		.cmd = NETNS_CMD_NEWID,
> +		.policy = netns_nl_policy,
> +		.doit = netns_nl_cmd_newid,
> +		.flags = GENL_ADMIN_PERM,
> +	},
> +	{
> +		.cmd = NETNS_CMD_GETID,
> +		.policy = netns_nl_policy,
> +		.doit = netns_nl_cmd_getid,
> +		.flags = GENL_ADMIN_PERM,
> +	},
> +};
> +
> +int netns_genl_register(void)
> +{
> +	return genl_register_family_with_ops(&netns_genl_family,
> +					     netns_genl_ops);
> +}
> +
>  static int __init net_ns_init(void)
>  {
>  	struct net_generic *ng;
> diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
> index 76393f2f4b22..c6f39e40c9f3 100644
> --- a/net/netlink/genetlink.c
> +++ b/net/netlink/genetlink.c
> @@ -1029,6 +1029,10 @@ static int __init genl_init(void)
>  	if (err)
>  		goto problem;
>  
> +	err = netns_genl_register();
> +	if (err < 0)
> +		goto problem;
> +
>  	return 0;
>  
>  problem:

^ permalink raw reply

* Re: kdbus: add code for buses, domains and endpoints
From: Djalal Harouni @ 2014-10-30 18:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Greg Kroah-Hartman, Linux API,
	linux-kernel@vger.kernel.org, John Stultz, Arnd Bergmann,
	Tejun Heo, Marcel Holtmann, Ryan Lortie, Bastien Nocera,
	David Herrmann, Simon McVittie, Daniel Mack, alban.crequy,
	Javier Martinez Canillas, Tom Gundersen
In-Reply-To: <CALCETrXeNQUuwiyaJrbbbZUrSW3NTucRgpg0kKEuS3yszP-09w@mail.gmail.com>

Hi Andy,

On Thu, Oct 30, 2014 at 07:58:04AM -0700, Andy Lutomirski wrote:
> On Thu, Oct 30, 2014 at 7:48 AM, Djalal Harouni <tixxdz@opendz.org> wrote:
> > On Thu, Oct 30, 2014 at 05:15:04AM -0700, Eric W. Biederman wrote:
> >> Djalal Harouni <tixxdz@opendz.org> writes:
> >> What others are doing makes it very hard to safely use allow those
> >> ioctls in a tightly sandboxed application, as it is unpredictable
> >> what the sandboxed ioctl can do with the file descriptor.
> >>
> >> Further an application that calls setresuid at different times during
> >> it's application will behave differently.  Which makes ioctls that do
> >> not have consistent behavior after open time inappropriate for use in
> >> userspace libraries.
> > We are consistent in our checks, you say that the application will
> > behave differently when it calls setresuid() sure! If it changes its
> > creds then regain of course it will behave differently! and the checks
> > are here to make sure that setresuid() and alike work correctly when the
> > application changes its creds and calls-in.
> >
> 
> Except that it isn't consistent.
>
> If I open a postgresql socket that wants me to be root and then I drop
> privileges, I can keep talking to postresql.  This is a good thing,
> because it means that I can keep talking to postgresql but I lose my
> privilege to do other things.
Yes, that's nice :-)

But here you are not following about those capable() checks in ioctl(),
here you are referring to the send (talking) logic! which is another
thing. But hey we do not break that use case, we support it.

> The new kdbus model breaks this.  If I start as root and drop
> privileges to UID_PRIVSEP, then my attempts to communicate over
> already-open connections shouldn't consider UID_PRIVSEP.  In the, they
> shouldn't tell the other endpoints that UID_PRIVSEP exists at all
> unless I've explicitly asked the kernel for this behavior.
Yes, but kdbus tries to follow D-Bus which is primarily an RPC system,
not just a stream of bytes.

So we really want to be able to perform real time checks and authorise
method calls on the bus, and not just connections. I mean yes we do our
kdbus talk access checks on send (Talk) requests using creds of the
connection at creation time, but in the other hand we also need and have
to deal with D-Bus method requests which is the primary usecase here.

So, this is similar to AF_UNIX sockets. For them there's SCM_CREDENTIALS
and SO_PEERCRED. The former uses credentials at the time of when
messages are being sent, the latter uses the credentials at the time
when when the connection was initially established. So to not break
things, we provide similar APIs for services:

1) To get the creds of the connection (when the connection was created).

2) To get the creds of the sender of the message during send time. This
is specially relevent to authorize specific D-Bus method calls, by
checking the creds of the caller, not the one who created the kdbus
connection.

Hmm, a comparison for this model can be made with the kernel's own
syscalls: the same way syscalls are authorized or refused based on the
caller's creds at the time of the syscall... you can't hide that
information. We follow the same semantics here, and use it to allow
inter-process method calls. Having an fd passed is just a detail, the
focus should be put on the methode calls and how to perform the correct
access checks against realtime creds.

Thank you Andy for your comments!

-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply

* Re: [PATCH] kernel, add panic_on_warn
From: H. Peter Anvin @ 2014-10-30 17:24 UTC (permalink / raw)
  To: Prarit Bhargava, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Andi Kleen, Jonathan Corbet,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Rusty Russell,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, jbaron-JqFfY2XvxFXQT0dZR+AlfA,
	Fabian Frederick, isimatu.yasuaki-+CUm20s59erQFUHtdCDX3A,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Masami Hiramatsu, Andrew Morton,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
In-Reply-To: <1414688627-5298-1-git-send-email-prarit-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On 10/30/2014 10:03 AM, Prarit Bhargava wrote:
> There have been several times where I have had to rebuild a kernel to
> cause a panic when hitting a WARN() in the code in order to get a crash
> dump from a system.  Sometimes this is easy to do, other times (such as
> in the case of a remote admin) it is not trivial to send new images to the
> user.
> 
> A much easier method would be a switch to change the WARN() over to a
> panic.  This makes debugging easier in that I can now test the actual
> image the WARN() was seen on and I do not have to engage in remote
> debugging.
> 
> This patch adds a panic_on_warn kernel parameter and
> /proc/sys/kernel/panic_on_warn calls panic() in the warn_slowpath_common()
> path.  The function will still print out the location of the warning.
> 
> An example of the panic_on_warn output:
> 
> The first line below is from the WARN_ON() to output the WARN_ON()'s location.
> After that the panic() output is displayed.
> 

There is another very valid use for this: many operators would rather a
machine shuts down than being potentially compromised either
functionally or security-wise.

	-hpa

^ permalink raw reply

* Re: [RFC PATCH 1/1] arm64: Fix up /proc/cpuinfo
From: Ian Campbell @ 2014-10-30 17:20 UTC (permalink / raw)
  To: Will Deacon
  Cc: cross-distro-cunTk1MwBs8s++Sfvej+rw@public.gmane.org,
	Catalin Marinas, Serban Constantinescu,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ghackmann-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
In-Reply-To: <20141030171521.GL32589-5wv7dgnIgG8@public.gmane.org>

On Thu, 2014-10-30 at 17:15 +0000, Will Deacon wrote:
> Hi Mark,
> 
> On Fri, Oct 24, 2014 at 02:56:40PM +0100, Mark Rutland wrote:
> > Commit d7a49086f263164a (arm64: cpuinfo: print info for all CPUs)
> > attempted to clean up /proc/cpuinfo, but due to concerns regarding
> > further changes was reverted in commit 5e39977edf6500fd (Revert "arm64:
> > cpuinfo: print info for all CPUs").
> 
> I've applied this on our devel branch, with a view to queuing it for -next
> in a week or two. I think we should also CC stable on this, so could you
> see how far back it applies please? (note that I already had to resolve
> a conflict with a patch we've got queued from Ard).

<dons Debian hat>

Since we'll be shipping 3.16 in our next release I'd love to see it go
back at least as far as that ;-)

(I'd likely apply it myself even if it doesn't appear via stable).

Ian.

^ permalink raw reply

* Re: [RFC PATCH 1/1] arm64: Fix up /proc/cpuinfo
From: Will Deacon @ 2014-10-30 17:15 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	Catalin Marinas,
	ghackmann-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	ijc-KcIKpvwj1kUDXYZnReoRVg@public.gmane.org,
	Serban Constantinescu,
	cross-distro-cunTk1MwBs8s++Sfvej+rw@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <1414159000-27059-2-git-send-email-mark.rutland-5wv7dgnIgG8@public.gmane.org>

Hi Mark,

On Fri, Oct 24, 2014 at 02:56:40PM +0100, Mark Rutland wrote:
> Commit d7a49086f263164a (arm64: cpuinfo: print info for all CPUs)
> attempted to clean up /proc/cpuinfo, but due to concerns regarding
> further changes was reverted in commit 5e39977edf6500fd (Revert "arm64:
> cpuinfo: print info for all CPUs").

I've applied this on our devel branch, with a view to queuing it for -next
in a week or two. I think we should also CC stable on this, so could you
see how far back it applies please? (note that I already had to resolve
a conflict with a patch we've got queued from Ard).

Cheers,

Will

^ permalink raw reply

* [PATCH] kernel, add panic_on_warn
From: Prarit Bhargava @ 2014-10-30 17:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Prarit Bhargava, Jonathan Corbet, Andrew Morton, Rusty Russell,
	H. Peter Anvin, Andi Kleen, Masami Hiramatsu, Fabian Frederick,
	vgoyal, isimatu.yasuaki, jbaron, linux-doc, kexec, linux-api

There have been several times where I have had to rebuild a kernel to
cause a panic when hitting a WARN() in the code in order to get a crash
dump from a system.  Sometimes this is easy to do, other times (such as
in the case of a remote admin) it is not trivial to send new images to the
user.

A much easier method would be a switch to change the WARN() over to a
panic.  This makes debugging easier in that I can now test the actual
image the WARN() was seen on and I do not have to engage in remote
debugging.

This patch adds a panic_on_warn kernel parameter and
/proc/sys/kernel/panic_on_warn calls panic() in the warn_slowpath_common()
path.  The function will still print out the location of the warning.

An example of the panic_on_warn output:

The first line below is from the WARN_ON() to output the WARN_ON()'s location.
After that the panic() output is displayed.

WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]()
Kernel panic - not syncing: panic_on_warn set ...

CPU: 30 PID: 11698 Comm: insmod Tainted: G        W  OE  3.17.0+ #57
Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
 0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190
 0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec
 ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df
Call Trace:
 [<ffffffff81665190>] dump_stack+0x46/0x58
 [<ffffffff8165e2ec>] panic+0xd0/0x204
 [<ffffffffa038e05f>] ? init_dummy+0x1f/0x30 [dummy_module]
 [<ffffffff81076b90>] warn_slowpath_common+0xd0/0xd0
 [<ffffffffa038e040>] ? dummy_greetings+0x40/0x40 [dummy_module]
 [<ffffffff81076c8a>] warn_slowpath_null+0x1a/0x20
 [<ffffffffa038e05f>] init_dummy+0x1f/0x30 [dummy_module]
 [<ffffffff81002144>] do_one_initcall+0xd4/0x210
 [<ffffffff811b52c2>] ? __vunmap+0xc2/0x110
 [<ffffffff810f8889>] load_module+0x16a9/0x1b30
 [<ffffffff810f3d30>] ? store_uevent+0x70/0x70
 [<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180
 [<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0
 [<ffffffff8166cf29>] system_call_fastpath+0x12/0x17

Successfully tested by me.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: vgoyal@redhat.com
Cc: isimatu.yasuaki@jp.fujitsu.com
Cc: jbaron@akamai.com
Cc: linux-doc@vger.kernel.org
Cc: kexec@lists.infradead.org
Cc: linux-api@vger.kernel.org
Signed-off-by: Prarit Bhargava <prarit@redhat.com>

[v2]: add /proc/sys/kernel/panic_on_warn, additional documentation, modify
      !slowpath cases
[v3]: use proc_dointvec_minmax() in sysctl handler
[v4]: remove !slowpath cases, and add __read_mostly
[v5]: change to panic_on_warn, re-alphabetize Documentation/sysctl/kernel.txt
[v6]: fix typo in kernel/sysctl_binary.c
---
 Documentation/kdump/kdump.txt       |    7 ++++++
 Documentation/kernel-parameters.txt |    3 +++
 Documentation/sysctl/kernel.txt     |   40 +++++++++++++++++++++++------------
 include/linux/kernel.h              |    1 +
 include/uapi/linux/sysctl.h         |    1 +
 kernel/panic.c                      |   20 +++++++++++++++++-
 kernel/sysctl.c                     |    9 ++++++++
 kernel/sysctl_binary.c              |    1 +
 8 files changed, 67 insertions(+), 15 deletions(-)

diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
index 6c0b9f2..bc4bd5a 100644
--- a/Documentation/kdump/kdump.txt
+++ b/Documentation/kdump/kdump.txt
@@ -471,6 +471,13 @@ format. Crash is available on Dave Anderson's site at the following URL:
 
    http://people.redhat.com/~anderson/
 
+Trigger Kdump on WARN()
+=======================
+
+The kernel parameter, panic_on_warn, calls panic() in all WARN() paths.  This
+will cause a kdump to occur at the panic() call.  In cases where a user wants
+to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set to 1
+to achieve the same behaviour.
 
 Contact
 =======
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 74339c5..262ff3b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2495,6 +2495,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			timeout < 0: reboot immediately
 			Format: <timeout>
 
+	panic_on_warn	panic() instead of WARN().  Useful to cause kdump
+			on a WARN().
+
 	crash_kexec_post_notifiers
 			Run kdump after running panic-notifiers and dumping
 			kmsg. This only for the users who doubt kdump always
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 57baff5..b5d0c85 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -54,8 +54,9 @@ show up in /proc/sys/kernel:
 - overflowuid
 - panic
 - panic_on_oops
-- panic_on_unrecovered_nmi
 - panic_on_stackoverflow
+- panic_on_unrecovered_nmi
+- panic_on_warn
 - pid_max
 - powersave-nap               [ PPC only ]
 - printk
@@ -527,19 +528,6 @@ the recommended setting is 60.
 
 ==============================================================
 
-panic_on_unrecovered_nmi:
-
-The default Linux behaviour on an NMI of either memory or unknown is
-to continue operation. For many environments such as scientific
-computing it is preferable that the box is taken out and the error
-dealt with than an uncorrected parity/ECC error get propagated.
-
-A small number of systems do generate NMI's for bizarre random reasons
-such as power management so the default is off. That sysctl works like
-the existing panic controls already in that directory.
-
-==============================================================
-
 panic_on_oops:
 
 Controls the kernel's behaviour when an oops or BUG is encountered.
@@ -563,6 +551,30 @@ This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled.
 
 ==============================================================
 
+panic_on_unrecovered_nmi:
+
+The default Linux behaviour on an NMI of either memory or unknown is
+to continue operation. For many environments such as scientific
+computing it is preferable that the box is taken out and the error
+dealt with than an uncorrected parity/ECC error get propagated.
+
+A small number of systems do generate NMI's for bizarre random reasons
+such as power management so the default is off. That sysctl works like
+the existing panic controls already in that directory.
+
+==============================================================
+
+panic_on_warn:
+
+Calls panic() in the WARN() path when set to 1.  This is useful to avoid
+a kernel rebuild when attempting to kdump at the location of a WARN().
+
+0: only WARN(), default behaviour.
+
+1: call panic() after printing out WARN() location.
+
+==============================================================
+
 perf_cpu_time_max_percent:
 
 Hints to the kernel how much CPU time it should be allowed to
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 3d770f55..d60d31d 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -422,6 +422,7 @@ extern int panic_timeout;
 extern int panic_on_oops;
 extern int panic_on_unrecovered_nmi;
 extern int panic_on_io_nmi;
+extern int panic_on_warn;
 extern int sysctl_panic_on_stackoverflow;
 /*
  * Only to be used by arch init code. If the user over-wrote the default
diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index 43aaba1..0956373 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -153,6 +153,7 @@ enum
 	KERN_MAX_LOCK_DEPTH=74, /* int: rtmutex's maximum lock depth */
 	KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
 	KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
+	KERN_PANIC_ON_WARN=77, /* int: call panic() in WARN() functions */
 };
 
 
diff --git a/kernel/panic.c b/kernel/panic.c
index d09dc5c..2ab2168 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -33,6 +33,7 @@ static int pause_on_oops;
 static int pause_on_oops_flag;
 static DEFINE_SPINLOCK(pause_on_oops_lock);
 static bool crash_kexec_post_notifiers;
+int panic_on_warn __read_mostly;
 
 int panic_timeout = CONFIG_PANIC_TIMEOUT;
 EXPORT_SYMBOL_GPL(panic_timeout);
@@ -420,13 +421,23 @@ static void warn_slowpath_common(const char *file, int line, void *caller,
 {
 	disable_trace_on_warning();
 
-	pr_warn("------------[ cut here ]------------\n");
+	if (!panic_on_warn)
+		pr_warn("------------[ cut here ]------------\n");
 	pr_warn("WARNING: CPU: %d PID: %d at %s:%d %pS()\n",
 		raw_smp_processor_id(), current->pid, file, line, caller);
 
 	if (args)
 		vprintk(args->fmt, args->args);
 
+	if (panic_on_warn) {
+		/*
+		 * A flood of WARN()s may occur.  Prevent further WARN()s
+		 * from panicking the system.
+		 */
+		panic_on_warn = 0;
+		panic("panic_on_warn set ... \n");
+	}
+
 	print_modules();
 	dump_stack();
 	print_oops_end_marker();
@@ -501,3 +512,10 @@ static int __init oops_setup(char *s)
 	return 0;
 }
 early_param("oops", oops_setup);
+
+static int __init panic_on_warn_setup(char *s)
+{
+	panic_on_warn = 1;
+	return 0;
+}
+early_param("panic_on_warn", panic_on_warn_setup);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4aada6d..38deafa 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1103,6 +1103,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 #endif
+	{
+		.procname	= "panic_on_warn",
+		.data		= &panic_on_warn,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
 	{ }
 };
 
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 9a4f750..7e7746a 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -137,6 +137,7 @@ static const struct bin_table bin_kern_table[] = {
 	{ CTL_INT,	KERN_COMPAT_LOG,		"compat-log" },
 	{ CTL_INT,	KERN_MAX_LOCK_DEPTH,		"max_lock_depth" },
 	{ CTL_INT,	KERN_PANIC_ON_NMI,		"panic_on_unrecovered_nmi" },
+	{ CTL_INT,	KERN_PANIC_ON_WARN,		"panic_on_warn" },
 	{}
 };
 
-- 
1.7.9.3


^ permalink raw reply related

* Re: [v5 4/5] Adds ioctl interface support for ext4 project
From: Jan Kara @ 2014-10-30 16:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Li Xi, linux-fsdevel, linux-ext4, linux-api, tytso, adilger, jack,
	viro, hch, dmonakhov
In-Reply-To: <20141026215743.GD4317@dastard>

On Mon 27-10-14 08:57:43, Dave Chinner wrote:
> On Sun, Oct 26, 2014 at 01:22:52PM +0800, Li Xi wrote:
> > This patch adds ioctl interface for setting/getting project of ext4.
> 
> If you add support for the XFS_IOC_[GS]ETXATTR ioctls, then this
> patch needs to be dropped - we don't want (or need) multiple
> interfaces to perform the same functionality.
  Completely agreed.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply

* Re: [v5 3/5] Adds project quota support for ext4
From: Jan Kara @ 2014-10-30 16:50 UTC (permalink / raw)
  To: Li Xi
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	adilger-m1MBpc4rdrD3fQ9qLvQP4Q, jack-AlSwsSmVLrQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	dmonakhov-GEFAQzZX7r8dnm+yROfE0A
In-Reply-To: <1414300973-1118-4-git-send-email-lixi-LfVdkaOWEx8@public.gmane.org>

On Sun 26-10-14 13:22:51, Li Xi wrote:
> This patch adds mount options for enabling/disabling project quota
> accounting and enforcement. A new specific inode is also used for
> project quota accounting.
  The patch looks good except for one small issue:

...
> +	limit = dquot->dq_dqb.dqb_isoftlimit ?
> +		dquot->dq_dqb.dqb_isoftlimit :
> +		dquot->dq_dqb.dqb_ihardlimit;
> +	if (limit && buf->f_files > limit) {
> +		buf->f_files = limit;
> +		buf->f_ffree =
> +			(buf->f_files > dquot->dq_dqb.dqb_curinodes) ?
> +			 (buf->f_ffree - dquot->dq_dqb.dqb_curinodes) : 0;
                              ^^^^^ here should be f_files, shouldn't it?

> +	}
> +
> +	spin_unlock(&dq_data_lock);
> +	dqput(dquot);
> +	return 0;
> +}
> +
>  static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
>  {
>  	struct super_block *sb = dentry->d_sb;

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply

* Re: [v5 2/5] Adds project ID support for ext4
From: Jan Kara @ 2014-10-30 16:18 UTC (permalink / raw)
  To: Li Xi
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
	adilger-m1MBpc4rdrD3fQ9qLvQP4Q, jack-AlSwsSmVLrQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	dmonakhov-GEFAQzZX7r8dnm+yROfE0A
In-Reply-To: <1414300973-1118-3-git-send-email-lixi-LfVdkaOWEx8@public.gmane.org>

On Sun 26-10-14 13:22:50, Li Xi wrote:
> This patch adds a new internal field of ext4 inode to save project
> identifier. Also a new flag EXT4_INODE_PROJINHERIT is added for
> inheriting project ID from parent directory.
  The patch looks good. You can add:
Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>

  Just a few formatting issues below:

> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 8012a5d..78a2775 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -756,6 +756,14 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>  		inode->i_gid = dir->i_gid;
>  	} else
>  		inode_init_owner(inode, dir, mode);
> +	if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
> +		EXT4_FEATURE_RO_COMPAT_PROJECT) &&
  This looks like excessive line wrapping. You can merge the above two
lines...

> +	    ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT)) {
> +		ei->i_projid = EXT4_I(dir)->i_projid;
> +	} else {
> +		ei->i_projid =
> +			make_kprojid(&init_user_ns, EXT4_DEF_PROJID);
  And here as well.

> +	}
>  	dquot_initialize(inode);
>  
>  	if (!goal)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e9777f9..ab3bd51 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3886,6 +3886,15 @@ static inline void ext4_iget_extra_inode(struct inode *inode,
>  		EXT4_I(inode)->i_inline_off = 0;
>  }
>  
> +int ext4_get_projid(struct inode *inode, kprojid_t *projid)
> +{
> +	if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> +			EXT4_FEATURE_RO_COMPAT_PROJECT))
  And here as well...

> +		return -EOPNOTSUPP;
> +	*projid = EXT4_I(inode)->i_projid;
> +	return 0;
> +}
> +
>  struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
>  {
>  	struct ext4_iloc iloc;
> @@ -3897,6 +3906,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
>  	int block;
>  	uid_t i_uid;
>  	gid_t i_gid;
> +	projid_t i_projid;
>  
>  	inode = iget_locked(sb, ino);
>  	if (!inode)
> @@ -3946,12 +3956,19 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
>  	inode->i_mode = le16_to_cpu(raw_inode->i_mode);
>  	i_uid = (uid_t)le16_to_cpu(raw_inode->i_uid_low);
>  	i_gid = (gid_t)le16_to_cpu(raw_inode->i_gid_low);
> +	if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
> +			EXT4_FEATURE_RO_COMPAT_PROJECT))
  And here...

> +		i_projid = (projid_t)le32_to_cpu(raw_inode->i_projid);
> +	else
> +		i_projid = EXT4_DEF_PROJID;
> +
>  	if (!(test_opt(inode->i_sb, NO_UID32))) {
>  		i_uid |= le16_to_cpu(raw_inode->i_uid_high) << 16;
>  		i_gid |= le16_to_cpu(raw_inode->i_gid_high) << 16;
>  	}
>  	i_uid_write(inode, i_uid);
>  	i_gid_write(inode, i_gid);
> +	ei->i_projid = make_kprojid(&init_user_ns, i_projid);;
>  	set_nlink(inode, le16_to_cpu(raw_inode->i_links_count));
>  
>  	ext4_clear_state_flags(ei);	/* Only relevant on 32-bit archs */
...
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 61756f9..3229604 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -2931,6 +2931,11 @@ static int ext4_link(struct dentry *old_dentry,
>  	if (inode->i_nlink >= EXT4_LINK_MAX)
>  		return -EMLINK;
>  
> +	if ((ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT)) &&
> +	    (!projid_eq(EXT4_I(dir)->i_projid,
> +			EXT4_I(old_dentry->d_inode)->i_projid)))
> +		return -EXDEV;
> +
>  	dquot_initialize(dir);
>  
>  retry:
> @@ -3173,6 +3178,11 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
>  	int force_reread;
>  	int retval;
>  
> +	if ((ext4_test_inode_flag(new_dir, EXT4_INODE_PROJINHERIT)) &&
> +	    (!projid_eq(EXT4_I(new_dir)->i_projid,
> +		EXT4_I(old_dentry->d_inode)->i_projid)))
  The above line is wrongly indented. Move it one tab more to the right.

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 00/12] Add kdbus implementation
From: Greg Kroah-Hartman @ 2014-10-30 16:17 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-api, linux-kernel, john.stultz, tj, marcel, desrt, hadess,
	dh.herrmann, tixxdz, simon.mcvittie, daniel, alban.crequy,
	javier.martinez, teg
In-Reply-To: <4382138.U2ODUIQ55c@wuerfel>

On Thu, Oct 30, 2014 at 09:33:56AM +0100, Arnd Bergmann wrote:
> On Wednesday 29 October 2014 15:00:44 Greg Kroah-Hartman wrote:
> >  drivers/misc/Kconfig                             |    1 +
> >  drivers/misc/Makefile                            |    1 +
> >  drivers/misc/kdbus/Kconfig                       |   11 +
> >  drivers/misc/kdbus/Makefile                      |   19 +
> >  drivers/misc/kdbus/bus.c                         |  450 ++++++
> >  drivers/misc/kdbus/bus.h                         |  107 ++
> >  drivers/misc/kdbus/connection.c                  | 1751 +++++++++++++++++++++
> >  drivers/misc/kdbus/connection.h                  |  177 +++
> >  drivers/misc/kdbus/domain.c                      |  477 ++++++
> > 
> 
> One very high-level common:
> 
> Since this is going to be a very commonly used IPC mechanism, I don't
> like the idea of stuffing it into drivers/misc.
> 
> How about putting it into drivers/kdbus or ipc/kdbus instead?

ipc/kdbus seems good to me.  I didn't want to "pollute" drivers/ with
any new subdirectories, it seems to grow fast enough as it is...

greg k-h

^ permalink raw reply

* Re: [v5 1/5] Adds general codes to enforces project quota limits
From: Jan Kara @ 2014-10-30 16:05 UTC (permalink / raw)
  To: Li Xi
  Cc: linux-fsdevel, linux-ext4, linux-api, tytso, adilger, jack, viro,
	hch, dmonakhov
In-Reply-To: <1414300973-1118-2-git-send-email-lixi@ddn.com>

On Sun 26-10-14 13:22:49, Li Xi wrote:
> This patch adds support for a new quota type PRJQUOTA for project quota
> enforcement. Also a new method get_projid() is added into dquot_operations
> structure.
> 
> Signed-off-by: Li Xi <lixi@ddn.com>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
...
> @@ -72,6 +76,8 @@ static int quota_quotaon(struct super_block *sb, int type, int cmd, qid_t id,
>  		return sb->s_qcop->quota_on_meta(sb, type, id);
>  	if (IS_ERR(path))
>  		return PTR_ERR(path);
> +	if (type == PRJQUOTA && sb->dq_op->get_projid == NULL)
> +		return -EOPNOTSUPP;
>  	return sb->s_qcop->quota_on(sb, type, id, path);
>  }
  Checking for ->get_projid() in quota_quotaon() isn't necessary. This will
be already handled by allowed_qtype bitmask in my patches. But you could
add a test in vfs_load_quota_inode() just after ->quota_write and
->quota_read tests. It will be mostly a safety check but I think it's
worthwhile.

Otherwise the patch looks fine to me.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply

* Re: kdbus: add code to gather metadata
From: Daniel Mack @ 2014-10-30 15:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Greg Kroah-Hartman, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, John Stultz,
	Arnd Bergmann, Tejun Heo, Marcel Holtmann, Ryan Lortie,
	Bastien Nocera, David Herrmann, Djalal Harouni, Simon McVittie,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ, Javier Martinez Canillas,
	Tom Gundersen
In-Reply-To: <CALCETrWjOS0AHF33zN0Vy1NC1441To7AgNPge3sKCz8bn2d8gg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 10/30/2014 03:07 PM, Andy Lutomirski wrote:
> On Thu, Oct 30, 2014 at 1:45 AM, Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
>>
>> 1. At open() time, So we can tell peers (through KDBUS_CMD_CONN_INFO)
>> about the credentials a connection had when it was created with
>> KDBUS_CMD_HELLO.
> 
> Then the API that tells peers about this information needs to make
> this very clear.

Yes, that's an API that tells you something about a connection, and with
that KDBUS_CMD_CONN_INFO call, you'll always get the information about
the creator of the connection. IOW: by calling that particular ioctl,
you'll always get the same kind of information. I guess it's valid to
just define the API that way?

>> 2. When a new bus is created through KDBUS_CMD_BUS_MAKE, so peers can
>> later query the credentials of the owner of the bus they're connected to.
> 
> Ditto.  Although, why on earth should a bus have credentials?  This
> sounds like a misdesign.  It seems to me that this type of policy
> belongs all the way in userspace.  If you want a bus, you ask the
> owner of the entire domain to make you a bus.  Or you make it yourself
> and hand off references in some authenticated way.

Yes, that's the way it works. However, the idea is that a bus stays
alive as long as the file descriptor that was used to the create that
bus remains open, and it is immediately shut down when the fd is closed.
We merely allow user that are connected to a bus to query the
credentials of the creator of the bus they're connected to. So it's not
the bus which has credentials, but its original creator, at the time of
creation.

>> 3. When we dispatch a KDBUS_CMD_MSG_SEND ioctl(), because we want to
>> attach the metadata that was authoritative when the message was sent.
>> IOW: We want metadata that actually matches the message payload.
> 
> What does that "metadata that actually matches the message payload"
> mean? 

A bus client posted something in some point in time. We want metadata of
the time the message was posted.

> If I create an endpoint and delegate some processing to a less
> privileged child, other things on the bus MUST NOT be able to detect
> that delegation in any sensible design.  Otherwise everything will
> appear to work in testing because other processes never checked the
> problematic credential, but then it will randomly fail because someone
> decided to do something daft and validate my unprivileged child's
> argv[0], which is, of course, not what they expected.

Not sure whether I got your point, but if a privileged service that
takes action on behalf of unprivileged clients, it may well depend on
certain credential information to be present along with the message, and
refuse to take action otherwise.

For example, if a privileged service can reboot the system, it checks
whether the asking peer has CAP_SYS_BOOT set. If it checks for uid==0
instead, and it works in your tests because you happen to test as root,
that just a bug in the service, right? But I might have missed your point.

> I suspect that, if you make credential sending opt-in, you will
> quickly discover that the current model for which credentials matter
> makes no sense.

While for the example above, opting-in for creds items on the sender
side might actually work (the asking service would be refused in his
request to reboot the machine). However, for any sort of logging or
system services, for example, allowing the sender to select which creds
it wants to reveal is supporting a hide-and-seek game, and that's
something that won't work.

Thanks for sharing your thoughts on this - I appreciate this discussion
stays on such technical grounds :)

Daniel

^ permalink raw reply

* [PATCH net-next v4 4/4] rtnl: allow to create device with IFLA_LINK_NETNSID set
From: Nicolas Dichtel @ 2014-10-30 15:25 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: cwang-xCSkyg8dI+0RB7SZvlqPiA, Nicolas Dichtel,
	luto-kltTT9wpgjJwATOyAt5JVQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1414682728-4532-1-git-send-email-nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

This patch adds the ability to create a netdevice in a specified netns and
then move it into the final netns. In fact, it allows to have a symetry between
get and set rtnl messages.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>
---
 net/core/rtnetlink.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 1b9329512496..57959a85ed2c 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1211,6 +1211,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_NUM_RX_QUEUES]	= { .type = NLA_U32 },
 	[IFLA_PHYS_PORT_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_PORT_ID_LEN },
 	[IFLA_CARRIER_CHANGES]	= { .type = NLA_U32 },  /* ignored */
+	[IFLA_LINK_NETNSID]	= { .type = NLA_S32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -1983,7 +1984,7 @@ replay:
 		struct nlattr *slave_attr[m_ops ? m_ops->slave_maxtype + 1 : 0];
 		struct nlattr **data = NULL;
 		struct nlattr **slave_data = NULL;
-		struct net *dest_net;
+		struct net *dest_net, *link_net = NULL;
 
 		if (ops) {
 			if (ops->maxtype && linkinfo[IFLA_INFO_DATA]) {
@@ -2089,7 +2090,18 @@ replay:
 		if (IS_ERR(dest_net))
 			return PTR_ERR(dest_net);
 
-		dev = rtnl_create_link(dest_net, ifname, name_assign_type, ops, tb);
+		if (tb[IFLA_LINK_NETNSID]) {
+			int id = nla_get_s32(tb[IFLA_LINK_NETNSID]);
+
+			link_net = get_net_ns_by_id(dest_net, id);
+			if (link_net == NULL) {
+				err =  -EINVAL;
+				goto out;
+			}
+		}
+
+		dev = rtnl_create_link(link_net ? : dest_net, ifname,
+				       name_assign_type, ops, tb);
 		if (IS_ERR(dev)) {
 			err = PTR_ERR(dev);
 			goto out;
@@ -2117,9 +2129,16 @@ replay:
 			}
 		}
 		err = rtnl_configure_link(dev, ifm);
-		if (err < 0)
+		if (err < 0) {
 			unregister_netdevice(dev);
+			goto out;
+		}
+
+		if (link_net)
+			err = dev_change_net_namespace(dev, dest_net, ifname);
 out:
+		if (link_net)
+			put_net(link_net);
 		put_net(dest_net);
 		return err;
 	}
-- 
2.1.0

^ permalink raw reply related

* [PATCH net-next v4 3/4] iptunnels: advertise link netns via netlink
From: Nicolas Dichtel @ 2014-10-30 15:25 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: cwang-xCSkyg8dI+0RB7SZvlqPiA, Nicolas Dichtel,
	luto-kltTT9wpgjJwATOyAt5JVQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1414682728-4532-1-git-send-email-nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

Implement rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
added to rtnetlink messages.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>
---
 include/net/ip6_tunnel.h | 1 +
 include/net/ip_tunnels.h | 1 +
 net/ipv4/ip_gre.c        | 2 ++
 net/ipv4/ip_tunnel.c     | 8 ++++++++
 net/ipv4/ip_vti.c        | 1 +
 net/ipv4/ipip.c          | 1 +
 net/ipv6/ip6_gre.c       | 1 +
 net/ipv6/ip6_tunnel.c    | 9 +++++++++
 net/ipv6/ip6_vti.c       | 1 +
 net/ipv6/sit.c           | 1 +
 10 files changed, 26 insertions(+)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index a5593dab6af7..8648519f4555 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -69,6 +69,7 @@ int ip6_tnl_xmit_ctl(struct ip6_tnl *t);
 __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw);
 __u32 ip6_tnl_get_cap(struct ip6_tnl *t, const struct in6_addr *laddr,
 			     const struct in6_addr *raddr);
+struct net *ip6_tnl_get_link_net(const struct net_device *dev);
 
 static inline void ip6tunnel_xmit(struct sk_buff *skb, struct net_device *dev)
 {
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 5bc6edeb7143..ce4ff6161fab 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -122,6 +122,7 @@ struct ip_tunnel_net {
 int ip_tunnel_init(struct net_device *dev);
 void ip_tunnel_uninit(struct net_device *dev);
 void  ip_tunnel_dellink(struct net_device *dev, struct list_head *head);
+struct net *ip_tunnel_get_link_net(const struct net_device *dev);
 int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id,
 		       struct rtnl_link_ops *ops, char *devname);
 
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 12055fdbe716..9e2e29a8c989 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -827,6 +827,7 @@ static struct rtnl_link_ops ipgre_link_ops __read_mostly = {
 	.dellink	= ip_tunnel_dellink,
 	.get_size	= ipgre_get_size,
 	.fill_info	= ipgre_fill_info,
+	.get_link_net	= ip_tunnel_get_link_net,
 };
 
 static struct rtnl_link_ops ipgre_tap_ops __read_mostly = {
@@ -841,6 +842,7 @@ static struct rtnl_link_ops ipgre_tap_ops __read_mostly = {
 	.dellink	= ip_tunnel_dellink,
 	.get_size	= ipgre_get_size,
 	.fill_info	= ipgre_fill_info,
+	.get_link_net	= ip_tunnel_get_link_net,
 };
 
 static int __net_init ipgre_tap_init_net(struct net *net)
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 0bb8e141eacc..3e1edd544b27 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -972,6 +972,14 @@ void ip_tunnel_dellink(struct net_device *dev, struct list_head *head)
 }
 EXPORT_SYMBOL_GPL(ip_tunnel_dellink);
 
+struct net *ip_tunnel_get_link_net(const struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+
+	return tunnel->net;
+}
+EXPORT_SYMBOL(ip_tunnel_get_link_net);
+
 int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id,
 				  struct rtnl_link_ops *ops, char *devname)
 {
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index 3e861011e4a3..f0fab26e4ddc 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -530,6 +530,7 @@ static struct rtnl_link_ops vti_link_ops __read_mostly = {
 	.changelink	= vti_changelink,
 	.get_size	= vti_get_size,
 	.fill_info	= vti_fill_info,
+	.get_link_net	= ip_tunnel_get_link_net,
 };
 
 static int __init vti_init(void)
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 37096d64730e..e7a183baba0a 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -498,6 +498,7 @@ static struct rtnl_link_ops ipip_link_ops __read_mostly = {
 	.dellink	= ip_tunnel_dellink,
 	.get_size	= ipip_get_size,
 	.fill_info	= ipip_fill_info,
+	.get_link_net	= ip_tunnel_get_link_net,
 };
 
 static struct xfrm_tunnel ipip_handler __read_mostly = {
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 12c3c8ef3849..5165ac7fde22 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1661,6 +1661,7 @@ static struct rtnl_link_ops ip6gre_link_ops __read_mostly = {
 	.dellink	= ip6gre_dellink,
 	.get_size	= ip6gre_get_size,
 	.fill_info	= ip6gre_fill_info,
+	.get_link_net	= ip6_tnl_get_link_net,
 };
 
 static struct rtnl_link_ops ip6gre_tap_ops __read_mostly = {
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 9409887fb664..6b2534ea9c54 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1703,6 +1703,14 @@ nla_put_failure:
 	return -EMSGSIZE;
 }
 
+struct net *ip6_tnl_get_link_net(const struct net_device *dev)
+{
+	struct ip6_tnl *tunnel = netdev_priv(dev);
+
+	return tunnel->net;
+}
+EXPORT_SYMBOL(ip6_tnl_get_link_net);
+
 static const struct nla_policy ip6_tnl_policy[IFLA_IPTUN_MAX + 1] = {
 	[IFLA_IPTUN_LINK]		= { .type = NLA_U32 },
 	[IFLA_IPTUN_LOCAL]		= { .len = sizeof(struct in6_addr) },
@@ -1726,6 +1734,7 @@ static struct rtnl_link_ops ip6_link_ops __read_mostly = {
 	.dellink	= ip6_tnl_dellink,
 	.get_size	= ip6_tnl_get_size,
 	.fill_info	= ip6_tnl_fill_info,
+	.get_link_net	= ip6_tnl_get_link_net,
 };
 
 static struct xfrm6_tunnel ip4ip6_handler __read_mostly = {
diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index d440bb585524..43966dcc9603 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -992,6 +992,7 @@ static struct rtnl_link_ops vti6_link_ops __read_mostly = {
 	.changelink	= vti6_changelink,
 	.get_size	= vti6_get_size,
 	.fill_info	= vti6_fill_info,
+	.get_link_net	= ip6_tnl_get_link_net,
 };
 
 static void __net_exit vti6_destroy_tunnels(struct vti6_net *ip6n)
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 58e5b4710127..c858d0eb267a 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -1765,6 +1765,7 @@ static struct rtnl_link_ops sit_link_ops __read_mostly = {
 	.get_size	= ipip6_get_size,
 	.fill_info	= ipip6_fill_info,
 	.dellink	= ipip6_dellink,
+	.get_link_net	= ip_tunnel_get_link_net,
 };
 
 static struct xfrm_tunnel sit_handler __read_mostly = {
-- 
2.1.0

^ permalink raw reply related

* [PATCH net-next v4 2/4] rtnl: add link netns id to interface messages
From: Nicolas Dichtel @ 2014-10-30 15:25 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: cwang-xCSkyg8dI+0RB7SZvlqPiA, Nicolas Dichtel,
	luto-kltTT9wpgjJwATOyAt5JVQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1414682728-4532-1-git-send-email-nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

This patch adds a new attribute (IFLA_LINK_NETNSID) which contains the 'link'
netns id when this netns is different from the netns where the interface
stands (for example for x-net interfaces like ip tunnels). When there is no id,
we put NETNSA_NSINDEX_UNKNOWN into this attribute to indicate to userland that
the link netns is different from the interface netns. Hence, userland knows that
some information like IFLA_LINK are not interpretable.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>
---
 include/net/rtnetlink.h      |  2 ++
 include/uapi/linux/if_link.h |  1 +
 net/core/rtnetlink.c         | 13 +++++++++++++
 3 files changed, 16 insertions(+)

diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index e21b9f9653c0..6c6d5393fc34 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -46,6 +46,7 @@ static inline int rtnl_msg_family(const struct nlmsghdr *nlh)
  *			    to create when creating a new device.
  *	@get_num_rx_queues: Function to determine number of receive queues
  *			    to create when creating a new device.
+ *	@get_link_net: Function to get the i/o netns of the device
  */
 struct rtnl_link_ops {
 	struct list_head	list;
@@ -93,6 +94,7 @@ struct rtnl_link_ops {
 	int			(*fill_slave_info)(struct sk_buff *skb,
 						   const struct net_device *dev,
 						   const struct net_device *slave_dev);
+	struct net		*(*get_link_net)(const struct net_device *dev);
 };
 
 int __rtnl_link_register(struct rtnl_link_ops *ops);
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 7072d8325016..d2729f63cf01 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -145,6 +145,7 @@ enum {
 	IFLA_CARRIER,
 	IFLA_PHYS_PORT_ID,
 	IFLA_CARRIER_CHANGES,
+	IFLA_LINK_NETNSID,
 	__IFLA_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a6882686ca3a..1b9329512496 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -862,6 +862,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(1) /* IFLA_OPERSTATE */
 	       + nla_total_size(1) /* IFLA_LINKMODE */
 	       + nla_total_size(4) /* IFLA_CARRIER_CHANGES */
+	       + nla_total_size(4) /* IFLA_LINK_NETNSID */
 	       + nla_total_size(ext_filter_mask
 			        & RTEXT_FILTER_VF ? 4 : 0) /* IFLA_NUM_VF */
 	       + rtnl_vfinfo_size(dev, ext_filter_mask) /* IFLA_VFINFO_LIST */
@@ -1134,6 +1135,18 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 			goto nla_put_failure;
 	}
 
+	if (dev->rtnl_link_ops &&
+	    dev->rtnl_link_ops->get_link_net) {
+		struct net *link_net = dev->rtnl_link_ops->get_link_net(dev);
+
+		if (!net_eq(dev_net(dev), link_net)) {
+			int id = peernet2id(dev_net(dev), link_net);
+
+			if (nla_put_s32(skb, IFLA_LINK_NETNSID, id))
+				goto nla_put_failure;
+		}
+	}
+
 	if (!(af_spec = nla_nest_start(skb, IFLA_AF_SPEC)))
 		goto nla_put_failure;
 
-- 
2.1.0

^ permalink raw reply related

* [PATCH net-next v4 1/4] netns: add genl cmd to add and get peer netns ids
From: Nicolas Dichtel @ 2014-10-30 15:25 UTC (permalink / raw)
  To: netdev, containers, linux-kernel, linux-api
  Cc: davem, ebiederm, stephen, akpm, luto, cwang, Nicolas Dichtel
In-Reply-To: <1414682728-4532-1-git-send-email-nicolas.dichtel@6wind.com>

With this patch, a user can define an id for a peer netns by providing a FD or a
PID. These ids are local to netns (ie valid only into one netns).

This will be useful for netlink messages when a x-netns interface is dumped.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 MAINTAINERS                 |   1 +
 include/net/net_namespace.h |   5 ++
 include/uapi/linux/Kbuild   |   1 +
 include/uapi/linux/netns.h  |  38 +++++++++
 net/core/net_namespace.c    | 195 ++++++++++++++++++++++++++++++++++++++++++++
 net/netlink/genetlink.c     |   4 +
 6 files changed, 244 insertions(+)
 create mode 100644 include/uapi/linux/netns.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 43898b1a8a2d..de7e6fcbd5c2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6382,6 +6382,7 @@ F:	include/linux/netdevice.h
 F:	include/uapi/linux/in.h
 F:	include/uapi/linux/net.h
 F:	include/uapi/linux/netdevice.h
+F:	include/uapi/linux/netns.h
 F:	tools/net/
 F:	tools/testing/selftests/net/
 F:	lib/random32.c
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index e0d64667a4b3..0f1367a71b81 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -59,6 +59,7 @@ struct net {
 	struct list_head	exit_list;	/* Use only net_mutex */
 
 	struct user_namespace   *user_ns;	/* Owning user namespace */
+	struct idr		netns_ids;
 
 	unsigned int		proc_inum;
 
@@ -289,6 +290,10 @@ static inline struct net *read_pnet(struct net * const *pnet)
 #define __net_initconst	__initconst
 #endif
 
+int peernet2id(struct net *net, struct net *peer);
+struct net *get_net_ns_by_id(struct net *net, int id);
+int netns_genl_register(void);
+
 struct pernet_operations {
 	struct list_head list;
 	int (*init)(struct net *net);
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 6cad97485bad..d7f49c69585a 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -277,6 +277,7 @@ header-y += netfilter_decnet.h
 header-y += netfilter_ipv4.h
 header-y += netfilter_ipv6.h
 header-y += netlink.h
+header-y += netns.h
 header-y += netrom.h
 header-y += nfc.h
 header-y += nfs.h
diff --git a/include/uapi/linux/netns.h b/include/uapi/linux/netns.h
new file mode 100644
index 000000000000..2edf129377de
--- /dev/null
+++ b/include/uapi/linux/netns.h
@@ -0,0 +1,38 @@
+/* Copyright (c) 2014 6WIND S.A.
+ * Author: Nicolas Dichtel <nicolas.dichtel@6wind.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ */
+#ifndef _UAPI_LINUX_NETNS_H_
+#define _UAPI_LINUX_NETNS_H_
+
+/* Generic netlink messages */
+
+#define NETNS_GENL_NAME			"netns"
+#define NETNS_GENL_VERSION		0x1
+
+/* Commands */
+enum {
+	NETNS_CMD_UNSPEC,
+	NETNS_CMD_NEWID,
+	NETNS_CMD_GETID,
+	__NETNS_CMD_MAX,
+};
+
+#define NETNS_CMD_MAX		(__NETNS_CMD_MAX - 1)
+
+/* Attributes */
+enum {
+	NETNSA_NONE,
+#define NETNSA_NSINDEX_UNKNOWN	-1
+	NETNSA_NSID,
+	NETNSA_PID,
+	NETNSA_FD,
+	__NETNSA_MAX,
+};
+
+#define NETNSA_MAX		(__NETNSA_MAX - 1)
+
+#endif /* _UAPI_LINUX_NETNS_H_ */
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 7f155175bba8..4a5680ed42fb 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -15,6 +15,8 @@
 #include <linux/file.h>
 #include <linux/export.h>
 #include <linux/user_namespace.h>
+#include <linux/netns.h>
+#include <net/genetlink.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 
@@ -144,6 +146,50 @@ static void ops_free_list(const struct pernet_operations *ops,
 	}
 }
 
+/* This function is used by idr_for_each(). If net is equal to peer, the
+ * function returns the id so that idr_for_each() stops. Because we cannot
+ * returns the id 0 (idr_for_each() will not stop), we return the magic value
+ * -1 for it.
+ */
+static int net_eq_idr(int id, void *net, void *peer)
+{
+	if (net_eq(net, peer))
+		return id ? : -1;
+	return 0;
+}
+
+/* returns NETNSA_NSINDEX_UNKNOWN if not found */
+int peernet2id(struct net *net, struct net *peer)
+{
+	int id = idr_for_each(&net->netns_ids, net_eq_idr, peer);
+
+	ASSERT_RTNL();
+
+	/* Magic value for id 0. */
+	if (id == -1)
+		return 0;
+	if (id == 0)
+		return NETNSA_NSINDEX_UNKNOWN;
+
+	return id;
+}
+
+struct net *get_net_ns_by_id(struct net *net, int id)
+{
+	struct net *peer;
+
+	if (id < 0)
+		return NULL;
+
+	rcu_read_lock();
+	peer = idr_find(&net->netns_ids, id);
+	if (peer)
+		get_net(peer);
+	rcu_read_unlock();
+
+	return peer;
+}
+
 /*
  * setup_net runs the initializers for the network namespace object.
  */
@@ -158,6 +204,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
 	atomic_set(&net->passive, 1);
 	net->dev_base_seq = 1;
 	net->user_ns = user_ns;
+	idr_init(&net->netns_ids);
 
 #ifdef NETNS_REFCNT_DEBUG
 	atomic_set(&net->use_count, 0);
@@ -288,6 +335,14 @@ static void cleanup_net(struct work_struct *work)
 	list_for_each_entry(net, &net_kill_list, cleanup_list) {
 		list_del_rcu(&net->list);
 		list_add_tail(&net->exit_list, &net_exit_list);
+		for_each_net(tmp) {
+			int id = peernet2id(tmp, net);
+
+			if (id >= 0)
+				idr_remove(&tmp->netns_ids, id);
+		}
+		idr_destroy(&net->netns_ids);
+
 	}
 	rtnl_unlock();
 
@@ -399,6 +454,146 @@ static struct pernet_operations __net_initdata net_ns_ops = {
 	.exit = net_ns_net_exit,
 };
 
+static struct genl_family netns_genl_family = {
+	.id		= GENL_ID_GENERATE,
+	.name		= NETNS_GENL_NAME,
+	.version	= NETNS_GENL_VERSION,
+	.hdrsize	= 0,
+	.maxattr	= NETNSA_MAX,
+	.netnsok	= true,
+};
+
+static struct nla_policy netns_nl_policy[NETNSA_MAX + 1] = {
+	[NETNSA_NONE]		= { .type = NLA_UNSPEC },
+	[NETNSA_NSID]		= { .type = NLA_S32 },
+	[NETNSA_PID]		= { .type = NLA_U32 },
+	[NETNSA_FD]		= { .type = NLA_U32 },
+};
+
+static int netns_nl_cmd_newid(struct sk_buff *skb, struct genl_info *info)
+{
+	struct net *net = genl_info_net(info);
+	struct net *peer;
+	int nsid, err;
+
+	if (!info->attrs[NETNSA_NSID])
+		return -EINVAL;
+	nsid = nla_get_s32(info->attrs[NETNSA_NSID]);
+	if (nsid < 0)
+		return -EINVAL;
+
+	if (info->attrs[NETNSA_PID])
+		peer = get_net_ns_by_pid(nla_get_u32(info->attrs[NETNSA_PID]));
+	else if (info->attrs[NETNSA_FD])
+		peer = get_net_ns_by_fd(nla_get_u32(info->attrs[NETNSA_FD]));
+	else
+		return -EINVAL;
+	if (IS_ERR(peer))
+		return PTR_ERR(peer);
+
+	rtnl_lock();
+	if (peernet2id(net, peer) >= 0) {
+		err = -EEXIST;
+		goto out;
+	}
+
+	err = idr_alloc(&net->netns_ids, peer, nsid, nsid + 1, GFP_KERNEL);
+	if (err >= 0)
+		err = 0;
+out:
+	rtnl_unlock();
+	put_net(peer);
+	return err;
+}
+
+static int netns_nl_get_size(void)
+{
+	return nla_total_size(sizeof(s32)) /* NETNSA_NSID */
+	       ;
+}
+
+static int netns_nl_fill(struct sk_buff *skb, u32 portid, u32 seq, int flags,
+			 int cmd, struct net *net, struct net *peer)
+{
+	void *hdr;
+	int id;
+
+	hdr = genlmsg_put(skb, portid, seq, &netns_genl_family, flags, cmd);
+	if (!hdr)
+		return -EMSGSIZE;
+
+	rtnl_lock();
+	id = peernet2id(net, peer);
+	rtnl_unlock();
+	if (nla_put_s32(skb, NETNSA_NSID, id))
+		goto nla_put_failure;
+
+	return genlmsg_end(skb, hdr);
+
+nla_put_failure:
+	genlmsg_cancel(skb, hdr);
+	return -EMSGSIZE;
+}
+
+static int netns_nl_cmd_getid(struct sk_buff *skb, struct genl_info *info)
+{
+	struct net *net = genl_info_net(info);
+	struct sk_buff *msg;
+	int err = -ENOBUFS;
+	struct net *peer;
+
+	if (info->attrs[NETNSA_PID])
+		peer = get_net_ns_by_pid(nla_get_u32(info->attrs[NETNSA_PID]));
+	else if (info->attrs[NETNSA_FD])
+		peer = get_net_ns_by_fd(nla_get_u32(info->attrs[NETNSA_FD]));
+	else
+		return -EINVAL;
+
+	if (IS_ERR(peer))
+		return PTR_ERR(peer);
+
+	msg = genlmsg_new(netns_nl_get_size(), GFP_KERNEL);
+	if (!msg) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	err = netns_nl_fill(msg, info->snd_portid, info->snd_seq,
+			    NLM_F_ACK, NETNS_CMD_GETID, net, peer);
+	if (err < 0)
+		goto err_out;
+
+	err = genlmsg_unicast(net, msg, info->snd_portid);
+	goto out;
+
+err_out:
+	nlmsg_free(msg);
+out:
+	put_net(peer);
+	return err;
+}
+
+static struct genl_ops netns_genl_ops[] = {
+	{
+		.cmd = NETNS_CMD_NEWID,
+		.policy = netns_nl_policy,
+		.doit = netns_nl_cmd_newid,
+		.flags = GENL_ADMIN_PERM,
+	},
+	{
+		.cmd = NETNS_CMD_GETID,
+		.policy = netns_nl_policy,
+		.doit = netns_nl_cmd_getid,
+		.flags = GENL_ADMIN_PERM,
+	},
+};
+
+int netns_genl_register(void)
+{
+	return genl_register_family_with_ops(&netns_genl_family,
+					     netns_genl_ops);
+}
+
 static int __init net_ns_init(void)
 {
 	struct net_generic *ng;
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 76393f2f4b22..c6f39e40c9f3 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -1029,6 +1029,10 @@ static int __init genl_init(void)
 	if (err)
 		goto problem;
 
+	err = netns_genl_register();
+	if (err < 0)
+		goto problem;
+
 	return 0;
 
 problem:
-- 
2.1.0

^ permalink raw reply related

* [PATCH net-next v4 0/4] netns: allow to identify peer netns
From: Nicolas Dichtel @ 2014-10-30 15:25 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	luto-kltTT9wpgjJwATOyAt5JVQ, cwang-xCSkyg8dI+0RB7SZvlqPiA
In-Reply-To: <1412257690-31253-1-git-send-email-nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

The goal of this serie is to be able to multicast netlink messages with an
attribute that identify a peer netns.
This is needed by the userland to interpret some informations contained in
netlink messages (like IFLA_LINK value, but also some other attributes in case
of x-netns netdevice (see also
http://thread.gmane.org/gmane.linux.network/315933/focus=316064 and
http://thread.gmane.org/gmane.linux.kernel.containers/28301/focus=4239)).

Ids of peer netns are set by userland via a new genl messages. These ids are
stored per netns and are local (ie only valid in the netns where they are set).
To avoid allocating an int for each peer netns, I use idr_for_each() to retrieve
the id of a peer netns. Note that it will be possible to add a table (struct net
-> id) later to optimize this lookup if needed.

Patch 1/4 introduces the netlink API mechanism to set and get these ids.
Patch 2/4 and 3/4 implements an example of how to use these ids in rtnetlink
messages. And patch 4/4 shows that the netlink messages can be symetric between
a GET and a SET.

iproute2 patches are available, I can send them on demand.

Here is a small screenshot to show how it can be used by userland.

First, setup netns and required ids:
$ ip netns add foo
$ ip netns del foo
$ ip netns
$ touch /var/run/netns/init_net
$ mount --bind /proc/1/ns/net /var/run/netns/init_net
$ ip netns add foo
$ ip netns exec foo ip netns set init_net 0
$ ip netns
foo
init_net
$ ip netns exec foo ip netns
foo
init_net (id: 0)

Now, add and display an ipip tunnel, with its link part in init_net (id 0 in
netns foo) and the netdevice in foo:
$ ip netns exec foo ip link add ipip1 link-netnsid 0 type ipip remote 10.16.0.121 local 10.16.0.249
$ ip netns exec foo ip l ls ipip1
6: ipip1@NONE: <POINTOPOINT,NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default 
    link/ipip 10.16.0.249 peer 10.16.0.121 link-netnsid 0

The parameter link-netnsid shows us where the interface sends and receives
packets (and thus we know where encapsulated addresses are set).

RFCv3 -> v4:
  rebase on net-next
  add copyright text in the new netns.h file

RFCv2 -> RFCv3:
  ids are now defined by userland (via netlink). Ids are stored in each netns
  (and they are local to this netns).
  add get_link_net support for ip6 tunnels
  netnsid is now a s32 instead of a u32

RFCv1 -> RFCv2:
  remove useless ()
  ids are now stored in the user ns. It's possible to get an id for a peer netns
  only if the current netns and the peer netns have the same user ns parent.

 MAINTAINERS                  |   1 +
 include/net/ip6_tunnel.h     |   1 +
 include/net/ip_tunnels.h     |   1 +
 include/net/net_namespace.h  |   5 ++
 include/net/rtnetlink.h      |   2 +
 include/uapi/linux/Kbuild    |   1 +
 include/uapi/linux/if_link.h |   1 +
 include/uapi/linux/netns.h   |  38 +++++++++
 net/core/net_namespace.c     | 195 +++++++++++++++++++++++++++++++++++++++++++
 net/core/rtnetlink.c         |  38 ++++++++-
 net/ipv4/ip_gre.c            |   2 +
 net/ipv4/ip_tunnel.c         |   8 ++
 net/ipv4/ip_vti.c            |   1 +
 net/ipv4/ipip.c              |   1 +
 net/ipv6/ip6_gre.c           |   1 +
 net/ipv6/ip6_tunnel.c        |   9 ++
 net/ipv6/ip6_vti.c           |   1 +
 net/ipv6/sit.c               |   1 +
 net/netlink/genetlink.c      |   4 +
 19 files changed, 308 insertions(+), 3 deletions(-)

Comments are welcome.

Regards,
Nicolas

^ permalink raw reply

* Re: kdbus: add code for buses, domains and endpoints
From: Andy Lutomirski @ 2014-10-30 14:58 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Eric W. Biederman, Greg Kroah-Hartman, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, John Stultz,
	Arnd Bergmann, Tejun Heo, Marcel Holtmann, Ryan Lortie,
	Bastien Nocera, David Herrmann, Simon McVittie, Daniel Mack,
	alban.crequy, Javier Martinez Canillas, Tom Gundersen
In-Reply-To: <20141030144855.GA9705@dztty>

On Thu, Oct 30, 2014 at 7:48 AM, Djalal Harouni <tixxdz-Umm1ozX2/EEdnm+yROfE0A@public.gmane.org> wrote:
> On Thu, Oct 30, 2014 at 05:15:04AM -0700, Eric W. Biederman wrote:
>> Djalal Harouni <tixxdz-Umm1ozX2/EEdnm+yROfE0A@public.gmane.org> writes:
>> What others are doing makes it very hard to safely use allow those
>> ioctls in a tightly sandboxed application, as it is unpredictable
>> what the sandboxed ioctl can do with the file descriptor.
>>
>> Further an application that calls setresuid at different times during
>> it's application will behave differently.  Which makes ioctls that do
>> not have consistent behavior after open time inappropriate for use in
>> userspace libraries.
> We are consistent in our checks, you say that the application will
> behave differently when it calls setresuid() sure! If it changes its
> creds then regain of course it will behave differently! and the checks
> are here to make sure that setresuid() and alike work correctly when the
> application changes its creds and calls-in.
>

Except that it isn't consistent.

If I open a postgresql socket that wants me to be root and then I drop
privileges, I can keep talking to postresql.  This is a good thing,
because it means that I can keep talking to postgresql but I lose my
privilege to do other things.

The new kdbus model breaks this.  If I start as root and drop
privileges to UID_PRIVSEP, then my attempts to communicate over
already-open connections shouldn't consider UID_PRIVSEP.  In the, they
shouldn't tell the other endpoints that UID_PRIVSEP exists at all
unless I've explicitly asked the kernel for this behavior.

I suggest reading up on the object capability model.  Linux isn't one,
but large deviations (like kdbus') from an object capability model are
rarely a good thing.

--Andy

^ permalink raw reply

* Re: kdbus: add code for buses, domains and endpoints
From: Djalal Harouni @ 2014-10-30 14:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	john.stultz-QSEj5FYQhm4dnm+yROfE0A, arnd-r2nGTMty4D4,
	tj-DgEjT+Ai2ygdnm+yROfE0A, marcel-kz+m5ild9QBg9hUCZPvPmw,
	desrt-0xnayjDhYQY, hadess-0MeiytkfxGOsTnJN9+BGXg,
	dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	simon.mcvittie-ZGY8ohtN/8pPYcu2f3hruQ,
	daniel-cYrQPVfZoowdnm+yROfE0A,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ, teg-B22kvLQNl6c,
	Andy Lutomirski
In-Reply-To: <87wq7hiwjb.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On Thu, Oct 30, 2014 at 05:15:04AM -0700, Eric W. Biederman wrote:
> Djalal Harouni <tixxdz-Umm1ozX2/EEdnm+yROfE0A@public.gmane.org> writes:
> 
> > On Wed, Oct 29, 2014 at 08:59:44PM -0700, Eric W. Biederman wrote:
> >> Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> writes:
> >> 
> >> The way capabilities are checked in this patch make me very nervous.
> >> 
> >> We are not checking permissions at open time.  Every other location
> >> of calling capable on file like objects has been show to be suceptible
> >> to file descriptor pass attacks.
> > Yes, I do understand the concern, this is valid for some cases! but we
> > can't apply it on the ioctl API ?! please see below:
> >
> > All (perhaps not all) the current ioctl do not check for fd passing
> > attacks! if a privileged do arbitrary ioctl on untrusted fds we are
> > already owned... the dumb privileged process is the one to blame, right?
> >
> >
> > Example:
> > 1) fs/ext4/ioctl.c:ext4_ioctl()
> >    they have:
> >    inode_owner_or_capable() + capable() checks
> >
> >    for all the restricted ioctl()
> >
> > 2) fs/xfs/xfs_ioctl.c:xfs_file_ioctl()
> >    they have:
> >    capable() checks
> >
> > 3) fs/btrfs/ioctl.c:btrfs_ioctl()
> >    they have capable() + inode_owner_or_capable()
> >
> > ... long list
> >
> > These are sensible API and they do not care at all about fd passing,
> > so I don't think we should care either ?! or perhaps I'm missing
> > something ?
> 
> - It is an easy mistake to make.
> - We have not performed extensive audits of the capable calls at this
>   time to veryify that fd passing is safe.
> - Unless it is egregious we are likely to grandfather the existing usage
>   in to avoid breaking userspace.
> 
> None of that is an excuse for kdbus to get it wrong once it has been
> pointed out in review.
Of course! but our goal here is not to produce some sort of new
capability checks or new security mechanisms in this field. We want to
follow what other API are doing and be consistent. So every one who reads
the code can understand it, it is the standard API, the standard scheme
used in every crucial part of the kernel. If there is really some sort
of proven bugs affecting these ioctl() API say in ext4, btrfs or other
devices, in this case we need to follow and update, we have too!


> > The capable() is done as it is, and for the inode_owner_or_capable() you
> > will notice that we followed the same logic and did use it in our
> > kdbus_bus_uid_is_privileged() to stay safe and follow what other API are
> > doing.
> 
> What others are doing makes it very hard to safely use allow those
> ioctls in a tightly sandboxed application, as it is unpredictable
> what the sandboxed ioctl can do with the file descriptor.
> 
> Further an application that calls setresuid at different times during
> it's application will behave differently.  Which makes ioctls that do
> not have consistent behavior after open time inappropriate for use in
> userspace libraries.
We are consistent in our checks, you say that the application will
behave differently when it calls setresuid() sure! If it changes its
creds then regain of course it will behave differently! and the checks
are here to make sure that setresuid() and alike work correctly when the
application changes its creds and calls-in.

Thanks!

-- 
Djalal Harouni
http://opendz.org

^ permalink raw reply

* Re: [PATCH 00/12] Add kdbus implementation
From: Greg Kroah-Hartman @ 2014-10-30 14:47 UTC (permalink / raw)
  To: Karol Lewandowski
  Cc: Jiri Kosina, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	John Stultz, Arnd Bergmann, Tejun Heo, Ryan Lortie,
	Simon McVittie, daniel-cYrQPVfZoowdnm+yROfE0A, David Herrmann,
	Paul Moore,
	casey.schaufler-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org,
	marcel-kz+m5ild9QBg9hUCZPvPmw, tixxdz-Umm1ozX2/EEdnm+yROfE0A,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ
In-Reply-To: <54521697.1030900-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>

On Thu, Oct 30, 2014 at 11:44:39AM +0100, Karol Lewandowski wrote:
> [ Sorry for breaking thread and resend - gmane rejected my original message
>   due to too long list of recipients... ]
> 
> On 2014-10-30 00:40, Greg Kroah-Hartman wrote:
> 
> > There is a 1815 line documentation file in this series, so we aren't
> > trying to not provide this type of information here at all.  But yes,
> > more background, about why this can't be done in userspace (zero copy,
> > less context switches, proper credential passing, timestamping, availble
> > at early-boot, LSM hooks for security models to tie into
> 
> While you're at it... I did some work on proof-of-concept LSM patches for
> kdbus some time ago, see [1][2].  Currently, these are completely of date.
> 
>  [1] https://github.com/lmctl/linux/commits/kdbus-lsm-v4.for-systemd-v212
>  [2] https://github.com/lmctl/kdbus/commit/aa0885489d19be92fa41c6f0a71df28763228a40
> 
> May I ask if you guys have your own plan for LSM or maybe it would be
> worth to resurrect [1]?

The core calls are already mediated by LSM today, right?  We don't want
anyone to be parsing the data stream through an LSM, that idea got
rejected a long time ago as something that is really not a good idea.

Other than that, I don't know exactly what your patches do, or why they
are needed, care to go into details?

thanks,

greg k-h

^ permalink raw reply

* Re: kdbus: add code to gather metadata
From: Andy Lutomirski @ 2014-10-30 14:07 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Greg Kroah-Hartman, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, John Stultz,
	Arnd Bergmann, Tejun Heo, Marcel Holtmann, Ryan Lortie,
	Bastien Nocera, David Herrmann, Djalal Harouni, Simon McVittie,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ, Javier Martinez Canillas,
	Tom Gundersen
In-Reply-To: <5451FA9B.8070501-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>

On Thu, Oct 30, 2014 at 1:45 AM, Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
>
> 1. At open() time, So we can tell peers (through KDBUS_CMD_CONN_INFO)
> about the credentials a connection had when it was created with
> KDBUS_CMD_HELLO.

Then the API that tells peers about this information needs to make
this very clear.

>
> 2. When a new bus is created through KDBUS_CMD_BUS_MAKE, so peers can
> later query the credentials of the owner of the bus they're connected to.
>

Ditto.  Although, why on earth should a bus have credentials?  This
sounds like a misdesign.  It seems to me that this type of policy
belongs all the way in userspace.  If you want a bus, you ask the
owner of the entire domain to make you a bus.  Or you make it yourself
and hand off references in some authenticated way.

> 3. When we dispatch a KDBUS_CMD_MSG_SEND ioctl(), because we want to
> attach the metadata that was authoritative when the message was sent.
> IOW: We want metadata that actually matches the message payload.
>

What does that "metadata that actually matches the message payload"
mean?  If I create an endpoint and delegate some processing to a less
privileged child, other things on the bus MUST NOT be able to detect
that delegation in any sensible design.  Otherwise everything will
appear to work in testing because other processes never checked the
problematic credential, but then it will randomly fail because someone
decided to do something daft and validate my unprivileged child's
argv[0], which is, of course, not what they expected.

I suspect that, if you make credential sending opt-in, you will
quickly discover that the current model for which credentials matter
makes no sense.

>> And the ns_eq stuff is too far buried (and not even contained in this
>> patch!) to be easily verified as being correct, whatever correct means
>> in that context.
>
> I see that. As I explained earlier in my reply to Eric, we've chosen to
> submit the patch set this way to keep the reply threading clean, so it
> was some sort of a trade-off. Still, I think the best way to review it
> is to pull in Greg's patches and look at the actual files.

This wasn't a comment about the threading.  The call, in the patches
in the git tree, is buried and very difficult to follow.

--Andy

^ permalink raw reply

* Re: [PATCH 00/12] Add kdbus implementation
From: Andy Lutomirski @ 2014-10-30 13:59 UTC (permalink / raw)
  To: Tom Gundersen
  Cc: Greg Kroah-Hartman, Jiri Kosina, Linux API,
	linux-kernel@vger.kernel.org, John Stultz, Arnd Bergmann,
	Tejun Heo, Marcel Holtmann, Ryan Lortie, Bastien Nocera,
	David Herrmann, Djalal Harouni, Simon McVittie, Daniel Mack,
	alban.crequy, Javier Martinez Canillas
In-Reply-To: <CAG-2HqX9RUQHiF1U_CXiDVVLS-7aUOQdYn7EVNSMZNdbe38cTA@mail.gmail.com>

On Thu, Oct 30, 2014 at 4:52 AM, Tom Gundersen <teg@jklm.no> wrote:
> On 10/30/2014 12:55 AM, Andy Lutomirski wrote:> It's worth noting that:
>>
>>  - Proper credential passing could be added to UNIX sockets, and we
>> may want to do that anyway.  Also, the current kdbus semantics seem to
>> be "spew lots of credentials and other miscellaneous
>> potentially-sensitive and sometime spoofable information all over the
>> place", which isn't obviously an improvement.  (This is fixable, but
>> it will almost certainly not be compatible with current systemd kdbus
>> code if fixed.)
>
> Care to elaborate on what you think is spoofable, and what needs to be fixed?

cmd and comm are trivially replaceable by any sender.

>
> Anyway, the idea is that by simply connecting to the bus and sending a
> message to some service, you implicitly agree to passing some metadata
> along to the service (and to a lesser extent to the bus). It's not
> that this information is leaked, or that the peer could actively
> access any of the sender's private memory.

To me, this smells like bad design.  By using kdbus, I implicitly
agree to send everyone my command line?!?  If I'm in a cgroup that
policy decrees should be privileged, then I should invoke that
privilege by specifically asking, *at the time of capture*, to send
that cgroup.  Otherwise it becomes unclear what things convey
privilege when, and that will lead immediately to incomprehensible
security models, and that will lead to exploits.

<snark>Sorry, but "implicitly agree" sounds a lot like using my
esteemed cellphone carrier.  When I use it, some argue that I
implicitly agree to have my identity prepended to all outgoing HTTP
requests.  This is *not* a good thing.</snark>

> Also note that this kind of
> metadata information is also available via /proc/$PID, and via
> SCM_CREDENTIALS/SO_PEERCRED and the socket seclabel APIs.

Not if you have a sensible LSM policy or if you use hidepid.  And,
once you've fixed the namespacing issues, not if the sender and
receiver are in different PID namespaces or if they don't have /proc
mounted at all.

>
> When credential information is passed between processes of different
> (PID) namespaces most of the attached metadata is suppressed.

This is a bug.  It prevents users from usefully sandboxing themselves
in a kdbus world.  If you create and enter a user namespace, then your
outside identity (which should be unchanged) is suppressed.  (Note
that anything that captures credentials other than at open time is
also an issue for sandboxes in the other direction: it may interfere
with selective privilege dropping.)

> This
> isn't too different from how SCM_CREDENTIALS works, which will zero
> out the bits it cannot translate as well.

SCM_CREDENTIALS translates the translatable parts.

> There are some major benefits regarding performance:
>
> * fewer userspace context switches. For a full-duplex method call it's
> down from five to two: instead of sender -> dbus daemon -> service ->
> dbus daemon -> sender it's just sender -> service -> sender.
> * fewer message copies in userspace. For a full-duplex method call
> it's down from eight to two: instead of copying the method call data
> into a socket, out of a socket, into a socket, out of a socket, and
> the same for the method reply, we just copy one message directly to
> the receiver, and the reply back.
> * generally fewer syscalls involved. A synchronous method call is now
> doable in a single ioctl on the sender side.
> * memfds can be used for transport purposes of larger payload. This
> way, we can cover substantial payload sizes instead of just small
> control messages, with no extra copies. kdbus, in its transport layer,
> makes sure only sealed memfds are passed in as payload, so the sender
> cannot modify the contents while the receiver is already parsing it.

There should be a number measured in, say, nanoseconds in here
somewhere.  The actual extent of the speedup is unmeasurable here.
Also, it's worth reading at least one of Linus' many rants about
zero-copy.  It's not an automatic win.

--Andy

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox