Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask
From: Yann Droneaud @ 2015-01-29 20:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Shachar Raindel, Eli Cohen, Haggai Eran,
	Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20150129191801.GM11842-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

Hi,

Le jeudi 29 janvier 2015 à 12:18 -0700, Jason Gunthorpe a écrit :
> On Thu, Jan 29, 2015 at 07:43:29PM +0100, Yann Droneaud wrote:
> 
> > The write() syscall must return the size buffer passed to it, or
> > less, but in such case it would ask for trouble as userspace would
> > be allowed to write() the remaining bytes.  Returning a size bigger
> > than the one passed to write() is not acceptable and would break any
> > expectation.
> 
> By that logic the 0 return is still wrong, and it should be ucore->in_len
> 

This is handled by ib_uverbs_write() in
drivers/infiniband/core/uverbs_main.c:

    709                 if (err)
    710                         return err;
    711 
    712                 return written_count;

> But I think we can return less without risking anything breaking, it
> already violates the invariant associated with write() - it mutates
> the buffer passed in!
> 

I don't think so, as only the response buffer is written to and the
response buffer pointer is provided in the buffer given to write().

AFAIK, no uverbs ever change the content of the input buffer (eg. the
request): I've managed to declare the various input buffers "const" so
it would surprising to find it use for writing to userspace.

Anyway, I recognize that uverb way of abusing write() syscall is 
borderline (at best) regarding other Linux subsystems and Unix paradigm 
in general. But it's not enough to screw it more.

Regards.

-- 
Yann Droneaud
OPTEYA

^ permalink raw reply

* Re: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask
From: Yann Droneaud @ 2015-01-29 20:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Shachar Raindel, Eli Cohen, Haggai Eran,
	Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1422557009.3133.172.camel-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>

Le jeudi 29 janvier 2015 à 19:43 +0100, Yann Droneaud a écrit :
> Le jeudi 29 janvier 2015 à 11:28 -0700, Jason Gunthorpe a écrit :
> > On Thu, Jan 29, 2015 at 06:59:58PM +0100, Yann Droneaud wrote:

> > > -		resp.odp_caps.general_caps = attr.odp_caps.general_caps;
> > > -		resp.odp_caps.per_transport_caps.rc_odp_caps =
> > > -			attr.odp_caps.per_transport_caps.rc_odp_caps;
> > > -		resp.odp_caps.per_transport_caps.uc_odp_caps =
> > > -			attr.odp_caps.per_transport_caps.uc_odp_caps;
> > > -		resp.odp_caps.per_transport_caps.ud_odp_caps =
> > > -			attr.odp_caps.per_transport_caps.ud_odp_caps;
> > > -		resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP;
> > > -	}
> > > +	resp.odp_caps.general_caps = attr.odp_caps.general_caps;
> > > +	resp.odp_caps.per_transport_caps.rc_odp_caps =
> > > +		attr.odp_caps.per_transport_caps.rc_odp_caps;
> > > +	resp.odp_caps.per_transport_caps.uc_odp_caps =
> > > +		attr.odp_caps.per_transport_caps.uc_odp_caps;
> > > +	resp.odp_caps.per_transport_caps.ud_odp_caps =
> > > +		attr.odp_caps.per_transport_caps.ud_odp_caps;
> > >  #endif
> > > +	resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP;
> > 
> > Not sure about this - if 0 is a valid null answer for all the _caps
> > then it is fine, and the comp_mask bit should just be removed as the
> > size alone should be enough.
> > 
> 
> That's difficult to say. But I hope 0 are valids answers in this case.
> 

Hum, according to include/rdma/ib_verbs.h, there's a bit set
for ODP support:

    148 enum ib_odp_general_cap_bits {
    149         IB_ODP_SUPPORT = 1 << 0,
    150 };

So it should be safe to set everything to 0 in struct
ib_uverbs_odp_caps.

Regards.

-- 
Yann Droneaud
OPTEYA

^ permalink raw reply

* [PATCH v12 2/2] crypto: AF_ALG: enable AEAD interface compilation
From: Stephan Mueller @ 2015-01-29 20:26 UTC (permalink / raw)
  To: 'Herbert Xu'
  Cc: 'Quentin Gouchet', Daniel Borkmann, 'LKML',
	linux-crypto-u79uwXL29TY76Z2rM5mHXA, 'Linux API'
In-Reply-To: <1681008.mocmysOoQU-PJstQz4BMNNP20K/wil9xYQuADTiUCJX@public.gmane.org>

Enable compilation of the AEAD AF_ALG support and provide a Kconfig
option to compile the AEAD AF_ALG support.

Signed-off-by: Stephan Mueller <smueller-T9tCv8IpfcWELgA04lAiVw@public.gmane.org>
---
 crypto/Kconfig  | 9 +++++++++
 crypto/Makefile | 1 +
 2 files changed, 10 insertions(+)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 50f4da4..41a3fc5 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1523,6 +1523,15 @@ config CRYPTO_USER_API_RNG
 	  This option enables the user-spaces interface for random
 	  number generator algorithms.
 
+config CRYPTO_USER_API_AEAD
+	tristate "User-space interface for AEAD cipher algorithms"
+	depends on NET
+	select CRYPTO_AEAD
+	select CRYPTO_USER_API
+	help
+	  This option enables the user-spaces interface for AEAD
+	  cipher algorithms.
+
 config CRYPTO_HASH_INFO
 	bool
 
diff --git a/crypto/Makefile b/crypto/Makefile
index ba19465..97b7d3a 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_CRYPTO_USER_API) += af_alg.o
 obj-$(CONFIG_CRYPTO_USER_API_HASH) += algif_hash.o
 obj-$(CONFIG_CRYPTO_USER_API_SKCIPHER) += algif_skcipher.o
 obj-$(CONFIG_CRYPTO_USER_API_RNG) += algif_rng.o
+obj-$(CONFIG_CRYPTO_USER_API_AEAD) += algif_aead.o
 
 #
 # generic algorithms and the async_tx api
-- 
2.1.0

^ permalink raw reply related

* [PATCH v12 1/2] crypto: AF_ALG: add AEAD support
From: Stephan Mueller @ 2015-01-29 20:24 UTC (permalink / raw)
  To: 'Herbert Xu'
  Cc: 'Quentin Gouchet', Daniel Borkmann, 'LKML',
	linux-crypto, 'Linux API'
In-Reply-To: <1681008.mocmysOoQU@tachyon.chronox.de>

This patch adds the AEAD support for AF_ALG.

The implementation is based on algif_skcipher, but contains heavy
modifications to streamline the interface for AEAD uses.

To use AEAD, the user space consumer has to use the salg_type named
"aead".

The AEAD implementation includes some overhead to calculate the size of
the ciphertext, because the AEAD implementation of the kernel crypto API
makes implied assumption on the location of the authentication tag. When
performing an encryption, the tag will be added to the created
ciphertext (note, the tag is placed adjacent to the ciphertext). For
decryption, the caller must hand in the ciphertext with the tag appended
to the ciphertext. Therefore, the selection of the used memory
needs to add/subtract the tag size from the source/destination buffers
depending on the encryption type. The code is provided with comments
explaining when and how that operation is performed.

A fully working example using all aspects of AEAD is provided at
http://www.chronox.de/libkcapi.html

Signed-off-by: Stephan Mueller <smueller@chronox.de>
---
 crypto/algif_aead.c | 666 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 666 insertions(+)
 create mode 100644 crypto/algif_aead.c

diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
new file mode 100644
index 0000000..fe3dfdd
--- /dev/null
+++ b/crypto/algif_aead.c
@@ -0,0 +1,666 @@
+/*
+ * algif_aead: User-space interface for AEAD algorithms
+ *
+ * Copyright (C) 2014, Stephan Mueller <smueller@chronox.de>
+ *
+ * This file provides the user-space API for AEAD ciphers.
+ *
+ * This file is derived from algif_skcipher.c.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include <crypto/scatterwalk.h>
+#include <crypto/if_alg.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/net.h>
+#include <net/sock.h>
+
+struct aead_sg_list {
+	unsigned int cur;
+	struct scatterlist sg[ALG_MAX_PAGES];
+};
+
+struct aead_ctx {
+	struct aead_sg_list tsgl;
+	/*
+	 * RSGL_MAX_ENTRIES is an artificial limit where user space at maximum
+	 * can cause the kernel to allocate RSGL_MAX_ENTRIES * ALG_MAX_PAGES
+	 * bytes
+	 */
+#define RSGL_MAX_ENTRIES ALG_MAX_PAGES
+	struct af_alg_sgl rsgl[RSGL_MAX_ENTRIES];
+
+	void *iv;
+
+	struct af_alg_completion completion;
+
+	unsigned long used;
+
+	unsigned int len;
+	bool more;
+	bool merge;
+	bool enc;
+
+	size_t aead_assoclen;
+	struct aead_request aead_req;
+};
+
+static inline int aead_sndbuf(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+
+	return max_t(int, max_t(int, sk->sk_sndbuf & PAGE_MASK, PAGE_SIZE) -
+			  ctx->used, 0);
+}
+
+static inline bool aead_writable(struct sock *sk)
+{
+	return PAGE_SIZE <= aead_sndbuf(sk);
+}
+
+static inline bool aead_sufficient_data(struct aead_ctx *ctx)
+{
+	unsigned as = crypto_aead_authsize(crypto_aead_reqtfm(&ctx->aead_req));
+
+	return (ctx->used >= (ctx->aead_assoclen + (ctx->enc ? 0 : as)));
+}
+
+static void aead_put_sgl(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	struct aead_sg_list *sgl = &ctx->tsgl;
+	struct scatterlist *sg = sgl->sg;
+	unsigned int i;
+
+	for (i = 0; i < sgl->cur; i++) {
+		if (!sg_page(sg + i))
+			continue;
+
+		put_page(sg_page(sg + i));
+		sg_assign_page(sg + i, NULL);
+	}
+	sgl->cur = 0;
+	ctx->used = 0;
+	ctx->more = 0;
+	ctx->merge = 0;
+}
+
+static void aead_wmem_wakeup(struct sock *sk)
+{
+	struct socket_wq *wq;
+
+	if (!aead_writable(sk))
+		return;
+
+	rcu_read_lock();
+	wq = rcu_dereference(sk->sk_wq);
+	if (wq_has_sleeper(wq))
+		wake_up_interruptible_sync_poll(&wq->wait, POLLIN |
+							   POLLRDNORM |
+							   POLLRDBAND);
+	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
+	rcu_read_unlock();
+}
+
+static int aead_wait_for_data(struct sock *sk, unsigned flags)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	long timeout;
+	DEFINE_WAIT(wait);
+	int err = -ERESTARTSYS;
+
+	if (flags & MSG_DONTWAIT)
+		return -EAGAIN;
+
+	set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
+
+	for (;;) {
+		if (signal_pending(current))
+			break;
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+		timeout = MAX_SCHEDULE_TIMEOUT;
+		if (sk_wait_event(sk, &timeout, !ctx->more)) {
+			err = 0;
+			break;
+		}
+	}
+	finish_wait(sk_sleep(sk), &wait);
+
+	clear_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
+
+	return err;
+}
+
+static void aead_data_wakeup(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	struct socket_wq *wq;
+
+	if (ctx->more)
+		return;
+	if (!ctx->used)
+		return;
+
+	rcu_read_lock();
+	wq = rcu_dereference(sk->sk_wq);
+	if (wq_has_sleeper(wq))
+		wake_up_interruptible_sync_poll(&wq->wait, POLLOUT |
+							   POLLRDNORM |
+							   POLLRDBAND);
+	sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
+	rcu_read_unlock();
+}
+
+static int aead_sendmsg(struct kiocb *unused, struct socket *sock,
+			struct msghdr *msg, size_t size)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	unsigned ivsize =
+		crypto_aead_ivsize(crypto_aead_reqtfm(&ctx->aead_req));
+	struct aead_sg_list *sgl = &ctx->tsgl;
+	struct af_alg_control con = {};
+	long copied = 0;
+	bool enc = 0;
+	bool init = 0;
+	int err = -EINVAL;
+
+	if (msg->msg_controllen) {
+		err = af_alg_cmsg_send(msg, &con);
+		if (err)
+			return err;
+
+		init = 1;
+		switch (con.op) {
+		case ALG_OP_ENCRYPT:
+			enc = 1;
+			break;
+		case ALG_OP_DECRYPT:
+			enc = 0;
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		if (con.iv && con.iv->ivlen != ivsize)
+			return -EINVAL;
+	}
+
+	lock_sock(sk);
+	if (!ctx->more && ctx->used)
+		goto unlock;
+
+	if (init) {
+		ctx->enc = enc;
+		if (con.iv)
+			memcpy(ctx->iv, con.iv->iv, ivsize);
+
+		ctx->aead_assoclen = con.aead_assoclen;
+	}
+
+	while (size) {
+		unsigned long len = size;
+		struct scatterlist *sg = NULL;
+
+		/* use the existing memory in an allocated page */
+		if (ctx->merge) {
+			sg = sgl->sg + sgl->cur - 1;
+			len = min_t(unsigned long, len,
+				    PAGE_SIZE - sg->offset - sg->length);
+			err = memcpy_from_msg(page_address(sg_page(sg)) +
+					      sg->offset + sg->length,
+					      msg, len);
+			if (err)
+				goto unlock;
+
+			sg->length += len;
+			ctx->merge = (sg->offset + sg->length) &
+				     (PAGE_SIZE - 1);
+
+			ctx->used += len;
+			copied += len;
+			size -= len;
+			continue;
+		}
+
+		if (!aead_writable(sk)) {
+			/* user space sent too much data */
+			aead_put_sgl(sk);
+			err = -EMSGSIZE;
+			goto unlock;
+		}
+
+		/* allocate a new page */
+		len = min_t(unsigned long, size, aead_sndbuf(sk));
+		while (len) {
+			int plen = 0;
+
+			if (sgl->cur >= ALG_MAX_PAGES) {
+				aead_put_sgl(sk);
+				err = -E2BIG;
+				goto unlock;
+			}
+
+			sg = sgl->sg + sgl->cur;
+			plen = min_t(int, len, PAGE_SIZE);
+
+			sg_assign_page(sg, alloc_page(GFP_KERNEL));
+			err = -ENOMEM;
+			if (!sg_page(sg))
+				goto unlock;
+
+			err = memcpy_from_msg(page_address(sg_page(sg)),
+					      msg, plen);
+			if (err) {
+				__free_page(sg_page(sg));
+				sg_assign_page(sg, NULL);
+				goto unlock;
+			}
+
+			sg->offset = 0;
+			sg->length = plen;
+			len -= plen;
+			ctx->used += plen;
+			copied += plen;
+			sgl->cur++;
+			size -= plen;
+			ctx->merge = plen & (PAGE_SIZE - 1);
+		}
+	}
+
+	err = 0;
+
+	ctx->more = msg->msg_flags & MSG_MORE;
+	if (!ctx->more && !aead_sufficient_data(ctx)) {
+		aead_put_sgl(sk);
+		err = -EMSGSIZE;
+	}
+
+unlock:
+	aead_data_wakeup(sk);
+	release_sock(sk);
+
+	return err ?: copied;
+}
+
+static ssize_t aead_sendpage(struct socket *sock, struct page *page,
+			     int offset, size_t size, int flags)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	struct aead_sg_list *sgl = &ctx->tsgl;
+	int err = -EINVAL;
+
+	if (flags & MSG_SENDPAGE_NOTLAST)
+		flags |= MSG_MORE;
+
+	if (sgl->cur >= ALG_MAX_PAGES)
+		return -E2BIG;
+
+	lock_sock(sk);
+	if (!ctx->more && ctx->used)
+		goto unlock;
+
+	if (!size)
+		goto done;
+
+	if (!aead_writable(sk)) {
+		/* user space sent too much data */
+		aead_put_sgl(sk);
+		err = -EMSGSIZE;
+		goto unlock;
+	}
+
+	ctx->merge = 0;
+
+	get_page(page);
+	sg_set_page(sgl->sg + sgl->cur, page, size, offset);
+	sgl->cur++;
+	ctx->used += size;
+
+	err = 0;
+
+done:
+	ctx->more = flags & MSG_MORE;
+	if (!ctx->more && !aead_sufficient_data(ctx)) {
+		aead_put_sgl(sk);
+		err = -EMSGSIZE;
+	}
+
+unlock:
+	aead_data_wakeup(sk);
+	release_sock(sk);
+
+	return err ?: size;
+}
+
+static int aead_recvmsg(struct kiocb *unused, struct socket *sock,
+			struct msghdr *msg, size_t ignored, int flags)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	unsigned bs = crypto_aead_blocksize(crypto_aead_reqtfm(&ctx->aead_req));
+	unsigned as = crypto_aead_authsize(crypto_aead_reqtfm(&ctx->aead_req));
+	struct aead_sg_list *sgl = &ctx->tsgl;
+	struct scatterlist *sg = NULL;
+	struct scatterlist assoc[ALG_MAX_PAGES];
+	size_t assoclen = 0;
+	unsigned int i = 0;
+	int err = -EINVAL;
+	unsigned long used = 0;
+	unsigned long outlen = 0;
+	const struct iovec *iov;
+	unsigned long iovlen = 0;
+	unsigned long usedpages = 0;
+
+	/* Limit number of IOV blocks to be accessed below */
+	if (msg->msg_iter.nr_segs > RSGL_MAX_ENTRIES)
+		return -ENOMSG;
+
+	lock_sock(sk);
+
+	/*
+	 * AEAD memory structure: For encryption, the tag is appended to the
+	 * ciphertext which implies that the memory allocated for the ciphertext
+	 * must be increased by the tag length. For decryption, the tag
+	 * is expected to be concatenated to the ciphertext. The plaintext
+	 * therefore has a memory size of the ciphertext minus the tag length.
+	 *
+	 * The memory structure for cipher operation has the following
+	 * structure:
+	 *	AEAD encryption input:  assoc data || plaintext
+	 *	AEAD encryption output: cipherntext || auth tag
+	 *	AEAD decryption input:  assoc data || ciphertext || auth tag
+	 *	AEAD decryption output: plaintext
+	 */
+
+	if (ctx->more) {
+		err = aead_wait_for_data(sk, flags);
+		if (err)
+			goto unlock;
+	}
+
+	used = ctx->used;
+
+	/*
+	 * Make sure sufficient data is present -- note, the same check is
+	 * is also present in sendmsg/sendpage. The checks in sendpage/sendmsg
+	 * shall provide an information to the data sender that something is
+	 * wrong, but they are irrelevant to maintain the kernel integrity.
+	 * We need this check here too in case user space decides to not honor
+	 * the error message in sendmsg/sendpage and still call recvmsg. This
+	 * check here protects the kernel integrity.
+	 */
+	if (!aead_sufficient_data(ctx))
+		goto unlock;
+
+	/*
+	 * The cipher operation input data is reduced by the associated data
+	 * length as this data is processed separately later on.
+	 */
+	used -= ctx->aead_assoclen;
+
+	if (ctx->enc) {
+		/* round up output buffer to multiple of block size */
+		outlen = ((used + bs - 1) / bs * bs);
+		/* add the size needed for the auth tag to be created */
+		outlen += as;
+	} else {
+		/* output data size is input without the authentication tag */
+		outlen = used - as;
+		/* round up output buffer to multiple of block size */
+		outlen = ((outlen + bs - 1) / bs * bs);
+	}
+
+	/* convert iovecs of output buffers into scatterlists */
+	for (iov = msg->msg_iter.iov, iovlen = 0;
+	     iovlen < msg->msg_iter.nr_segs; iovlen++, iov++) {
+		char __user *from = iov->iov_base;
+		unsigned long seglen = min_t(unsigned long, iov->iov_len,
+					     (outlen - usedpages));
+
+		/* make one iovec available as scatterlist */
+		err = af_alg_make_sg(&ctx->rsgl[iovlen], from, seglen, 1);
+		if (err < 0)
+			goto unlock;
+		usedpages += err;
+		/* chain the new scatterlist with initial list */
+		if (iovlen)
+			scatterwalk_crypto_chain(ctx->rsgl[0].sg,
+					ctx->rsgl[iovlen].sg, 1,
+					sg_nents(ctx->rsgl[iovlen-1].sg));
+		/* we do not need more iovecs as we have sufficient memory */
+		if (outlen <= usedpages)
+			break;
+	}
+
+	err = -EINVAL;
+	/* ensure output buffer is sufficiently large */
+	if (usedpages < outlen)
+		goto unlock;
+
+	sg_init_table(assoc, ALG_MAX_PAGES);
+	assoclen = ctx->aead_assoclen;
+	/*
+	 * Split scatterlist into two: first part becomes AD, second part
+	 * is plaintext / ciphertext. The first part is assigned to assoc
+	 * scatterlist. When this loop finishes, sg points to the start of the
+	 * plaintext / ciphertext.
+	 */
+	for (i = 0; i < ctx->tsgl.cur; i++) {
+		sg = sgl->sg + i;
+		if (sg->length <= assoclen) {
+			/* AD is larger than one page */
+			sg_set_page(assoc + i, sg_page(sg),
+				    sg->length, sg->offset);
+			assoclen -= sg->length;
+			if (i >= ctx->tsgl.cur)
+				goto unlock;
+		} else if (!assoclen) {
+			/* current page is to start of plaintext / ciphertext */
+			if (i)
+				/* AD terminates at page boundary */
+				sg_mark_end(assoc + i - 1);
+			else
+				/* AD size is zero */
+				sg_mark_end(assoc);
+			break;
+		} else {
+			/* AD does not terminate at page boundary */
+			sg_set_page(assoc + i, sg_page(sg),
+				    assoclen, sg->offset);
+			sg_mark_end(assoc + i);
+			/* plaintext / ciphertext starts after AD */
+			sg->length -= assoclen;
+			sg->offset += assoclen;
+			break;
+		}
+	}
+
+	aead_request_set_assoc(&ctx->aead_req, assoc, ctx->aead_assoclen);
+	aead_request_set_crypt(&ctx->aead_req, sg, ctx->rsgl[0].sg, used,
+			       ctx->iv);
+
+	err = af_alg_wait_for_completion(ctx->enc ?
+					 crypto_aead_encrypt(&ctx->aead_req) :
+					 crypto_aead_decrypt(&ctx->aead_req),
+					 &ctx->completion);
+
+	if (err) {
+		/* EBADMSG implies a valid cipher operation took place */
+		if (err == -EBADMSG)
+			aead_put_sgl(sk);
+		goto unlock;
+	}
+
+	aead_put_sgl(sk);
+
+	err = 0;
+
+unlock:
+	for (i = 0; i < iovlen; i++)
+		af_alg_free_sg(&ctx->rsgl[i]);
+
+	aead_wmem_wakeup(sk);
+	release_sock(sk);
+
+	return err ? err : outlen;
+}
+
+static unsigned int aead_poll(struct file *file, struct socket *sock,
+			      poll_table *wait)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	unsigned int mask;
+
+	sock_poll_wait(file, sk_sleep(sk), wait);
+	mask = 0;
+
+	if (!ctx->more)
+		mask |= POLLIN | POLLRDNORM;
+
+	if (aead_writable(sk))
+		mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
+
+	return mask;
+}
+
+static struct proto_ops algif_aead_ops = {
+	.family		=	PF_ALG,
+
+	.connect	=	sock_no_connect,
+	.socketpair	=	sock_no_socketpair,
+	.getname	=	sock_no_getname,
+	.ioctl		=	sock_no_ioctl,
+	.listen		=	sock_no_listen,
+	.shutdown	=	sock_no_shutdown,
+	.getsockopt	=	sock_no_getsockopt,
+	.mmap		=	sock_no_mmap,
+	.bind		=	sock_no_bind,
+	.accept		=	sock_no_accept,
+	.setsockopt	=	sock_no_setsockopt,
+
+	.release	=	af_alg_release,
+	.sendmsg	=	aead_sendmsg,
+	.sendpage	=	aead_sendpage,
+	.recvmsg	=	aead_recvmsg,
+	.poll		=	aead_poll,
+};
+
+static void *aead_bind(const char *name, u32 type, u32 mask)
+{
+	return crypto_alloc_aead(name, type, mask);
+}
+
+static void aead_release(void *private)
+{
+	crypto_free_aead(private);
+}
+
+static int aead_setauthsize(void *private, unsigned int authsize)
+{
+	return crypto_aead_setauthsize(private, authsize);
+}
+
+static int aead_setkey(void *private, const u8 *key, unsigned int keylen)
+{
+	return crypto_aead_setkey(private, key, keylen);
+}
+
+static void aead_sock_destruct(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct aead_ctx *ctx = ask->private;
+	unsigned int ivlen = crypto_aead_ivsize(
+				crypto_aead_reqtfm(&ctx->aead_req));
+
+	aead_put_sgl(sk);
+	sock_kzfree_s(sk, ctx->iv, ivlen);
+	sock_kfree_s(sk, ctx, ctx->len);
+	af_alg_release_parent(sk);
+}
+
+static int aead_accept_parent(void *private, struct sock *sk)
+{
+	struct aead_ctx *ctx;
+	struct alg_sock *ask = alg_sk(sk);
+	unsigned int len = sizeof(*ctx) + crypto_aead_reqsize(private);
+	unsigned int ivlen = crypto_aead_ivsize(private);
+
+	ctx = sock_kmalloc(sk, len, GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+	memset(ctx, 0, len);
+
+	ctx->iv = sock_kmalloc(sk, ivlen, GFP_KERNEL);
+	if (!ctx->iv) {
+		sock_kfree_s(sk, ctx, len);
+		return -ENOMEM;
+	}
+	memset(ctx->iv, 0, ivlen);
+
+	ctx->len = len;
+	ctx->used = 0;
+	ctx->more = 0;
+	ctx->merge = 0;
+	ctx->enc = 0;
+	ctx->tsgl.cur = 0;
+	ctx->aead_assoclen = 0;
+	af_alg_init_completion(&ctx->completion);
+	sg_init_table(ctx->tsgl.sg, ALG_MAX_PAGES);
+
+	ask->private = ctx;
+
+	aead_request_set_tfm(&ctx->aead_req, private);
+	aead_request_set_callback(&ctx->aead_req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+				  af_alg_complete, &ctx->completion);
+
+	sk->sk_destruct = aead_sock_destruct;
+
+	return 0;
+}
+
+static const struct af_alg_type algif_type_aead = {
+	.bind		=	aead_bind,
+	.release	=	aead_release,
+	.setkey		=	aead_setkey,
+	.setauthsize	=	aead_setauthsize,
+	.accept		=	aead_accept_parent,
+	.ops		=	&algif_aead_ops,
+	.name		=	"aead",
+	.owner		=	THIS_MODULE
+};
+
+static int __init algif_aead_init(void)
+{
+	return af_alg_register_type(&algif_type_aead);
+}
+
+static void __exit algif_aead_exit(void)
+{
+	int err = af_alg_unregister_type(&algif_type_aead);
+	BUG_ON(err);
+}
+
+module_init(algif_aead_init);
+module_exit(algif_aead_exit);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Stephan Mueller <smueller@chronox.de>");
+MODULE_DESCRIPTION("AEAD kernel crypto API user space interface");
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH v1 2/5] IB/uverbs: ex_query_device: check request's comp_mask
From: Jason Gunthorpe @ 2015-01-29 20:24 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: Sagi Grimberg, Shachar Raindel, Eli Cohen, Haggai Eran,
	Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20150129183648.GI11842-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On Thu, Jan 29, 2015 at 11:36:48AM -0700, Jason Gunthorpe wrote:

> Also, the _ex varients were supposed to be supersets of the base call,
> so it is wrong that query_device_ex doesn't return all the same data
> as query_device, layed out so that the original response structure is
> a prefix of the extended response structure.

Never mind this, I see the copy_query_dev_fields call now!

Thanks,
Jason

^ permalink raw reply

* [PATCH v12 0/2] crypto: AF_ALG: add AEAD and RNG support
From: Stephan Mueller @ 2015-01-29 20:23 UTC (permalink / raw)
  To: 'Herbert Xu'
  Cc: 'Quentin Gouchet', Daniel Borkmann, 'LKML',
	linux-crypto, 'Linux API'

Hi,

This patch set adds AEAD and RNG support to the AF_ALG interface
exported by the kernel crypto API. By extending AF_ALG with AEAD and RNG
support, all cipher types the kernel crypto API allows access to are
now accessible from userspace.

Both, AEAD and RNG implementations are stand-alone and do not depend
other AF_ALG interfaces (like hash or skcipher).

The AEAD implementation uses the same approach as provided with
skcipher by offering the following interfaces:

	* sendmsg and recvmsg interfaces allowing multiple
	  invocations supporting a threaded user space. To support
	  multi-threaded user space, kernel-side buffering
	  is implemented similarly to skcipher.

	* splice / vmsplice interfaces allowing a zero-copy
	  invocation

The RNG interface only implements the recvmsg interface as
zero-copy is not applicable.

The new AEAD and RNG interfaces are fully tested with the test application
provided at [1]. That test application exercises all newly added user space
interfaces. The testing covers:

	* use of the sendmsg/recvmsg interface

	* use of the splice / vmsplice interface

	* invocation of all AF_ALG types (aead, rng, skcipher, hash)

	* using all types of operation (encryption, decryption, keyed MD,
	  MD, random numbers, AEAD decryption with positive and negative
	  authentication verification)

	* stress testing by running all tests for 30 minutes in an
	  endless loop

	* test execution on 64 bit and 32 bit

[1] http://www.chronox.de/libkcapi.html

Changes v2:
* rebase to current cryptodev-2.6 tree
* use memzero_explicit to zeroize AEAD associated data
* use sizeof for determining length of AEAD associated data
* update algif_rng.c covering all suggestions from Daniel Borkmann
  <dborkman@redhat.com>
* addition of patch 9: add digestsize interface for hashes
* addition of patch to update documentation covering the userspace interface
* change numbers of getsockopt options: separate them from sendmsg interface
  definitions

Changes v3:
* remove getsockopt interface
* AEAD: associated data is set prepended to the plain/ciphertext
* AEAD: allowing arbitrary associated data lengths
* remove setkey patch as protection was already in the existing code

Changes v4:
* stand-alone implementation of AEAD
* testing of all interfaces offered by AEAD
* stress testing of AEAD and RNG

Changes v5:
* AEAD: add outer while(size) loop in aead_sendmsg to ensure all data is
  copied into the kernel (reporter Herbert Xu)
* AEAD: aead_sendmsg bug fix: change size -= len; to size -= plen;
* AF_ALG / AEAD: add aead_setauthsize and associated extension to
  struct af_alg_type as well as alg_setsockopt (reporter Herbert Xu)
* RNG: rng_recvmsg: use 128 byte stack variable for output of RNG instead
  of ctx->result (reporter Herbert Xu)
* RNG / AF_ALG: allow user space to seed RNG via setsockopt
* RNG: rng_recvmsg bug fix: use genlen as result variable for
  crypto_rng_get_bytes as previously no negative errors were obtained
* AF_ALG: alg_setop: zeroize buffer before free

Changes v6:
* AEAD/RNG: port to 3.19-rc1 with the iov_iter handling
* RNG: use the setkey interface to obtain the seed and drop the patch adding
  a separate reseeding interface
* extract the zeroization patch for alg_setkey into a stand-alone patch
  submission
* fix bug in aead_sufficient_data (reporter Herbert Xu)
* testing of all interfaces with test application provided with libkcapi version
  0.6.2

Changes v7:
* AEAD: aead_recvmsg: change error code from ENOMEM to EINVAL
* AEAD: drop aead_readable/aead_sufficient_data and only use ctx->more to decide
  whether the read side shall become active. This change requires that the
  patch for crypto_aead_decrypt ensuring that the ciphertext contains the
  authentication tag was added -- see https://lkml.org/lkml/2014/12/30/200.
  Otherwise, user space can trigger a kernel crash.
* RNG: patch dropped as it was applied
* AEAD: port Kconfig/Makefile patch forward to current code base

Changes v8:
* Removed check for aead_assoclen in aead_sendmsg
* Fix endless loop bug in aead_sendmsg (check for sgl->cur > ALG_MAX_PAGES in
  while condition removed -- this condition is checked within the loop already)
* Resurrect aead_sufficient_data and call it in aead_sendmsg, aead_sendpage to
  notify caller about wrong invocation
* Re-add aead_sufficient_data to aead_recvmsg to verify user input data before
  using them to ensure the kernel protects against malicious parameters
* Allow arbitrary size of AD (i.e. up to the maximum buffer size of
  ALG_MAX_PAGES)
* When aead_recvmsg receives an error from decryption, release all pages if the
  error is EBADMSG -- this error implies that a proper decryption was performed
  but the integrity of the message is lost. This error is considered to be a
  valid decryption result.
* Add test cases for sendmsg and splice interface to test large AD sizes (in
  case of sendmsg, use 65504 bytes AD and 32 bytes plaintext; in case of splice
  use 15 pages AD and 32 bytes in the 16th page for plaintext). See [1] for
  updated test case.

Changes v9:
* if socket is not writable during sendmsg/sendpage due to insufficient memory
  and a recvmsg operation is forced, inform userspace about truncated operation
  via MSG_TRUNC
* use -EMSGSIZE in case insufficient data was provided in sendmsg/sendpage
* release all buffers in case insufficient data was provided in sendmsg/sendpage
* bug fix in sendmsg: when a new page is allocated, reset sg->offset to 0 --
  the error is visible with the new tests in [1] when using the -d flag
  with the test application

Changes v10:
* initialize ctx->trunc in aead_accept_parent to zero
* fix one line with code formatting problems

Changes v11:
* return an error if user space sends too much data instead of waiting until
  reader side catches up with operation (suggested by Herbert Xu)
* remove now unneeded aead_wait_for_wmem service function
* remove now unneeded ctx->trunc and MSG_TRUNC error return

Changes v12:
* check ctx->used in aead_data_wakeup (reported by Herbert Xu)
* free socket buffers in case userspace sends too much data (reported by
  Herbert Xu)
* allow up to 16 IOVECs in recvmsg for the output buffer -- the code can handle
  an arbitrary number of IOVECs (just change RSGL_MAX_ENTRIES). Though, when
  changing the value, the maximum memory user space can cause the kernel to
  allocate should be considered as documented in the comment. The test code in
  [1] is updated to invoke recvmsg with 16 IOVECs (reported by Herbert Xu)
* prevent an edge condition error case in sendmsg (reported by Herbert Xu)
* correct some formatting as suggested by checkpatch.pl

Stephan Mueller (2):
  crypto: AF_ALG: add AEAD support
  crypto: AF_ALG: enable AEAD interface compilation

 crypto/Kconfig      |   9 +
 crypto/Makefile     |   1 +
 crypto/algif_aead.c | 666 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 676 insertions(+)
 create mode 100644 crypto/algif_aead.c

-- 
2.1.0

^ permalink raw reply

* Re: [PATCH v1 3/5] IB/uverbs: ex_query_device: answer must depend on response buffer's size
From: Yann Droneaud @ 2015-01-29 19:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Shachar Raindel, Eli Cohen, Haggai Eran,
	Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20150129183839.GJ11842-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

Hi,

Le jeudi 29 janvier 2015 à 11:38 -0700, Jason Gunthorpe a écrit :
> On Thu, Jan 29, 2015 at 07:00:00PM +0100, Yann Droneaud wrote:
> > As specified in "Extending Verbs API" presentation [1] by Tzahi Oved
> > during OFA International Developer Workshop 2013, the request's
> > comp_mask should describe the request data: it's describe the
> > availability of extended fields in the request.
> > Conversely, the response's comp_mask should describe the presence
> > of extended fields in the response.
> 
> This makes sense to me as well.
> 
> > -	err = ib_copy_to_udata(ucore, &resp, sizeof(resp));
> > +end:
> > +	err = ib_copy_to_udata(ucore, &resp, resp_len);
> >  	if (err)
> >  		return err;
> >  
> 
> I think resp_len should be returned, not 0?
> 

As said previously it's not possible to use the effective size of the
response as the return value of the write() syscall: the syscall
must return the size of the input buffer, not the size of the output
buffer, otherwise it would break the semantics of the write() syscall.

Regards.

-- 
Yann Droneaud
OPTEYA


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v1 2/5] IB/uverbs: ex_query_device: check request's comp_mask
From: Yann Droneaud @ 2015-01-29 19:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Shachar Raindel, Eli Cohen, Haggai Eran,
	Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20150129183648.GI11842-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

Hi,

Le jeudi 29 janvier 2015 à 11:36 -0700, Jason Gunthorpe a écrit :
> On Thu, Jan 29, 2015 at 06:59:59PM +0100, Yann Droneaud wrote:
> > This patch ensures the extended QUERY_DEVICE uverbs request's
> > comp_mask has only known and supported bits (currently none).
> 
> I think I would be happy to see the input comp_mask removed
> entirely. I can't see a possible use for input data to a QUERY command
> that wouldn't be better served by creating a new command
> 
> But forcing the value to 0 seems reasonable as well.
> 

I cannot forsee the future, but having at least one unused bit
available allow for any kind of yet unknown extension: having this
bit/these bits checked now, permit fixing mistake later.

So let it be.

> Reviewed-By: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> 

Thanks.

> Also, the _ex varients were supposed to be supersets of the base call,
> so it is wrong that query_device_ex doesn't return all the same data
> as query_device, layed out so that the original response structure is
> a prefix of the extended response structure.
> 
> The other _ex calls in verbs were designed that way so it will be
> surprising to the user that this one is different.
> 

It seems to me it has the layout you're expecting:

    224 struct ib_uverbs_ex_query_device_resp {
    225         struct ib_uverbs_query_device_resp base;
    226         __u32 comp_mask;
    227         __u32 reserved;
    228         struct ib_uverbs_odp_caps odp_caps;
    229 };

Regards.

-- 
Yann Droneaud
OPTEYA


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask
From: Jason Gunthorpe @ 2015-01-29 19:18 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: Sagi Grimberg, Shachar Raindel, Eli Cohen, Haggai Eran,
	Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1422557009.3133.172.camel-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>

On Thu, Jan 29, 2015 at 07:43:29PM +0100, Yann Droneaud wrote:

> The write() syscall must return the size buffer passed to it, or
> less, but in such case it would ask for trouble as userspace would
> be allowed to write() the remaining bytes.  Returning a size bigger
> than the one passed to write() is not acceptable and would break any
> expectation.

By that logic the 0 return is still wrong, and it should be ucore->in_len

But I think we can return less without risking anything breaking, it
already violates the invariant associated with write() - it mutates
the buffer passed in!

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 3/4] IB/uverbs: ex_query_device: check request's comp_mask
From: Jason Gunthorpe @ 2015-01-29 19:14 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: Haggai Eran, Sagi Grimberg, Shachar Raindel, Eli Cohen,
	Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Tzahi Oved, Or Gerlitz,
	Matan Barak
In-Reply-To: <1422556514.3133.165.camel-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>

On Thu, Jan 29, 2015 at 07:35:14PM +0100, Yann Droneaud wrote:

> Unfortunately, the userspace don't get the size of the returned data:
> it's only a single "write()" syscall after all.

A write syscall that behaves nothing like write() actually should, so
I don't see why we can't have

resp_len = write(fd,inout_buf,sizeof(input_len));

Returning 0 from write makes no sense at all.

In the fullness of your patchset it will maintain the invariant that
resp_len <= sizeof(input_len)

Which seems OK to me considering what we have to work with, and a
significantly better choice than 0.

> So the kernel is left with the comp_mask in the response to express
> the returned size.

It was never the intent that the size should be computed from
comp_mask. If the size is necessary it must be explicit.

In this instance if the size is not returned then libibverbs will have
to zero the entire user buffer before passing it to the kernel,
because it has to ensure any tail for the user app is 0'd.

Remember for santity we want comp mask bits for things that can't be 0
and we want 0 for things that are not set.

struct ib_query_device_ex res;
ibv_query_device_ex(..,res,sizeof(res));

printf("%u",res.foo_cap); // 0 if unsupported is OK
if (res.comp_mask & COMP_BAR)
  printf("%u",res.bar_thingy);  // 0 has meaning, must use COMP_BAR

Ideally we want to minimize the number of COMP bits because it is a
huge burden on the end user to work with them.

> > The purpose of the output comp_mask is to allow drivers to declare
> > they do not support the new structure members, and comp_mask bits
> > should only be used with new structure members do not have a natural
> > 'null' value.

> It's not (yet) about drivers as the output's comp_mask (in the 
> response), is set by uverbs layer.
> 
> Do you think we have to enforce a 1 to 1 relation between the request 
> comp_mask's bits and the response comp_mask's bits ?

In the query context I would be happy enough to just treat the in
bound buffer as uninitialized buffer space, but certainly generally
speaking the comp_mask+size should refer to the structure - so
input/output are not directly related.

> > Further, we need to distinguish cases where additional data is
> > mandatory or optional.
> > 
> > query_device is clearly optional, the only purpose the input comp mask
> > serves is to reduce expensive work in the driver by indicating that
> > some result bits are not needed.
> 
> That's not how I've understand the purpose of the request's comp_mask
> after reading the presentation: request's comp_mask describe the content
> of the request. Then that additional content can trigger the presence 
> of additional fields in the response if needed.

Agreed - what I described was an abuse that some early Mellanox
patches for a query_ex included and it was just a shortcut to avoid
defining a dedicated input structure. It seems that same scheme was
copied into these new patches.

> >  It is perfectly OK for the kernel to
> > ignore the input comp mask, and OK for userspace to typically request
> > all bits. (indeed, I think this is a pretty silly optimization myself,
> > and the original patch that motivated this was restructured so it is
> > not needed)
> > 
> 
> It's not at all OK to ignore request's comp_mask: it has to be checked
> to find if there's a feature requested the kernel cannot fullfil, and 
> any unknown bit is a possible feature request. So the kernel has to 
> reject the request as a whole as it don't know how to process it.

In the context of the above scheme the input structure was just this:

struct query_input
{
  u64 requested_output;
};

ie it wasn't actually a comp_mask, it just overlapped the comp_mask
bytes on output.

Such a use was never explicitly documented, and IIRC was never
actually included in libibverbs.

Unless someone has a strong reason we need to do this I am inclined to
think that your interpretation is the better one, and we could always
add a requested_output to the input someday if it became urgent.

In any event, you are right, we can't ingore the input bytes today and
expect to give them meaning tomorrow.

> As we don't know yet how we would extend the extended QUERY_DEVICE
> uverbs, the kernel have to refuse any random value.
> 
> http://blog.ffwll.ch/2013/11/botching-up-ioctls.html

or the kernel treats QUERY_DEVICE as an output only function and never
inspects the in/out buffer at all.

> > > Thanks for the link to the presentation.
> > 
> > If I recall the presentation is old and had some flaws that were
> > addressed before things made it into libibverbs..
> 
> I have to have a look to this part of libibverbs: I'm not sure
> the extended QUERY_DEVICE is already implemented.

The patches turned out to be unnecessary and were dropped, IIRC.

> > Thank you for looking at this, it is very important that this scheme
> > is used properly, and it is very easy to make mistakes. I haven't had
> > time to review any of these new patches myself.

> I hope you would find some time to review my latest patchset so that
> we don't miss a corner case. It's starting to become urgent.

I have and will, thank you

Jason

^ permalink raw reply

* Re: [RESEND] [PATCH] block: create ioctl to discard-or-zeroout a range of blocks
From: Darrick J. Wong @ 2015-01-29 19:01 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jens Axboe, Christoph Hellwig,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Jeff Layton, J. Bruce Fields
In-Reply-To: <2747023.BlTyJ4fNVf@wuerfel>

On Thu, Jan 29, 2015 at 11:02:58AM +0100, Arnd Bergmann wrote:
> On Wednesday 28 January 2015 18:00:25 Darrick J. Wong wrote:
> > Create a new ioctl to expose the block layer's newfound ability to
> > issue either a zeroing discard, a WRITE SAME with a zero page, or a
> > regular write with the zero page.  This BLKZEROOUT2 ioctl takes
> > {start, length, flags} as parameters.  So far, the only flag available
> > is to enable the zeroing discard part -- without it, the call invokes
> > the old BLKZEROOUT behavior.  start and length have the same meaning
> > as in BLKZEROOUT.
> > 
> > Furthermore, because BLKZEROOUT2 issues commands directly to the
> > storage device, we must invalidate the page cache (as a regular
> > O_DIRECT write would do) to avoid returning stale cache contents at a
> > later time.
> > 
> > This patch depends on "block: Add discard flag to
> > blkdev_issue_zeroout() function" in Jens' for-3.20/core branch.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> > 
> 
> Would this work ok for devices that fill discarded areas with all-ones
> instead of all-zeroes? I believe SD cards can do either.

It won't do all-ones, because the underlying blkdev_issue_zeroout call only
knows how to tell the device to write zeroes or perform a discard if the flag
is set and the device is whitelisted.  This patch only exposes the existing
kernel call to userspace.

(All-ones could be plumbed into the storage stack, but that would have to be a
separate patch.)

--D

> 
> 	Arnd
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
From: Alexei Starovoitov @ 2015-01-29 18:55 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Masami Hiramatsu, Linux API, Network Development, LKML

On Thu, Jan 29, 2015 at 4:35 AM, Namhyung Kim <namhyung@kernel.org> wrote:
>>> Right.  I think bpf programs belong to a user process but events are
>>> global resource.  Maybe you also need to consider attaching bpf
>>> program via perf (ioctl?) interface..
>>
>> yes. I did. Please see my reply to Masami.
>> ioctl only works for tracepoints.
>
> What was the problem of kprobes then? :)

Looks like I misread the logic of attaching a filter via perf ioctl.
Looking at it again it seems to be a major change in design:
Instead of adding into ftrace_raw_* helpers, I would add
to perf_trace_* helpers which are very stack heavy
because of 'pt_regs'
Ex: perf_trace_kfree_skb() is using 224 bytes of stack
whereas ftrace_raw_event_kfree_skb() only 80.
which doesn't help in my quest for lowest overhead.
And the discussion about soft- and auto- enable/disable
becomes meaningless, since there is no such things
when it goes through perf events.
I guess it means no hooks through tracefs...
Anyway, I'll hook it up and see which way is cleaner.

^ permalink raw reply

* Re: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask
From: Yann Droneaud @ 2015-01-29 18:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Shachar Raindel, Eli Cohen, Haggai Eran,
	Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20150129182800.GH11842-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

Hi,

Le jeudi 29 janvier 2015 à 11:28 -0700, Jason Gunthorpe a écrit :
> On Thu, Jan 29, 2015 at 06:59:58PM +0100, Yann Droneaud wrote:
> > As specified in "Extending Verbs API" presentation [1] by Tzahi Oved
> > during OFA International Developer Workshop 2013, the request's
> > comp_mask should describe the request data: it's describe the
> > availability of extended fields in the request.
> > Conversely, the response's comp_mask should describe the presence
> > of extended fields in the response.
> 
> Roland: I agree with Yann, these patches need to go in, or the ODP
> patches reverted.
> 
> >  #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
> > -	if (cmd.comp_mask & IB_USER_VERBS_EX_QUERY_DEVICE_ODP) {
> 
> Absolutely, a query output should never depend on the input comp_mask.
> 

We should keep this in mind then.

> > -		resp.odp_caps.general_caps = attr.odp_caps.general_caps;
> > -		resp.odp_caps.per_transport_caps.rc_odp_caps =
> > -			attr.odp_caps.per_transport_caps.rc_odp_caps;
> > -		resp.odp_caps.per_transport_caps.uc_odp_caps =
> > -			attr.odp_caps.per_transport_caps.uc_odp_caps;
> > -		resp.odp_caps.per_transport_caps.ud_odp_caps =
> > -			attr.odp_caps.per_transport_caps.ud_odp_caps;
> > -		resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP;
> > -	}
> > +	resp.odp_caps.general_caps = attr.odp_caps.general_caps;
> > +	resp.odp_caps.per_transport_caps.rc_odp_caps =
> > +		attr.odp_caps.per_transport_caps.rc_odp_caps;
> > +	resp.odp_caps.per_transport_caps.uc_odp_caps =
> > +		attr.odp_caps.per_transport_caps.uc_odp_caps;
> > +	resp.odp_caps.per_transport_caps.ud_odp_caps =
> > +		attr.odp_caps.per_transport_caps.ud_odp_caps;
> >  #endif
> > +	resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP;
> 
> Not sure about this - if 0 is a valid null answer for all the _caps
> then it is fine, and the comp_mask bit should just be removed as the
> size alone should be enough.
> 

That's difficult to say. But I hope 0 are valids answers in this case.

Anyway, the response's comp_mask is needed, as it's the sole way for 
the userspace to know the size of the response. See below.

> This looks wrong however:
> 
> >  	err = ib_copy_to_udata(ucore, &resp, sizeof(resp));
> >  	if (err)
> >           return err;
> >        return 0;
> 
> Why isn't this returning the filled structure size to userspace?  This
> would seem to be very urgently wrong to me? Am I missing something?
> 
> Patch 3 probably should gain:
> 
> - return 0;
> + return resp_len;
> 

The write() syscall must return the size buffer passed to it, or less,
but in such case it would ask for trouble as userspace would be allowed
to write() the remaining bytes.
Returning a size bigger than the one passed to write() is not acceptable
and would break any expectation.

> This patch looks like an improvement to me:
> 
> Reviewed-By: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
> 

Thanks a lot.

Regards.

-- 
Yann Droneaud
OPTEYA


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2] tpm: fix suspend/resume paths for TPM 2.0
From: Scot Doyle @ 2015-01-29 18:43 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Peter Huewe, Ashley Lai, tpmdd-devel, linux-kernel, josh,
	christophe.ricard, jason.gunthorpe, stefanb, linux-api,
	trousers-tech
In-Reply-To: <1422510227-23256-1-git-send-email-jarkko.sakkinen@linux.intel.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1386 bytes --]

On Thu, 29 Jan 2015, Jarkko Sakkinen wrote:
> Fixed suspend/resume paths for TPM 2.0 and consolidated all the
> associated code to the tpm_pm_suspend() and tpm_pm_resume()
> functions. Resume path should be handled by the firmware, i.e.
> Startup(CLEAR) for hibernate and Startup(STATE) for suspend.
> 
> There might be some non-PC embedded devices in the future where
> Startup() is not the handled by the FW but fixing the code for
> those IMHO should be postponed until there is hardware available
> to test the fixes although extra Startup in the driver code is
> essentially a NOP.
> 
> Added Shutdown(CLEAR) to the remove paths of TIS and CRB drivers.
> Changed tpm2_shutdown() to a void function because there isn't
> much you can do except print an error message if this fails with
> a system error.
> 
> Reported-by: Peter Hüwe <PeterHuewe@gmx.de>
> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> ---
>  drivers/char/tpm/tpm-interface.c |  6 ++++--
>  drivers/char/tpm/tpm.h           |  2 +-
>  drivers/char/tpm/tpm2-cmd.c      | 19 +++++++++++--------
>  drivers/char/tpm/tpm_crb.c       | 20 +++++---------------
>  drivers/char/tpm/tpm_tis.c       | 26 +++++++++++++-------------
>  5 files changed, 34 insertions(+), 39 deletions(-)

Resume still functions on TPM 1.2 chip, with and without CONFIG_TCG_CRB.

Tested-by: Scot Doyle <lkml14@scotdoyle.com>

^ permalink raw reply

* Re: [PATCH v1 4/5] IB/uverbs: ex_query_device: no need to clear the whole structure
From: Jason Gunthorpe @ 2015-01-29 18:39 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: Sagi Grimberg, Shachar Raindel, Eli Cohen, Haggai Eran,
	Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <0b646f62e9272bc962a1ff6ff03ad9523b454dfe.1422553023.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>

On Thu, Jan 29, 2015 at 07:00:01PM +0100, Yann Droneaud wrote:
> As only the requested fields are set and copied to userspace,
> there's no need to clear the content of the response structure
> beforehand.

Agreed.

Reviewed-By: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v1 3/5] IB/uverbs: ex_query_device: answer must depend on response buffer's size
From: Jason Gunthorpe @ 2015-01-29 18:38 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: Sagi Grimberg, Shachar Raindel, Eli Cohen, Haggai Eran,
	Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <a7b2b5adb3b207ec2a4330067a03ce7e4c2225aa.1422553023.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>

On Thu, Jan 29, 2015 at 07:00:00PM +0100, Yann Droneaud wrote:
> As specified in "Extending Verbs API" presentation [1] by Tzahi Oved
> during OFA International Developer Workshop 2013, the request's
> comp_mask should describe the request data: it's describe the
> availability of extended fields in the request.
> Conversely, the response's comp_mask should describe the presence
> of extended fields in the response.

This makes sense to me as well.

> -	err = ib_copy_to_udata(ucore, &resp, sizeof(resp));
> +end:
> +	err = ib_copy_to_udata(ucore, &resp, resp_len);
>  	if (err)
>  		return err;
>  

I think resp_len should be returned, not 0?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v1 2/5] IB/uverbs: ex_query_device: check request's comp_mask
From: Jason Gunthorpe @ 2015-01-29 18:36 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: Sagi Grimberg, Shachar Raindel, Eli Cohen, Haggai Eran,
	Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <335da45738872e446f63db338ca766a34608c90a.1422553023.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>

On Thu, Jan 29, 2015 at 06:59:59PM +0100, Yann Droneaud wrote:
> This patch ensures the extended QUERY_DEVICE uverbs request's
> comp_mask has only known and supported bits (currently none).

I think I would be happy to see the input comp_mask removed
entirely. I can't see a possible use for input data to a QUERY command
that wouldn't be better served by creating a new command

But forcing the value to 0 seems reasonable as well.

Reviewed-By: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

Also, the _ex varients were supposed to be supersets of the base call,
so it is wrong that query_device_ex doesn't return all the same data
as query_device, layed out so that the original response structure is
a prefix of the extended response structure.

The other _ex calls in verbs were designed that way so it will be
surprising to the user that this one is different.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 3/4] IB/uverbs: ex_query_device: check request's comp_mask
From: Yann Droneaud @ 2015-01-29 18:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Haggai Eran, Sagi Grimberg, Shachar Raindel, Eli Cohen,
	Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Tzahi Oved, Or Gerlitz,
	Matan Barak
In-Reply-To: <20150129180956.GG11842-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

Hi,

Le jeudi 29 janvier 2015 à 11:09 -0700, Jason Gunthorpe a écrit :
> On Wed, Jan 28, 2015 at 02:19:11PM +0100, Yann Droneaud wrote:
> 
> > But the same program (either binary or source code) might fail on
> > newer kernel where some bits in comp_mask gain a meaning not supported
> > by the program.
> 
> To clarify some of this:
> 
> The intention of the comp_mask scheme was to define a growing
> structure.
> 
> The invariant is: A bigger structure allways has all smaller
> structures (past versions) as a prefix.
> 
> So the size alone is enough for user space to know what is going
> on. userspace only processes structure members up to the size returned
> by the kernel, and there is only one structure layout.
> 

Unfortunately, the userspace don't get the size of the returned data:
it's only a single "write()" syscall after all.

So the kernel is left with the comp_mask in the response to express
the returned size.

> The purpose of the output comp_mask is to allow drivers to declare
> they do not support the new structure members, and comp_mask bits
> should only be used with new structure members do not have a natural
> 'null' value.
> 

It's not (yet) about drivers as the output's comp_mask (in the 
response), is set by uverbs layer.

Do you think we have to enforce a 1 to 1 relation between the request 
comp_mask's bits and the response comp_mask's bits ?

That would not fit with my understanding of "Extending Verbs API" 
presentation [1]: request's comp_mask describe the fields present in 
the request and response's comp_mask describe the fields present in the 
response.

> Further, we need to distinguish cases where additional data is
> mandatory or optional.
> 
> query_device is clearly optional, the only purpose the input comp mask
> serves is to reduce expensive work in the driver by indicating that
> some result bits are not needed.

That's not how I've understand the purpose of the request's comp_mask
after reading the presentation: request's comp_mask describe the content
of the request. Then that additional content can trigger the presence 
of additional fields in the response if needed.

>  It is perfectly OK for the kernel to
> ignore the input comp mask, and OK for userspace to typically request
> all bits. (indeed, I think this is a pretty silly optimization myself,
> and the original patch that motivated this was restructured so it is
> not needed)
> 

It's not at all OK to ignore request's comp_mask: it has to be checked
to find if there's a feature requested the kernel cannot fullfil, and 
any unknown bit is a possible feature request. So the kernel has to 
reject the request as a whole as it don't know how to process it.

As we don't know yet how we would extend the extended QUERY_DEVICE
uverbs, the kernel have to refuse any random value.

http://blog.ffwll.ch/2013/11/botching-up-ioctls.html

> Other verbs must be more strict, if one side does not understand the
> comp mask then the verb must fail in some way. Presumably the valid
> comp mask values are inferred in some way (eg query_device says the
> extended function is supported)
> 

> Everything should fit in one of those two general cases..

I believe every interface should return an error for any unknown value
(every unused bits of a data structure), that is, there's only one case.

>   
> > Thanks for the link to the presentation.
> 
> If I recall the presentation is old and had some flaws that were
> addressed before things made it into libibverbs..
> 

I have to have a look to this part of libibverbs: I'm not sure
the extended QUERY_DEVICE is already implemented.

> Thank you for looking at this, it is very important that this scheme
> is used properly, and it is very easy to make mistakes. I haven't had
> time to review any of these new patches myself.
> 

I hope you would find some time to review my latest patchset
so that we don't miss a corner case. It's starting to become urgent.

Regards.

-- 
Yann Droneaud
OPTEYA

^ permalink raw reply

* Re: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask
From: Jason Gunthorpe @ 2015-01-29 18:28 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: Sagi Grimberg, Shachar Raindel, Eli Cohen, Haggai Eran,
	Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <24ceb1fc5b2b6563532e5776b1b2320b1ae37543.1422553023.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>

On Thu, Jan 29, 2015 at 06:59:58PM +0100, Yann Droneaud wrote:
> As specified in "Extending Verbs API" presentation [1] by Tzahi Oved
> during OFA International Developer Workshop 2013, the request's
> comp_mask should describe the request data: it's describe the
> availability of extended fields in the request.
> Conversely, the response's comp_mask should describe the presence
> of extended fields in the response.

Roland: I agree with Yann, these patches need to go in, or the ODP
patches reverted.

>  #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
> -	if (cmd.comp_mask & IB_USER_VERBS_EX_QUERY_DEVICE_ODP) {

Absolutely, a query output should never depend on the input comp_mask.

> -		resp.odp_caps.general_caps = attr.odp_caps.general_caps;
> -		resp.odp_caps.per_transport_caps.rc_odp_caps =
> -			attr.odp_caps.per_transport_caps.rc_odp_caps;
> -		resp.odp_caps.per_transport_caps.uc_odp_caps =
> -			attr.odp_caps.per_transport_caps.uc_odp_caps;
> -		resp.odp_caps.per_transport_caps.ud_odp_caps =
> -			attr.odp_caps.per_transport_caps.ud_odp_caps;
> -		resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP;
> -	}
> +	resp.odp_caps.general_caps = attr.odp_caps.general_caps;
> +	resp.odp_caps.per_transport_caps.rc_odp_caps =
> +		attr.odp_caps.per_transport_caps.rc_odp_caps;
> +	resp.odp_caps.per_transport_caps.uc_odp_caps =
> +		attr.odp_caps.per_transport_caps.uc_odp_caps;
> +	resp.odp_caps.per_transport_caps.ud_odp_caps =
> +		attr.odp_caps.per_transport_caps.ud_odp_caps;
>  #endif
> +	resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP;

Not sure about this - if 0 is a valid null answer for all the _caps
then it is fine, and the comp_mask bit should just be removed as the
size alone should be enough.

This looks wrong however:

>  	err = ib_copy_to_udata(ucore, &resp, sizeof(resp));
>  	if (err)
>           return err;
>        return 0;

Why isn't this returning the filled structure size to userspace?  This
would seem to be very urgently wrong to me? Am I missing something?

Patch 3 probably should gain:

- return 0;
+ return resp_len;

This patch looks like an improvement to me:

Reviewed-By: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

Jason

^ permalink raw reply

* Re: [PATCH 3/4] IB/uverbs: ex_query_device: check request's comp_mask
From: Jason Gunthorpe @ 2015-01-29 18:09 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: Haggai Eran, Sagi Grimberg, Shachar Raindel, Eli Cohen,
	Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Tzahi Oved, Or Gerlitz,
	Matan Barak
In-Reply-To: <1422451151.3133.130.camel-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>

On Wed, Jan 28, 2015 at 02:19:11PM +0100, Yann Droneaud wrote:

> But the same program (either binary or source code) might fail on
> newer kernel where some bits in comp_mask gain a meaning not supported
> by the program.

To clarify some of this:

The intention of the comp_mask scheme was to define a growing
structure.

The invariant is: A bigger structure allways has all smaller
structures (past versions) as a prefix.

So the size alone is enough for user space to know what is going
on. userspace only processes structure members up to the size returned
by the kernel, and there is only one structure layout.

The purpose of the output comp_mask is to allow drivers to declare
they do not support the new structure members, and comp_mask bits
should only be used with new structure members do not have a natural
'null' value.

Further, we need to distinguish cases where additional data is
mandatory or optional.

query_device is clearly optional, the only purpose the input comp mask
serves is to reduce expensive work in the driver by indicating that
some result bits are not needed. It is perfectly OK for the kernel to
ignore the input comp mask, and OK for userspace to typically request
all bits. (indeed, I think this is a pretty silly optimization myself,
and the original patch that motivated this was restructured so it is
not needed)

Other verbs must be more strict, if one side does not understand the
comp mask then the verb must fail in some way. Presumably the valid
comp mask values are inferred in some way (eg query_device says the
extended function is supported)

Everything should fit in one of those two general cases..

> Thanks for the link to the presentation.

If I recall the presentation is old and had some flaws that were
addressed before things made it into libibverbs..

Thank you for looking at this, it is very important that this scheme
is used properly, and it is very easy to make mistakes. I haven't had
time to review any of these new patches myself.

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 3/4] IB/uverbs: ex_query_device: check request's comp_mask
From: Yann Droneaud @ 2015-01-29 18:00 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Sagi Grimberg, Shachar Raindel, Eli Cohen, Roland Dreier,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Tzahi Oved, Or Gerlitz,
	Matan Barak
In-Reply-To: <54C902E4.5010600-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Hi,

Le mercredi 28 janvier 2015 à 17:40 +0200, Haggai Eran a écrit :
> On 28/01/2015 15:19, Yann Droneaud wrote:
> > Le mardi 27 janvier 2015 à 08:50 +0200, Haggai Eran a écrit :
> >> On 26/01/2015 13:17, Yann Droneaud wrote:
> >>> ...
> >>> Le dimanche 25 janvier 2015 à 17:23 +0200, Haggai Eran a écrit :
> >>>> On 22/01/2015 15:28, Yann Droneaud wrote:
> >>>>> This patch ensures the extended QUERY_DEVICE uverbs request's
> >>>>> comp_mask has only known values. If userspace returns unknown
> >>>>> features, -EINVAL will be returned, allowing to probe/discover
> >>>>> which features are currently supported by the kernel.
> >>>>
> >>>> This probing process will be much more cumbersome than it needs to be
> >>>> because userspace will have to call QUERY_DEVICE repetitively with
> >>>> different comp_mask fields until it finds out the exact set of supported
> >>>> bits.
> >>>>
> >>>
> >>> O(log2(N))
> >>
> >> I don't think user space developers will be happy having to do trial and 
> >> error to determine what features the kernel driver supports. It might be 
> >> even more then O(log2(N)). If my understanding of comp_mask bits usage is 
> >> correct it would O(N). But it's not the time complexity I'm worried about,
> >> it's the fact that it requires user-space developers to go through hoops in
> >> order to get information that can be much more easily exported.
> >>
> > 
> > But there's currently *NO* such mean that could give a hint to userspace
> > developer whether one bit of request's comp_mask is currently effective
> > in extended QUERY_DEVICE uverbs. While we have such mean in CREATE_FLOW
> > and DESTROY_FLOW: unsupported bits trigger -EINVAL.
> > 
> > In QUERY_DEVICE, userspace developer is allowed to set request's 
> > comp_mask to -1: it won't hurt it on current kernel, so why not using 
> > this value or any other random value ? The program will run: it's "Works
> > For Me".
> > 
> > But the same program (either binary or source code) might fail on
> > newer kernel where some bits in comp_mask gain a meaning not supported
> > by the program.
> 
> Why don't we make the command comp_mask in QUERY_DEVICE zero in both
> versions. The behavior of the command with comp_mask = 0 will be to
> return the maximum amount of data that fits in the response buffer. The
> kernel will return -EINVAL if the input command comp_mask is not zero in
> the current version.
> 

Yes, that's exactly what I wanted to do.

> If in the future we want to alter the behavior or add more input fields
> to QUERY_DEVICE, we would use new bits.
> 

Yes.

> >>> Or you had to add a different interface, dedicated to retrieve the exact
> >>> supported feature mask.
> >>>
> >>>>> Moreover, it also ensure the requested features set in comp_mask
> >>>>> are sequentially set, not skipping intermediate features, eg. the
> >>>>> "highest" requested feature also request all the "lower" ones.
> >>>>> This way each new feature will have to be stacked on top of the
> >>>>> existing ones: this is especially important for the request and
> >>>>> response data structures where fields are added after the
> >>>>> current ones when expanded to support a new feature.
> >>>>
> >>>> I think it is perfectly acceptable that not all drivers will implement
> >>>> all available features, and so you shouldn't enforce this requirement.
> >>>
> >>> With regard to QUERY_DEVICE: the data structure layout depends on the
> >>> comp_mask value. So either you propose a way to express multipart data
> >>> structure (see CMSG or NETLINK), or we have to ensure the data structure
> >>> is ever-growing, with each new chunck stacked over the existing ones:
> >>> that's the purpose of :
> >>>
> >>> 	if (cmd.comp_mask & (cmd.comp_mask + 1))
> >>> 		return -EINVAL;
> >>>
> >>>> Also, it makes the comp_mask nothing more than a wasteful version number
> >>>> between 0 and 63.
> >>>
> >>> That's what I've already explained earlier in "Re: [PATCH v3 06/17]
> >>> IB/core: Add support for extended query device caps":
> >>>
> >>> http://mid.gmane.org/1421844612.13543.40.camel-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org
> >>
> >> Yes, you wrote there:
> >>> Regarding comp_mask (not for this particular verb):
> >>>
> >>> It's not clear whether request's comp_mask describe the request or the
> >>> response, as such I'm puzzled.
> >>>
> >>> How would the kernel and the userspace be able to parse the request and
> >>> the response if they ignore unknown bits ?
> >>>
> >>> How would they be able to skip the unrecognized chunk of the request and
> >>> response buffer ?
> >>>
> >>> How one bit in a comp_mask is related to a chunk in the request or
> >>> response ?
> >>>
> >>> It's likely the kernel or userspace would have to skip the remaining
> >>> comp_mask's bits after encountering an unknown bit as the size of the
> >>> corresponding chunk in request or response would be unknown, making
> >>> impossible to locate the corresponding chunk for the next bit set in
> >>> comp_mask. Having said that, comp_mask is just a complicated way of
> >>> expressing a version, which is turn describe a size (ever growing).
> >>
> >> It is my understanding that each comp_mask bit marks a set of fields in 
> >> the command or in the response struct as valid, so the struct format 
> >> remains the same and the kernel and userspace don't need to make 
> >> difficult calculation as to where each field is, but you can still pass 
> >> a high bit set in comp_mask with one of the lower bits cleared.
> >>
> > 
> > How can the struct format remain the same, if some fields are added
> > to implement newer feature ?
> 
> I meant that the binary format for an older version is the prefix of the
> binary format of the newer version.
> 

OK.

> >> I couldn't find this explicit detail in the mailing list, but I did found
> >> a presentation that was presented in OFA International Developer 
> >> Workshop 2013 [1], that gave an example of of an verb where each 
> >> comp_mask bit marked a different field as valid.
> >>
> > 
> > Thanks for the link to the presentation.
> > 
> > As I read it the presentation:
> > 
> > - in request, comp_mask give hint to the kernel of additional fields in
> > the request.
> > 
> > - in response, comp_mask give hint to userspace regarding the presence
> > of additional fields in the response.
> 
> I'm not sure that in request you can regard the comp_mask as a hint.

It's a hint because the kernel also have to check the size match:
if userspace say: "hey I've given you more fields" but the request
size is shorter than the kernel expect for such fields, the kernel
must return an error so that userspace is taught to fix its requests.

> I agree that you need to enforce it in general and reject unknown bits
> there (although I thought QUERY_DEVICE would be an exception).
> 

I think it's better to stick to one common behavior so that every
extended uverbs behave the same way.

> > But commit 860f10a799c8 ("IB/core: Add flags for on demand paging
> > support") on top of commit 5a77abf9a97a ("IB/core: Add support for
> > extended query device caps") is not doing it the expected way
> > as one bit set in request's comp_mask has direct effect on
> > the response's fields.
> > 
> > To be conformant with the "Extending Verbs API" presentation,
> > no check should be done on request's comp_mask, and on-demand paging
> > information should be returned to userspace in all case, provided 
> > there's enough room in the response buffer.
> > 
> > Anyway, allowing userspace to set any "hint" in the request's comp_mask
> > is opening a pandora box: being allowed to set any value, userspace
> > program will use any random value, as it will work with current kernel.
> > 
> > But the same program used on newer kernel is going to trigger some
> > behavior unknown to the application or return an error the application
> > is not ready to handle ... breaking any kind of forward compatibility 
> > promise.
> 
> I don't think that the application should handle unknown response bits
> as an error. As you wrote, I see these as hints about more data in the
> response that the application is free to ignore.
> 

I tried to explain the issue regarding older userspace setting random
bits in request's comp_mask: it older kernel allows such, the same
program will likely have issue with newer kernel where the request's
comp_mask bits have a meaning now: the program will either face
unknown behavior: the kernel does something the userspace was not
expecting, or the kernel refuse to do it because there's not enough
information in the request to fullfil it.

That's why we must catch now unknown value to prevent userspace to
use it now. If unknown value are not allowed, userspace program won't
use it and then it could run asis on newer kernel.


> > It's likely the kernel will have to use the size of the request to 
> > guess the hints in comp_mask are corrects to handle such. In such case, 
> > relying only on the size of the request might be enough to detect 
> > extended request, without the need for comp_mask.
> > 
> > Regarding the answer, if the response buffer is smaller than the size
> > of the extended response, will the kernel keep setting the response's 
> > comp_mask with all the bits it supports or will it unset the bits for 
> > the fields it wasn't able to fit in the response buffer ?
> > 
> > It's likely the kernel will have to use the size of the response buffer 
> > to set the response's comp_mask. 
> > 
> > Instead commit 860f10a799c8 ("IB/core: Add flags for on demand paging
> > support") on top of commit 5a77abf9a97a ("IB/core: Add support for
> > extended query device caps") make it possible for the kernel to return
> > truncated response with full comp_mask. Such is going to break silently 
> > when one will introduce a data structure with an ABI mismatch between
> > userspace and kernel (for example x86 vs amd64 ... we have some recent
> > exemples).
> 
> As I said, I think unknown bits in the comp_mask are hints about unknown
> fields in the response that can be ignored by the application. However,
> I can agree to having the kernel checking the response buffer size as
> you wrote, and only setting the valid comp_mask bits.
> 

OK.

> > 
> >>>
> >>>>
> >>>> In the specific case of QUERY_DEVICE you might argue that there isn't
> >>>> any need for input comp_mask, only for output, and then you may enforce
> >>>> the input comp_mask will always be zero.
> >>>
> >>> The extended QUERY_DEVICE uverbs as currently merged is using comp_mask
> >>> from input to choose to report on-demand-paging related value. So it
> >>> seems it's needed.
> >>>
> >>> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_cmd.c?id=v3.19-rc6#n3297
> >>>
> >>>> However, you will in any case need to be able to extended the size of the response in the future.
> >>>>
> >>>
> >>> That's already the case for on demand paging.
> >>>
> >>>>>
> >>>>> Link: http://mid.gmane.org/cover.1421931555.git.ydroneaud@opteya.com
> >>>>> Cc: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>>>> Cc: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>>>> Cc: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>>>> Cc: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>>>> Signed-off-by: Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
> >>>>> ---
> >>>>>  drivers/infiniband/core/uverbs_cmd.c | 6 ++++++
> >>>>>  1 file changed, 6 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
> >>>>> index 8668b328b7e6..80a1c90f1dbf 100644
> >>>>> --- a/drivers/infiniband/core/uverbs_cmd.c
> >>>>> +++ b/drivers/infiniband/core/uverbs_cmd.c
> >>>>> @@ -3313,6 +3313,12 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
> >>>>>  	if (err)
> >>>>>  		return err;
> >>>>>  
> >>>>> +	if (cmd.comp_mask & (cmd.comp_mask + 1))
> >>>>> +		return -EINVAL;
> >>>>> +
> >>>>> +	if (cmd.comp_mask & ~(__u32)IB_USER_VERBS_EX_QUERY_DEVICE_ODP)
> >>>>> +		return -EINVAL;
> >>>>> +
> >>>
> >>> If we keep the checks on output buffer size from patch 1, returning
> >>> -ENOSPC in case of size mismatch, and if we are sure that no bit in
> >>> input comp_mask will ever trigger a call to a kernel function that can
> >>> make the uverb fail, the latter check on known value could be dropped,
> >>> allowing the userspace to set the highest mask (-1): userspace
> >>> will use -ENOSPC to probe the expected size of the response buffer
> >>> to match the highest supported comp_mask. But it's going to hurt
> >>> userspace application not ready to receive -ENOSPC on newer kernel
> >>> with extended QUERY_DEVICE ABI ... Oops.
> >>>
> >>> So in this end, the safest way to ensure userspace is doing the correct
> >>> thing so that we have backward and forward compatibility is to check
> >>> for known values in comp_mask, check for response buffer size and ensure
> >>> that data structure chunk are stacked.
> >>>
> >>> The tighter are the checks now, the easier the interface could be
> >>> extended latter.
> >>
> >> I understand this position, and I generally agree, but I think that 
> >> specifically for a verb like QUERY_DEVICE that only reads information from 
> >> the kernel driver to user-space, there is no harm in the kernel just 
> >> providing all the information it can fit in the response buffer provided 
> >> by user-space.
> >>
> > 
> > Returning as much as possible information to userspace is OK,
> > but it's doing it the wrong way.
> > 
> > I'm not specifically discussing QUERY_DEVICE, as I'm concerned with 
> > every extended uverbs to be added to the kernel.
> > 
> >> Let me explain: newer fields are added to the kernel response struct at the 
> >> end, together with a new comp_mask bit.
> >>
> >> Older user-space with newer kernels will simply ask only for the buffer 
> >> size they care about. The fact that the struct is truncated doesn't affect
> >> these programs because the truncated struct is a valid struct that was 
> >> presented by the kernel in an older version.
> >>
> > 
> > You cannot ensure the userspace being correct when specifying a request.
> > It's a wrong assumption (perhaps not so wrong in the case of
> > InfiniBand/RDMA, as most userspace program are using libibverbs,
> > librdmacm and provider's libraries).
> > 
> > That's why we must not be liberal with the request and check any bit of
> > it for being something valid, so that erroneous values are catch now,
> > ensuring userspace is not trying to request things we don't know yet
> > and it's not aware of it too.
> 
> Does the solution I proposed above satisfy this requirement?
> - The kernel validates input size == command struct size and
> cmd.comp_mask == 0.
> - The kernel fills as much information as it can fit in the buffer
> provided by userspace.
> - The kernel marks which fields it was able to fill in the response's
> comp_mask.
> 

Yes. This scheme follow the "Extending Verbs API" presention
while ensuring the command can be extended regarding to good
practice one can find in
http://blog.ffwll.ch/2013/11/botching-up-ioctls.html

I'm going to submit the patches to have this behavior.

Regards.

-- 
Yann Droneaud
OPTEYA



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v1 5/5] IB/core: ib_copy_to_udata(): don't silently truncate response
From: Yann Droneaud @ 2015-01-29 18:00 UTC (permalink / raw)
  To: Sagi Grimberg, Shachar Raindel, Eli Cohen, Haggai Eran,
	Roland Dreier
  Cc: Yann Droneaud, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <cover.1422553023.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>

While ib_copy_to_udata() should check for the available output
space as already proposed in some other patches [1][2][3], the
changes brought by commit 5a77abf9a97a ("IB/core: Add support for
extended query device caps") are silently truncating the data to
be written to userspace if the output buffer is not large enough
to hold the response data.

Silently truncating the response is not a reliable behavior as
userspace is not given any hint about this truncation: userspace
is leaved with garbage to play with.

Not checking the response buffer size and writing past the
userspace buffer is no good either, but it's the current behavior.

So this patch revert the particular change on ib_copy_to_udata()
as a better behavior is implemented in the upper level function
ib_uverbs_ex_query_device().

[1] "[PATCH 00/22] infiniband: improve userspace input check"

http://mid.gmane.org/cover.1376847403.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org

[2] "[PATCH 03/22] infiniband: ib_copy_from_udata(): check input length"

http://mid.gmane.org/2bf102a41c51f61965ee09df827abe8fefb523a9.1376847403.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org

[3] "[PATCH 04/22] infiniband: ib_copy_to_udata(): check output length"

http://mid.gmane.org/d27716a3a1c180f832d153a7402f65ea8a75b734.1376847403.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org

Link: http://mid.gmane.org/cover.1422553023.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org
Cc: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
---
 include/rdma/ib_verbs.h | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 0d74f1de99aa..65994a19e840 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1707,10 +1707,7 @@ static inline int ib_copy_from_udata(void *dest, struct ib_udata *udata, size_t
 
 static inline int ib_copy_to_udata(struct ib_udata *udata, void *src, size_t len)
 {
-	size_t copy_sz;
-
-	copy_sz = min_t(size_t, len, udata->outlen);
-	return copy_to_user(udata->outbuf, src, copy_sz) ? -EFAULT : 0;
+	return copy_to_user(udata->outbuf, src, len) ? -EFAULT : 0;
 }
 
 /**
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v1 4/5] IB/uverbs: ex_query_device: no need to clear the whole structure
From: Yann Droneaud @ 2015-01-29 18:00 UTC (permalink / raw)
  To: Sagi Grimberg, Shachar Raindel, Eli Cohen, Haggai Eran,
	Roland Dreier
  Cc: Yann Droneaud, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <cover.1422553023.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>

As only the requested fields are set and copied to userspace,
there's no need to clear the content of the response structure
beforehand.

Link: http://mid.gmane.org/cover.1422553023.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org
Cc: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/core/uverbs_cmd.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 81d8b5aa2eb6..9520880a163f 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -3328,9 +3328,9 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 	if (err)
 		return err;
 
-	memset(&resp, 0, sizeof(resp));
 	copy_query_dev_fields(file, &resp.base, &attr);
 	resp.comp_mask = 0;
+	resp.reserved = 0;
 
 	if (ucore->outlen < resp_len + sizeof(resp.odp_caps))
 		goto end;
@@ -3343,6 +3343,9 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 		attr.odp_caps.per_transport_caps.uc_odp_caps;
 	resp.odp_caps.per_transport_caps.ud_odp_caps =
 		attr.odp_caps.per_transport_caps.ud_odp_caps;
+	resp.odp_caps.reserved = 0;
+#else
+	memset(&resp.odp_caps, 0, sizeof(resp.odp_caps));
 #endif
 	resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP;
 	resp_len += sizeof(resp.odp_caps);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v1 3/5] IB/uverbs: ex_query_device: answer must depend on response buffer's size
From: Yann Droneaud @ 2015-01-29 18:00 UTC (permalink / raw)
  To: Sagi Grimberg, Shachar Raindel, Eli Cohen, Haggai Eran,
	Roland Dreier
  Cc: Yann Droneaud, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <cover.1422553023.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>

As specified in "Extending Verbs API" presentation [1] by Tzahi Oved
during OFA International Developer Workshop 2013, the request's
comp_mask should describe the request data: it's describe the
availability of extended fields in the request.
Conversely, the response's comp_mask should describe the presence
of extended fields in the response.

So instead of silently truncating extended QUERY_DEVICE uverb's
response, see commit 5a77abf9a97a ("IB/core: Add support for
extended query device caps")), this patch makes function
ib_uverbs_ex_query_device() check the available space in the
response buffer against the minimal response structure and fail
with -ENOSPC if this base structure cannot be returned to
userspace: it's required to be able to return the comp_mask's
value, at least.

For extended features, currently only IB_USER_VERBS_EX_QUERY_DEVICE_ODP
per commit 860f10a799c8 ("IB/core: Add flags for on demand paging
support"), the extension's data is returned if the needed space
is available in the response buffer: it is expected that newer
userspace program pass a bigger response buffer so that it can
retrieve extended features. The comp_mask value will match the fields
that were effectively returned to userspace.

In the end:
- userspace won't get truncated responses;
- newer kernel would be able to support older binaries and
  older binaries will work flawlessly with newer kernel;
- additionally, newer binaries would even be able to support
  older kernel, as far as they don't set new feature bit in
  request's comp_mask.

Note: as offsetof() is used to retrieve the size of the lower chunk
of the response, beware that it only works if the upper chunk
is right after, without any implicit padding. And, as the size of
the latter chunk is added to the base size, implicit padding at the
end of the structure is not taken in account. Both point must be
taken in account when extending the uverbs functionalities.

[1] https://www.openfabrics.org/images/docs/2013_Dev_Workshop/Tues_0423/2013_Workshop_Tues_0830_Tzahi_Oved-verbs_extensions_ofa_2013-tzahio.pdf

Link: http://mid.gmane.org/cover.1422553023.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org
Cc: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/core/uverbs_cmd.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index fbcc54b86795..81d8b5aa2eb6 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -3302,6 +3302,7 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 	struct ib_uverbs_ex_query_device  cmd;
 	struct ib_device_attr attr;
 	struct ib_device *device;
+	size_t resp_len;
 	int err;

 	device = file->device->ib_dev;
@@ -3318,6 +3319,11 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 	if (cmd.reserved)
 		return -EINVAL;

+	resp_len = offsetof(typeof(resp), odp_caps);
+
+	if (ucore->outlen < resp_len)
+		return -ENOSPC;
+
 	err = device->query_device(device, &attr);
 	if (err)
 		return err;
@@ -3326,6 +3332,9 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 	copy_query_dev_fields(file, &resp.base, &attr);
 	resp.comp_mask = 0;

+	if (ucore->outlen < resp_len + sizeof(resp.odp_caps))
+		goto end;
+
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 	resp.odp_caps.general_caps = attr.odp_caps.general_caps;
 	resp.odp_caps.per_transport_caps.rc_odp_caps =
@@ -3336,8 +3345,10 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 		attr.odp_caps.per_transport_caps.ud_odp_caps;
 #endif
 	resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP;
+	resp_len += sizeof(resp.odp_caps);

-	err = ib_copy_to_udata(ucore, &resp, sizeof(resp));
+end:
+	err = ib_copy_to_udata(ucore, &resp, resp_len);
 	if (err)
 		return err;

-- 
2.1.0

^ permalink raw reply related

* [PATCH v1 2/5] IB/uverbs: ex_query_device: check request's comp_mask
From: Yann Droneaud @ 2015-01-29 17:59 UTC (permalink / raw)
  To: Sagi Grimberg, Shachar Raindel, Eli Cohen, Haggai Eran,
	Roland Dreier
  Cc: Yann Droneaud, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <cover.1422553023.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>

This patch ensures the extended QUERY_DEVICE uverbs request's
comp_mask has only known and supported bits (currently none).

If userspace set unknown features bits, -EINVAL will be returned,
ensuring current programs are not allowed to set random feature
bits: such bits could enable new extended features in future kernel
versions and those features can trigger a behavior not unsupported
by the older programs or make the newer kernels return an error
for a request which was valid on older kernels.

Additionally, returning an error for unsupported feature would
allow userspace to probe/discover which extended features are
currently supported by a kernel.

Link: http://mid.gmane.org/cover.1422553023.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org
Cc: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/core/uverbs_cmd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 6ef06a9b4362..fbcc54b86795 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -3312,6 +3312,9 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 	if (err)
 		return err;
 
+	if (cmd.comp_mask)
+		return -EINVAL;
+
 	if (cmd.reserved)
 		return -EINVAL;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox