Netdev List
 help / color / mirror / Atom feed
* Re: Freeing alive fib_info caused by ebc0ffae5
From: David Miller @ 2010-11-04 19:06 UTC (permalink / raw)
  To: michael; +Cc: eric.dumazet, netdev
In-Reply-To: <1288870526.30549.19.camel@concordia>

From: Michael Ellerman <michael@ellerman.id.au>
Date: Thu, 04 Nov 2010 22:35:26 +1100

> On Thu, 2010-11-04 at 12:21 +0100, Eric Dumazet wrote:
>> [PATCH] fib: fib_result_assign() should not change fib refcounts
>> 
>> After commit ebc0ffae5 (RCU conversion of fib_lookup()),
>> fib_result_assign()  should not change fib refcounts anymore.
>> 
>> Thanks to Michael who did the bisection and bug report.
 ...
> Perfect, that fixes it, thanks!

Applied, thanks everyone!

^ permalink raw reply

* Re: [RFC][net-next-2.6 PATCH 2/4] net: 8021Q consolidate header_ops routines
From: Jesse Gross @ 2010-11-04 18:26 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev@vger.kernel.org
In-Reply-To: <4CD2B883.30808@intel.com>

On Thu, Nov 4, 2010 at 6:43 AM, John Fastabend
<john.r.fastabend@intel.com> wrote:
> On 11/3/2010 5:47 PM, Jesse Gross wrote:
>> On Thu, Oct 21, 2010 at 3:10 PM, John Fastabend
>> <john.r.fastabend@intel.com> wrote:
>>> The only thing the 8021Q header ops routines are required
>>> for is the VLAN_FLAG_REORDER_HDR otherwise by the time
>>> the VLAN tag has been added the packet is already on
>>> its way down the stack. In this case using the Ethernet
>>> ops works OK.
>>>
>>> At present the VLAN_FLAG_REORDER_HDR flag does not work
>>> with vlan offloads. As I understand the flag the intent
>>> is to allow taps on the vlan device and possibly the
>>> QOS layer to see the vlan tag info.
>>>
>>> By inserting the tag in vlan_tci any taps or QOS policies
>>> should be able to retrieve the vlan info. This allows
>>> the flag to work the same in both the offload case and
>>> non-offloaded case. And allows us to use the underlying
>>> ethernet ops.
>>>
>>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>>
>> I noticed that you dropped this patch from your most recent series, so
>> I went back to take a look at it.  I realized that it probably works
>> inconsistently since header caching doesn't take into account
>> skb->vlan_tci, so whether you see the tag depends on the state of the
>> cache.
>>
>> It would be really good to have this type of code consolidation, both
>> for the sake of sanity and to eliminate the inconsistent behavior.  We
>> could do that by either not using header caching or making it work
>> with vlan offloading somehow.  However, I'm not sure that there's
>> really much point in that.  VLAN_FLAG_REORDER_HDR doesn't work with
>> cards that do vlan offloading, which is a pretty significant number of
>> them.  It similarly works inconsistently on the rx side.  So it's
>> broken most of the time and worse, the behavior changes depending on
>> the NIC (and now the ethtool setting).  Can we just eliminate it?
>
> Yes this is why I have dropped it for now. Also rebuild is broke as best I can tell. Although I doubt anyone would notice you would need to clear VLAN_FLAG_REORDER_HDR and be using one of the ARPHRD_{ROSE|AX25|NETROM}.
>
> The problem with caching the vlan header is the skb priority to vlan priority map. So we could cache the vid, sa, da, and protocols but I can not see anyway to cache the vlan priority. Also the cache would have to be flushed when the flag is toggled.

I agree, fixing this so that !VLAN_FLAG_REORDER_HDR works correctly in
all cases would be messy.

However, this has been broken for a long time and I don't know of
anyone complaining.  Since it is already a no-op in the accelerated
case, I would like to just drop the flag so we get consistent behavior
and less code.

^ permalink raw reply

* [PATCH 4/4] crypto: algif_skcipher - User-space interface for skcipher operations
From: Herbert Xu @ 2010-11-04 17:36 UTC (permalink / raw)
  To: Linux Crypto Mailing List, netdev, Linux Kernel Mailing List
In-Reply-To: <20101104173456.GA1321@gondor.apana.org.au>

crypto: algif_skcipher - User-space interface for skcipher operations

This patch adds the af_alg plugin for symmetric key ciphers,
corresponding to the ablkcipher kernel operation type.

Keys can optionally be set through the setsockopt interface.

Once a sendmsg call occurs without MSG_MORE no further writes
may be made to the socket until all previous data has been read.

IVs and and whether encryption/decryption is performed can be
set through the setsockopt interface or as a control message
to sendmsg.

The interface is completely synchronous, all operations are
carried out in recvmsg(2) and will complete prior to the system
call returning.

The splice(2) interface support reading the user-space data directly
without copying (except that the Crypto API itself may copy the data
if alignment is off).

The recvmsg(2) interface supports directly writing to user-space
without additional copying, i.e., the kernel crypto interface will
receive the user-space address as its output SG list.

Thakns to Miloslav Trmac for reviewing this and contributing
fixes and improvements.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 crypto/Kconfig          |    8 
 crypto/Makefile         |    1 
 crypto/algif_skcipher.c |  647 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 656 insertions(+)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 6db27d7..69437e2 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -852,6 +852,14 @@ config CRYPTO_USER_API_HASH
 	  This option enables the user-spaces interface for hash
 	  algorithms.
 
+config CRYPTO_USER_API_SKCIPHER
+	tristate "User-space interface for symmetric key cipher algorithms"
+	select CRYPTO_BLKCIPHER
+	select CRYPTO_USER_API
+	help
+	  This option enables the user-spaces interface for symmetric
+	  key cipher algorithms.
+
 source "drivers/crypto/Kconfig"
 
 endif	# if CRYPTO
diff --git a/crypto/Makefile b/crypto/Makefile
index 14ab405..efc0f18 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -87,6 +87,7 @@ obj-$(CONFIG_CRYPTO_TEST) += tcrypt.o
 obj-$(CONFIG_CRYPTO_GHASH) += ghash-generic.o
 obj-$(CONFIG_CRYPTO_USER_API) += af_alg.o
 obj-$(CONFIG_CRYPTO_USER_API_HASH) += algif_hash.o
+obj-$(CONFIG_CRYPTO_USER_API_SKCIPHER) += algif_skcipher.o
 
 #
 # generic algorithms and the async_tx api
diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
new file mode 100644
index 0000000..abc15b2
--- /dev/null
+++ b/crypto/algif_skcipher.c
@@ -0,0 +1,647 @@
+/*
+ * algif_skcipher: User-space interface for skcipher algorithms
+ *
+ * This file provides the user-space API for symmetric key ciphers.
+ *
+ * Copyright (c) 2010 Herbert Xu <herbert@gondor.apana.org.au>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+
+#include <crypto/skcipher.h>
+#include <crypto/if_alg.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/net.h>
+#include <net/sock.h>
+
+struct skcipher_sg_list {
+	struct list_head list;
+
+	int cur;
+
+	struct scatterlist sg[0];
+};
+
+struct skcipher_ctx {
+	struct list_head tsgl;
+	struct af_alg_sgl rsgl;
+
+	void *iv;
+
+	struct af_alg_completion completion;
+
+	unsigned used;
+
+	unsigned int len;
+	bool more;
+	bool merge;
+	bool enc;
+
+	struct ablkcipher_request req;
+};
+
+#define MAX_SGL_ENTS ((PAGE_SIZE - sizeof(struct skcipher_sg_list)) / \
+		      sizeof(struct scatterlist) - 1)
+
+static inline bool skcipher_writable(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct skcipher_ctx *ctx = ask->private;
+
+	return ctx->used + PAGE_SIZE <= max_t(int, sk->sk_sndbuf, PAGE_SIZE);
+}
+
+static int skcipher_alloc_sgl(struct sock *sk, int size)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct skcipher_ctx *ctx = ask->private;
+	struct skcipher_sg_list *sgl;
+	struct scatterlist *sg = NULL;
+
+	sgl = list_first_entry(&ctx->tsgl, struct skcipher_sg_list, list);
+	if (!list_empty(&ctx->tsgl))
+		sg = sgl->sg;
+
+	if (!sg || sgl->cur >= MAX_SGL_ENTS) {
+		sgl = sock_kmalloc(sk, sizeof(*sgl) +
+				       sizeof(sgl->sg[0]) * (MAX_SGL_ENTS + 1),
+				   GFP_KERNEL);
+		if (!sgl)
+			return -ENOMEM;
+
+		sg_init_table(sgl->sg, MAX_SGL_ENTS + 1);
+		sgl->cur = 0;
+
+		if (sg)
+			sg_chain(sg, MAX_SGL_ENTS + 1, sgl->sg);
+
+		list_add_tail(&sgl->list, &ctx->tsgl);
+	}
+
+	return 0;
+}
+
+static void skcipher_pull_sgl(struct sock *sk, int used)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct skcipher_ctx *ctx = ask->private;
+	struct skcipher_sg_list *sgl;
+	struct scatterlist *sg;
+	int i;
+
+	while (!list_empty(&ctx->tsgl)) {
+		sgl = list_first_entry(&ctx->tsgl, struct skcipher_sg_list,
+				       list);
+		sg = sgl->sg;
+
+		for (i = 0; i < sgl->cur; i++) {
+			int plen = min_t(int, used, sg[i].length);
+
+			if (!sg_page(sg + i))
+				continue;
+
+			if (!used)
+				return;
+
+			sg[i].length -= plen;
+			sg[i].offset += plen;
+
+			if (!sg[i].length) {
+				put_page(sg_page(sg + i));
+				sg_assign_page(sg + i, NULL);
+			}
+
+			used -= plen;
+			ctx->used -= plen;
+		}
+
+		list_del(&sgl->list);
+		sock_kfree_s(sk, sgl,
+			     sizeof(*sgl) + sizeof(sgl->sg[0]) *
+					    (MAX_SGL_ENTS + 1));
+	}
+
+	if (!ctx->used)
+		ctx->merge = 0;
+}
+
+static void skcipher_free_sgl(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct skcipher_ctx *ctx = ask->private;
+
+	skcipher_pull_sgl(sk, ctx->used);
+}
+
+static int skcipher_wait_for_wmem(struct sock *sk, unsigned flags)
+{
+	DEFINE_WAIT(wait);
+	int err = -ERESTARTSYS;
+
+	if (flags & MSG_DONTWAIT)
+		return -EAGAIN;
+
+	set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
+
+	for (;;) {
+		if (signal_pending(current))
+			break;
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+		if (skcipher_writable(sk)) {
+			err = 0;
+			break;
+		}
+		schedule();
+	}
+	finish_wait(sk_sleep(sk), &wait);
+
+	return err;
+}
+
+static void skcipher_wmem_wakeup(struct sock *sk)
+{
+	struct socket_wq *wq;
+
+	if (!skcipher_writable(sk))
+		return;
+
+	rcu_read_lock();
+	wq = rcu_dereference(sk->sk_wq);
+	if (wq_has_sleeper(wq))
+		wake_up_interruptible_sync_poll(&wq->wait, POLLIN |
+							   POLLRDNORM |
+							   POLLRDBAND);
+	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
+	rcu_read_unlock();
+}
+
+static int skcipher_wait_for_data(struct sock *sk, unsigned flags)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct skcipher_ctx *ctx = ask->private;
+	DEFINE_WAIT(wait);
+	int err = -ERESTARTSYS;
+
+	if (flags & MSG_DONTWAIT) {
+		return -EAGAIN;
+	}
+
+	set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
+
+	for (;;) {
+		if (signal_pending(current))
+			break;
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+		if (ctx->used) {
+			err = 0;
+			break;
+		}
+		schedule();
+	}
+	finish_wait(sk_sleep(sk), &wait);
+
+	clear_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
+
+	return err;
+}
+
+static void skcipher_data_wakeup(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct skcipher_ctx *ctx = ask->private;
+	struct socket_wq *wq;
+
+	if (!ctx->used);
+		return;
+
+	rcu_read_lock();
+	wq = rcu_dereference(sk->sk_wq);
+	if (wq_has_sleeper(wq))
+		wake_up_interruptible_sync_poll(&wq->wait, POLLOUT |
+							   POLLRDNORM |
+							   POLLRDBAND);
+	sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
+	rcu_read_unlock();
+}
+
+static int skcipher_sendmsg(struct kiocb *unused, struct socket *sock,
+			    struct msghdr *msg, size_t size)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct skcipher_ctx *ctx = ask->private;
+	struct crypto_ablkcipher *tfm = crypto_ablkcipher_reqtfm(&ctx->req);
+	unsigned ivsize = crypto_ablkcipher_ivsize(tfm);
+	struct skcipher_sg_list *sgl;
+	struct af_alg_control con = {};
+	long copied = 0;
+	bool enc = 0;
+	int limit;
+	int err;
+	int i;
+
+	if (msg->msg_controllen) {
+		err = af_alg_cmsg_send(msg, &con);
+		if (err)
+			return err;
+
+		switch (con.op) {
+		case ALG_OP_ENCRYPT:
+			enc = 1;
+			break;
+		case ALG_OP_DECRYPT:
+			enc = 0;
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		if (con.iv && con.iv->ivlen != ivsize)
+			return -EINVAL;
+	}
+
+	err = -EINVAL;
+
+	lock_sock(sk);
+	if (!ctx->more && ctx->used)
+		goto unlock;
+
+	if (!ctx->used) {
+		ctx->enc = enc;
+		if (con.iv)
+			memcpy(ctx->iv, con.iv->iv, ivsize);
+	}
+
+	limit = max_t(int, sk->sk_sndbuf, PAGE_SIZE);
+	limit -= ctx->used;
+
+	while (size) {
+		struct scatterlist *sg;
+		unsigned long len = size;
+		int plen;
+
+		if (ctx->merge) {
+			sgl = list_entry(ctx->tsgl.prev,
+					 struct skcipher_sg_list, list);
+			sg = sgl->sg + sgl->cur - 1;
+			len = min_t(unsigned long, len, PAGE_SIZE - sg->length);
+
+			err = memcpy_fromiovec(page_address(sg_page(sg)) +
+					       sg->length, msg->msg_iov, len);
+			if (err)
+				goto unlock;
+
+			sg->length += len;
+			ctx->merge = sg->length & (PAGE_SIZE - 1);
+
+			size -= len;
+			copied += len;
+			continue;
+		}
+
+		if (limit < PAGE_SIZE) {
+			release_sock(sk);
+			err = skcipher_wait_for_wmem(sk, msg->msg_flags);
+			lock_sock(sk);
+			if (err)
+				goto unlock;
+
+			limit = max_t(int, sk->sk_sndbuf, PAGE_SIZE);
+			limit -= ctx->used;
+		}
+
+		len = min_t(unsigned long, len, limit);
+
+		err = skcipher_alloc_sgl(sk, len);
+		if (err)
+			goto unlock;
+
+		sgl = list_first_entry(&ctx->tsgl, struct skcipher_sg_list,
+				       list);
+		sg = sgl->sg;
+		do {
+			i = sgl->cur;
+			plen = min_t(int, len, PAGE_SIZE);
+
+			sg_assign_page(sg + i, alloc_page(GFP_KERNEL));
+			err = -ENOMEM;
+			if (!sg_page(sg + i))
+				goto unlock;
+
+			err = memcpy_fromiovec(page_address(sg_page(sg + i)),
+					       msg->msg_iov, plen);
+			if (err) {
+				__free_page(sg_page(sg + i));
+				sg_assign_page(sg + i, NULL);
+				goto unlock;
+			}
+
+			sg[i].length = plen;
+			len -= plen;
+			ctx->used += plen;
+			copied += plen;
+			size -= plen;
+			limit -= plen;
+			sgl->cur++;
+		} while (sg_page(sg + ++i));
+
+		ctx->merge = plen & (PAGE_SIZE - 1);
+		if (ctx->merge)
+			sgl->cur--;
+	}
+
+	err = 0;
+
+	ctx->more = msg->msg_flags & MSG_MORE;
+	if (!ctx->more && !list_empty(&ctx->tsgl)) {
+		sgl = list_entry(ctx->tsgl.prev, struct skcipher_sg_list, list);
+		sg_mark_end(sgl->sg + sgl->cur - 1);
+	}
+
+unlock:
+	skcipher_data_wakeup(sk);
+	release_sock(sk);
+
+	return copied ?: err;
+}
+
+static ssize_t skcipher_sendpage(struct socket *sock, struct page *page,
+				 int offset, size_t size, int flags)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct skcipher_ctx *ctx = ask->private;
+	struct skcipher_sg_list *sgl;
+	int err = -EINVAL;
+	int limit;
+
+	lock_sock(sk);
+	if (!ctx->more && ctx->used)
+		goto unlock;
+
+	if (!size)
+		goto done;
+
+	limit = max_t(int, sk->sk_sndbuf, PAGE_SIZE);
+	limit -= ctx->used;
+
+	if (limit < PAGE_SIZE) {
+		release_sock(sk);
+		err = skcipher_wait_for_wmem(sk, flags);
+		lock_sock(sk);
+		if (err)
+			goto unlock;
+
+		limit = max_t(int, sk->sk_sndbuf, PAGE_SIZE);
+		limit -= ctx->used;
+	}
+
+	err = skcipher_alloc_sgl(sk, 0);
+	if (err)
+		goto unlock;
+
+	ctx->merge = 0;
+	sgl = list_entry(ctx->tsgl.prev, struct skcipher_sg_list, list);
+
+	get_page(page);
+	sg_set_page(sgl->sg + sgl->cur, page, size, offset);
+	sgl->cur++;
+	ctx->used += size;
+
+done:
+	ctx->more = flags & MSG_MORE;
+	if (!ctx->more && !list_empty(&ctx->tsgl)) {
+		sgl = list_entry(ctx->tsgl.prev, struct skcipher_sg_list, list);
+		sg_mark_end(sgl->sg + sgl->cur - 1);
+	}
+
+unlock:
+	skcipher_data_wakeup(sk);
+	release_sock(sk);
+
+	return err ?: size;
+}
+
+static int skcipher_recvmsg(struct kiocb *unused, struct socket *sock,
+			    struct msghdr *msg, size_t ignored, int flags)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct skcipher_ctx *ctx = ask->private;
+	unsigned bs = crypto_ablkcipher_blocksize(crypto_ablkcipher_reqtfm(
+		&ctx->req));
+	struct skcipher_sg_list *sgl;
+	struct scatterlist *sg;
+	unsigned long iovlen;
+	struct iovec *iov;
+	int err = -EAGAIN;
+	int used;
+	long copied = 0;
+
+	lock_sock(sk);
+	for (iov = msg->msg_iov, iovlen = msg->msg_iovlen; iovlen > 0;
+	     iovlen--, iov++) {
+		unsigned long seglen = iov->iov_len;
+		char __user *from = iov->iov_base;
+
+		sgl = list_first_entry(&ctx->tsgl, struct skcipher_sg_list,
+				       list);
+		sg = sgl->sg;
+		while (!sg->length)
+			sg++;
+
+		while (seglen) {
+			used = ctx->used;
+			if (!used) {
+				release_sock(sk);
+				err = skcipher_wait_for_data(sk, flags);
+				lock_sock(sk);
+				if (err)
+					goto unlock;
+			}
+
+			used = min_t(unsigned long, used, seglen);
+
+			if (ctx->more || used < ctx->used)
+				used -= used % bs;
+
+			err = -EINVAL;
+			if (!used)
+				goto unlock;
+
+			used = af_alg_make_sg(&ctx->rsgl, from, used, 1);
+			if (used < 0)
+				goto unlock;
+
+			ablkcipher_request_set_crypt(&ctx->req, sg,
+						     ctx->rsgl.sg, used,
+						     ctx->iv);
+
+			err = af_alg_wait_for_completion(
+				ctx->enc ?
+					crypto_ablkcipher_encrypt(&ctx->req) :
+					crypto_ablkcipher_decrypt(&ctx->req),
+				&ctx->completion);
+
+			af_alg_free_sg(&ctx->rsgl);
+
+			if (err)
+				goto unlock;
+
+			copied += used;
+			from += used;
+			seglen -= used;
+			skcipher_pull_sgl(sk, used);
+		}
+	}
+
+	err = 0;
+
+unlock:
+	skcipher_wmem_wakeup(sk);
+	release_sock(sk);
+
+	return copied ?: err;
+}
+
+
+static unsigned int skcipher_poll(struct file *file, struct socket *sock,
+				  poll_table *wait)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct skcipher_ctx *ctx = ask->private;
+	unsigned int mask;
+
+	sock_poll_wait(file, sk_sleep(sk), wait);
+	mask = 0;
+
+	if (ctx->used)
+		mask |= POLLIN | POLLRDNORM;
+
+	if (skcipher_writable(sk))
+		mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
+
+	return mask;
+}
+
+static struct proto_ops algif_skcipher_ops = {
+	.family		=	PF_ALG,
+
+	.connect	=	sock_no_connect,
+	.socketpair	=	sock_no_socketpair,
+	.getname	=	sock_no_getname,
+	.ioctl		=	sock_no_ioctl,
+	.listen		=	sock_no_listen,
+	.shutdown	=	sock_no_shutdown,
+	.getsockopt	=	sock_no_getsockopt,
+	.mmap		=	sock_no_mmap,
+	.bind		=	sock_no_bind,
+	.accept		=	sock_no_accept,
+	.setsockopt	=	sock_no_setsockopt,
+
+	.release	=	af_alg_release,
+	.sendmsg	=	skcipher_sendmsg,
+	.sendpage	=	skcipher_sendpage,
+	.recvmsg	=	skcipher_recvmsg,
+	.poll		=	skcipher_poll,
+};
+
+static void *skcipher_bind(const char *name, u32 type, u32 mask)
+{
+	return crypto_alloc_ablkcipher(name, type, mask);
+}
+
+static void skcipher_release(void *private)
+{
+	crypto_free_ablkcipher(private);
+}
+
+static int skcipher_setkey(void *private, const u8 *key, unsigned int keylen)
+{
+	return crypto_ablkcipher_setkey(private, key, keylen);
+}
+
+static void skcipher_sock_destruct(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct skcipher_ctx *ctx = ask->private;
+	struct crypto_ablkcipher *tfm = crypto_ablkcipher_reqtfm(&ctx->req);
+
+	skcipher_free_sgl(sk);
+	sock_kfree_s(sk, ctx->iv, crypto_ablkcipher_blocksize(tfm));
+	sock_kfree_s(sk, ctx, ctx->len);
+	af_alg_release_parent(sk);
+}
+
+static int skcipher_accept_parent(void *private, struct sock *sk)
+{
+	struct skcipher_ctx *ctx;
+	struct alg_sock *ask = alg_sk(sk);
+	unsigned int len = sizeof(*ctx) + crypto_ablkcipher_reqsize(private);
+
+	ctx = sock_kmalloc(sk, len, GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	ctx->iv = sock_kmalloc(sk, crypto_ablkcipher_ivsize(private),
+			       GFP_KERNEL);
+	if (!ctx->iv) {
+		sock_kfree_s(sk, ctx, len);
+		return -ENOMEM;
+	}
+
+	memset(ctx->iv, 0, crypto_ablkcipher_ivsize(private));
+
+	INIT_LIST_HEAD(&ctx->tsgl);
+	ctx->len = len;
+	ctx->used = 0;
+	ctx->more = 0;
+	ctx->merge = 0;
+	ctx->enc = 0;
+	af_alg_init_completion(&ctx->completion);
+
+	ask->private = ctx;
+
+	ablkcipher_request_set_tfm(&ctx->req, private);
+	ablkcipher_request_set_callback(&ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+					af_alg_complete, &ctx->completion);
+
+	sk->sk_destruct = skcipher_sock_destruct;
+
+	return 0;
+}
+
+static const struct af_alg_type algif_type_skcipher = {
+	.bind		=	skcipher_bind,
+	.release	=	skcipher_release,
+	.setkey		=	skcipher_setkey,
+	.accept		=	skcipher_accept_parent,
+	.ops		=	&algif_skcipher_ops,
+	.name		=	"skcipher",
+	.owner		=	THIS_MODULE
+};
+
+static int __init algif_skcipher_init(void)
+{
+	return af_alg_register_type(&algif_type_skcipher);
+}
+
+static void __exit algif_skcipher_exit(void)
+{
+	int err = af_alg_unregister_type(&algif_type_skcipher);
+	BUG_ON(err);
+}
+
+module_init(algif_skcipher_init);
+module_exit(algif_skcipher_exit);
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_NETPROTO(AF_ALG);

^ permalink raw reply related

* [PATCH 3/4] crypto: algif_hash - User-space interface for hash operations
From: Herbert Xu @ 2010-11-04 17:36 UTC (permalink / raw)
  To: Linux Crypto Mailing List, netdev, Linux Kernel Mailing List
In-Reply-To: <20101104173456.GA1321@gondor.apana.org.au>

crypto: algif_hash - User-space interface for hash operations

This patch adds the af_alg plugin for hash, corresponding to
the ahash kernel operation type.

Keys can optionally be set through the setsockopt interface.

Each sendmsg call will finalise the hash unless sent with a MSG_MORE
flag.

Partial hash states can be cloned using accept(2).

The interface is completely synchronous, all operations will
complete prior to the system call returning.

Both sendmsg(2) and splice(2) support reading the user-space
data directly without copying (except that the Crypto API itself
may copy the data if alignment is off).

For now only the splice(2) interface supports performing digest
instead of init/update/final.  In future the sendmsg(2) interface
will also be modified to use digest/finup where possible so that
hardware that cannot return a partial hash state can still benefit
from this interface.

Thakns to Miloslav Trmac for reviewing this and contributing
fixes and improvements.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 crypto/Kconfig      |    8 +
 crypto/Makefile     |    1 
 crypto/algif_hash.c |  321 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 330 insertions(+)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 357e3ca..6db27d7 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -844,6 +844,14 @@ config CRYPTO_ANSI_CPRNG
 config CRYPTO_USER_API
 	tristate
 
+config CRYPTO_USER_API_HASH
+	tristate "User-space interface for hash algorithms"
+	select CRYPTO_HASH
+	select CRYPTO_USER_API
+	help
+	  This option enables the user-spaces interface for hash
+	  algorithms.
+
 source "drivers/crypto/Kconfig"
 
 endif	# if CRYPTO
diff --git a/crypto/Makefile b/crypto/Makefile
index 0b13197..14ab405 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -86,6 +86,7 @@ obj-$(CONFIG_CRYPTO_ANSI_CPRNG) += ansi_cprng.o
 obj-$(CONFIG_CRYPTO_TEST) += tcrypt.o
 obj-$(CONFIG_CRYPTO_GHASH) += ghash-generic.o
 obj-$(CONFIG_CRYPTO_USER_API) += af_alg.o
+obj-$(CONFIG_CRYPTO_USER_API_HASH) += algif_hash.o
 
 #
 # generic algorithms and the async_tx api
diff --git a/crypto/algif_hash.c b/crypto/algif_hash.c
new file mode 100644
index 0000000..014ea43
--- /dev/null
+++ b/crypto/algif_hash.c
@@ -0,0 +1,321 @@
+/*
+ * algif_hash: User-space interface for hash algorithms
+ *
+ * This file provides the user-space API for hash algorithms.
+ *
+ * Copyright (c) 2010 Herbert Xu <herbert@gondor.apana.org.au>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+
+#include <crypto/hash.h>
+#include <crypto/if_alg.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/net.h>
+#include <net/sock.h>
+
+struct hash_ctx {
+	struct af_alg_sgl sgl;
+
+	u8 *result;
+
+	struct af_alg_completion completion;
+
+	unsigned int len;
+	int err;
+	bool more;
+
+	struct ahash_request req;
+};
+
+static int hash_sendmsg(struct kiocb *unused, struct socket *sock,
+			struct msghdr *msg, size_t ignored)
+{
+	int limit = ALG_MAX_PAGES * PAGE_SIZE;
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct hash_ctx *ctx = ask->private;
+	unsigned long iovlen;
+	struct iovec *iov;
+	long copied = 0;
+	int err;
+
+	if (limit > sk->sk_sndbuf)
+		limit = sk->sk_sndbuf;
+
+	lock_sock(sk);
+	if (!ctx->more) {
+		err = crypto_ahash_init(&ctx->req);
+		if (err)
+			goto unlock;
+	}
+
+	ctx->more = 0;
+
+	for (iov = msg->msg_iov, iovlen = msg->msg_iovlen; iovlen > 0;
+	     iovlen--, iov++) {
+		unsigned long seglen = iov->iov_len;
+		char __user *from = iov->iov_base;
+
+		while (seglen) {
+			int len = min_t(unsigned long, seglen, limit);
+			int newlen;
+
+			newlen = af_alg_make_sg(&ctx->sgl, from, len, 0);
+			if (newlen < 0)
+				goto unlock;
+
+			ahash_request_set_crypt(&ctx->req, ctx->sgl.sg, NULL,
+						newlen);
+
+			err = af_alg_wait_for_completion(
+				crypto_ahash_update(&ctx->req),
+				&ctx->completion);
+
+			af_alg_free_sg(&ctx->sgl);
+
+			if (err)
+				goto unlock;
+
+			seglen -= newlen;
+			from += newlen;
+			copied += newlen;
+		}
+	}
+
+	err = 0;
+
+	ctx->more = msg->msg_flags & MSG_MORE;
+	if (!ctx->more) {
+		ahash_request_set_crypt(&ctx->req, NULL, ctx->result, 0);
+		err = af_alg_wait_for_completion(crypto_ahash_final(&ctx->req),
+						 &ctx->completion);
+	}
+
+unlock:
+	release_sock(sk);
+
+	return err ?: copied;
+}
+
+static ssize_t hash_sendpage(struct socket *sock, struct page *page,
+			     int offset, size_t size, int flags)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct hash_ctx *ctx = ask->private;
+	int err;
+
+	lock_sock(sk);
+	sg_init_table(ctx->sgl.sg, 1);
+	sg_set_page(ctx->sgl.sg, page, size, offset);
+
+	ahash_request_set_crypt(&ctx->req, ctx->sgl.sg, ctx->result, size);
+
+	if (!(flags & MSG_MORE)) {
+		if (ctx->more)
+			err = crypto_ahash_finup(&ctx->req);
+		else
+			err = crypto_ahash_digest(&ctx->req);
+	} else {
+		if (!ctx->more) {
+			err = crypto_ahash_init(&ctx->req);
+			if (err)
+				goto unlock;
+		}
+
+		err = crypto_ahash_update(&ctx->req);
+	}
+
+	err = af_alg_wait_for_completion(err, &ctx->completion);
+	if (err)
+		goto unlock;
+
+	ctx->more = flags & MSG_MORE;
+
+unlock:
+	release_sock(sk);
+
+	return err ?: size;
+}
+
+static int hash_recvmsg(struct kiocb *unused, struct socket *sock,
+			struct msghdr *msg, size_t len, int flags)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct hash_ctx *ctx = ask->private;
+	unsigned ds = crypto_ahash_digestsize(crypto_ahash_reqtfm(&ctx->req));
+	int err;
+
+	if (len > ds)
+		len = ds;
+	else if (len < ds)
+		msg->msg_flags |= MSG_TRUNC;
+
+	lock_sock(sk);
+	if (ctx->more) {
+		ctx->more = 0;
+		ahash_request_set_crypt(&ctx->req, NULL, ctx->result, 0);
+		err = af_alg_wait_for_completion(crypto_ahash_final(&ctx->req),
+						 &ctx->completion);
+		if (err)
+			goto unlock;
+	}
+
+	err = memcpy_toiovec(msg->msg_iov, ctx->result, len);
+
+unlock:
+	release_sock(sk);
+
+	return err ?: len;
+}
+
+static int hash_accept(struct socket *sock, struct socket *newsock, int flags)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct hash_ctx *ctx = ask->private;
+	struct ahash_request *req = &ctx->req;
+	char state[crypto_ahash_statesize(crypto_ahash_reqtfm(req))];
+	struct sock *sk2;
+	struct alg_sock *ask2;
+	struct hash_ctx *ctx2;
+	int err;
+
+	err = crypto_ahash_export(req, state);
+	if (err)
+		return err;
+
+	err = af_alg_accept(ask->parent, newsock);
+	if (err)
+		return err;
+
+	sk2 = newsock->sk;
+	ask2 = alg_sk(sk2);
+	ctx2 = ask2->private;
+	ctx2->more = 1;
+
+	err = crypto_ahash_import(&ctx2->req, state);
+	if (err) {
+		sock_orphan(sk2);
+		sock_put(sk2);
+	}
+
+	return err;
+}
+
+static struct proto_ops algif_hash_ops = {
+	.family		=	PF_ALG,
+
+	.connect	=	sock_no_connect,
+	.socketpair	=	sock_no_socketpair,
+	.getname	=	sock_no_getname,
+	.ioctl		=	sock_no_ioctl,
+	.listen		=	sock_no_listen,
+	.shutdown	=	sock_no_shutdown,
+	.getsockopt	=	sock_no_getsockopt,
+	.mmap		=	sock_no_mmap,
+	.bind		=	sock_no_bind,
+	.setsockopt	=	sock_no_setsockopt,
+	.poll		=	sock_no_poll,
+
+	.release	=	af_alg_release,
+	.sendmsg	=	hash_sendmsg,
+	.sendpage	=	hash_sendpage,
+	.recvmsg	=	hash_recvmsg,
+	.accept		=	hash_accept,
+};
+
+static void *hash_bind(const char *name, u32 type, u32 mask)
+{
+	return crypto_alloc_ahash(name, type, mask);
+}
+
+static void hash_release(void *private)
+{
+	crypto_free_ahash(private);
+}
+
+static int hash_setkey(void *private, const u8 *key, unsigned int keylen)
+{
+	return crypto_ahash_setkey(private, key, keylen);
+}
+
+static void hash_sock_destruct(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	struct hash_ctx *ctx = ask->private;
+
+	sock_kfree_s(sk, ctx->result,
+		     crypto_ahash_digestsize(crypto_ahash_reqtfm(&ctx->req)));
+	sock_kfree_s(sk, ctx, ctx->len);
+	af_alg_release_parent(sk);
+}
+
+static int hash_accept_parent(void *private, struct sock *sk)
+{
+	struct hash_ctx *ctx;
+	struct alg_sock *ask = alg_sk(sk);
+	unsigned len = sizeof(*ctx) + crypto_ahash_reqsize(private);
+	unsigned ds = crypto_ahash_digestsize(private);
+
+	ctx = sock_kmalloc(sk, len, GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	ctx->result = sock_kmalloc(sk, ds, GFP_KERNEL);
+	if (!ctx->result) {
+		sock_kfree_s(sk, ctx, len);
+		return -ENOMEM;
+	}
+
+	memset(ctx->result, 0, ds);
+
+	ctx->len = len;
+	ctx->more = 0;
+	af_alg_init_completion(&ctx->completion);
+
+	ask->private = ctx;
+
+	ahash_request_set_tfm(&ctx->req, private);
+	ahash_request_set_callback(&ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+				   af_alg_complete, &ctx->completion);
+
+	sk->sk_destruct = hash_sock_destruct;
+
+	return 0;
+}
+
+static const struct af_alg_type algif_type_hash = {
+	.bind		=	hash_bind,
+	.release	=	hash_release,
+	.setkey		=	hash_setkey,
+	.accept		=	hash_accept_parent,
+	.ops		=	&algif_hash_ops,
+	.name		=	"hash",
+	.owner		=	THIS_MODULE
+};
+
+static int __init algif_hash_init(void)
+{
+	return af_alg_register_type(&algif_type_hash);
+}
+
+static void __exit algif_hash_exit(void)
+{
+	int err = af_alg_unregister_type(&algif_type_hash);
+	BUG_ON(err);
+}
+
+module_init(algif_hash_init);
+module_exit(algif_hash_exit);
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_NETPROTO(AF_ALG);

^ permalink raw reply related

* [PATCH 2/4] crypto: af_alg - User-space interface for Crypto API
From: Herbert Xu @ 2010-11-04 17:36 UTC (permalink / raw)
  To: Linux Crypto Mailing List, netdev, Linux Kernel Mailing List
In-Reply-To: <20101104173456.GA1321@gondor.apana.org.au>

crypto: af_alg - User-space interface for Crypto API

This patch creates the backbone of the user-space interface for
the Crypto API, through a new socket family AF_ALG.

Each session corresponds to one or more connections obtained from
that socket.  The number depends on the number of inputs/outputs
of that particular type of operation.  For most types there will
be a s ingle connection/file descriptor that is used for both input
and output.  AEAD is one of the few that require two inputs.

Each algorithm type will provide its own implementation that plugs
into af_alg.  They're keyed using a string such as "skcipher" or
"hash".

IOW this patch only contains the boring bits that is required
to hold everything together.

Thakns to Miloslav Trmac for reviewing this and contributing
fixes and improvements.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 crypto/Kconfig          |    3 
 crypto/Makefile         |    1 
 crypto/af_alg.c         |  460 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/crypto/if_alg.h |   92 +++++++++
 include/linux/if_alg.h  |   40 ++++
 5 files changed, 596 insertions(+)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index e4bac29..357e3ca 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -841,6 +841,9 @@ config CRYPTO_ANSI_CPRNG
 	  ANSI X9.31 A.2.4. Note that this option must be enabled if
 	  CRYPTO_FIPS is selected
 
+config CRYPTO_USER_API
+	tristate
+
 source "drivers/crypto/Kconfig"
 
 endif	# if CRYPTO
diff --git a/crypto/Makefile b/crypto/Makefile
index 423b7de..0b13197 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -85,6 +85,7 @@ obj-$(CONFIG_CRYPTO_RNG2) += krng.o
 obj-$(CONFIG_CRYPTO_ANSI_CPRNG) += ansi_cprng.o
 obj-$(CONFIG_CRYPTO_TEST) += tcrypt.o
 obj-$(CONFIG_CRYPTO_GHASH) += ghash-generic.o
+obj-$(CONFIG_CRYPTO_USER_API) += af_alg.o
 
 #
 # generic algorithms and the async_tx api
diff --git a/crypto/af_alg.c b/crypto/af_alg.c
new file mode 100644
index 0000000..92cd9b1
--- /dev/null
+++ b/crypto/af_alg.c
@@ -0,0 +1,460 @@
+/*
+ * af_alg: User-space algorithm interface
+ *
+ * This file provides the user-space API for algorithms.
+ *
+ * Copyright (c) 2010 Herbert Xu <herbert@gondor.apana.org.au>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+
+#include <asm/atomic.h>
+#include <crypto/if_alg.h>
+#include <linux/crypto.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/net.h>
+#include <linux/rwsem.h>
+
+struct alg_type_list {
+	const struct af_alg_type *type;
+	struct list_head list;
+};
+
+static atomic_t alg_memory_allocated;
+
+static struct proto alg_proto = {
+	.name			= "ALG",
+	.owner			= THIS_MODULE,
+	.memory_allocated	= &alg_memory_allocated,
+	.obj_size		= sizeof(struct alg_sock),
+};
+
+static LIST_HEAD(alg_types);
+static DECLARE_RWSEM(alg_types_sem);
+
+static const struct af_alg_type *alg_get_type(const char *name)
+{
+	const struct af_alg_type *type = ERR_PTR(-ENOENT);
+	struct alg_type_list *node;
+
+	down_read(&alg_types_sem);
+	list_for_each_entry(node, &alg_types, list) {
+		if (strcmp(node->type->name, name))
+			continue;
+
+		if (try_module_get(node->type->owner))
+			type = node->type;
+		break;
+	}
+	up_read(&alg_types_sem);
+
+	return type;
+}
+
+int af_alg_register_type(const struct af_alg_type *type)
+{
+	struct alg_type_list *node;
+	int err = -EEXIST;
+
+	down_write(&alg_types_sem);
+	list_for_each_entry(node, &alg_types, list) {
+		if (!strcmp(node->type->name, type->name))
+			goto unlock;
+	}
+
+	node = kmalloc(sizeof(*node), GFP_KERNEL);
+	err = -ENOMEM;
+	if (!node)
+		goto unlock;
+
+	type->ops->owner = THIS_MODULE;
+	node->type = type;
+	list_add(&node->list, &alg_types);
+	err = 0;
+
+unlock:
+	up_write(&alg_types_sem);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(af_alg_register_type);
+
+int af_alg_unregister_type(const struct af_alg_type *type)
+{
+	struct alg_type_list *node;
+	int err = -ENOENT;
+
+	down_write(&alg_types_sem);
+	list_for_each_entry(node, &alg_types, list) {
+		if (strcmp(node->type->name, type->name))
+			continue;
+
+		list_del(&node->list);
+		kfree(node);
+		err = 0;
+		break;
+	}
+	up_write(&alg_types_sem);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(af_alg_unregister_type);
+
+static void alg_do_release(const struct af_alg_type *type, void *private)
+{
+	if (!type)
+		return;
+
+	type->release(private);
+	module_put(type->owner);
+}
+
+int af_alg_release(struct socket *sock)
+{
+	if (sock->sk)
+		sock_put(sock->sk);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(af_alg_release);
+
+static int alg_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	struct sockaddr_alg *sa = (void *)uaddr;
+	const struct af_alg_type *type;
+	void *private;
+
+	if (sock->state == SS_CONNECTED)
+		return -EINVAL;
+
+	if (addr_len != sizeof(*sa))
+		return -EINVAL;
+
+	sa->salg_type[sizeof(sa->salg_type) - 1] = 0;
+	sa->salg_name[sizeof(sa->salg_name) - 1] = 0;
+
+	type = alg_get_type(sa->salg_type);
+	if (IS_ERR(type) && PTR_ERR(type) == -ENOENT) {
+		request_module("algif-%s", sa->salg_type);
+		type = alg_get_type(sa->salg_type);
+	}
+
+	if (IS_ERR(type))
+		return PTR_ERR(type);
+
+	private = type->bind(sa->salg_name, sa->salg_feat, sa->salg_mask);
+	if (IS_ERR(private)) {
+		module_put(type->owner);
+		return PTR_ERR(private);
+	}
+
+	lock_sock(sk);
+
+	swap(ask->type, type);
+	swap(ask->private, private);
+
+	release_sock(sk);
+
+	alg_do_release(type, private);
+
+	return 0;
+}
+
+static int alg_setkey(struct sock *sk, char __user *ukey,
+		      unsigned int keylen)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	const struct af_alg_type *type = ask->type;
+	u8 *key;
+	int err;
+
+	key = sock_kmalloc(sk, keylen, GFP_KERNEL);
+	if (!key)
+		return -ENOMEM;
+
+	if (copy_from_user(key, ukey, keylen))
+		return -EFAULT;
+
+	err = type->setkey(ask->private, key, keylen);
+
+	sock_kfree_s(sk, key, keylen);
+
+	return err;
+}
+
+static int alg_setsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, unsigned int optlen)
+{
+	struct sock *sk = sock->sk;
+	struct alg_sock *ask = alg_sk(sk);
+	const struct af_alg_type *type = ask->type;
+
+	if (level != SOL_ALG || !type)
+		return -ENOPROTOOPT;
+
+	switch (optname) {
+	case ALG_SET_KEY:
+		if (sock->state == SS_CONNECTED)
+			return -ENOPROTOOPT;
+		if (!type->setkey)
+			return -ENOPROTOOPT;
+
+		return alg_setkey(sk, optval, optlen);
+	}
+
+	return -ENOPROTOOPT;
+}
+
+int af_alg_accept(struct sock *sk, struct socket *newsock)
+{
+	struct alg_sock *ask = alg_sk(sk);
+	const struct af_alg_type *type = ask->type;
+	struct sock *sk2;
+	int err;
+
+	if (!type)
+		return -EINVAL;
+
+	sk2 = sk_alloc(sock_net(sk), PF_ALG, GFP_KERNEL, &alg_proto);
+	if (!sk2)
+		return -ENOMEM;
+
+	sock_init_data(newsock, sk2);
+
+	err = type->accept(ask->private, sk2);
+	if (err) {
+		sk_free(sk2);
+		return err;
+	}
+
+	sk2->sk_family = PF_ALG;
+
+	sock_hold(sk);
+	alg_sk(sk2)->parent = sk;
+	alg_sk(sk2)->type = type;
+
+	newsock->ops = type->ops;
+	newsock->state = SS_CONNECTED;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(af_alg_accept);
+
+static int alg_accept(struct socket *sock, struct socket *newsock, int flags)
+{
+	return af_alg_accept(sock->sk, newsock);
+}
+
+static const struct proto_ops alg_proto_ops = {
+	.family		=	PF_ALG,
+	.owner		=	THIS_MODULE,
+
+	.connect	=	sock_no_connect,
+	.socketpair	=	sock_no_socketpair,
+	.getname	=	sock_no_getname,
+	.ioctl		=	sock_no_ioctl,
+	.listen		=	sock_no_listen,
+	.shutdown	=	sock_no_shutdown,
+	.getsockopt	=	sock_no_getsockopt,
+	.mmap		=	sock_no_mmap,
+	.sendpage	=	sock_no_sendpage,
+	.sendmsg	=	sock_no_sendmsg,
+	.recvmsg	=	sock_no_recvmsg,
+	.poll		=	sock_no_poll,
+
+	.bind		=	alg_bind,
+	.release	=	af_alg_release,
+	.setsockopt	=	alg_setsockopt,
+	.accept		=	alg_accept,
+};
+
+static void alg_sock_destruct(struct sock *sk)
+{
+	struct alg_sock *ask = alg_sk(sk);
+
+	alg_do_release(ask->type, ask->private);
+}
+
+static int alg_create(struct net *net, struct socket *sock, int protocol,
+		      int kern)
+{
+	struct sock *sk;
+	int err;
+
+	if (sock->type != SOCK_SEQPACKET)
+		return -ESOCKTNOSUPPORT;
+	if (protocol != 0)
+		return -EPROTONOSUPPORT;
+
+	err = -ENOMEM;
+	sk = sk_alloc(net, PF_ALG, GFP_KERNEL, &alg_proto);
+	if (!sk)
+		goto out;
+
+	sock->ops = &alg_proto_ops;
+	sock_init_data(sock, sk);
+
+	sk->sk_family = PF_ALG;
+	sk->sk_destruct = alg_sock_destruct;
+
+	return 0;
+out:
+	return err;
+}
+
+static const struct net_proto_family alg_family = {
+	.family	=	PF_ALG,
+	.create	=	alg_create,
+	.owner	=	THIS_MODULE,
+};
+
+int af_alg_make_sg(struct af_alg_sgl *sgl, void __user *addr, int len,
+		   int write)
+{
+	unsigned long from = (unsigned long)addr;
+	unsigned long npages;
+	unsigned off;
+	int err;
+	int i;
+
+	err = -EFAULT;
+	if (!access_ok(write ? VERIFY_READ : VERIFY_WRITE, addr, len))
+		goto out;
+
+	off = from & ~PAGE_MASK;
+	npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	err = get_user_pages_fast(from, npages, write, sgl->pages);
+	if (err < 0)
+		goto out;
+
+	npages = err;
+	err = -EINVAL;
+	if (WARN_ON(npages == 0))
+		goto out;
+
+	err = 0;
+
+	sg_init_table(sgl->sg, npages);
+
+	for (i = 0; i < npages; i++) {
+		int plen = min_t(int, len, PAGE_SIZE - off);
+
+		sg_set_page(sgl->sg + i, sgl->pages[i], plen, off);
+
+		off = 0;
+		len -= plen;
+		err += plen;
+	}
+
+out:
+	return err;
+}
+EXPORT_SYMBOL_GPL(af_alg_make_sg);
+
+void af_alg_free_sg(struct af_alg_sgl *sgl)
+{
+	int i;
+
+	i = 0;
+	do {
+		put_page(sgl->pages[i]);
+	} while (!sg_is_last(sgl->sg + (i++)));
+}
+EXPORT_SYMBOL_GPL(af_alg_free_sg);
+
+int af_alg_cmsg_send(struct msghdr *msg, struct af_alg_control *con)
+{
+	struct cmsghdr *cmsg;
+
+	for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg, cmsg)) {
+		if (!CMSG_OK(msg, cmsg))
+			return -EINVAL;
+		if (cmsg->cmsg_level != SOL_ALG)
+			continue;
+
+		switch(cmsg->cmsg_type) {
+		case ALG_SET_IV:
+			if (cmsg->cmsg_len < sizeof(*con->iv))
+				return -EINVAL;
+			con->iv = (void *)CMSG_DATA(cmsg);
+			if (cmsg->cmsg_len < con->iv->ivlen +
+					     sizeof(con->iv->ivlen))
+				return -EINVAL;
+			break;
+
+		case ALG_SET_OP:
+			if (cmsg->cmsg_len < sizeof(u32))
+				return -EINVAL;
+			con->op = *(u32 *)CMSG_DATA(cmsg);
+			break;
+
+		default:
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(af_alg_cmsg_send);
+
+int af_alg_wait_for_completion(int err, struct af_alg_completion *completion)
+{
+	switch (err) {
+	case -EINPROGRESS:
+	case -EBUSY:
+		wait_for_completion(&completion->completion);
+		INIT_COMPLETION(completion->completion);
+		err = completion->err;
+		break;
+	};
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(af_alg_wait_for_completion);
+
+void af_alg_complete(struct crypto_async_request *req, int err)
+{
+	struct af_alg_completion *completion = req->data;
+
+	completion->err = err;
+	complete(&completion->completion);
+}
+EXPORT_SYMBOL_GPL(af_alg_complete);
+
+static int __init af_alg_init(void)
+{
+	int err = proto_register(&alg_proto, 0);
+
+	if (err)
+		goto out;
+
+	err = sock_register(&alg_family);
+	if (err != 0)
+		goto out_unregister_proto;
+
+out:
+	return err;
+
+out_unregister_proto:
+	proto_unregister(&alg_proto);
+	goto out;
+}
+
+static void __exit af_alg_exit(void)
+{
+	sock_unregister(PF_ALG);
+	proto_unregister(&alg_proto);
+}
+
+module_init(af_alg_init);
+module_exit(af_alg_exit);
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_NETPROTO(AF_ALG);
diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h
new file mode 100644
index 0000000..c5813c8
--- /dev/null
+++ b/include/crypto/if_alg.h
@@ -0,0 +1,92 @@
+/*
+ * if_alg: User-space algorithm interface
+ *
+ * Copyright (c) 2010 Herbert Xu <herbert@gondor.apana.org.au>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+
+#ifndef _CRYPTO_IF_ALG_H
+#define _CRYPTO_IF_ALG_H
+
+#include <linux/compiler.h>
+#include <linux/completion.h>
+#include <linux/if_alg.h>
+#include <linux/types.h>
+#include <net/sock.h>
+
+#define ALG_MAX_PAGES			16
+
+struct crypto_async_request;
+
+struct alg_sock {
+	/* struct sock must be the first member of struct alg_sock */
+	struct sock sk;
+
+	struct sock *parent;
+
+	const struct af_alg_type *type;
+	void *private;
+};
+
+struct af_alg_completion {
+	struct completion completion;
+	int err;
+};
+
+struct af_alg_control {
+	struct af_alg_iv *iv;
+	int op;
+};
+
+struct af_alg_type {
+	void *(*bind)(const char *name, u32 type, u32 mask);
+	void (*release)(void *private);
+	int (*setkey)(void *private, const u8 *key, unsigned int keylen);
+	int (*accept)(void *private, struct sock *sk);
+
+	struct proto_ops *ops;
+	struct module *owner;
+	char name[14];
+};
+
+struct af_alg_sgl {
+	struct scatterlist sg[ALG_MAX_PAGES];
+	struct page *pages[ALG_MAX_PAGES];
+};
+
+int af_alg_register_type(const struct af_alg_type *type);
+int af_alg_unregister_type(const struct af_alg_type *type);
+
+int af_alg_release(struct socket *sock);
+int af_alg_accept(struct sock *sk, struct socket *newsock);
+
+int af_alg_make_sg(struct af_alg_sgl *sgl, void __user *addr, int len,
+		   int write);
+void af_alg_free_sg(struct af_alg_sgl *sgl);
+
+int af_alg_cmsg_send(struct msghdr *msg, struct af_alg_control *con);
+
+int af_alg_wait_for_completion(int err, struct af_alg_completion *completion);
+void af_alg_complete(struct crypto_async_request *req, int err);
+
+static inline struct alg_sock *alg_sk(struct sock *sk)
+{
+	return (struct alg_sock *)sk;
+}
+
+static inline void af_alg_release_parent(struct sock *sk)
+{
+	sock_put(alg_sk(sk)->parent);
+}
+
+static inline void af_alg_init_completion(struct af_alg_completion *completion)
+{
+	init_completion(&completion->completion);
+}
+
+#endif	/* _CRYPTO_IF_ALG_H */
diff --git a/include/linux/if_alg.h b/include/linux/if_alg.h
new file mode 100644
index 0000000..0f9acce
--- /dev/null
+++ b/include/linux/if_alg.h
@@ -0,0 +1,40 @@
+/*
+ * if_alg: User-space algorithm interface
+ *
+ * Copyright (c) 2010 Herbert Xu <herbert@gondor.apana.org.au>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+
+#ifndef _LINUX_IF_ALG_H
+#define _LINUX_IF_ALG_H
+
+#include <linux/types.h>
+
+struct sockaddr_alg {
+	__u16	salg_family;
+	__u8	salg_type[14];
+	__u32	salg_feat;
+	__u32	salg_mask;
+	__u8	salg_name[64];
+};
+
+struct af_alg_iv {
+	__u32	ivlen;
+	__u8	iv[0];
+};
+
+/* Socket options */
+#define ALG_SET_KEY			1
+#define ALG_SET_IV			2
+#define ALG_SET_OP			3
+
+/* Operations */
+#define ALG_OP_DECRYPT			0
+#define ALG_OP_ENCRYPT			1
+
+#endif	/* _LINUX_IF_ALG_H */

^ permalink raw reply related

* [PATCH 1/4] net - Add AF_ALG macros
From: Herbert Xu @ 2010-11-04 17:36 UTC (permalink / raw)
  To: Linux Crypto Mailing List, netdev, Linux Kernel Mailing List
In-Reply-To: <20101104173456.GA1321@gondor.apana.org.au>

net - Add AF_ALG macros

This patch adds the socket family/level macros for the yet-to-be-born
AF_ALG family.  The AF_ALG family provides the user-space interface
for the kernel crypto API.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/socket.h |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 5146b50..ebc081b 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -193,7 +193,8 @@ struct ucred {
 #define AF_PHONET	35	/* Phonet sockets		*/
 #define AF_IEEE802154	36	/* IEEE802154 sockets		*/
 #define AF_CAIF		37	/* CAIF sockets			*/
-#define AF_MAX		38	/* For now.. */
+#define AF_ALG		38	/* Algorithm sockets		*/
+#define AF_MAX		39	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -234,6 +235,7 @@ struct ucred {
 #define PF_PHONET	AF_PHONET
 #define PF_IEEE802154	AF_IEEE802154
 #define PF_CAIF		AF_CAIF
+#define PF_ALG		AF_ALG
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
@@ -307,6 +309,7 @@ struct ucred {
 #define SOL_RDS		276
 #define SOL_IUCV	277
 #define SOL_CAIF	278
+#define SOL_ALG		279
 
 /* IPX options */
 #define IPX_TYPE	1

^ permalink raw reply related

* Re: RFC: Crypto API User-interface
From: Herbert Xu @ 2010-11-04 17:34 UTC (permalink / raw)
  To: Linux Crypto Mailing List, netdev, Linux Kernel Mailing List
In-Reply-To: <20101019134418.GA13514@gondor.apana.org.au>

On Tue, Oct 19, 2010 at 09:44:18PM +0800, Herbert Xu wrote:
> 
> OK I've gone ahead and implemented the user-space API for hashes
> and ciphers.

Here is a revised series with bug fixes and improvements.  The
main change is that hashes can now be finalised by recvmsg instead
of requiring a preceding sendmsg with no MSG_MORE.

Thakns to Miloslav Trmac for reviewing this and contributing
fixes and improvements.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH] virtio_net: Fix queue full check
From: Michael S. Tsirkin @ 2010-11-04 16:45 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: davem, netdev, Rusty Russell, yvugenfi
In-Reply-To: <OF9196A33D.75E1F25E-ON652577D1.0057D889-652577D1.00592104@in.ibm.com>

On Thu, Nov 04, 2010 at 09:47:04PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 11/04/2010 05:54:24 PM:
> 
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >
> > I thought about this some more.  I think the original
> > code is actually correct in returning ENOSPC: indirect
> > buffers are nice, but it's a mistake
> > to rely on them as a memory allocation might fail.
> >
> > And if you look at virtio-net, it is dropping packets
> > under memory pressure which is not really a happy outcome:
> > the packet will get freed, reallocated and we get another one,
> > adding pressure on the allocator instead of releasing it
> > until we free up some buffers.
> >
> > So I now think we should calculate the capacity
> > assuming non-indirect entries, and if we manage to
> > use indirect, all the better.
> >
> > So below is what I propose now - as a replacement for
> > my original patch.  Krishna Kumar, Rusty, what do you think?
> >
> > Separately I'm also considering moving the
> >    if (vq->num_free < out + in)
> > check earlier in the function to keep all users honest,
> > but need to check what the implications are for e.g. block.
> > Thoughts on this?
> 
> This looks like the right thing to do.  Besides this, I
> think virtio-net still needs to remove check for ENOMEM?

Yes, the only valid reason for failure would be a unexpected error.
No need to special-case ENOMEM anymore.

> I will test this patch tomorrow.
> 
> Another question about add_recvbuf_small and
> add_recvbuf_big - both call virtqueue_add_buf_gfp with
> in+out > 1, and that can fail with -ENOSPC.  So try_fill_recv
> gets -ENOSPC.  When that happens, oom is not set to true,
> I thought it should have got set.  Is this a bug?
> 
> Thanks,
> 
> - KK

I don't see a bug: on ENOSPC we don't need to (and can't) add any more
buffers, we know we will make progress since there must be some buffers
in the ring already, ENOMEM makes us try again later with more buffers
(and possibly more aggressive GFP flag).  What's wrong?

-- 
MST

^ permalink raw reply

* Re: [PATCH] virtio_net: Fix queue full check
From: Krishna Kumar2 @ 2010-11-04 16:17 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: davem, netdev, Rusty Russell, yvugenfi
In-Reply-To: <20101104122424.GA29830@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 11/04/2010 05:54:24 PM:

> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>
> I thought about this some more.  I think the original
> code is actually correct in returning ENOSPC: indirect
> buffers are nice, but it's a mistake
> to rely on them as a memory allocation might fail.
>
> And if you look at virtio-net, it is dropping packets
> under memory pressure which is not really a happy outcome:
> the packet will get freed, reallocated and we get another one,
> adding pressure on the allocator instead of releasing it
> until we free up some buffers.
>
> So I now think we should calculate the capacity
> assuming non-indirect entries, and if we manage to
> use indirect, all the better.
>
> So below is what I propose now - as a replacement for
> my original patch.  Krishna Kumar, Rusty, what do you think?
>
> Separately I'm also considering moving the
>    if (vq->num_free < out + in)
> check earlier in the function to keep all users honest,
> but need to check what the implications are for e.g. block.
> Thoughts on this?

This looks like the right thing to do.  Besides this, I
think virtio-net still needs to remove check for ENOMEM?
I will test this patch tomorrow.

Another question about add_recvbuf_small and
add_recvbuf_big - both call virtqueue_add_buf_gfp with
in+out > 1, and that can fail with -ENOSPC.  So try_fill_recv
gets -ENOSPC.  When that happens, oom is not set to true,
I thought it should have got set.  Is this a bug?

Thanks,

- KK


^ permalink raw reply

* Re: Linux 2.6.37-rc1 (net/sched: cls_cgroup)
From: Randy Dunlap @ 2010-11-04 15:56 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Eric Dumazet, Linus Torvalds, Jamal Hadi Salim, Thomas Graf,
	Linux Kernel Mailing List, netdev, Ben Blum
In-Reply-To: <20101103233105.GA26124@gondor.apana.org.au>

On 11/03/10 16:31, Herbert Xu wrote:
> On Wed, Nov 03, 2010 at 11:01:17PM +0100, Eric Dumazet wrote:
>>
>> commits 8e039d84b323c450 
>> (cgroups: net_cls as module)
>>
>> followed by commit f845172531f
>> (cls_cgroup: Store classid in struct sock)
> 
> Indeed, it looks like the tree I worked on didn't have the first
> patch applied for some reason.
> 
> Anyway, this patch should fix the problem.  Thanks Eric!
> 
> cls_cgroup: Fix crash on module unload
> 
> Somewhere along the lines net_cls_subsys_id became a macro when
> cls_cgroup is built as a module.  Not only did it make cls_cgroup
> completely useless, it also causes it to crash on module unload.
> 
> This patch fixes this by removing that macro.
> 
> Thanks to Eric Dumazet for diagnosing this problem.
> 
> Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Tested-by: Randy Dunlap <randy.dunlap@oracle.com>

Thanks.

> 
> diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
> index 37dff78..d49c40f 100644
> --- a/net/sched/cls_cgroup.c
> +++ b/net/sched/cls_cgroup.c
> @@ -34,8 +34,6 @@ struct cgroup_subsys net_cls_subsys = {
>  	.populate	= cgrp_populate,
>  #ifdef CONFIG_NET_CLS_CGROUP
>  	.subsys_id	= net_cls_subsys_id,
> -#else
> -#define net_cls_subsys_id net_cls_subsys.subsys_id
>  #endif
>  	.module		= THIS_MODULE,
>  };
> 
> Cheers,


-- 
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply

* Re: Routing over multiple interfaces
From: Eric Dumazet @ 2010-11-04 14:01 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: netdev
In-Reply-To: <1288875102.4357.40.camel@lat1>

Le jeudi 04 novembre 2010 à 13:51 +0100, Patrick Schaaf a écrit :
> > iptables -t mangle -A PREROUTING -d $EXTERNAL -m statistic --mode nth --every 2 -j MARK --set-mark 6
> 
> If statistics match is missing, a pretty good alternative I recently
> "found" is using u32 to match for a bit from the IP ID. That is a
> stateless decision, and here it probably has the theoretical advantage
> of putting all fragments of a given packet onto the same link.
> 
> iptables -t mangle -A PREROUTING ... -m u32 --u32 0x2&0x1=0x0 -j MARK
> --set-mark 6

Sure, thats a good tip/idea, but note many UDP frames have IP.id = 0




^ permalink raw reply

* Re: [RFC][net-next-2.6 PATCH 2/4] net: 8021Q consolidate header_ops routines
From: John Fastabend @ 2010-11-04 13:43 UTC (permalink / raw)
  To: Jesse Gross; +Cc: netdev@vger.kernel.org
In-Reply-To: <AANLkTi=pObCfPHkPCUxD8yicZL3pyTwm9s_z4KKda62k@mail.gmail.com>

On 11/3/2010 5:47 PM, Jesse Gross wrote:
> On Thu, Oct 21, 2010 at 3:10 PM, John Fastabend
> <john.r.fastabend@intel.com> wrote:
>> The only thing the 8021Q header ops routines are required
>> for is the VLAN_FLAG_REORDER_HDR otherwise by the time
>> the VLAN tag has been added the packet is already on
>> its way down the stack. In this case using the Ethernet
>> ops works OK.
>>
>> At present the VLAN_FLAG_REORDER_HDR flag does not work
>> with vlan offloads. As I understand the flag the intent
>> is to allow taps on the vlan device and possibly the
>> QOS layer to see the vlan tag info.
>>
>> By inserting the tag in vlan_tci any taps or QOS policies
>> should be able to retrieve the vlan info. This allows
>> the flag to work the same in both the offload case and
>> non-offloaded case. And allows us to use the underlying
>> ethernet ops.
>>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> 
> I noticed that you dropped this patch from your most recent series, so
> I went back to take a look at it.  I realized that it probably works
> inconsistently since header caching doesn't take into account
> skb->vlan_tci, so whether you see the tag depends on the state of the
> cache.
> 
> It would be really good to have this type of code consolidation, both
> for the sake of sanity and to eliminate the inconsistent behavior.  We
> could do that by either not using header caching or making it work
> with vlan offloading somehow.  However, I'm not sure that there's
> really much point in that.  VLAN_FLAG_REORDER_HDR doesn't work with
> cards that do vlan offloading, which is a pretty significant number of
> them.  It similarly works inconsistently on the rx side.  So it's
> broken most of the time and worse, the behavior changes depending on
> the NIC (and now the ethtool setting).  Can we just eliminate it?

Yes this is why I have dropped it for now. Also rebuild is broke as best I can tell. Although I doubt anyone would notice you would need to clear VLAN_FLAG_REORDER_HDR and be using one of the ARPHRD_{ROSE|AX25|NETROM}.

The problem with caching the vlan header is the skb priority to vlan priority map. So we could cache the vid, sa, da, and protocols but I can not see anyway to cache the vlan priority. Also the cache would have to be flushed when the flag is toggled.

Thanks,
John.

^ permalink raw reply

* Re: Routing over multiple interfaces
From: Patrick Schaaf @ 2010-11-04 12:51 UTC (permalink / raw)
  To: netdev
In-Reply-To: <4CD08C6D.1090107@arndnet.de>


> iptables -t mangle -A PREROUTING -d $EXTERNAL -m statistic --mode nth --every 2 -j MARK --set-mark 6

If statistics match is missing, a pretty good alternative I recently
"found" is using u32 to match for a bit from the IP ID. That is a
stateless decision, and here it probably has the theoretical advantage
of putting all fragments of a given packet onto the same link.

iptables -t mangle -A PREROUTING ... -m u32 --u32 0x2&0x1=0x0 -j MARK
--set-mark 6

best regards
  Patrick


^ permalink raw reply

* Re: [PATCH 2/2] inet_diag: Make sure we actually run the same bytecode we audited.
From: Thomas Graf @ 2010-11-04 13:28 UTC (permalink / raw)
  To: Nelson Elhage; +Cc: netdev
In-Reply-To: <1288838141-17871-2-git-send-email-nelhage@ksplice.com>

On Wed, Nov 03, 2010 at 10:35:41PM -0400, Nelson Elhage wrote:
> We were using nlmsg_find_attr() to look up the bytecode by attribute when
> auditing, but then just using the first attribute when actually running
> bytecode. So, if we received a message with two attribute elements, where only
> the second had type INET_DIAG_REQ_BYTECODE, we would validate and run different
> bytecode strings.
> 
> Fix this by consistently using nlmsg_find_attr everywhere.
> 
> Signed-off-by: Nelson Elhage <nelhage@ksplice.com>

Both patches look good.

Signed-off-by: Thomas Graf <tgraf@infradead.org>

^ permalink raw reply

* Winning Alert!!!  contact: claimsdepartment1313@yahoo.co.uk  for more details
From: Henrik Maibom Hansen @ 2010-11-04 12:46 UTC (permalink / raw)


You have won $552,000.00,just send your name,tel,country

^ permalink raw reply

* Re: [PATCH] virtio_net: Fix queue full check
From: Michael S. Tsirkin @ 2010-11-04 12:24 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Krishna Kumar2, davem, netdev, yvugenfi
In-Reply-To: <20101102161730.GA32311@redhat.com>

On Tue, Nov 02, 2010 at 06:17:30PM +0200, Michael S. Tsirkin wrote:
> On Fri, Oct 29, 2010 at 09:58:40PM +1030, Rusty Russell wrote:
> > On Fri, 29 Oct 2010 09:25:09 pm Krishna Kumar2 wrote:
> > > Rusty Russell <rusty@rustcorp.com.au> wrote on 10/29/2010 03:17:24 PM:
> > > 
> > > > > Oct 17 10:22:40 localhost kernel: net eth0: Unexpected TX queue
> > > failure: -28
> > > > > Oct 17 10:28:22 localhost kernel: net eth0: Unexpected TX queue
> > > failure: -28
> > > > > Oct 17 10:35:58 localhost kernel: net eth0: Unexpected TX queue
> > > failure: -28
> > > > > Oct 17 10:41:06 localhost kernel: net eth0: Unexpected TX queue
> > > failure: -28
> > > > >
> > > > > I initially changed the check from -ENOMEM to -ENOSPC, but
> > > > > virtqueue_add_buf can return only -ENOSPC when it doesn't have
> > > > > space for new request.  Patch removes redundant checks but
> > > > > displays the failure errno.
> > > > >
> > > > > Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> > > > > ---
> > > > >  drivers/net/virtio_net.c |   15 ++++-----------
> > > > >  1 file changed, 4 insertions(+), 11 deletions(-)
> > > > >
> > > > > diff -ruNp org/drivers/net/virtio_net.c new/drivers/net/virtio_net.c
> > > > > --- org/drivers/net/virtio_net.c   2010-10-11 10:20:02.000000000 +0530
> > > > > +++ new/drivers/net/virtio_net.c   2010-10-21 17:37:45.000000000 +0530
> > > > > @@ -570,17 +570,10 @@ static netdev_tx_t start_xmit(struct sk_
> > > > >
> > > > >     /* This can happen with OOM and indirect buffers. */
> > > > >     if (unlikely(capacity < 0)) {
> > > > > -      if (net_ratelimit()) {
> > > > > -         if (likely(capacity == -ENOMEM)) {
> > > > > -            dev_warn(&dev->dev,
> > > > > -                "TX queue failure: out of memory\n");
> > > > > -         } else {
> > > > > -            dev->stats.tx_fifo_errors++;
> > > > > -            dev_warn(&dev->dev,
> > > > > -                "Unexpected TX queue failure: %d\n",
> > > > > -                capacity);
> > > > > -         }
> > > > > -      }
> > > > > +      if (net_ratelimit())
> > > > > +         dev_warn(&dev->dev,
> > > > > +             "TX queue failure (%d): out of memory\n",
> > > > > +             capacity);
> > > >
> > > > Hold on... you were getting -ENOSPC, which shouldn't happen.  What makes
> > > you
> > > > think it's out of memory?
> > > 
> > > virtqueue_add_buf_gfp returns only -ENOSPC on failure, whether
> > > direct or indirect descriptors are used, so isn't -ENOSPC
> > > "expected"? (vring_add_indirect returns -ENOMEM on memory
> > > failure, but that is masked out and we go direct which is
> > > the failure point).
> > 
> > Ah, OK, gotchya.
> > I'm not even sure the fallback to linear makes sense; if we're failing
> > kmallocs we should probably just return -ENOMEM.  Would mean we can
> > tell the difference between "out of space" (which should never happen
> > since we stop the queue when we have < 2+MAX_SKB_FRAGS slots left)
> > and this case.
> > 
> > Michael, what do you think?
> > 
> > Thanks,
> > Rusty.
> 
> Let's make sure I understand the issue: we use indirect buffers
> so we assume there's still a lot of place in the ring, then
> allocation for the indirect fails and so we return -ENOSPC?
> 
> So first, I agree it's a bug.  But I am not sure killing the fallback
> is such a good idea: recovering from add buf failure is hard
> generally, we should try to accomodate if we can. Let's just fix
> the return code for now?
> 
> And generally, we should be smarter: as long as the ring is almost
> empty, and s/g list is short, it is a waste to use indirect buffers.
> BTW we have had a FIXME there for a long while, I think Yan suggested
> increasing that threshold to 3. Yan?
> 
> Further, maybe preallocating some memory for the indirect buffers might
> be a good idea.
> 
> In short, lots of good ideas, let's start with the minimal patch that is
> a good 2.6.37 candidate too. How about the following (untested)?
> 
> virtio: fix add_buf return code for OOM
> 
> add_buff returned ENOSPC on out of memory: this is a bug
> as at leats virtio-net expects ENOMEM and handles it
> specially. Fix that.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

I thought about this some more.  I think the original
code is actually correct in returning ENOSPC: indirect
buffers are nice, but it's a mistake
to rely on them as a memory allocation might fail.

And if you look at virtio-net, it is dropping packets
under memory pressure which is not really a happy outcome:
the packet will get freed, reallocated and we get another one,
adding pressure on the allocator instead of releasing it
until we free up some buffers.

So I now think we should calculate the capacity
assuming non-indirect entries, and if we manage to
use indirect, all the better.

So below is what I propose now - as a replacement for
my original patch.  Krishna Kumar, Rusty, what do you think?

Separately I'm also considering moving the
	if (vq->num_free < out + in)
check earlier in the function to keep all users honest,
but need to check what the implications are for e.g. block.
Thoughts on this?

---->

virtio: return correct capacity to users

We can't rely on indirect buffers for capacity
calculations because they need a memory allocation
which might fail.

So return the number of buffers we can guarantee users.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1475ed6..cc2f73e 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -230,9 +230,6 @@ add_head:
 	pr_debug("Added buffer head %i to %p\n", head, vq);
 	END_USE(vq);
 
-	/* If we're indirect, we can fit many (assuming not OOM). */
-	if (vq->indirect)
-		return vq->num_free ? vq->vring.num : 0;
 	return vq->num_free;
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_buf_gfp);

^ permalink raw reply related

* Re: Freeing alive fib_info caused by ebc0ffae5
From: Michael Ellerman @ 2010-11-04 11:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1288869699.2659.77.camel@edumazet-laptop>

[-- Attachment #1: Type: text/plain, Size: 1301 bytes --]

On Thu, 2010-11-04 at 12:21 +0100, Eric Dumazet wrote:
> 
> Hmm, a review of the code spotted a bug in fib_result_assign()
> 
> Please try following patch :
> 
> Thanks again !
> 
> [PATCH] fib: fib_result_assign() should not change fib refcounts
> 
> After commit ebc0ffae5 (RCU conversion of fib_lookup()),
> fib_result_assign()  should not change fib refcounts anymore.
> 
> Thanks to Michael who did the bisection and bug report.
> 
> Reported-by: Michael Ellerman <michael@ellerman.id.au>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  net/ipv4/fib_lookup.h |    5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/net/ipv4/fib_lookup.h b/net/ipv4/fib_lookup.h
> index a29edf2..c079cc0 100644
> --- a/net/ipv4/fib_lookup.h
> +++ b/net/ipv4/fib_lookup.h
> @@ -47,11 +47,8 @@ extern int fib_detect_death(struct fib_info *fi, int order,
>  static inline void fib_result_assign(struct fib_result *res,
>  				     struct fib_info *fi)
>  {
> -	if (res->fi != NULL)
> -		fib_info_put(res->fi);
> +	/* we used to play games with refcounts, but we now use RCU */
>  	res->fi = fi;
> -	if (fi != NULL)
> -		atomic_inc(&fi->fib_clntref);
>  }
>  
>  #endif /* _FIB_LOOKUP_H */

Perfect, that fixes it, thanks!

cheers



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* vhost-net-next updates
From: Michael S. Tsirkin @ 2010-11-04 11:26 UTC (permalink / raw)
  To: Shirley Ma; +Cc: krkumar2, netdev, kvm, linux-kernel

I pushed out some optimization patches on vhost-net-next
branch on my vhost tree (intended for 2.6.38).
It would be helpful if people working on vhost-net optimizations
base their work on that tree just to make sure comparisons
are apples to apples.

I might rebase this as I didn't send a pull request to Dave yet
but I'll try not to.  So far I have:

8b7347a vhost: get/put_user -> __get/__put_user
dfe5ac5 vhost: copy_to_user -> __copy_to_user
64e1c80 vhost-net: batch use/unuse mm
533a19b vhost: put mm after thread stop
3fcedec drivers/vhost/vhost.c: delete double assignment

Thanks!

-- 
MST

^ permalink raw reply

* Re: Freeing alive fib_info caused by ebc0ffae5
From: Michael Ellerman @ 2010-11-04 11:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1288869699.2659.77.camel@edumazet-laptop>

[-- Attachment #1: Type: text/plain, Size: 1295 bytes --]

On Thu, 2010-11-04 at 12:21 +0100, Eric Dumazet wrote:
> Le jeudi 04 novembre 2010 à 11:30 +0100, Eric Dumazet a écrit :
> > Le jeudi 04 novembre 2010 à 21:23 +1100, Michael Ellerman a écrit :
> > > Hi all,
> > > 
> > > I'm running Linus' latest or thereabouts (ff8b16d), and I'm seeing
> > > "Freeing alive fib_info" messages, from free_fib_info().
> > > 
> > > Actually I only get one per boot, when network interfaces come up.
> > > Seemingly related I am getting refcount problems when I shutdown, ie.
> > > unregister_netdevice() sees a usage count of 1, which never decrements.
> > > 
> > > Bisect says it's ebc0ffae5 which causes the problem, or makes it appear.
> > > 
> > >     fib: RCU conversion of fib_lookup()
> > >     
> > >     fib_lookup() converted to be called in RCU protected context, no
> > >     reference taken and released on a contended cache line (fib_clntref)
> > >     
> > > 
> > > Is this a bug in that commit, or a driver bug exposed?
> > 
> > Hi Michael, thanks for the report (and painful bisection I guess)
> > 
> > Thats hard to say... Is it reproductable on my machine ?
> > 
> 
> Hmm, a review of the code spotted a bug in fib_result_assign()

Aha, I was just adding some debug in there. Let me test the patch.

cheers


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: Freeing alive fib_info caused by ebc0ffae5
From: Eric Dumazet @ 2010-11-04 11:21 UTC (permalink / raw)
  To: michael; +Cc: netdev
In-Reply-To: <1288866626.2659.71.camel@edumazet-laptop>

Le jeudi 04 novembre 2010 à 11:30 +0100, Eric Dumazet a écrit :
> Le jeudi 04 novembre 2010 à 21:23 +1100, Michael Ellerman a écrit :
> > Hi all,
> > 
> > I'm running Linus' latest or thereabouts (ff8b16d), and I'm seeing
> > "Freeing alive fib_info" messages, from free_fib_info().
> > 
> > Actually I only get one per boot, when network interfaces come up.
> > Seemingly related I am getting refcount problems when I shutdown, ie.
> > unregister_netdevice() sees a usage count of 1, which never decrements.
> > 
> > Bisect says it's ebc0ffae5 which causes the problem, or makes it appear.
> > 
> >     fib: RCU conversion of fib_lookup()
> >     
> >     fib_lookup() converted to be called in RCU protected context, no
> >     reference taken and released on a contended cache line (fib_clntref)
> >     
> > 
> > Is this a bug in that commit, or a driver bug exposed?
> 
> Hi Michael, thanks for the report (and painful bisection I guess)
> 
> Thats hard to say... Is it reproductable on my machine ?
> 

Hmm, a review of the code spotted a bug in fib_result_assign()

Please try following patch :

Thanks again !

[PATCH] fib: fib_result_assign() should not change fib refcounts

After commit ebc0ffae5 (RCU conversion of fib_lookup()),
fib_result_assign()  should not change fib refcounts anymore.

Thanks to Michael who did the bisection and bug report.

Reported-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/fib_lookup.h |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/ipv4/fib_lookup.h b/net/ipv4/fib_lookup.h
index a29edf2..c079cc0 100644
--- a/net/ipv4/fib_lookup.h
+++ b/net/ipv4/fib_lookup.h
@@ -47,11 +47,8 @@ extern int fib_detect_death(struct fib_info *fi, int order,
 static inline void fib_result_assign(struct fib_result *res,
 				     struct fib_info *fi)
 {
-	if (res->fi != NULL)
-		fib_info_put(res->fi);
+	/* we used to play games with refcounts, but we now use RCU */
 	res->fi = fi;
-	if (fi != NULL)
-		atomic_inc(&fi->fib_clntref);
 }
 
 #endif /* _FIB_LOOKUP_H */



^ permalink raw reply related

* Congrat! contact: mr.graham.poll15@gmail.com for more details
From: Henrik Maibom Hansen @ 2010-11-04 10:22 UTC (permalink / raw)


500,000GBP was awarded to your email

^ permalink raw reply

* Re: Freeing alive fib_info caused by ebc0ffae5
From: Eric Dumazet @ 2010-11-04 10:46 UTC (permalink / raw)
  To: michael; +Cc: netdev
In-Reply-To: <1288866626.2659.71.camel@edumazet-laptop>

Le jeudi 04 novembre 2010 à 11:30 +0100, Eric Dumazet a écrit :
> Le jeudi 04 novembre 2010 à 21:23 +1100, Michael Ellerman a écrit :
> > Hi all,
> > 
> > I'm running Linus' latest or thereabouts (ff8b16d), and I'm seeing
> > "Freeing alive fib_info" messages, from free_fib_info().
> > 
> > Actually I only get one per boot, when network interfaces come up.
> > Seemingly related I am getting refcount problems when I shutdown, ie.
> > unregister_netdevice() sees a usage count of 1, which never decrements.
> > 
> > Bisect says it's ebc0ffae5 which causes the problem, or makes it appear.
> > 
> >     fib: RCU conversion of fib_lookup()
> >     
> >     fib_lookup() converted to be called in RCU protected context, no
> >     reference taken and released on a contended cache line (fib_clntref)
> >     
> > 
> > Is this a bug in that commit, or a driver bug exposed?
> 
> Hi Michael, thanks for the report (and painful bisection I guess)
> 
> Thats hard to say... Is it reproductable on my machine ?

You could ask a stack trace eventually, this might help to spot the bug.

Thanks

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 3e0da3e..8039db0 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -159,6 +159,7 @@ void free_fib_info(struct fib_info *fi)
 {
 	if (fi->fib_dead == 0) {
 		pr_warning("Freeing alive fib_info %p\n", fi);
+		WARN_ON_ONCE(1);
 		return;
 	}
 	change_nexthops(fi) {




^ permalink raw reply related

* Re: Freeing alive fib_info caused by ebc0ffae5
From: Eric Dumazet @ 2010-11-04 10:30 UTC (permalink / raw)
  To: michael; +Cc: netdev
In-Reply-To: <1288866186.30549.10.camel@concordia>

Le jeudi 04 novembre 2010 à 21:23 +1100, Michael Ellerman a écrit :
> Hi all,
> 
> I'm running Linus' latest or thereabouts (ff8b16d), and I'm seeing
> "Freeing alive fib_info" messages, from free_fib_info().
> 
> Actually I only get one per boot, when network interfaces come up.
> Seemingly related I am getting refcount problems when I shutdown, ie.
> unregister_netdevice() sees a usage count of 1, which never decrements.
> 
> Bisect says it's ebc0ffae5 which causes the problem, or makes it appear.
> 
>     fib: RCU conversion of fib_lookup()
>     
>     fib_lookup() converted to be called in RCU protected context, no
>     reference taken and released on a contended cache line (fib_clntref)
>     
> 
> Is this a bug in that commit, or a driver bug exposed?

Hi Michael, thanks for the report (and painful bisection I guess)

Thats hard to say... Is it reproductable on my machine ?

Thanks



^ permalink raw reply

* Re: [RFC 0/3] MPEG2/TS drop analyzer iptables match extension
From: Jan Engelhardt @ 2010-11-04 10:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Netfilter Developers, paulmck, Eric Dumazet, netdev, Solvik Blum
In-Reply-To: <Pine.LNX.4.64.1011040953590.19565@ask.diku.dk>


On Thursday 2010-11-04 10:20, Jesper Dangaard Brouer wrote:
> On Thu, 4 Nov 2010, Jan Engelhardt wrote:
>> On Tuesday 2010-10-19 16:21, Jesper Dangaard Brouer wrote:
>>>
>>> This is my iptables match module for analyzing IPTV MPEG2/TS streams.
>>> Currently it only detects dropped packets, but I want to extend it for
>>> analyzing jitter and bursts.
>>>
>>> Jan Engelhardt convinced me that I should just send the module as-is
>>> for review on the list.  I wrote the code in 2009, and have only done
>>> some minor changes to make it work on kernel 2.6.35 since.
>>
>> This now lives in the mp2t branch (since NFWS already actually) of xt-a,
>> and I have taken the liberty to start updating it to higher standards.
>> Please watch that branch, as I don't have any MPEG equipment around me
>> to do runtime tests.
>
> Jan, I would actually like to maintain the source via my own git tree. And I
> would gladly accept your patches against that tree.

I do not mind who is hosting what parts, as git repos can be
transferred easily, but I strongly suggest not to decouple xt_mp2t
from (any clone of) the xtables-addons structure base, because doing
so would bring you back to square one with regard to maintenance.

I recognize you may dislike splitting up the IPTV codebase, so I
propose that you make use of submodules, and have an Xt-a clone as
one submodule. That would allow merging in both directions.

^ permalink raw reply

* Freeing alive fib_info caused by ebc0ffae5
From: Michael Ellerman @ 2010-11-04 10:23 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet

[-- Attachment #1: Type: text/plain, Size: 694 bytes --]

Hi all,

I'm running Linus' latest or thereabouts (ff8b16d), and I'm seeing
"Freeing alive fib_info" messages, from free_fib_info().

Actually I only get one per boot, when network interfaces come up.
Seemingly related I am getting refcount problems when I shutdown, ie.
unregister_netdevice() sees a usage count of 1, which never decrements.

Bisect says it's ebc0ffae5 which causes the problem, or makes it appear.

    fib: RCU conversion of fib_lookup()
    
    fib_lookup() converted to be called in RCU protected context, no
    reference taken and released on a contended cache line (fib_clntref)
    

Is this a bug in that commit, or a driver bug exposed?

cheers

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox