Netdev List
 help / color / mirror / Atom feed
* Re: iproute2 tools for 2.6.31
From: Eric Dumazet @ 2009-09-11  7:48 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20090910120416.6fd03d1f@s6510>

Stephen Hemminger a écrit :
> I am putting together release for 2.6.31 based tools.
> The only open issue is how to deal with the error handling in commands
> that do monitoring filtering.  Right now leaning towards the two socket
> solution.
> 
> So if you have anything else that you have been waiting for,
> please drop me a note.
>

One thing that is IMHO strange is the output of sk information 
on 64 bits (x86_64 for example)

# ss -e dst 55.225.18.6
State      Recv-Q Send-Q                                Local Address:Port                                    Peer Address:Port
ESTAB      0      0                                      55.225.18.96:9273                                     55.225.18.6:37405    timer:(keepalive,20min,0) ino:57807651 sk:36e40c80ffff8100

True sk pointer is ffff8100ffff8100, not 36e40c80ffff8100



ss/misc.c

                printf(" sk:%08x", r->id.idiag_cookie[0]);
                if (r->id.idiag_cookie[1] != 0)
                        printf("%08x", r->id.idiag_cookie[1]);

while kernel does :
        r->id.idiag_cookie[0] = (u32)(unsigned long)sk;
        r->id.idiag_cookie[1] = (u32)(((unsigned long)sk >> 31) >> 1);



What do you think of following patch ?

[PATCH] ss: correct display of sk pointer

On 64bit arches, sk pointer was 32/32 reversed.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---

diff --git a/misc/ss.c b/misc/ss.c
index 651fe3b..2447186 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -1393,9 +1393,10 @@ static int tcp_show_sock(struct nlmsghdr *nlh, struct filter *f)
 		if (r->idiag_uid)
 			printf(" uid:%u", (unsigned)r->idiag_uid);
 		printf(" ino:%u", r->idiag_inode);
-		printf(" sk:%08x", r->id.idiag_cookie[0]);
+		printf(" sk:");
 		if (r->id.idiag_cookie[1] != 0)
 			printf("%08x", r->id.idiag_cookie[1]);
+ 		printf("%08x", r->id.idiag_cookie[0]);
 	}
 	if (show_mem || show_tcpinfo) {
 		printf("\n\t");



^ permalink raw reply related

* Re: [PATCH RFC] tun: export underlying socket
From: Eric Dumazet @ 2009-09-11  6:10 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Paul Moore, David Miller, netdev, herbert
In-Reply-To: <20090911053610.GA10324@redhat.com>

Michael S. Tsirkin a écrit :
> On Fri, Sep 11, 2009 at 07:59:43AM +0300, Michael S. Tsirkin wrote:
>> On Fri, Sep 11, 2009 at 12:17:27AM -0400, Paul Moore wrote:
>>> On Thursday 10 September 2009 08:59:29 am Michael S. Tsirkin wrote:
>>>> Tun device looks similar to a packet socket
>>>> in that both pass complete frames from/to userspace.
>>>>
>>>> This patch fills in enough fields in the socket underlying tun driver
>>>> to support sendmsg/recvmsg operations, and exports access to this socket
>>>> to modules.
>>>>
>>>> This way, code using raw sockets to inject packets
>>>> into a physical device, can support injecting
>>>> packets into host network stack almost without modification.
>>>>
>>>> First user of this interface will be vhost virtualization
>>>> accelerator.
>>> No comments on the code at this point - I'm just trying to understand the 
>>> intended user right now which I'm assuming is the vhost-net bits you sent 
>>> previously? 
>> Yes - these now use raw socket,
> 
> More specifically, vhost would then be patched with:
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index aeffb3a..b54f9d6 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -331,15 +331,26 @@ err:
>  	return ERR_PTR(r);
>  }
>  
> +static struct socket *get_tun_socket(int fd)
> +{
> +	struct file *file = fget(fd);
> +	if (!file)
> +		return ERR_PTR(-EBADF);
> +	return tun_get_socket(file);

This would leak a reference on file, if it happens not being a tun file 

> +}
> +
>  static struct socket *get_socket(int fd)
>  {
>  	struct socket *sock;
>  	sock = get_raw_socket(fd);
>  	if (!IS_ERR(sock))
>  		return sock;
> +	sock = get_tun_socket(fd);
> +	if (!IS_ERR(sock))
> +		return sock;
>  	return ERR_PTR(-ENOTSOCK);
>  }
>  
>  static long vhost_net_set_socket(struct vhost_net *n, int fd)
>  {
>  	struct socket *sock, *oldsock = NULL;


^ permalink raw reply

* Re: [PATCH RFC] tun: export underlying socket
From: Michael S. Tsirkin @ 2009-09-11  5:36 UTC (permalink / raw)
  To: Paul Moore; +Cc: David Miller, netdev, herbert
In-Reply-To: <20090911045943.GA1613@redhat.com>

On Fri, Sep 11, 2009 at 07:59:43AM +0300, Michael S. Tsirkin wrote:
> On Fri, Sep 11, 2009 at 12:17:27AM -0400, Paul Moore wrote:
> > On Thursday 10 September 2009 08:59:29 am Michael S. Tsirkin wrote:
> > > Tun device looks similar to a packet socket
> > > in that both pass complete frames from/to userspace.
> > > 
> > > This patch fills in enough fields in the socket underlying tun driver
> > > to support sendmsg/recvmsg operations, and exports access to this socket
> > > to modules.
> > > 
> > > This way, code using raw sockets to inject packets
> > > into a physical device, can support injecting
> > > packets into host network stack almost without modification.
> > > 
> > > First user of this interface will be vhost virtualization
> > > accelerator.
> > 
> > No comments on the code at this point - I'm just trying to understand the 
> > intended user right now which I'm assuming is the vhost-net bits you sent 
> > previously? 
> 
> Yes - these now use raw socket,

More specifically, vhost would then be patched with:

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index aeffb3a..b54f9d6 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -331,15 +331,26 @@ err:
 	return ERR_PTR(r);
 }
 
+static struct socket *get_tun_socket(int fd)
+{
+	struct file *file = fget(fd);
+	if (!file)
+		return ERR_PTR(-EBADF);
+	return tun_get_socket(file);
+}
+
 static struct socket *get_socket(int fd)
 {
 	struct socket *sock;
 	sock = get_raw_socket(fd);
 	if (!IS_ERR(sock))
 		return sock;
+	sock = get_tun_socket(fd);
+	if (!IS_ERR(sock))
+		return sock;
 	return ERR_PTR(-ENOTSOCK);
 }
 
 static long vhost_net_set_socket(struct vhost_net *n, int fd)
 {
 	struct socket *sock, *oldsock = NULL;

^ permalink raw reply related

* [PATCH 8/8] fanotify: send events to userspace over socket reads
From: Eric Paris @ 2009-09-11  5:26 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, netdev; +Cc: davem, viro, alan, hch
In-Reply-To: <20090911052558.32359.18075.stgit@paris.rdu.redhat.com>

fanotify sends event notification to userspace when userspace reads from the
fanotify socket. This patch implements the operations that happen at read
time.  These include opening the file descriptor to the original object and
then filling the userspace buffer.  The fd should be pollable to indicate when
it has data present and it should return how much data it has to send when the
FIONREAD ioctl is checked.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 fs/notify/fanotify/af_fanotify.c |  228 ++++++++++++++++++++++++++++++++++++++
 fs/notify/fanotify/fanotify.h    |    5 +
 include/linux/fanotify.h         |   25 ++++
 3 files changed, 256 insertions(+), 2 deletions(-)

diff --git a/fs/notify/fanotify/af_fanotify.c b/fs/notify/fanotify/af_fanotify.c
index ac6aee1..2ae871b 100644
--- a/fs/notify/fanotify/af_fanotify.c
+++ b/fs/notify/fanotify/af_fanotify.c
@@ -2,6 +2,7 @@
 #include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/fsnotify_backend.h>
+#include <linux/ima.h> /* ima_path_check */
 #include <linux/init.h>
 #include <linux/kernel.h> /* UINT_MAX */
 #include <linux/mount.h> /* mntget() */
@@ -16,6 +17,8 @@
 #include "fanotify.h"
 #include "af_fanotify.h"
 
+#include <asm/ioctls.h>
+
 static const struct proto_ops fanotify_proto_ops;
 static struct kmem_cache *fanotify_mark_cache __read_mostly;
 
@@ -114,6 +117,36 @@ static int fan_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	return 0;
 }
 
+static int fan_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
+{
+	struct fanotify_sock *fan_sock;
+	struct fsnotify_group *group;
+	struct fsnotify_event_holder *holder;
+	void __user *p;
+	int ret = -ENOTTY;
+	size_t send_len = 0;
+
+	if (sock->state != SS_CONNECTED)
+		return -EBADF;
+
+	fan_sock = fan_sk(sock->sk);
+	group = fan_sock->group;
+
+	p = (void __user *) arg;
+
+	switch (cmd) {
+	case FIONREAD:
+		mutex_lock(&group->notification_mutex);
+		list_for_each_entry(holder, &group->notification_list, event_list)
+			send_len += FAN_EVENT_METADATA_LEN;
+		mutex_unlock(&group->notification_mutex);
+		ret = put_user(send_len, (int __user *) p);
+		break;
+	}
+
+	return ret;
+}
+
 static void fanotify_free_mark(struct fsnotify_mark_entry *entry)
 {
 	kmem_cache_free(fanotify_mark_cache, entry);
@@ -278,6 +311,197 @@ static int fan_setsockopt(struct socket *sock, int level, int optname,
 	return ret;
 }
 
+/*
+ * Get an fsnotify notification event if one exists and is small
+ * enough to fit in "count". Return an error pointer if the count
+ * is not large enough.
+ *
+ * Called with the group->notification_mutex held.
+ */
+static struct fsnotify_event *get_one_event(struct fsnotify_group *group,
+					    size_t count)
+{
+	BUG_ON(!mutex_is_locked(&group->notification_mutex));
+
+	if (fsnotify_notify_queue_is_empty(group))
+		return NULL;
+
+	if (FAN_EVENT_METADATA_LEN > count)
+		return ERR_PTR(-EINVAL);
+
+	/* held the notification_mutex the whole time, so this is the
+	 * same event we peeked above */
+	return fsnotify_remove_notify_event(group);
+}
+
+static int create_and_fill_fd(struct fsnotify_group *group,
+			      struct fanotify_event_metadata *metadata,
+			      struct fsnotify_event *event)
+{
+	int client_fd, err;
+	struct dentry *dentry;
+	struct vfsmount *mnt;
+	struct file *new_file;
+
+	client_fd = get_unused_fd();
+	if (client_fd < 0)
+		return client_fd;
+
+	if (event->data_type != FSNOTIFY_EVENT_PATH) {
+		WARN_ON(1);
+		put_unused_fd(client_fd);
+		return -EINVAL;
+	}
+
+	/*
+	 * we need a new file handle for the userspace program so it can read even if it was
+	 * originally opened O_WRONLY.
+	 */
+	dentry = dget(event->path.dentry);
+	mnt = mntget(event->path.mnt);
+	/* it's possible this event was an overflow event.  in that case dentry and mnt
+	 * are NULL;  That's fine, just don't call dentry open */
+	if (dentry && mnt) {
+		err = ima_path_check(&event->path, MAY_READ, IMA_COUNT_UPDATE);
+		if (err)
+			new_file = ERR_PTR(err);
+		else
+			new_file = dentry_open(dentry, mnt, O_RDONLY | O_LARGEFILE,
+					       current_cred());
+	} else
+		new_file = ERR_PTR(-EOVERFLOW);
+	if (IS_ERR(new_file)) {
+		/*
+		 * we still send an event even if we can't open the file.  this
+		 * can happen when say tasks are gone and we try to open their
+		 * /proc entries or we try to open a WRONLY file like in sysfs
+		 * we just send the errno to userspace since there isn't much
+		 * else we can do.
+		 */
+		put_unused_fd(client_fd);
+		client_fd = PTR_ERR(new_file);
+	} else {
+		new_file->f_mode |= FMODE_NONOTIFY;
+		fd_install(client_fd, new_file);
+	}
+
+	metadata->fd = client_fd;
+
+	return 0;
+}
+
+static ssize_t fill_event_metadata(struct fsnotify_group *group,
+				   struct fanotify_event_metadata *metadata,
+				   struct fsnotify_event *event)
+{
+	pr_debug("%s: \n", __func__);
+
+	metadata->event_len = FAN_EVENT_METADATA_LEN;
+	metadata->vers = FANOTIFY_METADATA_VERSION;
+	metadata->mask = fanotify_outgoing_mask(event->mask);
+
+	return create_and_fill_fd(group, metadata, event);
+
+}
+
+static ssize_t copy_event_to_iov(struct fsnotify_group *group,
+				 struct fsnotify_event *event,
+				 struct iovec *iov)
+{
+	struct fanotify_event_metadata fanotify_event_metadata;
+	int ret;
+
+	pr_debug("%s: \n", __func__);
+
+	ret = fill_event_metadata(group, &fanotify_event_metadata, event);
+	if (ret)
+		return ret;
+
+	/* send the main event */
+	ret = memcpy_toiovec(iov, (unsigned char *)&fanotify_event_metadata,
+			     FAN_EVENT_METADATA_LEN);
+	if (ret < 0)
+		return ret;
+
+	return FAN_EVENT_METADATA_LEN;
+}
+
+static ssize_t fan_recv_events(struct fsnotify_group *group, struct msghdr *msg,
+				int count, int nonblock)
+{
+	struct fsnotify_event *event;
+	int ret, len_sent = 0;
+	DEFINE_WAIT(wait);
+
+	pr_debug("%s: \n", __func__);
+
+	while (1) {
+		prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
+
+		mutex_lock(&group->notification_mutex);
+		event = get_one_event(group, count);
+		mutex_unlock(&group->notification_mutex);
+
+		if (event) {
+			ret = PTR_ERR(event);
+			if (IS_ERR(event))
+				break;
+
+			ret = copy_event_to_iov(group, event, msg->msg_iov);
+			fsnotify_put_event(event);
+			if (ret < 0)
+				break;
+			len_sent += ret;
+			count -= ret;
+			continue;
+		}
+
+		ret = -EAGAIN;
+		if (nonblock)
+			break;
+		ret = -EINTR;
+		if (signal_pending(current))
+			break;
+
+		if (len_sent)
+			break;
+
+		schedule();
+	}
+
+	finish_wait(&group->notification_waitq, &wait);
+	if (len_sent && ret != -EFAULT)
+		ret = len_sent;
+	return ret;
+}
+
+static int fan_recvmsg(struct kiocb *iocb, struct socket *sock,
+		       struct msghdr *msg, size_t size, int flags)
+{
+	struct fanotify_sock *fan_sock;
+	struct fsnotify_group *group;
+	int nonblock;
+
+	pr_debug("%s: \n", __func__);
+
+	if (sock->state != SS_CONNECTED)
+		return -EBADF;
+
+	if (size < FAN_EVENT_METADATA_LEN)
+		return -ENOMEM;
+
+	fan_sock = fan_sk(sock->sk);
+	group = fan_sock->group;
+
+	/* hey, nonblock no matter how they ask */
+	nonblock = !!(sock->file->f_flags & O_NONBLOCK);
+	nonblock |= !!(flags & MSG_DONTWAIT);
+
+	size = fan_recv_events(group, msg, size, nonblock);
+
+	return size;
+}
+
 static const struct net_proto_family fanotify_family_ops = {
 	.family		=	PF_FANOTIFY,
 	.create		=	fan_sock_create,
@@ -294,13 +518,13 @@ static const struct proto_ops fanotify_proto_ops = {
 	.accept =	sock_no_accept,
 	.getname =	sock_no_getname,
 	.poll =		sock_no_poll,
-	.ioctl =	sock_no_ioctl,
+	.ioctl =	fan_ioctl,
 	.listen =	sock_no_listen,
 	.shutdown =	sock_no_shutdown,
 	.setsockopt =	fan_setsockopt,
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	sock_no_sendmsg,
-	.recvmsg =	sock_no_recvmsg,
+	.recvmsg =	fan_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
 };
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 6c7bf06..4a5c785 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -12,3 +12,8 @@ static inline bool fanotify_is_mask_valid(__u32 mask)
 		return false;
 	return true;
 }
+
+static inline __u32 fanotify_outgoing_mask(__u32 mask)
+{
+	return mask & FAN_ALL_OUTGOING_EVENTS;
+}
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 6ecbcea..17f9550 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -35,6 +35,10 @@
  */
 #define FAN_ALL_INCOMING_EVENTS	(FAN_ALL_EVENTS |\
 				 FAN_EVENT_ON_CHILD)
+
+#define FAN_ALL_OUTGOING_EVENTS	(FAN_ALL_EVENTS |\
+				 FAN_Q_OVERFLOW)
+
 #ifndef SOL_FANOTIFY
 #define SOL_FANOTIFY	278
 #endif
@@ -63,6 +67,27 @@ struct fanotify_so_inode_mark {
 #define FANOTIFY_SET_MARK	1
 #define FANOTIFY_REMOVE_MARK	2
 
+#define FANOTIFY_METADATA_VERSION	1
+
+struct fanotify_event_metadata {
+	__u32 event_len;
+	__u32 vers;
+	__s32 fd;
+	__u32 mask;
+}  __attribute__((packed));
+
+
+/* Helper functions to deal with fanotify_event_metadata buffers */
+#define FAN_EVENT_METADATA_LEN	(sizeof(struct fanotify_event_metadata))
+
+#define FAN_EVENT_NEXT(meta, len) ((len) -= (meta)->event_len, \
+				   (struct fanotify_event_metadata*)(((char *)(meta)) + \
+				   (meta)->event_len))
+
+#define FAN_EVENT_OK(meta, len)	((long)(len) >= (long)FAN_EVENT_METADATA_LEN && \
+				 (long)(meta)->event_len >= (long)FAN_EVENT_METADATA_LEN && \
+				 (long)(meta)->event_len <= (long)(len))
+
 #ifdef __KERNEL__
 
 #endif /* __KERNEL__ */

^ permalink raw reply related

* [PATCH 7/8] fanotify: userspace can add and remove fsnotify inode marks
From: Eric Paris @ 2009-09-11  5:26 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, netdev; +Cc: davem, viro, alan, hch
In-Reply-To: <20090911052558.32359.18075.stgit@paris.rdu.redhat.com>

Using setsockopt a user can add or remove fsnotify marks on inodes.  These
marks are used to determine which events for which inode are to be sent to
userspace.  They are very similar in nature to inotify_add_watch and
inotify_rm_watch.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 fs/notify/fanotify/af_fanotify.c |  169 ++++++++++++++++++++++++++++++++++++++
 include/linux/fanotify.h         |   10 ++
 2 files changed, 178 insertions(+), 1 deletions(-)

diff --git a/fs/notify/fanotify/af_fanotify.c b/fs/notify/fanotify/af_fanotify.c
index d7bf658..ac6aee1 100644
--- a/fs/notify/fanotify/af_fanotify.c
+++ b/fs/notify/fanotify/af_fanotify.c
@@ -17,6 +17,7 @@
 #include "af_fanotify.h"
 
 static const struct proto_ops fanotify_proto_ops;
+static struct kmem_cache *fanotify_mark_cache __read_mostly;
 
 static struct proto fanotify_proto = {
 	.name     = "FANOTIFY",
@@ -113,6 +114,170 @@ static int fan_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	return 0;
 }
 
+static void fanotify_free_mark(struct fsnotify_mark_entry *entry)
+{
+	kmem_cache_free(fanotify_mark_cache, entry);
+}
+
+static int fanotify_remove_inode_mark(struct fsnotify_group *group,
+				      struct fanotify_so_inode_mark *so_inode_mark)
+{
+	struct fsnotify_mark_entry *entry;
+	struct file *file;
+	struct inode *inode;
+	int fput_needed, ret = 0;
+
+	ret = -EBADF;
+	file = fget_light(so_inode_mark->fd, &fput_needed);
+	if (!file)
+		goto out;
+
+	inode = file->f_path.dentry->d_inode;
+
+	spin_lock(&inode->i_lock);
+	entry = fsnotify_find_mark_entry(group, inode);
+	spin_unlock(&inode->i_lock);
+
+	ret = -ENOENT;
+	if (!entry)
+		goto out_fput;
+
+	ret = 0;
+
+	fsnotify_destroy_mark_by_entry(entry);
+
+	/* matches the fsnotify_find_mark_entry() */
+	fsnotify_put_mark(entry);
+
+	fsnotify_recalc_group_mask(group);
+out_fput:
+	fput_light(file, fput_needed);
+out:
+	return ret;
+}
+
+static int fanotify_add_inode_mark(struct fsnotify_group *group,
+				   struct fanotify_so_inode_mark *so_inode_mark)
+{
+	struct fsnotify_mark_entry *entry;
+	struct file *file;
+	struct inode *inode;
+	__u32 old_mask, new_mask;
+	int fput_needed, ret;
+
+	ret = -EINVAL;
+	if (!fanotify_is_mask_valid(so_inode_mark->mask))
+		goto out;
+
+	ret = -EBADF;
+	file = fget_light(so_inode_mark->fd, &fput_needed);
+	if (!file)
+		goto out;
+
+	inode = file->f_path.dentry->d_inode;
+
+	spin_lock(&inode->i_lock);
+	entry = fsnotify_find_mark_entry(group, inode);
+	spin_unlock(&inode->i_lock);
+
+	if (!entry) {
+		struct fsnotify_mark_entry *new_entry;
+
+		ret = -ENOMEM;
+		new_entry = kmem_cache_alloc(fanotify_mark_cache, GFP_KERNEL);
+		if (!new_entry)
+			goto out_fput;
+
+		fsnotify_init_mark(new_entry, fanotify_free_mark);
+		ret = fsnotify_add_mark(new_entry, group, inode, 0);
+		if (ret) {
+			fanotify_free_mark(new_entry);
+			goto out_fput;
+		}
+
+		entry = new_entry;
+	}
+
+	ret = 0;
+
+	spin_lock(&entry->lock);
+	old_mask = entry->mask;
+	entry->mask |= so_inode_mark->mask;
+	new_mask = entry->mask;
+	spin_unlock(&entry->lock);
+
+	/* we made changes to a mask, update the group mask and the inode mask
+	 * so things happen quickly. */
+	if (old_mask != new_mask) {
+		/* more bits in old than in new? */
+		int dropped = (old_mask & ~new_mask);
+		/* more bits in this entry than the inode's mask? */
+		int do_inode = (new_mask & ~inode->i_fsnotify_mask);
+		/* more bits in this entry than the group? */
+		int do_group = (new_mask & ~group->mask);
+
+		/* update the inode with this new entry */
+		if (dropped || do_inode)
+			fsnotify_recalc_inode_mask(inode);
+
+		/* update the group mask with the new mask */
+		if (dropped || do_group)
+			fsnotify_recalc_group_mask(group);
+	}
+
+	/* match the init or the find.... */
+	fsnotify_put_mark(entry);
+
+out_fput:
+	fput_light(file, fput_needed);
+out:
+	return ret;
+}
+
+static int fan_setsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, int optlen)
+{
+	struct fanotify_sock *fan_sock;
+	struct fsnotify_group *group;
+	size_t copy_len;
+
+	union {
+		struct fanotify_so_inode_mark inode_mark;
+	} data;
+	int ret = 0;
+
+	if (sock->state != SS_CONNECTED)
+		return -EBADF;
+
+	if (level != SOL_FANOTIFY)
+		return -ENOPROTOOPT;
+
+	fan_sock = fan_sk(sock->sk);
+	group = fan_sock->group;
+
+	copy_len = min(optlen, (int)sizeof(data));
+	ret = copy_from_user(&data, optval, copy_len);
+	if (ret)
+		return ret;
+
+	switch (optname) {
+	case FANOTIFY_SET_MARK:
+	case FANOTIFY_REMOVE_MARK:
+		if (optlen < sizeof(struct fanotify_so_inode_mark))
+			return -ENOMEM;
+
+		if (optname == FANOTIFY_SET_MARK)
+			ret = fanotify_add_inode_mark(group, &data.inode_mark);
+		else if (optname == FANOTIFY_REMOVE_MARK)
+			ret = fanotify_remove_inode_mark(group, &data.inode_mark);
+		break;
+	default:
+		return -ENOPROTOOPT;
+	}
+
+	return ret;
+}
+
 static const struct net_proto_family fanotify_family_ops = {
 	.family		=	PF_FANOTIFY,
 	.create		=	fan_sock_create,
@@ -132,7 +297,7 @@ static const struct proto_ops fanotify_proto_ops = {
 	.ioctl =	sock_no_ioctl,
 	.listen =	sock_no_listen,
 	.shutdown =	sock_no_shutdown,
-	.setsockopt =	sock_no_setsockopt,
+	.setsockopt =	fan_setsockopt,
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	sock_no_sendmsg,
 	.recvmsg =	sock_no_recvmsg,
@@ -142,6 +307,8 @@ static const struct proto_ops fanotify_proto_ops = {
 
 static int __init fanotify_init(void)
 {
+	fanotify_mark_cache = KMEM_CACHE(fsnotify_mark_entry, SLAB_PANIC);
+
 	if (proto_register(&fanotify_proto, 0))
 		panic("unable to register fanotify protocol with network stack\n");
 
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 4c1c6cd..6ecbcea 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -53,6 +53,16 @@ struct fanotify_addr {
 	__u32 unused[16];
 }  __attribute__((packed));
 
+/* struct used for FANOTIFY_SET_MARK */
+struct fanotify_so_inode_mark {
+	__s32 fd;
+	__u32 mask;
+}  __attribute__((packed));
+
+/* fanotify setsockopt optvals */
+#define FANOTIFY_SET_MARK	1
+#define FANOTIFY_REMOVE_MARK	2
+
 #ifdef __KERNEL__
 
 #endif /* __KERNEL__ */


^ permalink raw reply related

* [PATCH 6/8] fanotify: userspace socket
From: Eric Paris @ 2009-09-11  5:26 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, netdev; +Cc: davem, viro, alan, hch
In-Reply-To: <20090911052558.32359.18075.stgit@paris.rdu.redhat.com>

This patch implements an userspace interface for the fanotify notification
system.  An fanotify socket is created in userspace and is 'bound' to an
address.  That bind call actually creates the new fanotify listener much like
inotify_init() creates an inotify instance.

Requests for notification of events on certain fs objects is done using a
setsockopt() call.  (not implemented in this patch)  This setsockopt() call is
largely analogous to inotify_add_watch()

Events are retrieved from the kernel calling read on the bound socket.
This interface is designed to be forward looking as the kernel/userspace
interaction can be changed simply by implementing a new getsockopt option.

Macros are provided much like the netlink macros in order to allow of the
messages from the kernel to userspace to change in length in the future while
maintaining backwards compatibility.

This patch only implements the socket registration and the bind call.  The
getsockopt() calls and data read call are implemented in later patches.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 fs/notify/fanotify/Makefile      |    2 -
 fs/notify/fanotify/af_fanotify.c |  152 ++++++++++++++++++++++++++++++++++++++
 fs/notify/fanotify/af_fanotify.h |   21 +++++
 fs/notify/fanotify/fanotify.h    |    2 +
 include/linux/fanotify.h         |   19 +++++
 5 files changed, 195 insertions(+), 1 deletions(-)
 create mode 100644 fs/notify/fanotify/af_fanotify.c
 create mode 100644 fs/notify/fanotify/af_fanotify.h

diff --git a/fs/notify/fanotify/Makefile b/fs/notify/fanotify/Makefile
index e7d39c0..1196005 100644
--- a/fs/notify/fanotify/Makefile
+++ b/fs/notify/fanotify/Makefile
@@ -1 +1 @@
-obj-$(CONFIG_FANOTIFY)		+= fanotify.o
+obj-$(CONFIG_FANOTIFY)		+= fanotify.o af_fanotify.o
diff --git a/fs/notify/fanotify/af_fanotify.c b/fs/notify/fanotify/af_fanotify.c
new file mode 100644
index 0000000..d7bf658
--- /dev/null
+++ b/fs/notify/fanotify/af_fanotify.c
@@ -0,0 +1,152 @@
+#include <linux/errno.h>
+#include <linux/fdtable.h>
+#include <linux/file.h>
+#include <linux/fsnotify_backend.h>
+#include <linux/init.h>
+#include <linux/kernel.h> /* UINT_MAX */
+#include <linux/mount.h> /* mntget() */
+#include <linux/net.h>
+#include <linux/skbuff.h>
+#include <linux/socket.h>
+#include <linux/types.h>
+
+#include <net/net_namespace.h>
+#include <net/sock.h>
+
+#include "fanotify.h"
+#include "af_fanotify.h"
+
+static const struct proto_ops fanotify_proto_ops;
+
+static struct proto fanotify_proto = {
+	.name     = "FANOTIFY",
+	.owner    = THIS_MODULE,
+	.obj_size = sizeof(struct fanotify_sock),
+};
+
+static int fan_sock_create(struct net *net, struct socket *sock, int protocol)
+{
+	struct sock *sk;
+	struct fanotify_sock *fan_sock;
+
+	/* FIXME maybe a new LSM hook? */
+	if (!capable(CAP_NET_RAW))
+		return -EPERM;
+
+	if (protocol != 0)
+		return -ESOCKTNOSUPPORT;
+
+	if (sock->type != SOCK_RAW)
+		return -ESOCKTNOSUPPORT;
+
+	sock->state = SS_UNCONNECTED;
+
+	sk = sk_alloc(net, PF_FANOTIFY, GFP_KERNEL, &fanotify_proto);
+	if (sk == NULL)
+		return -ENOBUFS;
+
+	sock->ops = &fanotify_proto_ops;
+
+	sock_init_data(sock, sk);
+
+	sk->sk_family = PF_FANOTIFY;
+	sk_refcnt_debug_inc(sk);
+
+	fan_sock = fan_sk(sk);
+	fan_sock->group = NULL;
+
+	return 0;
+}
+
+static int fan_release(struct socket *sock)
+{
+	struct sock *sk;
+	struct fanotify_sock *fan_sock;
+
+	sk = sock->sk;
+	if (!sk)
+		return 0;
+
+	fan_sock = fan_sk(sk);
+
+	if (sock->state == SS_CONNECTED) {
+		sock->state = SS_UNCONNECTED;
+		fsnotify_put_group(fan_sock->group);
+	}
+
+	fan_sock->group = NULL;
+
+	sock_orphan(sk);
+	sock->sk = NULL;
+
+	sk_refcnt_debug_release(sk);
+
+	sock_put(sk);
+
+	return 0;
+}
+
+static int fan_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
+{
+	struct fanotify_addr *fan_addr = (struct fanotify_addr *)addr;
+	struct fanotify_sock *fan_sock;
+
+	if (addr_len != sizeof(struct fanotify_addr))
+		return -EINVAL;
+
+	if (sock->state != SS_UNCONNECTED)
+		return -EINVAL;
+
+	if (!fanotify_is_mask_valid(fan_addr->mask))
+		return -EINVAL;
+
+	fan_sock = fan_sk(sock->sk);
+	fan_sock->group = fsnotify_obtain_group(fan_addr->mask, &fanotify_ops);
+
+	if (IS_ERR(fan_sock->group))
+		return PTR_ERR(fan_sock->group);
+
+	fan_sock->group->max_events = 16383;
+
+	sock->state = SS_CONNECTED;
+
+	return 0;
+}
+
+static const struct net_proto_family fanotify_family_ops = {
+	.family		=	PF_FANOTIFY,
+	.create		=	fan_sock_create,
+	.owner		=	THIS_MODULE,
+};
+
+static const struct proto_ops fanotify_proto_ops = {
+	.family =	PF_FANOTIFY,
+	.owner =	THIS_MODULE,
+	.release =	fan_release,
+	.bind =		fan_bind,
+	.connect =	sock_no_connect,
+	.socketpair =	sock_no_socketpair,
+	.accept =	sock_no_accept,
+	.getname =	sock_no_getname,
+	.poll =		sock_no_poll,
+	.ioctl =	sock_no_ioctl,
+	.listen =	sock_no_listen,
+	.shutdown =	sock_no_shutdown,
+	.setsockopt =	sock_no_setsockopt,
+	.getsockopt =	sock_no_getsockopt,
+	.sendmsg =	sock_no_sendmsg,
+	.recvmsg =	sock_no_recvmsg,
+	.mmap =		sock_no_mmap,
+	.sendpage =	sock_no_sendpage,
+};
+
+static int __init fanotify_init(void)
+{
+	if (proto_register(&fanotify_proto, 0))
+		panic("unable to register fanotify protocol with network stack\n");
+
+	sock_register(&fanotify_family_ops);
+
+	return 0;
+}
+device_initcall(fanotify_init);
diff --git a/fs/notify/fanotify/af_fanotify.h b/fs/notify/fanotify/af_fanotify.h
new file mode 100644
index 0000000..fff0e66
--- /dev/null
+++ b/fs/notify/fanotify/af_fanotify.h
@@ -0,0 +1,21 @@
+#ifndef _LINUX_AF_FANOTIFY_H
+#define _LINUX_AF_FANOTIFY_H
+
+#include <linux/fanotify.h>
+#include <net/sock.h>
+
+struct fanotify_sock {
+	struct sock		sock;
+	struct fsnotify_group	*group;
+};
+
+static inline struct fanotify_sock *fan_sk(struct sock *sock)
+{
+	struct fanotify_sock *fan_sock;
+
+	fan_sock = container_of(sock, struct fanotify_sock, sock);
+
+	return fan_sock;
+}
+
+#endif /* _LINUX_AF_NET_H */
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index a8785c1..6c7bf06 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -4,6 +4,8 @@
 #include <linux/kernel.h>
 #include <linux/types.h>
 
+extern const struct fsnotify_ops fanotify_ops;
+
 static inline bool fanotify_is_mask_valid(__u32 mask)
 {
 	if (mask & ~(FAN_ALL_INCOMING_EVENTS))
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index b560f86..4c1c6cd 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -1,6 +1,7 @@
 #ifndef _LINUX_FANOTIFY_H
 #define _LINUX_FANOTIFY_H
 
+#include <linux/socket.h>
 #include <linux/types.h>
 
 /* the following events that user-space can register for */
@@ -34,6 +35,24 @@
  */
 #define FAN_ALL_INCOMING_EVENTS	(FAN_ALL_EVENTS |\
 				 FAN_EVENT_ON_CHILD)
+#ifndef SOL_FANOTIFY
+#define SOL_FANOTIFY	278
+#endif
+
+#ifndef AF_FANOTIFY
+#define AF_FANOTIFY	37
+#define PF_FANOTIFY	AF_FANOTIFY
+#endif
+
+struct fanotify_addr {
+	sa_family_t family;
+	__u32 priority;	/* unused */
+	__u32 mask_hi;	/* unused */
+	__u32 mask;	/* unused */
+	__u32 f_flags;	/* unused */
+	__u32 unused[16];
+}  __attribute__((packed));
+
 #ifdef __KERNEL__
 
 #endif /* __KERNEL__ */

^ permalink raw reply related

* [PATCH 5/8] fanotify: merge notification events with different masks
From: Eric Paris @ 2009-09-11  5:26 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, netdev; +Cc: davem, viro, alan, hch
In-Reply-To: <20090911052558.32359.18075.stgit@paris.rdu.redhat.com>

Instead of just merging fanotify events if they are exactly the same, merge
notification events with different masks.  To do this we have to clone the
old event, update the mask in the new event with the new merged mask, and
put the new event in place of the old event.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 fs/notify/fanotify/fanotify.c |   24 ++++++++++++++++++------
 1 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index f054300..40ee137 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -10,8 +10,7 @@
 
 static bool try_merge(struct fsnotify_event *old, struct fsnotify_event *new)
 {
-	if ((old->mask == new->mask) &&
-	    (old->to_tell == new->to_tell) &&
+	if ((old->to_tell == new->to_tell) &&
 	    (old->data_type == new->data_type)) {
 		switch (old->data_type) {
 		case (FSNOTIFY_EVENT_PATH):
@@ -29,15 +28,28 @@ static bool try_merge(struct fsnotify_event *old, struct fsnotify_event *new)
 
 static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
 {
-	struct fsnotify_event_holder *holder;
+	struct fsnotify_event_holder *test_holder, *prev;
 	struct fsnotify_event *test_event;
+	struct fsnotify_event *new_event;
+	int ret;
 
 	/* and the list better be locked by something too! */
 
-	list_for_each_entry_reverse(holder, list, event_list) {
-		test_event = holder->event;
-		if (try_merge(test_event, event))
+	list_for_each_entry_safe_reverse(test_holder, prev, list, event_list) {
+		test_event = test_holder->event;
+		if (try_merge(test_event, event)) {
+			if (test_event->mask == event->mask)
+				return -EEXIST;
+			new_event = fsnotify_clone_event(test_event);
+			if (!new_event)
+				return 0;
+			new_event->mask = (test_event->mask | event->mask);
+			ret = fsnotify_replace_event(test_holder, new_event);
+			fsnotify_put_event(new_event); /* matches the ref from clone */
+			if (ret)
+				return ret;
 			return -EEXIST;
+		}
 	}
 
 	return 0;


^ permalink raw reply related

* [PATCH 4/8] fanotify:drop notification if they exist in the outgoing queue
From: Eric Paris @ 2009-09-11  5:26 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, netdev; +Cc: davem, viro, alan, hch
In-Reply-To: <20090911052558.32359.18075.stgit@paris.rdu.redhat.com>

fanotify listeners get an open file descriptor to the object in question so
the ordering of operations is not as important as in other notification
systems.  inotify will drop events if the last event in the event FIFO is
the same as the current event.  This patch will drop fanotify events if
they are the same as another event anywhere in the event FIFO.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 fs/notify/fanotify/fanotify.c |   40 ++++++++++++++++++++++++++++++++++++++--
 1 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index b5c2e32..f054300 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -8,6 +8,40 @@
 
 #include "fanotify.h"
 
+static bool try_merge(struct fsnotify_event *old, struct fsnotify_event *new)
+{
+	if ((old->mask == new->mask) &&
+	    (old->to_tell == new->to_tell) &&
+	    (old->data_type == new->data_type)) {
+		switch (old->data_type) {
+		case (FSNOTIFY_EVENT_PATH):
+			if ((old->path.mnt == new->path.mnt) &&
+			    (old->path.dentry == new->path.dentry))
+				return true;
+		case (FSNOTIFY_EVENT_NONE):
+			return true;
+		default:
+			BUG();
+		};
+	}
+	return false;
+}
+
+static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
+{
+	struct fsnotify_event_holder *holder;
+	struct fsnotify_event *test_event;
+
+	/* and the list better be locked by something too! */
+
+	list_for_each_entry_reverse(holder, list, event_list) {
+		test_event = holder->event;
+		if (try_merge(test_event, event))
+			return -EEXIST;
+	}
+
+	return 0;
+}
 static int fanotify_handle_event(struct fsnotify_group *group, struct fsnotify_event *event)
 {
 	int ret;
@@ -20,8 +54,10 @@ static int fanotify_handle_event(struct fsnotify_group *group, struct fsnotify_e
 	BUILD_BUG_ON(FAN_EVENT_ON_CHILD != FS_EVENT_ON_CHILD);
 	BUILD_BUG_ON(FAN_Q_OVERFLOW != FS_Q_OVERFLOW);
 
-	ret = fsnotify_add_notify_event(group, event, NULL, NULL);
-
+	ret = fsnotify_add_notify_event(group, event, NULL, fanotify_merge);
+	/* -EEXIST means this event was merged with another, not that it was an error */
+	if (ret == -EEXIST)
+		ret = 0;
 	return ret;
 }
 


^ permalink raw reply related

* [PATCH 3/8] fanotify: fscking all notification system
From: Eric Paris @ 2009-09-11  5:26 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, netdev; +Cc: davem, viro, alan, hch
In-Reply-To: <20090911052558.32359.18075.stgit@paris.rdu.redhat.com>

fanotify is a novel file notification system which bases notification on
giving userspace both an event type (open, close, read, write) and an open
file descriptor to the object in question.  This should address a number of
races and problems with other notification systems like inotify and dnotify
and should allow the future implementation of blocking or access controlled
notification.  These are useful for on access scanners or hierachical storage
management schemes.

This patch just implements the basics of the fsnotify functions.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 fs/notify/Kconfig             |    1 
 fs/notify/Makefile            |    1 
 fs/notify/fanotify/Kconfig    |   11 +++++
 fs/notify/fanotify/Makefile   |    1 
 fs/notify/fanotify/fanotify.c |   86 +++++++++++++++++++++++++++++++++++++++++
 fs/notify/fanotify/fanotify.h |   12 ++++++
 include/linux/Kbuild          |    1 
 include/linux/fanotify.h      |   40 +++++++++++++++++++
 8 files changed, 153 insertions(+), 0 deletions(-)
 create mode 100644 fs/notify/fanotify/Kconfig
 create mode 100644 fs/notify/fanotify/Makefile
 create mode 100644 fs/notify/fanotify/fanotify.c
 create mode 100644 fs/notify/fanotify/fanotify.h
 create mode 100644 include/linux/fanotify.h

diff --git a/fs/notify/Kconfig b/fs/notify/Kconfig
index dffbb09..22c629e 100644
--- a/fs/notify/Kconfig
+++ b/fs/notify/Kconfig
@@ -3,3 +3,4 @@ config FSNOTIFY
 
 source "fs/notify/dnotify/Kconfig"
 source "fs/notify/inotify/Kconfig"
+source "fs/notify/fanotify/Kconfig"
diff --git a/fs/notify/Makefile b/fs/notify/Makefile
index 0922cc8..396a387 100644
--- a/fs/notify/Makefile
+++ b/fs/notify/Makefile
@@ -2,3 +2,4 @@ obj-$(CONFIG_FSNOTIFY)		+= fsnotify.o notification.o group.o inode_mark.o
 
 obj-y			+= dnotify/
 obj-y			+= inotify/
+obj-y			+= fanotify/
diff --git a/fs/notify/fanotify/Kconfig b/fs/notify/fanotify/Kconfig
new file mode 100644
index 0000000..70631ed
--- /dev/null
+++ b/fs/notify/fanotify/Kconfig
@@ -0,0 +1,11 @@
+config FANOTIFY
+	bool "Filesystem wide access notification"
+	select FSNOTIFY
+	default y
+	---help---
+	   Say Y here to enable fanotify suport.  fanotify is a system wide
+	   file access notification interface.  Events are read from from a
+	   socket and in doing so an fd is created in the reading process
+	   which points to the same data as the one on which the event occured.
+
+	   If unsure, say Y.
diff --git a/fs/notify/fanotify/Makefile b/fs/notify/fanotify/Makefile
new file mode 100644
index 0000000..e7d39c0
--- /dev/null
+++ b/fs/notify/fanotify/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_FANOTIFY)		+= fanotify.o
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
new file mode 100644
index 0000000..b5c2e32
--- /dev/null
+++ b/fs/notify/fanotify/fanotify.c
@@ -0,0 +1,86 @@
+#include <linux/fdtable.h>
+#include <linux/fsnotify_backend.h>
+#include <linux/init.h>
+#include <linux/kernel.h> /* UINT_MAX */
+#include <linux/net.h> /* struct socket */
+#include <linux/sched.h> /* task_struct */
+#include <linux/types.h>
+
+#include "fanotify.h"
+
+static int fanotify_handle_event(struct fsnotify_group *group, struct fsnotify_event *event)
+{
+	int ret;
+
+	BUILD_BUG_ON(FAN_ACCESS != FS_ACCESS);
+	BUILD_BUG_ON(FAN_MODIFY != FS_MODIFY);
+	BUILD_BUG_ON(FAN_CLOSE_NOWRITE != FS_CLOSE_NOWRITE);
+	BUILD_BUG_ON(FAN_CLOSE_WRITE != FS_CLOSE_WRITE);
+	BUILD_BUG_ON(FAN_OPEN != FS_OPEN);
+	BUILD_BUG_ON(FAN_EVENT_ON_CHILD != FS_EVENT_ON_CHILD);
+	BUILD_BUG_ON(FAN_Q_OVERFLOW != FS_Q_OVERFLOW);
+
+	ret = fsnotify_add_notify_event(group, event, NULL, NULL);
+
+	return ret;
+}
+
+static bool fanotify_should_send_event(struct fsnotify_group *group, struct inode *inode,
+				       __u32 mask, void *data, int data_type)
+{
+	struct fsnotify_mark_entry *entry;
+	bool send;
+
+	/* sorry, fanotify only gives a damn about files and dirs */
+	if (!S_ISREG(inode->i_mode) &&
+	    !S_ISDIR(inode->i_mode))
+		return false;
+
+	/* if we don't have enough info to send an event to userspace say no */
+	if ((data_type != FSNOTIFY_EVENT_FILE) &&
+	    (data_type != FSNOTIFY_EVENT_PATH))
+		return false;
+
+	/* if this file was opened by fanotify don't send events about it */
+	if (data_type == FSNOTIFY_EVENT_FILE) {
+		struct file *file;
+
+		file = (struct file *)data;
+		if (file->f_mode & FMODE_NONOTIFY)
+			return false;
+	}
+
+	spin_lock(&inode->i_lock);
+	entry = fsnotify_find_mark_entry(group, inode);
+	spin_unlock(&inode->i_lock);
+	if (!entry)
+		return false;
+
+	/* if the event is for a child and this inode doesn't care about
+	 * events on the child, don't send it! */
+	if ((mask & FS_EVENT_ON_CHILD) &&
+	    !(entry->mask & FS_EVENT_ON_CHILD))
+		send = false;
+	else {
+		if (!(entry->mask & FS_EVENT_ON_CHILD) &&
+		    (mask & FS_EVENT_ON_CHILD))
+			send = false;
+		else {
+			mask = (mask & ~FS_EVENT_ON_CHILD);
+			send = (entry->mask & mask);
+		}
+	}
+
+	/* find took a reference */
+	fsnotify_put_mark(entry);
+
+	return send;
+}
+
+const struct fsnotify_ops fanotify_ops = {
+	.handle_event = fanotify_handle_event,
+	.should_send_event = fanotify_should_send_event,
+	.free_group_priv = NULL,
+	.free_event_priv = NULL,
+	.freeing_mark = NULL,
+};
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
new file mode 100644
index 0000000..a8785c1
--- /dev/null
+++ b/fs/notify/fanotify/fanotify.h
@@ -0,0 +1,12 @@
+#include <linux/fanotify.h>
+#include <linux/fsnotify_backend.h>
+#include <linux/net.h>
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+static inline bool fanotify_is_mask_valid(__u32 mask)
+{
+	if (mask & ~(FAN_ALL_INCOMING_EVENTS))
+		return false;
+	return true;
+}
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index e7d84ff..b298c0e 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -206,6 +206,7 @@ unifdef-y += ethtool.h
 unifdef-y += eventpoll.h
 unifdef-y += signalfd.h
 unifdef-y += ext2_fs.h
+unifdef-y += fanotify.h
 unifdef-y += fb.h
 unifdef-y += fcntl.h
 unifdef-y += filter.h
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
new file mode 100644
index 0000000..b560f86
--- /dev/null
+++ b/include/linux/fanotify.h
@@ -0,0 +1,40 @@
+#ifndef _LINUX_FANOTIFY_H
+#define _LINUX_FANOTIFY_H
+
+#include <linux/types.h>
+
+/* the following events that user-space can register for */
+#define FAN_ACCESS		0x00000001	/* File was accessed */
+#define FAN_MODIFY		0x00000002	/* File was modified */
+#define FAN_CLOSE_WRITE		0x00000008	/* Unwrittable file closed */
+#define FAN_CLOSE_NOWRITE	0x00000010	/* Writtable file closed */
+#define FAN_OPEN		0x00000020	/* File was opened */
+
+#define FAN_EVENT_ON_CHILD	0x08000000	/* interested in child events */
+
+/* FIXME currently Q's have no limit.... */
+#define FAN_Q_OVERFLOW		0x00004000	/* Event queued overflowed */
+
+/* helper events */
+#define FAN_CLOSE		(FAN_CLOSE_WRITE | FAN_CLOSE_NOWRITE) /* close */
+
+/*
+ * All of the events - we build the list by hand so that we can add flags in
+ * the future and not break backward compatibility.  Apps will get only the
+ * events that they originally wanted.  Be sure to add new events here!
+ */
+#define FAN_ALL_EVENTS (FAN_ACCESS |\
+			FAN_MODIFY |\
+			FAN_CLOSE |\
+			FAN_OPEN)
+
+/*
+ * All legal FAN bits userspace can request (although possibly not all
+ * at the same time.
+ */
+#define FAN_ALL_INCOMING_EVENTS	(FAN_ALL_EVENTS |\
+				 FAN_EVENT_ON_CHILD)
+#ifdef __KERNEL__
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_FANOTIFY_H */


^ permalink raw reply related

* [PATCH 2/8] vfs: introduce FMODE_NONOTIFY
From: Eric Paris @ 2009-09-11  5:26 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, netdev; +Cc: davem, viro, alan, hch
In-Reply-To: <20090911052558.32359.18075.stgit@paris.rdu.redhat.com>

This is a new f_mode which can only be set by the kernel.  It indicates
that the fd was opened by fanotify and should not cause future fanotify
events.  This is needed to prevent fanotify livelock.  An example of
obvious livelock is from fanotify close events.

Process A closes file1
This creates a close event for file1.
fanotify opens file1 for Listener X
Listener X deals with the event and closes its fd for file1.
This creates a close event for file1.
fanotify opens file1 for Listener X
Listener X deals with the event and closes its fd for file1.
This creates a close event for file1.
fanotify opens file1 for Listener X
Listener X deals with the event and closes its fd for file1.
notice a pattern?

The fix is to add the FMODE_NONOTIFY bit to the open filp done by the kernel
for fanotify.  Thus when that file is used it will not generate future
events.

This patch simply defines the bit.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 include/linux/fs.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 02e111c..433db62 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -87,6 +87,9 @@ struct inodes_stat_t {
  */
 #define FMODE_NOCMTIME		((__force fmode_t)2048)
 
+/* File was opened by fanotify and shouldn't generate fanotify events */
+#define FMODE_NONOTIFY		((__force fmode_t)4096)
+
 /*
  * The below are the various read and write types that we support. Some of
  * them include behavioral modifiers that send information down to the

^ permalink raw reply related

* [PATCH 1/8] networking/fanotify: declare fanotify socket numbers
From: Eric Paris @ 2009-09-11  5:25 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, netdev; +Cc: davem, viro, alan, hch

fanotify's user interface uses a custom socket (it doesn't use netlink
since work must be done in the context of the receive side of the socket)

This patch simply defines the fanotify socket number declarations.  The
actual implementation of the socket is in a later patch.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 include/linux/socket.h |    5 ++++-
 net/core/sock.c        |    6 +++---
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 3b461df..e03f47b 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -195,7 +195,8 @@ struct ucred {
 #define AF_ISDN		34	/* mISDN sockets 		*/
 #define AF_PHONET	35	/* Phonet sockets		*/
 #define AF_IEEE802154	36	/* IEEE802154 sockets		*/
-#define AF_MAX		37	/* For now.. */
+#define AF_FANOTIFY	37	/* fscking all access sockets	*/
+#define AF_MAX		38	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -235,6 +236,7 @@ struct ucred {
 #define PF_ISDN		AF_ISDN
 #define PF_PHONET	AF_PHONET
 #define PF_IEEE802154	AF_IEEE802154
+#define PF_FANOTIFY	AF_FANOTIFY
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
@@ -306,6 +308,7 @@ struct ucred {
 #define SOL_PNPIPE	275
 #define SOL_RDS		276
 #define SOL_IUCV	277
+#define SOL_FANOTIFY	278
 
 /* IPX options */
 #define IPX_TYPE	1
diff --git a/net/core/sock.c b/net/core/sock.c
index 30d5446..28a99bf 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -155,7 +155,7 @@ static const char *const af_family_key_strings[AF_MAX+1] = {
   "sk_lock-27"       , "sk_lock-28"          , "sk_lock-AF_CAN"      ,
   "sk_lock-AF_TIPC"  , "sk_lock-AF_BLUETOOTH", "sk_lock-IUCV"        ,
   "sk_lock-AF_RXRPC" , "sk_lock-AF_ISDN"     , "sk_lock-AF_PHONET"   ,
-  "sk_lock-AF_IEEE802154",
+  "sk_lock-AF_IEEE802154", "sk_lock-AF_FANOTIFY",
   "sk_lock-AF_MAX"
 };
 static const char *const af_family_slock_key_strings[AF_MAX+1] = {
@@ -171,7 +171,7 @@ static const char *const af_family_slock_key_strings[AF_MAX+1] = {
   "slock-27"       , "slock-28"          , "slock-AF_CAN"      ,
   "slock-AF_TIPC"  , "slock-AF_BLUETOOTH", "slock-AF_IUCV"     ,
   "slock-AF_RXRPC" , "slock-AF_ISDN"     , "slock-AF_PHONET"   ,
-  "slock-AF_IEEE802154",
+  "slock-AF_IEEE802154", "slock=AF_FANOTIFY",
   "slock-AF_MAX"
 };
 static const char *const af_family_clock_key_strings[AF_MAX+1] = {
@@ -187,7 +187,7 @@ static const char *const af_family_clock_key_strings[AF_MAX+1] = {
   "clock-27"       , "clock-28"          , "clock-AF_CAN"      ,
   "clock-AF_TIPC"  , "clock-AF_BLUETOOTH", "clock-AF_IUCV"     ,
   "clock-AF_RXRPC" , "clock-AF_ISDN"     , "clock-AF_PHONET"   ,
-  "clock-AF_IEEE802154",
+  "clock-AF_IEEE802154", "clock-AF_FANOTIFY",
   "clock-AF_MAX"
 };
 


^ permalink raw reply related

* Re: [PATCH RFC] tun: export underlying socket
From: Michael S. Tsirkin @ 2009-09-11  4:59 UTC (permalink / raw)
  To: Paul Moore; +Cc: David Miller, netdev, herbert
In-Reply-To: <200909110017.27668.paul.moore@hp.com>

On Fri, Sep 11, 2009 at 12:17:27AM -0400, Paul Moore wrote:
> On Thursday 10 September 2009 08:59:29 am Michael S. Tsirkin wrote:
> > Tun device looks similar to a packet socket
> > in that both pass complete frames from/to userspace.
> > 
> > This patch fills in enough fields in the socket underlying tun driver
> > to support sendmsg/recvmsg operations, and exports access to this socket
> > to modules.
> > 
> > This way, code using raw sockets to inject packets
> > into a physical device, can support injecting
> > packets into host network stack almost without modification.
> > 
> > First user of this interface will be vhost virtualization
> > accelerator.
> 
> No comments on the code at this point - I'm just trying to understand the 
> intended user right now which I'm assuming is the vhost-net bits you sent 
> previously? 

Yes - these now use raw socket, I'll add
something like 

	sock = tun_get_socket(file)
	if (!IS_ERR(sock)) {
		return sock;
	}


> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> > 
> > This patch is on top of net-next master.
> > An alternative approach would be to add an ioctl to tun, to export the
> > underlying socket to userspace: a uniform way to work with a network
> > device and the host stack might be useful there, as well.
> > Kernel users could then do sockfd_lookup to get the socket.
> > I decided against it for now as it requires more code.
> > Please comment.
> > 
> >  drivers/net/tun.c      |   78
> >  +++++++++++++++++++++++++++++++++++++++++++---- include/linux/if_tun.h |  
> >  14 ++++++++
> >  2 files changed, 85 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index 589a44a..76f5faa 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -151,6 +151,7 @@ static int tun_attach(struct tun_struct *tun, struct
> >  file *file) err = 0;
> >  	tfile->tun = tun;
> >  	tun->tfile = tfile;
> > +	tun->socket.file = file;
> >  	dev_hold(tun->dev);
> >  	sock_hold(tun->socket.sk);
> >  	atomic_inc(&tfile->count);
> > @@ -165,6 +166,7 @@ static void __tun_detach(struct tun_struct *tun)
> >  	/* Detach from net device */
> >  	netif_tx_lock_bh(tun->dev);
> >  	tun->tfile = NULL;
> > +	tun->socket.file = NULL;
> >  	netif_tx_unlock_bh(tun->dev);
> > 
> >  	/* Drop read queue */
> > @@ -750,7 +752,7 @@ static __inline__ ssize_t tun_put_user(struct
> >  tun_struct *tun, len = min_t(int, skb->len, len);
> > 
> >  	skb_copy_datagram_const_iovec(skb, 0, iv, total, len);
> > -	total += len;
> > +	total += skb->len;
> > 
> >  	tun->dev->stats.tx_packets++;
> >  	tun->dev->stats.tx_bytes += len;
> > @@ -758,12 +760,10 @@ static __inline__ ssize_t tun_put_user(struct
> >  tun_struct *tun, return total;
> >  }
> > 
> > -static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec
> >  *iv, -			    unsigned long count, loff_t pos)
> > +static ssize_t tun_do_read(struct tun_struct *tun,
> > +			   struct kiocb *iocb, const struct iovec *iv,
> > +			   unsigned long count, int noblock)
> >  {
> > -	struct file *file = iocb->ki_filp;
> > -	struct tun_file *tfile = file->private_data;
> > -	struct tun_struct *tun = __tun_get(tfile);
> >  	DECLARE_WAITQUEUE(wait, current);
> >  	struct sk_buff *skb;
> >  	ssize_t len, ret = 0;
> > @@ -785,7 +785,7 @@ static ssize_t tun_chr_aio_read(struct kiocb *iocb,
> >  const struct iovec *iv,
> > 
> >  		/* Read frames from the queue */
> >  		if (!(skb=skb_dequeue(&tun->socket.sk->sk_receive_queue))) {
> > -			if (file->f_flags & O_NONBLOCK) {
> > +			if (noblock) {
> >  				ret = -EAGAIN;
> >  				break;
> >  			}
> > @@ -813,6 +813,21 @@ static ssize_t tun_chr_aio_read(struct kiocb *iocb,
> >  const struct iovec *iv, remove_wait_queue(&tun->socket.wait, &wait);
> > 
> >  out:
> > +	return ret;
> > +}
> > +
> > +static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec
> >  *iv, +			    unsigned long count, loff_t pos)
> > +{
> > +	struct file *file = iocb->ki_filp;
> > +	struct tun_file *tfile = file->private_data;
> > +	struct tun_struct *tun = __tun_get(tfile);
> > +	ssize_t ret;
> > +
> > +	if (!tun)
> > +		return -EBADFD;
> > +	ret = tun_do_read(tun, iocb, iv, count, file->f_flags & O_NONBLOCK);
> > +	ret = min_t(ssize_t, ret, count);
> >  	tun_put(tun);
> >  	return ret;
> >  }
> > @@ -865,6 +880,37 @@ static void tun_sock_destruct(struct sock *sk)
> >  	free_netdev(container_of(sk, struct tun_sock, sk)->tun->dev);
> >  }
> > 
> > +static int tun_sendmsg(struct kiocb *iocb, struct socket *sock,
> > +		       struct msghdr *m, size_t total_len)
> > +{
> > +	struct tun_struct *tun = container_of(sock, struct tun_struct, socket);
> > +	return tun_get_user(tun, m->msg_iov, total_len,
> > +			    m->msg_flags & MSG_DONTWAIT);
> > +}
> > +
> > +static int tun_recvmsg(struct kiocb *iocb, struct socket *sock,
> > +		       struct msghdr *m, size_t total_len,
> > +		       int flags)
> > +{
> > +	struct tun_struct *tun = container_of(sock, struct tun_struct, socket);
> > +	int ret;
> > +	if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
> > +		return -EINVAL;
> > +	ret = tun_do_read(tun, iocb, m->msg_iov, total_len,
> > +			  flags & MSG_DONTWAIT);
> > +	if (ret > total_len) {
> > +		m->msg_flags |= MSG_TRUNC;
> > +		ret = flags & MSG_TRUNC ? ret : total_len;
> > +	}
> > +	return ret;
> > +}
> > +
> > +/* Ops structure to mimic raw sockets with tun */
> > +static const struct proto_ops tun_socket_ops = {
> > +	.sendmsg = tun_sendmsg,
> > +	.recvmsg = tun_recvmsg,
> > +};
> > +
> >  static struct proto tun_proto = {
> >  	.name		= "tun",
> >  	.owner		= THIS_MODULE,
> > @@ -982,6 +1028,7 @@ static int tun_set_iff(struct net *net, struct file
> >  *file, struct ifreq *ifr) goto err_free_dev;
> > 
> >  		init_waitqueue_head(&tun->socket.wait);
> > +		tun->socket.ops = &tun_socket_ops;
> >  		sock_init_data(&tun->socket, sk);
> >  		sk->sk_write_space = tun_sock_write_space;
> >  		sk->sk_sndbuf = INT_MAX;
> > @@ -1483,6 +1530,23 @@ static void tun_cleanup(void)
> >  	rtnl_link_unregister(&tun_link_ops);
> >  }
> > 
> > +/* Get an underlying socket object from tun file.  Returns error unless
> >  file is + * attached to a device.  The returned object works like a packet
> >  socket, it + * can be used for sock_sendmsg/sock_recvmsg.  The caller is
> >  responsible for + * holding a reference to the file for as long as the
> >  socket is in use. */ +struct socket *tun_get_socket(struct file *file)
> > +{
> > +	struct tun_struct *tun;
> > +	if (file->f_op != &tun_fops)
> > +		return ERR_PTR(-EINVAL);
> > +	tun = tun_get(file);
> > +	if (!tun)
> > +		return ERR_PTR(-EBADFD);
> > +	tun_put(tun);
> > +	return &tun->socket;
> > +}
> > +EXPORT_SYMBOL_GPL(tun_get_socket);
> > +
> >  module_init(tun_init);
> >  module_exit(tun_cleanup);
> >  MODULE_DESCRIPTION(DRV_DESCRIPTION);
> > diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
> > index 3f5fd52..404abe0 100644
> > --- a/include/linux/if_tun.h
> > +++ b/include/linux/if_tun.h
> > @@ -86,4 +86,18 @@ struct tun_filter {
> >  	__u8   addr[0][ETH_ALEN];
> >  };
> > 
> > +#ifdef __KERNEL__
> > +#if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
> > +struct socket *tun_get_socket(struct file *);
> > +#else
> > +#include <linux/err.h>
> > +#include <linux/errno.h>
> > +struct file;
> > +struct socket;
> > +static inline struct socket *tun_get_socket(struct file *f)
> > +{
> > +	return ERR_PTR(-EINVAL);
> > +}
> > +#endif /* CONFIG_TUN */
> > +#endif /* __KERNEL__ */
> >  #endif /* __IF_TUN_H */
> > 
> 
> -- 
> paul moore
> linux @ hp

^ permalink raw reply

* Re: [PATCH RFC] tun: export underlying socket
From: Paul Moore @ 2009-09-11  4:17 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Miller, m.s.tsirkin, netdev, herbert
In-Reply-To: <20090910125929.GA32593@redhat.com>

On Thursday 10 September 2009 08:59:29 am Michael S. Tsirkin wrote:
> Tun device looks similar to a packet socket
> in that both pass complete frames from/to userspace.
> 
> This patch fills in enough fields in the socket underlying tun driver
> to support sendmsg/recvmsg operations, and exports access to this socket
> to modules.
> 
> This way, code using raw sockets to inject packets
> into a physical device, can support injecting
> packets into host network stack almost without modification.
> 
> First user of this interface will be vhost virtualization
> accelerator.

No comments on the code at this point - I'm just trying to understand the 
intended user right now which I'm assuming is the vhost-net bits you sent 
previously? 

> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> 
> This patch is on top of net-next master.
> An alternative approach would be to add an ioctl to tun, to export the
> underlying socket to userspace: a uniform way to work with a network
> device and the host stack might be useful there, as well.
> Kernel users could then do sockfd_lookup to get the socket.
> I decided against it for now as it requires more code.
> Please comment.
> 
>  drivers/net/tun.c      |   78
>  +++++++++++++++++++++++++++++++++++++++++++---- include/linux/if_tun.h |  
>  14 ++++++++
>  2 files changed, 85 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 589a44a..76f5faa 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -151,6 +151,7 @@ static int tun_attach(struct tun_struct *tun, struct
>  file *file) err = 0;
>  	tfile->tun = tun;
>  	tun->tfile = tfile;
> +	tun->socket.file = file;
>  	dev_hold(tun->dev);
>  	sock_hold(tun->socket.sk);
>  	atomic_inc(&tfile->count);
> @@ -165,6 +166,7 @@ static void __tun_detach(struct tun_struct *tun)
>  	/* Detach from net device */
>  	netif_tx_lock_bh(tun->dev);
>  	tun->tfile = NULL;
> +	tun->socket.file = NULL;
>  	netif_tx_unlock_bh(tun->dev);
> 
>  	/* Drop read queue */
> @@ -750,7 +752,7 @@ static __inline__ ssize_t tun_put_user(struct
>  tun_struct *tun, len = min_t(int, skb->len, len);
> 
>  	skb_copy_datagram_const_iovec(skb, 0, iv, total, len);
> -	total += len;
> +	total += skb->len;
> 
>  	tun->dev->stats.tx_packets++;
>  	tun->dev->stats.tx_bytes += len;
> @@ -758,12 +760,10 @@ static __inline__ ssize_t tun_put_user(struct
>  tun_struct *tun, return total;
>  }
> 
> -static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec
>  *iv, -			    unsigned long count, loff_t pos)
> +static ssize_t tun_do_read(struct tun_struct *tun,
> +			   struct kiocb *iocb, const struct iovec *iv,
> +			   unsigned long count, int noblock)
>  {
> -	struct file *file = iocb->ki_filp;
> -	struct tun_file *tfile = file->private_data;
> -	struct tun_struct *tun = __tun_get(tfile);
>  	DECLARE_WAITQUEUE(wait, current);
>  	struct sk_buff *skb;
>  	ssize_t len, ret = 0;
> @@ -785,7 +785,7 @@ static ssize_t tun_chr_aio_read(struct kiocb *iocb,
>  const struct iovec *iv,
> 
>  		/* Read frames from the queue */
>  		if (!(skb=skb_dequeue(&tun->socket.sk->sk_receive_queue))) {
> -			if (file->f_flags & O_NONBLOCK) {
> +			if (noblock) {
>  				ret = -EAGAIN;
>  				break;
>  			}
> @@ -813,6 +813,21 @@ static ssize_t tun_chr_aio_read(struct kiocb *iocb,
>  const struct iovec *iv, remove_wait_queue(&tun->socket.wait, &wait);
> 
>  out:
> +	return ret;
> +}
> +
> +static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec
>  *iv, +			    unsigned long count, loff_t pos)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct tun_file *tfile = file->private_data;
> +	struct tun_struct *tun = __tun_get(tfile);
> +	ssize_t ret;
> +
> +	if (!tun)
> +		return -EBADFD;
> +	ret = tun_do_read(tun, iocb, iv, count, file->f_flags & O_NONBLOCK);
> +	ret = min_t(ssize_t, ret, count);
>  	tun_put(tun);
>  	return ret;
>  }
> @@ -865,6 +880,37 @@ static void tun_sock_destruct(struct sock *sk)
>  	free_netdev(container_of(sk, struct tun_sock, sk)->tun->dev);
>  }
> 
> +static int tun_sendmsg(struct kiocb *iocb, struct socket *sock,
> +		       struct msghdr *m, size_t total_len)
> +{
> +	struct tun_struct *tun = container_of(sock, struct tun_struct, socket);
> +	return tun_get_user(tun, m->msg_iov, total_len,
> +			    m->msg_flags & MSG_DONTWAIT);
> +}
> +
> +static int tun_recvmsg(struct kiocb *iocb, struct socket *sock,
> +		       struct msghdr *m, size_t total_len,
> +		       int flags)
> +{
> +	struct tun_struct *tun = container_of(sock, struct tun_struct, socket);
> +	int ret;
> +	if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
> +		return -EINVAL;
> +	ret = tun_do_read(tun, iocb, m->msg_iov, total_len,
> +			  flags & MSG_DONTWAIT);
> +	if (ret > total_len) {
> +		m->msg_flags |= MSG_TRUNC;
> +		ret = flags & MSG_TRUNC ? ret : total_len;
> +	}
> +	return ret;
> +}
> +
> +/* Ops structure to mimic raw sockets with tun */
> +static const struct proto_ops tun_socket_ops = {
> +	.sendmsg = tun_sendmsg,
> +	.recvmsg = tun_recvmsg,
> +};
> +
>  static struct proto tun_proto = {
>  	.name		= "tun",
>  	.owner		= THIS_MODULE,
> @@ -982,6 +1028,7 @@ static int tun_set_iff(struct net *net, struct file
>  *file, struct ifreq *ifr) goto err_free_dev;
> 
>  		init_waitqueue_head(&tun->socket.wait);
> +		tun->socket.ops = &tun_socket_ops;
>  		sock_init_data(&tun->socket, sk);
>  		sk->sk_write_space = tun_sock_write_space;
>  		sk->sk_sndbuf = INT_MAX;
> @@ -1483,6 +1530,23 @@ static void tun_cleanup(void)
>  	rtnl_link_unregister(&tun_link_ops);
>  }
> 
> +/* Get an underlying socket object from tun file.  Returns error unless
>  file is + * attached to a device.  The returned object works like a packet
>  socket, it + * can be used for sock_sendmsg/sock_recvmsg.  The caller is
>  responsible for + * holding a reference to the file for as long as the
>  socket is in use. */ +struct socket *tun_get_socket(struct file *file)
> +{
> +	struct tun_struct *tun;
> +	if (file->f_op != &tun_fops)
> +		return ERR_PTR(-EINVAL);
> +	tun = tun_get(file);
> +	if (!tun)
> +		return ERR_PTR(-EBADFD);
> +	tun_put(tun);
> +	return &tun->socket;
> +}
> +EXPORT_SYMBOL_GPL(tun_get_socket);
> +
>  module_init(tun_init);
>  module_exit(tun_cleanup);
>  MODULE_DESCRIPTION(DRV_DESCRIPTION);
> diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
> index 3f5fd52..404abe0 100644
> --- a/include/linux/if_tun.h
> +++ b/include/linux/if_tun.h
> @@ -86,4 +86,18 @@ struct tun_filter {
>  	__u8   addr[0][ETH_ALEN];
>  };
> 
> +#ifdef __KERNEL__
> +#if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
> +struct socket *tun_get_socket(struct file *);
> +#else
> +#include <linux/err.h>
> +#include <linux/errno.h>
> +struct file;
> +struct socket;
> +static inline struct socket *tun_get_socket(struct file *f)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +#endif /* CONFIG_TUN */
> +#endif /* __KERNEL__ */
>  #endif /* __IF_TUN_H */
> 

-- 
paul moore
linux @ hp

^ permalink raw reply

* Re: [net-next PATCH] igb: Use Intel OUI for VF MAC addresses
From: Joe Perches @ 2009-09-11  3:02 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jeff Kirsher, davem, netdev, gospo, Greg Rose, Don Skidmore
In-Reply-To: <20090910190703.25d14533@nehalam>

On Thu, 2009-09-10 at 19:07 -0700, Stephen Hemminger wrote:
> On Thu, 10 Sep 2009 18:48:27 -0700
> Jeff Kirsher <jeffrey.t.kirsher@intel.com> wrote:
> > From: Gregory Rose <gregory.v.rose@intel.com>
> > This patch changes the default VF MAC address generation to use an Intel
> > Organizational Unit Identifier (OUI), instead of a fully randomized
> > Ethernet address.  This is to help prevent accidental MAC address
> > collisions.

I think this not a very good idea.

> How can probability of collision be lower when the address space
> is smaller? If you are going to use Intel OUI, then you should constrain
> the selection to a portion of that space that is not being used
> by other hardware. I.e if you know Intel assigns addresses to their
> devices in ranges, choose a range that is not in use.

Some other possibilities might be:

o Apply for a Linux specific OUI, maybe via the Linux Foundation,
  and assign MAC random addresses using only that OUI.
o Avoid assigning random MAC addresses that use the initial values
  of already assigned OUIs.

  http://standards.ieee.org/regauth/oui/oui.txt

  Unfortunately, this is not a complete list because several
  vendors have just picked a number at random.

  Nominally assigned leading bytes are:

  00 02 04 08 0C
  10 11 14 18 1C
  20 24 28
  30 34 3C
  40 44 48
  58
  60 64 68 6C
  70 74 78 7C
  80 88
  90 94 98 9C
  A0 A4 A8 AA AC
  B0 B4 B8 BC
  C0 C4 C8 CC
  D4 D8 DC
  E0 E4 E8 EC
  F0 F4

  Maybe just pick an unused specific leading byte.



^ permalink raw reply

* [patch] igb: Tidy vf initialisation
From: Simon Horman @ 2009-09-11  2:18 UTC (permalink / raw)
  To: e1000-devel, netdev
  Cc: John Ronciak, Bruce Allan, Jesse Brandeburg, Jeff Kirsher

This is purely cosmetic, but to my mind it makes the code a lot easier
to follow.

Signed-off-by: Simon Horman <horms@verge.net.au>

Index: net-next-2.6/drivers/net/igb/igb_main.c
===================================================================
--- net-next-2.6.orig/drivers/net/igb/igb_main.c	2009-09-09 15:04:34.000000000 +1000
+++ net-next-2.6/drivers/net/igb/igb_main.c	2009-09-09 16:06:59.000000000 +1000
@@ -1194,6 +1194,44 @@ static const struct net_device_ops igb_n
 #endif
 };
 
+/* Since IOV functionality isn't critical to base device function we can
+ * accept failure.  If it fails we don't allow IOV to be enabled */
+static void __devinit igb_probe_vf(struct pci_dev *pdev,
+				   struct igb_adapter *adapter)
+{
+#ifdef CONFIG_PCI_IOV
+	/* 82576 supports a maximum of 7 VFs in addition to the PF */
+	unsigned int num_vfs = (max_vfs > 7) ? 7 : max_vfs;
+	int i;
+	unsigned char mac_addr[ETH_ALEN];
+
+	if (adapter->hw.mac.type != e1000_82576 || !num_vfs)
+		return;
+
+	adapter->vf_data = kcalloc(num_vfs, sizeof(struct vf_data_storage),
+				   GFP_KERNEL);
+	if (!adapter->vf_data) {
+		dev_err(&pdev->dev, "Could not allocate VF private data - "
+				    "IOV enable failed\n");
+		return;
+	}
+
+	if (pci_enable_sriov(pdev, num_vfs)) {
+		kfree(adapter->vf_data);
+		adapter->vf_data = NULL;
+		return;
+	}
+
+	adapter->vfs_allocated_count = num_vfs;
+	dev_info(&pdev->dev, "%d vfs allocated\n", num_vfs);
+	for (i = 0; i < adapter->vfs_allocated_count; i++) {
+		random_ether_addr(mac_addr);
+		igb_set_vf_mac(adapter, i, mac_addr);
+	}
+#endif
+	return;
+}
+
 /**
  * igb_probe - Device Initialization Routine
  * @pdev: PCI device information struct
@@ -1307,46 +1345,8 @@ static int __devinit igb_probe(struct pc
 	if (err)
 		goto err_sw_init;
 
-#ifdef CONFIG_PCI_IOV
-	/* since iov functionality isn't critical to base device function we
-	 * can accept failure.  If it fails we don't allow iov to be enabled */
-	if (hw->mac.type == e1000_82576) {
-		/* 82576 supports a maximum of 7 VFs in addition to the PF */
-		unsigned int num_vfs = (max_vfs > 7) ? 7 : max_vfs;
-		int i;
-		unsigned char mac_addr[ETH_ALEN];
-
-		if (num_vfs) {
-			adapter->vf_data = kcalloc(num_vfs,
-						sizeof(struct vf_data_storage),
-						GFP_KERNEL);
-			if (!adapter->vf_data) {
-				dev_err(&pdev->dev,
-				        "Could not allocate VF private data - "
-					"IOV enable failed\n");
-			} else {
-				err = pci_enable_sriov(pdev, num_vfs);
-				if (!err) {
-					adapter->vfs_allocated_count = num_vfs;
-					dev_info(&pdev->dev,
-					         "%d vfs allocated\n",
-					         num_vfs);
-					for (i = 0;
-					     i < adapter->vfs_allocated_count;
-					     i++) {
-						random_ether_addr(mac_addr);
-						igb_set_vf_mac(adapter, i,
-						               mac_addr);
-					}
-				} else {
-					kfree(adapter->vf_data);
-					adapter->vf_data = NULL;
-				}
-			}
-		}
-	}
+	igb_probe_vf(pdev, adapter);
 
-#endif
 	/* setup the private structure */
 	err = igb_sw_init(adapter);
 	if (err)

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july

^ permalink raw reply

* [patch] e1000: Tidy up logic in e1000_enable_tx_pkt_filtering()
From: Simon Horman @ 2009-09-11  2:17 UTC (permalink / raw)
  To: e1000-devel, netdev
  Cc: John Ronciak, Bruce Allan, Jesse Brandeburg, Jeff Kirsher

This is purely cosmetic, but to my mind it makes the code a lot easier
to follow.

Signed-off-by: Simon Horman <horms@verge.net.au>

--- 

Lightly tested.

Index: net-next-2.6/drivers/net/e1000/e1000_hw.c
===================================================================
--- net-next-2.6.orig/drivers/net/e1000/e1000_hw.c	2009-08-31 17:44:07.000000000 +1000
+++ net-next-2.6/drivers/net/e1000/e1000_hw.c	2009-09-09 11:40:14.000000000 +1000
@@ -7604,31 +7604,33 @@ bool e1000_enable_tx_pkt_filtering(struc
 {
     /* called in init as well as watchdog timer functions */
 
-    s32 ret_val, checksum;
-    bool tx_filter = false;
+    s32 checksum;
+    bool tx_filter = true;
     struct e1000_host_mng_dhcp_cookie *hdr = &(hw->mng_cookie);
     u8 *buffer = (u8 *) &(hw->mng_cookie);
 
-    if (e1000_check_mng_mode(hw)) {
-        ret_val = e1000_mng_enable_host_if(hw);
-        if (ret_val == E1000_SUCCESS) {
-            ret_val = e1000_host_if_read_cookie(hw, buffer);
-            if (ret_val == E1000_SUCCESS) {
-                checksum = hdr->checksum;
-                hdr->checksum = 0;
-                if ((hdr->signature == E1000_IAMT_SIGNATURE) &&
-                    checksum == e1000_calculate_mng_checksum((char *)buffer,
-                                               E1000_MNG_DHCP_COOKIE_LENGTH)) {
-                    if (hdr->status &
-                        E1000_MNG_DHCP_COOKIE_STATUS_PARSING_SUPPORT)
-                        tx_filter = true;
-                } else
-                    tx_filter = true;
-            } else
-                tx_filter = true;
-        }
-    }
+    if (!e1000_check_mng_mode(hw))
+        goto out_false;
+
+    if (e1000_mng_enable_host_if(hw) != E1000_SUCCESS)
+        goto out_false;
+
+    if (e1000_host_if_read_cookie(hw, buffer) != E1000_SUCCESS)
+	goto out_true;
 
+    checksum = hdr->checksum;
+    hdr->checksum = 0;
+    if ((hdr->signature != E1000_IAMT_SIGNATURE) ||
+        checksum != e1000_calculate_mng_checksum((char *)buffer,
+                                   E1000_MNG_DHCP_COOKIE_LENGTH))
+	goto out_true;
+
+    if (hdr->status & E1000_MNG_DHCP_COOKIE_STATUS_PARSING_SUPPORT)
+	goto out_true;
+
+out_false:
+    tx_filter = false;
+out_true:
     hw->tx_pkt_filtering = tx_filter;
     return tx_filter;
 }

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july

^ permalink raw reply

* Re: [net-next PATCH] igb: Use Intel OUI for VF MAC addresses
From: Stephen Hemminger @ 2009-09-11  2:07 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, netdev, gospo, Greg Rose, Jeff Kirsher, Don Skidmore
In-Reply-To: <20090911014757.19631.66570.stgit@localhost.localdomain>

On Thu, 10 Sep 2009 18:48:27 -0700
Jeff Kirsher <jeffrey.t.kirsher@intel.com> wrote:

> From: Gregory Rose <gregory.v.rose@intel.com>
> 
> This patch changes the default VF MAC address generation to use an Intel
> Organizational Unit Identifier (OUI), instead of a fully randomized
> Ethernet address.  This is to help prevent accidental MAC address
> collisions.

How can probability of collision be lower when the address space
is smaller? If you are going to use Intel OUI, then you should constrain
the selection to a portion of that space that is not being used
by other hardware. I.e if you know Intel assigns addresses to their
devices in ranges, choose a range that is not in use.

^ permalink raw reply

* [net-next PATCH] igb: Use Intel OUI for VF MAC addresses
From: Jeff Kirsher @ 2009-09-11  1:48 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, Greg Rose, Jeff Kirsher, Don Skidmore

From: Gregory Rose <gregory.v.rose@intel.com>

This patch changes the default VF MAC address generation to use an Intel
Organizational Unit Identifier (OUI), instead of a fully randomized
Ethernet address.  This is to help prevent accidental MAC address
collisions.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
---

 drivers/net/igb/igb.h      |    1 +
 drivers/net/igb/igb_main.c |    2 ++
 2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/net/igb/igb.h b/drivers/net/igb/igb.h
index 7126fea..463d178 100644
--- a/drivers/net/igb/igb.h
+++ b/drivers/net/igb/igb.h
@@ -65,6 +65,7 @@ struct igb_adapter;
 #define IGB_MAX_VF_MC_ENTRIES              30
 #define IGB_MAX_VF_FUNCTIONS               8
 #define IGB_MAX_VFTA_ENTRIES               128
+#define OUI_LEN                            3
 
 struct vf_data_storage {
 	unsigned char vf_mac_addresses[ETH_ALEN];
diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index 943186b..290555c 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -1315,6 +1315,7 @@ static int __devinit igb_probe(struct pci_dev *pdev,
 		unsigned int num_vfs = (max_vfs > 7) ? 7 : max_vfs;
 		int i;
 		unsigned char mac_addr[ETH_ALEN];
+		unsigned char oui[OUI_LEN] = {0x02, 0xAA, 0x00};
 
 		if (num_vfs) {
 			adapter->vf_data = kcalloc(num_vfs,
@@ -1335,6 +1336,7 @@ static int __devinit igb_probe(struct pci_dev *pdev,
 					     i < adapter->vfs_allocated_count;
 					     i++) {
 						random_ether_addr(mac_addr);
+						memcpy(mac_addr, oui, OUI_LEN);
 						igb_set_vf_mac(adapter, i,
 						               mac_addr);
 					}


^ permalink raw reply related

* Re: netfilter 00/31: netfilter 2.6.32 update
From: David Miller @ 2009-09-11  1:25 UTC (permalink / raw)
  To: kaber; +Cc: netdev, netfilter-devel
In-Reply-To: <20090910161142.31179.5256.sendpatchset@x2.localnet>

From: Patrick McHardy <kaber@trash.net>
Date: Thu, 10 Sep 2009 18:11:46 +0200 (MEST)

> Please apply or pull from:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6.git master

Pulled, thanks Patrick!

^ permalink raw reply

* Re: igb bandwidth allocation configuration
From: Simon Horman @ 2009-09-11  0:38 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: e1000-devel, netdev
In-Reply-To: <4AA8E939.6050208@trash.net>

On Thu, Sep 10, 2009 at 01:55:37PM +0200, Patrick McHardy wrote:
> Patrick McHardy wrote:
> > Simon Horman wrote:
> >>
> >> I have been looking into adding support the 82586's per-PF/VF
> >> bandwidth allocation to the igb driver. It seems that the trickiest
> >> part is working out how to expose things to user-space.
> >>
> >> ...
> >> Internally it seems that actually the limits are applied to HW Tx queues
> >> rather than directly VMs. There are 16 such queues. Accordingly it might
> >> be useful to design an interface to set limits per-queue using ethtool.
> >> But this would seem to also require exposing which queues are associated
> >> with which PF/VF.
> > 
> > Just an idea since I don't know much about this stuff:
> > 
> > Since we now have the mq packet scheduler, which exposes the device
> > queues as qdisc classes, how about adding driver-specific configuration
> > attributes that are passed to the driver by the mq scheduler? This
> > would allow to configure per-queue bandwidth limits using regular TC
> > commands and also use those limits without VFs for any kind of traffic.
> > Drivers not supporting this would refuse unsupported options.
> 
> Attached patch demonstrates the idea. Compile-tested only.
> 

Thanks, that seems like a pretty good idea to me.
I'll see if I can make it work.


^ permalink raw reply

* Re: [PATCH net-next-2.6] bonding: make ab_arp select active slaves as other modes
From: Jay Vosburgh @ 2009-09-11  0:32 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, bonding-devel, nicolas.2p.debian
In-Reply-To: <20090831210937.GA3152@psychotron.redhat.com>

Jiri Pirko <jpirko@redhat.com> wrote:

>When I was implementing primary_passive option (formely named primary_lazy) I've
>run into troubles with ab_arp. This is the only mode which is not using
>bond_select_active_slave() function to select active slave and instead it
>selects it itself. This seems to be not the right behaviour and it would be
>better to do it in bond_select_active_slave() for all cases. This patch makes
>this happen. Please review.

	Sorry for the delay in response; was out of the office.

	My first question is whether this affect the "current_arp_slave"
behavior, i.e., the round-robining of the ARP probes when no slaves are
active.  Is that something you checked?

	I'll give this a test tomorrow as well.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

>Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 7c0e0bd..6ebd88d 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1093,15 +1093,8 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
> 			return NULL; /* still no slave, return NULL */
> 	}
>
>-	/*
>-	 * first try the primary link; if arping, a link must tx/rx
>-	 * traffic before it can be considered the curr_active_slave.
>-	 * also, we would skip slaves between the curr_active_slave
>-	 * and primary_slave that may be up and able to arp
>-	 */
> 	if ((bond->primary_slave) &&
>-	    (!bond->params.arp_interval) &&
>-	    (IS_UP(bond->primary_slave->dev))) {
>+	    bond->primary_slave->link == BOND_LINK_UP) {
> 		new_active = bond->primary_slave;
> 	}
>
>@@ -1109,15 +1102,14 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
> 	old_active = new_active;
>
> 	bond_for_each_slave_from(bond, new_active, i, old_active) {
>-		if (IS_UP(new_active->dev)) {
>-			if (new_active->link == BOND_LINK_UP) {
>-				return new_active;
>-			} else if (new_active->link == BOND_LINK_BACK) {
>-				/* link up, but waiting for stabilization */
>-				if (new_active->delay < mintime) {
>-					mintime = new_active->delay;
>-					bestslave = new_active;
>-				}
>+		if (new_active->link == BOND_LINK_UP) {
>+			return new_active;
>+		} else if (new_active->link == BOND_LINK_BACK &&
>+			   IS_UP(new_active->dev)) {
>+			/* link up, but waiting for stabilization */
>+			if (new_active->delay < mintime) {
>+				mintime = new_active->delay;
>+				bestslave = new_active;

	Is there a functional reason for rearranging this (i.e., did the
use of select_active_slave need this for some reason)?


> 			}
> 		}
> 	}
>@@ -2929,18 +2921,6 @@ static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
> 		}
> 	}
>
>-	read_lock(&bond->curr_slave_lock);
>-
>-	/*
>-	 * Trigger a commit if the primary option setting has changed.
>-	 */
>-	if (bond->primary_slave &&
>-	    (bond->primary_slave != bond->curr_active_slave) &&
>-	    (bond->primary_slave->link == BOND_LINK_UP))
>-		commit++;
>-
>-	read_unlock(&bond->curr_slave_lock);
>-
> 	return commit;
> }
>
>@@ -2961,90 +2941,58 @@ static void bond_ab_arp_commit(struct bonding *bond, int delta_in_ticks)
> 			continue;
>
> 		case BOND_LINK_UP:
>-			write_lock_bh(&bond->curr_slave_lock);
>-
>-			if (!bond->curr_active_slave &&
>-			    time_before_eq(jiffies, dev_trans_start(slave->dev) +
>-					   delta_in_ticks)) {
>+			if ((!bond->curr_active_slave &&
>+			     time_before_eq(jiffies,
>+					    dev_trans_start(slave->dev) +
>+					    delta_in_ticks)) ||
>+			    bond->curr_active_slave != slave) {
> 				slave->link = BOND_LINK_UP;
>-				bond_change_active_slave(bond, slave);
> 				bond->current_arp_slave = NULL;
>
> 				pr_info(DRV_NAME
>-				       ": %s: %s is up and now the "
>-				       "active interface\n",
>-				       bond->dev->name, slave->dev->name);
>-
>-			} else if (bond->curr_active_slave != slave) {
>-				/* this slave has just come up but we
>-				 * already have a current slave; this can
>-				 * also happen if bond_enslave adds a new
>-				 * slave that is up while we are searching
>-				 * for a new slave
>-				 */
>-				slave->link = BOND_LINK_UP;
>-				bond_set_slave_inactive_flags(slave);
>-				bond->current_arp_slave = NULL;
>+					": %s: link status definitely "
>+					"up for interface %s.\n",
>+					bond->dev->name, slave->dev->name);
>
>-				pr_info(DRV_NAME
>-				       ": %s: backup interface %s is now up\n",
>-				       bond->dev->name, slave->dev->name);
>-			}
>+				if (!bond->curr_active_slave ||
>+				    (slave == bond->primary_slave))
>+					goto do_failover;
>
>-			write_unlock_bh(&bond->curr_slave_lock);
>+			}
>
>-			break;
>+			continue;
>
> 		case BOND_LINK_DOWN:
> 			if (slave->link_failure_count < UINT_MAX)
> 				slave->link_failure_count++;
>
> 			slave->link = BOND_LINK_DOWN;
>+			bond_set_slave_inactive_flags(slave);
>
>-			if (slave == bond->curr_active_slave) {
>-				pr_info(DRV_NAME
>-				       ": %s: link status down for active "
>-				       "interface %s, disabling it\n",
>-				       bond->dev->name, slave->dev->name);
>-
>-				bond_set_slave_inactive_flags(slave);
>-
>-				write_lock_bh(&bond->curr_slave_lock);
>-
>-				bond_select_active_slave(bond);
>-				if (bond->curr_active_slave)
>-					bond->curr_active_slave->jiffies =
>-						jiffies;
>-
>-				write_unlock_bh(&bond->curr_slave_lock);
>+			pr_info(DRV_NAME
>+				": %s: link status definitely down for "
>+				"interface %s, disabling it\n",
>+				bond->dev->name, slave->dev->name);
>
>+			if (slave == bond->curr_active_slave) {
> 				bond->current_arp_slave = NULL;
>-
>-			} else if (slave->state == BOND_STATE_BACKUP) {
>-				pr_info(DRV_NAME
>-				       ": %s: backup interface %s is now down\n",
>-				       bond->dev->name, slave->dev->name);
>-
>-				bond_set_slave_inactive_flags(slave);
>+				goto do_failover;
> 			}
>-			break;
>+
>+			continue;
>
> 		default:
> 			pr_err(DRV_NAME
> 			       ": %s: impossible: new_link %d on slave %s\n",
> 			       bond->dev->name, slave->new_link,
> 			       slave->dev->name);
>+			continue;
> 		}
>-	}
>
>-	/*
>-	 * No race with changes to primary via sysfs, as we hold rtnl.
>-	 */
>-	if (bond->primary_slave &&
>-	    (bond->primary_slave != bond->curr_active_slave) &&
>-	    (bond->primary_slave->link == BOND_LINK_UP)) {
>+do_failover:
>+		ASSERT_RTNL();
> 		write_lock_bh(&bond->curr_slave_lock);
>-		bond_change_active_slave(bond, bond->primary_slave);
>+		bond_select_active_slave(bond);
> 		write_unlock_bh(&bond->curr_slave_lock);
> 	}
>
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html




^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] bonding: introduce primary_passive option
From: Jay Vosburgh @ 2009-09-11  0:08 UTC (permalink / raw)
  To: =?ISO-8859-1?Q?Nicolas_de_Peslo=FCan?=
  Cc: Jiri Pirko, netdev, davem, bonding-devel
In-Reply-To: <4AA97855.8070301@free.fr>

Nicolas de Pesloüan <nicolas.2p.debian@free.fr> wrote:

>Jiri Pirko wrote:
>> (updated 2)
>>
>> In some cases there is not desirable to switch back to primary interface when
>> it's link recovers and rather stay with currently active one. We need to avoid
>> packetloss as much as we can in some cases. This is solved by introducing
>> primary_passive option. Note that enslaved primary slave is set as current
>> active no matter what.
>>
>> This patch depends on the following one:
>> [net-next-2.6] bonding: make ab_arp select active slaves as other modes
>> http://patchwork.ozlabs.org/patch/32684/
>>
>> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>>
>> diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
>> index d5181ce..e5f2c31 100644
>> --- a/Documentation/networking/bonding.txt
>> +++ b/Documentation/networking/bonding.txt
>> @@ -614,6 +614,32 @@ primary
>>   	The primary option is only valid for active-backup mode.
>>  +primary_passive
>> +
>> +	Specifies the behavior of the current active slave when the primary was
>> +	down and comes back up.  This option is designed to prevent
>> +	flip-flopping between the primary slave and other slaves.  The possible
>> +	values and their respective effects are:
>> +
>> +	disabled or 0 (default)
>> +
>> +		The primary slave becomes the active slave whenever it comes
>> +		back up.
>> +
>> +	better or 1
>> +
>> +		The primary slave becomes the active slave when it comes back
>> +		up, if the speed and duplex of the primary slave is better
>> +		than the speed and duplex of the current active slave.
>> +
>> +	always or 2
>> +
>> +		The primary slave becomes the active slave only if the current
>> +		active slave fails and the primary slave is up.
>> +
>> +	When no slave are active, if the primary comes back up, it becomes the
>> +	active slave, regardless of the value of primary_return
>> +
>>  updelay
>
>My feeling is that using primary_passive=disabled/better/always is far
>less clear than primary_return=always/better/failure_only.
>
>The normal (current) behavior is to always return to primay if it is
>up. You want to add the ability to return to it only on failure of the
>current active slave and I suggested to add the ability the return to it
>if it is "better" than the current active slave.

	Since this has changed from a real option to a "modify behavior
of another option" option, I'd call it something along the lines of
"primary_reselect" with the "always / better / failure" set of choices.

	Also, if "better" isn't implemented, leave it out entirely
(don't define the label or option stuff).  In the future, then, it can
be added in a complete patch, rather than splitting it across two
patches.

>Hence the suggested name for the option and option values.
>
>>   	Specifies the time, in milliseconds, to wait before enabling a
>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> index a7e731f..193eca4 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -94,6 +94,7 @@ static int downdelay;
>>  static int use_carrier	= 1;
>>  static char *mode;
>>  static char *primary;
>> +static char *primary_passive;
>>  static char *lacp_rate;
>>  static char *ad_select;
>>  static char *xmit_hash_policy;
>> @@ -126,6 +127,13 @@ MODULE_PARM_DESC(mode, "Mode of operation : 0 for balance-rr, "
>>  		       "6 for balance-alb");
>>  module_param(primary, charp, 0);
>>  MODULE_PARM_DESC(primary, "Primary network device to use");
>> +module_param(primary_passive, charp, 0);
>> +MODULE_PARM_DESC(primary_passive, "Do not set primary slave active "
>> +				  "once it comes up; "
>> +				  "0 for disabled (default), "
>> +				  "1 for on only if speed of primary is not "
>> +				  "better, "
>> +				  "2 for always on");
>
>You have a double negative assertion here : a "do not" option whose value
>is "disabled". For clarity, I suggest to have a "do" option whose value is
>"enabled".
>
>This is probably the reason why I suggested primary_return instead of
>primary_passive. primary_passive means "refuse to return to primary" and
>when disabled, it becomes "don't refuse to return to primary", which
>should be "accept to return to primary" instead :-)
>
>It might be seen as not being that important, but having an option whose
>name and values are easy to understand leads to an easy to use
>option... :-)
>
>>  module_param(lacp_rate, charp, 0);
>>  MODULE_PARM_DESC(lacp_rate, "LACPDU tx rate to request from 802.3ad partner "
>>  			    "(slow/fast)");
>> @@ -200,6 +208,13 @@ const struct bond_parm_tbl fail_over_mac_tbl[] = {
>>  {	NULL,			-1},
>>  };
>>  +const struct bond_parm_tbl pri_passive_tbl[] = {
>> +{	"disabled",		BOND_PRI_PASSIVE_DISABLED},
>> +{	"better",		BOND_PRI_PASSIVE_BETTER},
>> +{	"always",		BOND_PRI_PASSIVE_ALWAYS},
>> +{	NULL,			-1},
>> +};
>> +
>>  struct bond_parm_tbl ad_select_tbl[] = {
>>  {	"stable",	BOND_AD_STABLE},
>>  {	"bandwidth",	BOND_AD_BANDWIDTH},
>> @@ -1070,6 +1085,25 @@ out:
>>   }
>>  +static bool bond_should_loose_active(struct bonding *bond)

	I'm not sure what this function name means (beyond the
misspelling of "lose"); it's really "bond_should_change_active" as a
question, correct?

>> +{
>> +	struct slave *prim = bond->primary_slave;
>> +	struct slave *curr = bond->curr_active_slave;
>> +
>> +	if (!prim || !curr || curr->link != BOND_LINK_UP)
>> +		return true;
>> +	if (bond->force_primary) {
>> +		bond->force_primary = false;
>> +		return true;
>> +	}
>> +	if (bond->params.primary_passive == 1 &&
>
>You should use the constants you defined in bonding.h and used in pri_passive_tbl :
>
>if (bond->params.primary_passive == BOND_PRI_PASSIVE_BETTER &&
>
>> +	    (prim->speed < curr->speed ||
>> +	     (prim->speed == curr->speed && prim->duplex <= curr->duplex)))
>> +		return false;
>> +	if (bond->params.primary_passive == 2)

	Ah, ok, passive really is implemented; I just hunted for the
BETTER label.  So, yes, use the defined labels.

>You should use the constants you defined in bonding.h and used in pri_passive_tbl :
>
>if (bond->params.primary_passive == BOND_PRI_PASSIVE_ALWAYS)
>
>> +		return false;
>> +	return true;
>> +}
>>   /**
>>   * find_best_interface - select the best available slave to be the active one
>> @@ -1093,15 +1127,9 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
>>  			return NULL; /* still no slave, return NULL */
>>  	}
>>  -	/*
>> -	 * first try the primary link; if arping, a link must tx/rx
>> -	 * traffic before it can be considered the curr_active_slave.
>> -	 * also, we would skip slaves between the curr_active_slave
>> -	 * and primary_slave that may be up and able to arp
>> -	 */
>>  	if ((bond->primary_slave) &&
>> -	    (!bond->params.arp_interval) &&
>> -	    (IS_UP(bond->primary_slave->dev))) {
>> +	    bond->primary_slave->link == BOND_LINK_UP &&
>> +	    bond_should_loose_active(bond)) {
>>  		new_active = bond->primary_slave;
>>  	}
>>  @@ -1109,15 +1137,14 @@ static struct slave
>> *bond_find_best_slave(struct bonding *bond)
>>  	old_active = new_active;
>>   	bond_for_each_slave_from(bond, new_active, i, old_active) {
>> -		if (IS_UP(new_active->dev)) {
>> -			if (new_active->link == BOND_LINK_UP) {
>> -				return new_active;
>> -			} else if (new_active->link == BOND_LINK_BACK) {
>> -				/* link up, but waiting for stabilization */
>> -				if (new_active->delay < mintime) {
>> -					mintime = new_active->delay;
>> -					bestslave = new_active;
>> -				}
>> +		if (new_active->link == BOND_LINK_UP) {
>> +			return new_active;
>> +		} else if (new_active->link == BOND_LINK_BACK &&
>> +			   IS_UP(new_active->dev)) {
>> +			/* link up, but waiting for stabilization */
>> +			if (new_active->delay < mintime) {
>> +				mintime = new_active->delay;
>> +				bestslave = new_active;
>>  			}
>>  		}
>>  	}
>> @@ -1683,8 +1710,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
>>   	if (USES_PRIMARY(bond->params.mode) && bond->params.primary[0]) {
>>  		/* if there is a primary slave, remember it */
>> -		if (strcmp(bond->params.primary, new_slave->dev->name) == 0)
>> +		if (strcmp(bond->params.primary, new_slave->dev->name) == 0) {
>>  			bond->primary_slave = new_slave;
>> +			bond->force_primary = true;
>> +		}
>>  	}
>>   	write_lock_bh(&bond->curr_slave_lock);
>> @@ -2929,18 +2958,6 @@ static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
>>  		}
>>  	}
>>  -	read_lock(&bond->curr_slave_lock);
>> -
>> -	/*
>> -	 * Trigger a commit if the primary option setting has changed.
>> -	 */
>> -	if (bond->primary_slave &&
>> -	    (bond->primary_slave != bond->curr_active_slave) &&
>> -	    (bond->primary_slave->link == BOND_LINK_UP))
>> -		commit++;
>> -
>> -	read_unlock(&bond->curr_slave_lock);
>> -
>>  	return commit;
>>  }
>>  @@ -2961,90 +2978,58 @@ static void bond_ab_arp_commit(struct bonding
>> *bond, int delta_in_ticks)
>>  			continue;
>>   		case BOND_LINK_UP:
>> -			write_lock_bh(&bond->curr_slave_lock);
>> -
>> -			if (!bond->curr_active_slave &&
>> -			    time_before_eq(jiffies, dev_trans_start(slave->dev) +
>> -					   delta_in_ticks)) {
>> +			if ((!bond->curr_active_slave &&
>> +			     time_before_eq(jiffies,
>> +					    dev_trans_start(slave->dev) +
>> +					    delta_in_ticks)) ||
>> +			    bond->curr_active_slave != slave) {
>>  				slave->link = BOND_LINK_UP;
>> -				bond_change_active_slave(bond, slave);
>>  				bond->current_arp_slave = NULL;
>>   				pr_info(DRV_NAME
>> -				       ": %s: %s is up and now the "
>> -				       "active interface\n",
>> -				       bond->dev->name, slave->dev->name);
>> -
>> -			} else if (bond->curr_active_slave != slave) {
>> -				/* this slave has just come up but we
>> -				 * already have a current slave; this can
>> -				 * also happen if bond_enslave adds a new
>> -				 * slave that is up while we are searching
>> -				 * for a new slave
>> -				 */
>> -				slave->link = BOND_LINK_UP;
>> -				bond_set_slave_inactive_flags(slave);
>> -				bond->current_arp_slave = NULL;
>> +					": %s: link status definitely "
>> +					"up for interface %s.\n",
>> +					bond->dev->name, slave->dev->name);
>>  -				pr_info(DRV_NAME
>> -				       ": %s: backup interface %s is now up\n",
>> -				       bond->dev->name, slave->dev->name);
>> -			}
>> +				if (!bond->curr_active_slave ||
>> +				    (slave == bond->primary_slave))
>> +					goto do_failover;
>>  -			write_unlock_bh(&bond->curr_slave_lock);
>> +			}
>>  -			break;
>> +			continue;
>>   		case BOND_LINK_DOWN:
>>  			if (slave->link_failure_count < UINT_MAX)
>>  				slave->link_failure_count++;
>>   			slave->link = BOND_LINK_DOWN;
>> +			bond_set_slave_inactive_flags(slave);
>>  -			if (slave == bond->curr_active_slave) {
>> -				pr_info(DRV_NAME
>> -				       ": %s: link status down for active "
>> -				       "interface %s, disabling it\n",
>> -				       bond->dev->name, slave->dev->name);
>> -
>> -				bond_set_slave_inactive_flags(slave);
>> -
>> -				write_lock_bh(&bond->curr_slave_lock);
>> -
>> -				bond_select_active_slave(bond);
>> -				if (bond->curr_active_slave)
>> -					bond->curr_active_slave->jiffies =
>> -						jiffies;
>> -
>> -				write_unlock_bh(&bond->curr_slave_lock);
>> +			pr_info(DRV_NAME
>> +				": %s: link status definitely down for "
>> +				"interface %s, disabling it\n",
>> +				bond->dev->name, slave->dev->name);
>>  +			if (slave == bond->curr_active_slave) {
>>  				bond->current_arp_slave = NULL;
>> -
>> -			} else if (slave->state == BOND_STATE_BACKUP) {
>> -				pr_info(DRV_NAME
>> -				       ": %s: backup interface %s is now down\n",
>> -				       bond->dev->name, slave->dev->name);
>> -
>> -				bond_set_slave_inactive_flags(slave);
>> +				goto do_failover;
>>  			}
>> -			break;
>> +
>> +			continue;
>>   		default:
>>  			pr_err(DRV_NAME
>>  			       ": %s: impossible: new_link %d on slave %s\n",
>>  			       bond->dev->name, slave->new_link,
>>  			       slave->dev->name);
>> +			continue;
>>  		}
>> -	}
>>  -	/*
>> -	 * No race with changes to primary via sysfs, as we hold rtnl.
>> -	 */
>> -	if (bond->primary_slave &&
>> -	    (bond->primary_slave != bond->curr_active_slave) &&
>> -	    (bond->primary_slave->link == BOND_LINK_UP)) {
>> +do_failover:
>> +		ASSERT_RTNL();
>>  		write_lock_bh(&bond->curr_slave_lock);
>> -		bond_change_active_slave(bond, bond->primary_slave);
>> +		bond_select_active_slave(bond);
>>  		write_unlock_bh(&bond->curr_slave_lock);
>>  	}
>>  @@ -4695,7 +4680,7 @@ int bond_parse_parm(const char *buf, const struct
>> bond_parm_tbl *tbl)
>>   static int bond_check_params(struct bond_params *params)
>>  {
>> -	int arp_validate_value, fail_over_mac_value;
>> +	int arp_validate_value, fail_over_mac_value, primary_passive_value;
>>   	/*
>>  	 * Convert string parameters.
>> @@ -4994,6 +4979,20 @@ static int bond_check_params(struct bond_params *params)
>>  		primary = NULL;
>>  	}
>>  +	if (primary && primary_passive) {
>> +		primary_passive_value = bond_parse_parm(primary_passive,
>> +							pri_passive_tbl);
>> +		if (primary_passive_value == -1) {
>> +			pr_err(DRV_NAME
>> +			       ": Error: Invalid primary_passive \"%s\"\n",
>> +			       primary_passive ==
>> +					NULL ? "NULL" : primary_passive);
>> +			return -EINVAL;
>> +		}
>> +	} else {
>> +		primary_passive_value = BOND_PRI_PASSIVE_DISABLED;
>> +	}
>> +
>>  	if (fail_over_mac) {
>>  		fail_over_mac_value = bond_parse_parm(fail_over_mac,
>>  						      fail_over_mac_tbl);
>> @@ -5025,6 +5024,7 @@ static int bond_check_params(struct bond_params *params)
>>  	params->use_carrier = use_carrier;
>>  	params->lacp_fast = lacp_fast;
>>  	params->primary[0] = 0;
>> +	params->primary_passive = primary_passive_value;
>>  	params->fail_over_mac = fail_over_mac_value;
>>   	if (primary) {
>> diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
>> index 6044e12..84ce933 100644
>> --- a/drivers/net/bonding/bond_sysfs.c
>> +++ b/drivers/net/bonding/bond_sysfs.c
>> @@ -1212,6 +1212,59 @@ static DEVICE_ATTR(primary, S_IRUGO | S_IWUSR,
>>  		   bonding_show_primary, bonding_store_primary);
>>   /*
>> + * Show and set the primary_passive flag.
>> + */
>> +static ssize_t bonding_show_primary_passive(struct device *d,
>> +					    struct device_attribute *attr,
>> +					    char *buf)
>> +{
>> +	struct bonding *bond = to_bond(d);
>> +
>> +	return sprintf(buf, "%s %d\n",
>> +		       pri_passive_tbl[bond->params.primary_passive].modename,
>> +		       bond->params.primary_passive);
>> +}
>> +
>> +static ssize_t bonding_store_primary_passive(struct device *d,
>> +					     struct device_attribute *attr,
>> +					     const char *buf, size_t count)
>> +{
>> +	int new_value, ret = count;
>> +	struct bonding *bond = to_bond(d);
>> +
>> +	if (!rtnl_trylock())
>> +		return restart_syscall();
>> +
>> +	new_value = bond_parse_parm(buf, pri_passive_tbl);
>> +	if (new_value < 0)  {
>> +		pr_err(DRV_NAME
>> +		       ": %s: Ignoring invalid primary_passive value %.*s.\n",
>> +		       bond->dev->name,
>> +		       (int) strlen(buf) - 1, buf);
>> +		ret = -EINVAL;
>> +		goto out;
>> +	} else {
>> +		bond->params.primary_passive = new_value;
>> +		pr_info(DRV_NAME ": %s: setting primary_passive to %s (%d).\n",
>> +		       bond->dev->name, pri_passive_tbl[new_value].modename,
>> +		       new_value);
>> +		if (new_value == 0 || new_value == 1) {

	Magic numbers again, but this test shouldn't be necessary.  The
new_value is known to be valid if bond_parse_parm didn't return failure.

>> +			bond->force_primary = true;
>> +			read_lock(&bond->lock);
>> +			write_lock_bh(&bond->curr_slave_lock);
>> +			bond_select_active_slave(bond);
>> +			write_unlock_bh(&bond->curr_slave_lock);
>> +			read_unlock(&bond->lock);
>> +		}
>> +	}
>> +out:
>> +	rtnl_unlock();
>> +	return ret;
>> +}
>> +static DEVICE_ATTR(primary_passive, S_IRUGO | S_IWUSR,
>> +		   bonding_show_primary_passive, bonding_store_primary_passive);
>> +
>> +/*
>>   * Show and set the use_carrier flag.
>>   */
>>  static ssize_t bonding_show_carrier(struct device *d,
>> @@ -1500,6 +1553,7 @@ static struct attribute *per_bond_attrs[] = {
>>  	&dev_attr_num_unsol_na.attr,
>>  	&dev_attr_miimon.attr,
>>  	&dev_attr_primary.attr,
>> +	&dev_attr_primary_passive.attr,
>>  	&dev_attr_use_carrier.attr,
>>  	&dev_attr_active_slave.attr,
>>  	&dev_attr_mii_status.attr,
>> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>> index 6824771..9a9e0c7 100644
>> --- a/drivers/net/bonding/bonding.h
>> +++ b/drivers/net/bonding/bonding.h
>> @@ -131,6 +131,7 @@ struct bond_params {
>>  	int lacp_fast;
>>  	int ad_select;
>>  	char primary[IFNAMSIZ];
>> +	int primary_passive;
>>  	__be32 arp_targets[BOND_MAX_ARP_TARGETS];
>>  };
>>  @@ -190,6 +191,7 @@ struct bonding {
>>  	struct   slave *curr_active_slave;
>>  	struct   slave *current_arp_slave;
>>  	struct   slave *primary_slave;
>> +	bool     force_primary;
>>  	s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
>>  	rwlock_t lock;
>>  	rwlock_t curr_slave_lock;
>> @@ -258,6 +260,10 @@ static inline bool bond_is_lb(const struct bonding *bond)
>>  		|| bond->params.mode == BOND_MODE_ALB;
>>  }
>>  +#define BOND_PRI_PASSIVE_DISABLED	0
>> +#define BOND_PRI_PASSIVE_BETTER		1
>> +#define BOND_PRI_PASSIVE_ALWAYS		2
>> +
>>  #define BOND_FOM_NONE			0
>>  #define BOND_FOM_ACTIVE			1
>>  #define BOND_FOM_FOLLOW			2
>> @@ -348,6 +354,7 @@ extern const struct bond_parm_tbl bond_mode_tbl[];
>>  extern const struct bond_parm_tbl xmit_hashtype_tbl[];
>>  extern const struct bond_parm_tbl arp_validate_tbl[];
>>  extern const struct bond_parm_tbl fail_over_mac_tbl[];
>> +extern const struct bond_parm_tbl pri_passive_tbl[];
>>  extern struct bond_parm_tbl ad_select_tbl[];
>>   #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
>>
>
>	Nicolas.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* [PATCH -next] isdn: fix netjet/isdnhdlc build errors
From: Randy Dunlap @ 2009-09-10 22:15 UTC (permalink / raw)
  To: netdev; +Cc: davem, Karsten Keil

From: Randy Dunlap <randy.dunlap@oracle.com>

The netjet driver uses isdnhdlc interfaces (as does another ISDN
driver), so move that driver source file into the top-level
drivers/isdn/ directory and make it available to any ISDN drivers.

Fixes these build errors:

ERROR: "isdnhdlc_decode" [drivers/isdn/hardware/mISDN/netjet.ko] undefined!
ERROR: "isdnhdlc_encode" [drivers/isdn/hardware/mISDN/netjet.ko] undefined!
ERROR: "isdnhdlc_out_init" [drivers/isdn/hardware/mISDN/netjet.ko] undefined!
ERROR: "isdnhdlc_rcv_init" [drivers/isdn/hardware/mISDN/netjet.ko] undefined!

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Karsten Keil <isdn@linux-pingi.de>
---
 drivers/isdn/Kconfig        |    5 
 drivers/isdn/Makefile       |    2 
 drivers/isdn/i4l/Kconfig    |    6 
 drivers/isdn/i4l/Makefile   |    1 
 drivers/isdn/i4l/isdnhdlc.c |  630 ----------------------------------
 drivers/isdn/isdnhdlc.c     |  630 ++++++++++++++++++++++++++++++++++
 6 files changed, 637 insertions(+), 637 deletions(-)

--- linux-next-20090909.orig/drivers/isdn/Makefile
+++ linux-next-20090909/drivers/isdn/Makefile
@@ -15,3 +15,5 @@ obj-$(CONFIG_ISDN_DRV_LOOP)		+= isdnloop
 obj-$(CONFIG_ISDN_DRV_ACT2000)		+= act2000/
 obj-$(CONFIG_HYSDN)			+= hysdn/
 obj-$(CONFIG_ISDN_DRV_GIGASET)		+= gigaset/
+
+obj-$(CONFIG_ISDN_HDLC)		+= isdnhdlc.o
--- linux-next-20090909.orig/drivers/isdn/i4l/Makefile
+++ linux-next-20090909/drivers/isdn/i4l/Makefile
@@ -4,7 +4,6 @@
 
 obj-$(CONFIG_ISDN_I4L)		+= isdn.o
 obj-$(CONFIG_ISDN_PPP_BSDCOMP)	+= isdn_bsdcomp.o
-obj-$(CONFIG_ISDN_HDLC)		+= isdnhdlc.o
 
 # Multipart objects.
 
--- linux-next-20090909.orig/drivers/isdn/i4l/isdnhdlc.c
+++ /dev/null
@@ -1,630 +0,0 @@
-/*
- * isdnhdlc.c  --  General purpose ISDN HDLC decoder.
- *
- * Copyright (C)
- *	2009	Karsten Keil		<keil@b1-systems.de>
- *	2002	Wolfgang Mües		<wolfgang@iksw-muees.de>
- *	2001	Frode Isaksen		<fisaksen@bewan.com>
- *      2001	Kai Germaschewski	<kai.germaschewski@gmx.de>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- */
-
-#include <linux/module.h>
-#include <linux/init.h>
-#include <linux/crc-ccitt.h>
-#include <linux/isdn/hdlc.h>
-#include <linux/bitrev.h>
-
-/*-------------------------------------------------------------------*/
-
-MODULE_AUTHOR("Wolfgang Mües <wolfgang@iksw-muees.de>, "
-	      "Frode Isaksen <fisaksen@bewan.com>, "
-	      "Kai Germaschewski <kai.germaschewski@gmx.de>");
-MODULE_DESCRIPTION("General purpose ISDN HDLC decoder");
-MODULE_LICENSE("GPL");
-
-/*-------------------------------------------------------------------*/
-
-enum {
-	HDLC_FAST_IDLE, HDLC_GET_FLAG_B0, HDLC_GETFLAG_B1A6, HDLC_GETFLAG_B7,
-	HDLC_GET_DATA, HDLC_FAST_FLAG
-};
-
-enum {
-	HDLC_SEND_DATA, HDLC_SEND_CRC1, HDLC_SEND_FAST_FLAG,
-	HDLC_SEND_FIRST_FLAG, HDLC_SEND_CRC2, HDLC_SEND_CLOSING_FLAG,
-	HDLC_SEND_IDLE1, HDLC_SEND_FAST_IDLE, HDLC_SENDFLAG_B0,
-	HDLC_SENDFLAG_B1A6, HDLC_SENDFLAG_B7, STOPPED, HDLC_SENDFLAG_ONE
-};
-
-void isdnhdlc_rcv_init(struct isdnhdlc_vars *hdlc, u32 features)
-{
-	memset(hdlc, 0, sizeof(struct isdnhdlc_vars));
-	hdlc->state = HDLC_GET_DATA;
-	if (features & HDLC_56KBIT)
-		hdlc->do_adapt56 = 1;
-	if (features & HDLC_BITREVERSE)
-		hdlc->do_bitreverse = 1;
-}
-EXPORT_SYMBOL(isdnhdlc_out_init);
-
-void isdnhdlc_out_init(struct isdnhdlc_vars *hdlc, u32 features)
-{
-	memset(hdlc, 0, sizeof(struct isdnhdlc_vars));
-	if (features & HDLC_DCHANNEL) {
-		hdlc->dchannel = 1;
-		hdlc->state = HDLC_SEND_FIRST_FLAG;
-	} else {
-		hdlc->dchannel = 0;
-		hdlc->state = HDLC_SEND_FAST_FLAG;
-		hdlc->ffvalue = 0x7e;
-	}
-	hdlc->cbin = 0x7e;
-	if (features & HDLC_56KBIT) {
-		hdlc->do_adapt56 = 1;
-		hdlc->state = HDLC_SENDFLAG_B0;
-	} else
-		hdlc->data_bits = 8;
-	if (features & HDLC_BITREVERSE)
-		hdlc->do_bitreverse = 1;
-}
-EXPORT_SYMBOL(isdnhdlc_rcv_init);
-
-static int
-check_frame(struct isdnhdlc_vars *hdlc)
-{
-	int status;
-
-	if (hdlc->dstpos < 2) 	/* too small - framing error */
-		status = -HDLC_FRAMING_ERROR;
-	else if (hdlc->crc != 0xf0b8)	/* crc error */
-		status = -HDLC_CRC_ERROR;
-	else {
-		/* remove CRC */
-		hdlc->dstpos -= 2;
-		/* good frame */
-		status = hdlc->dstpos;
-	}
-	return status;
-}
-
-/*
-  isdnhdlc_decode - decodes HDLC frames from a transparent bit stream.
-
-  The source buffer is scanned for valid HDLC frames looking for
-  flags (01111110) to indicate the start of a frame. If the start of
-  the frame is found, the bit stuffing is removed (0 after 5 1's).
-  When a new flag is found, the complete frame has been received
-  and the CRC is checked.
-  If a valid frame is found, the function returns the frame length
-  excluding the CRC with the bit HDLC_END_OF_FRAME set.
-  If the beginning of a valid frame is found, the function returns
-  the length.
-  If a framing error is found (too many 1s and not a flag) the function
-  returns the length with the bit HDLC_FRAMING_ERROR set.
-  If a CRC error is found the function returns the length with the
-  bit HDLC_CRC_ERROR set.
-  If the frame length exceeds the destination buffer size, the function
-  returns the length with the bit HDLC_LENGTH_ERROR set.
-
-  src - source buffer
-  slen - source buffer length
-  count - number of bytes removed (decoded) from the source buffer
-  dst _ destination buffer
-  dsize - destination buffer size
-  returns - number of decoded bytes in the destination buffer and status
-  flag.
- */
-int isdnhdlc_decode(struct isdnhdlc_vars *hdlc, const u8 *src, int slen,
-	int *count, u8 *dst, int dsize)
-{
-	int status = 0;
-
-	static const unsigned char fast_flag[] = {
-		0x00, 0x00, 0x00, 0x20, 0x30, 0x38, 0x3c, 0x3e, 0x3f
-	};
-
-	static const unsigned char fast_flag_value[] = {
-		0x00, 0x7e, 0xfc, 0xf9, 0xf3, 0xe7, 0xcf, 0x9f, 0x3f
-	};
-
-	static const unsigned char fast_abort[] = {
-		0x00, 0x00, 0x80, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc, 0xfe, 0xff
-	};
-
-#define handle_fast_flag(h) \
-	do {\
-		if (h->cbin == fast_flag[h->bit_shift]) {\
-			h->ffvalue = fast_flag_value[h->bit_shift];\
-			h->state = HDLC_FAST_FLAG;\
-			h->ffbit_shift = h->bit_shift;\
-			h->bit_shift = 1;\
-		} else {\
-			h->state = HDLC_GET_DATA;\
-			h->data_received = 0;\
-		} \
-	} while (0)
-
-#define handle_abort(h) \
-	do {\
-		h->shift_reg = fast_abort[h->ffbit_shift - 1];\
-		h->hdlc_bits1 = h->ffbit_shift - 2;\
-		if (h->hdlc_bits1 < 0)\
-			h->hdlc_bits1 = 0;\
-		h->data_bits = h->ffbit_shift - 1;\
-		h->state = HDLC_GET_DATA;\
-		h->data_received = 0;\
-	} while (0)
-
-	*count = slen;
-
-	while (slen > 0) {
-		if (hdlc->bit_shift == 0) {
-			/* the code is for bitreverse streams */
-			if (hdlc->do_bitreverse == 0)
-				hdlc->cbin = bitrev8(*src++);
-			else
-				hdlc->cbin = *src++;
-			slen--;
-			hdlc->bit_shift = 8;
-			if (hdlc->do_adapt56)
-				hdlc->bit_shift--;
-		}
-
-		switch (hdlc->state) {
-		case STOPPED:
-			return 0;
-		case HDLC_FAST_IDLE:
-			if (hdlc->cbin == 0xff) {
-				hdlc->bit_shift = 0;
-				break;
-			}
-			hdlc->state = HDLC_GET_FLAG_B0;
-			hdlc->hdlc_bits1 = 0;
-			hdlc->bit_shift = 8;
-			break;
-		case HDLC_GET_FLAG_B0:
-			if (!(hdlc->cbin & 0x80)) {
-				hdlc->state = HDLC_GETFLAG_B1A6;
-				hdlc->hdlc_bits1 = 0;
-			} else {
-				if ((!hdlc->do_adapt56) &&
-				    (++hdlc->hdlc_bits1 >= 8) &&
-				    (hdlc->bit_shift == 1))
-						hdlc->state = HDLC_FAST_IDLE;
-			}
-			hdlc->cbin <<= 1;
-			hdlc->bit_shift--;
-			break;
-		case HDLC_GETFLAG_B1A6:
-			if (hdlc->cbin & 0x80) {
-				hdlc->hdlc_bits1++;
-				if (hdlc->hdlc_bits1 == 6)
-					hdlc->state = HDLC_GETFLAG_B7;
-			} else
-				hdlc->hdlc_bits1 = 0;
-			hdlc->cbin <<= 1;
-			hdlc->bit_shift--;
-			break;
-		case HDLC_GETFLAG_B7:
-			if (hdlc->cbin & 0x80) {
-				hdlc->state = HDLC_GET_FLAG_B0;
-			} else {
-				hdlc->state = HDLC_GET_DATA;
-				hdlc->crc = 0xffff;
-				hdlc->shift_reg = 0;
-				hdlc->hdlc_bits1 = 0;
-				hdlc->data_bits = 0;
-				hdlc->data_received = 0;
-			}
-			hdlc->cbin <<= 1;
-			hdlc->bit_shift--;
-			break;
-		case HDLC_GET_DATA:
-			if (hdlc->cbin & 0x80) {
-				hdlc->hdlc_bits1++;
-				switch (hdlc->hdlc_bits1) {
-				case 6:
-					break;
-				case 7:
-					if (hdlc->data_received)
-						/* bad frame */
-						status = -HDLC_FRAMING_ERROR;
-					if (!hdlc->do_adapt56) {
-						if (hdlc->cbin == fast_abort
-						    [hdlc->bit_shift + 1]) {
-							hdlc->state =
-								HDLC_FAST_IDLE;
-							hdlc->bit_shift = 1;
-							break;
-						}
-					} else
-						hdlc->state = HDLC_GET_FLAG_B0;
-					break;
-				default:
-					hdlc->shift_reg >>= 1;
-					hdlc->shift_reg |= 0x80;
-					hdlc->data_bits++;
-					break;
-				}
-			} else {
-				switch (hdlc->hdlc_bits1) {
-				case 5:
-					break;
-				case 6:
-					if (hdlc->data_received)
-						status = check_frame(hdlc);
-					hdlc->crc = 0xffff;
-					hdlc->shift_reg = 0;
-					hdlc->data_bits = 0;
-					if (!hdlc->do_adapt56)
-						handle_fast_flag(hdlc);
-					else {
-						hdlc->state = HDLC_GET_DATA;
-						hdlc->data_received = 0;
-					}
-					break;
-				default:
-					hdlc->shift_reg >>= 1;
-					hdlc->data_bits++;
-					break;
-				}
-				hdlc->hdlc_bits1 = 0;
-			}
-			if (status) {
-				hdlc->dstpos = 0;
-				*count -= slen;
-				hdlc->cbin <<= 1;
-				hdlc->bit_shift--;
-				return status;
-			}
-			if (hdlc->data_bits == 8) {
-				hdlc->data_bits = 0;
-				hdlc->data_received = 1;
-				hdlc->crc = crc_ccitt_byte(hdlc->crc,
-						hdlc->shift_reg);
-
-				/* good byte received */
-				if (hdlc->dstpos < dsize)
-					dst[hdlc->dstpos++] = hdlc->shift_reg;
-				else {
-					/* frame too long */
-					status = -HDLC_LENGTH_ERROR;
-					hdlc->dstpos = 0;
-				}
-			}
-			hdlc->cbin <<= 1;
-			hdlc->bit_shift--;
-			break;
-		case HDLC_FAST_FLAG:
-			if (hdlc->cbin == hdlc->ffvalue) {
-				hdlc->bit_shift = 0;
-				break;
-			} else {
-				if (hdlc->cbin == 0xff) {
-					hdlc->state = HDLC_FAST_IDLE;
-					hdlc->bit_shift = 0;
-				} else if (hdlc->ffbit_shift == 8) {
-					hdlc->state = HDLC_GETFLAG_B7;
-					break;
-				} else
-					handle_abort(hdlc);
-			}
-			break;
-		default:
-			break;
-		}
-	}
-	*count -= slen;
-	return 0;
-}
-EXPORT_SYMBOL(isdnhdlc_decode);
-/*
-  isdnhdlc_encode - encodes HDLC frames to a transparent bit stream.
-
-  The bit stream starts with a beginning flag (01111110). After
-  that each byte is added to the bit stream with bit stuffing added
-  (0 after 5 1's).
-  When the last byte has been removed from the source buffer, the
-  CRC (2 bytes is added) and the frame terminates with the ending flag.
-  For the dchannel, the idle character (all 1's) is also added at the end.
-  If this function is called with empty source buffer (slen=0), flags or
-  idle character will be generated.
-
-  src - source buffer
-  slen - source buffer length
-  count - number of bytes removed (encoded) from source buffer
-  dst _ destination buffer
-  dsize - destination buffer size
-  returns - number of encoded bytes in the destination buffer
-*/
-int isdnhdlc_encode(struct isdnhdlc_vars *hdlc, const u8 *src, u16 slen,
-	int *count, u8 *dst, int dsize)
-{
-	static const unsigned char xfast_flag_value[] = {
-		0x7e, 0x3f, 0x9f, 0xcf, 0xe7, 0xf3, 0xf9, 0xfc, 0x7e
-	};
-
-	int len = 0;
-
-	*count = slen;
-
-	/* special handling for one byte frames */
-	if ((slen == 1) && (hdlc->state == HDLC_SEND_FAST_FLAG))
-		hdlc->state = HDLC_SENDFLAG_ONE;
-	while (dsize > 0) {
-		if (hdlc->bit_shift == 0) {
-			if (slen && !hdlc->do_closing) {
-				hdlc->shift_reg = *src++;
-				slen--;
-				if (slen == 0)
-					/* closing sequence, CRC + flag(s) */
-					hdlc->do_closing = 1;
-				hdlc->bit_shift = 8;
-			} else {
-				if (hdlc->state == HDLC_SEND_DATA) {
-					if (hdlc->data_received) {
-						hdlc->state = HDLC_SEND_CRC1;
-						hdlc->crc ^= 0xffff;
-						hdlc->bit_shift = 8;
-						hdlc->shift_reg =
-							hdlc->crc & 0xff;
-					} else if (!hdlc->do_adapt56)
-						hdlc->state =
-							HDLC_SEND_FAST_FLAG;
-					else
-						hdlc->state =
-							HDLC_SENDFLAG_B0;
-				}
-
-			}
-		}
-
-		switch (hdlc->state) {
-		case STOPPED:
-			while (dsize--)
-				*dst++ = 0xff;
-			return dsize;
-		case HDLC_SEND_FAST_FLAG:
-			hdlc->do_closing = 0;
-			if (slen == 0) {
-				/* the code is for bitreverse streams */
-				if (hdlc->do_bitreverse == 0)
-					*dst++ = bitrev8(hdlc->ffvalue);
-				else
-					*dst++ = hdlc->ffvalue;
-				len++;
-				dsize--;
-				break;
-			}
-			/* fall through */
-		case HDLC_SENDFLAG_ONE:
-			if (hdlc->bit_shift == 8) {
-				hdlc->cbin = hdlc->ffvalue >>
-					(8 - hdlc->data_bits);
-				hdlc->state = HDLC_SEND_DATA;
-				hdlc->crc = 0xffff;
-				hdlc->hdlc_bits1 = 0;
-				hdlc->data_received = 1;
-			}
-			break;
-		case HDLC_SENDFLAG_B0:
-			hdlc->do_closing = 0;
-			hdlc->cbin <<= 1;
-			hdlc->data_bits++;
-			hdlc->hdlc_bits1 = 0;
-			hdlc->state = HDLC_SENDFLAG_B1A6;
-			break;
-		case HDLC_SENDFLAG_B1A6:
-			hdlc->cbin <<= 1;
-			hdlc->data_bits++;
-			hdlc->cbin++;
-			if (++hdlc->hdlc_bits1 == 6)
-				hdlc->state = HDLC_SENDFLAG_B7;
-			break;
-		case HDLC_SENDFLAG_B7:
-			hdlc->cbin <<= 1;
-			hdlc->data_bits++;
-			if (slen == 0) {
-				hdlc->state = HDLC_SENDFLAG_B0;
-				break;
-			}
-			if (hdlc->bit_shift == 8) {
-				hdlc->state = HDLC_SEND_DATA;
-				hdlc->crc = 0xffff;
-				hdlc->hdlc_bits1 = 0;
-				hdlc->data_received = 1;
-			}
-			break;
-		case HDLC_SEND_FIRST_FLAG:
-			hdlc->data_received = 1;
-			if (hdlc->data_bits == 8) {
-				hdlc->state = HDLC_SEND_DATA;
-				hdlc->crc = 0xffff;
-				hdlc->hdlc_bits1 = 0;
-				break;
-			}
-			hdlc->cbin <<= 1;
-			hdlc->data_bits++;
-			if (hdlc->shift_reg & 0x01)
-				hdlc->cbin++;
-			hdlc->shift_reg >>= 1;
-			hdlc->bit_shift--;
-			if (hdlc->bit_shift == 0) {
-				hdlc->state = HDLC_SEND_DATA;
-				hdlc->crc = 0xffff;
-				hdlc->hdlc_bits1 = 0;
-			}
-			break;
-		case HDLC_SEND_DATA:
-			hdlc->cbin <<= 1;
-			hdlc->data_bits++;
-			if (hdlc->hdlc_bits1 == 5) {
-				hdlc->hdlc_bits1 = 0;
-				break;
-			}
-			if (hdlc->bit_shift == 8)
-				hdlc->crc = crc_ccitt_byte(hdlc->crc,
-					hdlc->shift_reg);
-			if (hdlc->shift_reg & 0x01) {
-				hdlc->hdlc_bits1++;
-				hdlc->cbin++;
-				hdlc->shift_reg >>= 1;
-				hdlc->bit_shift--;
-			} else {
-				hdlc->hdlc_bits1 = 0;
-				hdlc->shift_reg >>= 1;
-				hdlc->bit_shift--;
-			}
-			break;
-		case HDLC_SEND_CRC1:
-			hdlc->cbin <<= 1;
-			hdlc->data_bits++;
-			if (hdlc->hdlc_bits1 == 5) {
-				hdlc->hdlc_bits1 = 0;
-				break;
-			}
-			if (hdlc->shift_reg & 0x01) {
-				hdlc->hdlc_bits1++;
-				hdlc->cbin++;
-				hdlc->shift_reg >>= 1;
-				hdlc->bit_shift--;
-			} else {
-				hdlc->hdlc_bits1 = 0;
-				hdlc->shift_reg >>= 1;
-				hdlc->bit_shift--;
-			}
-			if (hdlc->bit_shift == 0) {
-				hdlc->shift_reg = (hdlc->crc >> 8);
-				hdlc->state = HDLC_SEND_CRC2;
-				hdlc->bit_shift = 8;
-			}
-			break;
-		case HDLC_SEND_CRC2:
-			hdlc->cbin <<= 1;
-			hdlc->data_bits++;
-			if (hdlc->hdlc_bits1 == 5) {
-				hdlc->hdlc_bits1 = 0;
-				break;
-			}
-			if (hdlc->shift_reg & 0x01) {
-				hdlc->hdlc_bits1++;
-				hdlc->cbin++;
-				hdlc->shift_reg >>= 1;
-				hdlc->bit_shift--;
-			} else {
-				hdlc->hdlc_bits1 = 0;
-				hdlc->shift_reg >>= 1;
-				hdlc->bit_shift--;
-			}
-			if (hdlc->bit_shift == 0) {
-				hdlc->shift_reg = 0x7e;
-				hdlc->state = HDLC_SEND_CLOSING_FLAG;
-				hdlc->bit_shift = 8;
-			}
-			break;
-		case HDLC_SEND_CLOSING_FLAG:
-			hdlc->cbin <<= 1;
-			hdlc->data_bits++;
-			if (hdlc->hdlc_bits1 == 5) {
-				hdlc->hdlc_bits1 = 0;
-				break;
-			}
-			if (hdlc->shift_reg & 0x01)
-				hdlc->cbin++;
-			hdlc->shift_reg >>= 1;
-			hdlc->bit_shift--;
-			if (hdlc->bit_shift == 0) {
-				hdlc->ffvalue =
-					xfast_flag_value[hdlc->data_bits];
-				if (hdlc->dchannel) {
-					hdlc->ffvalue = 0x7e;
-					hdlc->state = HDLC_SEND_IDLE1;
-					hdlc->bit_shift = 8-hdlc->data_bits;
-					if (hdlc->bit_shift == 0)
-						hdlc->state =
-							HDLC_SEND_FAST_IDLE;
-				} else {
-					if (!hdlc->do_adapt56) {
-						hdlc->state =
-							HDLC_SEND_FAST_FLAG;
-						hdlc->data_received = 0;
-					} else {
-						hdlc->state = HDLC_SENDFLAG_B0;
-						hdlc->data_received = 0;
-					}
-					/* Finished this frame, send flags */
-					if (dsize > 1)
-						dsize = 1;
-				}
-			}
-			break;
-		case HDLC_SEND_IDLE1:
-			hdlc->do_closing = 0;
-			hdlc->cbin <<= 1;
-			hdlc->cbin++;
-			hdlc->data_bits++;
-			hdlc->bit_shift--;
-			if (hdlc->bit_shift == 0) {
-				hdlc->state = HDLC_SEND_FAST_IDLE;
-				hdlc->bit_shift = 0;
-			}
-			break;
-		case HDLC_SEND_FAST_IDLE:
-			hdlc->do_closing = 0;
-			hdlc->cbin = 0xff;
-			hdlc->data_bits = 8;
-			if (hdlc->bit_shift == 8) {
-				hdlc->cbin = 0x7e;
-				hdlc->state = HDLC_SEND_FIRST_FLAG;
-			} else {
-				/* the code is for bitreverse streams */
-				if (hdlc->do_bitreverse == 0)
-					*dst++ = bitrev8(hdlc->cbin);
-				else
-					*dst++ = hdlc->cbin;
-				hdlc->bit_shift = 0;
-				hdlc->data_bits = 0;
-				len++;
-				dsize = 0;
-			}
-			break;
-		default:
-			break;
-		}
-		if (hdlc->do_adapt56) {
-			if (hdlc->data_bits == 7) {
-				hdlc->cbin <<= 1;
-				hdlc->cbin++;
-				hdlc->data_bits++;
-			}
-		}
-		if (hdlc->data_bits == 8) {
-			/* the code is for bitreverse streams */
-			if (hdlc->do_bitreverse == 0)
-				*dst++ = bitrev8(hdlc->cbin);
-			else
-				*dst++ = hdlc->cbin;
-			hdlc->data_bits = 0;
-			len++;
-			dsize--;
-		}
-	}
-	*count -= slen;
-
-	return len;
-}
-EXPORT_SYMBOL(isdnhdlc_encode);
--- /dev/null
+++ linux-next-20090909/drivers/isdn/isdnhdlc.c
@@ -0,0 +1,630 @@
+/*
+ * isdnhdlc.c  --  General purpose ISDN HDLC decoder.
+ *
+ * Copyright (C)
+ *	2009	Karsten Keil		<keil@b1-systems.de>
+ *	2002	Wolfgang Mües		<wolfgang@iksw-muees.de>
+ *	2001	Frode Isaksen		<fisaksen@bewan.com>
+ *      2001	Kai Germaschewski	<kai.germaschewski@gmx.de>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/crc-ccitt.h>
+#include <linux/isdn/hdlc.h>
+#include <linux/bitrev.h>
+
+/*-------------------------------------------------------------------*/
+
+MODULE_AUTHOR("Wolfgang Mües <wolfgang@iksw-muees.de>, "
+	      "Frode Isaksen <fisaksen@bewan.com>, "
+	      "Kai Germaschewski <kai.germaschewski@gmx.de>");
+MODULE_DESCRIPTION("General purpose ISDN HDLC decoder");
+MODULE_LICENSE("GPL");
+
+/*-------------------------------------------------------------------*/
+
+enum {
+	HDLC_FAST_IDLE, HDLC_GET_FLAG_B0, HDLC_GETFLAG_B1A6, HDLC_GETFLAG_B7,
+	HDLC_GET_DATA, HDLC_FAST_FLAG
+};
+
+enum {
+	HDLC_SEND_DATA, HDLC_SEND_CRC1, HDLC_SEND_FAST_FLAG,
+	HDLC_SEND_FIRST_FLAG, HDLC_SEND_CRC2, HDLC_SEND_CLOSING_FLAG,
+	HDLC_SEND_IDLE1, HDLC_SEND_FAST_IDLE, HDLC_SENDFLAG_B0,
+	HDLC_SENDFLAG_B1A6, HDLC_SENDFLAG_B7, STOPPED, HDLC_SENDFLAG_ONE
+};
+
+void isdnhdlc_rcv_init(struct isdnhdlc_vars *hdlc, u32 features)
+{
+	memset(hdlc, 0, sizeof(struct isdnhdlc_vars));
+	hdlc->state = HDLC_GET_DATA;
+	if (features & HDLC_56KBIT)
+		hdlc->do_adapt56 = 1;
+	if (features & HDLC_BITREVERSE)
+		hdlc->do_bitreverse = 1;
+}
+EXPORT_SYMBOL(isdnhdlc_out_init);
+
+void isdnhdlc_out_init(struct isdnhdlc_vars *hdlc, u32 features)
+{
+	memset(hdlc, 0, sizeof(struct isdnhdlc_vars));
+	if (features & HDLC_DCHANNEL) {
+		hdlc->dchannel = 1;
+		hdlc->state = HDLC_SEND_FIRST_FLAG;
+	} else {
+		hdlc->dchannel = 0;
+		hdlc->state = HDLC_SEND_FAST_FLAG;
+		hdlc->ffvalue = 0x7e;
+	}
+	hdlc->cbin = 0x7e;
+	if (features & HDLC_56KBIT) {
+		hdlc->do_adapt56 = 1;
+		hdlc->state = HDLC_SENDFLAG_B0;
+	} else
+		hdlc->data_bits = 8;
+	if (features & HDLC_BITREVERSE)
+		hdlc->do_bitreverse = 1;
+}
+EXPORT_SYMBOL(isdnhdlc_rcv_init);
+
+static int
+check_frame(struct isdnhdlc_vars *hdlc)
+{
+	int status;
+
+	if (hdlc->dstpos < 2) 	/* too small - framing error */
+		status = -HDLC_FRAMING_ERROR;
+	else if (hdlc->crc != 0xf0b8)	/* crc error */
+		status = -HDLC_CRC_ERROR;
+	else {
+		/* remove CRC */
+		hdlc->dstpos -= 2;
+		/* good frame */
+		status = hdlc->dstpos;
+	}
+	return status;
+}
+
+/*
+  isdnhdlc_decode - decodes HDLC frames from a transparent bit stream.
+
+  The source buffer is scanned for valid HDLC frames looking for
+  flags (01111110) to indicate the start of a frame. If the start of
+  the frame is found, the bit stuffing is removed (0 after 5 1's).
+  When a new flag is found, the complete frame has been received
+  and the CRC is checked.
+  If a valid frame is found, the function returns the frame length
+  excluding the CRC with the bit HDLC_END_OF_FRAME set.
+  If the beginning of a valid frame is found, the function returns
+  the length.
+  If a framing error is found (too many 1s and not a flag) the function
+  returns the length with the bit HDLC_FRAMING_ERROR set.
+  If a CRC error is found the function returns the length with the
+  bit HDLC_CRC_ERROR set.
+  If the frame length exceeds the destination buffer size, the function
+  returns the length with the bit HDLC_LENGTH_ERROR set.
+
+  src - source buffer
+  slen - source buffer length
+  count - number of bytes removed (decoded) from the source buffer
+  dst _ destination buffer
+  dsize - destination buffer size
+  returns - number of decoded bytes in the destination buffer and status
+  flag.
+ */
+int isdnhdlc_decode(struct isdnhdlc_vars *hdlc, const u8 *src, int slen,
+	int *count, u8 *dst, int dsize)
+{
+	int status = 0;
+
+	static const unsigned char fast_flag[] = {
+		0x00, 0x00, 0x00, 0x20, 0x30, 0x38, 0x3c, 0x3e, 0x3f
+	};
+
+	static const unsigned char fast_flag_value[] = {
+		0x00, 0x7e, 0xfc, 0xf9, 0xf3, 0xe7, 0xcf, 0x9f, 0x3f
+	};
+
+	static const unsigned char fast_abort[] = {
+		0x00, 0x00, 0x80, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc, 0xfe, 0xff
+	};
+
+#define handle_fast_flag(h) \
+	do {\
+		if (h->cbin == fast_flag[h->bit_shift]) {\
+			h->ffvalue = fast_flag_value[h->bit_shift];\
+			h->state = HDLC_FAST_FLAG;\
+			h->ffbit_shift = h->bit_shift;\
+			h->bit_shift = 1;\
+		} else {\
+			h->state = HDLC_GET_DATA;\
+			h->data_received = 0;\
+		} \
+	} while (0)
+
+#define handle_abort(h) \
+	do {\
+		h->shift_reg = fast_abort[h->ffbit_shift - 1];\
+		h->hdlc_bits1 = h->ffbit_shift - 2;\
+		if (h->hdlc_bits1 < 0)\
+			h->hdlc_bits1 = 0;\
+		h->data_bits = h->ffbit_shift - 1;\
+		h->state = HDLC_GET_DATA;\
+		h->data_received = 0;\
+	} while (0)
+
+	*count = slen;
+
+	while (slen > 0) {
+		if (hdlc->bit_shift == 0) {
+			/* the code is for bitreverse streams */
+			if (hdlc->do_bitreverse == 0)
+				hdlc->cbin = bitrev8(*src++);
+			else
+				hdlc->cbin = *src++;
+			slen--;
+			hdlc->bit_shift = 8;
+			if (hdlc->do_adapt56)
+				hdlc->bit_shift--;
+		}
+
+		switch (hdlc->state) {
+		case STOPPED:
+			return 0;
+		case HDLC_FAST_IDLE:
+			if (hdlc->cbin == 0xff) {
+				hdlc->bit_shift = 0;
+				break;
+			}
+			hdlc->state = HDLC_GET_FLAG_B0;
+			hdlc->hdlc_bits1 = 0;
+			hdlc->bit_shift = 8;
+			break;
+		case HDLC_GET_FLAG_B0:
+			if (!(hdlc->cbin & 0x80)) {
+				hdlc->state = HDLC_GETFLAG_B1A6;
+				hdlc->hdlc_bits1 = 0;
+			} else {
+				if ((!hdlc->do_adapt56) &&
+				    (++hdlc->hdlc_bits1 >= 8) &&
+				    (hdlc->bit_shift == 1))
+						hdlc->state = HDLC_FAST_IDLE;
+			}
+			hdlc->cbin <<= 1;
+			hdlc->bit_shift--;
+			break;
+		case HDLC_GETFLAG_B1A6:
+			if (hdlc->cbin & 0x80) {
+				hdlc->hdlc_bits1++;
+				if (hdlc->hdlc_bits1 == 6)
+					hdlc->state = HDLC_GETFLAG_B7;
+			} else
+				hdlc->hdlc_bits1 = 0;
+			hdlc->cbin <<= 1;
+			hdlc->bit_shift--;
+			break;
+		case HDLC_GETFLAG_B7:
+			if (hdlc->cbin & 0x80) {
+				hdlc->state = HDLC_GET_FLAG_B0;
+			} else {
+				hdlc->state = HDLC_GET_DATA;
+				hdlc->crc = 0xffff;
+				hdlc->shift_reg = 0;
+				hdlc->hdlc_bits1 = 0;
+				hdlc->data_bits = 0;
+				hdlc->data_received = 0;
+			}
+			hdlc->cbin <<= 1;
+			hdlc->bit_shift--;
+			break;
+		case HDLC_GET_DATA:
+			if (hdlc->cbin & 0x80) {
+				hdlc->hdlc_bits1++;
+				switch (hdlc->hdlc_bits1) {
+				case 6:
+					break;
+				case 7:
+					if (hdlc->data_received)
+						/* bad frame */
+						status = -HDLC_FRAMING_ERROR;
+					if (!hdlc->do_adapt56) {
+						if (hdlc->cbin == fast_abort
+						    [hdlc->bit_shift + 1]) {
+							hdlc->state =
+								HDLC_FAST_IDLE;
+							hdlc->bit_shift = 1;
+							break;
+						}
+					} else
+						hdlc->state = HDLC_GET_FLAG_B0;
+					break;
+				default:
+					hdlc->shift_reg >>= 1;
+					hdlc->shift_reg |= 0x80;
+					hdlc->data_bits++;
+					break;
+				}
+			} else {
+				switch (hdlc->hdlc_bits1) {
+				case 5:
+					break;
+				case 6:
+					if (hdlc->data_received)
+						status = check_frame(hdlc);
+					hdlc->crc = 0xffff;
+					hdlc->shift_reg = 0;
+					hdlc->data_bits = 0;
+					if (!hdlc->do_adapt56)
+						handle_fast_flag(hdlc);
+					else {
+						hdlc->state = HDLC_GET_DATA;
+						hdlc->data_received = 0;
+					}
+					break;
+				default:
+					hdlc->shift_reg >>= 1;
+					hdlc->data_bits++;
+					break;
+				}
+				hdlc->hdlc_bits1 = 0;
+			}
+			if (status) {
+				hdlc->dstpos = 0;
+				*count -= slen;
+				hdlc->cbin <<= 1;
+				hdlc->bit_shift--;
+				return status;
+			}
+			if (hdlc->data_bits == 8) {
+				hdlc->data_bits = 0;
+				hdlc->data_received = 1;
+				hdlc->crc = crc_ccitt_byte(hdlc->crc,
+						hdlc->shift_reg);
+
+				/* good byte received */
+				if (hdlc->dstpos < dsize)
+					dst[hdlc->dstpos++] = hdlc->shift_reg;
+				else {
+					/* frame too long */
+					status = -HDLC_LENGTH_ERROR;
+					hdlc->dstpos = 0;
+				}
+			}
+			hdlc->cbin <<= 1;
+			hdlc->bit_shift--;
+			break;
+		case HDLC_FAST_FLAG:
+			if (hdlc->cbin == hdlc->ffvalue) {
+				hdlc->bit_shift = 0;
+				break;
+			} else {
+				if (hdlc->cbin == 0xff) {
+					hdlc->state = HDLC_FAST_IDLE;
+					hdlc->bit_shift = 0;
+				} else if (hdlc->ffbit_shift == 8) {
+					hdlc->state = HDLC_GETFLAG_B7;
+					break;
+				} else
+					handle_abort(hdlc);
+			}
+			break;
+		default:
+			break;
+		}
+	}
+	*count -= slen;
+	return 0;
+}
+EXPORT_SYMBOL(isdnhdlc_decode);
+/*
+  isdnhdlc_encode - encodes HDLC frames to a transparent bit stream.
+
+  The bit stream starts with a beginning flag (01111110). After
+  that each byte is added to the bit stream with bit stuffing added
+  (0 after 5 1's).
+  When the last byte has been removed from the source buffer, the
+  CRC (2 bytes is added) and the frame terminates with the ending flag.
+  For the dchannel, the idle character (all 1's) is also added at the end.
+  If this function is called with empty source buffer (slen=0), flags or
+  idle character will be generated.
+
+  src - source buffer
+  slen - source buffer length
+  count - number of bytes removed (encoded) from source buffer
+  dst _ destination buffer
+  dsize - destination buffer size
+  returns - number of encoded bytes in the destination buffer
+*/
+int isdnhdlc_encode(struct isdnhdlc_vars *hdlc, const u8 *src, u16 slen,
+	int *count, u8 *dst, int dsize)
+{
+	static const unsigned char xfast_flag_value[] = {
+		0x7e, 0x3f, 0x9f, 0xcf, 0xe7, 0xf3, 0xf9, 0xfc, 0x7e
+	};
+
+	int len = 0;
+
+	*count = slen;
+
+	/* special handling for one byte frames */
+	if ((slen == 1) && (hdlc->state == HDLC_SEND_FAST_FLAG))
+		hdlc->state = HDLC_SENDFLAG_ONE;
+	while (dsize > 0) {
+		if (hdlc->bit_shift == 0) {
+			if (slen && !hdlc->do_closing) {
+				hdlc->shift_reg = *src++;
+				slen--;
+				if (slen == 0)
+					/* closing sequence, CRC + flag(s) */
+					hdlc->do_closing = 1;
+				hdlc->bit_shift = 8;
+			} else {
+				if (hdlc->state == HDLC_SEND_DATA) {
+					if (hdlc->data_received) {
+						hdlc->state = HDLC_SEND_CRC1;
+						hdlc->crc ^= 0xffff;
+						hdlc->bit_shift = 8;
+						hdlc->shift_reg =
+							hdlc->crc & 0xff;
+					} else if (!hdlc->do_adapt56)
+						hdlc->state =
+							HDLC_SEND_FAST_FLAG;
+					else
+						hdlc->state =
+							HDLC_SENDFLAG_B0;
+				}
+
+			}
+		}
+
+		switch (hdlc->state) {
+		case STOPPED:
+			while (dsize--)
+				*dst++ = 0xff;
+			return dsize;
+		case HDLC_SEND_FAST_FLAG:
+			hdlc->do_closing = 0;
+			if (slen == 0) {
+				/* the code is for bitreverse streams */
+				if (hdlc->do_bitreverse == 0)
+					*dst++ = bitrev8(hdlc->ffvalue);
+				else
+					*dst++ = hdlc->ffvalue;
+				len++;
+				dsize--;
+				break;
+			}
+			/* fall through */
+		case HDLC_SENDFLAG_ONE:
+			if (hdlc->bit_shift == 8) {
+				hdlc->cbin = hdlc->ffvalue >>
+					(8 - hdlc->data_bits);
+				hdlc->state = HDLC_SEND_DATA;
+				hdlc->crc = 0xffff;
+				hdlc->hdlc_bits1 = 0;
+				hdlc->data_received = 1;
+			}
+			break;
+		case HDLC_SENDFLAG_B0:
+			hdlc->do_closing = 0;
+			hdlc->cbin <<= 1;
+			hdlc->data_bits++;
+			hdlc->hdlc_bits1 = 0;
+			hdlc->state = HDLC_SENDFLAG_B1A6;
+			break;
+		case HDLC_SENDFLAG_B1A6:
+			hdlc->cbin <<= 1;
+			hdlc->data_bits++;
+			hdlc->cbin++;
+			if (++hdlc->hdlc_bits1 == 6)
+				hdlc->state = HDLC_SENDFLAG_B7;
+			break;
+		case HDLC_SENDFLAG_B7:
+			hdlc->cbin <<= 1;
+			hdlc->data_bits++;
+			if (slen == 0) {
+				hdlc->state = HDLC_SENDFLAG_B0;
+				break;
+			}
+			if (hdlc->bit_shift == 8) {
+				hdlc->state = HDLC_SEND_DATA;
+				hdlc->crc = 0xffff;
+				hdlc->hdlc_bits1 = 0;
+				hdlc->data_received = 1;
+			}
+			break;
+		case HDLC_SEND_FIRST_FLAG:
+			hdlc->data_received = 1;
+			if (hdlc->data_bits == 8) {
+				hdlc->state = HDLC_SEND_DATA;
+				hdlc->crc = 0xffff;
+				hdlc->hdlc_bits1 = 0;
+				break;
+			}
+			hdlc->cbin <<= 1;
+			hdlc->data_bits++;
+			if (hdlc->shift_reg & 0x01)
+				hdlc->cbin++;
+			hdlc->shift_reg >>= 1;
+			hdlc->bit_shift--;
+			if (hdlc->bit_shift == 0) {
+				hdlc->state = HDLC_SEND_DATA;
+				hdlc->crc = 0xffff;
+				hdlc->hdlc_bits1 = 0;
+			}
+			break;
+		case HDLC_SEND_DATA:
+			hdlc->cbin <<= 1;
+			hdlc->data_bits++;
+			if (hdlc->hdlc_bits1 == 5) {
+				hdlc->hdlc_bits1 = 0;
+				break;
+			}
+			if (hdlc->bit_shift == 8)
+				hdlc->crc = crc_ccitt_byte(hdlc->crc,
+					hdlc->shift_reg);
+			if (hdlc->shift_reg & 0x01) {
+				hdlc->hdlc_bits1++;
+				hdlc->cbin++;
+				hdlc->shift_reg >>= 1;
+				hdlc->bit_shift--;
+			} else {
+				hdlc->hdlc_bits1 = 0;
+				hdlc->shift_reg >>= 1;
+				hdlc->bit_shift--;
+			}
+			break;
+		case HDLC_SEND_CRC1:
+			hdlc->cbin <<= 1;
+			hdlc->data_bits++;
+			if (hdlc->hdlc_bits1 == 5) {
+				hdlc->hdlc_bits1 = 0;
+				break;
+			}
+			if (hdlc->shift_reg & 0x01) {
+				hdlc->hdlc_bits1++;
+				hdlc->cbin++;
+				hdlc->shift_reg >>= 1;
+				hdlc->bit_shift--;
+			} else {
+				hdlc->hdlc_bits1 = 0;
+				hdlc->shift_reg >>= 1;
+				hdlc->bit_shift--;
+			}
+			if (hdlc->bit_shift == 0) {
+				hdlc->shift_reg = (hdlc->crc >> 8);
+				hdlc->state = HDLC_SEND_CRC2;
+				hdlc->bit_shift = 8;
+			}
+			break;
+		case HDLC_SEND_CRC2:
+			hdlc->cbin <<= 1;
+			hdlc->data_bits++;
+			if (hdlc->hdlc_bits1 == 5) {
+				hdlc->hdlc_bits1 = 0;
+				break;
+			}
+			if (hdlc->shift_reg & 0x01) {
+				hdlc->hdlc_bits1++;
+				hdlc->cbin++;
+				hdlc->shift_reg >>= 1;
+				hdlc->bit_shift--;
+			} else {
+				hdlc->hdlc_bits1 = 0;
+				hdlc->shift_reg >>= 1;
+				hdlc->bit_shift--;
+			}
+			if (hdlc->bit_shift == 0) {
+				hdlc->shift_reg = 0x7e;
+				hdlc->state = HDLC_SEND_CLOSING_FLAG;
+				hdlc->bit_shift = 8;
+			}
+			break;
+		case HDLC_SEND_CLOSING_FLAG:
+			hdlc->cbin <<= 1;
+			hdlc->data_bits++;
+			if (hdlc->hdlc_bits1 == 5) {
+				hdlc->hdlc_bits1 = 0;
+				break;
+			}
+			if (hdlc->shift_reg & 0x01)
+				hdlc->cbin++;
+			hdlc->shift_reg >>= 1;
+			hdlc->bit_shift--;
+			if (hdlc->bit_shift == 0) {
+				hdlc->ffvalue =
+					xfast_flag_value[hdlc->data_bits];
+				if (hdlc->dchannel) {
+					hdlc->ffvalue = 0x7e;
+					hdlc->state = HDLC_SEND_IDLE1;
+					hdlc->bit_shift = 8-hdlc->data_bits;
+					if (hdlc->bit_shift == 0)
+						hdlc->state =
+							HDLC_SEND_FAST_IDLE;
+				} else {
+					if (!hdlc->do_adapt56) {
+						hdlc->state =
+							HDLC_SEND_FAST_FLAG;
+						hdlc->data_received = 0;
+					} else {
+						hdlc->state = HDLC_SENDFLAG_B0;
+						hdlc->data_received = 0;
+					}
+					/* Finished this frame, send flags */
+					if (dsize > 1)
+						dsize = 1;
+				}
+			}
+			break;
+		case HDLC_SEND_IDLE1:
+			hdlc->do_closing = 0;
+			hdlc->cbin <<= 1;
+			hdlc->cbin++;
+			hdlc->data_bits++;
+			hdlc->bit_shift--;
+			if (hdlc->bit_shift == 0) {
+				hdlc->state = HDLC_SEND_FAST_IDLE;
+				hdlc->bit_shift = 0;
+			}
+			break;
+		case HDLC_SEND_FAST_IDLE:
+			hdlc->do_closing = 0;
+			hdlc->cbin = 0xff;
+			hdlc->data_bits = 8;
+			if (hdlc->bit_shift == 8) {
+				hdlc->cbin = 0x7e;
+				hdlc->state = HDLC_SEND_FIRST_FLAG;
+			} else {
+				/* the code is for bitreverse streams */
+				if (hdlc->do_bitreverse == 0)
+					*dst++ = bitrev8(hdlc->cbin);
+				else
+					*dst++ = hdlc->cbin;
+				hdlc->bit_shift = 0;
+				hdlc->data_bits = 0;
+				len++;
+				dsize = 0;
+			}
+			break;
+		default:
+			break;
+		}
+		if (hdlc->do_adapt56) {
+			if (hdlc->data_bits == 7) {
+				hdlc->cbin <<= 1;
+				hdlc->cbin++;
+				hdlc->data_bits++;
+			}
+		}
+		if (hdlc->data_bits == 8) {
+			/* the code is for bitreverse streams */
+			if (hdlc->do_bitreverse == 0)
+				*dst++ = bitrev8(hdlc->cbin);
+			else
+				*dst++ = hdlc->cbin;
+			hdlc->data_bits = 0;
+			len++;
+			dsize--;
+		}
+	}
+	*count -= slen;
+
+	return len;
+}
+EXPORT_SYMBOL(isdnhdlc_encode);
--- linux-next-20090909.orig/drivers/isdn/Kconfig
+++ linux-next-20090909/drivers/isdn/Kconfig
@@ -61,4 +61,9 @@ endif # ISDN_CAPI
 
 source "drivers/isdn/gigaset/Kconfig"
 
+config ISDN_HDLC
+	tristate
+	select CRC_CCITT
+	select BITREVERSE
+
 endif # ISDN
--- linux-next-20090909.orig/drivers/isdn/i4l/Kconfig
+++ linux-next-20090909/drivers/isdn/i4l/Kconfig
@@ -140,9 +140,3 @@ endmenu
 # end ISDN_I4L
 endif
 
-config ISDN_HDLC
-	tristate 
-	depends on HISAX_ST5481
-	select CRC_CCITT
-	select BITREVERSE
-




---
~Randy
LPC 2009, Sept. 23-25, Portland, Oregon
http://linuxplumbersconf.org/2009/

^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] bonding: introduce primary_passive option
From: Nicolas de Pesloüan @ 2009-09-10 22:06 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, fubar, bonding-devel
In-Reply-To: <20090910182059.GC2972@psychotron.redhat.com>

Jiri Pirko wrote:
> (updated 2)
> 
> In some cases there is not desirable to switch back to primary interface when
> it's link recovers and rather stay with currently active one. We need to avoid
> packetloss as much as we can in some cases. This is solved by introducing
> primary_passive option. Note that enslaved primary slave is set as current
> active no matter what.
> 
> This patch depends on the following one:
> [net-next-2.6] bonding: make ab_arp select active slaves as other modes
> http://patchwork.ozlabs.org/patch/32684/
> 
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> 
> diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
> index d5181ce..e5f2c31 100644
> --- a/Documentation/networking/bonding.txt
> +++ b/Documentation/networking/bonding.txt
> @@ -614,6 +614,32 @@ primary
>  
>  	The primary option is only valid for active-backup mode.
>  
> +primary_passive
> +
> +	Specifies the behavior of the current active slave when the primary was
> +	down and comes back up.  This option is designed to prevent
> +	flip-flopping between the primary slave and other slaves.  The possible
> +	values and their respective effects are:
> +
> +	disabled or 0 (default)
> +
> +		The primary slave becomes the active slave whenever it comes
> +		back up.
> +
> +	better or 1
> +
> +		The primary slave becomes the active slave when it comes back
> +		up, if the speed and duplex of the primary slave is better
> +		than the speed and duplex of the current active slave.
> +
> +	always or 2
> +
> +		The primary slave becomes the active slave only if the current
> +		active slave fails and the primary slave is up.
> +
> +	When no slave are active, if the primary comes back up, it becomes the
> +	active slave, regardless of the value of primary_return
> +
>  updelay

My feeling is that using primary_passive=disabled/better/always is far less 
clear than primary_return=always/better/failure_only.

The normal (current) behavior is to always return to primay if it is up. You 
want to add the ability to return to it only on failure of the current active 
slave and I suggested to add the ability the return to it if it is "better" than 
the current active slave.

Hence the suggested name for the option and option values.

>  
>  	Specifies the time, in milliseconds, to wait before enabling a
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index a7e731f..193eca4 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -94,6 +94,7 @@ static int downdelay;
>  static int use_carrier	= 1;
>  static char *mode;
>  static char *primary;
> +static char *primary_passive;
>  static char *lacp_rate;
>  static char *ad_select;
>  static char *xmit_hash_policy;
> @@ -126,6 +127,13 @@ MODULE_PARM_DESC(mode, "Mode of operation : 0 for balance-rr, "
>  		       "6 for balance-alb");
>  module_param(primary, charp, 0);
>  MODULE_PARM_DESC(primary, "Primary network device to use");
> +module_param(primary_passive, charp, 0);
> +MODULE_PARM_DESC(primary_passive, "Do not set primary slave active "
> +				  "once it comes up; "
> +				  "0 for disabled (default), "
> +				  "1 for on only if speed of primary is not "
> +				  "better, "
> +				  "2 for always on");

You have a double negative assertion here : a "do not" option whose value is 
"disabled". For clarity, I suggest to have a "do" option whose value is "enabled".

This is probably the reason why I suggested primary_return instead of 
primary_passive. primary_passive means "refuse to return to primary" and when 
disabled, it becomes "don't refuse to return to primary", which should be 
"accept to return to primary" instead :-)

It might be seen as not being that important, but having an option whose name 
and values are easy to understand leads to an easy to use option... :-)

>  module_param(lacp_rate, charp, 0);
>  MODULE_PARM_DESC(lacp_rate, "LACPDU tx rate to request from 802.3ad partner "
>  			    "(slow/fast)");
> @@ -200,6 +208,13 @@ const struct bond_parm_tbl fail_over_mac_tbl[] = {
>  {	NULL,			-1},
>  };
>  
> +const struct bond_parm_tbl pri_passive_tbl[] = {
> +{	"disabled",		BOND_PRI_PASSIVE_DISABLED},
> +{	"better",		BOND_PRI_PASSIVE_BETTER},
> +{	"always",		BOND_PRI_PASSIVE_ALWAYS},
> +{	NULL,			-1},
> +};
> +
>  struct bond_parm_tbl ad_select_tbl[] = {
>  {	"stable",	BOND_AD_STABLE},
>  {	"bandwidth",	BOND_AD_BANDWIDTH},
> @@ -1070,6 +1085,25 @@ out:
>  
>  }
>  
> +static bool bond_should_loose_active(struct bonding *bond)
> +{
> +	struct slave *prim = bond->primary_slave;
> +	struct slave *curr = bond->curr_active_slave;
> +
> +	if (!prim || !curr || curr->link != BOND_LINK_UP)
> +		return true;
> +	if (bond->force_primary) {
> +		bond->force_primary = false;
> +		return true;
> +	}
> +	if (bond->params.primary_passive == 1 &&

You should use the constants you defined in bonding.h and used in pri_passive_tbl :

if (bond->params.primary_passive == BOND_PRI_PASSIVE_BETTER &&

> +	    (prim->speed < curr->speed ||
> +	     (prim->speed == curr->speed && prim->duplex <= curr->duplex)))
> +		return false;
> +	if (bond->params.primary_passive == 2)

You should use the constants you defined in bonding.h and used in pri_passive_tbl :

if (bond->params.primary_passive == BOND_PRI_PASSIVE_ALWAYS)

> +		return false;
> +	return true;
> +}
>  
>  /**
>   * find_best_interface - select the best available slave to be the active one
> @@ -1093,15 +1127,9 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
>  			return NULL; /* still no slave, return NULL */
>  	}
>  
> -	/*
> -	 * first try the primary link; if arping, a link must tx/rx
> -	 * traffic before it can be considered the curr_active_slave.
> -	 * also, we would skip slaves between the curr_active_slave
> -	 * and primary_slave that may be up and able to arp
> -	 */
>  	if ((bond->primary_slave) &&
> -	    (!bond->params.arp_interval) &&
> -	    (IS_UP(bond->primary_slave->dev))) {
> +	    bond->primary_slave->link == BOND_LINK_UP &&
> +	    bond_should_loose_active(bond)) {
>  		new_active = bond->primary_slave;
>  	}
>  
> @@ -1109,15 +1137,14 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
>  	old_active = new_active;
>  
>  	bond_for_each_slave_from(bond, new_active, i, old_active) {
> -		if (IS_UP(new_active->dev)) {
> -			if (new_active->link == BOND_LINK_UP) {
> -				return new_active;
> -			} else if (new_active->link == BOND_LINK_BACK) {
> -				/* link up, but waiting for stabilization */
> -				if (new_active->delay < mintime) {
> -					mintime = new_active->delay;
> -					bestslave = new_active;
> -				}
> +		if (new_active->link == BOND_LINK_UP) {
> +			return new_active;
> +		} else if (new_active->link == BOND_LINK_BACK &&
> +			   IS_UP(new_active->dev)) {
> +			/* link up, but waiting for stabilization */
> +			if (new_active->delay < mintime) {
> +				mintime = new_active->delay;
> +				bestslave = new_active;
>  			}
>  		}
>  	}
> @@ -1683,8 +1710,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
>  
>  	if (USES_PRIMARY(bond->params.mode) && bond->params.primary[0]) {
>  		/* if there is a primary slave, remember it */
> -		if (strcmp(bond->params.primary, new_slave->dev->name) == 0)
> +		if (strcmp(bond->params.primary, new_slave->dev->name) == 0) {
>  			bond->primary_slave = new_slave;
> +			bond->force_primary = true;
> +		}
>  	}
>  
>  	write_lock_bh(&bond->curr_slave_lock);
> @@ -2929,18 +2958,6 @@ static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
>  		}
>  	}
>  
> -	read_lock(&bond->curr_slave_lock);
> -
> -	/*
> -	 * Trigger a commit if the primary option setting has changed.
> -	 */
> -	if (bond->primary_slave &&
> -	    (bond->primary_slave != bond->curr_active_slave) &&
> -	    (bond->primary_slave->link == BOND_LINK_UP))
> -		commit++;
> -
> -	read_unlock(&bond->curr_slave_lock);
> -
>  	return commit;
>  }
>  
> @@ -2961,90 +2978,58 @@ static void bond_ab_arp_commit(struct bonding *bond, int delta_in_ticks)
>  			continue;
>  
>  		case BOND_LINK_UP:
> -			write_lock_bh(&bond->curr_slave_lock);
> -
> -			if (!bond->curr_active_slave &&
> -			    time_before_eq(jiffies, dev_trans_start(slave->dev) +
> -					   delta_in_ticks)) {
> +			if ((!bond->curr_active_slave &&
> +			     time_before_eq(jiffies,
> +					    dev_trans_start(slave->dev) +
> +					    delta_in_ticks)) ||
> +			    bond->curr_active_slave != slave) {
>  				slave->link = BOND_LINK_UP;
> -				bond_change_active_slave(bond, slave);
>  				bond->current_arp_slave = NULL;
>  
>  				pr_info(DRV_NAME
> -				       ": %s: %s is up and now the "
> -				       "active interface\n",
> -				       bond->dev->name, slave->dev->name);
> -
> -			} else if (bond->curr_active_slave != slave) {
> -				/* this slave has just come up but we
> -				 * already have a current slave; this can
> -				 * also happen if bond_enslave adds a new
> -				 * slave that is up while we are searching
> -				 * for a new slave
> -				 */
> -				slave->link = BOND_LINK_UP;
> -				bond_set_slave_inactive_flags(slave);
> -				bond->current_arp_slave = NULL;
> +					": %s: link status definitely "
> +					"up for interface %s.\n",
> +					bond->dev->name, slave->dev->name);
>  
> -				pr_info(DRV_NAME
> -				       ": %s: backup interface %s is now up\n",
> -				       bond->dev->name, slave->dev->name);
> -			}
> +				if (!bond->curr_active_slave ||
> +				    (slave == bond->primary_slave))
> +					goto do_failover;
>  
> -			write_unlock_bh(&bond->curr_slave_lock);
> +			}
>  
> -			break;
> +			continue;
>  
>  		case BOND_LINK_DOWN:
>  			if (slave->link_failure_count < UINT_MAX)
>  				slave->link_failure_count++;
>  
>  			slave->link = BOND_LINK_DOWN;
> +			bond_set_slave_inactive_flags(slave);
>  
> -			if (slave == bond->curr_active_slave) {
> -				pr_info(DRV_NAME
> -				       ": %s: link status down for active "
> -				       "interface %s, disabling it\n",
> -				       bond->dev->name, slave->dev->name);
> -
> -				bond_set_slave_inactive_flags(slave);
> -
> -				write_lock_bh(&bond->curr_slave_lock);
> -
> -				bond_select_active_slave(bond);
> -				if (bond->curr_active_slave)
> -					bond->curr_active_slave->jiffies =
> -						jiffies;
> -
> -				write_unlock_bh(&bond->curr_slave_lock);
> +			pr_info(DRV_NAME
> +				": %s: link status definitely down for "
> +				"interface %s, disabling it\n",
> +				bond->dev->name, slave->dev->name);
>  
> +			if (slave == bond->curr_active_slave) {
>  				bond->current_arp_slave = NULL;
> -
> -			} else if (slave->state == BOND_STATE_BACKUP) {
> -				pr_info(DRV_NAME
> -				       ": %s: backup interface %s is now down\n",
> -				       bond->dev->name, slave->dev->name);
> -
> -				bond_set_slave_inactive_flags(slave);
> +				goto do_failover;
>  			}
> -			break;
> +
> +			continue;
>  
>  		default:
>  			pr_err(DRV_NAME
>  			       ": %s: impossible: new_link %d on slave %s\n",
>  			       bond->dev->name, slave->new_link,
>  			       slave->dev->name);
> +			continue;
>  		}
> -	}
>  
> -	/*
> -	 * No race with changes to primary via sysfs, as we hold rtnl.
> -	 */
> -	if (bond->primary_slave &&
> -	    (bond->primary_slave != bond->curr_active_slave) &&
> -	    (bond->primary_slave->link == BOND_LINK_UP)) {
> +do_failover:
> +		ASSERT_RTNL();
>  		write_lock_bh(&bond->curr_slave_lock);
> -		bond_change_active_slave(bond, bond->primary_slave);
> +		bond_select_active_slave(bond);
>  		write_unlock_bh(&bond->curr_slave_lock);
>  	}
>  
> @@ -4695,7 +4680,7 @@ int bond_parse_parm(const char *buf, const struct bond_parm_tbl *tbl)
>  
>  static int bond_check_params(struct bond_params *params)
>  {
> -	int arp_validate_value, fail_over_mac_value;
> +	int arp_validate_value, fail_over_mac_value, primary_passive_value;
>  
>  	/*
>  	 * Convert string parameters.
> @@ -4994,6 +4979,20 @@ static int bond_check_params(struct bond_params *params)
>  		primary = NULL;
>  	}
>  
> +	if (primary && primary_passive) {
> +		primary_passive_value = bond_parse_parm(primary_passive,
> +							pri_passive_tbl);
> +		if (primary_passive_value == -1) {
> +			pr_err(DRV_NAME
> +			       ": Error: Invalid primary_passive \"%s\"\n",
> +			       primary_passive ==
> +					NULL ? "NULL" : primary_passive);
> +			return -EINVAL;
> +		}
> +	} else {
> +		primary_passive_value = BOND_PRI_PASSIVE_DISABLED;
> +	}
> +
>  	if (fail_over_mac) {
>  		fail_over_mac_value = bond_parse_parm(fail_over_mac,
>  						      fail_over_mac_tbl);
> @@ -5025,6 +5024,7 @@ static int bond_check_params(struct bond_params *params)
>  	params->use_carrier = use_carrier;
>  	params->lacp_fast = lacp_fast;
>  	params->primary[0] = 0;
> +	params->primary_passive = primary_passive_value;
>  	params->fail_over_mac = fail_over_mac_value;
>  
>  	if (primary) {
> diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
> index 6044e12..84ce933 100644
> --- a/drivers/net/bonding/bond_sysfs.c
> +++ b/drivers/net/bonding/bond_sysfs.c
> @@ -1212,6 +1212,59 @@ static DEVICE_ATTR(primary, S_IRUGO | S_IWUSR,
>  		   bonding_show_primary, bonding_store_primary);
>  
>  /*
> + * Show and set the primary_passive flag.
> + */
> +static ssize_t bonding_show_primary_passive(struct device *d,
> +					    struct device_attribute *attr,
> +					    char *buf)
> +{
> +	struct bonding *bond = to_bond(d);
> +
> +	return sprintf(buf, "%s %d\n",
> +		       pri_passive_tbl[bond->params.primary_passive].modename,
> +		       bond->params.primary_passive);
> +}
> +
> +static ssize_t bonding_store_primary_passive(struct device *d,
> +					     struct device_attribute *attr,
> +					     const char *buf, size_t count)
> +{
> +	int new_value, ret = count;
> +	struct bonding *bond = to_bond(d);
> +
> +	if (!rtnl_trylock())
> +		return restart_syscall();
> +
> +	new_value = bond_parse_parm(buf, pri_passive_tbl);
> +	if (new_value < 0)  {
> +		pr_err(DRV_NAME
> +		       ": %s: Ignoring invalid primary_passive value %.*s.\n",
> +		       bond->dev->name,
> +		       (int) strlen(buf) - 1, buf);
> +		ret = -EINVAL;
> +		goto out;
> +	} else {
> +		bond->params.primary_passive = new_value;
> +		pr_info(DRV_NAME ": %s: setting primary_passive to %s (%d).\n",
> +		       bond->dev->name, pri_passive_tbl[new_value].modename,
> +		       new_value);
> +		if (new_value == 0 || new_value == 1) {
> +			bond->force_primary = true;
> +			read_lock(&bond->lock);
> +			write_lock_bh(&bond->curr_slave_lock);
> +			bond_select_active_slave(bond);
> +			write_unlock_bh(&bond->curr_slave_lock);
> +			read_unlock(&bond->lock);
> +		}
> +	}
> +out:
> +	rtnl_unlock();
> +	return ret;
> +}
> +static DEVICE_ATTR(primary_passive, S_IRUGO | S_IWUSR,
> +		   bonding_show_primary_passive, bonding_store_primary_passive);
> +
> +/*
>   * Show and set the use_carrier flag.
>   */
>  static ssize_t bonding_show_carrier(struct device *d,
> @@ -1500,6 +1553,7 @@ static struct attribute *per_bond_attrs[] = {
>  	&dev_attr_num_unsol_na.attr,
>  	&dev_attr_miimon.attr,
>  	&dev_attr_primary.attr,
> +	&dev_attr_primary_passive.attr,
>  	&dev_attr_use_carrier.attr,
>  	&dev_attr_active_slave.attr,
>  	&dev_attr_mii_status.attr,
> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
> index 6824771..9a9e0c7 100644
> --- a/drivers/net/bonding/bonding.h
> +++ b/drivers/net/bonding/bonding.h
> @@ -131,6 +131,7 @@ struct bond_params {
>  	int lacp_fast;
>  	int ad_select;
>  	char primary[IFNAMSIZ];
> +	int primary_passive;
>  	__be32 arp_targets[BOND_MAX_ARP_TARGETS];
>  };
>  
> @@ -190,6 +191,7 @@ struct bonding {
>  	struct   slave *curr_active_slave;
>  	struct   slave *current_arp_slave;
>  	struct   slave *primary_slave;
> +	bool     force_primary;
>  	s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
>  	rwlock_t lock;
>  	rwlock_t curr_slave_lock;
> @@ -258,6 +260,10 @@ static inline bool bond_is_lb(const struct bonding *bond)
>  		|| bond->params.mode == BOND_MODE_ALB;
>  }
>  
> +#define BOND_PRI_PASSIVE_DISABLED	0
> +#define BOND_PRI_PASSIVE_BETTER		1
> +#define BOND_PRI_PASSIVE_ALWAYS		2
> +
>  #define BOND_FOM_NONE			0
>  #define BOND_FOM_ACTIVE			1
>  #define BOND_FOM_FOLLOW			2
> @@ -348,6 +354,7 @@ extern const struct bond_parm_tbl bond_mode_tbl[];
>  extern const struct bond_parm_tbl xmit_hashtype_tbl[];
>  extern const struct bond_parm_tbl arp_validate_tbl[];
>  extern const struct bond_parm_tbl fail_over_mac_tbl[];
> +extern const struct bond_parm_tbl pri_passive_tbl[];
>  extern struct bond_parm_tbl ad_select_tbl[];
>  
>  #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
> 

	Nicolas.

^ permalink raw reply

* [PATCH] net: force bridge module(s) to be GPL
From: Stephen Hemminger @ 2009-09-10 21:51 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel

The only valid usage for the bridge frame hooks are by a
GPL components (such as the bridge module).
The kernel should not leave a crack in the door for proprietary
networking stacks to slip in.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/net/core/dev.c	2009-09-10 14:27:59.076074483 -0700
+++ b/net/core/dev.c	2009-09-10 14:28:43.659314263 -0700
@@ -2116,7 +2116,7 @@ static inline int deliver_skb(struct sk_
 /* This hook is defined here for ATM LANE */
 int (*br_fdb_test_addr_hook)(struct net_device *dev,
 			     unsigned char *addr) __read_mostly;
-EXPORT_SYMBOL(br_fdb_test_addr_hook);
+EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
 #endif
 
 /*
@@ -2125,7 +2125,7 @@ EXPORT_SYMBOL(br_fdb_test_addr_hook);
  */
 struct sk_buff *(*br_handle_frame_hook)(struct net_bridge_port *p,
 					struct sk_buff *skb) __read_mostly;
-EXPORT_SYMBOL(br_handle_frame_hook);
+EXPORT_SYMBOL_GPL(br_handle_frame_hook);
 
 static inline struct sk_buff *handle_bridge(struct sk_buff *skb,
 					    struct packet_type **pt_prev, int *ret,

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox